tidycode

Please see the tidycode website for full documentation:

https://lucymcgowan.github.io/tidycode/

The tidycode package is an attempt to make analyzing R code tidy. It is modeled after the tidytext package.

library(tidycode)

Read R files in as a tidy data frame

One way to analyze code is to read in existing R files. The read_rfiles() function will allow parse your R files into individual R calls, indicating the original file path along with the line number for each call. The tidycode package includes some example files with the paths accessible via the tidycode_example() function. Let’s examine two, the example_plots.R file and the example_analysis.R file.

cat(readLines(tidycode_example("example_plot.R")), sep = '\n')
#> library(tidyverse)
#> 
#> starwars %>%
#>   select(height, mass) %>%
#>   filter(!is.na(mass), !is.na(height)) %>%
#>   ggplot(aes(height, mass)) +
#>   geom_point()

cat(readLines(tidycode_example("example_analysis.R")), sep = '\n')
#> library(tidyverse)
#> library(rms)
#> 
#> starwars %>%
#>   mutate(bmi = mass / ((height / 100) ^ 2)) %>%
#>   select(bmi, gender) -> starwars
#> 
#> dd <- datadist(starwars)
#> options(datadist = "dd")
#> 
#> mod <- ols(bmi ~ gender, data = starwars) %>%
#>   summary()
#> 
#> plot(mod)

Using the read_rfiles() function, we can read them in as a tidy data frame.

(d <- read_rfiles(
  tidycode_example("example_plot.R"),
  tidycode_example("example_analysis.R")
  ))
#> # A tibble: 9 × 3
#>   file                                                          expr        line
#>   <chr>                                                         <list>     <int>
#> 1 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_p… <language>     1
#> 2 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_p… <language>     2
#> 3 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     1
#> 4 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     2
#> 5 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     3
#> 6 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     4
#> 7 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     5
#> 8 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     6
#> 9 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_a… <language>     7

This tidy data frame has one row per R call in the original file. It places the file path in the file column, the R call in the expr column, and the line number in the line column. Since this is in a tidy format, we can manipulate it using common data manipulation functions.

Let’s examine the first row.

d[1, ]
#> # A tibble: 1 × 3
#>   file                                                          expr        line
#>   <chr>                                                         <list>     <int>
#> 1 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example_p… <language>     1

This is the first line of the example_plot.R file. We can dig into the expr list column to see what R call was made on this first line.

d[1, "expr"][[1]]
#> [[1]]
#> library(tidyverse)

The call is library(tidyverse).

Unnest calls into individual functions

Similar to the tidytext package that will unnest groups of words by token using the unnest_tokens() function, such as by word or sentence, we can unnest these calls into individual functions using the unnest_calls() function. To do this, we can pipe the data frame we just created, d into the unnest_calls() function and specify the column that contains the R calls, in this case expr.

library(dplyr)

d_funcs <- d %>%
  unnest_calls(expr)

d_funcs
#> # A tibble: 35 × 4
#>    file                                                        line func  args  
#>    <chr>                                                      <int> <chr> <list>
#>  1 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     1 libr… <list>
#>  2 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 +     <list>
#>  3 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 %>%   <list>
#>  4 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 %>%   <list>
#>  5 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 %>%   <list>
#>  6 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 sele… <list>
#>  7 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 filt… <list>
#>  8 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 !     <list>
#>  9 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 is.na <list>
#> 10 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/exampl…     2 !     <list>
#> # ℹ 25 more rows

This added two columns to our data frame, func a column of type character indicating each function called and args a list column containing the arguments for each function. Let’s examine that first row again.

d_funcs[1, ]
#> # A tibble: 1 × 4
#>   file                                                         line func  args  
#>   <chr>                                                       <int> <chr> <list>
#> 1 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/example…     1 libr… <list>

Here the function is library, which tracks with what we have previously observed. Examining the args list column, we see the following.

d_funcs[1, "args"][[1]]
#> [[1]]
#> [[1]][[1]]
#> tidyverse

The argument for the library function on this first line is tidyverse. This aligns with what we observed, the first R call is library(tidyverse).

Remove “stopwords”

In text analysis, there is the concept of “stopwords”. These are often small common filler words you want to remove before completing an analysis, such as “a” or “the”. In a tidy code analysis, we can use a similar concept to remove some functions. For example we may want to remove the assignment operator, <-, before completing an analysis. We have compiled a list of common stop functions in the get_stopfuncs() function to antijoin from the data frame.

d_funcs %>%
  anti_join(get_stopfuncs())
#> Joining with `by = join_by(func)`
#> # A tibble: 17 × 4
#>    file                                                  line func  args        
#>    <chr>                                                <int> <chr> <list>      
#>  1 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     1 libr… <list [1]>  
#>  2 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 sele… <list [2]>  
#>  3 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 filt… <list [2]>  
#>  4 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 is.na <list [1]>  
#>  5 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 is.na <list [1]>  
#>  6 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 ggpl… <list [1]>  
#>  7 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 aes   <list [2]>  
#>  8 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     2 geom… <list [0]>  
#>  9 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     3 libr… <list [1]>  
#> 10 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     4 libr… <list [1]>  
#> 11 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     5 muta… <named list>
#> 12 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     5 sele… <list [2]>  
#> 13 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     6 data… <list [1]>  
#> 14 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     7 opti… <named list>
#> 15 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     8 ols   <named list>
#> 16 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     8 summ… <list [0]>  
#> 17 /tmp/RtmpMchi3W/Rinst10537f2e4381/tidycode/extdata/…     9 plot  <list [1]>

Classify code

Akin to the tidytext get_sentiments() function for sentiment analysis, the tidycode package has a get_classifications() function that will output a classification data frame. By default, this outputs a data frame with two classification lexicons, crowdsource and leeklab. The crowdsource lexicon was developed by twitter users who tried out the classify shiny application. The leeklab lexicon was curated by members of Jeff Leek’s Lab. Both lexicons involve the same functions classified multiple times by different users. The score column indicates the percentage of functions that were classified as a given class. To just use the most prevalent classification, you can set the incude_duplicates parameter to FALSE in the get_classifications() function. By default both the crowdsource and leeklab lexicons will be output. To get just one, specify the lexicon parameter. Here we will merge in the crowdsource lexicon, picking the most prevalent classification by setting the incude_duplicates parameter to FALSE.

d_funcs %>%
  anti_join(get_stopfuncs()) %>%
  inner_join(get_classifications("crowdsource", include_duplicates = FALSE)) %>%
  select(func, classification)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`
#> # A tibble: 15 × 2
#>    func       classification
#>    <chr>      <chr>         
#>  1 library    setup         
#>  2 select     data cleaning 
#>  3 filter     data cleaning 
#>  4 is.na      data cleaning 
#>  5 is.na      data cleaning 
#>  6 ggplot     visualization 
#>  7 aes        visualization 
#>  8 geom_point visualization 
#>  9 library    setup         
#> 10 library    setup         
#> 11 mutate     data cleaning 
#> 12 select     data cleaning 
#> 13 options    setup         
#> 14 summary    exploratory   
#> 15 plot       visualization

Notice we know have one classification per function. If we left the incude_duplicates parameter to its default, TRUE, we would end up with more than one classification per function along with a score column.

d_funcs %>%
  anti_join(get_stopfuncs()) %>%
  inner_join(get_classifications("crowdsource")) %>%
  select(func, classification, score)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`
#> Warning in inner_join(., get_classifications("crowdsource")): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1627 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> # A tibble: 115 × 3
#>    func    classification   score
#>    <chr>   <chr>            <dbl>
#>  1 library setup          0.687  
#>  2 library import         0.213  
#>  3 library visualization  0.0339 
#>  4 library data cleaning  0.0278 
#>  5 library modeling       0.0134 
#>  6 library exploratory    0.0128 
#>  7 library communication  0.00835
#>  8 library evaluation     0.00278
#>  9 library export         0.00111
#> 10 select  data cleaning  0.636  
#> # ℹ 105 more rows