Please see the tidycode website for full documentation:
The tidycode package is an attempt to make analyzing R code tidy. It is modeled after the tidytext package.
One way to analyze code is to read in existing R files. The
read_rfiles()
function will allow parse your R files into
individual R calls, indicating the original file path along with the
line number for each call. The tidycode package includes some example
files with the paths accessible via the tidycode_example()
function. Let’s examine two, the example_plots.R
file and
the example_analysis.R
file.
cat(readLines(tidycode_example("example_plot.R")), sep = '\n')
#> library(tidyverse)
#>
#> starwars %>%
#> select(height, mass) %>%
#> filter(!is.na(mass), !is.na(height)) %>%
#> ggplot(aes(height, mass)) +
#> geom_point()
cat(readLines(tidycode_example("example_analysis.R")), sep = '\n')
#> library(tidyverse)
#> library(rms)
#>
#> starwars %>%
#> mutate(bmi = mass / ((height / 100) ^ 2)) %>%
#> select(bmi, gender) -> starwars
#>
#> dd <- datadist(starwars)
#> options(datadist = "dd")
#>
#> mod <- ols(bmi ~ gender, data = starwars) %>%
#> summary()
#>
#> plot(mod)
Using the read_rfiles()
function, we can read them in as
a tidy data frame.
(d <- read_rfiles(
tidycode_example("example_plot.R"),
tidycode_example("example_analysis.R")
))
#> # A tibble: 9 × 3
#> file expr line
#> <chr> <list> <int>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_p… <language> 1
#> 2 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_p… <language> 2
#> 3 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 1
#> 4 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 2
#> 5 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 3
#> 6 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 4
#> 7 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 5
#> 8 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 6
#> 9 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_a… <language> 7
This tidy data frame has one row per R call in the original file. It
places the file path in the file
column, the R call in the
expr
column, and the line number in the line
column. Since this is in a tidy format, we can manipulate it using
common data manipulation functions.
Let’s examine the first row.
d[1, ]
#> # A tibble: 1 × 3
#> file expr line
#> <chr> <list> <int>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example_p… <language> 1
This is the first line of the example_plot.R
file. We
can dig into the expr
list column to see what R call was
made on this first line.
The call is library(tidyverse)
.
Similar to the tidytext package that will unnest groups of words by
token using the unnest_tokens()
function, such as by word
or sentence, we can unnest these calls into individual functions using
the unnest_calls()
function. To do this, we can pipe the
data frame we just created, d
into the
unnest_calls()
function and specify the column that
contains the R calls, in this case expr
.
library(dplyr)
d_funcs <- d %>%
unnest_calls(expr)
d_funcs
#> # A tibble: 35 × 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 1 libr… <list>
#> 2 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 + <list>
#> 3 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 %>% <list>
#> 4 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 %>% <list>
#> 5 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 %>% <list>
#> 6 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 sele… <list>
#> 7 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 filt… <list>
#> 8 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 ! <list>
#> 9 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 is.na <list>
#> 10 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/exampl… 2 ! <list>
#> # ℹ 25 more rows
This added two columns to our data frame, func
a column
of type character
indicating each function called and
args
a list column containing the arguments for each
function. Let’s examine that first row again.
d_funcs[1, ]
#> # A tibble: 1 × 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/example… 1 libr… <list>
Here the function is library
, which tracks with what we
have previously observed. Examining the args
list column,
we see the following.
The argument for the library
function on this first line
is tidyverse
. This aligns with what we observed, the first
R call is library(tidyverse)
.
In text analysis, there is the concept of “stopwords”. These are
often small common filler words you want to remove before completing an
analysis, such as “a” or “the”. In a tidy code analysis, we can
use a similar concept to remove some functions. For example we may want
to remove the assignment operator, <-
, before completing
an analysis. We have compiled a list of common stop functions in the
get_stopfuncs()
function to antijoin from the data
frame.
d_funcs %>%
anti_join(get_stopfuncs())
#> Joining with `by = join_by(func)`
#> # A tibble: 17 × 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 1 libr… <list [1]>
#> 2 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 sele… <list [2]>
#> 3 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 filt… <list [2]>
#> 4 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 is.na <list [1]>
#> 5 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 is.na <list [1]>
#> 6 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 ggpl… <list [1]>
#> 7 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 aes <list [2]>
#> 8 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 2 geom… <list [0]>
#> 9 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 3 libr… <list [1]>
#> 10 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 4 libr… <list [1]>
#> 11 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 5 muta… <named list>
#> 12 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 5 sele… <list [2]>
#> 13 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 6 data… <list [1]>
#> 14 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 7 opti… <named list>
#> 15 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 8 ols <named list>
#> 16 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 8 summ… <list [0]>
#> 17 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata/… 9 plot <list [1]>
Akin to the tidytext get_sentiments()
function for
sentiment analysis, the tidycode package has a
get_classifications()
function that will output a
classification data frame. By default, this outputs a data frame with
two classification lexicons, crowdsource
and
leeklab
. The crowdsource
lexicon was developed
by twitter users who tried out the classify shiny
application. The leeklab
lexicon was curated by members
of Jeff Leek’s Lab. Both lexicons
involve the same functions classified multiple times by different users.
The score
column indicates the percentage of functions that
were classified as a given class. To just use the most prevalent
classification, you can set the incude_duplicates
parameter
to FALSE
in the get_classifications()
function. By default both the crowdsource
and
leeklab
lexicons will be output. To get just one, specify
the lexicon
parameter. Here we will merge in the
crowdsource
lexicon, picking the most prevalent
classification by setting the incude_duplicates
parameter
to FALSE
.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource", include_duplicates = FALSE)) %>%
select(func, classification)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`
#> # A tibble: 15 × 2
#> func classification
#> <chr> <chr>
#> 1 library setup
#> 2 select data cleaning
#> 3 filter data cleaning
#> 4 is.na data cleaning
#> 5 is.na data cleaning
#> 6 ggplot visualization
#> 7 aes visualization
#> 8 geom_point visualization
#> 9 library setup
#> 10 library setup
#> 11 mutate data cleaning
#> 12 select data cleaning
#> 13 options setup
#> 14 summary exploratory
#> 15 plot visualization
Notice we know have one classification per function. If we left the
incude_duplicates
parameter to its default,
TRUE
, we would end up with more than one classification per
function along with a score
column.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource")) %>%
select(func, classification, score)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`
#> Warning in inner_join(., get_classifications("crowdsource")): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1627 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
#> # A tibble: 115 × 3
#> func classification score
#> <chr> <chr> <dbl>
#> 1 library setup 0.687
#> 2 library import 0.213
#> 3 library visualization 0.0339
#> 4 library data cleaning 0.0278
#> 5 library modeling 0.0134
#> 6 library exploratory 0.0128
#> 7 library communication 0.00835
#> 8 library evaluation 0.00278
#> 9 library export 0.00111
#> 10 select data cleaning 0.636
#> # ℹ 105 more rows