tidycode can be used to easily classify the lines of code in an R file (e.g., as data cleaning, setup, etc.).
This vignette shows how tidycode can easily be used to calculate the proportion of a total R file classified to different categories?
We will frist load the tidyverse and tidycode packages and then use
the tidycode function read_rfiles()
to read the two example
files (built-in to tidycode):
Next, we can classify the lines of code in the two rfiles saved to
the object two_rfiles
, using the
unnest_calls()
and subsequent functions as described in the
tidycode vignette:
unnested_expressions <- unnest_calls(two_rfiles, expr)
classified_code <- unnested_expressions %>%
inner_join(
get_classifications("crowdsource", include_duplicates = FALSE)
) %>%
anti_join(get_stopfuncs()) %>%
select(file, func, classification)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`
Then, we will create a simple function that a) takes the classified code and then b) calculates the proportion of the lines of code in each file that is classified into different categories:
It is easy to use the function on our classified code; just pass the classified code to it.:
proportion_of_file <- calc_proportion_file(classified_code)
proportion_of_file
#> # A tibble: 7 × 4
#> # Groups: file [2]
#> file classification n prop
#> <chr> <chr> <int> <dbl>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… data cleaning 2 0.286
#> 2 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… exploratory 1 0.143
#> 3 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… setup 3 0.429
#> 4 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… visualization 1 0.143
#> 5 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… data cleaning 4 0.5
#> 6 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… setup 1 0.125
#> 7 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… visualization 3 0.375
We can also easily visualize the results:
proportion_of_file %>%
ggplot(aes(x = 0, y = prop, fill = reorder(classification, prop))) +
geom_bar(stat = "identity", size = 1) +
scale_y_continuous(labels = scales::percent_format()) +
coord_flip() +
facet_wrap(~file, ncol = 1)+
labs(
title = "Proportion of Code by File",
y = "Proportion of Code",
fill = "Classification"
) +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
strip.text = element_text(hjust=0),
panel.background = element_blank(),
strip.background = element_blank(),
panel.grid.major.x = element_line(color="grey80")
)
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
This can become quite a large visualization if there are very many files.
Thus, this approach may be more useful when trying to visualize code on on a per-file basis for a relatively small (perhaps 10-15 or fewer) files.
Another approach can be scaled up to a larger number of files, as is described next.
First, we’ll create a function that is an analog to
calc_proportion_file()
, but for calculating the mean
proportion across many files:
calc_proportion_overall <- function(d) {
d %>% group_by(classification) %>%
count() %>%
ungroup() %>%
mutate(
prop = prop.table(n)
)
}
We can use this in the same way as
calc_proportion_file()
, passing
classified_code
as the sole argument:
proportion_overall <- calc_proportion_overall(classified_code)
proportion_overall
#> # A tibble: 4 × 3
#> classification n prop
#> <chr> <int> <dbl>
#> 1 data cleaning 6 0.4
#> 2 exploratory 1 0.0667
#> 3 setup 4 0.267
#> 4 visualization 4 0.267
These results can be visualized as follows:
proportion_overall %>%
ggplot() +
geom_bar(aes(x = reorder(classification, prop), y = 1), stat = "identity", fill = "grey80") +
geom_bar(aes(x = reorder(classification, prop), y = prop, fill = prop), stat = "identity")+
geom_text(aes(x = reorder(classification, prop), y = prop, label = paste0(round(prop * 100, digits = 0), "%"), hjust = -.5)) +
scale_y_continuous(labels = scales::percent_format()) +
coord_flip() +
labs(
title = "Overall Proportion of Code",
y = "Proportion of Code",
x = "Classification"
) +
theme(
panel.background = element_blank(),
legend.position = "none"
)