Calculating the proportion of code classified in an R file

Introduction

tidycode can be used to easily classify the lines of code in an R file (e.g., as data cleaning, setup, etc.).

This vignette shows how tidycode can easily be used to calculate the proportion of a total R file classified to different categories?

Loading and setting up

We will frist load the tidyverse and tidycode packages and then use the tidycode function read_rfiles() to read the two example files (built-in to tidycode):

library(tidycode)
library(dplyr)
library(ggplot2)

two_rfiles <- read_rfiles(
  tidycode_example("example_plot.R"),
  tidycode_example("example_analysis.R")
)

Classify the lines of code in the R files

Next, we can classify the lines of code in the two rfiles saved to the object two_rfiles, using the unnest_calls() and subsequent functions as described in the tidycode vignette:

unnested_expressions <- unnest_calls(two_rfiles, expr)

classified_code <- unnested_expressions %>%
  inner_join(
    get_classifications("crowdsource", include_duplicates = FALSE)
  ) %>%
  anti_join(get_stopfuncs()) %>%
  select(file, func, classification)
#> Joining with `by = join_by(func)`
#> Joining with `by = join_by(func)`

Creating a function

Then, we will create a simple function that a) takes the classified code and then b) calculates the proportion of the lines of code in each file that is classified into different categories:

calc_proportion_file <- function(d) {
  d %>% 
    count(file, classification) %>% 
    group_by(file) %>% 
    mutate(prop = n / sum(n))
}

Using the function

It is easy to use the function on our classified code; just pass the classified code to it.:

proportion_of_file <- calc_proportion_file(classified_code)

proportion_of_file
#> # A tibble: 7 × 4
#> # Groups:   file [2]
#>   file                                                classification     n  prop
#>   <chr>                                               <chr>          <int> <dbl>
#> 1 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… data cleaning      2 0.286
#> 2 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… exploratory        1 0.143
#> 3 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… setup              3 0.429
#> 4 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… visualization      1 0.143
#> 5 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… data cleaning      4 0.5  
#> 6 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… setup              1 0.125
#> 7 /tmp/Rtmp9a4e2H/Rinst10625516df29/tidycode/extdata… visualization      3 0.375

Visualizing classified code on a per-file basis

We can also easily visualize the results:

proportion_of_file %>%
  ggplot(aes(x = 0, y = prop, fill = reorder(classification, prop))) +
  geom_bar(stat = "identity", size = 1) +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  facet_wrap(~file, ncol = 1)+
  labs(
    title = "Proportion of Code by File",
    y = "Proportion of Code",
    fill = "Classification"
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank(),
    strip.text = element_text(hjust=0),
    panel.background = element_blank(),
    strip.background = element_blank(),
    panel.grid.major.x = element_line(color="grey80")
  )
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

This can become quite a large visualization if there are very many files.

Thus, this approach may be more useful when trying to visualize code on on a per-file basis for a relatively small (perhaps 10-15 or fewer) files.

Another approach can be scaled up to a larger number of files, as is described next.

Visualizing classified code across files

First, we’ll create a function that is an analog to calc_proportion_file(), but for calculating the mean proportion across many files:

calc_proportion_overall <- function(d) {
  d %>% group_by(classification) %>%
    count() %>% 
    ungroup() %>% 
    mutate(
      prop = prop.table(n)
    )
}

We can use this in the same way as calc_proportion_file(), passing classified_code as the sole argument:

proportion_overall <- calc_proportion_overall(classified_code)
proportion_overall
#> # A tibble: 4 × 3
#>   classification     n   prop
#>   <chr>          <int>  <dbl>
#> 1 data cleaning      6 0.4   
#> 2 exploratory        1 0.0667
#> 3 setup              4 0.267 
#> 4 visualization      4 0.267

These results can be visualized as follows:

proportion_overall %>%
  ggplot() +
  geom_bar(aes(x = reorder(classification, prop), y = 1), stat = "identity", fill = "grey80") +
  geom_bar(aes(x = reorder(classification, prop), y = prop, fill = prop), stat = "identity")+
  geom_text(aes(x = reorder(classification, prop), y = prop, label = paste0(round(prop * 100, digits = 0), "%"), hjust = -.5)) +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Overall Proportion of Code",
    y = "Proportion of Code",
    x = "Classification"
  ) +
  theme(
    panel.background = element_blank(),
    legend.position = "none"
  )