Introduction

Many investigators, project managers, and data managers have turned to REDCap to manage their data (Harris et al. 2009, 2019). Eventually, it falls to the statistician to take the REDCap data and load it into their statistical analysis program of choice. In this presentation, I show how to use the CSV and R script file downloaded from REDCap to create a clean R data set.

This presentation uses the R programming language and assumes the end user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2019; RStudio Team 2016). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/redcap_to_r_data_set/.

Libraries

The libraries knitr, bookdown, kableExtra, and DT, generate the HTML output (Xie 2019, 2018; Zhu 2019; Xie, Cheng, and Tan 2018). The Hmisc library is loaded to generate and store variable labels (Harrell Jr, Charles Dupont, and others. 2019).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "DT", "Hmisc")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

REDCap Data Structure

For this presentation, I created two files that mimic the general structure of data exported from REDCap for use in the R programming language. This structure is accurate as of REDCap version 9.1.0. While not the focus of this presentation, for the sake of reproducibility, the file used the generate the example data is downloadable here.

The first file is a flat-file of comma-separated values (CSV). Below is a display of the example data downloaded here. Of note, all of the factor variables, such as sex, are numerically coded. There are also indicators for ambiguous data (666) and missing data (999).

d_csv <- read.csv(file = "fromREDCap_DATA_2019-12-05_1111.csv", stringsAsFactors = FALSE)

d_csv_DT <- datatable(d_csv,
          rownames = FALSE,
          class = 'cell-border stripe',
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;',
            '', htmltools::em('The CSV file generated with REDCap for use with the R programming language.  Of note, all of the factor variables are numerically coded.')
          ),
          filter = 'top',
          options = list(pageLength = 5,
                         autoWidth = TRUE,
                         scrollX = TRUE
          )
)

d_csv_DT

The second file is an R script that codes the data for use in R. The contents of the example REDCap script file used in this presentation are shown below and downloadable here. We see that the Hmisc library is used to set the labels, and none of the numerically coded variables are stored as factors, but rather as new variables with .factor appended.

d_script <- readLines("fromREDCap_R_2019-12-05_1111.r")

cat(paste(d_script, "\n", sep = ""))
## #Clear existing data and graphics
##  rm(list=ls())
##  graphics.off()
##  #Load Hmisc library
##  library(Hmisc)
##  #Read Data
##  data=read.csv('fromREDCap_DATA_2019-12-05_1111.csv')
##  #Setting Labels
##  
##  label(data$record_id)="Record ID"
##  label(data$redcap_event_name)="Event Name"
##  label(data$visit_date) ="Visit Date"
##  label(data$randomization)="Randomization"
##  label(data$sex)="Sex"
##  label(data$age)="Age"
##  label(data$social_connectedness)="Social Connectedness"
##  label(data$comments)="Comments"
##                   
##  #Setting Factors(will create new variable for factors)
##  data$redcap_event_name.factor = factor(data$redcap_event_name,levels=c("baseline", "followup"))
##  data$randomization.factor = factor(data$randomization,levels=c("0", "1"))
##  data$sex.factor = factor(data$sex,levels=c("0", "1", "666", "999"))
##  data$social_connectedness.factor = factor(data$social_connectedness,levels=c("0", "1", "2", "3", "4", "5", "6", "7", "666", "999"))
##  
##  levels(data$redcap_event_name.factor)=c("Baseline","Follow-up")
##  levels(data$randomization.factor)=c("Control","Treatment")
##  levels(data$sex.factor)=c("Male","Female","Ambiguous","Missing")
##  levels(data$social_connectedness)=c("0", "1", "2", "3", "4", "5", "6", "7", "Ambiguous","Missing")

REDCap Script File Clean-up and Parsing

To set up the data, we want to parse and evaluate the script file with the primary goal of removing the appended .factor. The function redcap_to_r_data_set does just that. It takes two arguments; The first is redcap_data_file, which is the path to the REDCap data set. The second is redcap_script_file, which is the path to the REDCap script file.

redcap_to_r_data_set <- function(redcap_data_file, redcap_script_file) {
  # Read in the data and script file.
  redcap_data <- read.csv(file = redcap_data_file, stringsAsFactors = FALSE)
  redcap_script <- readLines(redcap_script_file)

  # We want to remove the appended .factor, but since releveling the numerically coded data erases the labels, we need to reorder the file so that the labels are last.
    # Every line in the script file that uses the factor() function
    redcap_factor <- redcap_script[grep("factor\\(", redcap_script)]
    
    # Every line in the script file that uses the levels() function
    redcap_levels <- redcap_script[grep("levels\\(", redcap_script)]
    
    # Every line in the script file that begins with the label function
    redcap_label <- redcap_script[grep("^label\\(", redcap_script)]
    
    # Reorder the chunks in the script file.
    redcap_reorder <- c(redcap_factor, "", redcap_levels, "", redcap_label)
    
    # Remove the appended .factor.
    redcap_no_append <- gsub("\\.factor", "", redcap_reorder)
    
  # REDCap defaults to calling the data 'data'.  Before evaluating, we need to change this to what the data is named here.
  redcap_rename <- gsub("data\\$", "redcap_data\\$", redcap_no_append)

  # Now we can safely evaluate the script file.
  eval(parse(text = redcap_rename))
  
  return(redcap_data)
}

d <- redcap_to_r_data_set(redcap_data_file = "fromREDCap_DATA_2019-12-05_1111.csv", redcap_script_file = "fromREDCap_R_2019-12-05_1111.r")

Below is the completed R data set after cleaning and parsing the REDCap script file.

d_DT <- datatable(d,
          rownames = FALSE,
          class = 'cell-border stripe',
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;',
            '', htmltools::em('The complete R data set after cleaning and parsing the REDCap script file.')
          ),
          filter = 'top',
          options = list(pageLength = 5,
                         autoWidth = TRUE,
                         scrollX = TRUE
          )
)

d_DT

Printing the data summary shows that variable coding worked as expected. The label accessor function shows that the labels were accurately created.

summary(d)
##    record_id  redcap_event_name  visit_date          randomization
##  Min.   : 1   Baseline :25      Length:50          Control  :10   
##  1st Qu.: 7   Follow-up:25      Class1:labelled    Treatment:15   
##  Median :13                     Class2:character   NA's     :25   
##  Mean   :13                     Mode  :character                  
##  3rd Qu.:19                                                       
##  Max.   :25                                                       
##                                                                   
##         sex          age        social_connectedness   comments        
##  Male     : 0   Min.   : 25.0   1        :9          Length:50         
##  Female   :11   1st Qu.: 33.0   2        :9          Class1:labelled   
##  Ambiguous: 0   Median : 38.0   3        :8          Class2:character  
##  Missing  : 3   Mean   :228.8   4        :8          Mode  :character  
##  NA's     :36   3rd Qu.: 61.0   Missing  :6                            
##                 Max.   :999.0   Ambiguous:4                            
##                 NA's   :25      (Other)  :6
label(d)
##              record_id      redcap_event_name             visit_date 
##            "Record ID"           "Event Name"           "Visit Date" 
##          randomization                    sex                    age 
##        "Randomization"                  "Sex"                  "Age" 
##   social_connectedness               comments 
## "Social Connectedness"             "Comments"

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] Hmisc_4.4-1      ggplot2_3.3.2    Formula_1.2-4    survival_3.2-7  
## [5] lattice_0.20-41  DT_0.16          kableExtra_1.3.1 bookdown_0.21   
## [9] knitr_1.30      
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.19           splines_4.0.3       colorspace_2.0-0   
##  [4] vctrs_0.3.5         htmltools_0.5.0     viridisLite_0.3.0  
##  [7] yaml_2.2.1          base64enc_0.1-3     rlang_0.4.8        
## [10] pillar_1.4.6        foreign_0.8-79      glue_1.4.2         
## [13] withr_2.3.0         RColorBrewer_1.1-2  jpeg_0.1-8.1       
## [16] lifecycle_0.2.0     stringr_1.4.0       munsell_0.5.0      
## [19] gtable_0.3.0        rvest_0.3.6         htmlwidgets_1.5.2  
## [22] evaluate_0.14       latticeExtra_0.6-29 crosstalk_1.1.0.1  
## [25] htmlTable_2.1.0     backports_1.2.0     checkmate_2.0.0    
## [28] renv_0.12.2         scales_1.1.1        jsonlite_1.7.1     
## [31] webshot_0.5.2       gridExtra_2.3       png_0.1-7          
## [34] digest_0.6.27       stringi_1.5.3       grid_4.0.3         
## [37] tools_4.0.3         magrittr_2.0.1      tibble_3.0.4       
## [40] cluster_2.1.0       crayon_1.3.4        pkgconfig_2.0.3    
## [43] ellipsis_0.3.1      Matrix_1.2-18       data.table_1.13.2  
## [46] xml2_1.3.2          rmarkdown_2.5       httr_1.4.2         
## [49] rstudioapi_0.13     R6_2.5.0            rpart_4.1-15       
## [52] nnet_7.3-14         compiler_4.0.3

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

Harris, Paul A, Robert Taylor, Brenda L Minor, Veida Elliott, Michelle Fernandez, Lindsay O’Neal, Laura McLeod, et al. 2019. “The Redcap Consortium: Building an International Community of Software Platform Partners.” Journal of Biomedical Informatics 95: 103208.

Harris, Paul A, Robert Taylor, Robert Thielke, Jonathon Payne, Nathaniel Gonzalez, Jose G Conde, and others. 2009. “A Metadata-Driven Methodology and Workflow Process for Providing Translational Research Informatics Support.” J Biomed Inform 42 (2): 377–81.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2016. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.