Introduction

Creating an analytic data set is very important when doing data analysis and will be used to reproduce the results.

An analytic data set should include the following:

This presentation uses the R programming language and assumes the end user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2019; RStudio Team 2016). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/creating_an_analytic_data_set/.

Libraries

The libraries knitr, bookdown, and kableExtra are used to generate the HTML output (Xie 2019, 2018; Zhu 2019). The ggplot2 library is loaded for the example data set that is used (Wickham 2016). The Hmisc and DT libraries provide functions needed to make and display the analytic data set (Harrell Jr, Charles Dupont, and others. 2019; Xie, Cheng, and Tan 2018).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "ggplot2", "Hmisc", "DT")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

Example Data

The example data for this presentation will be mpg from the ggplot2 package. From the description in the manual:

This dataset contains a subset of the fuel economy data that the EPA makes available here. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

data(mpg)
mpg <- data.frame(mpg)

kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Table 1: Header of mpg.
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

Data Setup

There are two essential parts to setting up the data. First, make sure each variable is coded to reflect its class. Second, we want to add labels to each variable in the data set using the label function of the Hmisc library.

set.seed(123)
data(mpg)
mpg <- data.frame(mpg)

colnames(mpg)[which(colnames(mpg) == "manufacturer")] <- "manu"

mpg$manu <- factor(mpg$manu)
mpg$model <- factor(mpg$model)
mpg$displ <- as.numeric(mpg$displ)
mpg$year <- factor(mpg$year, levels = c("1999", "2008"), ordered = TRUE)

mpg$dp <- as.Date(NA, origin = "1970-01-01")
mpg$dp[which(mpg$year == "1999")] <- sample(seq(as.Date('1999-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('1999-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)
mpg$dp[which(mpg$year == "2008")] <- sample(seq(as.Date('2008-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('2008-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)

mpg$cyl <- factor(mpg$cyl, levels = c(4, 5, 6, 8), ordered = TRUE)
mpg$trans <- factor(mpg$trans)
mpg$drv <- factor(mpg$drv, levels = c("f", "r", "4"), labels = c("front-wheel drive", "rear wheel drive", "4wd"))
mpg$fl <- factor(mpg$fl)
mpg$class <- factor(mpg$class)

mpg$rn <- rnorm(dim(mpg)[1], mean = 10, sd = 5)
mpg$rn[sample(1:length(mpg$rn), size = 50)] <- NA

mpg$party <- factor(sample(c("republican", "democrat", "independent", NA), dim(mpg)[1], replace = TRUE), levels = c("republican", "democrat", "independent"))
mpg$comments <- sample(c("I like this car!", "Meh.", "This is the worst car ever!", "Does it come in green?", "want cheese flavoured cars.", "Does it also fly?", "Blah, Blah, Blah, Blah, Blah, Blah, Blah, Blah", NA), dim(mpg)[1], replace = TRUE)

label(mpg$manu) <- "manufacturer"
label(mpg$model) <- "model name"
label(mpg$displ) <- "engine displacement, in litres"
label(mpg$year) <- "year of manufacture"
label(mpg$dp) <- "date of purchase"
label(mpg$cyl) <- "number of cylinders"
label(mpg$trans) <- "type of transmission"
label(mpg$drv) <- "drive type"
label(mpg$cty) <- "city miles per gallon"
label(mpg$hwy) <- "highway miles per gallon"
label(mpg$fl) <- "fuel type"
label(mpg$class) <- "type of car"
label(mpg$rn) <- "some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5"
label(mpg$party) <- "some random political parties"
label(mpg$comments) <- "some random comments"

kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Table 2: Header of mpg.
manu model displ year cyl trans drv cty hwy fl class dp rn party comments
audi a4 1.8 1999 4 auto(l5) front-wheel drive 18 29 p compact 1999-06-28 18.98570 democrat I like this car!
audi a4 1.8 1999 4 manual(m5) front-wheel drive 21 29 p compact 1999-01-14 NA NA I like this car!
audi a4 2.0 2008 4 manual(m6) front-wheel drive 20 31 p compact 2008-02-08 19.50450 independent Does it also fly?
audi a4 2.0 2008 4 auto(av) front-wheel drive 21 30 p compact 2008-07-14 NA independent I like this car!
audi a4 2.8 1999 6 auto(l5) front-wheel drive 16 26 p compact 1999-07-14 13.68097 democrat Meh.
audi a4 2.8 1999 6 manual(m5) front-wheel drive 18 26 p compact 1999-11-02 16.82888 NA Meh.

Data Dictionary

After the data is set up, the data dictionary can be printed as follows.

html(contents(mpg), 
     sortlevels = TRUE,
     maxlevels = Inf, 
     levelType = "table")

Data frame:mpg

234 observations and 15 variables, maximum # NAs:55  
NameLabelsLevelsClassStorageNAs
manumanufacturer15integer 0
modelmodel name38integer 0
displengine displacement, in litresnumericdouble 0
yearyear of manufacture 2orderedinteger 0
cylnumber of cylinders 4orderedinteger 0
transtype of transmission10integer 0
drvdrive type 3integer 0
ctycity miles per gallonintegerinteger 0
hwyhighway miles per gallonintegerinteger 0
flfuel type 5integer 0
classtype of car 7integer 0
dpdate of purchaseDatedouble 0
rnsome random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5numericdouble50
partysome random political parties 3integer55
commentssome random commentscharactercharacter31

VariableLevels
manuaudi
chevrolet
dodge
ford
honda
hyundai
jeep
land rover
lincoln
mercury
nissan
pontiac
subaru
toyota
volkswagen
model4runner 4wd
a4
a4 quattro
a6 quattro
altima
c1500 suburban 2wd
camry
camry solara
caravan 2wd
civic
corolla
corvette
dakota pickup 4wd
durango 4wd
expedition 2wd
explorer 4wd
f150 pickup 4wd
forester awd
grand cherokee 4wd
grand prix
gti
impreza awd
jetta
k1500 tahoe 4wd
land cruiser wagon 4wd
malibu
maxima
mountaineer 4wd
mustang
navigator 2wd
new beetle
passat
pathfinder 4wd
ram 1500 pickup 4wd
range rover
sonata
tiburon
toyota tacoma 4wd
year1999
2008
cyl4
5
6
8
transauto(av)
auto(l3)
auto(l4)
auto(l5)
auto(l6)
auto(s4)
auto(s5)
auto(s6)
manual(m5)
manual(m6)
drvfront-wheel drive
rear wheel drive
4wd
flc
d
e
p
r
class2seater
compact
midsize
minivan
pickup
subcompact
suv
partyrepublican
democrat
independent

Analytic Data

This analytic data is created using the datatable function of the DT library. There are many options here, so I will only cover the high points. For more information check out the excellent documentation here.

The main things I want to show are the extensions declared by the extensions argument. The Buttons option creates buttons so that the end user can download the data set in various formats. The KeyTable allows users to select cells within the table and navigate similar to how they would in spreadsheet software. The FixedColumns allows us to create fixed columns that follow the cursor and scroll bars.

Another thing to keep in mind is to remove commas from the analytic data set if you want to allow users to export to a CSV format.

commas <- lapply(mpg, grep, pattern = ",")
commas <- names(which(unlist(lapply(commas, length)) > 0))
mpg[, commas] <- gsub(",", "_comma_", mpg[, commas], fixed = TRUE)

datatable(mpg, escape = FALSE,
          extensions = c('Buttons', 'KeyTable', 'FixedColumns'),
          class = 'cell-border stripe',
          rownames = TRUE,
          options = list(
            dom = 'Bfrtip',
            pageLength = 5,
            deferRender = TRUE,
            fixedColumns = list(leftColumns = 1),
            responsive = TRUE,
            scrollX = TRUE,
            scrollCollaspe = TRUE,
            paging = TRUE,
            autoWidth = TRUE,
            keys = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
          ))

Session Info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] DT_0.24          Hmisc_4.7-0      Formula_1.2-4    survival_3.2-13 
## [5] lattice_0.20-45  ggplot2_3.3.6    kableExtra_1.3.4 bookdown_0.28   
## [9] knitr_1.39      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9          svglite_2.1.0       deldir_1.0-6       
##  [4] png_0.1-7           digest_0.6.29       utf8_1.2.2         
##  [7] R6_2.5.1            backports_1.4.1     evaluate_0.16      
## [10] httr_1.4.3          pillar_1.8.0        rlang_1.0.4        
## [13] data.table_1.14.2   rstudioapi_0.13     jquerylib_0.1.4    
## [16] rpart_4.1.16        Matrix_1.4-1        checkmate_2.1.0    
## [19] rmarkdown_2.14      splines_4.2.1       webshot_0.5.3      
## [22] stringr_1.4.0       foreign_0.8-82      htmlwidgets_1.5.4  
## [25] munsell_0.5.0       compiler_4.2.1      xfun_0.32          
## [28] pkgconfig_2.0.3     systemfonts_1.0.4   base64enc_0.1-3    
## [31] htmltools_0.5.3     nnet_7.3-17         tibble_3.1.8       
## [34] gridExtra_2.3       htmlTable_2.4.1     fansi_1.0.3        
## [37] viridisLite_0.4.0   withr_2.5.0         grid_4.2.1         
## [40] jsonlite_1.8.0      gtable_0.3.0        lifecycle_1.0.1    
## [43] magrittr_2.0.3      scales_1.2.0        cli_3.3.0          
## [46] stringi_1.7.8       cachem_1.0.6        renv_0.12.2        
## [49] latticeExtra_0.6-30 xml2_1.3.3          bslib_0.4.0        
## [52] vctrs_0.4.1         RColorBrewer_1.1-3  tools_4.2.1        
## [55] interp_1.1-3        glue_1.6.2          crosstalk_1.2.0    
## [58] jpeg_0.1-9          fastmap_1.1.0       yaml_2.3.5         
## [61] colorspace_2.0-3    cluster_2.1.3       rvest_1.0.2        
## [64] sass_0.4.2

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
RStudio Team. 2016. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.
———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.name/knitr/.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the JavaScript Library ’DataTables’. https://CRAN.R-project.org/package=DT.
Zhu, Hao. 2019. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.