Introduction

Creating an analytic data set is very important when doing data analysis and will be used to reproduce the results.

An analytic data set should include the following:

Variables (observed data) used in the analysis
Variables (observed data) used to construct other variables used in the analysis
Variable labels
Variable coding

This presentation uses the R programming language and assumes the end user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2019; RStudio Team 2016). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/creating_an_analytic_data_set/.

Libraries

The libraries knitr, bookdown, and kableExtra are used to generate the HTML output (Xie 2019, 2018; Zhu 2019). The ggplot2 library is loaded for the example data set that is used (Wickham 2016). The Hmisc and DT libraries provide functions needed to make and display the analytic data set (Harrell Jr, Charles Dupont, and others. 2019; Xie, Cheng, and Tan 2018).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "ggplot2", "Hmisc", "DT")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

Example Data

The example data for this presentation will be mpg from the ggplot2 package. From the description in the manual:

This dataset contains a subset of the fuel economy data that the EPA makes available here. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

data(mpg)
mpg <- data.frame(mpg)

kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Table 1: Header of **mpg**.
manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

Data Setup

There are two essential parts to setting up the data. First, make sure each variable is coded to reflect its class. Second, we want to add labels to each variable in the data set using the label function of the Hmisc library.

set.seed(123)
data(mpg)
mpg <- data.frame(mpg)

colnames(mpg)[which(colnames(mpg) == "manufacturer")] <- "manu"

mpg$manu <- factor(mpg$manu)
mpg$model <- factor(mpg$model)
mpg$displ <- as.numeric(mpg$displ)
mpg$year <- factor(mpg$year, levels = c("1999", "2008"), ordered = TRUE)

mpg$dp <- as.Date(NA, origin = "1970-01-01")
mpg$dp[which(mpg$year == "1999")] <- sample(seq(as.Date('1999-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('1999-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)
mpg$dp[which(mpg$year == "2008")] <- sample(seq(as.Date('2008-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('2008-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)

mpg$cyl <- factor(mpg$cyl, levels = c(4, 5, 6, 8), ordered = TRUE)
mpg$trans <- factor(mpg$trans)
mpg$drv <- factor(mpg$drv, levels = c("f", "r", "4"), labels = c("front-wheel drive", "rear wheel drive", "4wd"))
mpg$fl <- factor(mpg$fl)
mpg$class <- factor(mpg$class)

mpg$rn <- rnorm(dim(mpg)[1], mean = 10, sd = 5)
mpg$rn[sample(1:length(mpg$rn), size = 50)] <- NA

mpg$party <- factor(sample(c("republican", "democrat", "independent", NA), dim(mpg)[1], replace = TRUE), levels = c("republican", "democrat", "independent"))
mpg$comments <- sample(c("I like this car!", "Meh.", "This is the worst car ever!", "Does it come in green?", "want cheese flavoured cars.", "Does it also fly?", "Blah, Blah, Blah, Blah, Blah, Blah, Blah, Blah", NA), dim(mpg)[1], replace = TRUE)

label(mpg$manu) <- "manufacturer"
label(mpg$model) <- "model name"
label(mpg$displ) <- "engine displacement, in litres"
label(mpg$year) <- "year of manufacture"
label(mpg$dp) <- "date of purchase"
label(mpg$cyl) <- "number of cylinders"
label(mpg$trans) <- "type of transmission"
label(mpg$drv) <- "drive type"
label(mpg$cty) <- "city miles per gallon"
label(mpg$hwy) <- "highway miles per gallon"
label(mpg$fl) <- "fuel type"
label(mpg$class) <- "type of car"
label(mpg$rn) <- "some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5"
label(mpg$party) <- "some random political parties"
label(mpg$comments) <- "some random comments"

kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Table 2: Header of **mpg**.
manu	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	dp	rn	party	comments
audi	a4	1.8	1999	4	auto(l5)	front-wheel drive	18	29	p	compact	1999-06-28	18.98570	democrat	I like this car!
audi	a4	1.8	1999	4	manual(m5)	front-wheel drive	21	29	p	compact	1999-01-14	NA	NA	I like this car!
audi	a4	2.0	2008	4	manual(m6)	front-wheel drive	20	31	p	compact	2008-02-08	19.50450	independent	Does it also fly?
audi	a4	2.0	2008	4	auto(av)	front-wheel drive	21	30	p	compact	2008-07-14	NA	independent	I like this car!
audi	a4	2.8	1999	6	auto(l5)	front-wheel drive	16	26	p	compact	1999-07-14	13.68097	democrat	Meh.
audi	a4	2.8	1999	6	manual(m5)	front-wheel drive	18	26	p	compact	1999-11-02	16.82888	NA	Meh.

Data Dictionary

After the data is set up, the data dictionary can be printed as follows.

html(contents(mpg), 
     sortlevels = TRUE,
     maxlevels = Inf, 
     levelType = "table")

Data frame:mpg

234 observations and 15 variables, maximum # NAs:55

Name	Labels	Levels	Class	Storage	NAs
manu	manufacturer	15		integer	0
model	model name	38		integer	0
displ	engine displacement, in litres		numeric	double	0
year	year of manufacture	2	ordered	integer	0
cyl	number of cylinders	4	ordered	integer	0
trans	type of transmission	10		integer	0
drv	drive type	3		integer	0
cty	city miles per gallon		integer	integer	0
hwy	highway miles per gallon		integer	integer	0
fl	fuel type	5		integer	0
class	type of car	7		integer	0
dp	date of purchase		Date	double	0
rn	some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5		numeric	double	50
party	some random political parties	3		integer	55
comments	some random comments		character	character	31

Variable	Levels
manu	audi
	chevrolet
	dodge
	ford
	honda
	hyundai
	jeep
	land rover
	lincoln
	mercury
	nissan
	pontiac
	subaru
	toyota
	volkswagen
model	4runner 4wd
	a4
	a4 quattro
	a6 quattro
	altima
	c1500 suburban 2wd
	camry
	camry solara
	caravan 2wd
	civic
	corolla
	corvette
	dakota pickup 4wd
	durango 4wd
	expedition 2wd
	explorer 4wd
	f150 pickup 4wd
	forester awd
	grand cherokee 4wd
	grand prix
	gti
	impreza awd
	jetta
	k1500 tahoe 4wd
	land cruiser wagon 4wd
	malibu
	maxima
	mountaineer 4wd
	mustang
	navigator 2wd
	new beetle
	passat
	pathfinder 4wd
	ram 1500 pickup 4wd
	range rover
	sonata
	tiburon
	toyota tacoma 4wd
year	1999
	2008
cyl	4
	5
	6
	8
trans	auto(av)
	auto(l3)
	auto(l4)
	auto(l5)
	auto(l6)
	auto(s4)
	auto(s5)
	auto(s6)
	manual(m5)
	manual(m6)
drv	front-wheel drive
	rear wheel drive
	4wd
fl	c
	d
	e
	p
	r
class	2seater
	compact
	midsize
	minivan
	pickup
	subcompact
	suv
party	republican
	democrat
	independent

Analytic Data

This analytic data is created using the datatable function of the DT library. There are many options here, so I will only cover the high points. For more information check out the excellent documentation here.

The main things I want to show are the extensions declared by the extensions argument. The Buttons option creates buttons so that the end user can download the data set in various formats. The KeyTable allows users to select cells within the table and navigate similar to how they would in spreadsheet software. The FixedColumns allows us to create fixed columns that follow the cursor and scroll bars.

Another thing to keep in mind is to remove commas from the analytic data set if you want to allow users to export to a CSV format.

commas <- lapply(mpg, grep, pattern = ",")
commas <- names(which(unlist(lapply(commas, length)) > 0))
mpg[, commas] <- gsub(",", "_comma_", mpg[, commas], fixed = TRUE)

datatable(mpg, escape = FALSE,
          extensions = c('Buttons', 'KeyTable', 'FixedColumns'),
          class = 'cell-border stripe',
          rownames = TRUE,
          options = list(
            dom = 'Bfrtip',
            pageLength = 5,
            deferRender = TRUE,
            fixedColumns = list(leftColumns = 1),
            responsive = TRUE,
            scrollX = TRUE,
            scrollCollaspe = TRUE,
            paging = TRUE,
            autoWidth = TRUE,
            keys = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
          ))

Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] DT_0.24          Hmisc_4.7-0      Formula_1.2-4    survival_3.2-13 
## [5] lattice_0.20-45  ggplot2_3.3.6    kableExtra_1.3.4 bookdown_0.28   
## [9] knitr_1.39      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9          svglite_2.1.0       deldir_1.0-6       
##  [4] png_0.1-7           digest_0.6.29       utf8_1.2.2         
##  [7] R6_2.5.1            backports_1.4.1     evaluate_0.16      
## [10] httr_1.4.3          pillar_1.8.0        rlang_1.0.4        
## [13] data.table_1.14.2   rstudioapi_0.13     jquerylib_0.1.4    
## [16] rpart_4.1.16        Matrix_1.4-1        checkmate_2.1.0    
## [19] rmarkdown_2.14      splines_4.2.1       webshot_0.5.3      
## [22] stringr_1.4.0       foreign_0.8-82      htmlwidgets_1.5.4  
## [25] munsell_0.5.0       compiler_4.2.1      xfun_0.32          
## [28] pkgconfig_2.0.3     systemfonts_1.0.4   base64enc_0.1-3    
## [31] htmltools_0.5.3     nnet_7.3-17         tibble_3.1.8       
## [34] gridExtra_2.3       htmlTable_2.4.1     fansi_1.0.3        
## [37] viridisLite_0.4.0   withr_2.5.0         grid_4.2.1         
## [40] jsonlite_1.8.0      gtable_0.3.0        lifecycle_1.0.1    
## [43] magrittr_2.0.3      scales_1.2.0        cli_3.3.0          
## [46] stringi_1.7.8       cachem_1.0.6        renv_0.12.2        
## [49] latticeExtra_0.6-30 xml2_1.3.3          bslib_0.4.0        
## [52] vctrs_0.4.1         RColorBrewer_1.1-3  tools_4.2.1        
## [55] interp_1.1-3        glue_1.6.2          crosstalk_1.2.0    
## [58] jpeg_0.1-9          fastmap_1.1.0       yaml_2.3.5         
## [61] colorspace_2.0-3    cluster_2.1.3       rvest_1.0.2        
## [64] sass_0.4.2

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2016. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the JavaScript Library ’DataTables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.