Creating an analytic data set is very important when doing data analysis and will be used to reproduce the results.
An analytic data set should include the following:
This presentation uses the R programming language and assumes the end user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2019; RStudio Team 2016). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/creating_an_analytic_data_set/
.
The libraries knitr, bookdown, and kableExtra are used to generate the HTML output (Xie 2019, 2018; Zhu 2019). The ggplot2 library is loaded for the example data set that is used (Wickham 2016). The Hmisc and DT libraries provide functions needed to make and display the analytic data set (Harrell Jr, Charles Dupont, and others. 2019; Xie, Cheng, and Tan 2018).
package_loader <- function(x, ...) {
if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
library(x, ...)
}
packages <- c("knitr", "bookdown", "kableExtra", "ggplot2", "Hmisc", "DT")
invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))
The example data for this presentation will be mpg from the ggplot2 package. From the description in the manual:
This dataset contains a subset of the fuel economy data that the EPA makes available here. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
data(mpg)
mpg <- data.frame(mpg)
kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
There are two essential parts to setting up the data. First, make sure each variable is coded to reflect its class. Second, we want to add labels to each variable in the data set using the label function of the Hmisc library.
set.seed(123)
data(mpg)
mpg <- data.frame(mpg)
colnames(mpg)[which(colnames(mpg) == "manufacturer")] <- "manu"
mpg$manu <- factor(mpg$manu)
mpg$model <- factor(mpg$model)
mpg$displ <- as.numeric(mpg$displ)
mpg$year <- factor(mpg$year, levels = c("1999", "2008"), ordered = TRUE)
mpg$dp <- as.Date(NA, origin = "1970-01-01")
mpg$dp[which(mpg$year == "1999")] <- sample(seq(as.Date('1999-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('1999-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)
mpg$dp[which(mpg$year == "2008")] <- sample(seq(as.Date('2008-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('2008-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)
mpg$cyl <- factor(mpg$cyl, levels = c(4, 5, 6, 8), ordered = TRUE)
mpg$trans <- factor(mpg$trans)
mpg$drv <- factor(mpg$drv, levels = c("f", "r", "4"), labels = c("front-wheel drive", "rear wheel drive", "4wd"))
mpg$fl <- factor(mpg$fl)
mpg$class <- factor(mpg$class)
mpg$rn <- rnorm(dim(mpg)[1], mean = 10, sd = 5)
mpg$rn[sample(1:length(mpg$rn), size = 50)] <- NA
mpg$party <- factor(sample(c("republican", "democrat", "independent", NA), dim(mpg)[1], replace = TRUE), levels = c("republican", "democrat", "independent"))
mpg$comments <- sample(c("I like this car!", "Meh.", "This is the worst car ever!", "Does it come in green?", "want cheese flavoured cars.", "Does it also fly?", "Blah, Blah, Blah, Blah, Blah, Blah, Blah, Blah", NA), dim(mpg)[1], replace = TRUE)
label(mpg$manu) <- "manufacturer"
label(mpg$model) <- "model name"
label(mpg$displ) <- "engine displacement, in litres"
label(mpg$year) <- "year of manufacture"
label(mpg$dp) <- "date of purchase"
label(mpg$cyl) <- "number of cylinders"
label(mpg$trans) <- "type of transmission"
label(mpg$drv) <- "drive type"
label(mpg$cty) <- "city miles per gallon"
label(mpg$hwy) <- "highway miles per gallon"
label(mpg$fl) <- "fuel type"
label(mpg$class) <- "type of car"
label(mpg$rn) <- "some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5"
label(mpg$party) <- "some random political parties"
label(mpg$comments) <- "some random comments"
kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
manu | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | dp | rn | party | comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | front-wheel drive | 18 | 29 | p | compact | 1999-06-28 | 18.98570 | democrat | I like this car! |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | front-wheel drive | 21 | 29 | p | compact | 1999-01-14 | NA | NA | I like this car! |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | front-wheel drive | 20 | 31 | p | compact | 2008-02-08 | 19.50450 | independent | Does it also fly? |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | front-wheel drive | 21 | 30 | p | compact | 2008-07-14 | NA | independent | I like this car! |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | front-wheel drive | 16 | 26 | p | compact | 1999-07-14 | 13.68097 | democrat | Meh. |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | front-wheel drive | 18 | 26 | p | compact | 1999-11-02 | 16.82888 | NA | Meh. |
After the data is set up, the data dictionary can be printed as follows.
html(contents(mpg),
sortlevels = TRUE,
maxlevels = Inf,
levelType = "table")
Name | Labels | Levels | Class | Storage | NAs |
---|---|---|---|---|---|
manu | manufacturer | 15 | integer | 0 | |
model | model name | 38 | integer | 0 | |
displ | engine displacement, in litres | numeric | double | 0 | |
year | year of manufacture | 2 | ordered | integer | 0 |
cyl | number of cylinders | 4 | ordered | integer | 0 |
trans | type of transmission | 10 | integer | 0 | |
drv | drive type | 3 | integer | 0 | |
cty | city miles per gallon | integer | integer | 0 | |
hwy | highway miles per gallon | integer | integer | 0 | |
fl | fuel type | 5 | integer | 0 | |
class | type of car | 7 | integer | 0 | |
dp | date of purchase | Date | double | 0 | |
rn | some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5 | numeric | double | 50 | |
party | some random political parties | 3 | integer | 55 | |
comments | some random comments | character | character | 31 |
Variable | Levels |
---|---|
manu | audi |
chevrolet | |
dodge | |
ford | |
honda | |
hyundai | |
jeep | |
land rover | |
lincoln | |
mercury | |
nissan | |
pontiac | |
subaru | |
toyota | |
volkswagen | |
model | 4runner 4wd |
a4 | |
a4 quattro | |
a6 quattro | |
altima | |
c1500 suburban 2wd | |
camry | |
camry solara | |
caravan 2wd | |
civic | |
corolla | |
corvette | |
dakota pickup 4wd | |
durango 4wd | |
expedition 2wd | |
explorer 4wd | |
f150 pickup 4wd | |
forester awd | |
grand cherokee 4wd | |
grand prix | |
gti | |
impreza awd | |
jetta | |
k1500 tahoe 4wd | |
land cruiser wagon 4wd | |
malibu | |
maxima | |
mountaineer 4wd | |
mustang | |
navigator 2wd | |
new beetle | |
passat | |
pathfinder 4wd | |
ram 1500 pickup 4wd | |
range rover | |
sonata | |
tiburon | |
toyota tacoma 4wd | |
year | 1999 |
2008 | |
cyl | 4 |
5 | |
6 | |
8 | |
trans | auto(av) |
auto(l3) | |
auto(l4) | |
auto(l5) | |
auto(l6) | |
auto(s4) | |
auto(s5) | |
auto(s6) | |
manual(m5) | |
manual(m6) | |
drv | front-wheel drive |
rear wheel drive | |
4wd | |
fl | c |
d | |
e | |
p | |
r | |
class | 2seater |
compact | |
midsize | |
minivan | |
pickup | |
subcompact | |
suv | |
party | republican |
democrat | |
independent |
This analytic data is created using the datatable function of the DT library. There are many options here, so I will only cover the high points. For more information check out the excellent documentation here.
The main things I want to show are the extensions declared by the extensions argument. The Buttons option creates buttons so that the end user can download the data set in various formats. The KeyTable allows users to select cells within the table and navigate similar to how they would in spreadsheet software. The FixedColumns allows us to create fixed columns that follow the cursor and scroll bars.
Another thing to keep in mind is to remove commas from the analytic data set if you want to allow users to export to a CSV format.
commas <- lapply(mpg, grep, pattern = ",")
commas <- names(which(unlist(lapply(commas, length)) > 0))
mpg[, commas] <- gsub(",", "_comma_", mpg[, commas], fixed = TRUE)
datatable(mpg, escape = FALSE,
extensions = c('Buttons', 'KeyTable', 'FixedColumns'),
class = 'cell-border stripe',
rownames = TRUE,
options = list(
dom = 'Bfrtip',
pageLength = 5,
deferRender = TRUE,
fixedColumns = list(leftColumns = 1),
responsive = TRUE,
scrollX = TRUE,
scrollCollaspe = TRUE,
paging = TRUE,
autoWidth = TRUE,
keys = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] DT_0.24 Hmisc_4.7-0 Formula_1.2-4 survival_3.2-13
## [5] lattice_0.20-45 ggplot2_3.3.6 kableExtra_1.3.4 bookdown_0.28
## [9] knitr_1.39
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 svglite_2.1.0 deldir_1.0-6
## [4] png_0.1-7 digest_0.6.29 utf8_1.2.2
## [7] R6_2.5.1 backports_1.4.1 evaluate_0.16
## [10] httr_1.4.3 pillar_1.8.0 rlang_1.0.4
## [13] data.table_1.14.2 rstudioapi_0.13 jquerylib_0.1.4
## [16] rpart_4.1.16 Matrix_1.4-1 checkmate_2.1.0
## [19] rmarkdown_2.14 splines_4.2.1 webshot_0.5.3
## [22] stringr_1.4.0 foreign_0.8-82 htmlwidgets_1.5.4
## [25] munsell_0.5.0 compiler_4.2.1 xfun_0.32
## [28] pkgconfig_2.0.3 systemfonts_1.0.4 base64enc_0.1-3
## [31] htmltools_0.5.3 nnet_7.3-17 tibble_3.1.8
## [34] gridExtra_2.3 htmlTable_2.4.1 fansi_1.0.3
## [37] viridisLite_0.4.0 withr_2.5.0 grid_4.2.1
## [40] jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.1
## [43] magrittr_2.0.3 scales_1.2.0 cli_3.3.0
## [46] stringi_1.7.8 cachem_1.0.6 renv_0.12.2
## [49] latticeExtra_0.6-30 xml2_1.3.3 bslib_0.4.0
## [52] vctrs_0.4.1 RColorBrewer_1.1-3 tools_4.2.1
## [55] interp_1.1-3 glue_1.6.2 crosstalk_1.2.0
## [58] jpeg_0.1-9 fastmap_1.1.0 yaml_2.3.5
## [61] colorspace_2.0-3 cluster_2.1.3 rvest_1.0.2
## [64] sass_0.4.2