Introduction

It is not an overstatement to say that the default output for a Principal Components Analysis (PCA) biplot using the biplot() function in R is gross. An example of how bad it can be without modification is shown below using the farmworker stress data (discussed in detail here).

This presentation will show how to use the output from the prcomp() function with the ggplot2 library to create a more aesthetically pleasing and informative PCA biplot (Wickham 2016). Throughout, this presentation uses the R programming language and assumes the end user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2020; RStudio Team 2020). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/aesthetic_pca_biplot/.

Libraries

The libraries knitr, bookdown, kableExtra, and DT, generate the HTML output (Xie 2019, 2018; Zhu 2019; Xie, Cheng, and Tan 2018). The ggplot2 package is used to generate the final biplot (Wickham 2016).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "DT", "ggplot2")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

Data Setup

The data are loaded from a previously constructed R data set and then subset only to include observations from the thinning season when the farmworkers are most active. We also remove two columns where there is no variance in the answers of the participants.

# Load the data
load("data/farmstress.RData")

# Season when the farmworkers are the most active.
farmstressS1 <- na.omit(farmstress[which(farmstress$season == "Thinning"), ])

# There is no variance to quantify in these columns
farmstressS1 <- farmstressS1[, (-1)*which(colnames(farmstressS1) == "violence")]
farmstressS1 <- farmstressS1[, (-1)*which(colnames(farmstressS1) == "druguse")]

# Variable to hold the occupation (e.g. farmworker vs. non-farmworker)
farmstressS1Occupation <- farmstressS1$occupation

# A dataframe that removes occupation and season so it contains only numeric values.
farmstressS1 <- farmstressS1[, c(-1, -2)]

PCA Analysis

The PCA code for farm stress data is shown below. Generally, any PCA’s first step is to scale the data so that each feature is comparable, followed by mean centering, shifting the data on their axes without altering any stochastic properties. In this specific case, but generally, not so, scaling isn’t necessary since each column is measured using the same Likert-type scale.

We use the prcomp() function to do the PCA and subsequently extract the loadings, scores matrix, and feature importance as measured by how much each of the variability each PC axis explains in the observed data.

# Make the features comparable by centering each column around its mean.  Since the data are already on the same scale, we do not need to also divide by their respective variance (e.g., scale = FALSE).
farmstressS1 <- scale(farmstressS1, center = TRUE, scale = FALSE)

# This function does all of the math to create the PC matrix factorization.
farmstressS1_pca <- prcomp(farmstressS1, center = FALSE, scale. = FALSE, retx = TRUE)

# Pieces from the PCA used in the biplot.
farmstressS1_pca_loadings <- farmstressS1_pca$rotation # Loadings Matrix
farmstressS1_pca_scores <- predict(farmstressS1_pca) # Scores Matris
farmstressS1_pca_importance <- summary(farmstressS1_pca)$importance # Explained Variance

PCA Biplot

The PCA biplot goal is to plot the projections of the observed data onto the first two PC axes and how well each of the features from the original feature space load onto the new factorization. To that end, we can deconstruct the process into five steps:

  1. Setup the data to be plotted.
  2. Plot the scores.
  3. Plot segments showing how well each feature loads onto the first two PCs.
  4. Label the axes with the explained variance of the first two PCs.
  5. Adjust the plot theme to make it more pretty!

Setup the Data

The scale of the feature loadings is between zero and one because they are correlations. Therefore it would be best if the PC axes we want to plot are also on the same scale. We can achieve this by using a min-max feature scaling approach.

# The plot data includes occupation and the first two PC components.
plotDataScores <- cbind(occupation = farmstressS1Occupation, as.data.frame(farmstressS1_pca_scores[, c("PC1", "PC2")]))

# We want to use min-max feature scaling on the scores so they are between zero and one, the same as the loadings.
normalize <- function(x) return ((x - min(x)) / (max(x) - min(x)))

plotDataScores[, "PC1"] <- scale(normalize(plotDataScores[, "PC1"]), center = TRUE, scale = FALSE)
plotDataScores[, "PC2"] <- scale(normalize(plotDataScores[, "PC2"]), center = TRUE, scale = FALSE)

plotDataLoadings <- as.data.frame(farmstressS1_pca_loadings)

Plot the Scores

The first thing to do is plot the scores, and since we have a categorical occupation, color the points as such to see if there is any separation between the two groups. The default colors are not very nice, so we choose two different (yes, colorblind-friendly!) colors using the scale_color_manual() function.

p1 <- ggplot() + 
  geom_point(data = plotDataScores, mapping = aes(x = PC1, y = PC2, colour = occupation)) + 
  scale_color_manual(values = c("#8F57BA55", "#57BA7D55"))

p1

Plot the Loadings

To express the loadings in the two dimensional PC space, we plot arrows that are the same length as to how well each original feature loads onto the PC space as measure by their correlation.

p2 <- p1 + 
  geom_segment(data = plotDataLoadings, aes(x = 0, y = 0, xend = PC1, yend = PC2), arrow = arrow(length = unit(0.03, "npc")), alpha = 0.2) + 
  geom_text(data = plotDataLoadings, mapping = aes(x = PC1, y = PC2, label = rownames(plotDataLoadings)), hjust = 1, vjust = -0.2, colour = "darkred", size = 4, check_overlap = TRUE)

p2

Label the Axes

The X and Y labels can also contain information. In this case, we add the proportion of variance explained by the first two PCs, respectively.

p3 <- p2 + 
  xlab(paste("PC1 (", round(farmstressS1_pca_importance["Proportion of Variance", "PC1"]*100, 1),"%)", sep = "")) + 
  ylab(paste("PC2 (", round(farmstressS1_pca_importance["Proportion of Variance", "PC2"]*100, 1),"%)", sep = ""))

p3

Update the Theme

We can make some finish touches by updating the basic theme to make the plot look nicer.

p4 <- p3 + 
  theme(axis.line = element_line(colour = "black"),
        panel.border = element_rect(colour = "black", fill = NA, size = 1),
        axis.text = element_text(size = 12),
        axis.text.x = element_text(angle = 50, hjust = 1),
        axis.title = element_text(size = 12),
        legend.title = element_blank(),
        legend.position = "right",
        panel.grid = element_line(color = "lightgray"),
        panel.background = element_rect(fill = "white", colour = "white"))

p4

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2    DT_0.16          kableExtra_1.3.1 bookdown_0.21   
## [5] knitr_1.30      
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.3    pillar_1.4.7      tools_4.0.3       digest_0.6.27    
##  [5] evaluate_0.14     lifecycle_0.2.0   tibble_3.0.4      gtable_0.3.0     
##  [9] viridisLite_0.3.0 pkgconfig_2.0.3   rlang_0.4.9       rstudioapi_0.13  
## [13] yaml_2.2.1        xfun_0.19         withr_2.3.0       httr_1.4.2       
## [17] stringr_1.4.0     dplyr_1.0.2       xml2_1.3.2        generics_0.1.0   
## [21] htmlwidgets_1.5.3 vctrs_0.3.5       tidyselect_1.1.0  webshot_0.5.2    
## [25] grid_4.0.3        glue_1.4.2        R6_2.5.0          rmarkdown_2.6    
## [29] farver_2.0.3      purrr_0.3.4       magrittr_2.0.1    scales_1.1.1     
## [33] htmltools_0.5.0   ellipsis_0.3.1    rvest_0.3.6       colorspace_2.0-0 
## [37] renv_0.12.2       labeling_0.4.2    stringi_1.5.3     munsell_0.5.0    
## [41] crayon_1.3.4

References

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2020. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.