Introduction

When parameters of a general linear model are estimated, analysts often report the main effects. However, sometimes the hypothesis of interest is a linear combination of the main effects that is not displayed by default in a standard regression table. Testing many hypotheses from the same linear model is especially relevant when an analyst fits a model using categorical variables where many post-hoc hypotheses are of interest. This presentation will show how to code a categorical variable for use in the general linear model and use tests of the so-called general linear hypothesis to test any number of hypotheses about the effects (Fox 2008).

This presentation uses the R programming language and assumes the end-user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2020; RStudio Team 2020). All of the files needed to reproduce these results can be downloaded from the Git repository git clone https://git.waderstats.com/general_linear_hypothesis_application/.

Required Libraries

The libraries knitr, bookdown, and kableExtra are loaded to generate the HTML output (Xie 2021, 2020; Zhu 2021). The library magrittr is loaded so we can use the pipe (%>%) operator (Bache and Wickham 2020). The library DT is loaded to display simulated data in an output table (Xie, Cheng, and Tan 2021). Data summaries are facilitated by the dplyr library (Wickham et al. 2021). The library multcomp is loaded so we can do tests of the general linear hypothesis (Hothorn, Bretz, and Westfall 2008; Fox 2008).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "magrittr", "DT", "dplyr", "multcomp")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

Data Generating Process

Suppose we have a set of \(i \in \{1, ... , n\}\) independent and identically distributed normal random variables each denoted by \(y_i\), whose value depends on a categorical variable with three levels, \(x \in \{l, m, h\}\) that correspond to low, medium, and high, respectively. We will further assume that \(E[y_i|x] = \mu_x\) with constant variance \(\sigma^2\). The data generating process is written concisely in equation (1).

\[\begin{equation} \begin{split} y_i|x \sim N(\mu_{x}, \sigma^2) \end{split} \tag{1} \end{equation}\]

Simulation Data

Simulating data using this data generating process is straight forward. We will assume \(\mu_{x = l} = 1\), \(\mu_{x = m} = 1\), \(\mu_{x = h} = 2\) with fixed variance \(\sigma^2 = 1\) and generate a single realization for each of \(n = 100\) normal random variables.

set.seed(123)

# Create an empty data frame to hold the simulated data.
d <- matrix(nrow = 100, ncol = 2)
colnames(d) <- c("Y", "X")
d <- as.data.frame(d)

# We assume without loss of generality that each of the categories is equally likely to be observed.
X <- sample(x = c("l", "m", "h"), size = 100, replace = TRUE, prob = c(1/3, 1/3, 1/3))
d$X <- X

# Since the distribution of each y_i depends on the value of x, we generate the outcome separately for each possible value.
d[which(d$X == "l"), "Y"] <- rnorm(n = length(which(d$X == "l")), mean = 1, sd = 1)
d[which(d$X == "m"), "Y"] <- rnorm(n = length(which(d$X == "m")), mean = 1, sd = 1)
d[which(d$X == "h"), "Y"] <- rnorm(n = length(which(d$X == "h")), mean = 2, sd = 1)

# The factor function tells R that X is a factor and should be treated as such in downstream analyses.
d$X <- factor(d$X, levels = c("l", "m", "h"))

# We can use the datatable function to display the simulated data.
datatable(d, escape = FALSE, 
          caption = "This table contains the simulated data from the data generating process.  Each value in the Y column is a realization of an independent and identically distributed normal random variable whose expectation is conditional on the value in the X column.",
          extensions = c('Buttons', 'KeyTable'),
          class = 'cell-border stripe',
          rownames = TRUE,
          options = list(
            dom = 'Bfrtip',
            pageLength = 10,
            deferRender = TRUE,
            responsive = TRUE,
            scrollX = TRUE,
            scrollCollaspe = TRUE,
            paging = TRUE,
            autoWidth = FALSE,
            keys = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
          ))

Table 1 shows the estimated mean and variance for the outcome for each level of the categorical variable in the simulated data set.

dSumarries <- d %>% 
  group_by(X) %>%
  summarize(
    mean = round(mean(Y, na.rm = TRUE), 2),
    variance = round(sd(Y, na.rm = TRUE)^2, 2)
  )

dSumarries %>%
  kbl(escape = FALSE, caption = "The estimated mean and variance of the outcome for each value of the categorical variable in the simulated data.") %>%
  kable_classic_2(full_width = F)
Table 1: The estimated mean and variance of the outcome for each value of the categorical variable in the simulated data.
X mean variance
l 1.00 0.80
m 1.08 0.78
h 1.77 1.20

Dummy Coding the Categorical Predictor

We can write \(E[y|x] = \mu_x\) as a linear combination of a constant and \(x\) by dummy coding \(x\). Dummy coding transforms \(x\) into a set of dichotomous predictors for all but one of the levels which we call the reference category. In the current example, dummy coding says that \(x_{l} = 1\) if in the low category and \(x_{l} = 0\) otherwise; \(x_{m} = 1\) if in the medium category and \(x_{m} = 0\) otherwise; \(x_{h} = 1\) if in the high category and \(x_{h} = 0\) otherwise.

If we treat \(x_l\) as the referent category, the expectation of \(y\) can be written as a linear function of the categorical predictor (equation (2)).

\[\begin{equation} \begin{split} E[y|x] = \alpha + \beta_1 x_{m} + \beta_2 x_{h} \end{split} \tag{2} \end{equation}\]

For the simulated data, the design matrix for the dummy coded categorical variable is shown below.

# The model.matrix function extracts the design matrix for a given formula object.
dDM <- model.matrix(as.formula(Y ~ X), data = d)

# We can use the datatable function to display the design matrix for the simulated data.
datatable(dDM, escape = FALSE, 
          caption = "The design matrix for the simulated data that shows the dummy coding for the categorical variable.",
          extensions = c('Buttons', 'KeyTable'),
          class = 'cell-border stripe',
          rownames = TRUE,
          options = list(
            dom = 'Bfrtip',
            pageLength = 10,
            deferRender = TRUE,
            responsive = TRUE,
            scrollX = TRUE,
            scrollCollaspe = TRUE,
            paging = TRUE,
            autoWidth = FALSE,
            keys = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
          ))

Tests of the General Linear Hypothesis

A Simple Example

Under the dummy coding scheme, the mean of \(y\) for each level of the categorical variable can be derived from the linear predictor.

  • The mean value of \(y\) when the categorical variable is low is \(\mu_{x=l} = E[y|x_m = 0, x_h = 0] = \alpha\).
  • The mean value of \(y\) when the categorical variable is medium is \(\mu_{x=m} = E[y|x_m = 1, x_h = 0] = \alpha + \beta_1\).
  • The mean value of \(y\) when the categorical variable is high is \(\mu_{x=h} = E[y|x_m = 0, x_h = 0] = \alpha + \beta_2\).

The first step in deriving these effects for the current data is to fit a linear model to the data and extract the parameter estimates for the main effects.

# Estimate the parameters for the linear model
fit <- lm(Y~X, data = d)
fit
## 
## Call:
## lm(formula = Y ~ X, data = d)
## 
## Coefficients:
## (Intercept)           Xm           Xh  
##     0.99512      0.08678      0.77205

Next, we create a matrix containing a single row for each linear effect we care about. The linear effect matrix is multiplied by the vector of estimated coefficients for the main effects to get the effects of interest.

# Matrix that represents each linear effect we want to estimate.
mMat <- rbind(c(1, 0, 0), c(1, 1, 0), c(1, 0, 1))
mMat
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    1    1    0
## [3,]    1    0    1
# Vector of estiamted coefficients from the linear model
cMat <- matrix(coef(fit), nrow = 3)
cMat
##            [,1]
## [1,] 0.99511828
## [2,] 0.08678364
## [3,] 0.77205010
# The estimated value of y for each value of the category.
mMat %*% cMat
##           [,1]
## [1,] 0.9951183
## [2,] 1.0819019
## [3,] 1.7671684

To test the null hypothesis that any of these estimated effects are zero, we can apply the general linear hypothesis (Fox 2008).

  • \(H_0: \alpha = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is low (\(x = l\)).
  • \(H_0: \alpha + \beta_1 = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is medium (\(x = m\)).
  • \(H_0: \alpha + \beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is high (\(x = h\)).
mTest <- glht(model = fit, linfct = mMat)
summary(mTest)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = Y ~ X, data = d)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0   0.9951     0.1680   5.923 1.35e-07 ***
## 2 == 0   1.0819     0.1680   6.439  < 1e-07 ***
## 3 == 0   1.7672     0.1655  10.676  < 1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

A More Complicated Example

Specifying linear hypotheses allows us to test any number of effects inferentially.

  • \(H_0: \mu_{x=l} - \mu_{x = h} = \alpha - (\alpha + \beta_2) = -\beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) in the referent category low (\(x = l\)) is the same as the mean value of \(y\) in the high category (\(x = h\)).

We can get even more fancy and derive the mean value of \(y\) that is marginalized over the medium and high categories.

  • \(H_0: \mu_{x=l} - \frac{(\mu_{x=m} + \mu_{x = h})}{2} = \alpha - \frac{\alpha + \beta_1 + \alpha + \beta_2}{2} = - \frac{1}{2}\beta_1 - \frac{1}{2}\beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) in the referent category low (\(x = l\)) is the same as the mean value of \(y\) with in either the medium or high categories (\(x = m\) or \(x =l\)).
oMat <- rbind(c(0, 0, -1), c(0, -1/2, -1/2))
oMat
##      [,1] [,2] [,3]
## [1,]    0  0.0 -1.0
## [2,]    0 -0.5 -0.5
oTest <- glht(model = fit, linfct = oMat)
summary(oTest)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = Y ~ X, data = d)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)   
## 1 == 0  -0.7721     0.2359  -3.273   0.0024 **
## 2 == 0  -0.4294     0.2053  -2.092   0.0577 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Multiple Comparisons

Unless the goal is to publish in the Journal of Irreproducible Results, an analyst should adjust for simultaneously testing multiple hypotheses.

The multiple testing problem occurs because our fixed and known probability of falsely rejecting a true null hypothesis (usually \(5\%\)) applies only to a single test and increases as a function of the total number of inferences. For example, suppose we do 20 simultaneous tests, each with a \(5\%\) probability of being a false discovery. In that case, we expect at least one of these inferences will result in falsely rejecting a true null hypothesis.

There are two commonly used methods for adjusting for multiple comparisons. The first controls for the probability of falsely rejecting at least one true null hypothesis among all hypotheses tested, also known as the family-wise error rate (FWER), and typified by the Bonferroni correction (Dunn 1961). The second controls the probability of falsely rejecting at least one true null hypothesis among rejected hypotheses, known as the false discovery rate (FDR) (Benjamini and Hochberg 1995).

Table 2 shows the inferential results after adjusting for both the FWER and FDR.

resTab <- rbind(cbind(summary(mTest)$test$coefficients, summary(mTest)$test$sigma, summary(mTest)$test$tstat, summary(mTest)$test$pvalues),
                cbind(summary(oTest)$test$coefficients, summary(oTest)$test$sigma, summary(oTest)$test$tstat, summary(oTest)$test$pvalues)
)

rownames(resTab) <- c("$H_0: \\mu_{x = l} = 0$", "$H_0: \\mu_{x = m} = 0$", "$H_0: \\mu_{x = h} = 0$", "$H_0: \\mu_{x = l} - \\mu_{x = h} = 0$", "$H_0: \\mu_{x = l} - \\frac{1}{2}(\\mu_{x = m} + \\mu_{x = h}) = 0$")

resTab <- cbind(resTab, p.adjust(resTab[, 4], method = "bonferroni"), p.adjust(resTab[, 4], method = "fdr"))

colnames(resTab) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)", "FWER Pr(>|t|)", "FDR Pr(>|t|)")

resTab %>%
  kbl(escape = FALSE, caption = "The results for all analyses with inferential results controlling the family-wise error rate (FWER) and false discovery rate (FDR).") %>%
  kable_classic_2(full_width = F)
Table 2: The results for all analyses with inferential results controlling the family-wise error rate (FWER) and false discovery rate (FDR).
Estimate Std. Error t value Pr(>|t|) FWER Pr(>|t|) FDR Pr(>|t|)
\(H_0: \mu_{x = l} = 0\) 0.9951183 0.1680157 5.922769 0.0000002 0.0000008 0.0000003
\(H_0: \mu_{x = m} = 0\) 1.0819019 0.1680157 6.439290 0.0000000 0.0000001 0.0000000
\(H_0: \mu_{x = h} = 0\) 1.7671684 0.1655265 10.676048 0.0000000 0.0000000 0.0000000
\(H_0: \mu_{x = l} - \mu_{x = h} = 0\) -0.7720501 0.2358565 -3.273389 0.0024022 0.0120108 0.0030027
\(H_0: \mu_{x = l} - \frac{1}{2}(\mu_{x = m} + \mu_{x = h}) = 0\) -0.4294169 0.2052714 -2.091947 0.0577379 0.2886896 0.0577379

References

Bache, Stefan Milton, and Hadley Wickham. 2020. https://CRAN.R-project.org/package=magrittr.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1): 289–300.

Dunn, Olive Jean. 1961. “Multiple Comparisons Among Means.” Journal of the American Statistical Association 56 (293): 52–64.

Fox, John. 2008. Applied Regression Analysis and Generalized Linear Models. Sage Publications.

Hothorn, Torsten, Frank Bretz, and Peter Westfall. 2008. “Simultaneous Inference in General Parametric Models.” Biometrical Journal 50 (3): 346–63.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2020. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2020. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2021. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2021. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.