When parameters of a general linear model are estimated, analysts often report the main effects. However, sometimes the hypothesis of interest is a linear combination of the main effects that is not displayed by default in a standard regression table. Testing many hypotheses from the same linear model is especially relevant when an analyst fits a model using categorical variables where many post-hoc hypotheses are of interest. This presentation will show how to code a categorical variable for use in the general linear model and use tests of the so-called general linear hypothesis to test any number of hypotheses about the effects (Fox 2008).

This presentation uses the R programming language and assumes the end-user is taking advantage of RStudio IDE to compile their R markdown files into HTML (R Core Team 2020; RStudio Team 2020). All of the files needed to reproduce these results can be downloaded from the Git repository `git clone https://git.waderstats.com/general_linear_hypothesis_application/`

.

The libraries *knitr*, *bookdown*, and *kableExtra* are loaded to generate the HTML output (Xie 2021, 2020; Zhu 2021). The library *magrittr* is loaded so we can use the pipe (%>%) operator (Bache and Wickham 2020). The library *DT* is loaded to display simulated data in an output table (Xie, Cheng, and Tan 2021). Data summaries are facilitated by the *dplyr* library (Wickham et al. 2021). The library *multcomp* is loaded so we can do tests of the general linear hypothesis (Hothorn, Bretz, and Westfall 2008; Fox 2008).

```
package_loader <- function(x, ...) {
if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
library(x, ...)
}
packages <- c("knitr", "bookdown", "kableExtra", "magrittr", "DT", "dplyr", "multcomp")
invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))
```

Suppose we have a set of \(i \in \{1, ... , n\}\) independent and identically distributed normal random variables each denoted by \(y_i\), whose value depends on a categorical variable with three levels, \(x \in \{l, m, h\}\) that correspond to *low*, *medium*, and *high*, respectively. We will further assume that \(E[y_i|x] = \mu_x\) with constant variance \(\sigma^2\). The data generating process is written concisely in equation (1).

\[\begin{equation} \begin{split} y_i|x \sim N(\mu_{x}, \sigma^2) \end{split} \tag{1} \end{equation}\]

Simulating data using this data generating process is straight forward. We will assume \(\mu_{x = l} = 1\), \(\mu_{x = m} = 1\), \(\mu_{x = h} = 2\) with fixed variance \(\sigma^2 = 1\) and generate a single realization for each of \(n = 100\) normal random variables.

```
set.seed(123)
# Create an empty data frame to hold the simulated data.
d <- matrix(nrow = 100, ncol = 2)
colnames(d) <- c("Y", "X")
d <- as.data.frame(d)
# We assume without loss of generality that each of the categories is equally likely to be observed.
X <- sample(x = c("l", "m", "h"), size = 100, replace = TRUE, prob = c(1/3, 1/3, 1/3))
d$X <- X
# Since the distribution of each y_i depends on the value of x, we generate the outcome separately for each possible value.
d[which(d$X == "l"), "Y"] <- rnorm(n = length(which(d$X == "l")), mean = 1, sd = 1)
d[which(d$X == "m"), "Y"] <- rnorm(n = length(which(d$X == "m")), mean = 1, sd = 1)
d[which(d$X == "h"), "Y"] <- rnorm(n = length(which(d$X == "h")), mean = 2, sd = 1)
# The factor function tells R that X is a factor and should be treated as such in downstream analyses.
d$X <- factor(d$X, levels = c("l", "m", "h"))
# We can use the datatable function to display the simulated data.
datatable(d, escape = FALSE,
caption = "This table contains the simulated data from the data generating process. Each value in the Y column is a realization of an independent and identically distributed normal random variable whose expectation is conditional on the value in the X column.",
extensions = c('Buttons', 'KeyTable'),
class = 'cell-border stripe',
rownames = TRUE,
options = list(
dom = 'Bfrtip',
pageLength = 10,
deferRender = TRUE,
responsive = TRUE,
scrollX = TRUE,
scrollCollaspe = TRUE,
paging = TRUE,
autoWidth = FALSE,
keys = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))
```

Table 1 shows the estimated mean and variance for the outcome for each level of the categorical variable in the simulated data set.

```
dSumarries <- d %>%
group_by(X) %>%
summarize(
mean = round(mean(Y, na.rm = TRUE), 2),
variance = round(sd(Y, na.rm = TRUE)^2, 2)
)
dSumarries %>%
kbl(escape = FALSE, caption = "The estimated mean and variance of the outcome for each value of the categorical variable in the simulated data.") %>%
kable_classic_2(full_width = F)
```

X | mean | variance |
---|---|---|

l | 1.00 | 0.80 |

m | 1.08 | 0.78 |

h | 1.77 | 1.20 |

We can write \(E[y|x] = \mu_x\) as a linear combination of a constant and \(x\) by dummy coding \(x\). Dummy coding transforms \(x\) into a set of dichotomous predictors for all but one of the levels which we call the reference category. In the current example, dummy coding says that \(x_{l} = 1\) if in the *low* category and \(x_{l} = 0\) otherwise; \(x_{m} = 1\) if in the *medium* category and \(x_{m} = 0\) otherwise; \(x_{h} = 1\) if in the *high* category and \(x_{h} = 0\) otherwise.

If we treat \(x_l\) as the referent category, the expectation of \(y\) can be written as a linear function of the categorical predictor (equation (2)).

\[\begin{equation} \begin{split} E[y|x] = \alpha + \beta_1 x_{m} + \beta_2 x_{h} \end{split} \tag{2} \end{equation}\]

For the simulated data, the design matrix for the dummy coded categorical variable is shown below.

```
# The model.matrix function extracts the design matrix for a given formula object.
dDM <- model.matrix(as.formula(Y ~ X), data = d)
# We can use the datatable function to display the design matrix for the simulated data.
datatable(dDM, escape = FALSE,
caption = "The design matrix for the simulated data that shows the dummy coding for the categorical variable.",
extensions = c('Buttons', 'KeyTable'),
class = 'cell-border stripe',
rownames = TRUE,
options = list(
dom = 'Bfrtip',
pageLength = 10,
deferRender = TRUE,
responsive = TRUE,
scrollX = TRUE,
scrollCollaspe = TRUE,
paging = TRUE,
autoWidth = FALSE,
keys = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))
```

Under the dummy coding scheme, the mean of \(y\) for each level of the categorical variable can be derived from the linear predictor.

- The mean value of \(y\) when the categorical variable is
*low*is \(\mu_{x=l} = E[y|x_m = 0, x_h = 0] = \alpha\). - The mean value of \(y\) when the categorical variable is
*medium*is \(\mu_{x=m} = E[y|x_m = 1, x_h = 0] = \alpha + \beta_1\). - The mean value of \(y\) when the categorical variable is
*high*is \(\mu_{x=h} = E[y|x_m = 0, x_h = 0] = \alpha + \beta_2\).

The first step in deriving these effects for the current data is to fit a linear model to the data and extract the parameter estimates for the main effects.

```
# Estimate the parameters for the linear model
fit <- lm(Y~X, data = d)
fit
```

```
##
## Call:
## lm(formula = Y ~ X, data = d)
##
## Coefficients:
## (Intercept) Xm Xh
## 0.99512 0.08678 0.77205
```

Next, we create a matrix containing a single row for each linear effect we care about. The linear effect matrix is multiplied by the vector of estimated coefficients for the main effects to get the effects of interest.

```
# Matrix that represents each linear effect we want to estimate.
mMat <- rbind(c(1, 0, 0), c(1, 1, 0), c(1, 0, 1))
mMat
```

```
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 1 0
## [3,] 1 0 1
```

```
# Vector of estiamted coefficients from the linear model
cMat <- matrix(coef(fit), nrow = 3)
cMat
```

```
## [,1]
## [1,] 0.99511828
## [2,] 0.08678364
## [3,] 0.77205010
```

```
# The estimated value of y for each value of the category.
mMat %*% cMat
```

```
## [,1]
## [1,] 0.9951183
## [2,] 1.0819019
## [3,] 1.7671684
```

To test the null hypothesis that any of these estimated effects are zero, we can apply the general linear hypothesis (Fox 2008).

- \(H_0: \alpha = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is
*low*(\(x = l\)). - \(H_0: \alpha + \beta_1 = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is
*medium*(\(x = m\)). - \(H_0: \alpha + \beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) is zero when the categorical variable is
*high*(\(x = h\)).

```
mTest <- glht(model = fit, linfct = mMat)
summary(mTest)
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = Y ~ X, data = d)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 0.9951 0.1680 5.923 1.35e-07 ***
## 2 == 0 1.0819 0.1680 6.439 < 1e-07 ***
## 3 == 0 1.7672 0.1655 10.676 < 1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
```

Specifying linear hypotheses allows us to test any number of effects inferentially.

- \(H_0: \mu_{x=l} - \mu_{x = h} = \alpha - (\alpha + \beta_2) = -\beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) in the referent category
*low*(\(x = l\)) is the same as the mean value of \(y\) in the*high*category (\(x = h\)).

We can get even more fancy and derive the mean value of \(y\) that is marginalized over the *medium* and *high* categories.

- \(H_0: \mu_{x=l} - \frac{(\mu_{x=m} + \mu_{x = h})}{2} = \alpha - \frac{\alpha + \beta_1 + \alpha + \beta_2}{2} = - \frac{1}{2}\beta_1 - \frac{1}{2}\beta_2 = 0\) tests the null hypothesis that the mean value of \(y\) in the referent category
*low*(\(x = l\)) is the same as the mean value of \(y\) with in either the*medium*or*high*categories (\(x = m\) or \(x =l\)).

```
oMat <- rbind(c(0, 0, -1), c(0, -1/2, -1/2))
oMat
```

```
## [,1] [,2] [,3]
## [1,] 0 0.0 -1.0
## [2,] 0 -0.5 -0.5
```

```
oTest <- glht(model = fit, linfct = oMat)
summary(oTest)
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = Y ~ X, data = d)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.7721 0.2359 -3.273 0.0024 **
## 2 == 0 -0.4294 0.2053 -2.092 0.0577 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
```

Unless the goal is to publish in the Journal of Irreproducible Results, an analyst should adjust for simultaneously testing multiple hypotheses.

The multiple testing problem occurs because our fixed and known probability of falsely rejecting a true null hypothesis (usually \(5\%\)) applies only to a single test and increases as a function of the total number of inferences. For example, suppose we do 20 simultaneous tests, each with a \(5\%\) probability of being a false discovery. In that case, we expect at least one of these inferences will result in falsely rejecting a true null hypothesis.

There are two commonly used methods for adjusting for multiple comparisons. The first controls for the probability of falsely rejecting at least one true null hypothesis among all hypotheses tested, also known as the family-wise error rate (FWER), and typified by the Bonferroni correction (Dunn 1961). The second controls the probability of falsely rejecting at least one true null hypothesis among rejected hypotheses, known as the false discovery rate (FDR) (Benjamini and Hochberg 1995).

Table 2 shows the inferential results after adjusting for both the FWER and FDR.

```
resTab <- rbind(cbind(summary(mTest)$test$coefficients, summary(mTest)$test$sigma, summary(mTest)$test$tstat, summary(mTest)$test$pvalues),
cbind(summary(oTest)$test$coefficients, summary(oTest)$test$sigma, summary(oTest)$test$tstat, summary(oTest)$test$pvalues)
)
rownames(resTab) <- c("$H_0: \\mu_{x = l} = 0$", "$H_0: \\mu_{x = m} = 0$", "$H_0: \\mu_{x = h} = 0$", "$H_0: \\mu_{x = l} - \\mu_{x = h} = 0$", "$H_0: \\mu_{x = l} - \\frac{1}{2}(\\mu_{x = m} + \\mu_{x = h}) = 0$")
resTab <- cbind(resTab, p.adjust(resTab[, 4], method = "bonferroni"), p.adjust(resTab[, 4], method = "fdr"))
colnames(resTab) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)", "FWER Pr(>|t|)", "FDR Pr(>|t|)")
resTab %>%
kbl(escape = FALSE, caption = "The results for all analyses with inferential results controlling the family-wise error rate (FWER) and false discovery rate (FDR).") %>%
kable_classic_2(full_width = F)
```

Estimate | Std. Error | t value | Pr(>|t|) | FWER Pr(>|t|) | FDR Pr(>|t|) | |
---|---|---|---|---|---|---|

\(H_0: \mu_{x = l} = 0\) | 0.9951183 | 0.1680157 | 5.922769 | 0.0000002 | 0.0000008 | 0.0000003 |

\(H_0: \mu_{x = m} = 0\) | 1.0819019 | 0.1680157 | 6.439290 | 0.0000000 | 0.0000001 | 0.0000000 |

\(H_0: \mu_{x = h} = 0\) | 1.7671684 | 0.1655265 | 10.676048 | 0.0000000 | 0.0000000 | 0.0000000 |

\(H_0: \mu_{x = l} - \mu_{x = h} = 0\) | -0.7720501 | 0.2358565 | -3.273389 | 0.0024022 | 0.0120108 | 0.0030027 |

\(H_0: \mu_{x = l} - \frac{1}{2}(\mu_{x = m} + \mu_{x = h}) = 0\) | -0.4294169 | 0.2052714 | -2.091947 | 0.0577379 | 0.2886896 | 0.0577379 |

Bache, Stefan Milton, and Hadley Wickham. 2020. https://CRAN.R-project.org/package=magrittr.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” *Journal of the Royal Statistical Society: Series B (Methodological)* 57 (1): 289–300.

Dunn, Olive Jean. 1961. “Multiple Comparisons Among Means.” *Journal of the American Statistical Association* 56 (293): 52–64.

Fox, John. 2008. *Applied Regression Analysis and Generalized Linear Models*. Sage Publications.

Hothorn, Torsten, Frank Bretz, and Peter Westfall. 2008. “Simultaneous Inference in General Parametric Models.” *Biometrical Journal* 50 (3): 346–63.

R Core Team. 2020. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2020. *RStudio: Integrated Development Environment for R*. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. *Dplyr: A Grammar of Data Manipulation*. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2020. *Bookdown: Authoring Books and Technical Documents with R Markdown*. https://github.com/rstudio/bookdown.

———. 2021. *Knitr: A General-Purpose Package for Dynamic Report Generation in R*. https://yihui.org/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2021. *DT: A Wrapper of the Javascript Library ’Datatables’*. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2021. *KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax*. https://CRAN.R-project.org/package=kableExtra.