# 9 Drawing Inferences

You can think of regression as a descriptive statistic or data reduction method—a simple way to summarize trends and relationships in multivariate data. But for better or worse, most social scientists view regression as a tool for hypothesis testing. In this section, we will talk about what it is that we’re doing when we gaze at the “stars” that accompany our regression output, and the math that brings those stars to our computer screens.

## 9.1 The Basics of Hypothesis Testing

Remember the general procedure for hypothesis testing.

Specify a null hypothesis—the “no effect” baseline that (in most cases) you’re hoping to argue against.

Choose a test statistic and significance level.

Derive the sampling distribution that the test statistic would have if the null hypothesis were true.

Calculate the value of the test statistic for our sample.

Compare the sample test statistic to the sampling distribution under the null hypothesis. If the probability of obtaining a result as least as extreme as ours is at or below the significance level, reject the null hypothesis.

Imagine the null hypothesis is true. Given that, imagine 100 labs run independent tests of the null hypothesis, each using a significance level of 0.05. If they follow this general procedure, on average 5 of the labs will reject the null hypothesis, and 95 will fail to reject the null hypothesis.

What if the null hypothesis is false? What percentage of the labs will falsely reject it? That’s the *power* of the test, and it depends on a number of factors: how far off the null hypothesis is, what size sample each lab is drawing, and the significance level.

Before we get into hypothesis tests for regression, let’s refresh ourselves on how we draw inferences from a random sample about the population mean.

Suppose we have a sequence of \(N\) i.i.d. draws of the random variable \(X\), which we will denote \(X_1, \ldots, X_N\), and we are interested in testing against the null hypothesis
\[
H_0 : E[X] = \mu_0.
\]
Let \(\bar{X}\) denote the sample mean and \(S_X\) denote the sample standard
deviation. Define the \(t\) statistic as
\[
t = \frac{\bar{X} - \mu_0}{S_X / \sqrt{N}}.
\]
The denominator of the \(t\) statistic is the *standard error*—our estimate of the standard deviation of the sampling distribution under the null hypothesis. The greater the standard error, the more the statistic varies across samples, and thus the less reliable it is in any given sample. Naturally enough, our standard errors decrease with our sample size; the more data we have, the more reliably we can draw inferences.

If \(X\) is known to be normally distributed—an assumption that, in the realms political scientists deal with, is usually implausible—then the sampling distribution of \(t\) under the null hypothesis is \(t_{N - 1}\), the Student’s \(t\) distribution with \(N - 1\) degrees of freedom.

If \(X\) is not known to be normally distributed, but our sample size is “large” (in practice, \(N \geq 30\)), we can rely on the Central Limit Theorem. As \(N \to \infty\), the distribution of \(t\) under the null hypothesis is approximately \(N(0, 1)\), the normal distribution with mean zero and variance one.

## 9.2 Variance of OLS

Now let us return to the world of the linear model, \[ \mathbf{Y} = \mathbf{X} \beta + \epsilon, \] and the OLS estimator of \(\beta\), \[ \hat{\beta}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}. \] In order to draw inferences on OLS results, the first thing we need to know is the variance of the OLS estimator. You will remember from Stat 1 that the variance of the sample mean, \(\bar{X}\), is \[ V[\bar{X}] = \frac{V[X]}{N}, \] This should look familiar to you from the denominator of the \(t\)-test statistic, where we replace \(V[X]\) here with the sample variance and then take the square root. We will do something similar for OLS.

Throughout this section, we will maintain the following two assumptions on the error term:

Strict exogeneity: \(E[\epsilon \,|\, \mathbf{X}] = \mathbf{0}\).

Spherical errors: \(V[\epsilon \,|\, \mathbf{X}] = \sigma^2 \mathbf{I}_N\), where \(\sigma^2 > 0\).

Without the first assumption, OLS is biased and inconsistent—not a solid way to estimate \(\beta\) in the first place, let alone test hypotheses. Without the second assumption, OLS is unbiased and consistent, but not efficient. Later on, we will discuss how to draw inferences in the presence of non-spherical errors.

The OLS estimator is a \(K \times 1\) vector, so its variance won’t be a single number—it will be a \(K \times K\) matrix, \[ V[\hat{\beta}] = \begin{bmatrix} V[\hat{\beta}_1] & \mathop{\rm Cov}\nolimits[\hat{\beta}_1, \hat{\beta}_2] & \cdots & \mathop{\rm Cov}\nolimits[\hat{\beta}_1, \hat{\beta}_K] \\ \mathop{\rm Cov}\nolimits[\hat{\beta}_2, \hat{\beta}_1] & V[\hat{\beta}_2] & \cdots & \mathop{\rm Cov}\nolimits[\hat{\beta}_2, \hat{\beta}_K] \\ \vdots & \vdots & \ddots & \vdots \\ \mathop{\rm Cov}\nolimits[\hat{\beta}_K, \hat{\beta}_1] & \mathop{\rm Cov}\nolimits[\hat{\beta}_K, \hat{\beta}_2] & \cdots & V[\hat{\beta}_K] \\ \end{bmatrix}. \] Specifically, the variance of the OLS estimator (treating the covariates as fixed) is \[ V[\hat{\beta} \,|\, \mathbf{X}] = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. \] See the appendix to this chapter—or any graduate-level econometrics textbook—for how we derive this result.

If we knew \(\sigma^2\), the variance of the error term, then we could just use the above formula to draw inferences. Realistically, though, we will need to estimate \(\sigma^2\). We will do so using the residual variance, \[ \hat{\sigma}^2 = \frac{\sum_n (Y_n - \mathbf{x}_n \cdot \hat{\beta})^2}{N - K} = \frac{\text{SSE}}{N - K}. \] Why do we divide by \(N - K\)? Remember that when we estimate the variance of a random variable, we divide the squared deviations from the mean by \(N - 1\) to correct for the degree of freedom we used to estimate the sample mean. The resulting estimator is unbiased. Similarly, when estimating the residual variance of a linear regression model, we need to correct for the \(K\) degrees of freedom we used to estimate the model coefficients. Hence we must divide by \(N - K\) in order for \(\hat{\sigma}^2\) to be unbiased.

Under the spherical error assumption, our estimate of the variance of OLS will therefore be the \(K \times K\) matrix
\[
\hat{\Sigma} = \hat{\sigma}^2 (\mathbf{X}^\top \mathbf{X})^{-1}.
\]
A very important note. If the errors are not spherical, this formula produces a *biased* and *inconsistent* estimate of the sampling variability of OLS. This is true even though the OLS estimate itself is unbiased and consistent. In other words, if we use OLS in the presence of non-spherical errors:

Our coefficient estimates will not be systematically biased away from the population parameter, and the probability of our estimate being any meaningful distance away from the population parameter goes to zero as our sample size increases.

Our hypothesis tests will

**not**perform as advertised—typically, they will lead us to reject the null hypothesis more often than we should—and this problem does**not**go away as our sample size increases.

For the remainder of this section, we will proceed under the assumption of spherical errors. Later on, we will discuss how to draw inferences appropriately when this assumption fails to hold.

## 9.3 Single Variable Hypotheses

Consider a null hypothesis about the population value of a single coefficient, of the form \[ H_0 : \beta_k = b, \] where \(b\) is a fixed constant. Usually, though not always, political scientists concern themselves with null hypotheses of the form \(\beta_k = 0\); i.e., the \(k\)’th variable has zero (so-called) marginal effect on the response.

We will test this hypothesis using the familiar \(t\)-statistic. The *estimated standard error* of \(\hat{\beta}_k\) is
\[
\mathop{\rm SE}\nolimits(\hat{\beta}_k) = \sqrt{\hat{\Sigma}_{kk}},
\]
where \(\hat{\Sigma}_{kk}\) denotes the \(k\)’th element of the diagonal of \(\hat{\Sigma} = \hat{\sigma}^2 (\mathbf{X}^\top \mathbf{X})^{-1}\). The “standard errors” that appear alongside your regression output are calculating by taking the square root of the diagonal of this matrix.

The \(t\) statistic for the test of the null hypothesis \(H_0\) is in the familiar “estimate divided by standard error” form, \[ t = \frac{\hat{\beta}_k - b}{\mathop{\rm SE}\nolimits(\hat{\beta}_k)} = \frac{\hat{\beta}_k - b}{\sqrt{\hat{\Sigma}_{kk}}}. \] Our mode of inference from there depends on the sample size and on whether we are willing to make a normality assumption.

If the error term, \(\epsilon\), is normally distributed, then the sampling distribution of our test statistic is \(t_{N - K}\), the \(t\) distribution with \(N - K\) degrees of freedom.

^{23}As our sample size grows large, the sampling distribution of our test statistic is approximately \(N(0, 1)\), the normal distribution with mean zero and variance one. This follows from the Central Limit Theorem, and it holds even if the error term in our model is

*not*normally distributed.

So if you have a small sample, the validity of the standard hypothesis test depends on a normality assumption that may or may not be palatable, depending on the circumstances. Other techniques exist for this situation, but they are beyond the scope of this course. Of course, if your sample is so small that you need to resort to non-standard hypothesis testing techniques in order to draw appropriate inferences, you should probably go back to the drawing board on your study design.

Regardless of your sample size, regression software will compare your test statistics to \(t_{N - K}\) to calculate \(p\)-values and the results of hypothesis tests. This is innocuous even if you don’t assume normality, since if \(N\) is large the \(t_{N - K}\) distribution is approximately the same as the \(N(0, 1)\) distribution (with infinitesimally fatter tails).

Our method of constructing confidence intervals for a single parameter is also analogous to what we do with the sample mean. Let \(z_{\alpha}\) be the critical value of the sampling distribution of our test statistic for our chosen significance level \(\alpha\). For example, the critical value of \(N(0, 1)\) for significance \(\alpha = 0.05\) is \(z_{\alpha} = 1.96\). Then the \((1 - \alpha)\)-confidence interval around \(\hat{\beta}_k\) is \[ \mathop{\rm CI}\nolimits_{1 - \alpha}(\hat{\beta}_k) = [\hat{\beta}_k - z_{\alpha} \mathop{\rm SE}\nolimits(\hat{\beta}_k), \hat{\beta}_k + z_{\alpha} \mathop{\rm SE}\nolimits(\hat{\beta}_k)]. \]

## 9.4 Multiple Variable Hypotheses

It is common, especially (though not exclusively) when working with categorical variables or higher-order terms, to have hypotheses involving multiple variables. For example, think of our model from our study of categorical variables, \[ \text{Trump}_n = \beta_1 + \beta_2 \text{Independent}_n + \beta_3 \text{Democratic}_n + \beta_4 \text{Age}_n + \epsilon_n. \] Remember that \(\beta_2\) denotes the expected difference between Independents and Republicans (the omitted category) of the same age in their propensity to vote for Trump, and \(\beta_3\) denotes the expected difference between Democrats and Republicans of the same age.

If our null hypothesis were that Independents and Republicans of the same age had the same chance of voting for Trump, we would state that as \[ H_0 : \beta_2 = 0. \] But what about the null hypothesis were that Independents and Democrats had the same chance of voting for Trump? We would have to phrase that in terms of multiple coefficients, \[ H_0 : \beta_2 = \beta_3, \] or equivalently, \[ H_0 : \beta_2 - \beta_3 = 0. \] Or what if our null hypothesis were that party identification made no difference at all? That would mean Independents and Democrats are both no different than Republicans on average, or in our model notation, \[ H_0 : \left\{ \begin{aligned} \beta_2 &= 0, \\ \beta_3 &= 0. \end{aligned} \right. \]

Each of these, including the simple single-variable hypothesis, is a linear system in \(\beta\). In other words, we can write each of these hypotheses in the form \[ H_0 : \mathbf{R} \beta - \mathbf{c} = \mathbf{0}, \] where \(\mathbf{R}\) is a fixed \(r \times K\) matrix (where \(r\) is the number of restrictions we intend to test) and \(\mathbf{c}\) is a fixed \(r \times 1\) vector.

Perhaps the easiest way to test a hypothesis of this form is the *Wald test*. We form the test statistic
\[
W = (\mathbf{R} \beta - \mathbf{c})^\top (\mathbf{R} \hat{\Sigma} \mathbf{R}^\top)^{-1} (\mathbf{R} \beta - \mathbf{c}),
\]
which, despite all the matrices involved, works out to be a scalar. Under \(H_0\), the asymptotic sampling distribution of \(W\) is \(\chi^2_r\), the chi-squared distribution with \(r\) degrees of freedom.^{24} Regression software like R will usually report an \(F\) statistic, since the exact (not asymptotic) distribution of \(W\) follows an \(F\) distribution in the special case of normal residual error.

The Wald test is not just an aggregation of the individual \(t\) tests of the coefficients. Two coefficients might each individually be statistically insignificant, yet the Wald test may lead us to reject the null hypothesis that both are zero. Conversely, one of a group of coefficients might be statistically significant, and yet the Wald test may not have us reject the null hypothesis that all are zero.

We already saw a couple of examples of how to use the Wald test with a model with a categorical variable. Let’s also quickly consider its use in some other models with less-common specifications.

Imagine an interactive model,
\[
Y_n = \beta_1 + \beta_2 X_n + \beta_3 Z_n + \beta_4 (X_n \times Z_n) + \epsilon_n.
\]
The interaction term captures how the (so-called) marginal effect of \(X_n\) depends on the value of \(Z_n\), and vice versa. If your null hypothesis is that the marginal effect of \(X_n\) does not depend on \(Z_n\), you could use perform a \(t\) test of
\[
H_0 : \beta_4 = 0.
\]
But what if your null hypothesis is that the marginal effect of \(X_n\) is *always* zero, regardless of the value of \(Z_n\)? Some people make the unfortunate mistake of testing this via the null hypothesis
\[
H_0 : \beta_2 = 0,
\]
but that only means the marginal effect of \(X_n\) is zero *when \(Z_n = 0\)*. What you want is the composite null
\[
H_0 : \left\{ \begin{aligned}
\beta_2 &= 0, \\
\beta_4 &= 0,
\end{aligned} \right.
\]
or, in matrix form,
\[
\begin{bmatrix}
0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
\beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4
\end{bmatrix}
=
\begin{bmatrix}
0 \\ 0
\end{bmatrix}.
\]

Similarly, think of the quadratic model,
\[
Y_n = \beta_1 + \beta_2 X_n + \beta_3 X_n^2 + \epsilon_n.
\]
The null hypothesis that the (so-called) marginal effect of \(X_n\) is constant is equivalent to
\[
H_0 : \beta_3 = 0.
\]
But if we wanted to test against the null hypothesis that the marginal effect of \(X_n\) is *always* zero, we would have to use a composite null,
\[
H_0 : \left\{ \begin{aligned}
\beta_2 &= 0, \\
\beta_3 &= 0.
\end{aligned} \right.
\]

## 9.5 Appendix: Full Derivation of OLS Variance

We will assume strict exogeneity, \[ E [\epsilon \,|\, \mathbf{X}] = \mathbf{0}, \] and spherical errors, \[ V [\epsilon \,|\, \mathbf{X}] = E [\epsilon \epsilon^\top \,|\, \mathbf{X}] = \sigma^2 \mathbf{I}. \] A useful thing to know is that since \(\mathbf{X}^\top \mathbf{X}\) is symmetric, so is its inverse: \[ [(\mathbf{X}^\top \mathbf{X})^{-1}]^\top = (\mathbf{X}^\top \mathbf{X})^{-1}. \]

Let’s start by deriving the variance from our formula for a vector-valued random variable, \[ V[C] = E \left[ (C - E[C]) (C - E[C])^\top \right]. \] For the OLS estimator \(\hat{\beta}\), we have \[ \begin{aligned} V[\hat{\beta} \,|\, \mathbf{X}] &= E \left[ \left(\hat{\beta} - E[\hat{\beta}]\right) \left(\hat{\beta} - E[\hat{\beta}]\right)^\top \,|\, \mathbf{X} \right] \\ &= E \left[ \left(\hat{\beta} - \beta\right) \left(\hat{\beta} - \beta\right)^\top \,|\, \mathbf{X} \right] \\ &= E \left[ \hat{\beta} \hat{\beta}^\top - 2 \beta \hat{\beta}^\top + \beta \beta^\top \,|\, \mathbf{X} \right] \\ &= E \left[ \hat{\beta} \hat{\beta}^\top \,|\, \mathbf{X} \right] - 2 \beta E \left[ \hat{\beta}^\top \,|\, \mathbf{X} \right] + \beta \beta^\top \\ &= E \left[ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \mathbf{Y}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \,|\, \mathbf{X} \right] - \beta \beta^\top \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top E \left[ \mathbf{Y} \mathbf{Y}^\top \,|\, \mathbf{X} \right] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} - \beta \beta^\top. \end{aligned} \] Since \(\mathbf{Y} = \mathbf{X} \beta + \epsilon\), we have \[ \begin{aligned} E \left[ \mathbf{Y} \mathbf{Y}^\top \,|\, \mathbf{X} \right] &= E \left[ (\mathbf{X} \beta + \epsilon) (\mathbf{X} \beta + \epsilon)^\top \,|\, \mathbf{X} \right] \\ &= E \left[ \mathbf{X} \beta \beta^\top \mathbf{X}^\top + 2 \epsilon \beta^\top \mathbf{X}^\top + \epsilon \epsilon^\top \,|\, \mathbf{X} \right] \\ &= \mathbf{X} \beta \beta^\top \mathbf{X}^\top + 2 \underbrace{E[\epsilon \,|\, \mathbf{X}]}_{= \mathbf{0}} \beta^\top \mathbf{X}^\top + E[\epsilon \epsilon^\top \,|\, \mathbf{X}] \\ &= \mathbf{X} \beta \beta^\top \mathbf{X}^\top + \sigma^2 \mathbf{I}. \end{aligned} \] Continuing from above, we have \[ \begin{aligned} V[\hat{\beta} \,|\, \mathbf{X}] &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top E \left[ \mathbf{Y} \mathbf{Y}^\top \,|\, \mathbf{X} \right] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} - \beta \beta^\top \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top ( \mathbf{X} \beta \beta^\top \mathbf{X}^\top + \sigma^2 \mathbf{I} ) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} - \beta \beta^\top \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} (\mathbf{X}^\top \mathbf{X}) \beta \beta^\top (\mathbf{X}^\top \mathbf{X}) (\mathbf{X}^\top \mathbf{X})^{-1} \\ &\quad + \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} - \beta \beta^\top \\ &= \beta \beta^\top + \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} - \beta \beta^\top \\ &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \]

A slightly easier way to get there would be to apply two of the helpful properties of variance. You’ll remember from the study of scalar-valued random variables that, for a random variable \(A\) and scalars \(c\) and \(d\), \[ \begin{aligned} V[c A] &= c^2 V[A], \\ V[A + d] &= V[A]. \end{aligned} \] Similarly, for an \(m \times 1\) vector random variable \(\mathbf{A}\), a fixed \(n \times m\) matrix \(\mathbf{C}\), and a fixed \(m \times 1\) vector \(\mathbf{d}\), we have \[ \begin{aligned} V[\mathbf{C} \mathbf{A}] &= \mathbf{C} V[\mathbf{A}] \mathbf{C}^\top, \\ V[\mathbf{A} + \mathbf{d}] &= V[\mathbf{A}]. \end{aligned} \] Consequently, \[ \begin{aligned} V \left[ \hat{\beta} \,|\, \mathbf{X} \right] &= V[ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \,|\, \mathbf{X} ] \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top V [\mathbf{Y} \,|\, \mathbf{X}] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top V [\mathbf{X} \beta + \epsilon \,|\, \mathbf{X}] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top V [\epsilon \,|\, \mathbf{X}] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\sigma^2 \mathbf{I}) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \]

## 9.6 Appendix: Regression Inference in R

As usual, we will rely on the **tidyverse** and **broom** packages. We will also be using the **car** package, not only for data but also for its hypothesis testing functions.

Using the `Prestige`

data, let us regress occupational prestige on the type of occupation (blue collar, white collar, or professional) and its average income and education.

```
data("Prestige", package = "carData")
fit <- lm(prestige ~ type + education + income, data = Prestige)
```

`summary()`

prints the “regression table” containing the following information:

- Estimate of each coefficient.
- Estimated standard error of each coefficient estimate.
- \(t\) statistic for each coefficient estimate, for the test against the null hypothesis that the population value of the corresponding coefficient is zero (\(H_0 : \beta_k = 0\)).
- \(p\) value (two-tailed)
^{25}for the aforementioned hypothesis test.

```
##
## Call:
## lm(formula = prestige ~ type + education + income, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9529 -4.4486 0.1678 5.0566 18.6320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6229292 5.2275255 -0.119 0.905
## typeprof 6.0389707 3.8668551 1.562 0.122
## typewc -2.7372307 2.5139324 -1.089 0.279
## education 3.6731661 0.6405016 5.735 1.21e-07 ***
## income 0.0010132 0.0002209 4.586 1.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.095 on 93 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8278
## F-statistic: 117.5 on 4 and 93 DF, p-value: < 2.2e-16
```

This is useful for a quick check on the output, but not so much if you want to use the standard errors for further calculations (e.g., making a plot). To extract the standard errors, use the `tidy()`

function from **broom**. This returns a data frame whose columns correspond to the `summary()`

output.

```
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.623 5.23 -0.119 0.905
## 2 typeprof 6.04 3.87 1.56 0.122
## 3 typewc -2.74 2.51 -1.09 0.279
## 4 education 3.67 0.641 5.73 0.000000121
## 5 income 0.00101 0.000221 4.59 0.0000140
```

`## [1] 5.2275254877 3.8668550979 2.5139324035 0.6405016204 0.0002209185`

To extract the full (estimated) variance matrix, use the `vcov()`

function.

```
## (Intercept) typeprof typewc education
## (Intercept) 2.732702e+01 1.663699e+01 7.3984873320 -3.196460e+00
## typeprof 1.663699e+01 1.495257e+01 6.7970730800 -2.123883e+00
## typewc 7.398487e+00 6.797073e+00 6.3198561295 -1.106184e+00
## education -3.196460e+00 -2.123883e+00 -1.1061842075 4.102423e-01
## income 9.996273e-05 -4.984256e-06 0.0001310818 -4.333455e-05
## income
## (Intercept) 9.996273e-05
## typeprof -4.984256e-06
## typewc 1.310818e-04
## education -4.333455e-05
## income 4.880497e-08
```

Luckily, you shouldn’t often need to extract the individual standard errors or the variance matrix. R has convenience functions to perform most of the calculations you would care about.

Just to see how everything works, let’s confirm that this gives us the same thing as our formula, \(V[\hat{\beta}_{\text{OLS}} \,|\, \mathbf{X}] = \hat{\sigma}^2 (\mathbf{X}^\top \mathbf{X})^{-1}\).

```
X <- model.matrix(fit)
SSE <- sum(residuals(fit)^2)
sigma2_hat <- SSE / (nrow(X) - ncol(X))
vcov_est <- sigma2_hat * solve(t(X) %*% X)
all.equal(vcov_est, vcov(fit))
```

`## [1] TRUE`

To obtain confidence intervals for the regression coefficients, use the `confint()`

function.

```
## 2.5 % 97.5 %
## (Intercept) -1.100376e+01 9.757900433
## typeprof -1.639837e+00 13.717778520
## typewc -7.729402e+00 2.254940810
## education 2.401257e+00 4.945075332
## income 5.744929e-04 0.001451893
```

```
## 0.5 % 99.5 %
## (Intercept) -1.436992e+01 13.12406265
## typeprof -4.129823e+00 16.20776380
## typewc -9.348200e+00 3.87373813
## education 1.988818e+00 5.35751375
## income 4.322368e-04 0.00159415
```

```
## 2.5 % 97.5 %
## education 2.401257 4.945075
```

You may also be interested in testing against null hypotheses other than each individual coefficient being zero. This is where the `linearHypothesis()`

function from the car package comes in. For example, suppose we wanted to test against the null hypothesis that the population coefficient on education is 3.

```
## Linear hypothesis test
##
## Hypothesis:
## education = 3
##
## Model 1: restricted model
## Model 2: prestige ~ type + education + income
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 94 4736.9
## 2 93 4681.3 1 55.601 1.1046 0.296
```

We care about the last two columns of the bottom row. `F`

gives us the test statistic,^{26} and `Pr(>F)`

gives us the associated \(p\)-value. Here our \(p\)-value is 0.3, so we wouldn’t reject the null hypothesis that the population coefficient on education is 3.

Two quick notes on the `linearHypothesis()`

function:

Make sure to place your hypothesis in quotes. (Or if you have multiple hypotheses, that you use a vector of quoted strings.) The function will not work if you run something like

`linearHypothesis(fit, education = 3)`

.Make sure the name of the coefficient(s) you’re testing are exactly the same as in the regression output. This requires particular care when you’re dealing with factor variables, interactions, or quadratic terms.

Of course, for a univariate hypothesis test like this one, we could have just used the confidence interval to figure out the answer. The real value of `linearHypothesis()`

comes in simultaneously testing hypotheses about multiple coefficients—i.e., the Wald test.

For example, let’s test the null hypothesis that the population coefficients on white-collar and professional are the same.

```
## Linear hypothesis test
##
## Hypothesis:
## - typeprof + typewc = 0
##
## Model 1: restricted model
## Model 2: prestige ~ type + education + income
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 94 5186.2
## 2 93 4681.3 1 504.93 10.031 0.002082 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We would reject this null hypothesis except under particularly stringent significance levels (less than 0.002).

What about the composite hypothesis that the population coefficients on white-collar and professional both equal zero? To test this, we pass a *vector* of hypotheses.

```
## Linear hypothesis test
##
## Hypothesis:
## typewc = 0
## typeprof = 0
##
## Model 1: restricted model
## Model 2: prestige ~ type + education + income
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 95 5272.4
## 2 93 4681.3 2 591.16 5.8721 0.003966 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The \(p\)-value corresponding to this hypothesis test is 0.004. This illustrates a key feature of composite hypothesis tests. You’ll remember from the original regression output that neither of the occupational indicators were significant on their own.

```
##
## Call:
## lm(formula = prestige ~ type + education + income, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9529 -4.4486 0.1678 5.0566 18.6320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6229292 5.2275255 -0.119 0.905
## typeprof 6.0389707 3.8668551 1.562 0.122
## typewc -2.7372307 2.5139324 -1.089 0.279
## education 3.6731661 0.6405016 5.735 1.21e-07 ***
## income 0.0010132 0.0002209 4.586 1.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.095 on 93 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8278
## F-statistic: 117.5 on 4 and 93 DF, p-value: < 2.2e-16
```

So, operating at conventional significant levels, we would:

- Fail to reject the null that the coefficient on professional is zero
- Fail to reject the null that the coefficient on white-collar is zero
- Reject the null that the coefficients on both are zero

This seems contradictory, but it just illustrates the limits of drawing inferences from a finite sample. What the results of these hypothesis tests are telling us is that we have enough information to conclude that there are differences in prestige across occupational categories, holding average education and income fixed. However, we do not have enough information to identify exactly where those differences are coming—i.e., exactly which of the two categories it is that differs from the omitted baseline.

While I personally prefer to work with large samples and rely on asymptotics to justify my hypothesis tests, there is

*some*reasonable justification for assuming normality of the error term. Remember that \(\epsilon\) can be thought of as the sum of all the other influences on \(\mathbf{Y}\) that aren’t controlled for in the model (i.e., included as a column in \(\mathbf{X}\)). If there are a lot of these other influences, then the Central Limit Theorem implies that their sum is approximately normally distributed.↩︎You will notice that, for a single-variable hypothesis \(H_0 : \beta_k = b\), the Wald statistic reduces to the square of the \(t\) statistic. Since the asymptotic distribution of the \(t\) statistic is standard normal under the null hypothesis, it follows that the asymptotic distribution of its square is \(\chi^2_1\) under the null hypothesis.↩︎

There is a theoretical statistical justification for using one-tailed tests when your alternative hypothesis is directional, but in practice I would stick with two-tailed tests unless you have publicly preregistered your directional hypotheses.↩︎

Why an \(F\) statistic instead of a \(t\) statistic? If the random variable \(Z\) has a \(t\) distribution with \(n\) degrees of freedom, then \(Z^2\) has an \(F\) distribution with \(1,n\) degrees of freedom. The \(F\) statistic generalizes better to the case of multiple hypotheses.↩︎