# 12 Clustered and Panel Data

Grouped data structures, in which we observe individual units within larger groups, are common in political science and other social sciences. Examples include:

Cross-national survey data, where we observe individual respondents grouped by country.

Block-randomized field experiments. For example, in an experiment where the treatment is administered at the village level, we observe individual outcomes grouped by village.

Panel data, where we observe the same units repeatedly over time. This includes panel surveys as well as observational data on states, countries, or other political units over time.

Grouped data presents problems and opportunities. At the root of both is the idea of *unobserved heterogeneity*—that some variation across groups might be due to unobservable features of the groups.

To return to our running example, suppose you are interested in the correlates of voting for Donald Trump in the 2016 general election. You observe survey respondents grouped by state. Vote choice might be affected by:

*Individual*characteristics like one’s age, gender, race, income, and education.*State*characteristics^{29}like the unemployment rate, undocumented population, and exposure to trade with China.

Some state characteristics that might affect vote choice are difficult or impossible to measure. For example, some states have a more cosmopolitan culture, while others have a more traditional culture. A 50-year-old white man living in Connecticut is, we would expect, less likely to have voted for Trump than a 50-year-old white man living in Alabama. This is *unobserved heterogeneity*—characteristics that we do not observe, and therefore cannot control for, but which affect the response.

If we are not careful in dealing with grouped data, unobserved heterogeneity can be a major problem. The spherical errors assumption is usually not tenable for grouped data, since observations will be correlated with others in the same group. OLS will therefore be inefficient and yield invalid inferences. Even worse, if group-level sources of unobserved heterogeneity are correlated with individual characteristics—if, say, younger voters are more likely to live in cosmopolitan states—OLS may also be biased and inconsistent.

On the other hand, if we deal with grouped data properly, we can enhance the credibility of our inferences. We can eliminate the influence of variation *across* groups, allowing us to focus on the comparisons *within* groups that we are usually most interested in.

Before we get started, a note of caution. We only have enough time to scratch the surface of how to analyze grouped data. If you work with grouped data in your dissertation or other research, you should think carefully about your data structure and potential sources of unobserved heterogeneity. The methods we discuss this week may not solve your problems. As further references, I recommend Wooldridge (2002) and Gelman and Hill (2006).

## 12.1 The Linear Model with Grouped Data

We need to change our notation a bit to reflect the arrangement of observations into groups. Let there be \(G\) groups indexed by \(g = 1, \ldots, G\). Within each group, we have \(N\) observations indexed by \(n = 1, \ldots, N\).^{30} (In the special case of panel data, where each group is a unit observed over time, the standard notation is to use \(t = 1, \ldots, T\) instead.)

We will now index individual “rows” of the data by \(gn\), which stands for the \(n\)’th observation within group \(g\).

- Unit-level
- \(Y_{gn}\): response for the \(gn\)’th observation
- \(\mathbf{x}_{gn}\): vector of \(K\) covariates for the \(gn\)’th observation
- \(\epsilon_{gn}\): random shock to the \(gn\)’th response

- Group-level
- \(\mathbf{Y}_g\): vector of \(N\) responses for the \(g\)’th group
- \(\mathbf{X}_g\): \(N \times K\) matrix of covariates for the \(g\)’th group
- \(\epsilon_g\): vector of \(N\) random shocks for the \(g\)’th group

- Full data
- \(\mathbf{Y}\): vector of all \(GN\) responses, where \[\mathbf{Y} = \begin{bmatrix} \mathbf{Y}_1 \\ \mathbf{Y}_2 \\ \vdots \\ \mathbf{Y}_G \end{bmatrix}.\]
- \(\mathbf{X}\): \(GN \times K\) matrix of covariates, where \[\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \\ \vdots \\ \mathbf{X}_G \end{bmatrix}.\]
- \(\mathbf{D}\): \(GN \times G\) matrix of group membership indicators.
- \(\epsilon\): vector of all \(GN\) random shocks

If we assume a standard linear model for each \(gn\)’th response, \[ Y_{gn} = \mathbf{x}_{gn} \cdot \beta + \epsilon_{gn}, \] we end up with the familiar matrix equation \[ \mathbf{Y} = \mathbf{X} \beta + \epsilon. \]

You may look at this equation and think, “This looks like something we’ve used OLS on. What’s wrong with OLS?” Two reasons. First, at a minimum, we are unlikely to have spherical errors in grouped data. Observations in the same group—students in the same classroom, voters in the same state, LAPOP respondents in the same country—are likely to have residual correlation. There are unmeasured factors that commonly affect their responses. As in any non-spherical error model, this means OLS will yield invalid inferences. Moreover, because we are talking about autocorrelation and not just heteroskedasticity, White’s heteroskedasticity-consistent standard errors won’t do the trick.

Second, depending on the nature of the group-specific effects, we may also have a failure of strict exogeneity. I hope I have impressed on you by now that this is a major problem. To see why we have a failure of strict exogeneity, let us return to the example of voters living in states with cosmopolitan versus traditional cultures. If we cannot measure cosmopolitanism (as I am assuming), then it ends up in the error term of our model: \[ \epsilon_{gn} = \text{Cosmpolitanism}_g + \text{Other Stuff}_{gn}. \] But we know that younger and more educated voters are more likely to live in cosmopolitan areas. So if our covariate matrix includes age and education, we have \[ E[\epsilon \,|\, \mathbf{X}] \neq \mathbf{0}. \]

We will proceed from the easiest problems to the hardest. We will first consider a standard error correction and an efficiency improvement for the case where errors are correlated within groups but strict exogeneity still holds. We will then identify an unbiased estimator in the case where strict exogeneity fails.

## 12.2 Autocorrelation within Groups

Imagine the following policy experiment. The federal government randomly selects half of the states to receive a grant intended to improve high school education, while the other half do not receive it. We observe some indicator of educational quality (e.g., graduation rates) at the school district level, where school districts are grouped within states. So our model looks like \[ \text{Quality}_{gn} = \beta_1 + \beta_2 \text{Grant}_g + \epsilon_{gn}. \] We want to know whether receiving the grant affected quality. What we cannot measure is how well the states used the money. If some states used it better than others, then we would expect the error term to be correlated across school districts within the state. In a state where the money was used wisely, we would expect most of the schools to do “better than expected”—to have positive residuals, in the language of regression. Conversely, in a state where the money was squandered, we would expect most of the schools to do worse than we would otherwise predict.

This is one example of the general phenomenon Moulton (1986, 1990) identifies—that there is often substantial correlation in the random errors within groups, especially when we are looking at the effects of variables that only vary at the group level. One way I think about it is that, with grouped data and group-level covariates, the effective number of observations is less than the nominal number of observations. If we use the OLS standard errors, we are pretending to have more data than we really do.

Let’s see a bit more formally why autocorrelation within groups would cause us to overstate the effective sample size. Imagine we have \(N = 4\) observations organized into \(G = 2\) groups, with the groups being the first two and last two observations respectively. Furthermore, let’s assume we have a single covariate that varies across groups but not within them (e.g., a random treatment assigned at the group level), so that our data matrix looks like \[ \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix}. \] Finally, assume the errors are homoskedastic but are autocorrelated within groups:

- \(V[\epsilon_n] = \sigma^2\) for all \(n\).
- \(\mathop{\rm Cor}\nolimits[\epsilon_1, \epsilon_2] = \mathop{\rm Cor}\nolimits[\epsilon_3, \epsilon_4] = \rho\).
- \(\mathop{\rm Cor}\nolimits[\epsilon_n, \epsilon_m] = 0\) for all other \(n, m\) pairs.

Then we can write the variance matrix of the error term as \(V[\epsilon \,|\, \mathbf{X}] = \sigma^2 \Omega\), where \[ \Omega = \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix}. \] It turns out that the variance-covariance matrix of the OLS estimator in this case is \[ V[\hat{\beta} \,|\, \mathbf{X}] = \sigma^2 (1 + \rho) (\mathbf{X}^\top \mathbf{X})^{-1}. \] Notice that this is a factor of \(1 + \rho\) larger than the naive OLS formula—the one that is only valid when spherical errors is satisfied. (See the appendix to this chapter for how we derive this formula.) Therefore, if we used the regular OLS standard errors for our model, the standard error for each coefficient would be too small by a factor of \(\sqrt{1 + \rho}\).

This is a special case of the *Moulton factor* (see MHE 8.2.1).
If all groups have \(m\) observations, the covariate varies only at the group level, errors are homoskedastic, and the within-group correlation is \(\rho\) for all groups, then the OLS standard error calculation will underestimate the true standard errors by a factor of \(\sqrt{1 + (m - 1) \rho}\).
If for some reason we were certain the homoskedasticity assumption held *and* we knew \(\rho\), then we could simply correct the OLS standard errors by applying this formula.
More broadly, if we knew the exact structure of heteroskedasticity and autocorrelation, we could obtain an efficient estimate of \(\beta\) and proper standard errors by using GLS.
Usually, however, this option will not be available to us.

## 12.3 Clustered Standard Errors

Luckily, we can correct “clustered” errors in a manner similar to what we did when encountering heteroskedasticity of unknown form. The most we can assume on \(\Omega = V[\epsilon \,|\, \mathbf{X}]\) is

Heteroskedasticity of unknown form, within and across groups.

Autocorrelation of unknown form within groups.

*No*autocorrelation across groups.

Under these assumptions, \(\Omega\) has the block-diagonal form \[ \Omega = \begin{bmatrix} \Omega_1 & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{0} & \Omega_2 & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \Omega_G \end{bmatrix}, \] where each \(\Omega_g\) is a symmetric \(N \times N\) matrix. These may differ from each other—each group may have its own special autocorrelation structure. This is more general than the assumptions under which we obtained the Moulton factor above.

I’ll illustrate the simplest nontrivial example of a residual variance matrix with this structure. Imagine there are \(G = 2\) groups with \(N = 3\) members each. Let \(\sigma_n^2\) denote the residual variance for each observation \(n\), and let \(\rho_{nm}\) denote the autocorrelation between observations \(n\) and \(m\). By definition of covariance, \(\text{Cov} [\epsilon_n, \epsilon_m] = \sigma_n \sigma_m \rho_{nm}\). Therefore, under the assumptions above, the residual variance matrix is \[ \Omega = \begin{bmatrix} \sigma_1^2 & \sigma_1 \sigma_2 \rho_{12} & \sigma_1 \sigma_3 \rho_{13} & 0 & 0 & 0 \\ \sigma_1 \sigma_2 \rho_{12} & \sigma_2^2 & \sigma_2 \sigma_3 \rho_{23} & 0 & 0 & 0 \\ \sigma_1 \sigma_3 \rho_{13} & \sigma_2 \sigma_3 \rho_{23} & \sigma_3^2 & 0 & 0 & 0 \\ 0 & 0 & 0 & \sigma_4^2 & \sigma_4 \sigma_5 \rho_{45} & \sigma_4 \sigma_6 \rho_{46} \\ 0 & 0 & 0 & \sigma_4 \sigma_5 \rho_{45} & \sigma_5^2 & \sigma_5 \sigma_6 \rho_{56} \\ 0 & 0 & 0 & \sigma_4 \sigma_6 \rho_{46} & \sigma_5 \sigma_6 \rho_{56} & \sigma_6^2 \end{bmatrix} \] Here you can see the “block diagonal” structure, with the top-right and bottom-left (cross-group covariances) being filled with zeroes.

If we knew \(\Omega\) exactly, we could use GLS to efficiently estimate \(\beta\) and obtain correct standard errors. That is unlikely. Instead, we will use OLS to obtain our coefficient estimates, then we will correct the standard errors so as to be approximately correct in large samples. The *cluster-robust variance estimator* is
\[
\hat{\Sigma}_{\text{CR}} =
(\mathbf{X}^\top \mathbf{X})^{-1}
\left( \sum_{g=1}^G \mathbf{X}_g^\top \hat{e}_g \hat{e}_g^\top \mathbf{X}_g \right)
(\mathbf{X}^\top \mathbf{X})^{-1},
\]
where \(\hat{e}_g\) is the \(N \times 1\) vector of OLS residuals for the \(g\)’th group. Like White’s estimator for heteroskedasticity, this is a “sandwich” estimator. The “meat” of the sandwich here accounts for the within-group correlations in the error term.

Also like White’s estimator for heteroskedasticity, the cluster-robust estimator is consistent, but not unbiased. It approaches the truth as \(G\), the number of groups, grows large (holding fixed \(N\), the number of observations per group), but it may be badly biased in small samples. The consensus seems to be that \(N = 50\) is large enough (Cameron and Miller 2015), so you Americanists may go ahead and rejoice. However, if you observe a small number of units over a long period of time (e.g., in a study of the historical political economy of Western European countries), the cluster-robust standard error estimator will be severely biased. Beck and Katz (1995) provide a similar estimator for such data, except with the summation in the middle of the “sandwich” taken over the \(N\) time periods instead of the \(G\) groups.

The earliest derivation of this estimator was Liang and Zeger (1986), and that is whom you should cite when you use it. Arellano (1987) derives the same estimator but is a bit easier to follow. As with White’s estimator, there are various finite-sample corrections to \(\hat{\Sigma}_{\text{CR}}\) that you may want to use; we will not go through those here.

## 12.4 Random Effects

*The coverage and notation in this section and the next one closely follow Johnston and DiNardo (1997, chap. 12).*

Just as Huber-White standard errors do not fix the inefficiency of OLS under heteroskedasticity, cluster-robust standard errors do not fix the inefficiency of OLS under within-group correlation. To get a handle on the efficiency problem, we will make further assumptions about the source of that correlation.

We will assume each group \(g\) has a (potentially) different intercept. We can incorporate this into our original model,
\[
Y_{gn} = \mathbf{x}_{gn} \cdot \beta + \epsilon_{gn},
\]
by decomposing the error term into independent *group-specific* and *individual-specific* components:
\[
\epsilon_{gn} = \underbrace{\alpha_g}_{\text{group}} + \underbrace{\eta_{gn}}_{\text{individual}}.
\]
In this equation, \(\alpha_g\) represents the difference between the intercept for group \(g\) and the overall average, while \(\eta_{gn}\) is an idiosyncratic shock specific to the \(n\)’th observation of group \(g\). We will assume that \(\eta_{gn}\) is independent and identically distributed across observations, so that the only source of autocorrelation is that observations in the same group share the same \(\alpha_g\).

When we decompose the error term like this, we are assuming implicitly that the \(\alpha_g\)’s are uncorrelated with the covariates in the model. Otherwise, strict exogeneity fails and the techniques that follow are useless. Let me put that another way—the random effects model depends on the assumption that the group-specific shocks are uncorrelated with the covariates. We will proceed under this assumption, and return shortly to the questions of testing it and what to do if it fails.

The random-intercepts model gives us a convenient structure for the \(\Omega\) matrix in GLS. Let \(V[\alpha_g] = \sigma^2_\alpha\) and \(V[\eta_{gn}] = \sigma^2_\eta\). The error variance matrix for the \(g\)’th group then works out to \[ \Omega_g = \begin{bmatrix} \sigma_{\eta}^2 + \sigma_{\alpha}^2 & \sigma_{\alpha}^2 & \cdots & \sigma_{\alpha}^2 \\ \sigma_{\alpha}^2 & \sigma_{\eta}^2 + \sigma_{\alpha}^2 & \cdots & \sigma_{\alpha}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{\alpha}^2 & \sigma_{\alpha}^2 & \cdots & \sigma_{\eta}^2 + \sigma_{\alpha}^2 \end{bmatrix}. \] Since there is no autocorrelation across groups, \(\Omega\) takes the same block-diagonal form as in our discussion of cluster-robust standard errors.

If we knew \(\sigma^2_\alpha\) and \(\sigma^2_\eta\), we could estimate \(\beta\) by GLS. It is unlikely, however, that we would know these in advance. Luckily, there is a compromise option available. *Feasible GLS*, or FGLS, entails using a pair of first-stage regressions to estimate the variance of \(\alpha\) and \(\eta\), then plugging these into the GLS formula.

There are many ways to estimate these variances. In practice, this isn’t a problem you’ll need to worry about, since the variance estimates will be computed behind the scenes by your statistical software. But I think it’s useful to have a basic understanding of what’s going on when you use random effects, so here is one simple way to estimate \(\sigma^2_\alpha\) and \(\sigma^2_\eta\).

Run OLS of \(\mathbf{Y}\) on \(\mathbf{X}\). Let \(\hat{\sigma}^2\) denote the estimated residual variance from this “pooled model”—so called because it ignores group-level variation, throwing all the observations into a single pool. This is a consistent estimator of the total residual variance, \(\sigma^2_\alpha + \sigma^2_\eta\).

Run OLS of \(\mathbf{Y}\) on \([\mathbf{X} \quad \mathbf{D}]\), i.e., the covariates plus a set of group membership indicators. This is the fixed effects regression, which we will talk about shortly. Because the group membership indicators account for variation

*across*groups, the remaining variation in this model solely represents variance*within*groups. We can therefore let the estimated residual variance from this model be \(\hat{\sigma}^2_\eta\), which is a consistent estimator of the within-group variance, \(\sigma^2_\eta\).Estimate the between-group variance as \(\hat{\sigma}^2_\alpha = \hat{\sigma}^2 - \hat{\sigma}^2_\eta\).

Form the matrix \(\hat{\Omega}\) by plugging \(\hat{\sigma}^2_\alpha\) and \(\hat{\sigma}^2_\eta\) into the formulas above, then run GLS using \(\hat{\Omega}\) as the weighting matrix.

This gives us the *random effects estimator*,
\[
\hat{\beta}_{\text{RE}} = (\mathbf{X}^\top \hat{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \hat{\Omega}^{-1} \mathbf{Y}.
\]

See Johnston and DiNardo (1997, 392–95) for full formulas and details. FGLS is consistent but not unbiased, so the random effects model may not be a good idea in small samples. If our specification of the error structure is correct, it is asymptotically efficient—as the sample size increases, no other estimator has lower standard errors.

## 12.5 Fixed Effects

What if the group-specific intercepts are correlated with the covariates? Then, in order to maintain strict exogeneity, we must pull them out of the error term and into the covariate matrix. This is easy to do—we can rewrite the full model as
\[
\mathbf{Y} = \mathbf{X} \beta + \mathbf{D} \alpha + \eta,
\]
where \(\alpha\) is the \(G \times 1\) vector of group-specific intercepts. This suggests that we run OLS on our covariates plus the full set of group membership indicators. (As with any set of indicators, we need to omit one category.) The first \(K\) elements of this regression constitute the *fixed effects* estimator.

Most textbook treatments of fixed effect estimators go through a whole rigmarole about computation, because it used to be challenging to invert a \((K + G) \times (K + G)\) matrix when \(G\) was moderately large. This is no longer true,^{31} so you can safely ignore most of the hand-wringing about computational difficulties.

The standard errors of the fixed effects estimator are usually higher than those of the random effects estimator, since estimating \(G\) additional parameters uses a lot of degrees of freedom. This leads us to the following pair of observations.

If the random effects assumption is met (group-specific effects are uncorrelated with covariates), then the random effects and fixed estimators are both consistent, but fixed effects is less efficient.

If the random effects assumption is not met, then the random effects estimator is inconsistent while the fixed effects estimator is consistent.

The typical test for whether fixed effects are necessary comes from Hausman (1978). Under the null hypothesis that both estimators are consistent (and thus fixed effects are unnecessary and inefficient), the test statistic \[ H = (\hat{\beta}_{\text{RE}} - \hat{\beta}_{\text{FE}})^\top (\hat{\Sigma}_{\text{FE}} - \hat{\Sigma}_{\text{RE}})^{-1} (\hat{\beta}_{\text{RE}} - \hat{\beta}_{\text{FE}}) \] asymptotically has a \(\chi^2\) distribution with \(K\) degrees of freedom.

The other main drawback of fixed effects estimation is that you cannot estimate the effects of variables that do not vary within groups. (Why not?) See Greene (2003, sec. 13.5) for estimation strategies with panel data and time-invariant covariates.

One final note: Arellano (1987) shows that the cluster-robust variance matrix estimator can be used with fixed effects. Cameron and Miller (2015) recommend doing so, at least when the number of groups is large.

## 12.6 Appendix: Deriving the Moulton Factor

For a general error structure \(V[\epsilon \,|\, \mathbf{X}] = \sigma^2 \Omega\), the true variance of the OLS estimator is \[ V[\hat{\beta} \,|\, \mathbf{X}] = \sigma^2 \underbrace{(\mathbf{X}^\top \mathbf{X})^{-1}}_{\text{bread}} \underbrace{\mathbf{X}^\top \Omega \mathbf{X}}_{\text{meat}} \underbrace{(\mathbf{X}^\top \mathbf{X})^{-1}}_{\text{bread}}. \] Recall that we are considering the model with \(N = 4\) and \[ \Omega = \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix}. \] In this case, the “meat” of the OLS variance is \[ \begin{aligned} \mathbf{X}^\top \Omega \mathbf{X} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ x_1 & x_1 & x_2 & x_2 \end{bmatrix} \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= \begin{bmatrix} 1 + \rho & 1 + \rho & 1 + \rho & 1 + \rho \\ (1 + \rho) x_1 & (1 + \rho) x_1 & (1 + \rho) x_2 & (1 + \rho) x_2 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= (1 + \rho) \begin{bmatrix} 1 & 1 & 1 & 1 \\ x_1 & x_1 & x_2 & x_2 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= (1 + \rho) \mathbf{X}^\top \mathbf{X}. \end{aligned} \] Therefore, we have \[ \begin{aligned} V[\hat{\beta} \,|\, \mathbf{X}] &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \Omega \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} [(1 + \rho) \mathbf{X}^\top \mathbf{X}] (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (1 + \rho) (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \]

## 12.7 Appendix: Implementation in R

The methods introduced here can be implemented via the **plm** package.

`library("plm")`

We will use the `Produc`

dataset from **plm**, a riveting collection of economic statistics about the U.S. states from 1970 to 1986.

```
data(Produc, package = "plm")
head(Produc)
```

```
## state year region pcap hwy water util pc gsp emp
## 1 ALABAMA 1970 6 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5
## 2 ALABAMA 1971 6 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9
## 3 ALABAMA 1972 6 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3
## 4 ALABAMA 1973 6 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5
## 5 ALABAMA 1974 6 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8
## 6 ALABAMA 1975 6 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4
## unemp
## 1 4.7
## 2 5.2
## 3 4.7
## 4 3.9
## 5 5.5
## 6 7.7
```

The functions in **plm** assume that your data are organized like `Produc`

, with the grouping variable in the first column and the identification variable (time, in the case of panel data) in the second column. See the **plm** package vignette on how to get datasets not organized this way into line.

We will treat unemployment (`unemp`

) as our response and public capital stock (`pcap`

) and private capital stock (`pc`

) as our covariates. As a benchmark, let’s use OLS.

```
<- lm(unemp ~ pcap + pc,
fit_ols data = Produc)
summary(fit_ols)
```

```
##
## Call:
## lm(formula = unemp ~ pcap + pc, data = Produc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5589 -1.6219 -0.3373 1.2129 11.5951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.202e+00 1.077e-01 57.571 <2e-16 ***
## pcap 1.006e-05 5.515e-06 1.824 0.0686 .
## pc 2.558e-06 2.563e-06 0.998 0.3186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.196 on 813 degrees of freedom
## Multiple R-squared: 0.03516, Adjusted R-squared: 0.03279
## F-statistic: 14.81 on 2 and 813 DF, p-value: 4.795e-07
```

The “pooling” estimator implemented by `plm()`

ought to give us the same results.

```
<- plm(unemp ~ pcap + pc,
fit_pooling data = Produc,
model = "pooling")
summary(fit_pooling)
```

```
## Pooling Model
##
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "pooling")
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.55895 -1.62192 -0.33733 1.21292 11.59512
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 6.2015e+00 1.0772e-01 57.5709 < 2e-16 ***
## pcap 1.0059e-05 5.5152e-06 1.8238 0.06855 .
## pc 2.5582e-06 2.5633e-06 0.9980 0.31858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 4064.6
## Residual Sum of Squares: 3921.7
## R-Squared: 0.035162
## Adj. R-Squared: 0.032788
## F-statistic: 14.8141 on 2 and 813 DF, p-value: 4.7947e-07
```

We can obtain the cluster-robust variance matrix estimate via `vcovHC()`

. Make sure to specify `method = "arellano"`

so as to get the usual estimator. It is not entirely clear to me which of the various finite-sample adjustments corresponds to the defaults in Stata.

```
<- vcovHC(fit_pooling,
crvm_pooling method = "arellano",
type = "HC1")
summary(fit_pooling, vcov = crvm_pooling)
```

```
## Pooling Model
##
## Note: Coefficient variance-covariance matrix supplied: crvm_pooling
##
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "pooling")
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.55895 -1.62192 -0.33733 1.21292 11.59512
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 6.2015e+00 2.4475e-01 25.3385 <2e-16 ***
## pcap 1.0059e-05 1.2103e-05 0.8311 0.4062
## pc 2.5582e-06 7.3044e-06 0.3502 0.7263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 4064.6
## Residual Sum of Squares: 3921.7
## R-Squared: 0.035162
## Adj. R-Squared: 0.032788
## F-statistic: 6.45144 on 2 and 47 DF, p-value: 0.0033441
```

Our \(t\)-statistics are cut in more than half, even though our variables have within-group variation (unlike Moulton (1990)’s example).

We can also use `plm()`

to estimate a random-effects model.

```
<- plm(unemp ~ pcap + pc,
fit_random data = Produc,
model = "random")
summary(fit_random)
```

```
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "random")
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Effects:
## var std.dev share
## idiosyncratic 3.092 1.758 0.69
## individual 1.389 1.179 0.31
## theta: 0.6598
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.65953 -1.35357 -0.28627 1.01012 9.72087
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 5.6656e+00 2.5521e-01 22.1996 < 2.2e-16 ***
## pcap 3.9178e-08 1.1236e-05 0.0035 0.997218
## pc 1.6080e-05 4.5644e-06 3.5228 0.000427 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2916.2
## Residual Sum of Squares: 2792.6
## R-Squared: 0.042399
## Adj. R-Squared: 0.040043
## Chisq: 35.9967 on 2 DF, p-value: 1.5256e-08
```

And, finally, a fixed-effects model, which `plm()`

calls the `"within"`

model.

```
<- plm(unemp ~ pcap + pc,
fit_fixed data = Produc,
model = "within")
summary(fit_fixed)
```

```
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "within")
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.80687 -1.16760 -0.24813 0.94681 8.39430
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## pcap 2.2699e-04 3.1813e-05 7.1352 2.244e-12 ***
## pc 4.1783e-06 6.5669e-06 0.6363 0.5248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2765.9
## Residual Sum of Squares: 2368.5
## R-Squared: 0.14368
## Adj. R-Squared: 0.088906
## F-statistic: 64.2646 on 2 and 766 DF, p-value: < 2.22e-16
```

We can extract the fixed-effect estimates themselves via `fixef()`

.

`fixef(fit_fixed)`

```
## ALABAMA ARIZONA ARKANSAS CALIFORNIA COLORADO
## 3.763902 3.201232 5.020980 -24.351816 1.906885
## CONNECTICUT DELAWARE FLORIDA GEORGIA IDAHO
## 2.222801 5.723806 -2.950385 0.090978 5.926215
## ILLINOIS INDIANA IOWA KANSAS KENTUCKY
## -7.483176 1.499181 1.176446 1.258062 2.487150
## LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN
## 2.394977 6.132823 -0.469424 -0.081008 -2.073261
## MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA
## -0.913146 4.790411 0.576112 5.398842 1.033804
## NEVADA NEW_HAMPSHIRE NEW_JERSEY NEW_MEXICO NEW_YORK
## 6.121925 3.867146 -0.475632 5.875722 -22.416937
## NORTH_CAROLINA NORTH_DAKOTA OHIO OKLAHOMA OREGON
## 0.588848 3.714982 -5.208327 2.212210 4.913671
## PENNSYLVANIA RHODE_ISLAND SOUTH_CAROLINA SOUTH_DAKOTA TENNESSE
## -6.740590 6.045561 3.895325 2.942687 1.080415
## TEXAS UTAH VERMONT VIRGINIA WASHINGTON
## -11.161384 4.233540 5.624496 -0.960964 1.080757
## WEST_VIRGINIA WISCONSIN WYOMING
## 6.940764 0.191971 3.798969
```

If we wanted to include time dummies as well, we could specify `effect = "twoways"`

in the fitting function.

```
<- plm(unemp ~ pcap + pc,
fit_fixed_2 data = Produc,
effect = "twoways",
model = "within")
summary(fit_fixed_2)
```

```
## Twoways effects Within Model
##
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, effect = "twoways",
## model = "within")
##
## Balanced Panel: n = 48, T = 17, N = 816
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3.330599 -0.802635 -0.044942 0.749901 6.021784
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## pcap 9.1022e-05 2.6202e-05 3.4738 0.0005426 ***
## pc -1.2657e-05 5.0355e-06 -2.5135 0.0121641 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 1256.4
## Residual Sum of Squares: 1235.8
## R-Squared: 0.016355
## Adj. R-Squared: -0.068894
## F-statistic: 6.23517 on 2 and 750 DF, p-value: 0.0020624
```

`fixef(fit_fixed_2, effect = "time")`

```
## 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## 5.5480 6.2755 5.7642 5.1866 5.8250 8.5092 7.4508 6.9296 5.9312 5.8163 7.1746
## 1981 1982 1983 1984 1985 1986
## 7.6609 9.6937 9.6779 7.6333 7.4795 7.3435
```

`phtest()`

implements the Hausman test. Remember that the null hypothesis is that both estimators are consistent; if we reject it, then the random effects estimator is inconsistent and we must use fixed effects.

`phtest(fit_random, fit_fixed)`

```
##
## Hausman Test
##
## data: unemp ~ pcap + pc
## chisq = 95.562, df = 2, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
```

## 12.8 Appendix: R-squared in plm

**plm** has annoyed me—and my students—by reporting incorrect (or, generously, “idiosyncratic”) \(R^2\) values for models with fixed effects.
I’ve seen a few instances where the \(R^2\) apparently goes down after adding fixed effects to a model, which is mathematically impossible.

The issue is that **plm** reports the \(R^2\) as the proportion of variance explained *by the substantive covariates*, not including the fixed effects.
This is inconsistent with usual practice.
Here I illustrate how to calculate proper \(R^2\) values while using **plm**.

I’ll generate some grouped data where most of the action is at the group level.
I want there to be a lot of variation across groups, but relatively little within groups.
This is exactly the circumstance where the true \(R^2\) and the one reported by `plm`

should differ the most.

```
library("tidyverse")
## Draw group-level intercepts
set.seed(94)
<- LETTERS[1:20]
groups <- rnorm(length(groups), sd = 16)
group_means names(group_means) <- groups
## Assign individual observations to groups and draw outcomes
<- 5000
n_obs <- sample(groups, size = n_obs, replace = TRUE)
g <- rnorm(n_obs, sd = 1)
x1 <- rnorm(n_obs, sd = 1)
x2 <- group_means[g] + x1 - x2 + rnorm(n_obs, sd = 4)
y
## Arrange into format plm likes
<- tibble(g, x1, x2, y) %>%
grouped_data arrange(g) %>%
group_by(g) %>%
mutate(idx = row_number()) %>%
ungroup() %>%
select(g, idx, y, x1, x2)
grouped_data
```

```
## # A tibble: 5,000 x 5
## g idx y x1 x2
## <chr> <int> <dbl> <dbl> <dbl>
## 1 A 1 16.9 -0.932 -0.928
## 2 A 2 21.2 0.319 0.0733
## 3 A 3 23.5 0.228 0.708
## 4 A 4 13.9 -1.46 0.426
## 5 A 5 22.4 -1.70 0.154
## # … with 4,995 more rows
```

If we run a fixed effects model “by hand” by just plugging the group dummies into `lm`

, we get the correct (high) \(R^2\).

```
<- lm(y ~ x1 + x2 + g, data = grouped_data)
fit_fe_lm summary(fit_fe_lm)
```

```
##
## Call:
## lm(formula = y ~ x1 + x2 + g, data = grouped_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5872 -2.6934 -0.0011 2.6249 13.4342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.52393 0.24933 74.294 < 2e-16 ***
## x1 1.00745 0.05689 17.710 < 2e-16 ***
## x2 -1.17708 0.05703 -20.639 < 2e-16 ***
## gB -12.73795 0.35900 -35.481 < 2e-16 ***
## gC -42.70627 0.35585 -120.011 < 2e-16 ***
## gD 1.68971 0.35226 4.797 1.66e-06 ***
## gE -37.92942 0.35331 -107.354 < 2e-16 ***
## gF -20.21707 0.35815 -56.449 < 2e-16 ***
## gG -24.80747 0.34477 -71.953 < 2e-16 ***
## gH -5.42873 0.34709 -15.641 < 2e-16 ***
## gI 0.40416 0.35664 1.133 0.257
## gJ -59.49246 0.35774 -166.300 < 2e-16 ***
## gK -21.63007 0.35404 -61.095 < 2e-16 ***
## gL 1.68897 0.35440 4.766 1.94e-06 ***
## gM -14.33295 0.35090 -40.846 < 2e-16 ***
## gN -15.51534 0.35628 -43.548 < 2e-16 ***
## gO -21.77582 0.35404 -61.506 < 2e-16 ***
## gP -35.90792 0.34894 -102.906 < 2e-16 ***
## gQ -38.35542 0.35735 -107.332 < 2e-16 ***
## gR -47.74437 0.35412 -134.825 < 2e-16 ***
## gS -26.16950 0.36139 -72.413 < 2e-16 ***
## gT 4.02471 0.34861 11.545 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.966 on 4978 degrees of freedom
## Multiple R-squared: 0.9536, Adjusted R-squared: 0.9534
## F-statistic: 4876 on 21 and 4978 DF, p-value: < 2.2e-16
```

`plm`

gives us the same coefficients and standard errors on the substantive variables, `x1`

and `x2`

, but reports a curiously different \(R^2\).

```
<- plm(y ~ x1 + x2, data = grouped_data, model = "within")
fit_fe_plm summary(fit_fe_plm)
```

```
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1 + x2, data = grouped_data, model = "within")
##
## Unbalanced Panel: n = 20, T = 230-278, N = 5000
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -13.5871727 -2.6933956 -0.0011038 2.6249454 13.4342204
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 1.007450 0.056886 17.710 < 2.2e-16 ***
## x2 -1.177083 0.057033 -20.639 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 89881
## Residual Sum of Squares: 78292
## R-Squared: 0.12893
## Adj. R-Squared: 0.12526
## F-statistic: 368.416 on 2 and 4978 DF, p-value: < 2.22e-16
```

In order to get an \(R^2\) commensurate with what `lm`

reports, we need to calculate the total sum of squares and residual sum of squares by hand.
Here’s how to do that.

```
## Calculate total sum of squares
<- sum((y - mean(y))^2)
TSS
## Calculate residual sum of squares
<- sum(residuals(fit_fe_plm)^2)
RSS
<- 1 - (RSS / TSS)
correct_rsq correct_rsq
```

`## [1] 0.9536398`

Notice that this is the same as what `lm`

reports.

`all.equal(summary(fit_fe_lm)$r.squared, correct_rsq)`

`## [1] TRUE`

If the response variable (here, `y`

) used for fitting is not the same as in your raw data—for example, if you are using the `subset`

argument or if there is missing data—you will need to be cautious about how you calculate the TSS.

Why did I leave out national characteristics, like inflation or the national unemployment rate? In a single-election study, these are constant. While they might shape variation

*across*elections, they cannot be linked to variation*within*a single election.↩︎Everything we do here would go through if we allowed the number of observations to vary across groups, but the notation would get uglier. In the panel data context we say that the panels are

*balanced*if the number of observations is the same for each group and*unbalanced*otherwise.↩︎When I taught this class last year, I simulated a dataset with a single covariate, \(N = 1{,}000\), and \(J = 1{,}000\), meaning the fixed effects regression entailed estimating \(1{,}001\) parameters from \(1{,}000{,}000\) observations. It took less than two minutes to run on my not-especially-powerful laptop. That said, should you ever have \(N\) another order of magnitude above that, the computational tricks given in the textbook treatments will once again be of use.↩︎