12 Clustered and Panel Data

Grouped data structures, in which we observe individual units within larger groups, are common in political science and other social sciences. Examples include:

  • Cross-national survey data, where we observe individual respondents grouped by country.

  • Block-randomized field experiments. For example, in an experiment where the treatment is administered at the village level, we observe individual outcomes grouped by village.

  • Panel data, where we observe the same units repeatedly over time. This includes panel surveys as well as observational data on states, countries, or other political units over time.

Grouped data presents problems and opportunities. At the root of both is the idea of unobserved heterogeneity—that some variation across groups might be due to unobservable features of the groups.

To return to our running example, suppose you are interested in the correlates of voting for Donald Trump in the 2016 general election. You observe survey respondents grouped by state. Vote choice might be affected by:

  1. Individual characteristics like one’s age, gender, race, income, and education.

  2. State characteristics29 like the unemployment rate, undocumented population, and exposure to trade with China.

Some state characteristics that might affect vote choice are difficult or impossible to measure. For example, some states have a more cosmopolitan culture, while others have a more traditional culture. A 50-year-old white man living in Connecticut is, we would expect, less likely to have voted for Trump than a 50-year-old white man living in Alabama. This is unobserved heterogeneity—characteristics that we do not observe, and therefore cannot control for, but which affect the response.

If we are not careful in dealing with grouped data, unobserved heterogeneity can be a major problem. The spherical errors assumption is usually not tenable for grouped data, since observations will be correlated with others in the same group. OLS will therefore be inefficient and yield invalid inferences. Even worse, if group-level sources of unobserved heterogeneity are correlated with individual characteristics—if, say, younger voters are more likely to live in cosmopolitan states—OLS may also be biased and inconsistent.

On the other hand, if we deal with grouped data properly, we can enhance the credibility of our inferences. We can eliminate the influence of variation across groups, allowing us to focus on the comparisons within groups that we are usually most interested in.

Before we get started, a note of caution. We only have enough time to scratch the surface of how to analyze grouped data. If you work with grouped data in your dissertation or other research, you should think carefully about your data structure and potential sources of unobserved heterogeneity. The methods we discuss this week may not solve your problems. As further references, I recommend Wooldridge (2002) and Gelman and Hill (2006).

12.1 The Linear Model with Grouped Data

We need to change our notation a bit to reflect the arrangement of observations into groups. Let there be \(G\) groups indexed by \(g = 1, \ldots, G\). Within each group, we have \(N\) observations indexed by \(n = 1, \ldots, N\).30 (In the special case of panel data, where each group is a unit observed over time, the standard notation is to use \(t = 1, \ldots, T\) instead.)

We will now index individual “rows” of the data by \(gn\), which stands for the \(n\)’th observation within group \(g\).

  • Unit-level
    • \(Y_{gn}\): response for the \(gn\)’th observation
    • \(\mathbf{x}_{gn}\): vector of \(K\) covariates for the \(gn\)’th observation
    • \(\epsilon_{gn}\): random shock to the \(gn\)’th response
  • Group-level
    • \(\mathbf{Y}_g\): vector of \(N\) responses for the \(g\)’th group
    • \(\mathbf{X}_g\): \(N \times K\) matrix of covariates for the \(g\)’th group
    • \(\epsilon_g\): vector of \(N\) random shocks for the \(g\)’th group
  • Full data
    • \(\mathbf{Y}\): vector of all \(GN\) responses, where \[\mathbf{Y} = \begin{bmatrix} \mathbf{Y}_1 \\ \mathbf{Y}_2 \\ \vdots \\ \mathbf{Y}_G \end{bmatrix}.\]
    • \(\mathbf{X}\): \(GN \times K\) matrix of covariates, where \[\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \\ \vdots \\ \mathbf{X}_G \end{bmatrix}.\]
    • \(\mathbf{D}\): \(GN \times G\) matrix of group membership indicators.
    • \(\epsilon\): vector of all \(GN\) random shocks

If we assume a standard linear model for each \(gn\)’th response, \[ Y_{gn} = \mathbf{x}_{gn} \cdot \beta + \epsilon_{gn}, \] we end up with the familiar matrix equation \[ \mathbf{Y} = \mathbf{X} \beta + \epsilon. \]

You may look at this equation and think, “This looks like something we’ve used OLS on. What’s wrong with OLS?” Two reasons. First, at a minimum, we are unlikely to have spherical errors in grouped data. Observations in the same group—students in the same classroom, voters in the same state, LAPOP respondents in the same country—are likely to have residual correlation. There are unmeasured factors that commonly affect their responses. As in any non-spherical error model, this means OLS will yield invalid inferences. Moreover, because we are talking about autocorrelation and not just heteroskedasticity, White’s heteroskedasticity-consistent standard errors won’t do the trick.

Second, depending on the nature of the group-specific effects, we may also have a failure of strict exogeneity. I hope I have impressed on you by now that this is a major problem. To see why we have a failure of strict exogeneity, let us return to the example of voters living in states with cosmopolitan versus traditional cultures. If we cannot measure cosmopolitanism (as I am assuming), then it ends up in the error term of our model: \[ \epsilon_{gn} = \text{Cosmpolitanism}_g + \text{Other Stuff}_{gn}. \] But we know that younger and more educated voters are more likely to live in cosmopolitan areas. So if our covariate matrix includes age and education, we have \[ E[\epsilon \,|\, \mathbf{X}] \neq \mathbf{0}. \]

We will proceed from the easiest problems to the hardest. We will first consider a standard error correction and an efficiency improvement for the case where errors are correlated within groups but strict exogeneity still holds. We will then identify an unbiased estimator in the case where strict exogeneity fails.

12.2 Autocorrelation within Groups

Imagine the following policy experiment. The federal government randomly selects half of the states to receive a grant intended to improve high school education, while the other half do not receive it. We observe some indicator of educational quality (e.g., graduation rates) at the school district level, where school districts are grouped within states. So our model looks like \[ \text{Quality}_{gn} = \beta_1 + \beta_2 \text{Grant}_g + \epsilon_{gn}. \] We want to know whether receiving the grant affected quality. What we cannot measure is how well the states used the money. If some states used it better than others, then we would expect the error term to be correlated across school districts within the state. In a state where the money was used wisely, we would expect most of the schools to do “better than expected”—to have positive residuals, in the language of regression. Conversely, in a state where the money was squandered, we would expect most of the schools to do worse than we would otherwise predict.

This is one example of the general phenomenon Moulton (1986, 1990) identifies—that there is often substantial correlation in the random errors within groups, especially when we are looking at the effects of variables that only vary at the group level. One way I think about it is that, with grouped data and group-level covariates, the effective number of observations is less than the nominal number of observations. If we use the OLS standard errors, we are pretending to have more data than we really do.

Let’s see a bit more formally why autocorrelation within groups would cause us to overstate the effective sample size. Imagine we have \(N = 4\) observations organized into \(G = 2\) groups, with the groups being the first two and last two observations respectively. Furthermore, let’s assume we have a single covariate that varies across groups but not within them (e.g., a random treatment assigned at the group level), so that our data matrix looks like \[ \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix}. \] Finally, assume the errors are homoskedastic but are autocorrelated within groups:

  • \(V[\epsilon_n] = \sigma^2\) for all \(n\).
  • \(\mathop{\rm Cor}\nolimits[\epsilon_1, \epsilon_2] = \mathop{\rm Cor}\nolimits[\epsilon_3, \epsilon_4] = \rho\).
  • \(\mathop{\rm Cor}\nolimits[\epsilon_n, \epsilon_m] = 0\) for all other \(n, m\) pairs.

Then we can write the variance matrix of the error term as \(V[\epsilon \,|\, \mathbf{X}] = \sigma^2 \Omega\), where \[ \Omega = \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix}. \] It turns out that the variance-covariance matrix of the OLS estimator in this case is \[ V[\hat{\beta} \,|\, \mathbf{X}] = \sigma^2 (1 + \rho) (\mathbf{X}^\top \mathbf{X})^{-1}. \] Notice that this is a factor of \(1 + \rho\) larger than the naive OLS formula—the one that is only valid when spherical errors is satisfied. (See the appendix to this chapter for how we derive this formula.) Therefore, if we used the regular OLS standard errors for our model, the standard error for each coefficient would be too small by a factor of \(\sqrt{1 + \rho}\).

This is a special case of the Moulton factor (see MHE 8.2.1). If all groups have \(m\) observations, the covariate varies only at the group level, errors are homoskedastic, and the within-group correlation is \(\rho\) for all groups, then the OLS standard error calculation will underestimate the true standard errors by a factor of \(\sqrt{1 + (m - 1) \rho}\). If for some reason we were certain the homoskedasticity assumption held and we knew \(\rho\), then we could simply correct the OLS standard errors by applying this formula. More broadly, if we knew the exact structure of heteroskedasticity and autocorrelation, we could obtain an efficient estimate of \(\beta\) and proper standard errors by using GLS. Usually, however, this option will not be available to us.

12.3 Clustered Standard Errors

Luckily, we can correct “clustered” errors in a manner similar to what we did when encountering heteroskedasticity of unknown form. The most we can assume on \(\Omega = V[\epsilon \,|\, \mathbf{X}]\) is

  • Heteroskedasticity of unknown form, within and across groups.

  • Autocorrelation of unknown form within groups.

  • No autocorrelation across groups.

Under these assumptions, \(\Omega\) has the block-diagonal form \[ \Omega = \begin{bmatrix} \Omega_1 & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{0} & \Omega_2 & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \Omega_G \end{bmatrix}, \] where each \(\Omega_g\) is a symmetric \(N \times N\) matrix. These may differ from each other—each group may have its own special autocorrelation structure. This is more general than the assumptions under which we obtained the Moulton factor above.

I’ll illustrate the simplest nontrivial example of a residual variance matrix with this structure. Imagine there are \(G = 2\) groups with \(N = 3\) members each. Let \(\sigma_n^2\) denote the residual variance for each observation \(n\), and let \(\rho_{nm}\) denote the autocorrelation between observations \(n\) and \(m\). By definition of covariance, \(\text{Cov} [\epsilon_n, \epsilon_m] = \sigma_n \sigma_m \rho_{nm}\). Therefore, under the assumptions above, the residual variance matrix is \[ \Omega = \begin{bmatrix} \sigma_1^2 & \sigma_1 \sigma_2 \rho_{12} & \sigma_1 \sigma_3 \rho_{13} & 0 & 0 & 0 \\ \sigma_1 \sigma_2 \rho_{12} & \sigma_2^2 & \sigma_2 \sigma_3 \rho_{23} & 0 & 0 & 0 \\ \sigma_1 \sigma_3 \rho_{13} & \sigma_2 \sigma_3 \rho_{23} & \sigma_3^2 & 0 & 0 & 0 \\ 0 & 0 & 0 & \sigma_4^2 & \sigma_4 \sigma_5 \rho_{45} & \sigma_4 \sigma_6 \rho_{46} \\ 0 & 0 & 0 & \sigma_4 \sigma_5 \rho_{45} & \sigma_5^2 & \sigma_5 \sigma_6 \rho_{56} \\ 0 & 0 & 0 & \sigma_4 \sigma_6 \rho_{46} & \sigma_5 \sigma_6 \rho_{56} & \sigma_6^2 \end{bmatrix} \] Here you can see the “block diagonal” structure, with the top-right and bottom-left (cross-group covariances) being filled with zeroes.

If we knew \(\Omega\) exactly, we could use GLS to efficiently estimate \(\beta\) and obtain correct standard errors. That is unlikely. Instead, we will use OLS to obtain our coefficient estimates, then we will correct the standard errors so as to be approximately correct in large samples. The cluster-robust variance estimator is \[ \hat{\Sigma}_{\text{CR}} = (\mathbf{X}^\top \mathbf{X})^{-1} \left( \sum_{g=1}^G \mathbf{X}_g^\top \hat{e}_g \hat{e}_g^\top \mathbf{X}_g \right) (\mathbf{X}^\top \mathbf{X})^{-1}, \] where \(\hat{e}_g\) is the \(N \times 1\) vector of OLS residuals for the \(g\)’th group. Like White’s estimator for heteroskedasticity, this is a “sandwich” estimator. The “meat” of the sandwich here accounts for the within-group correlations in the error term.

Also like White’s estimator for heteroskedasticity, the cluster-robust estimator is consistent, but not unbiased. It approaches the truth as \(G\), the number of groups, grows large (holding fixed \(N\), the number of observations per group), but it may be badly biased in small samples. The consensus seems to be that \(N = 50\) is large enough (Cameron and Miller 2015), so you Americanists may go ahead and rejoice. However, if you observe a small number of units over a long period of time (e.g., in a study of the historical political economy of Western European countries), the cluster-robust standard error estimator will be severely biased. Beck and Katz (1995) provide a similar estimator for such data, except with the summation in the middle of the “sandwich” taken over the \(N\) time periods instead of the \(G\) groups.

The earliest derivation of this estimator was Liang and Zeger (1986), and that is whom you should cite when you use it. Arellano (1987) derives the same estimator but is a bit easier to follow. As with White’s estimator, there are various finite-sample corrections to \(\hat{\Sigma}_{\text{CR}}\) that you may want to use; we will not go through those here.

12.4 Random Effects

The coverage and notation in this section and the next one closely follow Johnston and DiNardo (1997, chap. 12).

Just as Huber-White standard errors do not fix the inefficiency of OLS under heteroskedasticity, cluster-robust standard errors do not fix the inefficiency of OLS under within-group correlation. To get a handle on the efficiency problem, we will make further assumptions about the source of that correlation.

We will assume each group \(g\) has a (potentially) different intercept. We can incorporate this into our original model, \[ Y_{gn} = \mathbf{x}_{gn} \cdot \beta + \epsilon_{gn}, \] by decomposing the error term into independent group-specific and individual-specific components: \[ \epsilon_{gn} = \underbrace{\alpha_g}_{\text{group}} + \underbrace{\eta_{gn}}_{\text{individual}}. \] In this equation, \(\alpha_g\) represents the difference between the intercept for group \(g\) and the overall average, while \(\eta_{gn}\) is an idiosyncratic shock specific to the \(n\)’th observation of group \(g\). We will assume that \(\eta_{gn}\) is independent and identically distributed across observations, so that the only source of autocorrelation is that observations in the same group share the same \(\alpha_g\).

When we decompose the error term like this, we are assuming implicitly that the \(\alpha_g\)’s are uncorrelated with the covariates in the model. Otherwise, strict exogeneity fails and the techniques that follow are useless. Let me put that another way—the random effects model depends on the assumption that the group-specific shocks are uncorrelated with the covariates. We will proceed under this assumption, and return shortly to the questions of testing it and what to do if it fails.

The random-intercepts model gives us a convenient structure for the \(\Omega\) matrix in GLS. Let \(V[\alpha_g] = \sigma^2_\alpha\) and \(V[\eta_{gn}] = \sigma^2_\eta\). The error variance matrix for the \(g\)’th group then works out to \[ \Omega_g = \begin{bmatrix} \sigma_{\eta}^2 + \sigma_{\alpha}^2 & \sigma_{\alpha}^2 & \cdots & \sigma_{\alpha}^2 \\ \sigma_{\alpha}^2 & \sigma_{\eta}^2 + \sigma_{\alpha}^2 & \cdots & \sigma_{\alpha}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{\alpha}^2 & \sigma_{\alpha}^2 & \cdots & \sigma_{\eta}^2 + \sigma_{\alpha}^2 \end{bmatrix}. \] Since there is no autocorrelation across groups, \(\Omega\) takes the same block-diagonal form as in our discussion of cluster-robust standard errors.

If we knew \(\sigma^2_\alpha\) and \(\sigma^2_\eta\), we could estimate \(\beta\) by GLS. It is unlikely, however, that we would know these in advance. Luckily, there is a compromise option available. Feasible GLS, or FGLS, entails using a pair of first-stage regressions to estimate the variance of \(\alpha\) and \(\eta\), then plugging these into the GLS formula.

There are many ways to estimate these variances. In practice, this isn’t a problem you’ll need to worry about, since the variance estimates will be computed behind the scenes by your statistical software. But I think it’s useful to have a basic understanding of what’s going on when you use random effects, so here is one simple way to estimate \(\sigma^2_\alpha\) and \(\sigma^2_\eta\).

  1. Run OLS of \(\mathbf{Y}\) on \(\mathbf{X}\). Let \(\hat{\sigma}^2\) denote the estimated residual variance from this “pooled model”—so called because it ignores group-level variation, throwing all the observations into a single pool. This is a consistent estimator of the total residual variance, \(\sigma^2_\alpha + \sigma^2_\eta\).

  2. Run OLS of \(\mathbf{Y}\) on \([\mathbf{X} \quad \mathbf{D}]\), i.e., the covariates plus a set of group membership indicators. This is the fixed effects regression, which we will talk about shortly. Because the group membership indicators account for variation across groups, the remaining variation in this model solely represents variance within groups. We can therefore let the estimated residual variance from this model be \(\hat{\sigma}^2_\eta\), which is a consistent estimator of the within-group variance, \(\sigma^2_\eta\).

  3. Estimate the between-group variance as \(\hat{\sigma}^2_\alpha = \hat{\sigma}^2 - \hat{\sigma}^2_\eta\).

  4. Form the matrix \(\hat{\Omega}\) by plugging \(\hat{\sigma}^2_\alpha\) and \(\hat{\sigma}^2_\eta\) into the formulas above, then run GLS using \(\hat{\Omega}\) as the weighting matrix.

This gives us the random effects estimator, \[ \hat{\beta}_{\text{RE}} = (\mathbf{X}^\top \hat{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \hat{\Omega}^{-1} \mathbf{Y}. \]

See Johnston and DiNardo (1997, 392–95) for full formulas and details. FGLS is consistent but not unbiased, so the random effects model may not be a good idea in small samples. If our specification of the error structure is correct, it is asymptotically efficient—as the sample size increases, no other estimator has lower standard errors.

12.5 Fixed Effects

What if the group-specific intercepts are correlated with the covariates? Then, in order to maintain strict exogeneity, we must pull them out of the error term and into the covariate matrix. This is easy to do—we can rewrite the full model as \[ \mathbf{Y} = \mathbf{X} \beta + \mathbf{D} \alpha + \eta, \] where \(\alpha\) is the \(G \times 1\) vector of group-specific intercepts. This suggests that we run OLS on our covariates plus the full set of group membership indicators. (As with any set of indicators, we need to omit one category.) The first \(K\) elements of this regression constitute the fixed effects estimator.

Most textbook treatments of fixed effect estimators go through a whole rigmarole about computation, because it used to be challenging to invert a \((K + G) \times (K + G)\) matrix when \(G\) was moderately large. This is no longer true,31 so you can safely ignore most of the hand-wringing about computational difficulties.

The standard errors of the fixed effects estimator are usually higher than those of the random effects estimator, since estimating \(G\) additional parameters uses a lot of degrees of freedom. This leads us to the following pair of observations.

  • If the random effects assumption is met (group-specific effects are uncorrelated with covariates), then the random effects and fixed estimators are both consistent, but fixed effects is less efficient.

  • If the random effects assumption is not met, then the random effects estimator is inconsistent while the fixed effects estimator is consistent.

The typical test for whether fixed effects are necessary comes from Hausman (1978). Under the null hypothesis that both estimators are consistent (and thus fixed effects are unnecessary and inefficient), the test statistic \[ H = (\hat{\beta}_{\text{RE}} - \hat{\beta}_{\text{FE}})^\top (\hat{\Sigma}_{\text{FE}} - \hat{\Sigma}_{\text{RE}})^{-1} (\hat{\beta}_{\text{RE}} - \hat{\beta}_{\text{FE}}) \] asymptotically has a \(\chi^2\) distribution with \(K\) degrees of freedom.

The other main drawback of fixed effects estimation is that you cannot estimate the effects of variables that do not vary within groups. (Why not?) See Greene (2003, sec. 13.5) for estimation strategies with panel data and time-invariant covariates.

One final note: Arellano (1987) shows that the cluster-robust variance matrix estimator can be used with fixed effects. Cameron and Miller (2015) recommend doing so, at least when the number of groups is large.

12.6 Appendix: Deriving the Moulton Factor

For a general error structure \(V[\epsilon \,|\, \mathbf{X}] = \sigma^2 \Omega\), the true variance of the OLS estimator is \[ V[\hat{\beta} \,|\, \mathbf{X}] = \sigma^2 \underbrace{(\mathbf{X}^\top \mathbf{X})^{-1}}_{\text{bread}} \underbrace{\mathbf{X}^\top \Omega \mathbf{X}}_{\text{meat}} \underbrace{(\mathbf{X}^\top \mathbf{X})^{-1}}_{\text{bread}}. \] Recall that we are considering the model with \(N = 4\) and \[ \Omega = \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix}. \] In this case, the “meat” of the OLS variance is \[ \begin{aligned} \mathbf{X}^\top \Omega \mathbf{X} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ x_1 & x_1 & x_2 & x_2 \end{bmatrix} \begin{bmatrix} 1 & \rho & 0 & 0 \\ \rho & 1 & 0 & 0 \\ 0 & 0 & 1 & \rho \\ 0 & 0 & \rho & 1 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= \begin{bmatrix} 1 + \rho & 1 + \rho & 1 + \rho & 1 + \rho \\ (1 + \rho) x_1 & (1 + \rho) x_1 & (1 + \rho) x_2 & (1 + \rho) x_2 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= (1 + \rho) \begin{bmatrix} 1 & 1 & 1 & 1 \\ x_1 & x_1 & x_2 & x_2 \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_1 \\ 1 & x_2 \\ 1 & x_2 \end{bmatrix} \\ &= (1 + \rho) \mathbf{X}^\top \mathbf{X}. \end{aligned} \] Therefore, we have \[ \begin{aligned} V[\hat{\beta} \,|\, \mathbf{X}] &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \Omega \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} [(1 + \rho) \mathbf{X}^\top \mathbf{X}] (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (1 + \rho) (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \]

12.7 Appendix: Implementation in R

The methods introduced here can be implemented via the plm package.

library("plm")

We will use the Produc dataset from plm, a riveting collection of economic statistics about the U.S. states from 1970 to 1986.

data(Produc, package = "plm")
head(Produc)
##     state year region     pcap     hwy   water    util       pc   gsp    emp
## 1 ALABAMA 1970      6 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5
## 2 ALABAMA 1971      6 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9
## 3 ALABAMA 1972      6 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3
## 4 ALABAMA 1973      6 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5
## 5 ALABAMA 1974      6 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8
## 6 ALABAMA 1975      6 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4
##   unemp
## 1   4.7
## 2   5.2
## 3   4.7
## 4   3.9
## 5   5.5
## 6   7.7

The functions in plm assume that your data are organized like Produc, with the grouping variable in the first column and the identification variable (time, in the case of panel data) in the second column. See the plm package vignette on how to get datasets not organized this way into line.

We will treat unemployment (unemp) as our response and public capital stock (pcap) and private capital stock (pc) as our covariates. As a benchmark, let’s use OLS.

fit_ols <- lm(unemp ~ pcap + pc,
              data = Produc)
summary(fit_ols)
## 
## Call:
## lm(formula = unemp ~ pcap + pc, data = Produc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5589 -1.6219 -0.3373  1.2129 11.5951 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.202e+00  1.077e-01  57.571   <2e-16 ***
## pcap        1.006e-05  5.515e-06   1.824   0.0686 .  
## pc          2.558e-06  2.563e-06   0.998   0.3186    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.196 on 813 degrees of freedom
## Multiple R-squared:  0.03516,    Adjusted R-squared:  0.03279 
## F-statistic: 14.81 on 2 and 813 DF,  p-value: 4.795e-07

The “pooling” estimator implemented by plm() ought to give us the same results.

fit_pooling <- plm(unemp ~ pcap + pc,
                   data = Produc,
                   model = "pooling")
summary(fit_pooling)
## Pooling Model
## 
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "pooling")
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -3.55895 -1.62192 -0.33733  1.21292 11.59512 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)    
## (Intercept) 6.2015e+00 1.0772e-01 57.5709  < 2e-16 ***
## pcap        1.0059e-05 5.5152e-06  1.8238  0.06855 .  
## pc          2.5582e-06 2.5633e-06  0.9980  0.31858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    4064.6
## Residual Sum of Squares: 3921.7
## R-Squared:      0.035162
## Adj. R-Squared: 0.032788
## F-statistic: 14.8141 on 2 and 813 DF, p-value: 4.7947e-07

We can obtain the cluster-robust variance matrix estimate via vcovHC(). Make sure to specify method = "arellano" so as to get the usual estimator. It is not entirely clear to me which of the various finite-sample adjustments corresponds to the defaults in Stata.

crvm_pooling <- vcovHC(fit_pooling,
                       method = "arellano",
                       type = "HC1")
summary(fit_pooling, vcov = crvm_pooling)
## Pooling Model
## 
## Note: Coefficient variance-covariance matrix supplied: crvm_pooling
## 
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "pooling")
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -3.55895 -1.62192 -0.33733  1.21292 11.59512 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)    
## (Intercept) 6.2015e+00 2.4475e-01 25.3385   <2e-16 ***
## pcap        1.0059e-05 1.2103e-05  0.8311   0.4062    
## pc          2.5582e-06 7.3044e-06  0.3502   0.7263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    4064.6
## Residual Sum of Squares: 3921.7
## R-Squared:      0.035162
## Adj. R-Squared: 0.032788
## F-statistic: 6.45144 on 2 and 47 DF, p-value: 0.0033441

Our \(t\)-statistics are cut in more than half, even though our variables have within-group variation (unlike Moulton (1990)’s example).

We can also use plm() to estimate a random-effects model.

fit_random <- plm(unemp ~ pcap + pc,
                  data = Produc,
                  model = "random")
summary(fit_random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "random")
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Effects:
##                 var std.dev share
## idiosyncratic 3.092   1.758  0.69
## individual    1.389   1.179  0.31
## theta: 0.6598
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -3.65953 -1.35357 -0.28627  1.01012  9.72087 
## 
## Coefficients:
##               Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept) 5.6656e+00 2.5521e-01 22.1996 < 2.2e-16 ***
## pcap        3.9178e-08 1.1236e-05  0.0035  0.997218    
## pc          1.6080e-05 4.5644e-06  3.5228  0.000427 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2916.2
## Residual Sum of Squares: 2792.6
## R-Squared:      0.042399
## Adj. R-Squared: 0.040043
## Chisq: 35.9967 on 2 DF, p-value: 1.5256e-08

And, finally, a fixed-effects model, which plm() calls the "within" model.

fit_fixed <- plm(unemp ~ pcap + pc,
                 data = Produc,
                 model = "within")
summary(fit_fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, model = "within")
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -3.80687 -1.16760 -0.24813  0.94681  8.39430 
## 
## Coefficients:
##        Estimate Std. Error t-value  Pr(>|t|)    
## pcap 2.2699e-04 3.1813e-05  7.1352 2.244e-12 ***
## pc   4.1783e-06 6.5669e-06  0.6363    0.5248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2765.9
## Residual Sum of Squares: 2368.5
## R-Squared:      0.14368
## Adj. R-Squared: 0.088906
## F-statistic: 64.2646 on 2 and 766 DF, p-value: < 2.22e-16

We can extract the fixed-effect estimates themselves via fixef().

fixef(fit_fixed)
##        ALABAMA        ARIZONA       ARKANSAS     CALIFORNIA       COLORADO 
##       3.763902       3.201232       5.020980     -24.351816       1.906885 
##    CONNECTICUT       DELAWARE        FLORIDA        GEORGIA          IDAHO 
##       2.222801       5.723806      -2.950385       0.090978       5.926215 
##       ILLINOIS        INDIANA           IOWA         KANSAS       KENTUCKY 
##      -7.483176       1.499181       1.176446       1.258062       2.487150 
##      LOUISIANA          MAINE       MARYLAND  MASSACHUSETTS       MICHIGAN 
##       2.394977       6.132823      -0.469424      -0.081008      -2.073261 
##      MINNESOTA    MISSISSIPPI       MISSOURI        MONTANA       NEBRASKA 
##      -0.913146       4.790411       0.576112       5.398842       1.033804 
##         NEVADA  NEW_HAMPSHIRE     NEW_JERSEY     NEW_MEXICO       NEW_YORK 
##       6.121925       3.867146      -0.475632       5.875722     -22.416937 
## NORTH_CAROLINA   NORTH_DAKOTA           OHIO       OKLAHOMA         OREGON 
##       0.588848       3.714982      -5.208327       2.212210       4.913671 
##   PENNSYLVANIA   RHODE_ISLAND SOUTH_CAROLINA   SOUTH_DAKOTA       TENNESSE 
##      -6.740590       6.045561       3.895325       2.942687       1.080415 
##          TEXAS           UTAH        VERMONT       VIRGINIA     WASHINGTON 
##     -11.161384       4.233540       5.624496      -0.960964       1.080757 
##  WEST_VIRGINIA      WISCONSIN        WYOMING 
##       6.940764       0.191971       3.798969

If we wanted to include time dummies as well, we could specify effect = "twoways" in the fitting function.

fit_fixed_2 <- plm(unemp ~ pcap + pc,
                   data = Produc,
                   effect = "twoways",
                   model = "within")
summary(fit_fixed_2)
## Twoways effects Within Model
## 
## Call:
## plm(formula = unemp ~ pcap + pc, data = Produc, effect = "twoways", 
##     model = "within")
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -3.330599 -0.802635 -0.044942  0.749901  6.021784 
## 
## Coefficients:
##         Estimate  Std. Error t-value  Pr(>|t|)    
## pcap  9.1022e-05  2.6202e-05  3.4738 0.0005426 ***
## pc   -1.2657e-05  5.0355e-06 -2.5135 0.0121641 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1256.4
## Residual Sum of Squares: 1235.8
## R-Squared:      0.016355
## Adj. R-Squared: -0.068894
## F-statistic: 6.23517 on 2 and 750 DF, p-value: 0.0020624
fixef(fit_fixed_2, effect = "time")
##   1970   1971   1972   1973   1974   1975   1976   1977   1978   1979   1980 
## 5.5480 6.2755 5.7642 5.1866 5.8250 8.5092 7.4508 6.9296 5.9312 5.8163 7.1746 
##   1981   1982   1983   1984   1985   1986 
## 7.6609 9.6937 9.6779 7.6333 7.4795 7.3435

phtest() implements the Hausman test. Remember that the null hypothesis is that both estimators are consistent; if we reject it, then the random effects estimator is inconsistent and we must use fixed effects.

phtest(fit_random, fit_fixed)
## 
##  Hausman Test
## 
## data:  unemp ~ pcap + pc
## chisq = 95.562, df = 2, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

12.8 Appendix: R-squared in plm

plm has annoyed me—and my students—by reporting incorrect (or, generously, “idiosyncratic”) \(R^2\) values for models with fixed effects. I’ve seen a few instances where the \(R^2\) apparently goes down after adding fixed effects to a model, which is mathematically impossible.

The issue is that plm reports the \(R^2\) as the proportion of variance explained by the substantive covariates, not including the fixed effects. This is inconsistent with usual practice. Here I illustrate how to calculate proper \(R^2\) values while using plm.

I’ll generate some grouped data where most of the action is at the group level. I want there to be a lot of variation across groups, but relatively little within groups. This is exactly the circumstance where the true \(R^2\) and the one reported by plm should differ the most.

library("tidyverse")

## Draw group-level intercepts
set.seed(94)
groups <- LETTERS[1:20]
group_means <- rnorm(length(groups), sd = 16)
names(group_means) <- groups

## Assign individual observations to groups and draw outcomes
n_obs <- 5000
g <- sample(groups, size = n_obs, replace = TRUE)
x1 <- rnorm(n_obs, sd = 1)
x2 <- rnorm(n_obs, sd = 1)
y <- group_means[g] + x1 - x2 + rnorm(n_obs, sd = 4)

## Arrange into format plm likes
grouped_data <- tibble(g, x1, x2, y) %>%
  arrange(g) %>%
  group_by(g) %>%
  mutate(idx = row_number()) %>%
  ungroup() %>%
  select(g, idx, y, x1, x2)

grouped_data
## # A tibble: 5,000 x 5
##   g       idx     y     x1      x2
##   <chr> <int> <dbl>  <dbl>   <dbl>
## 1 A         1  16.9 -0.932 -0.928 
## 2 A         2  21.2  0.319  0.0733
## 3 A         3  23.5  0.228  0.708 
## 4 A         4  13.9 -1.46   0.426 
## 5 A         5  22.4 -1.70   0.154 
## # … with 4,995 more rows

If we run a fixed effects model “by hand” by just plugging the group dummies into lm, we get the correct (high) \(R^2\).

fit_fe_lm <- lm(y ~ x1 + x2 + g, data = grouped_data)
summary(fit_fe_lm)
## 
## Call:
## lm(formula = y ~ x1 + x2 + g, data = grouped_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5872  -2.6934  -0.0011   2.6249  13.4342 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  18.52393    0.24933   74.294  < 2e-16 ***
## x1            1.00745    0.05689   17.710  < 2e-16 ***
## x2           -1.17708    0.05703  -20.639  < 2e-16 ***
## gB          -12.73795    0.35900  -35.481  < 2e-16 ***
## gC          -42.70627    0.35585 -120.011  < 2e-16 ***
## gD            1.68971    0.35226    4.797 1.66e-06 ***
## gE          -37.92942    0.35331 -107.354  < 2e-16 ***
## gF          -20.21707    0.35815  -56.449  < 2e-16 ***
## gG          -24.80747    0.34477  -71.953  < 2e-16 ***
## gH           -5.42873    0.34709  -15.641  < 2e-16 ***
## gI            0.40416    0.35664    1.133    0.257    
## gJ          -59.49246    0.35774 -166.300  < 2e-16 ***
## gK          -21.63007    0.35404  -61.095  < 2e-16 ***
## gL            1.68897    0.35440    4.766 1.94e-06 ***
## gM          -14.33295    0.35090  -40.846  < 2e-16 ***
## gN          -15.51534    0.35628  -43.548  < 2e-16 ***
## gO          -21.77582    0.35404  -61.506  < 2e-16 ***
## gP          -35.90792    0.34894 -102.906  < 2e-16 ***
## gQ          -38.35542    0.35735 -107.332  < 2e-16 ***
## gR          -47.74437    0.35412 -134.825  < 2e-16 ***
## gS          -26.16950    0.36139  -72.413  < 2e-16 ***
## gT            4.02471    0.34861   11.545  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.966 on 4978 degrees of freedom
## Multiple R-squared:  0.9536, Adjusted R-squared:  0.9534 
## F-statistic:  4876 on 21 and 4978 DF,  p-value: < 2.2e-16

plm gives us the same coefficients and standard errors on the substantive variables, x1 and x2, but reports a curiously different \(R^2\).

fit_fe_plm <- plm(y ~ x1 + x2, data = grouped_data, model = "within")
summary(fit_fe_plm)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1 + x2, data = grouped_data, model = "within")
## 
## Unbalanced Panel: n = 20, T = 230-278, N = 5000
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -13.5871727  -2.6933956  -0.0011038   2.6249454  13.4342204 
## 
## Coefficients:
##     Estimate Std. Error t-value  Pr(>|t|)    
## x1  1.007450   0.056886  17.710 < 2.2e-16 ***
## x2 -1.177083   0.057033 -20.639 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    89881
## Residual Sum of Squares: 78292
## R-Squared:      0.12893
## Adj. R-Squared: 0.12526
## F-statistic: 368.416 on 2 and 4978 DF, p-value: < 2.22e-16

In order to get an \(R^2\) commensurate with what lm reports, we need to calculate the total sum of squares and residual sum of squares by hand. Here’s how to do that.

## Calculate total sum of squares
TSS <- sum((y - mean(y))^2)

## Calculate residual sum of squares
RSS <- sum(residuals(fit_fe_plm)^2)

correct_rsq <- 1 - (RSS / TSS)
correct_rsq
## [1] 0.9536398

Notice that this is the same as what lm reports.

all.equal(summary(fit_fe_lm)$r.squared, correct_rsq)
## [1] TRUE

If the response variable (here, y) used for fitting is not the same as in your raw data—for example, if you are using the subset argument or if there is missing data—you will need to be cautious about how you calculate the TSS.


  1. Why did I leave out national characteristics, like inflation or the national unemployment rate? In a single-election study, these are constant. While they might shape variation across elections, they cannot be linked to variation within a single election.↩︎

  2. Everything we do here would go through if we allowed the number of observations to vary across groups, but the notation would get uglier. In the panel data context we say that the panels are balanced if the number of observations is the same for each group and unbalanced otherwise.↩︎

  3. When I taught this class last year, I simulated a dataset with a single covariate, \(N = 1{,}000\), and \(J = 1{,}000\), meaning the fixed effects regression entailed estimating \(1{,}001\) parameters from \(1{,}000{,}000\) observations. It took less than two minutes to run on my not-especially-powerful laptop. That said, should you ever have \(N\) another order of magnitude above that, the computational tricks given in the textbook treatments will once again be of use.↩︎