Main Task
In class this week, we discussed how replication with new data can help ameliorate the problems caused by publication bias and the statistical significance filter. But what can we do if new data isn’t forthcoming?
The literature on economic growth faced exactly this problem in the early 1990s. Dozens of different variables had been found to be statistically significant correlates of growth, depending on the exact specification one used. But we can’t wait another 50 years to replicate these studies with another half-century of growth data. To figure out which correlations were robust—i.e., which ones were not sensitive to a particular choice of specification–-Levine and Renelt (1992) regressed growth on every conceivable combination of covariates (within limits). Sala-I-Martin (1997), in the brilliantly titled article “I Just Ran Two Million Regressions,” refined their approach.
In this assignment, you will only run a little more than 300,000 regressions.1 The application is a political science topic as prominent and bedeviling as that of growth rates in economics: the correlates of civil war onset. Your analysis will mirror that of Hegre and Sambanis (2006), who bring the millions-of-regressions approach to the civil war literature. You will be working with the data file hs-imputed.csv
, a cleaned version of the Hegre and Sambanis (2006) data with missing values imputed.2
The data is in country-year format, with the following variables:
country
andyear
: Self-explanatory.warstns
: Indicator for whether a civil war began. This will be the response variable in all of your regressions.- Three “core” covariates to include in every regression:
ln_popns
: The natural logarithm of the country’s population.ln_gdpen
: The natural logarithm of the country’s GDP per capita.pt8
: \(2^{t/8}\), where \(t\) is the number of years the country has been at peace.
- Eighty-eight “concept” variables whose association with civil war onset you will be testing. Descriptions of each appear in Table 1 of Hegre and Sambanis (2006).
You will select \(J = 3\) of the 88 “concept” variables to test the robustness of their association with civil war onset. One of these must be ehet
, the ethnic heterogeneity index; the other two are up to you. For each variable, we want to estimate the average probability of \(\hat{\beta}_j > 0\), where the average is taken over the sampling distributions of each different regression specification. For each of these variables, \(j = 1, 2, 3\), you will do the following:
- Consider the set of \(M = \binom{87}{3}\) combinations of three covariates from the 87 other concept variables.
- For each \(m = 1, \ldots, M\), you will:
- Regress
warstns
on your chosen variable \(X_j\), the three core covariates, and the \(m\)’th combination of other concept variables. (So every regression will have seven covariates, plus an intercept.) - Save the following quantities:
- \(\hat{\beta}_{jm}\): the estimated coefficient on \(X_j\)
- \(\hat{\sigma}_{jm}^2\): the estimated variance (squared standard error) of the coefficient on \(X_j\)
- \(R_{jm}^2\): the \(R^2\) of the regression
- Regress
- Using the results across each model, calculate the following:
- A weight for each model, proportional to its \(R^2\): \[\omega_{jm} = \frac{R_{jm}^2}{\sum_{l=1}^M R_{jl}^2}\]
- The average estimated coefficient, \[\bar{\beta}_j = \sum_{m = 1}^M \omega_{jm} \hat{\beta}_{jm}.\]
- The average estimated variance of the coefficient, \[\bar{\sigma}_j^2 = \sum_{m=1}^M \omega_m \hat{\sigma}_{jm}^2.\]
- The so-called average \(p\)-value, \[\bar{p}_j = \sum_{m=1}^M \omega_{jm} \Phi(0 \,|\, \hat{\beta}_{jm}, \hat{\sigma}_{jm}^2),\] where \(\Phi(0 \,|\, \mu, \sigma^2)\) is the probability of drawing a value less than zero from a normal distribution with mean \(\mu\) and variance \(\sigma^2\). This is not the average of the individual \(p\)-values, since direction matters.
In the last step, you’re essentially answering the following: Suppose you drew a model at random, where the probability of choosing each model is proportional to its \(R^2\). Then, suppose you drew a value of \(\beta_j\) from the estimated sampling distribution of the coefficients of this model. What is the probability that the value you drew is less than zero? A result close to 0 or 1 indicates a robust relationship. A result close to 0.5 indicates that the sign of the estimated coefficient is highly dependent on the particular specification—and thus the relationship is not robust.
Interpret your results. Which of the covariates you chose is robustly associated with civil war onset? Which are not? What are the limitations of this approach—what might we be missing by analyzing robustness this way?
Your results will not be exactly the same as in Hegre and Sambanis (2006). First, you are assuming a linear model and using OLS, whereas they assume a logistic model and use its maximum likelihood estimator. Second, missing values in the data have been imputed, so you do not need to worry about combinations of covariates that make the sample size too small.