Main Task
You will be working with the dataset neumayer.dta
, the replication data for Neumayer (2005). Each observation is a country-year. The response variable is lneuplusasyl
, the natural logarithm of the number of people seeking asylum in Western Europe from the given country in the given year. The covariates for each country of origin include the following. I have tried my best to match variable names to the descriptions in the text of Neumayer (2005), but there may be some errors below.
lngdp
: logged GDP per capita, in 1997 USD (ln GDP)growthrate3years
: average annual economic growth over the previous three years (GROWTH)lnpop
: logged population (ln POPULATION)ecdis
: a Freedom House index of discrimination against ethnic minorities, measured on a 0-4 scale (ECONDISCRIMINATION)free
: sum of the two Freedom House indices of political rights and civil liberties, each ranging from 1 (most free) to 7 (least free) (AUTOCRACY)pts
: average of two Political Terror Scales measuring human rights violations, ranging from 1 (best) to 5 (worst) (RIGHTSVIOLATION)sfallmax
: magnitude score for civil war, ethnic war, or collapse of state authority (DOMWAR/STATEFAIL)genpoliticidemag
: magnitude score for number of deaths from genocide and politicide (GEN/POLITICIDE)uppsalaexternalintensity
: intensity of external conflicts, on an ordered categorical scale (0 = no conflict, 1 = 25-999 deaths that year in minor conflict, 2 = 25-999 deaths that year in conflict totaling 1000+ deaths across years, 3 = 1000+ deaths that year) (EXTERNALWAR)urban
: urban population as a percentage of total population (%URBAN)sharepop1564
: ages 15–64 as a percentage of total population (%POP15-64)food
: net per capita food production (FOOD)sumdead
: aggregate deaths from natural disasters (NATURALDISASTER)bnksum
: total number of guerrilla and riot events (DISSIDENTVIOLENCE)distmineurope
: logged minimum air distance between the country’s capital and the closest Western European capital (ln DISTANCE)christ
: percentage of Christians (%CHRIST)colsec
: number of years between 1900 and 1960 as a colony of any Western European destination country (COLONY)aidipolated
: aid as percentage of GDP (AID)tradeipolated
: trade as percentage of GDP (TRADE)arrivalsipolated
: number of tourist arrivals (TOURISTS)
Like Neumayer, you should restrict your sample to developing countries, as indicated by the developing
variable.
You will take the following steps:
You will randomly split the sample 50-50 into a training set and a validation set.
Using only the training set, you will run various models in search of an interesting finding. By “interesting” and “finding” I mean:
- The model must include at least one higher-order term.
- The higher-order term itself must be statistically significant, according to the nominal \(p\)-value.
- The test of the relevant composite hypothesis must also be statistically significant, again according to the nominal \(p\)-value.
Select a model from the training set. Present the estimates and nominal \(p\)-values from the model you fished for.
Run the exact same model on the validation set. How do your estimates and inferences change? Note: You should only do this after you have made a final decision about what model you want to present. In other words, unlike the training set, you should only use the validation set once.
- Answer the following questions.
- From a statistical and scientific standpoint, what are the advantages and disadvantages of this sample-splitting exercise?
- Should we trust the results of the hypothesis tests you report from the training set? What about the ones from the validation set?
- Would it be ethical to “fish” for a finding in this way, but only report the validation set results? Why or why not?
Your model need not include all of the covariates. You should be thoughtful about what you choose to control for.
You may, for the purposes of this exercise, assume the errors are independent and identically distributed across observations.