3 Working with Data
Some material in this chapter is adapted from notes Matt DiLorenzo wrote for the Spring 2016 session of PSCI 8357.
In your careers as social scientists, starting with your dissertation research—if not earlier—you will probably spend more time collecting, merging, and cleaning data than you will on statistical analysis. So it’s worth taking some time to learn how to do this well.
Best practices for data management can be summarized in a single sentence: Record and document everything you do to the data.
The first corollary of this principle is that raw data is sacrosanct. You should never edit raw data “in place.” Once you download the raw data file, that file should never change.7
In almost any non-trivial analysis, the “final” data—the format you plug into your analysis—will differ significantly from the raw data. It may consist of information merged from multiple sources. The variables may have been transformed, aggregated, or otherwise altered. The unit of observation may even differ from the original source. You must document every one of these changes, so that another researcher working from the exact same raw data will end up with the exact same final data.
The most sensible way to achieve this level of reproducibility is to do all of your data merging and cleaning in a script. In other words, no going into Excel and mucking around manually. Like any other piece of your analysis, your pipeline from raw data to final data should follow the principles of programming that we discussed last week.
Luckily for you,8 the tidyverse suite of R packages (including dplyr, tidyr, and others) makes it easy to script your “data pipeline.”
library("tidyverse")
3.1 Loading
The first step in working with data is to acquire some data. Depending on the nature of your research, you will be getting some or all of your data from sources available online. When you download data from online repositories, you should keep track of where you got it from. The best way to do so is—you guessed it—to script your data acquisition.
The R function download.file()
is the easiest way to download files from URLs from within R. Just specify where you’re getting the file from and where you want it to go. For the examples today, we’ll use an “untidied” version of the World Development Indicators data from the World Bank that I’ve posted to my website.
download.file(url = "http://bkenkel.com/files/untidy-data.csv",
destfile = "my-untidy-data.csv")
Once you’ve got the file stored locally, use the utilities from the readr package (part of tidyverse) to read it into R as a data frame.9 We have a CSV file, so we will use read_csv
. See help(package = "readr")
for other possibilities.
read_csv
(with an underscore) is typically better than R’s built-in read.csv
(with a dot), as it reads files faster and processes them more cleanly.
<- read_csv(file = "my-untidy-data.csv") untidy_data
##
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## country = col_character(),
## gdp.2005 = col_double(),
## gdp.2006 = col_double(),
## gdp.2007 = col_double(),
## gdp.2008 = col_double(),
## pop.2005 = col_double(),
## pop.2006 = col_double(),
## pop.2007 = col_double(),
## pop.2008 = col_double(),
## unemp.2005 = col_double(),
## unemp.2006 = col_double(),
## unemp.2007 = col_double(),
## unemp.2008 = col_double()
## )
Remember that each column of a data frame might be a different type, or more formally class, of object. read_csv
and its ilk try to guess the type of data each column contains: character, integer, decimal number (“double” in programming-speak), or something else. The readout above tells you what guesses it made. If it gets something wrong—say, reading a column as numbers that ought to be characters—you can use the col_types
argument to set it straight.
FYI, you could also run read_csv()
directly on a URL, as in:
read_csv("http://bkenkel.com/files/untidy-data.csv")
However, in analyses intended for publication, it’s usually preferable to download and save the raw data. What’s stored at a URL might change or disappear, and you’ll need to have a hard copy of the raw data for replication purposes.
Now let’s take a look at the data we’ve just loaded in.
untidy_data
## # A tibble: 215 x 13
## country gdp.2005 gdp.2006 gdp.2007 gdp.2008 pop.2005 pop.2006 pop.2007
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 3.85 4.03 4.03 3.68 0.0789 0.0810 0.0827
## 2 AE 257. 283. 292. 301. 4.59 5.30 6.17
## 3 AF 9.16 9.66 11.0 11.4 25.7 26.4 27.1
## 4 AG 1.15 1.29 1.41 1.41 0.0815 0.0827 0.0840
## 5 AL 9.22 9.77 10.4 11.1 3.01 2.99 2.97
## # … with 210 more rows, and 5 more variables: pop.2008 <dbl>, unemp.2005 <dbl>,
## # unemp.2006 <dbl>, unemp.2007 <dbl>, unemp.2008 <dbl>
We have a country
variable giving country abbreviations. The other variables are numerical values: the country’s GDP in 2005, 2006, 2007, and 2008; then the same for population and unemployment. Let’s get this into a format we could use for analysis.
3.2 Tidying
Wickham (2014) outlines three qualities that make data “tidy”:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
For one thing, this means that whether a dataset is tidy or not depends—at least in part (some data collections are messy from any angle)—on the purpose it’s being analyzed for.
Each row of untidy_data
is a country. In observational studies in comparative politics and international relations, more commonly the unit of observation is the country-year.10 How can we take untidy_data
and easily make it into country-year data?
We’ll use the tidyr package (again, part of tidyverse) to clean up this data. The biggest problem right now is that each column, besides the country identifier, really encodes two pieces of information: the year of observation and the variable being observed. To deal with this, we’ll have to first transform the data from one untidy format to another. We’re going to use the pivot_longer()
function to make each row a country-year-variable.
What pivot_longer()
does is make a row for each entry from a set of columns. It’s probably easiest to understand it by seeing it in practice:
<- pivot_longer(untidy_data,
long_data cols = starts_with(c("gdp", "pop", "unemp")))
long_data
## # A tibble: 2,580 x 3
## country name value
## <chr> <chr> <dbl>
## 1 AD gdp.2005 3.85
## 2 AD gdp.2006 4.03
## 3 AD gdp.2007 4.03
## 4 AD gdp.2008 3.68
## 5 AD pop.2005 0.0789
## # … with 2,575 more rows
The cols
argument tells pivot_longer()
which columns to collect.
The code here specifies that it be all of the columns starting with “gdp,” “pop,” or “unemp.”
As it happens, we could have achieved the same result by using cols = ends_with(c("2005", "2006", "2007", "2008"))
.
To select all columns, we instead would have used cols = everything()
.
For additional ways of selecting columns, see the help page ?tidyselect::select_helpers
.
Now we have a new problem, which is that the name
column encodes two pieces of information: the variable and the year of its observation. tidyr provides the separate()
function to solve that, splitting a single variable into two.
<- separate(long_data,
long_data col = "name",
into = c("var", "year"))
long_data
## # A tibble: 2,580 x 4
## country var year value
## <chr> <chr> <chr> <dbl>
## 1 AD gdp 2005 3.85
## 2 AD gdp 2006 4.03
## 3 AD gdp 2007 4.03
## 4 AD gdp 2008 3.68
## 5 AD pop 2005 0.0789
## # … with 2,575 more rows
So now we have country-year-variable data, with the year and variable conveniently stored in different columns. To turn this into country-year data, we can use the pivot_wider()
function, which is like the inverse of pivot_longer()
. pivot_wider()
takes a key column and a value column, and turns each different key into a column of its own.
<- pivot_wider(long_data,
clean_data names_from = "var",
values_from = "value")
clean_data
## # A tibble: 860 x 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
When using pivot_wider()
on data that you didn’t previously pivot_longer()
, be sure to set the values_fill
argument to tell it how to fill in empty cells. A simple example:
test_data
## # A tibble: 3 x 3
## id name value
## <chr> <chr> <dbl>
## 1 brenton a 10
## 2 brenton b 20
## 3 patrick b 5
pivot_wider(test_data, names_from = "name", values_from = "value")
## # A tibble: 2 x 3
## id a b
## <chr> <dbl> <dbl>
## 1 brenton 10 20
## 2 patrick NA 5
pivot_wider(test_data, names_from = "name", values_from = "value", values_fill = 0)
## # A tibble: 2 x 3
## id a b
## <chr> <dbl> <dbl>
## 1 brenton 10 20
## 2 patrick 0 5
One more important note on tidyverse semantics. It includes a fabulous feature called the pipe, %>%
, which makes it easy to string together a truly mind-boggling number of commands.
In pipe syntax, x %>% f()
is equivalent to f(x)
. That seems like a wasteful and confusing way to write f(x)
, and it is. But if you want to string together a bunch of commands, it’s much easier to comprehend
%>%
x f() %>%
g() %>%
h() %>%
i()
than i(h(g(f(x))))
.
You can pass function arguments using the pipe too. For example, f(x, bear = "moose")
is equivalent to x %>% f(bear = "moose")
.
The key thing about the tidyverse functions is that each of them takes a data frame as its first argument, and returns a data frame as its output. This makes them highly amenable to piping. For example, we can combine all three steps of our tidying above with a single command, thanks to the pipe:
%>%
untidy_data pivot_longer(cols = starts_with(c("gdp", "pop", "unemp"))) %>%
separate(col = "name",
into = c("var", "year")) %>%
pivot_wider(names_from = "var",
values_from = "value")
## # A tibble: 860 x 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
Without the pipe, if we wanted to run all those commands together, we would have to write:
pivot_wider(separate(pivot_longer(untidy_data,
cols = starts_with(c("gdp", "pop", "unemp"))),
col = "name",
into = c("var", "year")),
names_from = "var",
values_from = "value")
Going back to last week’s dictum of “write programs for people”: even though this would have the same computational result, it’s much harder for a human (your future self or a coauthor) to parse what the code is trying to do.
3.3 Transforming and Aggregating
Tidying the data usually isn’t the end of the process. If you want to perform further calculations on the raw, that’s where the tools in dplyr (part of, you guessed it, the tidyverse) come in.
Perhaps the simplest dplyr function (or “verb,” as the R hipsters would say) is rename()
, which lets you rename columns.
%>%
clean_data rename(gross_domestic_product = gdp)
## # A tibble: 860 x 5
## country year gross_domestic_product pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
The dplyr functions, like the vast majority of R functions, do not modify their inputs. In other words, running rename()
on clean_data
will return a renamed copy of clean_data
, but won’t overwrite the original.
clean_data
## # A tibble: 860 x 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
If you wanted to make the change stick, you would have to run:
<- clean_data %>%
clean_data rename(gross_domestic_product = gdp)
select()
lets you keep a couple of columns and drop all the others. Or vice versa if you use minus signs.
%>%
clean_data select(country, gdp)
## # A tibble: 860 x 2
## country gdp
## <chr> <dbl>
## 1 AD 3.85
## 2 AD 4.03
## 3 AD 4.03
## 4 AD 3.68
## 5 AE 257.
## # … with 855 more rows
%>%
clean_data select(-pop)
## # A tibble: 860 x 4
## country year gdp unemp
## <chr> <chr> <dbl> <dbl>
## 1 AD 2005 3.85 NA
## 2 AD 2006 4.03 NA
## 3 AD 2007 4.03 NA
## 4 AD 2008 3.68 NA
## 5 AE 2005 257. 3.12
## # … with 855 more rows
As we saw above, you can also select columns based on how their names start or end, or with even more complex criteria.
See ?tidyselect::select_helpers
.
%>%
untidy_data select(country, starts_with("gdp"))
## # A tibble: 215 x 5
## country gdp.2005 gdp.2006 gdp.2007 gdp.2008
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AD 3.85 4.03 4.03 3.68
## 2 AE 257. 283. 292. 301.
## 3 AF 9.16 9.66 11.0 11.4
## 4 AG 1.15 1.29 1.41 1.41
## 5 AL 9.22 9.77 10.4 11.1
## # … with 210 more rows
%>%
untidy_data select(country, ends_with("2005"))
## # A tibble: 215 x 4
## country gdp.2005 pop.2005 unemp.2005
## <chr> <dbl> <dbl> <dbl>
## 1 AD 3.85 0.0789 NA
## 2 AE 257. 4.59 3.12
## 3 AF 9.16 25.7 2.94
## 4 AG 1.15 0.0815 NA
## 5 AL 9.22 3.01 16.9
## # … with 210 more rows
mutate()
lets you create new variables that are transformations of old ones.
%>%
clean_data mutate(gdppc = gdp / pop,
log_gdppc = log(gdppc))
## # A tibble: 860 x 7
## country year gdp pop unemp gdppc log_gdppc
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA 48.8 3.89
## 2 AD 2006 4.03 0.0810 NA 49.7 3.91
## 3 AD 2007 4.03 0.0827 NA 48.7 3.89
## 4 AD 2008 3.68 0.0839 NA 43.9 3.78
## 5 AE 2005 257. 4.59 3.12 56.1 4.03
## # … with 855 more rows
filter()
cuts down the data according to the logical condition(s) you specify.
%>%
clean_data filter(year == 2006)
## # A tibble: 215 x 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2006 4.03 0.0810 NA
## 2 AE 2006 283. 5.30 2.77
## 3 AF 2006 9.66 26.4 2.83
## 4 AG 2006 1.29 0.0827 NA
## 5 AL 2006 9.77 2.99 16.4
## # … with 210 more rows
summarise()
calculates summaries of the data. For example, let’s find the maximum unemployment rate.
%>%
clean_data summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 1 x 1
## max_unemp
## <dbl>
## 1 37.2
This seems sort of useless, until you combine it with the group_by()
function. If you group the data before summarise
-ing it, you’ll calculate a separate summary for each group. For example, let’s calculate the maximum unemployment rate for each year in the data.
%>%
clean_data group_by(year) %>%
summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 4 x 2
## year max_unemp
## * <chr> <dbl>
## 1 2005 37.2
## 2 2006 36.0
## 3 2007 34.9
## 4 2008 33.8
summarise()
produces a “smaller” data frame than the input—one row per group. If you want to do something similar, but preserving the structure of the original data, use mutate
in combination with group_by
.
%>%
clean_data filter(!is.na(unemp)) %>%
group_by(year) %>%
mutate(max_unemp = max(unemp),
unemp_over_max = unemp / max_unemp) %>%
select(country, year, contains("unemp"))
## # A tibble: 744 x 5
## # Groups: year [4]
## country year unemp max_unemp unemp_over_max
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AE 2005 3.12 37.2 0.0838
## 2 AE 2006 2.77 36.0 0.0770
## 3 AE 2007 2.47 34.9 0.0706
## 4 AE 2008 2.33 33.8 0.0691
## 5 AF 2005 2.94 37.2 0.0790
## # … with 739 more rows
This gives us back the original data, but with a max_unemp
variable recording the highest unemployment level that year. We can then calculate each individual country’s unemployment as a percentage of the maximum. Whether grouped mutate
or summarise
is better depends, of course, on the purpose and structure of your analysis.
3.4 Merging
Particularly if you do observational work, you’ll often be merging together data from multiple sources. The key to merging data from separate tables is to have consistent identifiers across tables. For example, if you run an experiment, you might have demographic data on each subject in one table, and each subject’s response to treatment in another table. Naturally, you’ll want to have a subject identifier that “links” the records across tables, as in the following hypothetical example.
subject_data
## # A tibble: 3 x 4
## id gender loves_bernie does_yoga
## <dbl> <chr> <chr> <chr>
## 1 1001 male yes no
## 2 1002 female no yes
## 3 1003 male no no
subject_response_data
## # A tibble: 6 x 3
## id treatment response
## <dbl> <chr> <chr>
## 1 1001 read_book sad
## 2 1001 watch_tv sad
## 3 1002 read_book happy
## 4 1002 watch_tv sad
## 5 1003 read_book sad
## 6 1003 watch_tv happy
Let’s practice merging data with our cleaned-up country-year data. We’ll take two datasets from my website: a country-level dataset with latitudes and longitudes, and a country-year–level dataset with inflation over time.
<- read_csv("http://bkenkel.com/files/latlong.csv")
latlong_data latlong_data
## # A tibble: 245 x 3
## country latitude longitude
## <chr> <dbl> <dbl>
## 1 AD 42.5 1.60
## 2 AE 23.4 53.8
## 3 AF 33.9 67.7
## 4 AG 17.1 -61.8
## 5 AI 18.2 -63.1
## # … with 240 more rows
<- read_csv("http://bkenkel.com/files/inflation.csv")
inflation_data inflation_data
## # A tibble: 1,070 x 3
## country year inflation
## <chr> <dbl> <dbl>
## 1 AD 2004 NA
## 2 AD 2005 NA
## 3 AD 2006 NA
## 4 AD 2007 NA
## 5 AD 2008 NA
## # … with 1,065 more rows
For your convenience, both of these datasets use the same two-letter country naming scheme as the original data. Unfortunately, out in the real world, data from different sources often use incommensurate naming schemes. Converting from one naming scheme to another is part of the data cleaning process, and it requires careful attention.
dplyr contains various _join()
functions for merging. Each of these take as arguments the two data frames to merge, plus the names of the identifier variables to merge them on. The one I use most often is left_join()
, which keeps every row from the first (“left”) data frame and merges in the columns from the second (“right”) data frame.
For example, let’s merge the latitude and longitude data for each country into clean_data
.
left_join(clean_data,
latlong_data,by = "country")
## # A tibble: 860 x 7
## country year gdp pop unemp latitude longitude
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA 42.5 1.60
## 2 AD 2006 4.03 0.0810 NA 42.5 1.60
## 3 AD 2007 4.03 0.0827 NA 42.5 1.60
## 4 AD 2008 3.68 0.0839 NA 42.5 1.60
## 5 AE 2005 257. 4.59 3.12 23.4 53.8
## # … with 855 more rows
Since latlong_data
is country-level, the value is the same for each year. So the merged data contains redundant information. This is one reason to store data observed at different levels in different tables—with redundant observations, it is easier to make errors yet harder to catch them and fix them.
We can also merge data when the identifier is stored across multiple columns, as in the case of our country-year data. But first, a technical note.11 You might notice that the year
column of clean_data
is labeled <chr>
, as in character data. Yet the year
column of inflation_data
is labeled <int>
, as in integer data. We can check that by running class()
on each respective column.
class(clean_data$year)
## [1] "character"
class(inflation_data$year)
## [1] "numeric"
From R’s perspective, the character string "1999"
is a very different thing than the integer number 1999
. Therefore, if we try to merge clean_data
and inflation_data
on the year
variable, it will throw an error.
left_join(clean_data,
inflation_data,by = c("country", "year"))
## Error: Can't join on `x$year` x `y$year` because of incompatible types.
## ℹ `x$year` is of type <character>>.
## ℹ `y$year` is of type <double>>.
To fix this, let’s use mutate()
to convert the year
column of clean_data
to an integer. We probably should have done this in the first place—after all, having the year encoded as a character string would have thrown off plotting functions, statistical functions, or anything else where it would be more natural to treat the year like a number.
<- mutate(clean_data,
clean_data year = as.integer(year))
clean_data
## # A tibble: 860 x 5
## country year gdp pop unemp
## <chr> <int> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
Looks the same as before, except with an important difference: year
is now labeled <int>
.
Now we can merge the two datasets together without issue. Notice how we use a vector in the by
argument to specify multiple columns to merge on.
left_join(clean_data,
inflation_data,by = c("country", "year"))
## # A tibble: 860 x 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 855 more rows
You might remember that inflation_data
contained some country-years not included in the original data (namely, observations from 2004). If we want the merged data to use the observations from inflation_data
rather than clean_data
, we can use the right_join()
function.
right_join(clean_data,
inflation_data,by = c("country", "year"))
## # A tibble: 1,070 x 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 1,065 more rows
One last common issue in merging is that the identifier variables have different names in the two datasets. If it’s inconvenient or infeasible to correct this by renaming the columns in one or the other, you can specify the by
argument as in the following example.
<- rename(inflation_data,
inflation_data the_country = country,
the_year = year)
inflation_data
## # A tibble: 1,070 x 3
## the_country the_year inflation
## <chr> <dbl> <dbl>
## 1 AD 2004 NA
## 2 AD 2005 NA
## 3 AD 2006 NA
## 4 AD 2007 NA
## 5 AD 2008 NA
## # … with 1,065 more rows
left_join(clean_data,
inflation_data,by = c("country" = "the_country", "year" = "the_year"))
## # A tibble: 860 x 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 855 more rows
Even if it’s data you collected yourself, that data should still have a “canonical” representation that never gets overwritten. See Leek (2015) for more on distributing your own data.↩︎
But not for me, because these tools didn’t exist when I was a PhD student. Yes, I am old.↩︎
More precisely, the readr functions produce output of class
"tbl_df"
(pronounced “tibble diff,” I’m told), which are like data frames but better. Seehelp(package = "tibble")
for what can be done withtbl_df
s.↩︎Yes, this kind of huge cross-country analysis has fallen out of favor as we’ve become more attentive to causal inference. But wait until later in the semester to worry about that!↩︎
This won’t be the case if you got
clean_data
by loading it in directly fromclean-data.csv
on my website, sinceread_csv()
will have correctly encodedyear
as an integer.↩︎