3 Working with Data
Some material in this chapter is adapted from notes Matt DiLorenzo wrote for the Spring 2016 session of PSCI 8357.
In your careers as social scientists, starting with your dissertation research—if not earlier—you will probably spend more time collecting, merging, and cleaning data than you will on statistical analysis. So it’s worth taking some time to learn how to do this well.
Best practices for data management can be summarized in a single sentence: Record and document everything you do to the data.
The first corollary of this principle is that raw data is sacrosanct. You should never edit raw data “in place”. Once you download the raw data file, that file should never change.5
In almost any non-trivial analysis, the “final” data—the format you plug into your analysis—will differ significantly from the raw data. It may consist of information merged from multiple sources. The variables may have been transformed, aggregated, or otherwise altered. The unit of observation may even differ from the original source. You must document every one of these changes, so that another researcher working from the exact same raw data will end up with the exact same final data.
The most sensible way to achieve this level of reproducibility is to do all of your data merging and cleaning in a script. In other words, no going into Excel and mucking around manually. Like any other piece of your analysis, your pipeline from raw data to final data should follow the principles of programming that we discussed last week.
Luckily for you,6 the tidyverse suite of R packages (including dplyr, tidyr, and others) makes it easy to script your “data pipeline”.
3.1 Loading
The first step in working with data is to acquire some data. Depending on the nature of your research, you will be getting some or all of your data from sources available online. When you download data from online repositories, you should keep track of where you got it from. The best way to do so is—you guessed it—to script your data acquisition.
The R function download.file()
is the easiest way to download files from URLs from within R. Just specify where you’re getting the file from and where you want it to go. For the examples today, we’ll use an “untidied” version of the World Development Indicators data from the World Bank that I’ve posted to my website.
Once you’ve got the file stored locally, use the utilities from the readr package (part of tidyverse) to read it into R as a data frame.7 We have a CSV file, so we will use read_csv
. See help(package = "readr")
for other possibilities.
read_csv
(with an underscore) is typically better than R’s built-in read.csv
(with a dot), as it reads files faster and processes them more cleanly.
## Rows: 215 Columns: 13
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (12): gdp.2005, gdp.2006, gdp.2007, gdp.2008, pop.2005, pop.2006, pop.20...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Remember that each column of a data frame might be a different type, or more formally class, of object. read_csv
and its ilk try to guess the type of data each column contains: character, integer, decimal number (“double” in programming-speak), or something else. The readout above tells you what guesses it made. If it gets something wrong—say, reading a column as numbers that ought to be characters—you can use the col_types
argument to set it straight.
FYI, you could also run read_csv()
directly on a URL, as in:
However, in analyses intended for publication, it’s usually preferable to download and save the raw data. What’s stored at a URL might change or disappear, and you’ll need to have a hard copy of the raw data for replication purposes.
Now let’s take a look at the data we’ve just loaded in.
## # A tibble: 215 × 13
## country gdp.2005 gdp.2006 gdp.2007 gdp.2008 pop.2005 pop.2006 pop.2007 pop.2…¹
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 3.85 4.03 4.03 3.68 0.0789 0.0810 0.0827 0.0839
## 2 AE 257. 283. 292. 301. 4.59 5.30 6.17 7.09
## 3 AF 9.16 9.66 11.0 11.4 25.7 26.4 27.1 27.7
## 4 AG 1.15 1.29 1.41 1.41 0.0815 0.0827 0.0840 0.0854
## 5 AL 9.22 9.77 10.4 11.1 3.01 2.99 2.97 2.95
## # … with 210 more rows, 4 more variables: unemp.2005 <dbl>, unemp.2006 <dbl>,
## # unemp.2007 <dbl>, unemp.2008 <dbl>, and abbreviated variable name ¹pop.2008
We have a country
variable giving country abbreviations. The other variables are numerical values: the country’s GDP in 2005, 2006, 2007, and 2008; then the same for population and unemployment. Let’s get this into a format we could use for analysis.
3.2 Tidying
Wickham (2014) outlines three qualities that make data “tidy”:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
For one thing, this means that whether a dataset is tidy or not depends—at least in part (some data collections are messy from any angle)—on the purpose it’s being analyzed for.
Each row of untidy_data
is a country. In observational studies in comparative politics and international relations, more commonly the unit of observation is the country-year.8 How can we take untidy_data
and easily make it into country-year data?
We’ll use the tidyr package (again, part of tidyverse) to clean up this data. The biggest problem right now is that each column, besides the country identifier, really encodes two pieces of information: the year of observation and the variable being observed. To deal with this, we’ll have to first transform the data from one untidy format to another. We’re going to use the pivot_longer()
function to make each row a country-year-variable.
What pivot_longer()
does is make a row for each entry from a set of columns. It’s probably easiest to understand it by seeing it in practice:
## # A tibble: 2,580 × 3
## country name value
## <chr> <chr> <dbl>
## 1 AD gdp.2005 3.85
## 2 AD gdp.2006 4.03
## 3 AD gdp.2007 4.03
## 4 AD gdp.2008 3.68
## 5 AD pop.2005 0.0789
## # … with 2,575 more rows
The cols
argument tells pivot_longer()
which columns to collect.
The code here specifies that it be all of the columns starting with “gdp”, “pop”, or “unemp”.
As it happens, we could have achieved the same result by using cols = ends_with(c("2005", "2006", "2007", "2008"))
.
To select all columns, we instead would have used cols = everything()
.
For additional ways of selecting columns, see the help page ?tidyselect::select_helpers
.
Now we have a new problem, which is that the name
column encodes two pieces of information: the variable and the year of its observation. tidyr provides the separate()
function to solve that, splitting a single variable into two.
## # A tibble: 2,580 × 4
## country var year value
## <chr> <chr> <chr> <dbl>
## 1 AD gdp 2005 3.85
## 2 AD gdp 2006 4.03
## 3 AD gdp 2007 4.03
## 4 AD gdp 2008 3.68
## 5 AD pop 2005 0.0789
## # … with 2,575 more rows
So now we have country-year-variable data, with the year and variable conveniently stored in different columns. To turn this into country-year data, we can use the pivot_wider()
function, which is like the inverse of pivot_longer()
. pivot_wider()
takes a key column and a value column, and turns each different key into a column of its own.
## # A tibble: 860 × 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
When using pivot_wider()
on data that you didn’t previously pivot_longer()
, be sure to set the values_fill
argument to tell it how to fill in empty cells. A simple example:
## # A tibble: 3 × 3
## id name value
## <chr> <chr> <dbl>
## 1 brenton a 10
## 2 brenton b 20
## 3 patrick b 5
## # A tibble: 2 × 3
## id a b
## <chr> <dbl> <dbl>
## 1 brenton 10 20
## 2 patrick NA 5
## # A tibble: 2 × 3
## id a b
## <chr> <dbl> <dbl>
## 1 brenton 10 20
## 2 patrick 0 5
One more important note on tidyverse semantics.
It is built to work with R’s pipe, |>
, which makes it easy to string together a truly mind-boggling number of commands.9
In pipe syntax, x |> f()
is equivalent to f(x)
. That seems like a wasteful and confusing way to write f(x)
, and it is. But if you want to string together a bunch of commands, it’s much easier to comprehend
than i(h(g(f(x))))
.
You can pass function arguments using the pipe too. For example, f(x, bear = "moose")
is equivalent to x |> f(bear = "moose")
.
The key thing about the tidyverse functions is that each of them takes a data frame as its first argument, and returns a data frame as its output. This makes them highly amenable to piping. For example, we can combine all three steps of our tidying above with a single command, thanks to the pipe:
untidy_data |>
pivot_longer(cols = starts_with(c("gdp", "pop", "unemp"))) |>
separate(col = "name",
into = c("var", "year")) |>
pivot_wider(names_from = "var",
values_from = "value")
## # A tibble: 860 × 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
Without the pipe, if we wanted to run all those commands together, we would have to write:
pivot_wider(separate(pivot_longer(untidy_data,
cols = starts_with(c("gdp", "pop", "unemp"))),
col = "name",
into = c("var", "year")),
names_from = "var",
values_from = "value")
Going back to last week’s dictum of “write programs for people”: even though this would have the same computational result, it’s much harder for a human (your future self or a coauthor) to parse what the code is trying to do.
3.3 Transforming and Aggregating
Tidying the data usually isn’t the end of the process. If you want to perform further calculations on the raw, that’s where the tools in dplyr (part of, you guessed it, the tidyverse) come in.
Perhaps the simplest dplyr function (or “verb”, as the R hipsters would say) is rename()
, which lets you rename columns.
## # A tibble: 860 × 5
## country year gross_domestic_product pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
The dplyr functions, like the vast majority of R functions, do not modify their inputs. In other words, running rename()
on clean_data
will return a renamed copy of clean_data
, but won’t overwrite the original.
## # A tibble: 860 × 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
If you wanted to make the change stick, you would have to run:
select()
lets you keep a couple of columns and drop all the others. Or vice versa if you use minus signs.
## # A tibble: 860 × 2
## country gdp
## <chr> <dbl>
## 1 AD 3.85
## 2 AD 4.03
## 3 AD 4.03
## 4 AD 3.68
## 5 AE 257.
## # … with 855 more rows
## # A tibble: 860 × 4
## country year gdp unemp
## <chr> <chr> <dbl> <dbl>
## 1 AD 2005 3.85 NA
## 2 AD 2006 4.03 NA
## 3 AD 2007 4.03 NA
## 4 AD 2008 3.68 NA
## 5 AE 2005 257. 3.12
## # … with 855 more rows
As we saw above, you can also select columns based on how their names start or end, or with even more complex criteria.
See ?tidyselect::select_helpers
.
## # A tibble: 215 × 5
## country gdp.2005 gdp.2006 gdp.2007 gdp.2008
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AD 3.85 4.03 4.03 3.68
## 2 AE 257. 283. 292. 301.
## 3 AF 9.16 9.66 11.0 11.4
## 4 AG 1.15 1.29 1.41 1.41
## 5 AL 9.22 9.77 10.4 11.1
## # … with 210 more rows
## # A tibble: 215 × 4
## country gdp.2005 pop.2005 unemp.2005
## <chr> <dbl> <dbl> <dbl>
## 1 AD 3.85 0.0789 NA
## 2 AE 257. 4.59 3.12
## 3 AF 9.16 25.7 2.94
## 4 AG 1.15 0.0815 NA
## 5 AL 9.22 3.01 16.9
## # … with 210 more rows
mutate()
lets you create new variables that are transformations of old ones.
## # A tibble: 860 × 7
## country year gdp pop unemp gdppc log_gdppc
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA 48.8 3.89
## 2 AD 2006 4.03 0.0810 NA 49.7 3.91
## 3 AD 2007 4.03 0.0827 NA 48.7 3.89
## 4 AD 2008 3.68 0.0839 NA 43.9 3.78
## 5 AE 2005 257. 4.59 3.12 56.1 4.03
## # … with 855 more rows
filter()
cuts down the data according to the logical condition(s) you specify.
## # A tibble: 215 × 5
## country year gdp pop unemp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AD 2006 4.03 0.0810 NA
## 2 AE 2006 283. 5.30 2.77
## 3 AF 2006 9.66 26.4 2.83
## 4 AG 2006 1.29 0.0827 NA
## 5 AL 2006 9.77 2.99 16.4
## # … with 210 more rows
summarise()
calculates summaries of the data. For example, let’s find the maximum unemployment rate.
## # A tibble: 1 × 1
## max_unemp
## <dbl>
## 1 37.2
This seems sort of useless, until you combine it with the group_by()
function. If you group the data before summarise
-ing it, you’ll calculate a separate summary for each group. For example, let’s calculate the maximum unemployment rate for each year in the data.
## # A tibble: 4 × 2
## year max_unemp
## <chr> <dbl>
## 1 2005 37.2
## 2 2006 36.0
## 3 2007 34.9
## 4 2008 33.8
summarise()
produces a “smaller” data frame than the input—one row per group. If you want to do something similar, but preserving the structure of the original data, use mutate
in combination with group_by
.
clean_data |>
filter(!is.na(unemp)) |>
group_by(year) |>
mutate(max_unemp = max(unemp),
unemp_over_max = unemp / max_unemp) |>
select(country, year, contains("unemp"))
## # A tibble: 744 × 5
## # Groups: year [4]
## country year unemp max_unemp unemp_over_max
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AE 2005 3.12 37.2 0.0838
## 2 AE 2006 2.77 36.0 0.0770
## 3 AE 2007 2.47 34.9 0.0706
## 4 AE 2008 2.33 33.8 0.0691
## 5 AF 2005 2.94 37.2 0.0790
## # … with 739 more rows
This gives us back the original data, but with a max_unemp
variable recording the highest unemployment level that year. We can then calculate each individual country’s unemployment as a percentage of the maximum. Whether grouped mutate
or summarise
is better depends, of course, on the purpose and structure of your analysis.
3.4 Merging
Particularly if you do observational work, you’ll often be merging together data from multiple sources. The key to merging data from separate tables is to have consistent identifiers across tables. For example, if you run an experiment, you might have demographic data on each subject in one table, and each subject’s response to treatment in another table. Naturally, you’ll want to have a subject identifier that “links” the records across tables, as in the following hypothetical example.
## # A tibble: 3 × 4
## id gender_id loves_bernie does_yoga
## <dbl> <chr> <chr> <chr>
## 1 1001 male yes no
## 2 1002 female no yes
## 3 1003 male no no
## # A tibble: 6 × 3
## id treatment response
## <dbl> <chr> <chr>
## 1 1001 read_book sad
## 2 1001 watch_tv sad
## 3 1002 read_book happy
## 4 1002 watch_tv sad
## 5 1003 read_book sad
## 6 1003 watch_tv happy
Let’s practice merging data with our cleaned-up country-year data. We’ll take two datasets from my website: a country-level dataset with latitudes and longitudes, and a country-year–level dataset with inflation over time.
## # A tibble: 245 × 3
## country latitude longitude
## <chr> <dbl> <dbl>
## 1 AD 42.5 1.60
## 2 AE 23.4 53.8
## 3 AF 33.9 67.7
## 4 AG 17.1 -61.8
## 5 AI 18.2 -63.1
## # … with 240 more rows
## # A tibble: 1,070 × 3
## country year inflation
## <chr> <dbl> <dbl>
## 1 AD 2004 NA
## 2 AD 2005 NA
## 3 AD 2006 NA
## 4 AD 2007 NA
## 5 AD 2008 NA
## # … with 1,065 more rows
For your convenience, both of these datasets use the same two-letter country naming scheme as the original data. Unfortunately, out in the real world, data from different sources often use incommensurate naming schemes. Converting from one naming scheme to another is part of the data cleaning process, and it requires careful attention.
tidyverse contains various _join()
functions for merging. Each of these take as arguments the two data frames to merge, plus the names of the identifier variables to merge them on. The one I use most often is left_join()
, which keeps every row from the first (“left”) data frame and merges in the columns from the second (“right”) data frame.
For example, let’s merge the latitude and longitude data for each country into clean_data
.
## # A tibble: 860 × 7
## country year gdp pop unemp latitude longitude
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA 42.5 1.60
## 2 AD 2006 4.03 0.0810 NA 42.5 1.60
## 3 AD 2007 4.03 0.0827 NA 42.5 1.60
## 4 AD 2008 3.68 0.0839 NA 42.5 1.60
## 5 AE 2005 257. 4.59 3.12 23.4 53.8
## # … with 855 more rows
Since latlong_data
is country-level, the value is the same for each year. So the merged data contains redundant information. This is one reason to store data observed at different levels in different tables—with redundant observations, it is easier to make errors yet harder to catch them and fix them.
We can also merge data when the identifier is stored across multiple columns, as in the case of our country-year data. But first, a technical note.10 You might notice that the year
column of clean_data
is labeled <chr>
, as in character data. Yet the year
column of inflation_data
is labeled <int>
, as in integer data. We can check that by running class()
on each respective column.
## [1] "character"
## [1] "numeric"
From R’s perspective, the character string "1999"
is a very different thing than the integer number 1999
. Therefore, if we try to merge clean_data
and inflation_data
on the year
variable, it will throw an error.
## Error in `left_join()`:
## ! Can't join `x$year` with `y$year` due to incompatible types.
## ℹ `x$year` is a <character>.
## ℹ `y$year` is a <double>.
To fix this, let’s use mutate()
to convert the year
column of clean_data
to an integer. We probably should have done this in the first place—after all, having the year encoded as a character string would have thrown off plotting functions, statistical functions, or anything else where it would be more natural to treat the year like a number.
## # A tibble: 860 × 5
## country year gdp pop unemp
## <chr> <int> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA
## 2 AD 2006 4.03 0.0810 NA
## 3 AD 2007 4.03 0.0827 NA
## 4 AD 2008 3.68 0.0839 NA
## 5 AE 2005 257. 4.59 3.12
## # … with 855 more rows
Looks the same as before, except with an important difference: year
is now labeled <int>
.
Now we can merge the two datasets together without issue. Notice how we use a vector in the by
argument to specify multiple columns to merge on.
## # A tibble: 860 × 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 855 more rows
You might remember that inflation_data
contained some country-years not included in the original data (namely, observations from 2004). If we want the merged data to use the observations from inflation_data
rather than clean_data
, we can use the right_join()
function.
## # A tibble: 1,070 × 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 1,065 more rows
One last common issue in merging is that the identifier variables have different names in the two datasets. If it’s inconvenient or infeasible to correct this by renaming the columns in one or the other, you can specify the by
argument as in the following example.
## # A tibble: 1,070 × 3
## the_country the_year inflation
## <chr> <dbl> <dbl>
## 1 AD 2004 NA
## 2 AD 2005 NA
## 3 AD 2006 NA
## 4 AD 2007 NA
## 5 AD 2008 NA
## # … with 1,065 more rows
## # A tibble: 860 × 6
## country year gdp pop unemp inflation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 2005 3.85 0.0789 NA NA
## 2 AD 2006 4.03 0.0810 NA NA
## 3 AD 2007 4.03 0.0827 NA NA
## 4 AD 2008 3.68 0.0839 NA NA
## 5 AE 2005 257. 4.59 3.12 NA
## # … with 855 more rows
References
Even if it’s data you collected yourself, that data should still have a “canonical” representation that never gets overwritten. See Leek (2015) for more on distributing your own data.↩︎
But not for me, because these tools didn’t exist when I was a PhD student. Yes, I am old.↩︎
More precisely, the readr functions produce output of class
"tbl_df"
(pronounced “tibble diff,” I’m told), which are like data frames but better. Seehelp(package = "tibble")
for what can be done withtbl_df
s.↩︎Yes, this kind of huge cross-country analysis has fallen out of favor as we’ve become more attentive to causal inference. But wait until later in the semester to worry about that!↩︎
Before the pipe was built directly into R, tidyverse included its own version,
%>%
. So if you look on Google or Stack Overflow for code help (i.e., if you do any nontrivial coding) you might find older code—including not-yet-updated chapters of this book!—using%>%
instead of|>
. Except for a few edge cases that we won’t get into, both work identically.↩︎This won’t be the case if you got
clean_data
by loading it in directly fromclean-data.csv
on my website, sinceread_csv()
will have correctly encodedyear
as an integer.↩︎