3 Working with Data

Some material in this chapter is adapted from notes Matt DiLorenzo wrote for the Spring 2016 session of PSCI 8357.

Let me repeat something I said last week. In your careers as social scientists, starting with your dissertation research—if not earlier—you will probably spend more time collecting, merging, and cleaning data than you will on statistical analysis. So it’s worth taking some time to learn how to do this well.

Best practices for data management can be summarized in a single sentence: Record and document everything you do to the data.

The first corollary of this principle is that raw data is sacrosanct. You should never edit raw data “in place”. Once you download the raw data file, that file should never change.4

In almost any non-trivial analysis, the “final” data—the format you plug into your analysis—will differ significantly from the raw data. It may consist of information merged from multiple sources. The variables may have been transformed, aggregated, or otherwise altered. The unit of observation may even differ from the original source. You must document every one of these changes, so that another researcher working from the exact same raw data will end up with the exact same final data.

The most sensible way to achieve this level of reproducibility is to do all of your data merging and cleaning in a script. In other words, no going into Excel and mucking around manually. Like any other piece of your analysis, your pipeline from raw data to final data should follow the principles of programming that we discussed last week.

Luckily for you,5 the tidyverse suite of R packages (including dplyr, tidyr, and others) makes it easy to script your “data pipeline”. We’ll begin by loading the package.

library("tidyverse")

3.1 Loading

The first step in working with data is to acquire some data. Depending on the nature of your research, you will be getting some or all of your data from sources available online. When you download data from online repositories, you should keep track of where you got it from. The best way to do so is—you guessed it—to script your data acquisition.

The R function download.file() is the easiest way to download files from URLs from within R. Just specify where you’re getting the file from and where you want it to go. For the examples today, we’ll use an “untidied” version of the World Development Indicators data from the World Bank that I’ve posted to my website.

download.file(url = "http://bkenkel.com/data/untidy-data.csv",
              destfile = "my-untidy-data.csv")

Once you’ve got the file stored locally, use the utilities from the readr package (part of tidyverse) to read it into R as a data frame.6 We have a CSV file, so we will use read_csv. See help(package = "readr") for other possibilities.

untidy_data <- read_csv(file = "my-untidy-data.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   gdp.2005 = col_double(),
##   gdp.2006 = col_double(),
##   gdp.2007 = col_double(),
##   gdp.2008 = col_double(),
##   pop.2005 = col_double(),
##   pop.2006 = col_double(),
##   pop.2007 = col_double(),
##   pop.2008 = col_double(),
##   unemp.2005 = col_double(),
##   unemp.2006 = col_double(),
##   unemp.2007 = col_double(),
##   unemp.2008 = col_double()
## )

Remember that each column of a data frame might be a different type, or more formally class, of object. read_csv and its ilk try to guess the type of data each column contains: character, integer, decimal number (“double” in programming-speak), or something else. The readout above tells you what guesses it made. If it gets something wrong—say, reading a column as numbers that ought to be characters—you can use the col_types argument to set it straight.

FYI, you could also run read_csv() directly on a URL, as in:

read_csv("http://bkenkel.com/data/untidy-data.csv")

However, in analyses intended for publication, it’s usually preferable to download and save the raw data. What’s stored at a URL might change or disappear, and you’ll need to have a hard copy of the raw data for replication purposes.

Now let’s take a look at the data we’ve just loaded in.

head(untidy_data)
## # A tibble: 6 × 13
##   country gdp.2005 gdp.2006 gdp.2007 gdp.2008  pop.2005  pop.2006
##     <chr>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>     <dbl>
## 1      AD   3.8423   4.0184   4.0216   3.6759  0.081223  0.083373
## 2      AE 253.9655 278.9489 287.8318 297.0189  4.481976  5.171255
## 3      AF   9.7630  10.3052  11.7212  12.1445 24.399948 25.183615
## 4      AG   1.1190   1.2687   1.3892   1.3902  0.082565  0.083467
## 5      AL   9.2684   9.7718  10.3483  11.1275  3.011487  2.992547
## 6      AM   7.6678   8.6797   9.8731  10.5544  3.014917  3.002161
## # ... with 6 more variables: pop.2007 <dbl>, pop.2008 <dbl>,
## #   unemp.2005 <dbl>, unemp.2006 <dbl>, unemp.2007 <dbl>, unemp.2008 <dbl>

We have a country variable giving country abbreviations. The other variables are numerical values: the country’s GDP in 2005, 2006, 2007, and 2008; then the same for population and unemployment. Let’s get this into a format we could use for analysis.

3.2 Tidying

Wickham (2014) outlines three qualities that make data “tidy”:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

For one thing, this means that whether a dataset is tidy or not depends—at least in part (some data collections are messy from any angle)—on the purpose it’s being analyzed for.

Each row of untidy_data is a country. In observational studies in comparative politics and international relations, more commonly the unit of observation is the country-year.7 How can we take untidy_data and easily make it into country-year data?

We’ll use the tidyr package (again, part of tidyverse) to clean up this data. The biggest problem right now is that each column, besides the country identifier, really encodes two pieces of information: the year of observation and the variable being observed. To deal with this, we’ll have to first transform the data from one untidy format to another. We’re going to use the gather() function to make each row a country-year-variable.

What gather() does is make a row for each entry from a set of columns. It’s probably easiest to understand it by seeing it in practice:

long_data <- gather(untidy_data,
                    key = variable,
                    value = number,
                    gdp.2005:unemp.2008)
head(long_data)
## # A tibble: 6 × 3
##   country variable   number
##     <chr>    <chr>    <dbl>
## 1      AD gdp.2005   3.8423
## 2      AE gdp.2005 253.9655
## 3      AF gdp.2005   9.7630
## 4      AG gdp.2005   1.1190
## 5      AL gdp.2005   9.2684
## 6      AM gdp.2005   7.6678

With the first argument, we told gather() to use the untidy_data data frame. With the last argument, we told it the set of columns to “gather” together into a single column. The key column specifies the name of the variable to store the “key” (original column name) in, and the value column specifies the name of the variable to store the associated value. For example, the second row of long_data encodes what we previously saw as the gdp.2005 column of untidy_data.

Now we have a new problem, which is that variable encodes two pieces of information: the variable and the year of its observation. tidyr provides the separate() function to solve that, splitting a single variable into two.

long_data <- separate(long_data,
                      col = variable,
                      into = c("var", "year"))
head(long_data)
## # A tibble: 6 × 4
##   country   var  year   number
##     <chr> <chr> <chr>    <dbl>
## 1      AD   gdp  2005   3.8423
## 2      AE   gdp  2005 253.9655
## 3      AF   gdp  2005   9.7630
## 4      AG   gdp  2005   1.1190
## 5      AL   gdp  2005   9.2684
## 6      AM   gdp  2005   7.6678

So now we have country-year-variable data, with the year and variable conveniently stored in different columns. To turn this into country-year data, we can use the spread() function, which is like the inverse of gather(). spread() takes a key column and a value column, and turns each different key into a column of its own.

clean_data <- spread(long_data,
                    key = var,
                    value = number)
head(clean_data)
## # A tibble: 6 × 5
##   country  year      gdp      pop unemp
##     <chr> <chr>    <dbl>    <dbl> <dbl>
## 1      AD  2005   3.8423 0.081223    NA
## 2      AD  2006   4.0184 0.083373    NA
## 3      AD  2007   4.0216 0.084878    NA
## 4      AD  2008   3.6759 0.085616    NA
## 5      AE  2005 253.9655 4.481976   3.1
## 6      AE  2006 278.9489 5.171255   3.3

When using spread() on data that you didn’t previously gather(), be sure to set the fill argument to tell it how to fill in empty cells. A simple example:

test_data
## # A tibble: 3 × 3
##        id     k     v
##     <chr> <chr> <dbl>
## 1 brenton     a    10
## 2 brenton     b    20
## 3 patrick     b     5
spread(test_data, key = k, value = v)
## # A tibble: 2 × 3
##        id     a     b
## *   <chr> <dbl> <dbl>
## 1 brenton    10    20
## 2 patrick    NA     5
spread(test_data, key = k, value = v, fill = 100)
## # A tibble: 2 × 3
##        id     a     b
## *   <chr> <dbl> <dbl>
## 1 brenton    10    20
## 2 patrick   100     5

One more important note on tidyverse semantics. It includes a fabulous feature called the pipe, %>%, which makes it easy to string together a truly mind-boggling number of commands.

In pipe syntax, x %>% f() is equivalent to f(x). That seems like a wasteful and confusing way to write f(x), and it is. But if you want to string together a bunch of commands, it’s much easier to comprehend

x %>%
  f() %>%
  g() %>%
  h() %>%
  i()

than i(h(g(f(x)))).

You can pass function arguments using the pipe too. For example, f(x, bear = "moose") is equivalent to x %>% f(bear = "moose").

The key thing about the tidyverse functions is that each of them takes a data frame as its first argument, and returns a data frame as its output. This makes them highly amenable to piping. For example, we can combine all three steps of our tidying above with a single command, thanks to the pipe:8

untidy_data %>%
  gather(key = variable,
         value = number,
         gdp.2005:unemp.2008) %>%
  separate(col = variable,
           into = c("var", "year")) %>%
  spread(key = var,
         value = number)
## # A tibble: 860 × 5
##   country  year      gdp      pop unemp
## *   <chr> <chr>    <dbl>    <dbl> <dbl>
## 1      AD  2005   3.8423 0.081223    NA
## 2      AD  2006   4.0184 0.083373    NA
## 3      AD  2007   4.0216 0.084878    NA
## 4      AD  2008   3.6759 0.085616    NA
## 5      AE  2005 253.9655 4.481976   3.1
## # ... with 855 more rows

Without the pipe, if we wanted to run all those commands together, we would have to write:

spread(separate(gather(untidy_data,
                       key = variable,
                       value = number,
                       gdp.2005:unemp.2008),
                col = variable,
                into = c("var", "year")),
       key = var,
       value = number)

Sad!

3.3 Transforming and Aggregating

Tidying the data usually isn’t the end of the process. If you want to perform further calculations on the raw, that’s where the tools in dplyr (part of, you guessed it, the tidyverse) come in.

Perhaps the simplest dplyr function (or “verb”, as the R hipsters would say) is rename(), which lets you rename columns.

clean_data %>%
  rename(gross_domestic_product = gdp)
## # A tibble: 860 × 5
##   country  year gross_domestic_product      pop unemp
## *   <chr> <chr>                  <dbl>    <dbl> <dbl>
## 1      AD  2005                 3.8423 0.081223    NA
## 2      AD  2006                 4.0184 0.083373    NA
## 3      AD  2007                 4.0216 0.084878    NA
## 4      AD  2008                 3.6759 0.085616    NA
## 5      AE  2005               253.9655 4.481976   3.1
## # ... with 855 more rows

The dplyr functions, like the vast majority of R functions, do not modify their inputs. In other words, running rename() on clean_data will return a renamed copy of clean_data, but won’t overwrite the original.

clean_data
## # A tibble: 860 × 5
##   country  year      gdp      pop unemp
## *   <chr> <chr>    <dbl>    <dbl> <dbl>
## 1      AD  2005   3.8423 0.081223    NA
## 2      AD  2006   4.0184 0.083373    NA
## 3      AD  2007   4.0216 0.084878    NA
## 4      AD  2008   3.6759 0.085616    NA
## 5      AE  2005 253.9655 4.481976   3.1
## # ... with 855 more rows

If you wanted to make the change stick, you would have to run:

clean_data <- clean_data %>%
  rename(gross_domestic_product = gdp)

select() lets you keep a couple of columns and drop all the others. Or vice versa if you use minus signs.

clean_data %>%
  select(country, gdp)
## # A tibble: 860 × 2
##   country      gdp
## *   <chr>    <dbl>
## 1      AD   3.8423
## 2      AD   4.0184
## 3      AD   4.0216
## 4      AD   3.6759
## 5      AE 253.9655
## # ... with 855 more rows
clean_data %>%
  select(-pop)
## # A tibble: 860 × 4
##   country  year      gdp unemp
## *   <chr> <chr>    <dbl> <dbl>
## 1      AD  2005   3.8423    NA
## 2      AD  2006   4.0184    NA
## 3      AD  2007   4.0216    NA
## 4      AD  2008   3.6759    NA
## 5      AE  2005 253.9655   3.1
## # ... with 855 more rows

mutate() lets you create new variables that are transformations of old ones.

clean_data %>%
  mutate(gdppc = gdp / pop,
         log_gdppc = log(gdppc))
## # A tibble: 860 × 7
##   country  year      gdp      pop unemp  gdppc log_gdppc
##     <chr> <chr>    <dbl>    <dbl> <dbl>  <dbl>     <dbl>
## 1      AD  2005   3.8423 0.081223    NA 47.305    3.8566
## 2      AD  2006   4.0184 0.083373    NA 48.198    3.8753
## 3      AD  2007   4.0216 0.084878    NA 47.381    3.8582
## 4      AD  2008   3.6759 0.085616    NA 42.935    3.7597
## 5      AE  2005 253.9655 4.481976   3.1 56.664    4.0371
## # ... with 855 more rows

filter() cuts down the data according to the logical condition(s) you specify.

clean_data %>%
  filter(year == 2006)
## # A tibble: 215 × 5
##   country  year      gdp       pop unemp
##     <chr> <chr>    <dbl>     <dbl> <dbl>
## 1      AD  2006   4.0184  0.083373    NA
## 2      AE  2006 278.9489  5.171255   3.3
## 3      AF  2006  10.3052 25.183615   8.8
## 4      AG  2006   1.2687  0.083467    NA
## 5      AL  2006   9.7718  2.992547  12.4
## # ... with 210 more rows

summarise() calculates summaries of the data. For example, let’s find the maximum unemployment rate.

clean_data %>%
  summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 1 × 1
##   max_unemp
##       <dbl>
## 1      37.6

This seems sort of useless, until you combine it with the group_by() function. If you group the data before summarise-ing it, you’ll calculate a separate summary for each group. For example, let’s calculate the maximum unemployment rate for each year in the data.

clean_data %>%
  group_by(year) %>%
  summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 4 × 2
##    year max_unemp
##   <chr>     <dbl>
## 1  2005      37.3
## 2  2006      36.0
## 3  2007      34.9
## 4  2008      37.6

summarise() produces a “smaller” data frame than the input—one row per group. If you want to do something similar, but preserving the structure of the original data, use mutate in combination with group_by.

clean_data %>%
  group_by(year) %>%
  mutate(max_unemp = max(unemp, na.rm = TRUE),
         unemp_over_max = unemp / max_unemp) %>%
  select(country, year, contains("unemp"))
## Source: local data frame [860 x 5]
## Groups: year [4]
## 
##   country  year unemp max_unemp unemp_over_max
##     <chr> <chr> <dbl>     <dbl>          <dbl>
## 1      AD  2005    NA      37.3             NA
## 2      AD  2006    NA      36.0             NA
## 3      AD  2007    NA      34.9             NA
## 4      AD  2008    NA      37.6             NA
## 5      AE  2005   3.1      37.3        0.08311
## # ... with 855 more rows

This gives us back the original data, but with a max_unemp variable recording the highest unemployment level that year. We can then calculate each individual country’s unemployment as a percentage of the maximum. Whether grouped mutate or summarise is better depends, of course, on the purpose and structure of your analysis.

Notice how I selected all of the unemployment-related columns with contains("unemp"). See ?select_helpers for a full list of helpful functions like this for select-ing variables.

3.4 Merging

Only rarely will you be lucky enough to draw all your data from a single source. More often, you’ll be merging together data from multiple sources.

The key to merging data from separate tables is to have consistent identifiers across tables. For example, if you run an experiment, you might have demographic data on each subject in one table, and each subject’s response to each treatment in another table. Naturally, you’ll want to have a subject identifier that “links” the records across tables, as in the following hypothetical example.

subject_data
## # A tibble: 3 × 4
##      id gender loves_bernie does_yoga
##   <dbl>  <chr>        <chr>     <chr>
## 1  1001   male          yes        no
## 2  1002 female           no       yes
## 3  1003   male           no        no
subject_response_data
## # A tibble: 6 × 3
##      id treatment response
##   <dbl>     <chr>    <chr>
## 1  1001 read_book      sad
## 2  1001  watch_tv      sad
## 3  1002 read_book    happy
## 4  1002  watch_tv      sad
## 5  1003 read_book      sad
## 6  1003  watch_tv    happy

Let’s practice merging data with our cleaned-up country-year data. We’ll take two datasets from my website: a country-level dataset with latitudes and longitudes, and a country-year–level dataset with inflation over time.

latlong_data <- read_csv("http://bkenkel.com/data/latlong.csv")
latlong_data
## # A tibble: 245 × 3
##   country latitude longitude
##     <chr>    <dbl>     <dbl>
## 1      AD   42.546    1.6016
## 2      AE   23.424   53.8478
## 3      AF   33.939   67.7100
## 4      AG   17.061  -61.7964
## 5      AI   18.221  -63.0686
## # ... with 240 more rows
inflation_data <- read_csv("http://bkenkel.com/data/inflation.csv")
inflation_data
## # A tibble: 1,070 × 3
##   country  year inflation
##     <chr> <int>     <dbl>
## 1      AD  2004        NA
## 2      AD  2005        NA
## 3      AD  2006        NA
## 4      AD  2007        NA
## 5      AD  2008        NA
## # ... with 1,065 more rows

For your convenience, both of these datasets use the same two-letter country naming scheme as the original data. Unfortunately, out in the real world, data from different sources often use incommensurate naming schemes. Converting from one naming scheme to another is part of the data cleaning process, and it requires careful attention.

dplyr contains various _join() functions for merging. Each of these take as arguments the two data frames to merge, plus the names of the identifier variables to merge them on. The one I use most often is left_join(), which keeps every row from the first (“left”) data frame and merges in the columns from the second (“right”) data frame.

For example, let’s merge the latitude and longitude data for each country into clean_data.

left_join(clean_data,
          latlong_data,
          by = "country")
## # A tibble: 860 × 7
##   country  year      gdp      pop unemp latitude longitude
##     <chr> <chr>    <dbl>    <dbl> <dbl>    <dbl>     <dbl>
## 1      AD  2005   3.8423 0.081223    NA   42.546    1.6016
## 2      AD  2006   4.0184 0.083373    NA   42.546    1.6016
## 3      AD  2007   4.0216 0.084878    NA   42.546    1.6016
## 4      AD  2008   3.6759 0.085616    NA   42.546    1.6016
## 5      AE  2005 253.9655 4.481976   3.1   23.424   53.8478
## # ... with 855 more rows

Since latlong_data is country-level, the value is the same for each year. So the merged data contains redundant information. This is one reason to store data observed at different levels in different tables—with redundant observations, it is easier to make errors yet harder to catch them and fix them.

We can also merge data when the identifier is stored across multiple columns, as in the case of our country-year data. But first, a technical note.9 You might notice that the year column of clean_data is labeled <chr>, as in character data. Yet the year column of inflation_data is labeled <int>, as in integer data. We can check that by running class() on each respective column.

class(clean_data$year)
## [1] "character"
class(inflation_data$year)
## [1] "integer"

From R’s perspective, the character string "1999" is a very different thing than the integer number 1999. Therefore, if we try to merge clean_data and inflation_data on the year variable, it will throw an error.

left_join(clean_data,
          inflation_data,
          by = c("country", "year"))
## Error in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): Can't join on 'year' x 'year' because of incompatible types (integer / character)

To fix this, let’s use mutate() to convert the year column of clean_data to an integer. We probably should have done this in the first place—after all, having the year encoded as a character string would have thrown off plotting functions, statistical functions, or anything else where it would be more natural to treat the year like a number.

clean_data <- mutate(clean_data,
                     year = as.integer(year))
clean_data
## # A tibble: 860 × 5
##   country  year      gdp      pop unemp
##     <chr> <int>    <dbl>    <dbl> <dbl>
## 1      AD  2005   3.8423 0.081223    NA
## 2      AD  2006   4.0184 0.083373    NA
## 3      AD  2007   4.0216 0.084878    NA
## 4      AD  2008   3.6759 0.085616    NA
## 5      AE  2005 253.9655 4.481976   3.1
## # ... with 855 more rows

Looks the same as before, except with an important difference: year is now labeled <int>.

Now we can merge the two datasets together without issue. Notice how we use a vector in the by argument to specify multiple columns to merge on.

left_join(clean_data,
          inflation_data,
          by = c("country", "year"))
## # A tibble: 860 × 6
##   country  year      gdp      pop unemp inflation
##     <chr> <int>    <dbl>    <dbl> <dbl>     <dbl>
## 1      AD  2005   3.8423 0.081223    NA        NA
## 2      AD  2006   4.0184 0.083373    NA        NA
## 3      AD  2007   4.0216 0.084878    NA        NA
## 4      AD  2008   3.6759 0.085616    NA        NA
## 5      AE  2005 253.9655 4.481976   3.1        NA
## # ... with 855 more rows

You might remember that inflation_data contained some country-years not included in the original data (namely, observations from 2004). If we want the merged data to use the observations from inflation_data rather than clean_data, we can use the right_join() function.

right_join(clean_data,
           inflation_data,
           by = c("country", "year"))
## # A tibble: 1,070 × 6
##   country  year    gdp      pop unemp inflation
##     <chr> <int>  <dbl>    <dbl> <dbl>     <dbl>
## 1      AD  2004     NA       NA    NA        NA
## 2      AD  2005 3.8423 0.081223    NA        NA
## 3      AD  2006 4.0184 0.083373    NA        NA
## 4      AD  2007 4.0216 0.084878    NA        NA
## 5      AD  2008 3.6759 0.085616    NA        NA
## # ... with 1,065 more rows

One last common issue in merging is that the identifier variables have different names in the two datasets. If it’s inconvenient or infeasible to correct this by renaming the columns in one or the other, you can specify the by argument as in the following example.

inflation_data <- rename(inflation_data,
                         the_country = country,
                         the_year = year)
inflation_data
## # A tibble: 1,070 × 3
##   the_country the_year inflation
##         <chr>    <int>     <dbl>
## 1          AD     2004        NA
## 2          AD     2005        NA
## 3          AD     2006        NA
## 4          AD     2007        NA
## 5          AD     2008        NA
## # ... with 1,065 more rows
left_join(clean_data,
          inflation_data,
          by = c("country" = "the_country", "year" = "the_year"))
## # A tibble: 860 × 6
##   country  year      gdp      pop unemp inflation
##     <chr> <int>    <dbl>    <dbl> <dbl>     <dbl>
## 1      AD  2005   3.8423 0.081223    NA        NA
## 2      AD  2006   4.0184 0.083373    NA        NA
## 3      AD  2007   4.0216 0.084878    NA        NA
## 4      AD  2008   3.6759 0.085616    NA        NA
## 5      AE  2005 253.9655 4.481976   3.1        NA
## # ... with 855 more rows

3.5 Appendix: Creating the Example Data

I used the same tools this chapter introduces to create the untidy data. I may as well include the code to do it, in case it helps further illustrate how to use the tidyverse tools (and, as a bonus, the WDI package for downloading World Development Indicators data).

First I load the necessary packages.

library("tidyverse")
library("WDI")
library("countrycode")
library("stringr")

Next, I download the relevant WDI data. I used the WDIsearch() function to locate the appropriate indicator names.

dat_raw <- WDI(country = "all",
               indicator = c("NY.GDP.MKTP.KD",  # GDP in 2000 USD
                             "SP.POP.TOTL",     # Total population
                             "SL.UEM.TOTL.ZS"), # Unemployment rate
               start = 2005,
               end = 2008)

head(dat_raw)
##   iso2c    country year NY.GDP.MKTP.KD SP.POP.TOTL SL.UEM.TOTL.ZS
## 1    1A Arab World 2005     1.6428e+12   313430911        12.1402
## 2    1A Arab World 2006     1.7629e+12   320906736        11.3296
## 3    1A Arab World 2007     1.8625e+12   328766559        10.8961
## 4    1A Arab World 2008     1.9799e+12   336886468        10.5060
## 5    1W      World 2005     5.7703e+13  6513959904         6.1593
## 6    1W      World 2006     6.0229e+13  6594722462         5.9000

I want to get rid of the aggregates, like the “Arab World” and “World” we see here. As a rough tack at that, I’m going to exclude those so-called countries whose ISO codes don’t appear in the countrycode package data.10

dat_countries <- dat_raw %>%
  filter(iso2c %in% countrycode_data$iso2c)

Let’s check on which countries are left. (I cut it down to max six characters per country name for printing purposes.)

dat_countries$country %>%
  unique() %>%
  str_sub(start = 1, end = 6)
##   [1] "Andorr" "United" "Afghan" "Antigu" "Albani" "Armeni" "Angola"
##   [8] "Argent" "Americ" "Austri" "Austra" "Aruba"  "Azerba" "Bosnia"
##  [15] "Barbad" "Bangla" "Belgiu" "Burkin" "Bulgar" "Bahrai" "Burund"
##  [22] "Benin"  "Bermud" "Brunei" "Bolivi" "Brazil" "Bahama" "Bhutan"
##  [29] "Botswa" "Belaru" "Belize" "Canada" "Congo," "Centra" "Congo,"
##  [36] "Switze" "Cote d" "Chile"  "Camero" "China"  "Colomb" "Costa "
##  [43] "Cuba"   "Cabo V" "Curaca" "Cyprus" "Czech " "German" "Djibou"
##  [50] "Denmar" "Domini" "Domini" "Algeri" "Ecuado" "Estoni" "Egypt,"
##  [57] "Eritre" "Spain"  "Ethiop" "Finlan" "Fiji"   "Micron" "Faroe "
##  [64] "France" "Gabon"  "United" "Grenad" "Georgi" "Ghana"  "Gibral"
##  [71] "Greenl" "Gambia" "Guinea" "Equato" "Greece" "Guatem" "Guam"  
##  [78] "Guinea" "Guyana" "Hong K" "Hondur" "Croati" "Haiti"  "Hungar"
##  [85] "Indone" "Irelan" "Israel" "Isle o" "India"  "Iraq"   "Iran, "
##  [92] "Icelan" "Italy"  "Jamaic" "Jordan" "Japan"  "Kenya"  "Kyrgyz"
##  [99] "Cambod" "Kiriba" "Comoro" "St. Ki" "Korea," "Korea," "Kuwait"
## [106] "Cayman" "Kazakh" "Lao PD" "Lebano" "St. Lu" "Liecht" "Sri La"
## [113] "Liberi" "Lesoth" "Lithua" "Luxemb" "Latvia" "Libya"  "Morocc"
## [120] "Monaco" "Moldov" "Monten" "St. Ma" "Madaga" "Marsha" "Macedo"
## [127] "Mali"   "Myanma" "Mongol" "Macao " "Northe" "Maurit" "Malta" 
## [134] "Maurit" "Maldiv" "Malawi" "Mexico" "Malays" "Mozamb" "Namibi"
## [141] "New Ca" "Niger"  "Nigeri" "Nicara" "Nether" "Norway" "Nepal" 
## [148] "Nauru"  "New Ze" "Oman"   "Panama" "Peru"   "French" "Papua "
## [155] "Philip" "Pakist" "Poland" "Puerto" "West B" "Portug" "Palau" 
## [162] "Paragu" "Qatar"  "Romani" "Serbia" "Russia" "Rwanda" "Saudi "
## [169] "Solomo" "Seyche" "Sudan"  "Sweden" "Singap" "Sloven" "Slovak"
## [176] "Sierra" "San Ma" "Senega" "Somali" "Surina" "South " "Sao To"
## [183] "El Sal" "Sint M" "Syrian" "Swazil" "Turks " "Chad"   "Togo"  
## [190] "Thaila" "Tajiki" "Timor-" "Turkme" "Tunisi" "Tonga"  "Turkey"
## [197] "Trinid" "Tuvalu" "Tanzan" "Ukrain" "Uganda" "United" "Urugua"
## [204] "Uzbeki" "St. Vi" "Venezu" "Britis" "Virgin" "Vietna" "Vanuat"
## [211] "Samoa"  "Yemen," "South " "Zambia" "Zimbab"

With that out of the way, there’s still some cleaning up to do. The magnitudes of GDP and population are too large, and the variable names are impenetrable. Also, the country variable, while helpful, is redundant now that we’re satisfied with the list of countries remaining.

dat_countries <- dat_countries %>%
  select(-country) %>%
  rename(gdp = NY.GDP.MKTP.KD,
         pop = SP.POP.TOTL,
         unemp = SL.UEM.TOTL.ZS,
         country = iso2c) %>%
  mutate(gdp = gdp / 1e9,
         pop = pop / 1e6)

head(dat_countries)
##   country year      gdp      pop unemp
## 1      AD 2005   3.8423 0.081223    NA
## 2      AD 2006   4.0184 0.083373    NA
## 3      AD 2007   4.0216 0.084878    NA
## 4      AD 2008   3.6759 0.085616    NA
## 5      AE 2005 253.9655 4.481976   3.1
## 6      AE 2006 278.9489 5.171255   3.3

Now I convert the data to “long” format.

dat_countries_long <- dat_countries %>%
  gather(key = variable,
         value = value,
         gdp:unemp)

head(dat_countries_long)
##   country year variable    value
## 1      AD 2005      gdp   3.8423
## 2      AD 2006      gdp   4.0184
## 3      AD 2007      gdp   4.0216
## 4      AD 2008      gdp   3.6759
## 5      AE 2005      gdp 253.9655
## 6      AE 2006      gdp 278.9489

I then smush variable and year into a single column, and drop the individual components.

dat_countries_long <- dat_countries_long %>%
  mutate(var_year = paste(variable, year, sep = ".")) %>%
  select(-variable, -year)

head(dat_countries_long)
##   country    value var_year
## 1      AD   3.8423 gdp.2005
## 2      AD   4.0184 gdp.2006
## 3      AD   4.0216 gdp.2007
## 4      AD   3.6759 gdp.2008
## 5      AE 253.9655 gdp.2005
## 6      AE 278.9489 gdp.2006

Finally, I “widen” the data, so that each var_year is a column of its own.

dat_countries_wide <- dat_countries_long %>%
  spread(key = var_year, value = value)

head(dat_countries_wide)
##   country gdp.2005 gdp.2006 gdp.2007 gdp.2008  pop.2005  pop.2006
## 1      AD   3.8423   4.0184   4.0216   3.6759  0.081223  0.083373
## 2      AE 253.9655 278.9489 287.8318 297.0189  4.481976  5.171255
## 3      AF   9.7630  10.3052  11.7212  12.1445 24.399948 25.183615
## 4      AG   1.1190   1.2687   1.3892   1.3902  0.082565  0.083467
## 5      AL   9.2684   9.7718  10.3483  11.1275  3.011487  2.992547
## 6      AM   7.6678   8.6797   9.8731  10.5544  3.014917  3.002161
##    pop.2007  pop.2008 unemp.2005 unemp.2006 unemp.2007 unemp.2008
## 1  0.084878  0.085616         NA         NA         NA         NA
## 2  6.010100  6.900142        3.1        3.3        3.4        4.0
## 3 25.877544 26.528741        8.5        8.8        8.4        8.9
## 4  0.084397  0.085350         NA         NA         NA         NA
## 5  2.970017  2.947314       12.5       12.4       13.5       13.0
## 6  2.988117  2.975029       27.8       28.6       28.4       16.4

Now we have some ugly data. I save the output to upload to my website.

write_csv(dat_countries_wide, path = "untidy-data.csv")

And here’s how I made the second country-year dataset used in the merging section. The country dataset with latitudes and longitudes is from https://developers.google.com/public-data/docs/canonical/countries_csv.

dat_2 <-
  WDI(country = "all",
      indicator = "FP.CPI.TOTL.ZG",
      start = 2004,
      end = 2008) %>%
  as_data_frame() %>%
  select(country = iso2c,
         year,
         inflation = FP.CPI.TOTL.ZG) %>%
  mutate(year = as.integer(year)) %>%
  filter(country %in% clean_data$country) %>%
  arrange(country, year)

write_csv(dat_2, path = "inflation.csv")

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10).


  1. Even if it’s data you collected yourself, that data should still have a “canonical” representation that never gets overwritten. See Leek (2015) for more on distributing your own data.

  2. But not for me, because these tools didn’t exist when I was a PhD student. Also, get off my lawn!

  3. More precisely, the readr functions produce output of class "tbl_df" (pronounced “tibble diff,” I’m told), which are like data frames but better. See help(package = "tibble") for what can be done with tbl_dfs.

  4. Insert lame joke about how Americanists haven’t heard of other countries. But, seriously, if you’re confused because you haven’t heard of other countries, just think of “state-years”.

  5. If you are reading the PDF copy of these notes (i.e., the ones I hand out in class), the line breaks are eliminated, making the piped commands rather hard to read. I am working on fixing this. For now, you may find the online notes at http://bkenkel.com/pdaps easier to follow.

  6. This won’t be the case if you got clean_data by loading it in directly from clean-data.csv on my website, since read_csv() will have correctly encoded year as an integer.

  7. countrycode is a very useful, albeit imperfect, package for converting between different country naming/coding schemes.