3 Working with Data

Some material in this chapter is adapted from notes Matt DiLorenzo wrote for the Spring 2016 session of PSCI 8357.

In your careers as social scientists, starting with your dissertation research—if not earlier—you will probably spend more time collecting, merging, and cleaning data than you will on statistical analysis. So it’s worth taking some time to learn how to do this well.

Best practices for data management can be summarized in a single sentence: Record and document everything you do to the data.

The first corollary of this principle is that raw data is sacrosanct. You should never edit raw data “in place”. Once you download the raw data file, that file should never change.5

In almost any non-trivial analysis, the “final” data—the format you plug into your analysis—will differ significantly from the raw data. It may consist of information merged from multiple sources. The variables may have been transformed, aggregated, or otherwise altered. The unit of observation may even differ from the original source. You must document every one of these changes, so that another researcher working from the exact same raw data will end up with the exact same final data.

The most sensible way to achieve this level of reproducibility is to do all of your data merging and cleaning in a script. In other words, no going into Excel and mucking around manually. Like any other piece of your analysis, your pipeline from raw data to final data should follow the principles of programming that we discussed last week.

Luckily for you,6 the tidyverse suite of R packages (including dplyr, tidyr, and others) makes it easy to script your “data pipeline”.

library("tidyverse")

3.1 Loading

The first step in working with data is to acquire some data. Depending on the nature of your research, you will be getting some or all of your data from sources available online. When you download data from online repositories, you should keep track of where you got it from. The best way to do so is—you guessed it—to script your data acquisition.

The R function download.file() is the easiest way to download files from URLs from within R. Just specify where you’re getting the file from and where you want it to go. For the examples today, we’ll use an “untidied” version of the World Development Indicators data from the World Bank that I’ve posted to my website.

download.file(url = "http://bkenkel.com/files/untidy-data.csv",
              destfile = "my-untidy-data.csv")

Once you’ve got the file stored locally, use the utilities from the readr package (part of tidyverse) to read it into R as a data frame.7 We have a CSV file, so we will use read_csv. See help(package = "readr") for other possibilities. read_csv (with an underscore) is typically better than R’s built-in read.csv (with a dot), as it reads files faster and processes them more cleanly.

untidy_data <- read_csv(file = "my-untidy-data.csv")
## Rows: 215 Columns: 13
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): country
## dbl (12): gdp.2005, gdp.2006, gdp.2007, gdp.2008, pop.2005, pop.2006, pop.20...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Remember that each column of a data frame might be a different type, or more formally class, of object. read_csv and its ilk try to guess the type of data each column contains: character, integer, decimal number (“double” in programming-speak), or something else. The readout above tells you what guesses it made. If it gets something wrong—say, reading a column as numbers that ought to be characters—you can use the col_types argument to set it straight.

FYI, you could also run read_csv() directly on a URL, as in:

read_csv("http://bkenkel.com/files/untidy-data.csv")

However, in analyses intended for publication, it’s usually preferable to download and save the raw data. What’s stored at a URL might change or disappear, and you’ll need to have a hard copy of the raw data for replication purposes.

Now let’s take a look at the data we’ve just loaded in.

untidy_data
## # A tibble: 215 × 13
##   country gdp.2005 gdp.2006 gdp.2007 gdp.2008 pop.2005 pop.2006 pop.2007 pop.2…¹
##   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
## 1 AD          3.85     4.03     4.03     3.68   0.0789   0.0810   0.0827  0.0839
## 2 AE        257.     283.     292.     301.     4.59     5.30     6.17    7.09  
## 3 AF          9.16     9.66    11.0     11.4   25.7     26.4     27.1    27.7   
## 4 AG          1.15     1.29     1.41     1.41   0.0815   0.0827   0.0840  0.0854
## 5 AL          9.22     9.77    10.4     11.1    3.01     2.99     2.97    2.95  
## # … with 210 more rows, 4 more variables: unemp.2005 <dbl>, unemp.2006 <dbl>,
## #   unemp.2007 <dbl>, unemp.2008 <dbl>, and abbreviated variable name ¹​pop.2008

We have a country variable giving country abbreviations. The other variables are numerical values: the country’s GDP in 2005, 2006, 2007, and 2008; then the same for population and unemployment. Let’s get this into a format we could use for analysis.

3.2 Tidying

Wickham (2014) outlines three qualities that make data “tidy”:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

For one thing, this means that whether a dataset is tidy or not depends—at least in part (some data collections are messy from any angle)—on the purpose it’s being analyzed for.

Each row of untidy_data is a country. In observational studies in comparative politics and international relations, more commonly the unit of observation is the country-year.8 How can we take untidy_data and easily make it into country-year data?

We’ll use the tidyr package (again, part of tidyverse) to clean up this data. The biggest problem right now is that each column, besides the country identifier, really encodes two pieces of information: the year of observation and the variable being observed. To deal with this, we’ll have to first transform the data from one untidy format to another. We’re going to use the pivot_longer() function to make each row a country-year-variable.

What pivot_longer() does is make a row for each entry from a set of columns. It’s probably easiest to understand it by seeing it in practice:

long_data <- pivot_longer(untidy_data,
                          cols = starts_with(c("gdp", "pop", "unemp")))
long_data
## # A tibble: 2,580 × 3
##   country name      value
##   <chr>   <chr>     <dbl>
## 1 AD      gdp.2005 3.85  
## 2 AD      gdp.2006 4.03  
## 3 AD      gdp.2007 4.03  
## 4 AD      gdp.2008 3.68  
## 5 AD      pop.2005 0.0789
## # … with 2,575 more rows

The cols argument tells pivot_longer() which columns to collect. The code here specifies that it be all of the columns starting with “gdp”, “pop”, or “unemp”. As it happens, we could have achieved the same result by using cols = ends_with(c("2005", "2006", "2007", "2008")). To select all columns, we instead would have used cols = everything(). For additional ways of selecting columns, see the help page ?tidyselect::select_helpers.

Now we have a new problem, which is that the name column encodes two pieces of information: the variable and the year of its observation. tidyr provides the separate() function to solve that, splitting a single variable into two.

long_data <- separate(long_data,
                      col = "name",
                      into = c("var", "year"))
long_data
## # A tibble: 2,580 × 4
##   country var   year   value
##   <chr>   <chr> <chr>  <dbl>
## 1 AD      gdp   2005  3.85  
## 2 AD      gdp   2006  4.03  
## 3 AD      gdp   2007  4.03  
## 4 AD      gdp   2008  3.68  
## 5 AD      pop   2005  0.0789
## # … with 2,575 more rows

So now we have country-year-variable data, with the year and variable conveniently stored in different columns. To turn this into country-year data, we can use the pivot_wider() function, which is like the inverse of pivot_longer(). pivot_wider() takes a key column and a value column, and turns each different key into a column of its own.

clean_data <- pivot_wider(long_data,
                          names_from = "var",
                          values_from = "value")
clean_data
## # A tibble: 860 × 5
##   country year     gdp    pop unemp
##   <chr>   <chr>  <dbl>  <dbl> <dbl>
## 1 AD      2005    3.85 0.0789 NA   
## 2 AD      2006    4.03 0.0810 NA   
## 3 AD      2007    4.03 0.0827 NA   
## 4 AD      2008    3.68 0.0839 NA   
## 5 AE      2005  257.   4.59    3.12
## # … with 855 more rows

When using pivot_wider() on data that you didn’t previously pivot_longer(), be sure to set the values_fill argument to tell it how to fill in empty cells. A simple example:

test_data
## # A tibble: 3 × 3
##   id      name  value
##   <chr>   <chr> <dbl>
## 1 brenton a        10
## 2 brenton b        20
## 3 patrick b         5
pivot_wider(test_data, names_from = "name", values_from = "value")
## # A tibble: 2 × 3
##   id          a     b
##   <chr>   <dbl> <dbl>
## 1 brenton    10    20
## 2 patrick    NA     5
pivot_wider(test_data, names_from = "name", values_from = "value", values_fill = 0)
## # A tibble: 2 × 3
##   id          a     b
##   <chr>   <dbl> <dbl>
## 1 brenton    10    20
## 2 patrick     0     5

One more important note on tidyverse semantics. It is built to work with R’s pipe, |>, which makes it easy to string together a truly mind-boggling number of commands.9

In pipe syntax, x |> f() is equivalent to f(x). That seems like a wasteful and confusing way to write f(x), and it is. But if you want to string together a bunch of commands, it’s much easier to comprehend

x |>
  f() |>
  g() |>
  h() |>
  i()

than i(h(g(f(x)))).

You can pass function arguments using the pipe too. For example, f(x, bear = "moose") is equivalent to x |> f(bear = "moose").

The key thing about the tidyverse functions is that each of them takes a data frame as its first argument, and returns a data frame as its output. This makes them highly amenable to piping. For example, we can combine all three steps of our tidying above with a single command, thanks to the pipe:

untidy_data |>
  pivot_longer(cols = starts_with(c("gdp", "pop", "unemp"))) |>
  separate(col = "name",
           into = c("var", "year")) |>
  pivot_wider(names_from = "var",
              values_from = "value")
## # A tibble: 860 × 5
##   country year     gdp    pop unemp
##   <chr>   <chr>  <dbl>  <dbl> <dbl>
## 1 AD      2005    3.85 0.0789 NA   
## 2 AD      2006    4.03 0.0810 NA   
## 3 AD      2007    4.03 0.0827 NA   
## 4 AD      2008    3.68 0.0839 NA   
## 5 AE      2005  257.   4.59    3.12
## # … with 855 more rows

Without the pipe, if we wanted to run all those commands together, we would have to write:

pivot_wider(separate(pivot_longer(untidy_data,
                                  cols = starts_with(c("gdp", "pop", "unemp"))),
                     col = "name",
                     into = c("var", "year")),
            names_from = "var",
            values_from = "value")

Going back to last week’s dictum of “write programs for people”: even though this would have the same computational result, it’s much harder for a human (your future self or a coauthor) to parse what the code is trying to do.

3.3 Transforming and Aggregating

Tidying the data usually isn’t the end of the process. If you want to perform further calculations on the raw, that’s where the tools in dplyr (part of, you guessed it, the tidyverse) come in.

Perhaps the simplest dplyr function (or “verb”, as the R hipsters would say) is rename(), which lets you rename columns.

clean_data |>
  rename(gross_domestic_product = gdp)
## # A tibble: 860 × 5
##   country year  gross_domestic_product    pop unemp
##   <chr>   <chr>                  <dbl>  <dbl> <dbl>
## 1 AD      2005                    3.85 0.0789 NA   
## 2 AD      2006                    4.03 0.0810 NA   
## 3 AD      2007                    4.03 0.0827 NA   
## 4 AD      2008                    3.68 0.0839 NA   
## 5 AE      2005                  257.   4.59    3.12
## # … with 855 more rows

The dplyr functions, like the vast majority of R functions, do not modify their inputs. In other words, running rename() on clean_data will return a renamed copy of clean_data, but won’t overwrite the original.

clean_data
## # A tibble: 860 × 5
##   country year     gdp    pop unemp
##   <chr>   <chr>  <dbl>  <dbl> <dbl>
## 1 AD      2005    3.85 0.0789 NA   
## 2 AD      2006    4.03 0.0810 NA   
## 3 AD      2007    4.03 0.0827 NA   
## 4 AD      2008    3.68 0.0839 NA   
## 5 AE      2005  257.   4.59    3.12
## # … with 855 more rows

If you wanted to make the change stick, you would have to run:

clean_data <- clean_data |>
  rename(gross_domestic_product = gdp)

select() lets you keep a couple of columns and drop all the others. Or vice versa if you use minus signs.

clean_data |>
  select(country, gdp)
## # A tibble: 860 × 2
##   country    gdp
##   <chr>    <dbl>
## 1 AD        3.85
## 2 AD        4.03
## 3 AD        4.03
## 4 AD        3.68
## 5 AE      257.  
## # … with 855 more rows
clean_data |>
  select(-pop)
## # A tibble: 860 × 4
##   country year     gdp unemp
##   <chr>   <chr>  <dbl> <dbl>
## 1 AD      2005    3.85 NA   
## 2 AD      2006    4.03 NA   
## 3 AD      2007    4.03 NA   
## 4 AD      2008    3.68 NA   
## 5 AE      2005  257.    3.12
## # … with 855 more rows

As we saw above, you can also select columns based on how their names start or end, or with even more complex criteria. See ?tidyselect::select_helpers.

untidy_data |>
  select(country, starts_with("gdp"))
## # A tibble: 215 × 5
##   country gdp.2005 gdp.2006 gdp.2007 gdp.2008
##   <chr>      <dbl>    <dbl>    <dbl>    <dbl>
## 1 AD          3.85     4.03     4.03     3.68
## 2 AE        257.     283.     292.     301.  
## 3 AF          9.16     9.66    11.0     11.4 
## 4 AG          1.15     1.29     1.41     1.41
## 5 AL          9.22     9.77    10.4     11.1 
## # … with 210 more rows
untidy_data |>
  select(country, ends_with("2005"))
## # A tibble: 215 × 4
##   country gdp.2005 pop.2005 unemp.2005
##   <chr>      <dbl>    <dbl>      <dbl>
## 1 AD          3.85   0.0789      NA   
## 2 AE        257.     4.59         3.12
## 3 AF          9.16  25.7          2.94
## 4 AG          1.15   0.0815      NA   
## 5 AL          9.22   3.01        16.9 
## # … with 210 more rows

mutate() lets you create new variables that are transformations of old ones.

clean_data |>
  mutate(gdppc = gdp / pop,
         log_gdppc = log(gdppc))
## # A tibble: 860 × 7
##   country year     gdp    pop unemp gdppc log_gdppc
##   <chr>   <chr>  <dbl>  <dbl> <dbl> <dbl>     <dbl>
## 1 AD      2005    3.85 0.0789 NA     48.8      3.89
## 2 AD      2006    4.03 0.0810 NA     49.7      3.91
## 3 AD      2007    4.03 0.0827 NA     48.7      3.89
## 4 AD      2008    3.68 0.0839 NA     43.9      3.78
## 5 AE      2005  257.   4.59    3.12  56.1      4.03
## # … with 855 more rows

filter() cuts down the data according to the logical condition(s) you specify.

clean_data |>
  filter(year == 2006)
## # A tibble: 215 × 5
##   country year     gdp     pop unemp
##   <chr>   <chr>  <dbl>   <dbl> <dbl>
## 1 AD      2006    4.03  0.0810 NA   
## 2 AE      2006  283.    5.30    2.77
## 3 AF      2006    9.66 26.4     2.83
## 4 AG      2006    1.29  0.0827 NA   
## 5 AL      2006    9.77  2.99   16.4 
## # … with 210 more rows

summarise() calculates summaries of the data. For example, let’s find the maximum unemployment rate.

clean_data |>
  summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 1 × 1
##   max_unemp
##       <dbl>
## 1      37.2

This seems sort of useless, until you combine it with the group_by() function. If you group the data before summarise-ing it, you’ll calculate a separate summary for each group. For example, let’s calculate the maximum unemployment rate for each year in the data.

clean_data |>
  group_by(year) |>
  summarise(max_unemp = max(unemp, na.rm = TRUE))
## # A tibble: 4 × 2
##   year  max_unemp
##   <chr>     <dbl>
## 1 2005       37.2
## 2 2006       36.0
## 3 2007       34.9
## 4 2008       33.8

summarise() produces a “smaller” data frame than the input—one row per group. If you want to do something similar, but preserving the structure of the original data, use mutate in combination with group_by.

clean_data |>
  filter(!is.na(unemp)) |>
  group_by(year) |>
  mutate(max_unemp = max(unemp),
         unemp_over_max = unemp / max_unemp) |>
  select(country, year, contains("unemp"))
## # A tibble: 744 × 5
## # Groups:   year [4]
##   country year  unemp max_unemp unemp_over_max
##   <chr>   <chr> <dbl>     <dbl>          <dbl>
## 1 AE      2005   3.12      37.2         0.0838
## 2 AE      2006   2.77      36.0         0.0770
## 3 AE      2007   2.47      34.9         0.0706
## 4 AE      2008   2.33      33.8         0.0691
## 5 AF      2005   2.94      37.2         0.0790
## # … with 739 more rows

This gives us back the original data, but with a max_unemp variable recording the highest unemployment level that year. We can then calculate each individual country’s unemployment as a percentage of the maximum. Whether grouped mutate or summarise is better depends, of course, on the purpose and structure of your analysis.

3.4 Merging

Particularly if you do observational work, you’ll often be merging together data from multiple sources. The key to merging data from separate tables is to have consistent identifiers across tables. For example, if you run an experiment, you might have demographic data on each subject in one table, and each subject’s response to treatment in another table. Naturally, you’ll want to have a subject identifier that “links” the records across tables, as in the following hypothetical example.

subject_data
## # A tibble: 3 × 4
##      id gender_id loves_bernie does_yoga
##   <dbl> <chr>     <chr>        <chr>    
## 1  1001 male      yes          no       
## 2  1002 female    no           yes      
## 3  1003 male      no           no
subject_response_data
## # A tibble: 6 × 3
##      id treatment response
##   <dbl> <chr>     <chr>   
## 1  1001 read_book sad     
## 2  1001 watch_tv  sad     
## 3  1002 read_book happy   
## 4  1002 watch_tv  sad     
## 5  1003 read_book sad     
## 6  1003 watch_tv  happy

Let’s practice merging data with our cleaned-up country-year data. We’ll take two datasets from my website: a country-level dataset with latitudes and longitudes, and a country-year–level dataset with inflation over time.

latlong_data <- read_csv("http://bkenkel.com/files/latlong.csv")
latlong_data
## # A tibble: 245 × 3
##   country latitude longitude
##   <chr>      <dbl>     <dbl>
## 1 AD          42.5      1.60
## 2 AE          23.4     53.8 
## 3 AF          33.9     67.7 
## 4 AG          17.1    -61.8 
## 5 AI          18.2    -63.1 
## # … with 240 more rows
inflation_data <- read_csv("http://bkenkel.com/files/inflation.csv")
inflation_data
## # A tibble: 1,070 × 3
##   country  year inflation
##   <chr>   <dbl>     <dbl>
## 1 AD       2004        NA
## 2 AD       2005        NA
## 3 AD       2006        NA
## 4 AD       2007        NA
## 5 AD       2008        NA
## # … with 1,065 more rows

For your convenience, both of these datasets use the same two-letter country naming scheme as the original data. Unfortunately, out in the real world, data from different sources often use incommensurate naming schemes. Converting from one naming scheme to another is part of the data cleaning process, and it requires careful attention.

tidyverse contains various _join() functions for merging. Each of these take as arguments the two data frames to merge, plus the names of the identifier variables to merge them on. The one I use most often is left_join(), which keeps every row from the first (“left”) data frame and merges in the columns from the second (“right”) data frame.

For example, let’s merge the latitude and longitude data for each country into clean_data.

left_join(clean_data,
          latlong_data,
          by = "country")
## # A tibble: 860 × 7
##   country year     gdp    pop unemp latitude longitude
##   <chr>   <chr>  <dbl>  <dbl> <dbl>    <dbl>     <dbl>
## 1 AD      2005    3.85 0.0789 NA        42.5      1.60
## 2 AD      2006    4.03 0.0810 NA        42.5      1.60
## 3 AD      2007    4.03 0.0827 NA        42.5      1.60
## 4 AD      2008    3.68 0.0839 NA        42.5      1.60
## 5 AE      2005  257.   4.59    3.12     23.4     53.8 
## # … with 855 more rows

Since latlong_data is country-level, the value is the same for each year. So the merged data contains redundant information. This is one reason to store data observed at different levels in different tables—with redundant observations, it is easier to make errors yet harder to catch them and fix them.

We can also merge data when the identifier is stored across multiple columns, as in the case of our country-year data. But first, a technical note.10 You might notice that the year column of clean_data is labeled <chr>, as in character data. Yet the year column of inflation_data is labeled <int>, as in integer data. We can check that by running class() on each respective column.

class(clean_data$year)
## [1] "character"
class(inflation_data$year)
## [1] "numeric"

From R’s perspective, the character string "1999" is a very different thing than the integer number 1999. Therefore, if we try to merge clean_data and inflation_data on the year variable, it will throw an error.

left_join(clean_data,
          inflation_data,
          by = c("country", "year"))
## Error in `left_join()`:
## ! Can't join `x$year` with `y$year` due to incompatible types.
## ℹ `x$year` is a <character>.
## ℹ `y$year` is a <double>.

To fix this, let’s use mutate() to convert the year column of clean_data to an integer. We probably should have done this in the first place—after all, having the year encoded as a character string would have thrown off plotting functions, statistical functions, or anything else where it would be more natural to treat the year like a number.

clean_data <- mutate(clean_data,
                     year = as.integer(year))
clean_data
## # A tibble: 860 × 5
##   country  year    gdp    pop unemp
##   <chr>   <int>  <dbl>  <dbl> <dbl>
## 1 AD       2005   3.85 0.0789 NA   
## 2 AD       2006   4.03 0.0810 NA   
## 3 AD       2007   4.03 0.0827 NA   
## 4 AD       2008   3.68 0.0839 NA   
## 5 AE       2005 257.   4.59    3.12
## # … with 855 more rows

Looks the same as before, except with an important difference: year is now labeled <int>.

Now we can merge the two datasets together without issue. Notice how we use a vector in the by argument to specify multiple columns to merge on.

left_join(clean_data,
          inflation_data,
          by = c("country", "year"))
## # A tibble: 860 × 6
##   country  year    gdp    pop unemp inflation
##   <chr>   <dbl>  <dbl>  <dbl> <dbl>     <dbl>
## 1 AD       2005   3.85 0.0789 NA           NA
## 2 AD       2006   4.03 0.0810 NA           NA
## 3 AD       2007   4.03 0.0827 NA           NA
## 4 AD       2008   3.68 0.0839 NA           NA
## 5 AE       2005 257.   4.59    3.12        NA
## # … with 855 more rows

You might remember that inflation_data contained some country-years not included in the original data (namely, observations from 2004). If we want the merged data to use the observations from inflation_data rather than clean_data, we can use the right_join() function.

right_join(clean_data,
           inflation_data,
           by = c("country", "year"))
## # A tibble: 1,070 × 6
##   country  year    gdp    pop unemp inflation
##   <chr>   <dbl>  <dbl>  <dbl> <dbl>     <dbl>
## 1 AD       2005   3.85 0.0789 NA           NA
## 2 AD       2006   4.03 0.0810 NA           NA
## 3 AD       2007   4.03 0.0827 NA           NA
## 4 AD       2008   3.68 0.0839 NA           NA
## 5 AE       2005 257.   4.59    3.12        NA
## # … with 1,065 more rows

One last common issue in merging is that the identifier variables have different names in the two datasets. If it’s inconvenient or infeasible to correct this by renaming the columns in one or the other, you can specify the by argument as in the following example.

inflation_data <- rename(inflation_data,
                         the_country = country,
                         the_year = year)
inflation_data
## # A tibble: 1,070 × 3
##   the_country the_year inflation
##   <chr>          <dbl>     <dbl>
## 1 AD              2004        NA
## 2 AD              2005        NA
## 3 AD              2006        NA
## 4 AD              2007        NA
## 5 AD              2008        NA
## # … with 1,065 more rows
left_join(clean_data,
          inflation_data,
          by = c("country" = "the_country", "year" = "the_year"))
## # A tibble: 860 × 6
##   country  year    gdp    pop unemp inflation
##   <chr>   <dbl>  <dbl>  <dbl> <dbl>     <dbl>
## 1 AD       2005   3.85 0.0789 NA           NA
## 2 AD       2006   4.03 0.0810 NA           NA
## 3 AD       2007   4.03 0.0827 NA           NA
## 4 AD       2008   3.68 0.0839 NA           NA
## 5 AE       2005 257.   4.59    3.12        NA
## # … with 855 more rows

References

Leek, Jeff. 2015. The Elements of Data Analytic Style. Leanpub.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 59 (10).

  1. Even if it’s data you collected yourself, that data should still have a “canonical” representation that never gets overwritten. See Leek (2015) for more on distributing your own data.↩︎

  2. But not for me, because these tools didn’t exist when I was a PhD student. Yes, I am old.↩︎

  3. More precisely, the readr functions produce output of class "tbl_df" (pronounced “tibble diff,” I’m told), which are like data frames but better. See help(package = "tibble") for what can be done with tbl_dfs.↩︎

  4. Yes, this kind of huge cross-country analysis has fallen out of favor as we’ve become more attentive to causal inference. But wait until later in the semester to worry about that!↩︎

  5. Before the pipe was built directly into R, tidyverse included its own version, %>%. So if you look on Google or Stack Overflow for code help (i.e., if you do any nontrivial coding) you might find older code—including not-yet-updated chapters of this book!—using %>% instead of |>. Except for a few edge cases that we won’t get into, both work identically.↩︎

  6. This won’t be the case if you got clean_data by loading it in directly from clean-data.csv on my website, since read_csv() will have correctly encoded year as an integer.↩︎