2  Data Wrangling

In Chapter 1, we saw how to get data into R and the basics of writing code in Quarto. Now we’re going to dig further into “wrangling” raw data into a form that’s useful for analysis.

In both academic research and industry data science, the hard part of data analysis usually isn’t the statistical modeling — it’s to acquire the data in the first place, and then to get it into a state that’s clean enough to plug into whatever statistical model you plan to use.

The main data wrangling skills you’re going to pick up here are:

  1. How to subset the data to the variables (columns) and observations (rows) that you care about for your analysis. Along the way, you’ll see how to use the most common logical and comparison operators in scientific computing:

    • equal to: ==
    • not equal to: !=
    • greater than: >
    • less than: <
    • not: !
    • and: &
    • or: |
  2. Methods to transform your data: changing variables in place, creating new variables, summarizing variables both for the full data and within subcategories of observations.

  3. How to reshape data into “longer” and “wider” formats, which is especially important when your raw data comes to you in a weird or inconvenient shape (more often than you’d like!).

  4. How to merge data sets from disparate sources together into a single data frame you can analyze.

2.1 Data: International crises

We will work with data from the International Crisis Behavior Project on countries involved in international crises. The project’s operational definition of an international crisis is a situation in which there is

… (1) a change in type and/or an increase in intensity of disruptive, that is, hostile verbal or physical interactions between two or more states, with a heightened probability of military hostilities; that, in turn, (2) destabilizes their relationship and challenges the structure of an international system — global, dominant, or subsystem.

Quotation from Michael Brecher and Jonathan Wilkenfeld, A Study of Crisis, pages 4–5.

To put it more simply, countries are in a state of crisis when something happens that raises tensions between them to the point that there is a legitimate fear of war. Every war starts as a crisis, but not all crises lead to war (and in fact the vast majority do not).

We will work with the ICB Project’s “actor level” dataset, in which each observation (row) is a crisis participant. For example, in the Cuban Missile Crisis of 1962 (crisis #196 in the ICB data), the United States, Cuba, and the Soviet Union are each separate crisis participants. The data is on my website at https://bkenkel.com/qps1/data/crises.csv.

library("tidyverse")
df_crises <- read_csv("https://bkenkel.com/qps1/data/crises.csv")

df_crises
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       1      1    365 RUS       1918        5       NA RUSSIAN …
 2 ICB2       2      2     93 NIC       1918        5       25 COSTA RI…
 3 ICB2       2      3     94 COS       1918        5       25 COSTA RI…
 4 ICB2       3      4    365 RUS       1918        6       23 RUSSIAN …
 5 ICB2       4      5    365 RUS       1918       11       18 BALTIC I…
 6 ICB2       4      6    366 EST       1918       11       18 BALTIC I…
 7 ICB2       4      7    368 LIT       1918       11       18 BALTIC I…
 8 ICB2       4      8    367 LAT       1918       11       18 BALTIC I…
 9 ICB2       5      9    315 CZE       1919        1       15 TESCHEN  
10 ICB2       5     10    290 POL       1919        1       15 TESCHEN  
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

As you can see, there are a ton of variables here. The ICB Project provides a full codebook describing each variable in the dataset. The most important variables you need to know from the start are:

  • crisname and crisno: The name of the crisis and a numeric identifier assigned to each separate crisis.

  • actor and cracid: A country involved in the crisis and a numeric identifier assigned to each different country (e.g., USA = 2, Russia/Soviet Union = 365).

  • systrgyr: The year the crisis began.

So for example, the first crisis in the data (crisno = 1) is the Russian Civil War, which only involved Russia. The second (crisno = 2) is a 1918 crisis between Nicaragua and Costa Rica, each of which gets their own row in the data.

2.2 Subsetting

2.2.1 Subsetting by column: select()

The select() function lets us reduce the data frame just to have the columns we care about, or even just reorder the columns if you’re not trying to get rid of any data but want things to be arranged differently.

2.2.1.1 Basic usage

The simplest way to use select() is to tell it what data frame you’re working with, then list the names of the columns you want to keep from it.

select(df_crises, crisname, actor, systrgyr)
# A tibble: 1,131 × 3
   crisname             actor systrgyr
   <chr>                <chr>    <dbl>
 1 RUSSIAN CIVIL WAR I  RUS       1918
 2 COSTA RICAN COUP     NIC       1918
 3 COSTA RICAN COUP     COS       1918
 4 RUSSIAN CIVIL WAR II RUS       1918
 5 BALTIC INDEPENDENCE  RUS       1918
 6 BALTIC INDEPENDENCE  EST       1918
 7 BALTIC INDEPENDENCE  LIT       1918
 8 BALTIC INDEPENDENCE  LAT       1918
 9 TESCHEN              CZE       1919
10 TESCHEN              POL       1919
# ℹ 1,121 more rows

An important thing to know about how R works: Just running a function like this won’t change your data frame permanently. To see this, after running select() like the above, go back and look at the data frame — nothing has changed:

df_crises
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       1      1    365 RUS       1918        5       NA RUSSIAN …
 2 ICB2       2      2     93 NIC       1918        5       25 COSTA RI…
 3 ICB2       2      3     94 COS       1918        5       25 COSTA RI…
 4 ICB2       3      4    365 RUS       1918        6       23 RUSSIAN …
 5 ICB2       4      5    365 RUS       1918       11       18 BALTIC I…
 6 ICB2       4      6    366 EST       1918       11       18 BALTIC I…
 7 ICB2       4      7    368 LIT       1918       11       18 BALTIC I…
 8 ICB2       4      8    367 LAT       1918       11       18 BALTIC I…
 9 ICB2       5      9    315 CZE       1919        1       15 TESCHEN  
10 ICB2       5     10    290 POL       1919        1       15 TESCHEN  
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

All the original data is still there. Unlike some other programming languages, R essentially never modifies data “in place”. If you want to overwrite df_crises with the subsetted version, you need to explicitly tell R to overwrite it by assigning the select()’ed data frame to df_crises, as in:

# I don't want to overwrite df_crises, so not actually running this
df_crises <- select(df_crises, crisname, actor, systrgyr)

Just as a matter of workflow, I usually try not to overwrite the raw data frame that I’ve read into R. That way, if I decide later on that I want more out of the original data, I don’t have to go back and load from the CSV file and clean it up again. So in this case, I would typically just create a new data frame with the subsetted data:

df_crises_reduced <- select(df_crises, crisname, actor, systrgyr)

df_crises_reduced
# A tibble: 1,131 × 3
   crisname             actor systrgyr
   <chr>                <chr>    <dbl>
 1 RUSSIAN CIVIL WAR I  RUS       1918
 2 COSTA RICAN COUP     NIC       1918
 3 COSTA RICAN COUP     COS       1918
 4 RUSSIAN CIVIL WAR II RUS       1918
 5 BALTIC INDEPENDENCE  RUS       1918
 6 BALTIC INDEPENDENCE  EST       1918
 7 BALTIC INDEPENDENCE  LIT       1918
 8 BALTIC INDEPENDENCE  LAT       1918
 9 TESCHEN              CZE       1919
10 TESCHEN              POL       1919
# ℹ 1,121 more rows

2.2.1.2 Selecting by range or pattern

Suppose you wanted everything from the crisno column through the systrgyr column. Instead of typing out each column name explicitly, you can just use a colon (:) to tell select() to grab this range of variables:

select(df_crises, crisno:systrgyr)
# A tibble: 1,131 × 5
   crisno cracno cracid actor systrgyr
    <dbl>  <dbl>  <dbl> <chr>    <dbl>
 1      1      1    365 RUS       1918
 2      2      2     93 NIC       1918
 3      2      3     94 COS       1918
 4      3      4    365 RUS       1918
 5      4      5    365 RUS       1918
 6      4      6    366 EST       1918
 7      4      7    368 LIT       1918
 8      4      8    367 LAT       1918
 9      5      9    315 CZE       1919
10      5     10    290 POL       1919
# ℹ 1,121 more rows

If you want everything except a particular variable, you can use the minus sign (-) to exclude it. Let’s see how to get rid of the useless icb2 column that just marks the dataset version:

select(df_crises, -icb2)
# A tibble: 1,131 × 94
   crisno cracno cracid actor systrgyr systrgmo systrgda crisname triggr
    <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>     <dbl>
 1      1      1    365 RUS       1918        5       NA RUSSIAN…      9
 2      2      2     93 NIC       1918        5       25 COSTA R…      7
 3      2      3     94 COS       1918        5       25 COSTA R…      4
 4      3      4    365 RUS       1918        6       23 RUSSIAN…      7
 5      4      5    365 RUS       1918       11       18 BALTIC …      6
 6      4      6    366 EST       1918       11       18 BALTIC …      9
 7      4      7    368 LIT       1918       11       18 BALTIC …      9
 8      4      8    367 LAT       1918       11       18 BALTIC …      9
 9      5      9    315 CZE       1919        1       15 TESCHEN       2
10      5     10    290 POL       1919        1       15 TESCHEN       7
# ℹ 1,121 more rows
# ℹ 85 more variables: yrtrig <dbl>, motrig <dbl>, datrig <dbl>,
#   trigent <dbl>, trigloc <dbl>, southv <dbl>, southpow <dbl>,
#   sizedu <dbl>, strcdu <dbl>, comlev <dbl>, majres <dbl>,
#   yerres <dbl>, monres <dbl>, dayres <dbl>, trgresra <dbl>,
#   crismg <dbl>, cenvio <dbl>, sevvio <dbl>, usinv <dbl>,
#   usfavr <dbl>, suinv <dbl>, sufavr <dbl>, gbinv <dbl>, …
In-class exercise

What if you wanted to exclude the range of variables from actor through yrtrig? How would you do that without typing out each column name explicitly?

#
# [Write your answer here]
#

You can use the starts_with() subfunction to select all of the columns whose name starts with “sys”, for example:

select(df_crises, starts_with("sys"))
# A tibble: 1,131 × 4
   systrgyr systrgmo systrgda syslev
      <dbl>    <dbl>    <dbl>  <dbl>
 1     1918        5       NA      2
 2     1918        5       25      1
 3     1918        5       25      1
 4     1918        6       23      2
 5     1918       11       18      2
 6     1918       11       18      1
 7     1918       11       18      1
 8     1918       11       18      1
 9     1919        1       15      1
10     1919        1       15      1
# ℹ 1,121 more rows
In-class exercise

What happens if you instead run select(df_crises, starts_with(sys)) (without the quotation marks around sys)? Why doesn’t this work?

[Write your answer here]

There are many other subfunctions you can use to select columns without naming them each explicitly. You can learn about them by looking at the help page for the select() function. To see the help page for any R function, just enter ?name_of_function at the R console, like the following:

?select
In-class exercise

Select the columns whose name has "fav" anywhere in it. Look at the ?select help page to figure out the subfunction you need to do this.

#
# [Write your answer here]
#

2.2.1.3 Reordering and renaming

As you select variables, you can rename them by using the syntax newname = oldname:

select(df_crises, year = systrgyr, name = crisname)
# A tibble: 1,131 × 2
    year name                
   <dbl> <chr>               
 1  1918 RUSSIAN CIVIL WAR I 
 2  1918 COSTA RICAN COUP    
 3  1918 COSTA RICAN COUP    
 4  1918 RUSSIAN CIVIL WAR II
 5  1918 BALTIC INDEPENDENCE 
 6  1918 BALTIC INDEPENDENCE 
 7  1918 BALTIC INDEPENDENCE 
 8  1918 BALTIC INDEPENDENCE 
 9  1919 TESCHEN             
10  1919 TESCHEN             
# ℹ 1,121 more rows

To rename individual columns without dropping or rearranging the other columns, use the rename() function instead of select().

rename(df_crises, year = systrgyr, name = crisname)
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor  year systrgmo systrgda name  triggr
   <chr>  <dbl>  <dbl>  <dbl> <chr> <dbl>    <dbl>    <dbl> <chr>  <dbl>
 1 ICB2       1      1    365 RUS    1918        5       NA RUSS…      9
 2 ICB2       2      2     93 NIC    1918        5       25 COST…      7
 3 ICB2       2      3     94 COS    1918        5       25 COST…      4
 4 ICB2       3      4    365 RUS    1918        6       23 RUSS…      7
 5 ICB2       4      5    365 RUS    1918       11       18 BALT…      6
 6 ICB2       4      6    366 EST    1918       11       18 BALT…      9
 7 ICB2       4      7    368 LIT    1918       11       18 BALT…      9
 8 ICB2       4      8    367 LAT    1918       11       18 BALT…      9
 9 ICB2       5      9    315 CZE    1919        1       15 TESC…      2
10 ICB2       5     10    290 POL    1919        1       15 TESC…      7
# ℹ 1,121 more rows
# ℹ 85 more variables: yrtrig <dbl>, motrig <dbl>, datrig <dbl>,
#   trigent <dbl>, trigloc <dbl>, southv <dbl>, southpow <dbl>,
#   sizedu <dbl>, strcdu <dbl>, comlev <dbl>, majres <dbl>,
#   yerres <dbl>, monres <dbl>, dayres <dbl>, trgresra <dbl>,
#   crismg <dbl>, cenvio <dbl>, sevvio <dbl>, usinv <dbl>,
#   usfavr <dbl>, suinv <dbl>, sufavr <dbl>, gbinv <dbl>, …

If you want certain columns to be placed first, use the relocate() function.

relocate(df_crises, crisname, actor)
# A tibble: 1,131 × 95
   crisname  actor icb2  crisno cracno cracid systrgyr systrgmo systrgda
   <chr>     <chr> <chr>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
 1 RUSSIAN … RUS   ICB2       1      1    365     1918        5       NA
 2 COSTA RI… NIC   ICB2       2      2     93     1918        5       25
 3 COSTA RI… COS   ICB2       2      3     94     1918        5       25
 4 RUSSIAN … RUS   ICB2       3      4    365     1918        6       23
 5 BALTIC I… RUS   ICB2       4      5    365     1918       11       18
 6 BALTIC I… EST   ICB2       4      6    366     1918       11       18
 7 BALTIC I… LIT   ICB2       4      7    368     1918       11       18
 8 BALTIC I… LAT   ICB2       4      8    367     1918       11       18
 9 TESCHEN   CZE   ICB2       5      9    315     1919        1       15
10 TESCHEN   POL   ICB2       5     10    290     1919        1       15
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

To rearrange the rows of the data, you can use the arrange() function. By default, it sorts the data in ascending order of the column name(s) you feed it.

# Arrange in ascending order of actor numeric ID
arrange(df_crises, cracid)
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2      59    128      2 USA       1937       12       12 PANAY IN…
 2 ICB2      88    224      2 USA       1941       11       26 PEARL HA…
 3 ICB2     104    251      2 USA       1945        5        1 TRIESTE I
 4 ICB2     108    259      2 USA       1945        8       23 AZERBAIJ…
 5 ICB2     111    265      2 USA       1946        8        7 TURKISH …
 6 ICB2     114    271      2 USA       1947        2       21 TRUMAN D…
 7 ICB2     123    293      2 USA       1948        6        7 BERLIN B…
 8 ICB2     125    295      2 USA       1948        9       23 CHINA CI…
 9 ICB2     132    308      2 USA       1950        6       25 KOREAN W…
10 ICB2     133    314      2 USA       1950        9       30 KOREAN W…
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …
# Arrange by start year, then by actor ID within year
arrange(df_crises, systrgyr, cracid)
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       2      2     93 NIC       1918        5       25 COSTA RI…
 2 ICB2       2      3     94 COS       1918        5       25 COSTA RI…
 3 ICB2       1      1    365 RUS       1918        5       NA RUSSIAN …
 4 ICB2       3      4    365 RUS       1918        6       23 RUSSIAN …
 5 ICB2       4      5    365 RUS       1918       11       18 BALTIC I…
 6 ICB2       4      6    366 EST       1918       11       18 BALTIC I…
 7 ICB2       4      8    367 LAT       1918       11       18 BALTIC I…
 8 ICB2       4      7    368 LIT       1918       11       18 BALTIC I…
 9 ICB2       8     18    200 UKG       1919        4       15 THIRD AF…
10 ICB2      11     24    220 FRN       1919       11       NA CILICIAN…
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

To instead sort in descending order, wrap the variable name in desc().

# Arrange so that the most recent crises come first
arrange(df_crises, desc(systrgyr))
# A tibble: 1,131 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2     506   1117    615 ALG       2021        1        9 US VISIT…
 2 ICB2     507   1118    710 CHN       2021        1        9 US-TAIWA…
 3 ICB2     507   1119    713 TAW       2021        1        9 US-TAIWA…
 4 ICB2     508   1120    652 SYR       2021        1       13 ISRAEL-I…
 5 ICB2     508   1121    630 IRN       2021        1       13 ISRAEL-I…
 6 ICB2     509   1122    740 JPN       2021        1       22 CHINA CO…
 7 ICB2     511   1128    703 KYR       2021        4       28 KOK-TASH…
 8 ICB2     511   1129    702 TAJ       2021        4       28 KOK-TASH…
 9 ICB2     512   1130    345 YUG       2021        9       20 KOSOVO L…
10 ICB2     512   1131    347 KOS       2021        9       20 KOSOVO L…
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

2.2.2 Subsetting by row: filter()

What if we only want the crises that started during the Cold War? That involves selecting rows from the data rather than columns, which we’ll use the filter() function for. But before we can do that, we need to learn a bit about making logical statements in R.

2.2.2.1 Comparisons and logical operators

If we make a comparison in R, it will spit out whether our statement is TRUE or FALSE. There are six logical comparisons we can make in R:

  • x == y: Is x equal to y? (Notice that we use a double equals sign, not a single one!)

  • x != y: Is x unequal to y?

  • x > y: Is x greater than y?

  • x >= y: Is x greater than or equal to y?

  • x < y: Is x less than y?

  • x <= y: Is x less than or equal to y?

As an example, let’s assign the value 1950 to the variable x.

x <- 1950

Now let’s look at the values of a few logical comparisons.

x == 1950
[1] TRUE
x == 1951
[1] FALSE
x != 1951
[1] TRUE
x < 1950
[1] FALSE
x <= 1950
[1] TRUE

If we use these with a vector of numbers, R calculates the comparison for each individual element of the vector. So instead of just looking at 1950, let’s look at each number from 1948 to 1952.

x <- 1948:1952
x
[1] 1948 1949 1950 1951 1952

We’ve got a vector of five years here, so each of our logical comparisons will now return a vector of five TRUE/FALSE values.

x == 1950
[1] FALSE FALSE  TRUE FALSE FALSE
x == 1951
[1] FALSE FALSE FALSE  TRUE FALSE
x != 1951
[1]  TRUE  TRUE  TRUE FALSE  TRUE
x < 1950
[1]  TRUE  TRUE FALSE FALSE FALSE
x <= 1950
[1]  TRUE  TRUE  TRUE FALSE FALSE

You can combine logical comparisons like this with square brackets to select elements from a vector on the basis of a logical comparison.

x[x < 1950]
[1] 1948 1949

Above we’ve compared a vector to a single number. What if we compare a vector to another vector of the same length?

The last possibility is comparing a vector to another vector of a different length. Loosely speaking, R handles this situation by repeating the shorter vector until the two match up. R’s behavior in this situation can be hard to predict, so I recommend only doing logical comparisons against either a single number, or a vector of the same length. You can check the length of a vector using the length() function, FYI.

y <- c(1948, 1948, 1950, 1950, 1946)
y
[1] 1948 1948 1950 1950 1946
x == y
[1]  TRUE FALSE  TRUE FALSE FALSE
x != y
[1] FALSE  TRUE FALSE  TRUE  TRUE
x > y
[1] FALSE  TRUE FALSE  TRUE  TRUE
x >= y
[1] TRUE TRUE TRUE TRUE TRUE

The exclamation point means “not”. You can use it to turn TRUE into FALSE and vice versa. Just make sure that you put parentheses around the whole statement that you’re trying to negate.

!(x >= y)
[1] FALSE FALSE FALSE FALSE FALSE

Sometimes we will want to string together multiple comparisons. There are two ways to combine comparisons.

  • and: a & b is TRUE when a is TRUE and b is also TRUE. If either or both are FALSE, then so is a & b.

  • or: a | b is TRUE when a is TRUE, or b is TRUE, or both. The only way for a | b to be FALSE is for both a and b to be FALSE.

a b a & b (and) a | b (or)
TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE
FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE
# Reminder:
# x = 1948, 1949, 1950, 1951, 1952
# y = 1948, 1948, 1950, 1950, 1946

x > y & x < 1950
[1] FALSE  TRUE FALSE FALSE FALSE
x > y | x < 1950
[1]  TRUE  TRUE FALSE  TRUE  TRUE

Be aware that & and | only operate on TRUE and FALSE values, not on other values. There are situations where you might use “and” or “or” in an English sentence, but where & and | can’t just be substituted into your R code. For example, think about trying to code up the logical statement “x or y is less than 100.”

x <- c(75, 100, 125, 150)
y <- c(100, 90, 150, 99)

# Won't work correctly, because x and y aren't TRUE/FALSE
(x | y) < 100
[1] TRUE TRUE TRUE TRUE
# Will work correctly, because "x < 100" and "y < 100" are TRUE/FALSE
(x < 100) | (y < 100)
[1]  TRUE  TRUE FALSE  TRUE

Completely optional aside for those who are morbidly curious why we get all TRUE values when we run (x | y) < 100: When you plug numbers into a logical operator, any non-zero number evaluates to TRUE. Since every element of x and y is nonzero, x | y evaluates to TRUE, TRUE, TRUE, TRUE. Then, when you plug logical values into a numerical comparison, TRUE values are treated like 1 and FALSE values like 0. So R converts each TRUE to 1, observes that 1 is less than 100, and again returns TRUE for each one.

One final minor note: If you look for R help on the Internet, you’ll sometimes see people use && and ||. What’s the difference between & and &&, or between | and ||? The doubled versions only work on a single TRUE/FALSE statement. This distinction is important in programming situations where you need to make sure you’re only working with a single value. That won’t typically come up in this course, so to make our lives easier we’ll just always use the single versions.

x <- c(1, 2, 3, 4, 5)
y <- c(1, 1, 4, 4, 8)

x < y | x > y   # returns c(FALSE, TRUE, TRUE, FALSE, TRUE)
[1] FALSE  TRUE  TRUE FALSE  TRUE
x < y || x > y  # won't run, will give error message
Error in x < y || x > y: 'length = 5' in coercion to 'logical(1)'

2.2.2.2 The filter() function

Now that we’re familiar with comparisons and logical operators, let’s return to the crisis data to reduce it down to Cold War–era observations.

Remember that we can pull a single column from a data frame using the dollar sign. And we can use square brackets on a data frame to select particular observations from it. So we could subset to the Cold War observations that way.

df_crises[df_crises$systrgyr >= 1946 & df_crises$systrgyr <= 1989, ]
# A tibble: 585 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2     110    263    365 RUS       1946        6       30 COMMUNIS…
 2 ICB2     111    264    640 TUR       1946        8        7 TURKISH …
 3 ICB2     111    265      2 USA       1946        8        7 TURKISH …
 4 ICB2     112    266    350 GRC       1946       11       13 GREEK CI…
 5 ICB2     113    267    365 RUS       1947        2       10 COMMUNIS…
 6 ICB2     113    268    310 HUN       1947        2       10 COMMUNIS…
 7 ICB2     114    269    350 GRC       1947        2       21 TRUMAN D…
 8 ICB2     114    270    640 TUR       1947        2       21 TRUMAN D…
 9 ICB2     114    271      2 USA       1947        2       21 TRUMAN D…
10 ICB2     115    272    365 RUS       1947        7        3 MARSHALL…
# ℹ 575 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

That works, but it’s a bit ugly and unwieldy. The filter() function gives us an easier way to do the same thing.

filter(df_crises, systrgyr >= 1946 & systrgyr <= 1989)
# A tibble: 580 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2     110    263    365 RUS       1946        6       30 COMMUNIS…
 2 ICB2     111    264    640 TUR       1946        8        7 TURKISH …
 3 ICB2     111    265      2 USA       1946        8        7 TURKISH …
 4 ICB2     112    266    350 GRC       1946       11       13 GREEK CI…
 5 ICB2     113    267    365 RUS       1947        2       10 COMMUNIS…
 6 ICB2     113    268    310 HUN       1947        2       10 COMMUNIS…
 7 ICB2     114    269    350 GRC       1947        2       21 TRUMAN D…
 8 ICB2     114    270    640 TUR       1947        2       21 TRUMAN D…
 9 ICB2     114    271      2 USA       1947        2       21 TRUMAN D…
10 ICB2     115    272    365 RUS       1947        7        3 MARSHALL…
# ℹ 570 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

When dealing with character columns, the greater/lesser comparisons no longer make sense, but we can still use the equality and inequality comparisons. Just make sure to use quotation marks around whatever you’re comparing to!

filter(df_crises, actor == "USA")
# A tibble: 78 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2      59    128      2 USA       1937       12       12 PANAY IN…
 2 ICB2      88    224      2 USA       1941       11       26 PEARL HA…
 3 ICB2     104    251      2 USA       1945        5        1 TRIESTE I
 4 ICB2     108    259      2 USA       1945        8       23 AZERBAIJ…
 5 ICB2     111    265      2 USA       1946        8        7 TURKISH …
 6 ICB2     114    271      2 USA       1947        2       21 TRUMAN D…
 7 ICB2     123    293      2 USA       1948        6        7 BERLIN B…
 8 ICB2     125    295      2 USA       1948        9       23 CHINA CI…
 9 ICB2     132    308      2 USA       1950        6       25 KOREAN W…
10 ICB2     133    314      2 USA       1950        9       30 KOREAN W…
# ℹ 68 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …
filter(df_crises, actor != "RUS")
# A tibble: 1,079 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       2      2     93 NIC       1918        5       25 COSTA RI…
 2 ICB2       2      3     94 COS       1918        5       25 COSTA RI…
 3 ICB2       4      6    366 EST       1918       11       18 BALTIC I…
 4 ICB2       4      7    368 LIT       1918       11       18 BALTIC I…
 5 ICB2       4      8    367 LAT       1918       11       18 BALTIC I…
 6 ICB2       5      9    315 CZE       1919        1       15 TESCHEN  
 7 ICB2       5     10    290 POL       1919        1       15 TESCHEN  
 8 ICB2       6     11    310 HUN       1919        3       20 HUNGARIA…
 9 ICB2       6     12    315 CZE       1919        3       20 HUNGARIA…
10 ICB2       6     13    310 HUN       1919        3       20 HUNGARIA…
# ℹ 1,069 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

What if you wanted to get all of the observations involving the United States (USA), Russia (RUS), or the United Kingdom (UKG)? One kind of tedious way to go about it would be to string many “or” statements together.

filter(df_crises, actor == "USA" | actor == "RUS" | actor == "UKG")
# A tibble: 178 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       1      1    365 RUS       1918        5       NA RUSSIAN …
 2 ICB2       3      4    365 RUS       1918        6       23 RUSSIAN …
 3 ICB2       4      5    365 RUS       1918       11       18 BALTIC I…
 4 ICB2       8     18    200 UKG       1919        4       15 THIRD AF…
 5 ICB2       9     19    365 RUS       1919        4       20 FINNISH/…
 6 ICB2      10     21    365 RUS       1919        4       30 BESSARAB…
 7 ICB2      13     27    365 RUS       1920        4       25 POLISH/R…
 8 ICB2      14     29    365 RUS       1920        5       66 PERSIAN …
 9 ICB2      26     56    200 UKG       1922        9       23 CHANAK   
10 ICB2      31     66    200 UKG       1924        9       29 MOSUL LA…
# ℹ 168 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …

The easier way — especially if you’re matching a large number of values — is to use the %in% operator in R. When you run x %in% y, it gives you a TRUE or FALSE for each entry of x: TRUE if it matches at least one entry of y, and FALSE otherwise.

filter(df_crises, actor %in% c("USA", "RUS", "UKG"))
# A tibble: 178 × 95
   icb2  crisno cracno cracid actor systrgyr systrgmo systrgda crisname 
   <chr>  <dbl>  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl> <chr>    
 1 ICB2       1      1    365 RUS       1918        5       NA RUSSIAN …
 2 ICB2       3      4    365 RUS       1918        6       23 RUSSIAN …
 3 ICB2       4      5    365 RUS       1918       11       18 BALTIC I…
 4 ICB2       8     18    200 UKG       1919        4       15 THIRD AF…
 5 ICB2       9     19    365 RUS       1919        4       20 FINNISH/…
 6 ICB2      10     21    365 RUS       1919        4       30 BESSARAB…
 7 ICB2      13     27    365 RUS       1920        4       25 POLISH/R…
 8 ICB2      14     29    365 RUS       1920        5       66 PERSIAN …
 9 ICB2      26     56    200 UKG       1922        9       23 CHANAK   
10 ICB2      31     66    200 UKG       1924        9       29 MOSUL LA…
# ℹ 168 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
#   datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
#   southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
#   majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
#   trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
#   usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …
In-class exercise

How can you filter down to the observations that are not those three countries?

#
# [Write your answer here]
#

2.2.3 “Piping” commands together

Here’s a pretty common way to start off your data analysis workflow:

  1. Load your raw data into R.
  2. Reduce the columns down to the ones you will use in your analysis.
  3. Filter out the rows that aren’t relevant to your analysis.

Remember that R doesn’t save what you do to a data frame unless you explicitly assign the result to a variable. This means that a data cleaning process might involve a lot of intermediate assignments that you never end up using again.

Imagine we want to extract the name of the crisis, the name and numeric ID of the country involved, the year the crisis began, the number of days it lasted (trgterra), and the outcome for the given actor (outcom) Along the way we’ll rename the variables to be less ugly. And furthermore imagine we only want Cold War–era observations, and only for the USA, Russia, and the United Kingdom. We could run the following sequence of commands:

df_crises_cw <- select(
  df_crises,
  crisis_name = crisname,
  actor_name = actor,
  actor_id = cracid,
  year = systrgyr,
  duration = trgterra,
  outcome = outcom
)
df_crises_cw <- filter(df_crises_cw, year >= 1946 & year <= 1989)
df_crises_cw <- filter(df_crises_cw, actor_name %in% c("USA", "RUS", "UKG"))
df_crises_cw
# A tibble: 87 × 6
   crisis_name          actor_name actor_id  year duration outcome
   <chr>                <chr>         <dbl> <dbl>    <dbl>   <dbl>
 1 COMMUNISM IN POLAND  RUS             365  1946      204       1
 2 TURKISH STRAITS      USA               2  1946       81       1
 3 COMMUNISM IN HUNGARY RUS             365  1947      112       1
 4 TRUMAN DOCTRINE      USA               2  1947       91       1
 5 MARSHALL PLAN        RUS             365  1947        9       1
 6 COMMUNISM IN CZECH.  RUS             365  1948       13       1
 7 BERLIN BLOCKADE      RUS             365  1948      340       4
 8 BERLIN BLOCKADE      UKG             200  1948      323       1
 9 BERLIN BLOCKADE      USA               2  1948      323       1
10 CHINA CIVIL WAR      USA               2  1948       34       4
# ℹ 77 more rows

That certainly works fine to give us what we want, but it involved a lot of repetitive typing and retyping of df_crises_cw <- function_name(df_crises_cw, ...). And that’s only with fairly simple subsetting operations.

The pipe operator in R lets us chain together commands like this much more succinctly. x |> function_name(...) is equivalent to function_name(x, ...) in R.

x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10
mean(x)
[1] 5.5
x |> mean()
[1] 5.5

On its own the pipe might seem dumb, like just more typing for the same result. Weren’t we trying to reduce the amount of typing? The pipe is mainly useful when we are chaining many commands together.

# Chained commands
y <- first_command(x, arg1, arg2)
y <- second_command(y, arg3)
y <- third_command(y, arg4, arg5)
fourth_command(y, arg6)

# Same thing, but piped
x |>
  first_command(arg1, arg2) |>
  second_command(arg3) |>
  third_command(arg4, arg5) |>
  fourth_command(arg6)

Let’s see the pipe in action in the context of our data cleaning example.

df_crises |>
  select(
    crisis_name = crisname,
    actor_name = actor,
    actor_id = cracid,
    year = systrgyr,
    duration = trgterra,
    outcome = outcom
  ) |>
  filter(year >= 1946 & year <= 1989) |>
  filter(actor_name %in% c("USA", "RUS", "UKG"))
# A tibble: 87 × 6
   crisis_name          actor_name actor_id  year duration outcome
   <chr>                <chr>         <dbl> <dbl>    <dbl>   <dbl>
 1 COMMUNISM IN POLAND  RUS             365  1946      204       1
 2 TURKISH STRAITS      USA               2  1946       81       1
 3 COMMUNISM IN HUNGARY RUS             365  1947      112       1
 4 TRUMAN DOCTRINE      USA               2  1947       91       1
 5 MARSHALL PLAN        RUS             365  1947        9       1
 6 COMMUNISM IN CZECH.  RUS             365  1948       13       1
 7 BERLIN BLOCKADE      RUS             365  1948      340       4
 8 BERLIN BLOCKADE      UKG             200  1948      323       1
 9 BERLIN BLOCKADE      USA               2  1948      323       1
10 CHINA CIVIL WAR      USA               2  1948       34       4
# ℹ 77 more rows

It might take a bit getting used to, but I personally find it much easier to use the pipe when going through the data cleaning process. One thing to keep in mind, just like when you run these commands the regular way, they won’t change your data frame in place. You need to explicitly assign the output to a variable name.

df_crises_cw <- df_crises |>
  select(
    crisis_name = crisname,
    actor_name = actor,
    actor_id = cracid,
    year = systrgyr,
    duration = trgterra,
    outcome = outcom
  ) |>
  filter(year >= 1946 & year <= 1989) |>
  filter(actor_name %in% c("USA", "RUS", "UKG"))

Final note on the pipe. Until pretty recently, the pipe was %>% instead of |>. So you’ll see a lot of code on the Internet that uses %>%. Don’t worry, it works exactly the same as |> (except in some edge cases that won’t come up for us).

In-class exercise

Use the pipe to create a data frame called df_long_crises with the following specs:

  • Same variables as in my example, except with the numeric crisis ID instead of the crisis name
  • Only observations where the actor was involved in the crisis for 100+ days
  • Exclude observations of Russia from 1990 onward
#
# [Write your answer here]
#

2.3 Transforming

2.3.1 Changing variables and making new ones

The mutate() function lets you change a column in the dataframe, or create a new column using the values in existing columns. For example, right now the duration of each crisis is measured in days. We can add new columns to measure duration in weeks and years instead. As always, if we want our new columns to persist instead of vanishing into the ether, we need to assign the output to a variable — either overwriting df_crises_cw or creating a new data frame.

df_crises_cw <- df_crises_cw |>
  mutate(
    duration_weeks = duration / 7,
    duration_years = duration / 365
  )

df_crises_cw
# A tibble: 87 × 8
   crisis_name actor_name actor_id  year duration outcome duration_weeks
   <chr>       <chr>         <dbl> <dbl>    <dbl>   <dbl>          <dbl>
 1 COMMUNISM … RUS             365  1946      204       1          29.1 
 2 TURKISH ST… USA               2  1946       81       1          11.6 
 3 COMMUNISM … RUS             365  1947      112       1          16   
 4 TRUMAN DOC… USA               2  1947       91       1          13   
 5 MARSHALL P… RUS             365  1947        9       1           1.29
 6 COMMUNISM … RUS             365  1948       13       1           1.86
 7 BERLIN BLO… RUS             365  1948      340       4          48.6 
 8 BERLIN BLO… UKG             200  1948      323       1          46.1 
 9 BERLIN BLO… USA               2  1948      323       1          46.1 
10 CHINA CIVI… USA               2  1948       34       4           4.86
# ℹ 77 more rows
# ℹ 1 more variable: duration_years <dbl>

There’s a very useful function for mutating called if_else(). You invoke it as if_else(condition, a, b):

  • condition: vector of TRUE and FALSE values
  • a: value, or vector of values, to use for entries where condition is TRUE
  • b: value, or vector of values, to use for entries where condition is FALSE
# Simple if_else example
x <- c(1, 2, 3, 4)
y <- c(-100, -200, 300, 400)
if_else(x > y, x, y)
[1]   1   2 300 400
if_else(x > y, "ice cream", "pizza")
[1] "ice cream" "ice cream" "pizza"     "pizza"    
In-class exercise

Use mutate() and if_else() together to add another column to df_crises_cw called crisis_length. This variable should say "long" if the crisis is 100 or more days, and "short" otherwise.

#
# [Write your answer here]
#

The “outcome” variable here is recorded as numbers, which stand for different categories of crisis outcomes. The ICB codebook explains which category is associated with each numerical value.

  1. Victory: this country achieved its basic goals.
  2. Compromise: this country partly achieved its basic goals.
  3. Stalemate: there was no major change in the situation.
  4. Defeat: this country did not achieve its basic goals, and instead yielded or surrendered.
  5. Other.

So for example, looking at the rows associated with the Berlin Blockade, we see that the USA and United Kingdom experienced victory, while the Soviet Union was defeated.

df_crises_cw |>
  filter(crisis_name == "BERLIN BLOCKADE") |>
  select(actor_name, outcome, everything())
# A tibble: 3 × 8
  actor_name outcome crisis_name  actor_id  year duration duration_weeks
  <chr>        <dbl> <chr>           <dbl> <dbl>    <dbl>          <dbl>
1 RUS              4 BERLIN BLOC…      365  1948      340           48.6
2 UKG              1 BERLIN BLOC…      200  1948      323           46.1
3 USA              1 BERLIN BLOC…        2  1948      323           46.1
# ℹ 1 more variable: duration_years <dbl>

We could use a sequence of 5 if_else() statements to convert the outcome variable in terms of category names instead of (hard to remember) numeric identifiers. But it’s easier to use case_when():

df_crises_cw <- df_crises_cw |>
  mutate(
    outcome = case_when(
      outcome == 1 ~ "victory",
      outcome == 2 ~ "compromise",
      outcome == 3 ~ "stalemate",
      outcome == 4 ~ "defeat",
      outcome == 5 ~ "other"
    )
  )

df_crises_cw
# A tibble: 87 × 8
   crisis_name actor_name actor_id  year duration outcome duration_weeks
   <chr>       <chr>         <dbl> <dbl>    <dbl> <chr>            <dbl>
 1 COMMUNISM … RUS             365  1946      204 victory          29.1 
 2 TURKISH ST… USA               2  1946       81 victory          11.6 
 3 COMMUNISM … RUS             365  1947      112 victory          16   
 4 TRUMAN DOC… USA               2  1947       91 victory          13   
 5 MARSHALL P… RUS             365  1947        9 victory           1.29
 6 COMMUNISM … RUS             365  1948       13 victory           1.86
 7 BERLIN BLO… RUS             365  1948      340 defeat           48.6 
 8 BERLIN BLO… UKG             200  1948      323 victory          46.1 
 9 BERLIN BLO… USA               2  1948      323 victory          46.1 
10 CHINA CIVI… USA               2  1948       34 defeat            4.86
# ℹ 77 more rows
# ℹ 1 more variable: duration_years <dbl>

There’s another function called case_match() that would let you do this same thing with even less typing. Go ahead, look it up, and use it if you like! But case_when() is a bit more flexible, so to avoid confusion I’m only going to use it.

2.3.2 Grouping and summarizing

The summarize() function lets us calculate summaries of columns in the data. For example, let’s find the shortest, longest, and average crisis durations among the USA/UK/USSR during the Cold War.

df_crises_cw |>
  summarize(
    shortest = min(duration),
    longest = max(duration),
    average = mean(duration)
  )
# A tibble: 1 × 3
  shortest longest average
     <dbl>   <dbl>   <dbl>
1        2     486    99.2
# Seemingly easier way
min(df_crises_cw$duration)
[1] 2
max(df_crises_cw$duration)
[1] 486
mean(df_crises_cw$duration)
[1] 99.24138

The real power of summarize() comes in when you combine it with group_by() to calculate summaries for subcategories of the data. For example, say we wanted to calculate the shortest, longest, and average crisis duration separately for the USA, UK, and USSR. That would be tedious to do the “normal” way:

# one of nine long expressions you'd have to type out
min(df_crises_cw$duration[df_crises_cw$actor_name == "USA"])
[1] 2

group_by |> summarize makes this much easier.

df_crises_cw |>
  group_by(actor_name) |>
  summarize(
    shortest = min(duration),
    longest = max(duration),
    average = mean(duration)
  )
# A tibble: 3 × 4
  actor_name shortest longest average
  <chr>         <dbl>   <dbl>   <dbl>
1 RUS               3     486   124. 
2 UKG              12     323   106  
3 USA               2     433    84.6
In-class exercise

In which year of the Cold War did the USA spend the most total days in crises? Use filter(), group_by(), summarize(), and arrange() together to figure it out.

#
# [Write your answer here]
#

What if you wanted to count the number of crises involving each country? There’s a special function called n() that counts the number of observations within each group.

df_crises_cw |>
  group_by(actor_name) |>
  summarize(number = n())
# A tibble: 3 × 2
  actor_name number
  <chr>       <int>
1 RUS            24
2 UKG            15
3 USA            48

You can group on multiple variables at once. The number of rows in the result will be the number of unique combinations of values across the grouping variables.

df_crises_cw |>
  group_by(actor_name, outcome) |>
  summarize(
    number = n(),
    shortest = min(duration),
    longest = max(duration)
  )
`summarise()` has grouped output by 'actor_name'. You can override
using the `.groups` argument.
# A tibble: 13 × 5
# Groups:   actor_name [3]
   actor_name outcome    number shortest longest
   <chr>      <chr>       <int>    <dbl>   <dbl>
 1 RUS        compromise      3        8     222
 2 RUS        defeat          5        3     340
 3 RUS        stalemate       3        9     260
 4 RUS        victory        13        9     486
 5 UKG        compromise      2      183     192
 6 UKG        defeat          2       66     104
 7 UKG        other           1       17      17
 8 UKG        stalemate       2      112     293
 9 UKG        victory         8       12     323
10 USA        compromise      7        4     433
11 USA        defeat          6       10     336
12 USA        stalemate       8        5     293
13 USA        victory        27        2     323

2.4 Reshaping

To get practice with reshaping and merging data, we’re going to bring in a second dataset. military.csv, stored at https://bkenkel.com/qps1/data/military.csv, contains data on military spending and force sizes by country from 1816 to 2019.

df_military <- read_csv("https://bkenkel.com/qps1/data/military.csv")

df_military
# A tibble: 31,902 × 5
   ccode stateabb  year mil_indicator amount
   <dbl> <chr>    <dbl> <chr>          <dbl>
 1     2 USA       1816 spending        3823
 2     2 USA       1816 personnel         17
 3     2 USA       1817 spending        2466
 4     2 USA       1817 personnel         15
 5     2 USA       1818 spending        1910
 6     2 USA       1818 personnel         14
 7     2 USA       1819 spending        2301
 8     2 USA       1819 personnel         13
 9     2 USA       1820 spending        1556
10     2 USA       1820 personnel         15
# ℹ 31,892 more rows

The amounts here are in thousands of dollars and people respectively. For example, looking at the first two rows, in 1816 the USA spent $3.8 million on the military, which consisted of 17,000 soldiers.

How can we reshape this data so that there’s a single row for the USA in 1816, a single row for the USA in 1817, and so on? As of now the data is “longer” than we want it to be, and we’d like it to be “wider”. Hence we’ll use the function pivot_wider(). Here’s how we use it.

df_military_wide <- df_military |>
  pivot_wider(names_from = mil_indicator, values_from = amount)

df_military_wide
# A tibble: 15,951 × 5
   ccode stateabb  year spending personnel
   <dbl> <chr>    <dbl>    <dbl>     <dbl>
 1     2 USA       1816     3823        17
 2     2 USA       1817     2466        15
 3     2 USA       1818     1910        14
 4     2 USA       1819     2301        13
 5     2 USA       1820     1556        15
 6     2 USA       1821     1612        11
 7     2 USA       1822     1079        10
 8     2 USA       1823     1170        11
 9     2 USA       1824     1261        11
10     2 USA       1825     1336        11
# ℹ 15,941 more rows

Making a data frame “wider” means taking data that’s currently stored row by row, and spreading it out across columns instead. The names_from argument tells pivot_wider() which column of the original data frame contains the names of the new columns we want to create. The values_from argument tells it which column contains the values that we want to put into each column.

In-class exercise

Go back to our data on the USA, UK, and USSR in Cold War crises. Use pivot_wider() to create a new data frame where each row is a unique crisis name, and there are columns for the USA, UK, and USSR telling that country’s outcome from the given crisis. Were there any crises that ended in victory for both the USA and the USSR?

#
# [Write your answer here]
#

The opposite of pivot_wider() is, naturally enough, pivot_longer(). For example, imagine your raw data had a single row for each country, then a separate column for its military spending each year.

df_too_wide <- tibble(
  ccode = c(2, 200, 365),
  stateabb = c("USA", "UKG", "RUS"),
  mil1816 = c(17, 255, 800),
  mil1817 = c(15, 190, 700),
  mil1818 = c(14, 173, 600)
)

df_too_wide
# A tibble: 3 × 5
  ccode stateabb mil1816 mil1817 mil1818
  <dbl> <chr>      <dbl>   <dbl>   <dbl>
1     2 USA           17      15      14
2   200 UKG          255     190     173
3   365 RUS          800     700     600

To use pivot_longer(), we have to tell it (1) the set of columns whose values we want to collapse into a single column, (2) what we want to name the column that stores the column names from the original data, and (3) what we want to name the column that stores the values.

df_too_wide |>
  pivot_longer(
    starts_with("mil"),
    names_to = "year",
    values_to = "personnel"
  )
# A tibble: 9 × 4
  ccode stateabb year    personnel
  <dbl> <chr>    <chr>       <dbl>
1     2 USA      mil1816        17
2     2 USA      mil1817        15
3     2 USA      mil1818        14
4   200 UKG      mil1816       255
5   200 UKG      mil1817       190
6   200 UKG      mil1818       173
7   365 RUS      mil1816       800
8   365 RUS      mil1817       700
9   365 RUS      mil1818       600

FYI: If you wanted to turn the year column there into a proper year, you could mutate() it to remove the “mil” text and then convert it to a number.

df_too_wide |>
  pivot_longer(
    starts_with("mil"),
    names_to = "year",
    values_to = "personnel"
  ) |>
  mutate(
    year = str_replace(year, "mil", ""),
    year = as.numeric(year)
  )
# A tibble: 9 × 4
  ccode stateabb  year personnel
  <dbl> <chr>    <dbl>     <dbl>
1     2 USA       1816        17
2     2 USA       1817        15
3     2 USA       1818        14
4   200 UKG       1816       255
5   200 UKG       1817       190
6   200 UKG       1818       173
7   365 RUS       1816       800
8   365 RUS       1817       700
9   365 RUS       1818       600

How did I know to use the str_replace() function there? The stringr package contains tons of helpful functions for manipulating text in R. You can see everything it does by googling the documentation for it, or by entering help(package = "stringr") at the R console.

2.5 Merging

Data work in political science often involves combining data from multiple sources. For example, here we have one dataset on country participation in international crises, and another on the size of their militaries. How can we put these together?

The first thing to do is decide which dataset should be the “base” that we merge other data sources into. What’s appropriate depends on the question you’re trying to answer. For our purposes here, let’s imagine we want to know whether crisis winners tend to spend more on their military than crisis losers. That means we only care about the countries that actually ended up in crises, so we should use the crisis data as our base and merge the military size data into that.

The next thing we need to make sure of is that we have some way to identify the same observation across datasets. Here we are going to match observations by the numerical “country code” and by year. First we’ll edit the military size data so that the country code indicator has the same name as in df_crises_cw. (We’ll also drop the country name variable, since we already have that in the original data.)

df_military_wide <- df_military_wide |>
  rename(actor_id = ccode) |>
  select(-stateabb)

df_military_wide
# A tibble: 15,951 × 4
   actor_id  year spending personnel
      <dbl> <dbl>    <dbl>     <dbl>
 1        2  1816     3823        17
 2        2  1817     2466        15
 3        2  1818     1910        14
 4        2  1819     2301        13
 5        2  1820     1556        15
 6        2  1821     1612        11
 7        2  1822     1079        10
 8        2  1823     1170        11
 9        2  1824     1261        11
10        2  1825     1336        11
# ℹ 15,941 more rows

Next we’ll use left_join() to merge the datasets together. We need to tell this function three things:

  • The data frame to use as our base.
  • The data frame that we want to merge values in from.
  • The column names to match observations by.
df_crisis_and_mil <- left_join(
  df_crises_cw,
  df_military_wide,
  by = c("actor_id", "year")
)

df_crisis_and_mil
# A tibble: 87 × 10
   crisis_name actor_name actor_id  year duration outcome duration_weeks
   <chr>       <chr>         <dbl> <dbl>    <dbl> <chr>            <dbl>
 1 COMMUNISM … RUS             365  1946      204 victory          29.1 
 2 TURKISH ST… USA               2  1946       81 victory          11.6 
 3 COMMUNISM … RUS             365  1947      112 victory          16   
 4 TRUMAN DOC… USA               2  1947       91 victory          13   
 5 MARSHALL P… RUS             365  1947        9 victory           1.29
 6 COMMUNISM … RUS             365  1948       13 victory           1.86
 7 BERLIN BLO… RUS             365  1948      340 defeat           48.6 
 8 BERLIN BLO… UKG             200  1948      323 victory          46.1 
 9 BERLIN BLO… USA               2  1948      323 victory          46.1 
10 CHINA CIVI… USA               2  1948       34 defeat            4.86
# ℹ 77 more rows
# ℹ 3 more variables: duration_years <dbl>, spending <dbl>,
#   personnel <dbl>

Now we can use group_by() and summarize() to calculate the average size and spending of the military among crisis winners, crisis losers, and so on.

df_crisis_and_mil |>
  group_by(outcome) |>
  summarize(avg_spending = mean(spending),
            avg_personnel = mean(personnel))
# A tibble: 5 × 3
  outcome    avg_spending avg_personnel
  <chr>             <dbl>         <dbl>
1 compromise    57787476.         2436.
2 defeat        57416910.         2638.
3 other          4409778           839 
4 stalemate     81993296          2714.
5 victory       65411474.         2568.
In-class exercise

Winners have smaller militaries than losers on average. Does this mean that increasing the size of your military actually hurts your ability to win a crisis? Why or why not? How could you dig deeper into the data to better figure out whether this is the case?

[Write your answer here]