In Chapter 1, we saw how to get data into R and the basics of writing code in Quarto. Now we’re going to dig further into “wrangling” raw data into a form that’s useful for analysis.
In both academic research and industry data science, the hard part of data analysis usually isn’t the statistical modeling — it’s to acquire the data in the first place, and then to get it into a state that’s clean enough to plug into whatever statistical model you plan to use.
The main data wrangling skills you’re going to pick up here are:
How to subset the data to the variables (columns) and observations (rows) that you care about for your analysis. Along the way, you’ll see how to use the most common logical and comparison operators in scientific computing:
equal to: ==
not equal to: !=
greater than: >
less than: <
not: !
and: &
or: |
Methods to transform your data: changing variables in place, creating new variables, summarizing variables both for the full data and within subcategories of observations.
How to reshape data into “longer” and “wider” formats, which is especially important when your raw data comes to you in a weird or inconvenient shape (more often than you’d like!).
How to merge data sets from disparate sources together into a single data frame you can analyze.
2.1 Data: International crises
We will work with data from the International Crisis Behavior Project on countries involved in international crises. The project’s operational definition of an international crisis is a situation in which there is
… (1) a change in type and/or an increase in intensity of disruptive, that is, hostile verbal or physical interactions between two or more states, with a heightened probability of military hostilities; that, in turn, (2) destabilizes their relationship and challenges the structure of an international system — global, dominant, or subsystem.
Quotation from Michael Brecher and Jonathan Wilkenfeld, A Study of Crisis, pages 4–5.
To put it more simply, countries are in a state of crisis when something happens that raises tensions between them to the point that there is a legitimate fear of war. Every war starts as a crisis, but not all crises lead to war (and in fact the vast majority do not).
We will work with the ICB Project’s “actor level” dataset, in which each observation (row) is a crisis participant. For example, in the Cuban Missile Crisis of 1962 (crisis #196 in the ICB data), the United States, Cuba, and the Soviet Union are each separate crisis participants. The data is on my website at https://bkenkel.com/qps1/data/crises.csv.
As you can see, there are a ton of variables here. The ICB Project provides a full codebook describing each variable in the dataset. The most important variables you need to know from the start are:
crisname and crisno: The name of the crisis and a numeric identifier assigned to each separate crisis.
actor and cracid: A country involved in the crisis and a numeric identifier assigned to each different country (e.g., USA = 2, Russia/Soviet Union = 365).
systrgyr: The year the crisis began.
So for example, the first crisis in the data (crisno = 1) is the Russian Civil War, which only involved Russia. The second (crisno = 2) is a 1918 crisis between Nicaragua and Costa Rica, each of which gets their own row in the data.
2.2 Subsetting
2.2.1 Subsetting by column: select()
The select() function lets us reduce the data frame just to have the columns we care about, or even just reorder the columns if you’re not trying to get rid of any data but want things to be arranged differently.
2.2.1.1 Basic usage
The simplest way to use select() is to tell it what data frame you’re working with, then list the names of the columns you want to keep from it.
select(df_crises, crisname, actor, systrgyr)
# A tibble: 1,131 × 3
crisname actor systrgyr
<chr> <chr> <dbl>
1 RUSSIAN CIVIL WAR I RUS 1918
2 COSTA RICAN COUP NIC 1918
3 COSTA RICAN COUP COS 1918
4 RUSSIAN CIVIL WAR II RUS 1918
5 BALTIC INDEPENDENCE RUS 1918
6 BALTIC INDEPENDENCE EST 1918
7 BALTIC INDEPENDENCE LIT 1918
8 BALTIC INDEPENDENCE LAT 1918
9 TESCHEN CZE 1919
10 TESCHEN POL 1919
# ℹ 1,121 more rows
An important thing to know about how R works: Just running a function like this won’t change your data frame permanently. To see this, after running select() like the above, go back and look at the data frame — nothing has changed:
All the original data is still there. Unlike some other programming languages, R essentially never modifies data “in place”. If you want to overwrite df_crises with the subsetted version, you need to explicitly tell R to overwrite it by assigning the select()’ed data frame to df_crises, as in:
# I don't want to overwrite df_crises, so not actually running thisdf_crises <-select(df_crises, crisname, actor, systrgyr)
Just as a matter of workflow, I usually try not to overwrite the raw data frame that I’ve read into R. That way, if I decide later on that I want more out of the original data, I don’t have to go back and load from the CSV file and clean it up again. So in this case, I would typically just create a new data frame with the subsetted data:
# A tibble: 1,131 × 3
crisname actor systrgyr
<chr> <chr> <dbl>
1 RUSSIAN CIVIL WAR I RUS 1918
2 COSTA RICAN COUP NIC 1918
3 COSTA RICAN COUP COS 1918
4 RUSSIAN CIVIL WAR II RUS 1918
5 BALTIC INDEPENDENCE RUS 1918
6 BALTIC INDEPENDENCE EST 1918
7 BALTIC INDEPENDENCE LIT 1918
8 BALTIC INDEPENDENCE LAT 1918
9 TESCHEN CZE 1919
10 TESCHEN POL 1919
# ℹ 1,121 more rows
2.2.1.2 Selecting by range or pattern
Suppose you wanted everything from the crisno column through the systrgyr column. Instead of typing out each column name explicitly, you can just use a colon (:) to tell select() to grab this range of variables:
select(df_crises, crisno:systrgyr)
# A tibble: 1,131 × 5
crisno cracno cracid actor systrgyr
<dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 365 RUS 1918
2 2 2 93 NIC 1918
3 2 3 94 COS 1918
4 3 4 365 RUS 1918
5 4 5 365 RUS 1918
6 4 6 366 EST 1918
7 4 7 368 LIT 1918
8 4 8 367 LAT 1918
9 5 9 315 CZE 1919
10 5 10 290 POL 1919
# ℹ 1,121 more rows
If you want everything except a particular variable, you can use the minus sign (-) to exclude it. Let’s see how to get rid of the useless icb2 column that just marks the dataset version:
What happens if you instead run select(df_crises, starts_with(sys)) (without the quotation marks around sys)? Why doesn’t this work?
[Write your answer here]
There are many other subfunctions you can use to select columns without naming them each explicitly. You can learn about them by looking at the help page for the select() function. To see the help page for any R function, just enter ?name_of_function at the R console, like the following:
?select
In-class exercise
Select the columns whose name has "fav" anywhere in it. Look at the ?select help page to figure out the subfunction you need to do this.
## [Write your answer here]#
2.2.1.3 Reordering and renaming
As you select variables, you can rename them by using the syntax newname = oldname:
select(df_crises, year = systrgyr, name = crisname)
# A tibble: 1,131 × 2
year name
<dbl> <chr>
1 1918 RUSSIAN CIVIL WAR I
2 1918 COSTA RICAN COUP
3 1918 COSTA RICAN COUP
4 1918 RUSSIAN CIVIL WAR II
5 1918 BALTIC INDEPENDENCE
6 1918 BALTIC INDEPENDENCE
7 1918 BALTIC INDEPENDENCE
8 1918 BALTIC INDEPENDENCE
9 1919 TESCHEN
10 1919 TESCHEN
# ℹ 1,121 more rows
To rename individual columns without dropping or rearranging the other columns, use the rename() function instead of select().
rename(df_crises, year = systrgyr, name = crisname)
To rearrange the rows of the data, you can use the arrange() function. By default, it sorts the data in ascending order of the column name(s) you feed it.
# Arrange in ascending order of actor numeric IDarrange(df_crises, cracid)
# A tibble: 1,131 × 95
icb2 crisno cracno cracid actor systrgyr systrgmo systrgda crisname
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 ICB2 59 128 2 USA 1937 12 12 PANAY IN…
2 ICB2 88 224 2 USA 1941 11 26 PEARL HA…
3 ICB2 104 251 2 USA 1945 5 1 TRIESTE I
4 ICB2 108 259 2 USA 1945 8 23 AZERBAIJ…
5 ICB2 111 265 2 USA 1946 8 7 TURKISH …
6 ICB2 114 271 2 USA 1947 2 21 TRUMAN D…
7 ICB2 123 293 2 USA 1948 6 7 BERLIN B…
8 ICB2 125 295 2 USA 1948 9 23 CHINA CI…
9 ICB2 132 308 2 USA 1950 6 25 KOREAN W…
10 ICB2 133 314 2 USA 1950 9 30 KOREAN W…
# ℹ 1,121 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
# datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
# southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
# majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
# trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
# usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …
# Arrange by start year, then by actor ID within yeararrange(df_crises, systrgyr, cracid)
What if we only want the crises that started during the Cold War? That involves selecting rows from the data rather than columns, which we’ll use the filter() function for. But before we can do that, we need to learn a bit about making logical statements in R.
2.2.2.1 Comparisons and logical operators
If we make a comparison in R, it will spit out whether our statement is TRUE or FALSE. There are six logical comparisons we can make in R:
x == y: Is x equal to y? (Notice that we use a double equals sign, not a single one!)
x != y: Is x unequal to y?
x > y: Is x greater than y?
x >= y: Is x greater than or equal to y?
x < y: Is x less than y?
x <= y: Is x less than or equal to y?
As an example, let’s assign the value 1950 to the variable x.
x <-1950
Now let’s look at the values of a few logical comparisons.
x ==1950
[1] TRUE
x ==1951
[1] FALSE
x !=1951
[1] TRUE
x <1950
[1] FALSE
x <=1950
[1] TRUE
If we use these with a vector of numbers, R calculates the comparison for each individual element of the vector. So instead of just looking at 1950, let’s look at each number from 1948 to 1952.
x <-1948:1952x
[1] 1948 1949 1950 1951 1952
We’ve got a vector of five years here, so each of our logical comparisons will now return a vector of five TRUE/FALSE values.
x ==1950
[1] FALSE FALSE TRUE FALSE FALSE
x ==1951
[1] FALSE FALSE FALSE TRUE FALSE
x !=1951
[1] TRUE TRUE TRUE FALSE TRUE
x <1950
[1] TRUE TRUE FALSE FALSE FALSE
x <=1950
[1] TRUE TRUE TRUE FALSE FALSE
You can combine logical comparisons like this with square brackets to select elements from a vector on the basis of a logical comparison.
x[x <1950]
[1] 1948 1949
Above we’ve compared a vector to a single number. What if we compare a vector to another vector of the same length?
The last possibility is comparing a vector to another vector of a different length. Loosely speaking, R handles this situation by repeating the shorter vector until the two match up. R’s behavior in this situation can be hard to predict, so I recommend only doing logical comparisons against either a single number, or a vector of the same length. You can check the length of a vector using the length() function, FYI.
y <-c(1948, 1948, 1950, 1950, 1946)y
[1] 1948 1948 1950 1950 1946
x == y
[1] TRUE FALSE TRUE FALSE FALSE
x != y
[1] FALSE TRUE FALSE TRUE TRUE
x > y
[1] FALSE TRUE FALSE TRUE TRUE
x >= y
[1] TRUE TRUE TRUE TRUE TRUE
The exclamation point means “not”. You can use it to turn TRUE into FALSE and vice versa. Just make sure that you put parentheses around the whole statement that you’re trying to negate.
!(x >= y)
[1] FALSE FALSE FALSE FALSE FALSE
Sometimes we will want to string together multiple comparisons. There are two ways to combine comparisons.
and: a & b is TRUE when a is TRUE and b is also TRUE. If either or both are FALSE, then so is a & b.
or: a | b is TRUE when a is TRUE, or b is TRUE, or both. The only way for a | b to be FALSE is for both a and b to be FALSE.
a
b
a & b (and)
a | b (or)
TRUE
TRUE
TRUE
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
# Reminder:# x = 1948, 1949, 1950, 1951, 1952# y = 1948, 1948, 1950, 1950, 1946x > y & x <1950
[1] FALSE TRUE FALSE FALSE FALSE
x > y | x <1950
[1] TRUE TRUE FALSE TRUE TRUE
Be aware that & and | only operate on TRUE and FALSE values, not on other values. There are situations where you might use “and” or “or” in an English sentence, but where & and | can’t just be substituted into your R code. For example, think about trying to code up the logical statement “x or y is less than 100.”
x <-c(75, 100, 125, 150)y <-c(100, 90, 150, 99)# Won't work correctly, because x and y aren't TRUE/FALSE(x | y) <100
[1] TRUE TRUE TRUE TRUE
# Will work correctly, because "x < 100" and "y < 100" are TRUE/FALSE(x <100) | (y <100)
[1] TRUE TRUE FALSE TRUE
Completely optional aside for those who are morbidly curious why we get all TRUE values when we run (x | y) < 100: When you plug numbers into a logical operator, any non-zero number evaluates to TRUE. Since every element of x and y is nonzero, x | y evaluates to TRUE, TRUE, TRUE, TRUE. Then, when you plug logical values into a numerical comparison, TRUE values are treated like 1 and FALSE values like 0. So R converts each TRUE to 1, observes that 1 is less than 100, and again returns TRUE for each one.
One final minor note: If you look for R help on the Internet, you’ll sometimes see people use && and ||. What’s the difference between & and &&, or between | and ||? The doubled versions only work on a singleTRUE/FALSE statement. This distinction is important in programming situations where you need to make sure you’re only working with a single value. That won’t typically come up in this course, so to make our lives easier we’ll just always use the single versions.
x <-c(1, 2, 3, 4, 5)y <-c(1, 1, 4, 4, 8)x < y | x > y # returns c(FALSE, TRUE, TRUE, FALSE, TRUE)
[1] FALSE TRUE TRUE FALSE TRUE
x < y || x > y # won't run, will give error message
Error in x < y || x > y: 'length = 5' in coercion to 'logical(1)'
2.2.2.2 The filter() function
Now that we’re familiar with comparisons and logical operators, let’s return to the crisis data to reduce it down to Cold War–era observations.
Remember that we can pull a single column from a data frame using the dollar sign. And we can use square brackets on a data frame to select particular observations from it. So we could subset to the Cold War observations that way.
When dealing with character columns, the greater/lesser comparisons no longer make sense, but we can still use the equality and inequality comparisons. Just make sure to use quotation marks around whatever you’re comparing to!
filter(df_crises, actor =="USA")
# A tibble: 78 × 95
icb2 crisno cracno cracid actor systrgyr systrgmo systrgda crisname
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 ICB2 59 128 2 USA 1937 12 12 PANAY IN…
2 ICB2 88 224 2 USA 1941 11 26 PEARL HA…
3 ICB2 104 251 2 USA 1945 5 1 TRIESTE I
4 ICB2 108 259 2 USA 1945 8 23 AZERBAIJ…
5 ICB2 111 265 2 USA 1946 8 7 TURKISH …
6 ICB2 114 271 2 USA 1947 2 21 TRUMAN D…
7 ICB2 123 293 2 USA 1948 6 7 BERLIN B…
8 ICB2 125 295 2 USA 1948 9 23 CHINA CI…
9 ICB2 132 308 2 USA 1950 6 25 KOREAN W…
10 ICB2 133 314 2 USA 1950 9 30 KOREAN W…
# ℹ 68 more rows
# ℹ 86 more variables: triggr <dbl>, yrtrig <dbl>, motrig <dbl>,
# datrig <dbl>, trigent <dbl>, trigloc <dbl>, southv <dbl>,
# southpow <dbl>, sizedu <dbl>, strcdu <dbl>, comlev <dbl>,
# majres <dbl>, yerres <dbl>, monres <dbl>, dayres <dbl>,
# trgresra <dbl>, crismg <dbl>, cenvio <dbl>, sevvio <dbl>,
# usinv <dbl>, usfavr <dbl>, suinv <dbl>, sufavr <dbl>, …
What if you wanted to get all of the observations involving the United States (USA), Russia (RUS), or the United Kingdom (UKG)? One kind of tedious way to go about it would be to string many “or” statements together.
filter(df_crises, actor =="USA"| actor =="RUS"| actor =="UKG")
The easier way — especially if you’re matching a large number of values — is to use the %in% operator in R. When you run x %in% y, it gives you a TRUE or FALSE for each entry of x: TRUE if it matches at least one entry of y, and FALSE otherwise.
filter(df_crises, actor %in%c("USA", "RUS", "UKG"))
How can you filter down to the observations that are not those three countries?
## [Write your answer here]#
2.2.3 “Piping” commands together
Here’s a pretty common way to start off your data analysis workflow:
Load your raw data into R.
Reduce the columns down to the ones you will use in your analysis.
Filter out the rows that aren’t relevant to your analysis.
Remember that R doesn’t save what you do to a data frame unless you explicitly assign the result to a variable. This means that a data cleaning process might involve a lot of intermediate assignments that you never end up using again.
Imagine we want to extract the name of the crisis, the name and numeric ID of the country involved, the year the crisis began, the number of days it lasted (trgterra), and the outcome for the given actor (outcom) Along the way we’ll rename the variables to be less ugly. And furthermore imagine we only want Cold War–era observations, and only for the USA, Russia, and the United Kingdom. We could run the following sequence of commands:
# A tibble: 87 × 6
crisis_name actor_name actor_id year duration outcome
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 COMMUNISM IN POLAND RUS 365 1946 204 1
2 TURKISH STRAITS USA 2 1946 81 1
3 COMMUNISM IN HUNGARY RUS 365 1947 112 1
4 TRUMAN DOCTRINE USA 2 1947 91 1
5 MARSHALL PLAN RUS 365 1947 9 1
6 COMMUNISM IN CZECH. RUS 365 1948 13 1
7 BERLIN BLOCKADE RUS 365 1948 340 4
8 BERLIN BLOCKADE UKG 200 1948 323 1
9 BERLIN BLOCKADE USA 2 1948 323 1
10 CHINA CIVIL WAR USA 2 1948 34 4
# ℹ 77 more rows
That certainly works fine to give us what we want, but it involved a lot of repetitive typing and retyping of df_crises_cw <- function_name(df_crises_cw, ...). And that’s only with fairly simple subsetting operations.
The pipe operator in R lets us chain together commands like this much more succinctly. x |> function_name(...) is equivalent to function_name(x, ...) in R.
x <-1:10x
[1] 1 2 3 4 5 6 7 8 9 10
mean(x)
[1] 5.5
x |>mean()
[1] 5.5
On its own the pipe might seem dumb, like just more typing for the same result. Weren’t we trying to reduce the amount of typing? The pipe is mainly useful when we are chaining many commands together.
# A tibble: 87 × 6
crisis_name actor_name actor_id year duration outcome
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 COMMUNISM IN POLAND RUS 365 1946 204 1
2 TURKISH STRAITS USA 2 1946 81 1
3 COMMUNISM IN HUNGARY RUS 365 1947 112 1
4 TRUMAN DOCTRINE USA 2 1947 91 1
5 MARSHALL PLAN RUS 365 1947 9 1
6 COMMUNISM IN CZECH. RUS 365 1948 13 1
7 BERLIN BLOCKADE RUS 365 1948 340 4
8 BERLIN BLOCKADE UKG 200 1948 323 1
9 BERLIN BLOCKADE USA 2 1948 323 1
10 CHINA CIVIL WAR USA 2 1948 34 4
# ℹ 77 more rows
It might take a bit getting used to, but I personally find it much easier to use the pipe when going through the data cleaning process. One thing to keep in mind, just like when you run these commands the regular way, they won’t change your data frame in place. You need to explicitly assign the output to a variable name.
Final note on the pipe. Until pretty recently, the pipe was %>% instead of |>. So you’ll see a lot of code on the Internet that uses %>%. Don’t worry, it works exactly the same as |> (except in some edge cases that won’t come up for us).
In-class exercise
Use the pipe to create a data frame called df_long_crises with the following specs:
Same variables as in my example, except with the numeric crisis ID instead of the crisis name
Only observations where the actor was involved in the crisis for 100+ days
Exclude observations of Russia from 1990 onward
## [Write your answer here]#
2.3 Transforming
2.3.1 Changing variables and making new ones
The mutate() function lets you change a column in the dataframe, or create a new column using the values in existing columns. For example, right now the duration of each crisis is measured in days. We can add new columns to measure duration in weeks and years instead. As always, if we want our new columns to persist instead of vanishing into the ether, we need to assign the output to a variable — either overwriting df_crises_cw or creating a new data frame.
Use mutate() and if_else() together to add another column to df_crises_cw called crisis_length. This variable should say "long" if the crisis is 100 or more days, and "short" otherwise.
## [Write your answer here]#
The “outcome” variable here is recorded as numbers, which stand for different categories of crisis outcomes. The ICB codebook explains which category is associated with each numerical value.
Victory: this country achieved its basic goals.
Compromise: this country partly achieved its basic goals.
Stalemate: there was no major change in the situation.
Defeat: this country did not achieve its basic goals, and instead yielded or surrendered.
Other.
So for example, looking at the rows associated with the Berlin Blockade, we see that the USA and United Kingdom experienced victory, while the Soviet Union was defeated.
# A tibble: 3 × 8
actor_name outcome crisis_name actor_id year duration duration_weeks
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 RUS 4 BERLIN BLOC… 365 1948 340 48.6
2 UKG 1 BERLIN BLOC… 200 1948 323 46.1
3 USA 1 BERLIN BLOC… 2 1948 323 46.1
# ℹ 1 more variable: duration_years <dbl>
We could use a sequence of 5 if_else() statements to convert the outcome variable in terms of category names instead of (hard to remember) numeric identifiers. But it’s easier to use case_when():
# A tibble: 87 × 8
crisis_name actor_name actor_id year duration outcome duration_weeks
<chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 COMMUNISM … RUS 365 1946 204 victory 29.1
2 TURKISH ST… USA 2 1946 81 victory 11.6
3 COMMUNISM … RUS 365 1947 112 victory 16
4 TRUMAN DOC… USA 2 1947 91 victory 13
5 MARSHALL P… RUS 365 1947 9 victory 1.29
6 COMMUNISM … RUS 365 1948 13 victory 1.86
7 BERLIN BLO… RUS 365 1948 340 defeat 48.6
8 BERLIN BLO… UKG 200 1948 323 victory 46.1
9 BERLIN BLO… USA 2 1948 323 victory 46.1
10 CHINA CIVI… USA 2 1948 34 defeat 4.86
# ℹ 77 more rows
# ℹ 1 more variable: duration_years <dbl>
There’s another function called case_match() that would let you do this same thing with even less typing. Go ahead, look it up, and use it if you like! But case_when() is a bit more flexible, so to avoid confusion I’m only going to use it.
2.3.2 Grouping and summarizing
The summarize() function lets us calculate summaries of columns in the data. For example, let’s find the shortest, longest, and average crisis durations among the USA/UK/USSR during the Cold War.
# A tibble: 1 × 3
shortest longest average
<dbl> <dbl> <dbl>
1 2 486 99.2
# Seemingly easier waymin(df_crises_cw$duration)
[1] 2
max(df_crises_cw$duration)
[1] 486
mean(df_crises_cw$duration)
[1] 99.24138
The real power of summarize() comes in when you combine it with group_by() to calculate summaries for subcategories of the data. For example, say we wanted to calculate the shortest, longest, and average crisis duration separately for the USA, UK, and USSR. That would be tedious to do the “normal” way:
# one of nine long expressions you'd have to type outmin(df_crises_cw$duration[df_crises_cw$actor_name =="USA"])
# A tibble: 3 × 4
actor_name shortest longest average
<chr> <dbl> <dbl> <dbl>
1 RUS 3 486 124.
2 UKG 12 323 106
3 USA 2 433 84.6
In-class exercise
In which year of the Cold War did the USA spend the most total days in crises? Use filter(), group_by(), summarize(), and arrange() together to figure it out.
## [Write your answer here]#
What if you wanted to count the number of crises involving each country? There’s a special function called n() that counts the number of observations within each group.
# A tibble: 3 × 2
actor_name number
<chr> <int>
1 RUS 24
2 UKG 15
3 USA 48
You can group on multiple variables at once. The number of rows in the result will be the number of unique combinations of values across the grouping variables.
`summarise()` has grouped output by 'actor_name'. You can override
using the `.groups` argument.
# A tibble: 13 × 5
# Groups: actor_name [3]
actor_name outcome number shortest longest
<chr> <chr> <int> <dbl> <dbl>
1 RUS compromise 3 8 222
2 RUS defeat 5 3 340
3 RUS stalemate 3 9 260
4 RUS victory 13 9 486
5 UKG compromise 2 183 192
6 UKG defeat 2 66 104
7 UKG other 1 17 17
8 UKG stalemate 2 112 293
9 UKG victory 8 12 323
10 USA compromise 7 4 433
11 USA defeat 6 10 336
12 USA stalemate 8 5 293
13 USA victory 27 2 323
2.4 Reshaping
To get practice with reshaping and merging data, we’re going to bring in a second dataset. military.csv, stored at https://bkenkel.com/qps1/data/military.csv, contains data on military spending and force sizes by country from 1816 to 2019.
# A tibble: 31,902 × 5
ccode stateabb year mil_indicator amount
<dbl> <chr> <dbl> <chr> <dbl>
1 2 USA 1816 spending 3823
2 2 USA 1816 personnel 17
3 2 USA 1817 spending 2466
4 2 USA 1817 personnel 15
5 2 USA 1818 spending 1910
6 2 USA 1818 personnel 14
7 2 USA 1819 spending 2301
8 2 USA 1819 personnel 13
9 2 USA 1820 spending 1556
10 2 USA 1820 personnel 15
# ℹ 31,892 more rows
The amounts here are in thousands of dollars and people respectively. For example, looking at the first two rows, in 1816 the USA spent $3.8 million on the military, which consisted of 17,000 soldiers.
How can we reshape this data so that there’s a single row for the USA in 1816, a single row for the USA in 1817, and so on? As of now the data is “longer” than we want it to be, and we’d like it to be “wider”. Hence we’ll use the function pivot_wider(). Here’s how we use it.
# A tibble: 15,951 × 5
ccode stateabb year spending personnel
<dbl> <chr> <dbl> <dbl> <dbl>
1 2 USA 1816 3823 17
2 2 USA 1817 2466 15
3 2 USA 1818 1910 14
4 2 USA 1819 2301 13
5 2 USA 1820 1556 15
6 2 USA 1821 1612 11
7 2 USA 1822 1079 10
8 2 USA 1823 1170 11
9 2 USA 1824 1261 11
10 2 USA 1825 1336 11
# ℹ 15,941 more rows
Making a data frame “wider” means taking data that’s currently stored row by row, and spreading it out across columns instead. The names_from argument tells pivot_wider() which column of the original data frame contains the names of the new columns we want to create. The values_from argument tells it which column contains the values that we want to put into each column.
In-class exercise
Go back to our data on the USA, UK, and USSR in Cold War crises. Use pivot_wider() to create a new data frame where each row is a unique crisis name, and there are columns for the USA, UK, and USSR telling that country’s outcome from the given crisis. Were there any crises that ended in victory for both the USA and the USSR?
## [Write your answer here]#
The opposite of pivot_wider() is, naturally enough, pivot_longer(). For example, imagine your raw data had a single row for each country, then a separate column for its military spending each year.
To use pivot_longer(), we have to tell it (1) the set of columns whose values we want to collapse into a single column, (2) what we want to name the column that stores the column names from the original data, and (3) what we want to name the column that stores the values.
# A tibble: 9 × 4
ccode stateabb year personnel
<dbl> <chr> <dbl> <dbl>
1 2 USA 1816 17
2 2 USA 1817 15
3 2 USA 1818 14
4 200 UKG 1816 255
5 200 UKG 1817 190
6 200 UKG 1818 173
7 365 RUS 1816 800
8 365 RUS 1817 700
9 365 RUS 1818 600
How did I know to use the str_replace() function there? The stringr package contains tons of helpful functions for manipulating text in R. You can see everything it does by googling the documentation for it, or by entering help(package = "stringr") at the R console.
2.5 Merging
Data work in political science often involves combining data from multiple sources. For example, here we have one dataset on country participation in international crises, and another on the size of their militaries. How can we put these together?
The first thing to do is decide which dataset should be the “base” that we merge other data sources into. What’s appropriate depends on the question you’re trying to answer. For our purposes here, let’s imagine we want to know whether crisis winners tend to spend more on their military than crisis losers. That means we only care about the countries that actually ended up in crises, so we should use the crisis data as our base and merge the military size data into that.
The next thing we need to make sure of is that we have some way to identify the same observation across datasets. Here we are going to match observations by the numerical “country code” and by year. First we’ll edit the military size data so that the country code indicator has the same name as in df_crises_cw. (We’ll also drop the country name variable, since we already have that in the original data.)
Winners have smaller militaries than losers on average. Does this mean that increasing the size of your military actually hurts your ability to win a crisis? Why or why not? How could you dig deeper into the data to better figure out whether this is the case?