3 Univariate Analysis

By now you’re experts on wrangling data. It’s time to start using data to learn about politics.

We will start by analyzing a single variable at a time. Our goal is to come up with good quantitative summaries of each variable in our data — i.e., to collapse the complexity of what we observe into a few numerical values that are easier to wrap our minds around. The major concepts we will hit are:

Measures of central tendency, which tell us what a “typical value” of a variable looks like. There’s no single best way to do this — the best measure of central tendency depends on what you’re trying to understand or accomplish with your data analysis.
Measures of spread, which tell us how far apart the values of the variable tend to be. Is a random observation from the data likely to be pretty close to the average (low spread) or far away from the average (high spread)?
We’ll discuss how the appropriate measures depend on the type of variable we’re dealing with.

3.1 Data: County-level presidential election results

We will work with the data file county_pres.csv, stored online at https://bkenkel.com/qps1/data/county_pres.csv, which records county-by-county results of US presidential elections from 2000 to 2020.

library("tidyverse")

df_county_pres <- read_csv("https://bkenkel.com/qps1/data/county_pres.csv")

glimpse(df_county_pres)

Rows: 22,093
Columns: 12
$ year            <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
$ state           <chr> "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "A…
$ region          <chr> "South", "South", "South", "South", "South", "…
$ county          <chr> "AUTAUGA", "BALDWIN", "BARBOUR", "BIBB", "BLOU…
$ county_fips     <chr> "01001", "01003", "01005", "01007", "01009", "…
$ total_votes     <dbl> 17208, 56480, 10395, 7101, 17973, 4904, 7803, …
$ dem_votes       <dbl> 4942, 13997, 5188, 2710, 4977, 3395, 3606, 157…
$ rep_votes       <dbl> 11993, 40872, 5096, 4273, 12667, 1433, 4127, 2…
$ margin          <dbl> -7051, -26875, 92, -1563, -7690, 1962, -521, -…
$ pct_margin      <dbl> -0.409751278, -0.475832153, 0.008850409, -0.22…
$ competitiveness <dbl> -3, -3, 0, -3, -3, 3, -1, -2, 0, -1, -3, 0, -2…
$ dem_win_state   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Each observation (row) in this dataset is a particular county in a particular year. Here are all of the variables (columns) in the data, though we’ll only be using some of them in class today:

Name	Definition
`year`	Election year
`state`	State name
`region`	Region of the country (US Census codings) the state is in
`county`	County name
`county_fips`	County FIPS (Federical Information Processing Standards) code
`total_votes`	Total votes cast for any candidate in presidential race
`dem_votes`	Votes cast for the Democratic presidential candidate
`rep_votes`	Votes cast for the Republican presidential candidate
`margin`	Difference in votes cast: `dem_votes - rep_votes`
`pct_margin`	Difference relative to total votes: `margin / total_votes`
`competitiveness`	Categorization from -3 to +3 of how close the result was
`dem_win_state`	Did the Democrat win the state? 0 = No, 1 = Yes

We’re going to use this data to try to answer some simple questions about the competitiveness of presidential elections.

How competitive are recent presidential elections?
Are they getting more or less competitive?
Which places are competitive and non-competitive?

Data will help us answer these questions objectively, but any data analysis also involves judgment calls. For example, how should we define competitiveness? Is it closeness of the absolute number of votes, or of the vote percentages, or something else entirely? Can we sort races into “competitive” and “not competitive”, or does competitiveness lie on a continuum? Is competitiveness best assessed at the county, state, regional, or national level?

The important thing is to keep in mind the purpose of your data analysis. What are you trying to learn? What decisions are you hoping to make? If you’re working for a campaign and trying to help decide how to spend the advertising budget, the right notion of “competitiveness” might be different than if you’re a political science professor studying the effects of the economy on electoral competition. No matter what decision you make, it’s crucial that you:

Put some thought and reasoning into the judgment calls you make.
Be clear and transparent with your audience about which decisions you made, and why.
Be open to thinking about your problem in a different way! Try out alternative measures, conceptualizations, choices, etc., and see how sensitive your conclusions are.

3.2 Central tendency: What’s a typical value?

Data analysis is all about summarizing. Think about our county-level presidential vote data. We’ve got 22093 rows and 12 columns, for a whopping 265116 separate pieces of information — and this isn’t even a particularly “big” dataset! In order to wrap our puny human brains around this much information, we need to find ways to summarize it much more concisely.

One way to summarize a variable is to characterize its central tendency: to answer the question “what’s a typical value for this variable?” For any variable, there are multiple different measures of central tendency available to us, and it is a judgment call which of them are most useful for a given purpose. But to even know what our options are, we have to know what kind of variable we are dealing with.

3.2.1 Continuous variables

By the strictest definition, a continuous variable is a variable whose values may lie in a continuum of numbers. The height of a person is a continuous variable: they might be 66 inches tall, or 66.5 inches tall, or 66.52 inches tall, or even 66.5240549 inches tall. We aren’t going to be that strict about continuous variables. A candidate can receive 10,000 votes or 10,001 votes, but she can’t receive 10,000.5 votes. Nonetheless, vote totals are close enough that we will treat them as continuous. What’s really important, for our purposes, is that it makes sense to add and subtract the values of continuous variables.

In other stats classes you may have heard of a further subdivision of continuous variables into “interval” and “ratio” variables. Typical measures of central tendency (and spread, covered below) don’t care about this distinction, so we won’t either.

One of the most common measures of central tendency for a continuous variable is the mean, also called the average. To calculate the mean, we add up all the values of the variable, then divide by the number of observations. If there are $N$ observations of a continuous variable, say $x_1, x_2, \ldots, x_N$, then the mean is \[\bar{x} = \frac{\sum_{i=1}^N x_i}{N}.\]

As you saw a bit in our section on data wrangling, we can use the mean() function in R to calculate the mean. For example, let’s calculate the average number of votes in a county across the 2000–2020 presidential elections.

mean(df_county_pres$total_votes)

[1] 44419.82

In-class exercise

Calculate the average percentage by which Joe Biden won/lost across all counties in 2020. How does your result compare to his 4.5 percentage point margin of victory in the national popular vote? If there is a substantial difference, why do you think that is?

#
# [Write your answer here]
#

Sometimes when you calculate an average, you want to place more weight on some observations than others. This is common with survey data. For example, imagine you have a survey sample where 50% of respondents have a college degree and 50% do not. In the overall population, we know that about 38% of American adults have a college degree and 62% do not. So if you are trying to create a representative estimate, you might want to downweight the degree-holding respondents who are overrepresented in your sample, and upweight the degree-less respondents who are underrepresented. We call this a weighted mean.

For the math lovers out there, we define a weighted mean by associating positive weights $w_1, w_2, \ldots, w_N$ with each observation $1, 2, \ldots, N$. The formula is then \[\text{weighted mean} = \frac{\sum_{i=1}^N w_i \times x_i}{\sum_{i = 1}^N w_i}.\] The ordinary mean is the special case of the weighted mean where the weight on every observation is the same: $w_1 = w_2 = \cdots = w_N$.

To calculate a weighted mean in R, we use the weighted.mean() function. This takes two arguments: the vector of values that we are averaging, and then the weight to place on each observation.

In-class exercise

Return to the problem of calculating Biden’s average margin by county in 2020. Now calculate a weighted mean, where the weight on each county is the number of votes cast there. How does this compare to the raw mean, and to his 4.5 percentage point national margin of victory? What do you learn from the comparison?

#
# [Write your answer here]
#

The other most common measure of central tendency for a continuous variable is the median. If the median of some variable is $m$, that means half of the observations are less than or equal to $m$, and half of the observations are greater than or equal to $m$. We calculate the median in R with the median() function. For example, let’s look at the median number of votes cast across the counties in our data frame.

median(df_county_pres$total_votes)

[1] 11140

You can imagine it this way: Suppose we lined up each row of the data frame in order by total_votes, lowest to highest. The row at exactly the halfway point would have 11,140 total votes.

The mean and the median differ in terms of their sensitivity to outliers, data points that are extremely far from typical. For example, imagine you had a sample of 100 people: 99 normal people and Elon Musk. Because Elon Musk’s net worth is $420 billion, the average net worth of this sample would be at least $4.2 billion. And yet it would be misleading, in a sense, to say the average person in the sample is a billionaire — really you have one mega-billionaire and 99 regular people. The median net worth of the sample would be much, much lower.

In case you’re curious: I could only find statistics at the household level, but the median net worth of an American household is $192,000.

Because the mean is sensitive to outliers and the median is not, you might think we should prefer the median over the mean as a measure of central tendency. I think it’s more complicated than that. There are some situations where you care about outliers!

In-class exercise

You are running analytics for a political campaign. You recently tested two different templates for texts seeking political donations. You sent both texts to 1,000 different phone numbers from your list of supporters:

Template A yielded a median donation of $5 and a mean donation of $7.
Template B yielded a median donation of $0 (in other words, most receipients donated nothing) and a mean donation of $20.

If your goal is to just maximize the total amount of money donated, which template would you recommend sending out to the full list of supporters?

[Write your answer here]

How competitive are recent presidential elections? Are they getting more or less competitive? Let’s take a few different calculations. Which one do you think best captures the competitiveness of each election?

df_county_pres |>
  group_by(year) |>
  summarize(
    avg_pct_margin = mean(pct_margin),
    avg_weighted_pct_margin = weighted.mean(pct_margin, total_votes),
    med_pct_margin = median(pct_margin)
  )

# A tibble: 7 × 4
   year avg_pct_margin avg_weighted_pct_margin med_pct_margin
  <dbl>          <dbl>                   <dbl>          <dbl>
1  2000         -0.170                 0.00517         -0.175
2  2004         -0.212                -0.0248          -0.226
3  2008         -0.148                 0.0726          -0.162
4  2012         -0.205                 0.0395          -0.234
5  2016         -0.308                 0.0210          -0.376
6  2020         -0.313                 0.0445          -0.380
7  2024         -0.323                -0.0279          -0.375

It wouldn’t be a introductory stats class if I didn’t mention the mode, which is the single value that appears most commonly in the data. The mode is not very useful for continuous variables. For example, in our total votes variable, there are 16,941 distinct values observed. The most common is 3,930, as there just happen to be eight counties where this was the exact total of votes. Just because we get this exact total in eight out of 22,000+ observations doesn’t make it “typical” in any meaningful sense.

df_county_pres |>
  group_by(total_votes) |>
  summarize(number = n()) |>
  arrange(desc(number))

# A tibble: 16,941 × 2
   total_votes number
         <dbl>  <int>
 1        3930      8
 2        1017      7
 3        1369      7
 4        2031      6
 5        3906      6
 6         928      5
 7        1110      5
 8        1390      5
 9        1434      5
10        1440      5
# ℹ 16,931 more rows

3.2.2 Unordered categorical variables

A categorical variable is one with a discrete set of possible values. Usually these are stored as character strings in a data frame, though sometimes (like we saw in the previous unit with the “outcome” variable in the crisis data) people use numerical codes to represent different categories.

We say a categorical variable is unordered when none of the categories is “greater” or “less” than the others. For example, the region variable in our county-level election data has four categories: Midwest, Northeast, South, and West. Because no region is “more” or “less” than the others, this variable is unordered.

df_county_pres |>
  group_by(region) |>
  summarize(number = n())

# A tibble: 4 × 2
  region    number
  <chr>      <int>
1 Midwest     7391
2 Northeast   1560
3 South       9934
4 West        3208

You can’t calculate a mean for an unordered categorical variable, because you can’t add up the values. Nor can you calculate a median, because you can’t put them in order. The only calculation you can make is the mode — the category with the most observations. In the example here, the modal region is the South.

I find it more informative to look at the distribution of values of an unordered categorical variable: what proportion of values fall into each category?

In-class exercise

Add a proportion column to the above table of frequency counts for census region, then order it from highest proportion to lowest.

#
# [Write your answer here]
#

3.2.3 Ordered categorical variables

A categorical variable is ordered when the categories can be put in order from “least” to “most”. For example, every election cycle the Cook Political Report issues qualitative ratings of Congressional races:

Likely Republican
Lean Republican
Toss-up
Lean Democrat
Likely Democrat

Our county-level presidential elections data here contains an analogue of the Cook ratings in the Competitiveness column. It’s a 7-point scale where the lowest values most strongly favor Republicans, while the highest values most strongly favor Democrats.

df_county_pres |>
  group_by(competitiveness) |>
  summarize(
    number = n(),
    avg_dem_margin = mean(pct_margin),
    med_dem_margin = median(pct_margin)
  )

# A tibble: 7 × 4
  competitiveness number avg_dem_margin med_dem_margin
            <dbl>  <int>          <dbl>          <dbl>
1              -3  12891       -0.444         -0.429  
2              -2   2631       -0.152         -0.151  
3              -1   1371       -0.0714        -0.0719 
4               0   1502       -0.00178       -0.00250
5               1    916        0.0691         0.0686 
6               2   1073        0.147          0.146  
7               3   1709        0.405          0.329

With an ordered categorical variable, we can look at the mode and the distribution just like we did with an unordered categorical variable. We still can’t calculate a mean, because we can’t sensibly add and subtract values of a non-continuous variable. However, unlike with an unordered categorical variable, here it is also sensible to calculate the median.

median(df_county_pres$competitiveness)

[1] -3

Closely tracking what we saw with vote shares by county, we see here that the median county leans strongly in favor of Republican candidates.

What if your raw data is coded as a character string — e.g., "Strong Republican", "Likely Republican", etc.? You can still calculate the mode and the distribution the same way you did with an unordered categorical variable. However, to calculate the median, you’ll need to convert the character categories to numbers using case_when(), then run the median() function on the numerical version.

3.2.4 Binary variables

A binary variable is a categorical variable that has exactly two categories. Binary variables are special because we can essentially treat them as if they were continuous, specifically by treating one category as 0 and the other as 1.

In our county-level presidential data, the indicator for whether the Democrat won the state is a binary variable, already coded in 0/1 format.

df_county_pres |>
  group_by(dem_win_state) |>
  summarize(number = n())

# A tibble: 2 × 2
  dem_win_state number
          <dbl>  <int>
1             0  15262
2             1   6831

With a binary variable coded in 0/1 format, the mean tells us the proportion of observations that are a 1.

mean(df_county_pres$dem_win_state)

[1] 0.309193

This value indicates that in about 33% of the counties in our data, the Democrat won the presidential election in the state that year.

In-class exercise

Create a new column with a binary variable that indicates whether the Democrat won the given county, rather than the state as a whole. What is its mean, and how do you interpret that?

#
# [Write your answer here]
#

You can also calculate the median and the mode of a binary variable. These will be the same as each other: a median and mode of 1 if more than 50% of observations are 1, and 0 if less than 50% are.

In the rare case of a variable that’s exactly 50-50, the calculated median depends on precisely which algorithm you use. For our purposes we’re not going to worry about that sort of edge case.

3.3 Spread: How far from the central tendency is typical?

Imagine two hypothetical states, both with 9 counties. Say the average margin for the Democratic candidate across counties is 2% in both states. Furthermore say the median is 2% in both states too. Then they must be pretty politically similar — right?

Not necessarily. One way to end up with a mean and median of 2% is for every county to be slightly left of center, as in our hypothetical State A.

margins_state_a <- c(0.01, 0.01, 0.01, 0.01, 0.02, 0.03, 0.03, 0.03, 0.03)
mean(margins_state_a)

[1] 0.02

median(margins_state_a)

[1] 0.02

But you could also end up with these same numbers if the state is wildly polarized between heavily Democratic and heavily Republican counties, as in our hypothetical State B.

margins_state_b <- c(-0.40, -0.30, -0.25, -0.20, 0.02, 0.20, 0.30, 0.36, 0.45)
mean(margins_state_b)

[1] 0.02

median(margins_state_b)

[1] 0.02

If we only look at central tendency, these two places look the same. When we also examine the spread, as in the dispersion of values around the central tendency, we see important differences. Our hypothetical State B has much more spread in vote margins than our hypothetical State A. The goal for us now is to work through some precise quantitative measures of spread.

We’ll only look at measures of spread for continuous variables. Measures of spread for categorical variables do exist, but I honestly don’t find them particularly useful as data summaries.

The measure I’m most familiar with for categorical variables is information entropy. It does come up sometimes in machine learning applications, but not for the analyses we’ll do in PSCI 2300.

3.3.1 Standard deviation

By far the most common measure of spread for a continuous variable is the standard deviation. Very loosely, I think of the standard deviation as telling us “What’s a normal difference between a random observation and the average?”

Table 3.1: Calculating the “surprise factor” for an observation by looking at how many standard deviations away from the mean it is.

Distance from mean	How surprising?
Within 1 SD	Not surprising at all
1–2 SDs away	Mildly surprising
2–3 SDs away	Rare, but not out of this world
3+ SDs away	SurprisedPikachu.jpg

We can calculate the standard deviation in R using the sd() function. Let’s look at variation in total votes by county.

mean(df_county_pres$total_votes)

[1] 44419.82

sd(df_county_pres$total_votes)

[1] 135957.8

So we have a mean of about 44K and a standard deviation of about 136K. The standard deviation is always measured in the same units as the underlying variable. Because our total_votes column is measured in number of votes, so is its standard deviation.

Let’s see how much of the data falls into my quick categorization from Table 3.1.

vote_mean <- mean(df_county_pres$total_votes)
vote_sd <- sd(df_county_pres$total_votes)

df_county_pres |>
  # New column: How many SDs from the mean is total_votes in this row?
  mutate(sd_from_mean = (total_votes - vote_mean) / vote_sd) |>
  # Categorization based on absolute distance from mean
  mutate(category = case_when(
    abs(sd_from_mean) < 1 ~ "Within 1 SD",
    abs(sd_from_mean) < 2 ~ "Within 1-2 SD",
    abs(sd_from_mean) < 3 ~ "Within 2-3 SD",
    abs(sd_from_mean) >= 3 ~ "Within 3+ SD"
  )) |>
  # Count number in each categorization
  group_by(category) |>
  summarize(number = n()) |>
  # Calculate proportion in each categorization
  mutate(proportion = number / sum(number))

# A tibble: 4 × 3
  category      number proportion
  <chr>          <int>      <dbl>
1 Within 1 SD    20883     0.945 
2 Within 1-2 SD    624     0.0282
3 Within 2-3 SD    272     0.0123
4 Within 3+ SD     314     0.0142

Here we see the vast majority of the data, 94% of counties, is within 1 standard deviation of the mean in terms of total votes. Only about 2.8% is within 1–2 SDs of the mean, and less than half of that is within 2–3 SDs.

In-class exercise

Calculate the standard deviation of the Democratic candidate’s percentage margin across counties for each different election in the data. Do you notice any trends? What does this tell us about how elections are changing over time?

#
# [Write your answer here]
#

I will admit that the formula for the standard deviation is kind of intense: \[ \text{sd} = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N - 1}}. \tag{3.1}\] Here’s what’s going on:

We take each of the $N$ observations of our variable, $x_1, x_2, \ldots, x_N$.
For each of these observations, $x_i$, we find the difference between it and the mean: $x_i - \bar{x}$.
We square each of those differences, so now they’re each a positive number, namely the squared distance between the observation and the mean: $(x_i - \bar{x})^2$.
We take the average of all the squared differences. Well, not quite the average, because for statistical reasons I won’t get into here, we divide the sum by $N - 1$ instead of $N$. But with any moderately large sample, dividing by $N - 1$ has almost the same result as dividing by $N$ anyway.
We take the square root of the whole thing so that it’s measured in the same units as the original variable.

3.3.2 Median absolute deviation

The standard deviation shares some of the same problems as the mean, namely a sensitivity to outliers. One single observation can make a huge difference to the standard deviation.

# Create a vector of evenly-spaced hypothetical data points
x <- seq(from = 1, to = 2, length.out = 20)
x

 [1] 1.000000 1.052632 1.105263 1.157895 1.210526 1.263158 1.315789
 [8] 1.368421 1.421053 1.473684 1.526316 1.578947 1.631579 1.684211
[15] 1.736842 1.789474 1.842105 1.894737 1.947368 2.000000

sd(x)

[1] 0.3113726

# Take same vector of data, but make a single outlier
y <- x
y[20] <- 10
y

 [1]  1.000000  1.052632  1.105263  1.157895  1.210526  1.263158
 [7]  1.315789  1.368421  1.421053  1.473684  1.526316  1.578947
[13]  1.631579  1.684211  1.736842  1.789474  1.842105  1.894737
[19]  1.947368 10.000000

sd(y)  # Will be way higher!

[1] 1.928213

The median absolute deviation is to the standard deviation as the median is to the mean. It is another measure of spread that is less sensitive to outliers. We calculate it by going through the following steps:

Calculate the median of the sample, call it $m$.
Calculate the distance between each observation and the sample median: $|x_i - m|$.
Take the median of these distances.

We can calculate the median absolute deviation in R using the mad() function. There’s a kind of annoying default scaling option to try to make the MAD more similar to the standard deviation; I don’t like that, so I add the argument constant = 1 to turn that off.

median(df_county_pres$total_votes)

[1] 11140

mad(df_county_pres$total_votes, constant = 1)

[1] 7762

This means half of the counties in the data have between 3,378 and 18,902 votes (median of 11,140, plus or minus the MAD of 7,762 votes).

median_votes <- median(df_county_pres$total_votes)
mad_votes <- mad(df_county_pres$total_votes, constant = 1)

df_county_pres |>
  # Categorize each row:
  # 1. below median - MAD
  # 2. above median - MAD, but below median + MAD
  # 3. above median + MAD
  mutate(category = case_when(
    total_votes < median_votes - mad_votes ~ "lower",
    total_votes < median_votes + mad_votes ~ "middle",
    total_votes >= median_votes + mad_votes ~ "upper"
  )) |>
  # Count by category and calculate proportions
  group_by(category) |>
  summarize(number = n()) |>
  mutate(proportion = number / sum(number))

# A tibble: 3 × 3
  category number proportion
  <chr>     <int>      <dbl>
1 lower      3355      0.152
2 middle    11047      0.500
3 upper      7691      0.348

In-class exercise

Repeat the previous exercise, on the spread of the Democratic candidate’s percentage margin across counties by year, but now using the MAD instead of the standard deviation. Does the data tell the same big-picture story, or a different one?

#
# [Write your answer here]
#

3.3.3 Quartiles and other percentages

Another common way to gauge the spread of a variable is to look at its quartiles:

The first/lower quartile is the value that 25% of the data is below and 75% is above.
The second quartile is the value that 50% of the data is below and 50% is above. In other words, the second quartile is the median.
The third quartile is the value that 75% of the data is below and 25% is above.

We can calculate quartiles using the quantile() function in R. Yes, they’re called quartiles with an “r”, yet we use quantile() with an “n”. (Typing quartile() for quantile() is one of my most frequent R mistakes.)

quantile(df_county_pres$total_votes)

     0%     25%     50%     75%    100% 
     64    5073   11140   29674 5488998

This function also helpfully tells us the minimum and maximum. What we see here is that:

The smallest quarter of counties had 64 to 5,073 votes cast.
The next quarter had 5,073 to 11,140 votes cast.
The next had 11,140 to 29,674 votes cast.
The largest quarter had 28,138 to 5,488,998 votes cast.

I think the very largest value in the data is an error. It’s the 2024 value for Harris County, TX, home of Houston. The total population of the county is estimated to be about 5,000,000 — which includes children, noncitizens, and others ineligible to vote — and the sum of the Trump and Harris vote totals there is only about 2,750,000. Let this be a lesson on working with real-world data: take a careful look at data quality before trying to draw firm conclusions!

The difference between the third and first quartiles is called the interquartile range. You can calculate it in R with the IQR() function. Some people find this value useful, though frankly I don’t.

IQR(df_county_pres$total_votes)

[1] 24601

The quartiles are special cases of percentiles of the data. For any number $p$ between 0 and 100, the $p$’th percentile of the data is the value that $p$% of the data is below and $(100-p)$% of the data is above. To calculate a percentile, we can again use the quantile() function. For example, let’s find the 95th percentile, separating the smallest 95% of counties from the largest 5%. Note that we have to specify the argument to quantile() in terms of a proportion (between 0 and 1) rather than a percentage (between 0 and 100).

quantile(df_county_pres$total_votes, 0.95)

     95% 
196344.8

So we see that 95% counties had less than 196,344 votes cast, while 5% had more.