2 Principles of Programming

As a working social scientist, most of the time you spend on data analysis won’t be on the analysis part. It’ll be on obtaining and cleaning the data, to get it in a form that makes sense to analyze. Good programming skills will let you spend less time cleaning data and more time publishing papers.

Even if you don’t want to develop good programming habits, journals are going to force you to. Every reputable political science journal requires that you provide replication scripts, and some of the best (e.g., American Journal of Political Science) have begun auditing the replication materials as a condition of publication. Better to learn The Right Way now when you have lots of time than to be forced to when you’re writing a dissertation or on the market or teaching your own courses.

Political scientists now have more data and more computing power than ever before. You can’t collect, manage, clean, and analyze large quantities of data without understanding the basic principles of programming.

As Bowers (2011) puts it, “Data analysis is computer programming.” By getting a PhD in political science,1 by necessity you’re going to become a computer programmer. The choice before you is whether to be a good one or a bad one.

Wilson et al. (2014) list eight “best practices for scientific computing.” The first two encapsulate most of what you need to know:

  1. Write programs for people, not computers.
  2. Let the computer do the work.

2.1 Write Programs for People, Not Computers

The first two words here—write programs—are crucial. When you are doing analysis for a research project, you should be writing and running scripts, not typing commands into the R (or Stata) console. The console is ephemeral, but scripts are forever, at least if you save them.

Like the manuscripts you will write to describe your findings, your analysis scripts are a form of scientific communication. You wouldn’t write a paper that is disorganized, riddled with grammatical errors, or incomprehensible to anyone besides yourself. Don’t write your analysis scripts that way either.

I personally prefer each script to be self-contained, ideally accomplishing one major task. A typical breakdown of scripts for a project of mine looks like:

  • 0-download.r: downloads the data
  • 1-clean.r: cleans the data
  • 2-run.r: runs the main analysis
  • 3-figs.r: generates figures

The exact structure varies depending on the nature of the project. Notice that the scripts are numbered in the order they should be run.

Using an omnibus script that runs every bit of analysis is like writing a paper without paragraph breaks. And if portions of your analysis take a long time to run, an omnibus file at best is unwieldy to work with and at worst causes you to lose valuable time.

Within each script, write the code to make it as easy as possible for your reader to follow what you’re doing. You should indent your code according to style conventions such as https://style.tidyverse.org/. Even better, use the Code -> Reindent Lines menu option in RStudio to automatically indent according to a sane style.

# Terrible
my_results<-c(mean(variable),quantile(variable,probs=0.25),max(variable))

# Bad
my_results <- c(mean(variable),
quantile(variable,
probs = 0.25),
max(variable))

# Better
my_results <- c(mean(variable),
                quantile(variable,
                         probs = 0.25),
                max(variable))

Another way to make your code readable—one that, unfortunately, cannot be accomplished quite so algorithmically—is to add explanatory comments. The point of comments is not to document how the language works. The following comment is an extreme example of a useless comment.

# Take the square root of the errors and assign them to
# the output variable
output <- sqrt(errors)

A better use for the comment would be to explain why you’re taking the square root of the errors, at least if your purpose in doing so would be unclear to a hypothetical reader of the code.

My basic heuristic for code readability is If I got hit by a bus tomorrow, could one of my coauthors figure out what the hell I was doing and finish the paper?

2.2 Let the Computer Do the Work

Computers are really good at structured, repetitive tasks. If you ever find yourself entering the same thing into the computer over and over again, you are Doing It Wrong. Your job as the human directing the computer is to figure out the structure that underlies the repeated task and to program the computer to do the repetition.

For example, imagine you have just run a large experiment and you want to estimate effects across subgroups.2 Your respondents differ across four categories—party ID (R or D), gender (male or female), race (white or nonwhite), and education (college degree or not)—giving you 16 subgroups. You could copy and paste your code to estimate the treatment effect 16 times. But this is a bad idea for a few reasons.

  • Copy-paste doesn’t scale. Copy-paste is manageable (albeit misguided) for 16 iterations, but probably not for 50 and definitely not for more than 100.

  • Making changes becomes painful. Suppose you decide to change how you calculate the estimate. Now you have to go back and individually edit 16 chunks of code.

  • Copy-paste is error-prone, and insidiously so. If you do the calculation wrong all 16 times, you’ll probably notice. But what if you screwed up for just one or two cases? Are you really going to go through and check that you did everything right in each individual case?

We’re going to look at the most basic ways to get the computer to repeat structured tasks—functions and control flow statements. To illustrate these, we’ll run a Monte Carlo simulation on the trimmed mean, an alternative to the usual sample mean for estimating population means.

To calculate the trimmed mean, we remove the lowest x% and highest x% of observations from the data, then calculate the sample mean of what remains. For example, think of the following sample of 20 observations with a big obvious outlier.

outlier_sample
##  [1]   0.05   0.14   0.18   0.21   0.24   0.26   0.31   0.32   0.32   0.36
## [11]   0.41   0.43   0.51   0.63   0.78   0.83   0.88   0.89   0.96 100.54

Let’s compare the usual sample mean to the trimmed mean where we remove the top and bottom 5% of the data.

mean(outlier_sample)
## [1] 5.4625
mean(outlier_sample[-c(1, 20)])
## [1] 0.4811111

2.2.1 Functions

Is the trimmed mean a superior estimator of the population mean for non-normal distributions? We’ll examine this question by drawing data from the exponential distribution, a right-skewed distribution with support on \([0, \infty)\). The exponential distribution is characterized by its rate parameter \(\lambda > 0\), with a mean of \(1 / \lambda\) and variance of \(1 / \lambda^2\). For our example, we’ll set \(\lambda = 1/4\), so the population mean is 4.

We will write a function that calculates the trimmed mean for a given sample of data. This calculation requires that we know two things: the sample of data and the percentage of observations to remove from each side. We will make these the arguments of the function. To write a function, we give it a name and specify its arguments with a line like function_name <- function(arg1, arg2, ...). We follow that up with a block of code listing the series of calculations to perform on the arguments. At the end of that block of code, we tell it what to return. (Any intermediate computations are lost to the sands of time—the only thing we see is the return value.)

trimmed_mean <- function(sample, pct) {
  lower <- quantile(sample, pct)
  upper <- quantile(sample, 1 - pct)
  subset <- sample[sample >= lower & sample <= upper]
  return(mean(subset))
}

We can confirm that this does what we want by going back to our example data:

trimmed_mean(sample = outlier_sample, pct = 0)
## [1] 5.4625
trimmed_mean(sample = outlier_sample, pct = 0.05)
## [1] 0.4811111

2.2.2 For loops

We want to get an idea of the sampling distribution of the trimmed mean—the distribution of its value across many different independent random samples. We’ll use a for loop to perform the same operation repeatedly on different data.

## Set up vectors to store the output
n_per_sample <- 50
n_samples <- 1000
dist_mean <- dist_trim_5 <- dist_trim_10 <- rep(NA, n_samples)

## Run Monte Carlo simulation
for (i in 1:n_samples) {
  sample <- rexp(n = n_per_sample, rate = 1/4)
  dist_mean[i] <- mean(sample)
  dist_trim_5[i] <- trimmed_mean(sample = sample, pct = 0.05)
  dist_trim_10[i] <- trimmed_mean(sample = sample, pct = 0.1)
}

Here’s how the for loop works. We specified i as the name of the index variable. For each value in the vector specified after in (in this case, 1:n_samples), we assign i to that value, and then run the code in brackets. The loop keeps doing this until it’s gotten through every element of 1:n_samples.

Now let’s look and see which estimator gets closest to the true population mean on average.

mean(dist_mean)
## [1] 4.004953
mean(dist_trim_5)
## [1] 3.541034
mean(dist_trim_10)
## [1] 3.365819

The trimmed means appear to be biased downward, with the bias growing with the amount of data that’s trimmed. This makes intuitive sense—you know from Stat I that the orgindary sample mean is an unbiased estimator of the population mean, so throwing out data is probably going to introduce some bias. Let’s also take a look at the standard error of each estimator.

sd(dist_mean)
## [1] 0.5940141
sd(dist_trim_5)
## [1] 0.5547327
sd(dist_trim_10)
## [1] 0.5424173

Here we see a potential advantage of the trimmed estimators—they’re more precise, with less variation across samples. Though notice that the standard error of the 10% trimmed estimator is barely smaller than that of the 5% trimmed, even though it is noticeably more biased.

For loops are fun, but don’t overuse them. Many simple operations are vectorized and don’t require a loop. For example, suppose you want to take the square of a sequence of numbers. You could use a for loop …

input <- c(1, 3, 7, 29)
output <- rep(NA, length(input))

for (i in 1:length(input)) {
  output[i] <- input[i]^2
}

output
## [1]   1   9  49 841

… but it’s faster (in terms of computational speed) and easier to just take advantage of vectorization:

input^2
## [1]   1   9  49 841

2.2.3 If/else statements

Sometimes you want to perform different computations depending on the circumstance. You can use an if/else statement for this. An if/else statement checks a logical condition—an expression whose value is TRUE or FALSE—and runs different code depending on the value of the expression. These conditions are often, but not always, obtained by using one of the comparison operators like ==, !=, >=, >, <=, or <.

We’ll illustrate if/else statements with a kind of wacky estimator of the population mean. If every observation is within 4 standard deviations of the sample mean, we’ll use the sample mean as our estimate. Otherwise, if at least one observation is more than 4 standard deviations away from the ordinary sample mean, we’ll report the 5% trimmed mean.

dist_wacky <- rep(NA, n_samples)

for (i in 1:n_samples) {
  sample <- rexp(n = n_per_sample, rate = 1/4)
  z_scores <- (sample - mean(sample)) / sd(sample)
  if (max(abs(z_scores)) > 4) {
    dist_wacky[i] <- trimmed_mean(sample = sample, pct = 0.05)
  } else {
    dist_wacky[i] <- mean(sample)
  }
}

We’ll see how this compares to the other estimators we’ve looked at.

mean(dist_wacky)
## [1] 3.871631
sd(dist_wacky)
## [1] 0.6078043

Our wacky estimator seems like the worst of all possible worlds: just as high a standard error as the ordinary sample mean, yet slightly downward biased.

There is a vectorized version of if/else statements called, naturally, the ifelse function. This function takes three arguments, each a vector of the same length: (1) a logical condition, (2) an output value if the condition is TRUE, (3) an output value if the condition is FALSE.

x <- 1:10
big_x <- x * 100
small_x <- x * -100

ifelse(x > 5, big_x, small_x)
##  [1] -100 -200 -300 -400 -500  600  700  800  900 1000

Functions, for loops, and if/else statements are just a few of the useful tools for programming in R.3 But even these simple tools are enough to allow you to do much more at scale than you could with a copy-paste philosophy.

2.3 Debugging and Asking for Help

Programming is hard, and mistakes are common. Like virtually everything we do in academia, it’s an iterative process: your code starts bad, and the goal is to make it less bad over time until it’s finally acceptable. You should always feel welcome to ask for help! It’ll be easiest for others (including me) to help you if you follow some basic best practices for seeking help with code.

The first step is to properly diagnose the problem. “It didn’t work” isn’t very descriptive, because there are many ways for an R script to fail. Most problems fall into one of the three following categories, and there are important differences between them:

  1. Errors. An error is when your code literally fails to run: the R script stops and produces no output. You will know when an error has occurred because there is an error message.

    sqrt("pizza")
    ## Error in sqrt("pizza"): non-numeric argument to mathematical function

    Here, we have asked for the square root of a character string, which sqrt isn’t equipped to deal with, so R throws an error. The error message tells us what happened: “non-numeric argument to mathematical function”, i.e., you tried to do math on something that isn’t a number.

    The first step to fixing an error like this should be to read the error message. If you don’t understand the error message, Google is your friend. Stack Overflow, GitHub issues pages, and (to a lesser degree) R listservs are the best places to find useful discussions of error messages.

  2. Warnings. These look like errors if you’re not paying attention, but they’re actually not! Unlike with an error, when you receive a warning, R will keep running and produce output.

    sqrt(c(1, 4, 9, -16, 25))
    ## Warning in sqrt(c(1, 4, 9, -16, 25)): NaNs produced
    ## [1]   1   2   3 NaN   5

    Here, we have asked R for the square root of a negative number. Since R doesn’t deal with complex numbers by default, this results in R warning us that the output contains an NaN, which stands for “not a number”. But notice that unlike with the error above, our code did produce output.

    Warnings can be annoying to deal with. Sometimes a warning is an indication of a real problem that needs fixing. Other times, a warning is just that—an alert to you about something that might not have otherwise been obvious. When your code throws a warning, that’s an indicator to take a closer look and make a judgment call about whether it needs fixing.

  3. Unexpected output. Perhaps most commonly: even though there’s no error or warning message, the output of your code isn’t what you expected.

    x <- c(1, 2, 3, 4, 5, 6, NA)
    sum(x[x < 4])  # should be 1 + 2 + 3 = 6
    ## [1] NA

    No error or warning message here, but R didn’t give us what we hoped for. This is perhaps the most common debugging situation. Usually, the best thing to do is to go step-by-step and see what went wrong.

    x < 4
    ## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE    NA
    x[x < 4]
    ## [1]  1  2  3 NA
    sum(x[x < 4], na.rm = TRUE)
    ## [1] 6

    The problem is that x[x < 4] includes the NA, and that the sum() function defaults to treating any sum that includes an NA as NA. That all becomes clearer when we build our code up step-by-step, and we consult the documentation for the functions we’re using.

So when you’re asking for help, the first thing to figure out is whether you’re trying to eliminate an error, eliminate a warning, or correct unexpected output. In that last case, you should have an idea of what output you were anticipating, and how it differs from what you were expecting.

Once you’ve diagnosed the problem, the next step in asking for help is to create a minimal, reproducible example. By “minimal”, I mean that you have identified the precise line/chunk of code where the problem is occurring. By “reproducible”, I mean that running your example in a controlled environment—a fresh R session in an empty working directory on someone else’s computer—will produce the same result that you find problematic. If at all possible, it is also good to include sample/simulated data instead of sending an entire data file.

A minimal reproducible example might look like this:

Problem extracting elements from each row of a matrix

Sample code:

x <- rbind(c(1, 2, 3), c(4, 6, 8), c(5, 10, 15), c(100, 200, 300))
x[c(2, 3, 1, 2)]

Expected output: 2, 8, 5, 200 (2nd element of 1st row, 3rd element of 2nd row, etc)

Actual output: 4, 5, 1, 4

This gives me (or whoever’s helping you) exactly what they would need to reason through the situation and help you find a solution.4

What you don’t want to do when seeking help is send someone 500 lines of code, which can’t run on their computer anyway because it depends on data that wasn’t provided, with the description “It doesn’t work”. Providing a minimal reproducible example allows the person helping you to focus on the actual problem, without having to wade through lots of irrelevant material that doesn’t actually get to the point. Additionally, the process of creating a minimal reproducible example can be helpful to you in debugging your code and understanding yourself what’s actually wrong. I would say about half of the time that I’m about to post something online seeking coding R help, I figure out and fix the problem myself in the process of creating the minimal reproducible example.

References

Bowers, Jake. 2011. Six Steps to a Better Relationship with Your Future Self.” The Political Methodologist 18 (2): 2–8.
Wilson, Greg, D A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven H D Haddock, et al. 2014. Best Practices for Scientific Computing.” PLOS Biology 12 (1): e1001745.

  1. Or whatever other social science field.↩︎

  2. There could be statistical problems with this kind of analysis, at least if the subgroups were specified post hoc. See https://xkcd.com/882/ (“Significant”). We’re going to leave this issue aside for now, but we’ll return to it later when we discuss the statistical crisis in science.↩︎

  3. Others include the replicate function, the apply family of functions (sapply, lapply, tapply, mapply, …), the foreach package, the purrr package, just to name a few of the most useful off the top of my head.↩︎

  4. For this particular example, the syntax you’d want to use is x[cbind(c(1, 2, 3, 4), c(2, 3, 1, 2))]. Try it!↩︎