4 Data Visualization
Visualization is most important at the very beginning and the very end of the data analysis process. In the beginning, when you’ve just gotten your data together, visualization is perhaps the easiest tool to explore each variable and learn about the relationships among them. And when your analysis is almost complete, you will (usually) use visualizations to communicate your findings to your audience.
We only have time to scratch the surface of data visualization. This chapter will cover the plotting techniques I find most useful for exploratory and descriptive data analysis. We will talk about graphical techniques for presenting the results of regression analyses later in the class—once we’ve, you know, learned something about regression.
4.1 Basic Plots
We will use the ggplot2 package, which is part of—I’m as tired of it as you are—the tidyverse.
For the examples today, we’ll be using a dataset with statistics about the fifty U.S. states in 1977,11 which is posted on my website.
## # A tibble: 50 × 12
## State Abbrev Region Popul…¹ Income Illit…² LifeExp Murder HSGrad Frost Area
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaba… AL South 3615 3624 2.1 69.0 15.1 41.3 20 50708
## 2 Alaska AK West 365 6315 1.5 69.3 11.3 66.7 152 566432
## 3 Arizo… AZ West 2212 4530 1.8 70.6 7.8 58.1 15 113417
## 4 Arkan… AR South 2110 3378 1.9 70.7 10.1 39.9 65 51945
## 5 Calif… CA West 21198 5114 1.1 71.7 10.3 62.6 20 156361
## # … with 45 more rows, 1 more variable: IncomeGroup <chr>, and abbreviated
## # variable names ¹Population, ²Illiteracy
When I obtain data, I start by looking at the univariate distribution of each variable via a histogram. The following code creates a histogram in ggplot.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s walk through the syntax there. In the first line, we call ggplot()
, specifying the data frame to draw from, then in the aes()
command (which stands for “aesthetic”) we specify the variable to plot. If this were a bivariate analysis, here we would have also specified a y
variable to put on the y-axis. If we had just stopped there, we would have a sad, empty plot. The +
symbol indicates that we’ll be adding something to the plot. geom_histogram()
is the command to overlay a histogram.
We’ll only be looking at a few of the ggplot commands today. I recommend taking a look at the online package documentation at http://docs.ggplot2.org to see all of the many features available.
When you’re just making graphs for yourself to explore the data, you don’t need to worry about things like axis labels as long as you can comprehend what’s going on. But when you prepare graphs for others to read (including those of us grading your problem sets!) you need to include an informative title and axis labels. To that end, use the xlab()
, ylab()
, and ggtitle()
commands.
ggplot(state_data, aes(x = Population)) +
geom_histogram() +
xlab("Population (thousands)") +
ylab("Number of states") +
ggtitle("Some states are big, but most are small")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The density plot is a close relative of the histogram. It provides a smooth estimate of the probability density function of the data. Accordingly, the area under the density plot integrates to one. Depending on your purposes, this can make the y-axis of a density plot easier or (usually) harder to interpret than the count given by a histogram.
The box plot is a common way to look at the distribution of a continuous variable across different levels of a categorical variable.
A box plot consists of the following components:
- Center line: median of the data
- Bottom of box: 25th percentile
- Top of box: 75th percentile
- Lower “whisker”: range of observations no more than 1.5 IQR (height of box) below the 25th percentile
- Upper “whisker”: range of observations no more than 1.5 IQR above the 75th percentile
- Plotted points: any data lying outside the whiskers
If you want to skip the summary and plot the full distribution of a variable across categories, you can use a violin plot.
Technically, violin plots convey more information than box plots since they show the full distribution. However, readers aren’t as likely to be familiar with a violin plot. It’s harder to spot immediately where the median is (though you could add that to the plot if you wanted). Plus, violin plots look goofy with outliers—see the “West” column above—whereas box plots handle them easily.
For visualizing relationships between continuous variables, nothing beats the scatterplot.
When you’re plotting states or countries, a hip thing to do is plot abbreviated names instead of points. To do that, you can use geom_text()
instead of geom_point()
, supplying an additional aesthetic argument telling ggplot where to draw the labels from.
Maybe it’s overwhelming to look at all that raw data and you just want a summary. For example, maybe you want an estimate of expected LifeExp
for each value of Illiteracy
. This is called the conditional expectation and will be the subject of much of the rest of the course. For now, just now that you can calculate a smoothed conditional expectation via geom_smooth()
.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
And if you’re the kind of overachiever who likes to have the raw data and the summary, you can do it. Just add them both to the ggplot()
call.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
4.2 Saving Plots
When you’re writing in R Markdown, the plots go straight into your document without much fuss. Odds are, your dissertation will contain plots but won’t be written in R Markdown, which means you’ll need to learn how to save them.
It’s pretty simple:
- Assign your
ggplot()
call to a variable. - Pass that variable to the
ggsave()
function.
pop_hist <- ggplot(state_data, aes(x = Population)) +
geom_histogram()
ggsave(filename = "pop-hist.pdf",
plot = pop_hist,
width = 6,
height = 3)
If you want plot types other than PDF, just set a different extension. See ?ggsave
for the possibilities.
4.3 Faceting
Suppose you want to split the data into subgroups, as defined by some variable in the data (e.g., the region states are in), and make the same plot for each subgroup. ggplot’s faceting functions, facet_wrap()
and facet_grid()
, make this easy.
To split up plots according to a single grouping variable, use facet_wrap()
. This uses R’s formula syntax, defined by the tilde ~
, which you’ll become well acquainted with once we start running regressions.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you don’t like the default arrangement, use the ncol
argument.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For two grouping variables, use facet_grid()
, putting variables on both sides of the formula.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
4.4 Aesthetics
Faceting is one way to incorporate information about additional variables into what would otherwise be a plot of just one or two variables. Aesthetics—which alter the appearance of particular plot features depending on the value of a variable—provide another way to do that.
For example, when visualizing the relationship between statewide illiteracy and life expectancy, you might want larger states to get more visual weight. You can set the size
aesthetic of the point
geometry to vary according to the state’s population.
The ggplot2 documentation lists the available aesthetics for each function. Another popular one is colour
, which is great for on-screen display but not so much for the printed page. (And terrible for the colorblind!)
For line graphs or density plots, you can set the linetype
to vary by category.
(I always find these incomprehensible with more than two lines, but maybe that’s just me.) You can use multiple aesthetics together, and you can even combine aesthetics with faceting, as in the following example.
ggplot(state_data, aes(x = Illiteracy, y = LifeExp)) +
geom_smooth() +
geom_text(aes(label = Abbrev, colour = Region, size = Population)) +
facet_wrap(~ IncomeGroup)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
But the fact that you can do something doesn’t mean you should. That plot is so cluttered that it’s hard to extract the relevant information from it. Data visualizations should communicate a clear message to viewers without overwhelming them. To do this well takes practice, patience, and maybe even a bit of taste.
4.5 Appendix: Creating the Example Data
The example data comes from data on U.S. states in 1977 that are included with base R. See ?state
.
library("tidyverse")
state_data <- state.x77 %>%
as_tibble() %>%
add_column(State = rownames(state.x77),
Abbrev = state.abb,
Region = state.region,
.before = 1) %>%
rename(LifeExp = `Life Exp`,
HSGrad = `HS Grad`) %>%
mutate(IncomeGroup = cut(Income,
breaks = quantile(Income,
probs = seq(0, 1, by = 1/3)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE))
write_csv(state_data, file = "state-data.csv")
Why 1977? Because it was easily available. See the appendix to this chapter.↩︎