library("tidyverse")
library("broom")
library("tidytext")
library("SnowballC")
library("caret")
library("glmnet")
7 Text Analysis
Lots of important data about politics comes in the form of text rather than numbers. Just a few examples off the top of my head:
The text of constitutions, legislation, executive speeches, legislative debates, and other official acts.
Transcripts of news programs, pre-election debates, and other media events.
Free response fields in public opinion surveys, social media posts about politics, and other political text sources produced by ordinary citizens.
Historical accounts of diplomatic negotiations, state visits, wars, and other events in international relations.
To analyze text in R, we’ll be using a few new packages we haven’t used before, in addition to the usual tidyverse and broom packages.
To install these, you’ll need to run the following command once from the R console. Don’t try to put it in a Quarto document; installing packages from within the Quarto rendering process is dicey at best. (Plus, you only need to install the package once ever, so it’d be a waste of computing time and power to reinstall it every time you render a Quarto file.)
install.packages(c("tidytext", "SnowballC", "caret", "glmnet"))
7.1 Data: The Federalist Papers
We’ll study text analysis through the lens of the Federalist Papers, a series of 85 brief essays by Alexander Hamilton, John Jay, and James Madison. These essays, published in 1787 and 1788, make the case for the adoption of the new Constitution to replace the prior Articles of Confederation, laying out the political philosophy underlying the Constitution.
The Federalist Papers were originally published anonymously under the pseudonym “Publius”. Decades later, authorship lists from Hamilton and (even later) Madison were circulated, concurring on the authorship on most of the papers while disagreeing on 11 or 12. Both historians and statisticians have kept themselves busy trying to establish the authorship of the disputed papers, as you can tell from a quick Google Scholar search.
In our study of text analysis, we will try to replicate some of the statistical approaches to establishing the authorship of the disputed papers. What we do here should just be considered a first cut at a very difficult problem, not the final answer! Most importantly, along the way, you’ll learn about some data analysis techniques that are applicable well beyond the Federalist Papers:
How to clean text data to make it amenable for statistical analysis. In essence, we’ll answer the question: How do we turn words into (statistically meaningful) numbers?
Regression-type statistical models for classification with many features. There are a few key differences from the regression techniques we worked with before:
The response and the predictions will be categorical instead of numeric.
We will be able to make predictions with many, many features — potentially even more features than we have observations in our sample.
We will use specialized techniques to avoid overfitting — treating “noise” in the data as if it were a “signal” — which is especially important when we have very many features.
We will work with a slightly cleaned-up version of the Federalist Papers, via the data file fed_papers.csv
. This file is available on my website at https://bkenkel.com/qps1/data/fed_papers.csv. It’s constructed using the raw Federalist Papers text from Project Gutenberg.
<- read_csv("https://bkenkel.com/qps1/data/fed_papers.csv")
df_fed_papers
df_fed_papers
# A tibble: 85 × 3
paper_id author text
<dbl> <chr> <chr>
1 1 hamilton "To the People of the State of New York:\n\nAfter …
2 2 jay "To the People of the State of New York:\n\nWhen t…
3 3 jay "To the People of the State of New York:\n\nIt is …
4 4 jay "To the People of the State of New York:\n\nMy las…
5 5 jay "To the People of the State of New York:\n\nQueen …
6 6 hamilton "To the People of the State of New York:\n\nThe th…
7 7 hamilton "To the People of the State of New York:\n\nIt is …
8 8 hamilton "To the People of the State of New York:\n\nAssumi…
9 9 hamilton "To the People of the State of New York:\n\nA firm…
10 10 madison "To the People of the State of New York:\n\nAmong …
# ℹ 75 more rows
There are just three columns here:
paper_id
gives the number of the paper, in order of their publication.author
lists the author of the paper.text
contains the full text of the paper.
|>
df_fed_papers group_by(author) |>
summarize(number = n())
# A tibble: 5 × 2
author number
<chr> <int>
1 hamilton 51
2 hamilton and madison 3
3 hamilton or madison 11
4 jay 5
5 madison 15
We see here that most of the papers were written solely by Hamilton, with a smaller number by Madison and an even smaller number by Jay. The 11 disputed papers are labeled "hamilton or madison"
.
If you’d like to read the full text of one of these papers … well, it’s probably easiest to just Google it. But if for some reason you wanted to read it in R, you can use the cat()
function.
cat(df_fed_papers$text[1])
## To the People of the State of New York:
##
## After an unequivocal experience of the inefficacy of the subsisting
## federal government, you are called upon to deliberate on a new
## Constitution for the United States of America. The subject speaks its
## own importance; comprehending in its consequences nothing less than the
## existence of the UNION, the safety and welfare of the parts of which it
## is composed, the fate of an empire in many respects the most
## interesting in the world. It has been frequently remarked that it seems
## to have been reserved to the people of this country, by their conduct
## and example, to decide the important question, whether societies of men
## are really capable or not of establishing good government from [...]
7.2 Cleaning text data
7.2.1 From raw text to tokens
The first thing we want to do is break up the text of each document into the individual words that comprise it. Data scientists would use the term tokens rather than words, (1) because we’ll capture some bits of text like numerals that aren’t technically words, and (2) in some applications the unit of analysis is something smaller than words (e.g., syllables) or bigger than words (e.g., full sentences). For our purposes you don’t have to worry about the word/token distinction.
The unnest_tokens()
function takes a vector of document text and turns each document into a “bag of words”, extracting individual words. You use the input
argument to tell it where to take the text from, and the output
argument to tell it the name of the column to put the words into. I’ll illustrate with a simple example before applying it to the Federalist Papers data.
<- tibble(
my_fake_data id = c(1, 2),
text = c("The Muppets ate bananas!!", "Bananas are what they ate??")
)
|>
my_fake_data unnest_tokens(input = text, output = word)
# A tibble: 9 × 2
id word
<dbl> <chr>
1 1 the
2 1 muppets
3 1 ate
4 1 bananas
5 2 bananas
6 2 are
7 2 what
8 2 they
9 2 ate
By default, unnest_tokens()
gets rid of all capitalization and punctuation, leaving us with the raw words. That’s typically what we want for text analysis. We don’t want to treat "bananas"
like a different word from "Bananas"
, or "ate"
different from "ate??"
. (Of course, for some purposes you would want to make such distinctions, and then you’d use custom options in tokenization so that these would be treated as separate tokens.)
When we put the Federalist Papers through the unnest_tokens()
function, we end up with many thousands of individual words. We’ll store these in a new data frame called df_fed_tokens
, so we don’t overwrite the original raw data.
<- df_fed_papers |>
df_fed_tokens unnest_tokens(input = text, output = word)
df_fed_tokens
# A tibble: 190,623 × 3
paper_id author word
<dbl> <chr> <chr>
1 1 hamilton to
2 1 hamilton the
3 1 hamilton people
4 1 hamilton of
5 1 hamilton the
6 1 hamilton state
7 1 hamilton of
8 1 hamilton new
9 1 hamilton york
10 1 hamilton after
# ℹ 190,613 more rows
We can use our typical data wrangling functions to see which words appear most commonly in the Federalist Papers data. Are the most common words high-minded things like “politics” and “president” and “constitution”?
|>
df_fed_tokens group_by(word) |>
summarize(number = n()) |>
arrange(desc(number))
# A tibble: 8,689 × 2
word number
<chr> <int>
1 the 17767
2 of 11812
3 to 7060
4 and 5089
5 in 4450
6 a 3989
7 be 3832
8 that 2789
9 it 2543
10 is 2190
# ℹ 8,679 more rows
Nope, they’re extremely boring common words that would appear in virtually any English language document. Which brings us to our next task…
7.2.2 Removing stop words
Words that appear commonly and don’t have much topical meaning — e.g., “the”, “an”, “but”, “with” — are called stop words. We don’t typically expect these words to do much to help us classify documents, so it’s common to remove them before performing text analysis. There are multiple different lists of English stop words out there for different use cases, but we’ll just use the default list included in the stop_words
data frame (via the tidytext package).
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
We want to reduce df_fed_tokens
, our “bag of words” data frame, to exclude these words. We can do that using the anti_join()
function. Just like left_join()
, which we saw back in the Data Wrangling unit, anti_join()
takes two data frames and a column name as input. But it works a little bit differently: it returns only the rows of the first data frame that have no match in the second data frame. In this case, we want the rows of df_fed_tokens
where the word
column has no match in stop_words
.
<- df_fed_tokens |>
df_fed_tokens anti_join(stop_words, by = "word")
df_fed_tokens
# A tibble: 65,387 × 3
paper_id author word
<dbl> <chr> <chr>
1 1 hamilton people
2 1 hamilton york
3 1 hamilton unequivocal
4 1 hamilton experience
5 1 hamilton inefficacy
6 1 hamilton subsisting
7 1 hamilton federal
8 1 hamilton government
9 1 hamilton called
10 1 hamilton deliberate
# ℹ 65,377 more rows
About 2/3 of the words in the raw text were removed once we got rid of the stop words. And now if we look at the most common words, it looks a lot more like what we might intuitively expect from the Federalist Papers.
|>
df_fed_tokens group_by(word) |>
summarize(number = n()) |>
arrange(desc(number))
# A tibble: 8,173 × 2
word number
<chr> <int>
1 government 829
2 people 612
3 power 606
4 constitution 462
5 union 362
6 national 340
7 federal 324
8 authority 290
9 public 283
10 powers 255
# ℹ 8,163 more rows
Yet now we see another issue. Do we really want to treat “power” and “powers” as if they’re distinct words? Sometimes we do, but often we’d rather just group these together. That brings us to yet another data cleaning task…
7.2.3 From tokens to stems
Our last word-level processing step will be to stem the words. This is an algorithm that removes plurals and suffixes, reducing words to their root form. We use the wordStem()
function for this. For a simple example:
<- c(
gov_words "government", "govern", "governed", "governing",
"governs", "governor", "governors"
)
wordStem(gov_words)
[1] "govern" "govern" "govern" "govern" "govern" "governor"
[7] "governor"
There’s a tradeoff with stemming words like this. On the plus side, we help ensure that the meanings of words are what drive our statistical analysis. A document that uses the word “govern” many times won’t be treated differently than one that uses “governs” many times. The downside is that we lose some grammatical and syntactical information. That doesn’t matter too much when we’re treating each document like a “bag of words”, but in other applications it would be inappropriate to neglect the differences between “govern” and “governs” and “governing.”
Going back to the full Federalist Papers data, we can use a simple mutate()
to stem the raw tokens.
<- df_fed_tokens |>
df_fed_tokens mutate(word_stem = wordStem(word))
df_fed_tokens
# A tibble: 65,387 × 4
paper_id author word word_stem
<dbl> <chr> <chr> <chr>
1 1 hamilton people peopl
2 1 hamilton york york
3 1 hamilton unequivocal unequivoc
4 1 hamilton experience experi
5 1 hamilton inefficacy inefficaci
6 1 hamilton subsisting subsist
7 1 hamilton federal feder
8 1 hamilton government govern
9 1 hamilton called call
10 1 hamilton deliberate deliber
# ℹ 65,377 more rows
After stemming, we end up with a similar — but not identical — list of most frequent terms across the 85 essays.
|>
df_fed_tokens group_by(word_stem) |>
summarize(number = n()) |>
arrange(desc(number))
# A tibble: 4,801 × 2
word_stem number
<chr> <int>
1 govern 1040
2 power 911
3 constitut 685
4 peopl 612
5 nation 568
6 author 397
7 object 378
8 union 363
9 law 351
10 execut 347
# ℹ 4,791 more rows
7.2.4 Creating document-level features
Remember that our ultimate goal here is to discern the true author of the 11 disputed essays. That means our analysis will ultimately take place at the document level. So far we have:
Extracted the individual words from each document.
Removed the “stop words” that don’t contribute to the document’s substantive meaning.
“Stemmed” the words so that words representing the same concepts are grouped together.
Altogether, we have a list of word stems for each document. We want to use these to calculate a set of features for each document, which we can then use to classify them.
As a simple start, we will calculate the number of times each stemmed non-stop word appears in each document. At this point let’s make another new data frame, which we’ll call df_fed_tf
(where tf
stands for “term frequency” — more on that in a moment).
<- df_fed_tokens |>
df_fed_tf group_by(paper_id, word_stem) |>
summarize(number = n()) |>
ungroup()
`summarise()` has grouped output by 'paper_id'. You can override using
the `.groups` argument.
df_fed_tf
# A tibble: 37,834 × 3
paper_id word_stem number
<dbl> <chr> <int>
1 1 1 2
2 1 absurd 1
3 1 accid 1
4 1 acknowledg 1
5 1 act 1
6 1 actuat 1
7 1 add 1
8 1 addit 1
9 1 address 1
10 1 admit 1
# ℹ 37,824 more rows
At this point, we could just take the raw word counts as our document-level features. However, that approach is not ideal for a couple of reasons.
Most simply, some of the essays are longer than others. For example, Federalist #22 includes the stemmed word “law” 10 times, while Federalist #56 includes it 8 times. That might make it sound like “law” plays a larger role in #22 than in #56 — until you observe that Federalist #22 has 1,247 non-stop words, while Federalist #56 has only 479, less than half as many.
Raw word counts don’t give us any idea of the distinctiveness of certain words. For example, the stem “govern” appears at least twice in every single essay. Knowing that an essay uses the word “govern” doesn’t really help us discern its topics or nature from other ones. By contrast, the stem “stamp” only appears in 5 of the 85 essays. In this sense, usage of “stamp” gives us a lot more distinct information about the topic — and perhaps even the authorship — of each essay than “govern” does.
A better measure that incorporates document length and word distinctiveness is tf-idf, which stands for term frequency–inverse document frequency. I’m going to break down each of these two parts of tf-idf. Apologies in advance: there’s a bit of math ahead.
Term frequency is measured separately for each word in each document. It is simply the proportion of the words in the document that are the word in question. The term frequency for word \(w\) in document \(d\) is given by the equation \[\text{tf}(w, d) = \frac{\text{number of times $w$ appears in $d$}}{\text{total words in $d$}}.\] For example, the term frequency for “law” in Federalist #22 is 10/1247 (0.008), while the term frequency for “law” in Federalist #56 is 8/479 (0.0167).
Document frequency is measured for each word across documents. It is the proportion of documents that contain the word in quesiton. The document frequency for word \(w\) is given by the equation \[\text{df}(w) = \frac{\text{number of documents that contain $w$ at least once}}{\text{number of documents}}.\] For example, the document frequency of “govern” is 85/85 (1.0), while the document frequency of “stamp” is 5/85 (0.0588).
Inverse document frequency is — take a deep breath — the natural logarithm of the multiplicative inverse of the document frequency: \[\begin{align*} \text{idf}(w) &= \log \frac{1}{\text{df}(w)} \\ &= \log (\text{number of documents}) - \log (\text{number of documents that contain $w$}). \end{align*}\] Don’t worry, you don’t need to remember this exact formula, let alone why it’s defined in this weird way. The important thing is that the greater the percentage of documents that contain \(w\), the lower that idf(\(w\)) will be. In the extreme, if every document contains \(w\), then idf(\(w\)) will equal 0.
In case you’re curious: The “inverse” part, where we take 1/df, is just so that idf will get bigger as the percentage of documents containing the word is smaller. The harder part to grok is why we use a logarithm. This essentially means that there are diminishing returns: the difference between being in 1% of documents versus 2% of documents is “greater”, from the perspective of calculating tf-idf, than the difference between being in 50% versus 51%.
Finally, tf-idf is the product of term frequency and inverse document frequency. For word \(w\) in document \(d\), \[\text{tf-idf}(w, d) = \text{tf}(w, d) \times \text{idf}(w).\] Again, you don’t need to worry about the mathematical specifics here. What’s important is to understand tf-idf as a measure of word distinctiveness in a particular document. If tf-idf for word \(w\) in document \(d\) is high, that means a combination of (1) \(w\) appears a lot in \(d\), relative to other words, and (2) not very many other documents contain \(w\) at all.
To calculate tf-idf, we can plug our data frame of word counts by document into the bind_tf_idf()
function. We need to supply three arguments to bind_tf_idf()
:
term
: What’s the name of the column containing the words/tokens we want to calculate for?document
: What’s the name of the column containing the ID of each different document?n
: What’s the name of the column containing the word/token counts by document?
In this case, our words are in word_stem
, our document IDs are in paper_id
, and our word counts are in number
, so we run:
<- df_fed_tf |>
df_fed_tf bind_tf_idf(term = word_stem, document = paper_id, n = number)
df_fed_tf
# A tibble: 37,834 × 6
paper_id word_stem number tf idf tf_idf
<dbl> <chr> <int> <dbl> <dbl> <dbl>
1 1 1 2 0.00364 0.832 0.00302
2 1 absurd 1 0.00182 1.73 0.00315
3 1 accid 1 0.00182 3.75 0.00682
4 1 acknowledg 1 0.00182 1.50 0.00272
5 1 act 1 0.00182 0.400 0.000727
6 1 actuat 1 0.00182 2.14 0.00389
7 1 add 1 0.00182 1.31 0.00238
8 1 addit 1 0.00182 0.729 0.00133
9 1 address 1 0.00182 1.61 0.00293
10 1 admit 1 0.00182 0.382 0.000695
# ℹ 37,824 more rows
We’ve now got the term frequencies, inverse document frequencies, and tf-idf for each word-document pairing in our data. Let’s see which word is the most distinctive in each document, by pulling out the word with the highest tf-idf in each document. We can use the slice_max()
function to do this, after grouping by document IDs.
|>
df_fed_tf group_by(paper_id) |>
slice_max(tf_idf) |>
arrange(desc(tf_idf))
# A tibble: 87 × 6
# Groups: paper_id [85]
paper_id word_stem number tf idf tf_idf
<dbl> <chr> <int> <dbl> <dbl> <dbl>
1 83 juri 66 0.0342 2.36 0.0808
2 82 court 38 0.0710 1.08 0.0764
3 54 slave 14 0.0222 3.06 0.0678
4 67 vacanc 12 0.0198 3.06 0.0605
5 29 militia 26 0.0330 1.80 0.0595
6 18 macedon 11 0.0131 3.75 0.0492
7 69 governor 25 0.0240 2.04 0.0491
8 19 emperor 13 0.0156 3.06 0.0478
9 81 court 54 0.0410 1.08 0.0441
10 20 provinc 16 0.0258 1.67 0.0431
# ℹ 77 more rows
The results we get here line up with some basic intuitions. Federalist #82 and #83 are about the judiciary, so it makes sense that “juri” and “court” are particularly distinctive in these essays. Federalist #54 is about the apportionment of House of Representatives seats across states. The major controversy for this issue was whether and how to count enslaved persons towards a state’s population for apportionment purposes, hence the distinctiveness of “slave” for this essay. Federalist #67 is about how the president will fill vacancies, and Federalist #29 is about the maintenance of a militia. At least at a glance, it looks like tf-idf is doing a reasonable job at picking up the distinctive terms from each document.
The last thing we need to do is transform this into a data frame where each row is an essay, each column is a word, and the entries are the tf-idf of each word in each essay. In this way, the tf-idf of each word will serve as the features we use for categorization and classification, along our way to try and identify the authorship of the 11 disputed papers. To transform the data this way, we can simply use pivot_wider()
.
<- df_fed_tf |>
df_fed_features select(paper_id, word_stem, tf_idf) |>
pivot_wider(names_from = word_stem,
values_from = tf_idf,
values_fill = 0) |>
select(-paper_id) |>
select(where(~ any(.x > 0)))
df_fed_features
# A tibble: 85 × 4,796
`1` absurd accid acknowledg act actuat add addit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.00302 0.00315 0.00682 0.00272 0.000727 0.00389 0.00238 0.00133
2 0 0 0 0 0 0 0 0
3 0 0 0 0.00614 0.000819 0.00439 0.00268 0.00149
4 0 0 0 0 0.000739 0 0 0
5 0 0 0 0 0.000848 0 0 0
6 0.00191 0 0 0 0 0 0 0
7 0.00205 0 0 0 0.000493 0 0 0.00180
8 0.00217 0 0 0 0.000522 0 0 0
9 0.00233 0 0 0 0 0 0.00183 0
10 0 0 0 0 0.00156 0.00417 0.00127 0
# ℹ 75 more rows
# ℹ 4,788 more variables: address <dbl>, admit <dbl>, adopt <dbl>,
# advantag <dbl>, adversari <dbl>, advoc <dbl>, affect <dbl>,
# afford <dbl>, aggrand <dbl>, aim <dbl>, allow <dbl>, altern <dbl>,
# ambigu <dbl>, ambit <dbl>, ambiti <dbl>, america <dbl>, amus <dbl>,
# analogi <dbl>, angri <dbl>, animos <dbl>, answer <dbl>,
# antagonist <dbl>, appear <dbl>, apt <dbl>, ardent <dbl>, …
A couple of brief notes on the code above:
The
values_fill
argument inpivot_wider()
ensures that the feature value is 0, rather thanNA
, when a particular word doesn’t appear in a particular document.The last line of the pipe,
select(where(~ any(.x > 0)))
, only selects those columns where at least one entry is nonzero. This means it leaves out the small set of terms like “govern” that appear in every single document, as these have an idf (and thus tf-idf) of 0.
7.3 Classifying documents
Our goal here is to get a (provisional) answer about who authored the disputed documents. To do that, we’re going to use statistical methods to recognize patterns in the documents whose authorship is known, then use that pattern-matching to sort the unknown documents into “likely Hamilton” versus “likely Madison”.
As a first step, we’re going to add authorship info to our data frame of word features. We will also split up the data frame into the “known” cases (those where exactly one of Hamilton or Madison is known to be the author) and the “unknown” cases (those with “Hamilton or Madison” as the listed author).
<- df_fed_features |>
df_fed_features mutate(
paper_author = df_fed_papers$author,
is_hamilton = if_else(paper_author == "hamilton", 1, 0)
|>
) select(paper_author, is_hamilton, everything())
<- df_fed_features |>
df_fed_known filter(paper_author == "hamilton" | paper_author == "madison")
df_fed_known
# A tibble: 66 × 4,798
paper_author is_hamilton `1` absurd accid acknowledg act
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 hamilton 1 0.00302 0.00315 0.00682 0.00272 0.000727
2 hamilton 1 0.00191 0 0 0 0
3 hamilton 1 0.00205 0 0 0 0.000493
4 hamilton 1 0.00217 0 0 0 0.000522
5 hamilton 1 0.00233 0 0 0 0
6 madison 0 0 0 0 0 0.00156
7 hamilton 1 0.00186 0 0 0.00167 0
8 hamilton 1 0.00206 0 0 0.00186 0
9 hamilton 1 0 0 0 0 0
10 madison 0 0 0 0 0 0.00107
# ℹ 56 more rows
# ℹ 4,791 more variables: actuat <dbl>, add <dbl>, addit <dbl>,
# address <dbl>, admit <dbl>, adopt <dbl>, advantag <dbl>,
# adversari <dbl>, advoc <dbl>, affect <dbl>, afford <dbl>,
# aggrand <dbl>, aim <dbl>, allow <dbl>, altern <dbl>, ambigu <dbl>,
# ambit <dbl>, ambiti <dbl>, america <dbl>, amus <dbl>,
# analogi <dbl>, angri <dbl>, animos <dbl>, answer <dbl>, …
<- df_fed_features |>
df_fed_unknown filter(paper_author == "hamilton or madison")
df_fed_unknown
# A tibble: 11 × 4,798
paper_author is_hamilton `1` absurd accid acknowledg act
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 hamilton or madi… 0 0 0 0 0 0
2 hamilton or madi… 0 0 0 0 0.00393 0
3 hamilton or madi… 0 0 0 0 0 0
4 hamilton or madi… 0 0 0 0 0 6.74e-4
5 hamilton or madi… 0 0 0 0 0 1.08e-3
6 hamilton or madi… 0 0 0 0 0 6.33e-4
7 hamilton or madi… 0 0 0 0 0 0
8 hamilton or madi… 0 0.00347 0 0 0 8.34e-4
9 hamilton or madi… 0 0 0 0 0 0
10 hamilton or madi… 0 0 0 0 0.00184 4.91e-4
11 hamilton or madi… 0 0 0 0 0.00142 1.14e-3
# ℹ 4,791 more variables: actuat <dbl>, add <dbl>, addit <dbl>,
# address <dbl>, admit <dbl>, adopt <dbl>, advantag <dbl>,
# adversari <dbl>, advoc <dbl>, affect <dbl>, afford <dbl>,
# aggrand <dbl>, aim <dbl>, allow <dbl>, altern <dbl>, ambigu <dbl>,
# ambit <dbl>, ambiti <dbl>, america <dbl>, amus <dbl>,
# analogi <dbl>, angri <dbl>, animos <dbl>, answer <dbl>,
# antagonist <dbl>, appear <dbl>, apt <dbl>, ardent <dbl>, …
7.3.1 Feature selection
Say we wanted to use a linear regression to predict the authorship of each document. We immediately run into a problem: we have 4,797 features and only 66 observations. A linear regression won’t even work when there are more features than observations. To even run a regression model in the first place, we need to do some feature selection, reducing the huge set of possible features down to the ones we want to put in our model.
As a very rough first cut at feature selection, let’s look for the words whose tf-idf is most strongly correlated with paper authorship. (The correlation coefficient isn’t really meant for use with a binary variable, but this is just a rough cut anyway.) We can do that in a single (admittedly complex!) data pipeline.
|>
df_fed_known # Remove the non-numeric paper author column
select(-paper_author) |>
# Remove words that never appear in the "known" papers
select(where(~ any(.x > 0))) |>
# Put data into "long" format and group by word
pivot_longer(cols = -is_hamilton, names_to = "word", values_to = "tf_idf") |>
group_by(word) |>
# Calculate correlation between authorship and tf-idf for each word
summarize(cor = cor(is_hamilton, tf_idf)) |>
# Show the strongest correlations (positive or negative) first
arrange(desc(abs(cor)))
# A tibble: 4,469 × 2
word cor
<chr> <dbl>
1 whilst -0.609
2 form -0.563
3 li -0.484
4 mix -0.466
5 stamp -0.466
6 democrat -0.456
7 dishonor -0.456
8 congress -0.431
9 compass -0.428
10 lesser -0.425
# ℹ 4,459 more rows
The word stems with the strongest correlations are “whilst”, “form”, “li”, “mix”, and “stamp”. Each of these has a negative correlation, meaning a higher frequency of usage is typical of a paper written by Madison rather than Hamilton.
Let’s build a linear regression model using the tf-idf of the ten most strongly correlated words to predict paper authorship.
<- lm(
fit_authorship ~ whilst + form + li + mix + stamp +
is_hamilton + dishonor + congress + compass + lesser,
democrat data = df_fed_known
)
summary(fit_authorship)
Call:
lm(formula = is_hamilton ~ whilst + form + li + mix + stamp +
democrat + dishonor + congress + compass + lesser, data = df_fed_known)
Residuals:
Min 1Q Median 3Q Max
-0.91397 0.00138 0.00855 0.02152 0.31624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.99862 0.03651 27.350 < 2e-16 ***
whilst -223.72658 29.03297 -7.706 2.63e-10 ***
form -19.22100 47.48407 -0.405 0.68720
li -44.43080 38.39546 -1.157 0.25220
mix -264.97776 48.79483 -5.430 1.31e-06 ***
stamp -71.60991 47.86305 -1.496 0.14033
democrat -44.34696 48.38707 -0.917 0.36340
dishonor -118.67910 52.12201 -2.277 0.02670 *
congress 2.06805 11.95822 0.173 0.86333
compass -28.40312 27.62083 -1.028 0.30830
lesser -109.88549 36.40317 -3.019 0.00384 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1717 on 55 degrees of freedom
Multiple R-squared: 0.8601, Adjusted R-squared: 0.8346
F-statistic: 33.8 on 10 and 55 DF, p-value: < 2.2e-16
How accurately does our model predict the authorship of the known papers? To answer this question, we can build a confusion matrix. A confusion matrix is a table where we put the actual responses in the rows, and the predicted responses in the columns. Ideally, we would like the prediction to match the truth as often as possible, resulting in a confusion matrix with most observations along the “diagonal”.
To make a confusion matrix, first we need to extract predictions via augment()
, then match them up with the true responses. The linear regression doesn’t directly make predictions, but we’ll treat a fitted value above 0.5 (50%) as a Hamilton prediction and a fitted value below 0.5 as a Madison prediction.
<-
df_predictions augment(fit_authorship, newdata = df_fed_known) |>
mutate(prediction = if_else(.fitted >= 0.5, "hamilton", "madison"))
table(
actual = df_predictions$paper_author,
predicted = df_predictions$prediction
)
predicted
actual hamilton madison
hamilton 51 0
madison 1 14
Incredible! (Or so it seems.) Our model correctly predicts the author of all but one of the known papers. Its accuracy, the proportion of observations for which it gives a correct classification, is 65/66 (0.985). So we should put a lot of faith in how it classifies the unknown papers — right?
To calculate predictions for the unknown papers, we’d again use augment()
:
augment(fit_authorship, newdata = df_fed_unknown) |>
mutate(prediction = if_else(.fitted >= 0.5, "hamilton", "madison")) |>
count(prediction)
# A tibble: 2 × 2
prediction n
<chr> <int>
1 hamilton 3
2 madison 8
So with 98% accuracy, we can say Madison wrote 8 of the disputed papers and Hamilton wrote the other 3. Case closed?
7.3.2 Overfitting and cross-validation
Not so fast. Here’s what we did so far:
We used the known papers to estimate a regression model to predict authorship.
We looked at how well that model predicts ownership of the known papers — the same ones we used to estimate the model — in order to gauge its accuracy.
We then used the model to predict the authorship of the unknown papers.
It’s far too optimistic to expect 98% accuracy in that last step. The reason is that a statistical model will generally do a better job of “prediction” with data that it’s seen — data used to train the model in the first place — than on brand new data not used to fit the model.
Data scientists and statisticians call this phenomenon overfitting. Statistical models can’t fully separate “signal” (patterns in the training data that generalize more broadly) from “noise” (patterns in the training data that are idiosyncratic, rather than general population features). Overfitting is especially problematic when there are many features and few observations, exactly the situation we are in here.
Thanks to some combination of Aaron Burr and the ordinary passage of time, Hamilton and Madison aren’t around to write any more Federalist Papers, so we can’t collect new data to gauge the out-of-sample predictive accuracy of our model. Luckily for us, even without new data, we can get a good idea of our model’s out-of-sample accuracy through a process called cross-validation.
Randomly assign each observation to a group, or “fold” in data science lingo. The number of folds is usually 5 or 10, or sometimes even the same as the total number of observations in the data.
For each \(k\) from 1 to \(K\), where \(K\) is the number of folds:
Estimate the statistical model, using all of the data except the observations in the \(k\)’th fold.
Use that model to predict the outcome for the observations in the \(k\)’th fold. These are “unseen data” from the perspective of that model, as these observations were not used in its estimation.
Estimate accuracy by creating a confusion matrix of the true outcomes compared to the cross-validation predictions.
Not so coincidentally, cross-validation requires working with our new(ish) friends from Chapter 6, random number generators and for loops! Let’s use cross-validation to get a better guess at the out-of-sample predictive accuracy of our linear regression model with the top 10 strongest correlated words as features.
If we wanted to be entirely rigorous about cross-validation, we would repeat our feature selection process within the cross-validation loop, recalculating the correlations between authorship and tf-idf each time while excluding the data from the \(k\)’th fold. I won’t do that here, but the streamlined method discussed in Section 7.3.3 incorporates this automatically.
# Set seed for replicability
set.seed(1789)
# Randomly assign observations to folds
# Using 11 folds because it divides evenly
<- rep(1:11, each = 6)
fold_assignment <- sample(fold_assignment)
fold_assignment $fold <- fold_assignment
df_fed_known
# Set up storage for loop results
<- NULL
outcome_actual <- NULL
outcome_cv_pred
# Loop over number of folds
for (k in 1:11) {
# Split data
<- filter(df_fed_known, fold != k)
df_fit <- filter(df_fed_known, fold == k)
df_pred
# Fit regression model only using df_fit
<- lm(
fit_cv ~ whilst + form + li + mix + stamp +
is_hamilton + dishonor + congress + compass + lesser,
democrat data = df_fit
)
# Calculate predictions for excluded data
<- augment(fit_cv, newdata = df_pred) |>
df_pred mutate(prediction = if_else(.fitted >= 0.5, "hamilton", "madison"))
# Store results
<- c(outcome_actual, df_pred$paper_author)
outcome_actual <- c(outcome_cv_pred, df_pred$prediction)
outcome_cv_pred
}
# Create confusion matrix with cross-validation predictions
table(outcome_actual, outcome_cv_pred)
outcome_cv_pred
outcome_actual hamilton madison
hamilton 51 0
madison 4 11
When looking at out-of-sample predictions via cross-validation, it’s still the case that every predicted Madison paper really is a Madison paper. However, among the 55 Hamilton predictions, 4 are actually Madison papers. The cross-validation estimate of accuracy is now 62/66 (0.939). That’s still very good, but not as ludicrously good as it looked when we used in-sample fit to gauge accuracy.
7.3.3 A streamlined approach: The elastic net
Statisticians have developed modified regression models that automatically handle feature selection and overfitting prevention. We won’t go deeply into the underlying mathematical details; we’ll just focus on how to implement this type of model in R. If you want to know more, I strongly recommend the book The Elements of Statistical Learning: Data Mining, Inference, and Prediction, which is available for free on the authors’ website.
The model we will work with is called the elastic net, which is implemented in the R package glmnet (and which we’ll access through the easier-to-use train()
function from the caret package). Remember that the linear regression formula for a response \(y\) and features \(x_1, \ldots, x_M\) is \[\begin{align*}
y &\approx \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_M x_M \\
&= \beta_0 + \sum_{m=1}^M \beta_m x_m,
\end{align*}\] where \(\beta_0\) is the intercept and each \(\beta_m\) is the coefficient on the \(m\)’th feature. Without getting into the mathematical details, the elastic net has two important differences from ordinary regression:
Some of the coefficients — and in fact, usually quite a few of them — are estimated to be exactly zero. In this sense, the elastic net automatically performs “feature selection” for us, without us having to reduce the feature space in advance.
The remaining non-zero coefficients are typically estimated to be closer to zero than in an ordinary regression. This “shrinkage” property helps prevent overfitting, making the elastic net better for out-of-sample prediction.
Ordinary least squares, implemented by the lm()
function in R, chooses the coefficients to minimize the sum of squared residuals, \[\min_{\beta_0, \ldots, \beta_M} \sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_{i,1} - \cdots - \beta_M x_{i, M})^2.\] The elastic net chooses the coefficients to minimize a penalized version of the sum of squared residuals, \[\min_{\beta_0, \ldots, \beta_M} \sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_{i,1} - \cdots - \beta_M x_{i,M})^2 + \frac{\lambda (1 - \alpha)}{2} \sum_{m=1}^M \beta_m^2 + \lambda \alpha \sum_{m=1}^M |\beta_m|.\]
The penalty terms added at the end keep the coefficients (other than the intercept, which is excluded from the penalty) from being too large. The values \(\alpha \in [0, 1]\) and \(\lambda \geq 0\) are “tuning parameters” typically selected via cross-validation, to find the values that perform best for out-of-sample prediction. If \(\lambda = 0\), then the elastic net is equivalent to ordinary least squares regression. Larger values of \(\lambda\) represent more “shrinkage” of the coefficients toward 0.
We will use the train()
function to fit an elastic net. It works similarly to lm()
, except with a lot more options. I’ll show you the code first and then explain it.
# Remove columns we don't want to use as features
<- df_fed_known |>
df_fed_known select(-is_hamilton, -fold)
set.seed(42)
<- train(
fit_enet ~ .,
paper_author data = df_fed_known,
method = "glmnet",
family = "binomial",
trControl = trainControl(
method = "cv",
number = 11
),metric = "Accuracy"
)
Here’s everything that’s going on in this function:
paper_author ~ .
means “usepaper_author
as the response and everything else in the data frame as a feature”. This syntax is useful when dealing with data frames with many features, to avoid having to type them all out. One thing to notice here is that we’re using the categoricalpaper_author
instead of the 0/1is_hamilton
as our response. Another is that we took theis_hamilton
andfold
columns out of the data frame before putting it intotrain()
so that these wouldn’t be treated as features. (It would be pretty easy to maximize our in-sample fit withis_hamilton
as a feature, but that would be useless for out-of-sample prediction!)method = "glmnet"
tellstrain()
to use the elastic net. Thetrain()
function is a unified interface for fitting tons of different machine learning models; see the giant list of available models at https://topepo.github.io/caret/available-models.html.family = "binomial"
tells the elastic net to perform classification with a categorical response. This results in a slight change to the regression formula under the hood, though we won’t worry about that here.Optional: Altered regression formulaThe formula is now \[\Pr(y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_M x_M)}}.\] Unlike a linear regression, this formula ensures that every prediction is between 0% and 100%.
trControl = trainControl(method = "cv", number = 11)
instructstrain()
to use 11-fold cross-validation to select the tuning parameters, \(\alpha\) and \(\lambda\), which govern how aggressively the elastic net zeroes out features and reduces the magnitudes of the remaining features.train()
automatically performs the cross-validation process for us. Since the folds are assigned randomly, we set a random seed before runningtrain()
to ensure replicability of our results.metric = "Accuracy"
instructstrain()
to select the tuning parameters that result in the greatest out-of-fold prediction accuracy. There are other statistics like"Kappa"
and"logLoss"
that might be preferable measures of out-of-sample fit, but those are beyond the scope of PSCI 2300.
When you print out the result of a model fit with train()
, it shows you the values of the tuning parameters that it tested, as well as the estimated out-of-fold accuracy of each one.
fit_enet
glmnet
66 samples
4796 predictors
2 classes: 'hamilton', 'madison'
No pre-processing
Resampling: Cross-Validated (11 fold)
Summary of sample sizes: 61, 59, 61, 60, 59, 60, ...
Resampling results across tuning parameters:
alpha lambda Accuracy Kappa
0.10 0.01614625 0.8047619 0.1443850
0.10 0.05105894 0.8047619 0.1443850
0.10 0.16146255 0.8047619 0.1443850
0.55 0.01614625 0.8683983 0.4171123
0.55 0.05105894 0.8683983 0.4171123
0.55 0.16146255 0.8402597 0.3246753
1.00 0.01614625 0.8554113 0.3636364
1.00 0.05105894 0.8554113 0.3636364
1.00 0.16146255 0.8251082 0.2337662
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.55 and lambda
= 0.05105894.
There’s a lot of output here, and it’s different than a linear regression. The important part to extract is that out of the various elastic net specifications tested by the train()
function, the highest cross-validation accuracy obtained was about 87%. This might be a bit easier to see by looking at the confusion matrix — which is easy to obtain when we’re working with a model fit by train()
.
confusionMatrix(fit_enet)
Cross-Validated (11 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction hamilton madison
hamilton 77.3 13.6
madison 0.0 9.1
Accuracy (average) : 0.8636
We can use the varImp()
function to extract which features are most important, in terms of having the largest magnitude of coefficients in the elastic net model.
varImp(fit_enet)
glmnet variable importance
only 20 most important variables shown (out of 4796)
Overall
whilst 100.00
form 88.58
mix 69.68
stamp 52.16
dishonor 49.60
li 44.99
lesser 29.10
democrat 24.81
alloi 24.08
sphere 24.04
sure 19.88
assum 17.67
relief 15.39
overgrown 14.49
defin 13.42
indiscreet 13.38
`function` 11.91
develop 10.91
ninth 10.34
shoot 10.26
Here we see some overlap with the list of top-10 features that we extracted manually (with “whilst” still at the top), but also some differences.
You may be concerned that we did all this work to come up with a model whose out-of-sample predictive power (at least as measured by cross-validation) is worse than that of our linear regression from before. But keep in mind that we gave the elastic net a much harder job. With the linear regression, we pre-selected just 10 features. With the elastic net, we asked it to sort through all 4800ish features to identify the important ones!
Finally, to make out-of-sample predictions with the elastic net, we can use the predict()
function.
<- predict(fit_enet, newdata = df_fed_unknown)
pred_enet
pred_enet
[1] hamilton hamilton madison hamilton hamilton madison hamilton
[8] hamilton madison hamilton madison
Levels: hamilton madison
table(pred_enet)
pred_enet
hamilton madison
7 4
We see a slight difference here from the model we trained in Section 7.3.1. That model said Madison wrote 8 of the 11 disputed papers, whereas our elastic net says he wrote 7.