This project examines a collection of public speeches by Charles Dickens using word frequency text analysis in order to explore patterns in his language at scale. Rather than closely reading a single speech, the project uses computational techniques in R studio to examine Dickens’s rhetorical tendencies across every speech, asking which words appear most frequently once common stop words and punctuation are removed. By combining literary inquiry with data-driven methods, the project shows how digital tools can reveal thematic emphasis and stylistic patterns that may be difficult to detect through traditional close reading alone.
The primary source for this project is a collection of Charles Dickens’ speeches, accessed as a text file from Project Gutenberg. Along with his novels, Dickens was a prominent public speaker, often speaking about broader themes than just the content of his novels. Project Gutenberg texts include licensing information, headers, and footers that are not part of the original literary content, so the first step involved removing this material using code in R studio. The remaining text was then consolidated into a single document and tokenized into individual words. Additional cleaning steps included converting all text to lowercase, removing punctuation and numbers, and filtering out very common words in the English language. These steps ensured that the dataset reflected meaningful words rather than flooding the analysis with grammatically necessary language.
The analysis was conducted in R using packages from the tidyverse and tidytext, which are suited for transparent text analysis. Tokenization was used to break the text into individual words, and word frequencies were calculated to identify the most commonly used terms across the speeches. You can see the workflow here in this code chunk:
# DATA CLEANING
# Find start and end markers
start_idx <- grep("\\*\\*\\*START OF THE PROJECT GUTENBERG EBOOK", lines)
end_idx <- grep("\\*\\*\\*END OF THE PROJECT GUTENBERG EBOOK", lines)
# If markers are found keep only the speeches
if (length(start_idx) > 0 && length(end_idx) > 0) {
text_lines <- lines[(start_idx[1] + 1):(end_idx[1] - 1)]
} else {
text_lines <- lines
}
# Convert to one big string
text <- paste(text_lines, collapse = " ")
# Turn into a one row tibble
text_df <- tibble(doc = "Dickens_Speeches", text = text)
# Keeping just words
words_clean <- text_df %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "^[a-z']+$")) %>%
anti_join(stop_words, by = "word") %>%
filter(nchar(word) > 1)
While it doesn’t provide a lot of context of the words’ usage, this analysis is interpretative and effective for identifying dominant vocabulary and recurring themes across his speeches.
The results are presented as a clean, single page HTML document generated through R Markdown. I chose a horizontal bar chart to display the most frequent words because it allows for easy comparison across terms and accommodates longer word labels better than a vertical bar chart. Visual design choices like a color gradient and direct labeling of counts, were made to prioritize clarity, readability over decoration, and emphasize interpretability.
Looking at the bar plot, we can see that Dickens spoke as a public moral voice, concerned less with individual psychology than with institutions, social responsibility, and the ethical life of society in his time period in England.
This project illustrates how digital methods can complement traditional humanities scholarship by enabling scholars to analyze texts at a scale that would otherwise be impractical. By quantifying word usage in Dickens’s speeches, the analysis highlights recurring thematic elements. More broadly, the project exemplifies the goals of Digital Arts & Humanities by blending humanistic questions with computational tools, showing how code can function not as a replacement for interpretation, but as a way of expanding the scope and depth of literary inquiry.