Text Mining With R ⚡ Updated
data(stop_words) cleaned_austen <- tidy_austen %>% anti_join(stop_words, by = "word") Count most common words:
word_counts %>% filter(n > 500) %>% ggplot(aes(x = reorder(word, n), y = n)) + geom_col(fill = "steelblue") + coord_flip() + labs(title = "Most Frequent Words in Jane Austen's Novels", x = "Word", y = "Count") + theme_minimal() Sentiment lexicons (e.g., AFINN , bing , nrc ) assign emotional valence to words.
1. Introduction In the age of big data, most information exists as unstructured text —emails, social media posts, reviews, news articles, and research papers. Unlike numerical data, text cannot be directly fed into a statistical model. Text mining (or text analytics) is the process of transforming this free-form text into structured, quantifiable data for analysis, pattern discovery, and prediction. Text Mining With R
graph LR A[Raw Text] --> B[Preprocessing] --> C[Tokenization] --> D[Stop Word Removal] --> E[Analysis] --> F[Visualization] library(tidyverse) library(tidytext) library(janeaustenr) Load sample text (Jane Austen's books) austen_books <- austen_books() head(austen_books) 3.2. Preprocessing & Tokenization Tokenization splits text into meaningful units (words, sentences, n-grams). tidytext uses unnest_tokens() .
with a bar chart:
is an exceptional language for text mining. With a rich ecosystem of packages—most notably the tidytext , quanteda , and tm frameworks—R allows analysts to clean, tokenize, analyze sentiment, model topics, and visualize textual patterns efficiently.
This write-up outlines a reproducible workflow for text mining using R, emphasizing tidy data principles. | Package | Purpose | | :--- | :--- | | tidytext | Converts text to tidy data frames (one token per row). Integrates with dplyr , ggplot2 . | | dplyr | Data manipulation (filter, group, mutate). | | ggplot2 | Visualization of text metrics (word frequencies, sentiment scores). | | janeaustenr | Sample texts for practice. | | tidyverse | Meta-package for data science. | | wordcloud | Generates word clouds. | | quanteda | Advanced text analysis (DFM, keywords-in-context). | | tm | Classic text mining (corpus, term-document matrix). | Installation: install.packages(c("tidytext", "tidyverse", "wordcloud", "quanteda")) 3. The Text Mining Workflow A standard text mining pipeline in R consists of these steps: Unlike numerical data, text cannot be directly fed
# Using bing lexicon (positive/negative) bing_sent <- get_sentiments("bing") sentiment_scores <- cleaned_austen %>% inner_join(bing_sent, by = "word") %>% count(book = austen_books()$book, sentiment) %>% # approximate pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(net_sentiment = positive - negative)