##Lecture 9

Text Analysis: Topic Detection and Visualization, Part I

This week we start with frequency analysis, of specific unigrams in the text data.

Files to download: Hillary Clinton's tweets, Donald Trump's tweets

Let's start with a simple word cloud

library(streamR)

tweets_HC <- parseTweets("tweets_LP_HC.json")
tweets_DT <- parseTweets("tweets_LP_DT.json")

library(tm)
library(wordcloud)
library(Rstem)
library(stringr)

tweets_HC$text <- sapply(tweets_HC$text, function(row) iconv(row, "latin1", "ASCII", sub=""))
TweetCorpus <- paste(unlist(tweets_HC$text), collapse =" ") #to get all of the tweets together
%TweetCorpus <- Corpus(VectorSource(TweetCorpus))
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
TweetCorpus <- tm_map(TweetCorpus, removePunctuation)
TweetCorpus <- tm_map(TweetCorpus, removeWords, stopwords('english'))
%TweetCorpus <- tm_map(TweetCorpus, stemDocument) % No stemming for now!
TweetCorpus <- tm_map(TweetCorpus, content_transformer(tolower),lazy=TRUE)
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
wordcloud(TweetCorpus, max.words = 100, random.order = FALSE)

Then let's try to check the level of positivity and negativity of each candiate's tweet corpus:

Link to Lexicon (Courtesy of Neal Caren and Pablo Barbera)

We use the a simple function for finding the frequency of positive and negative words in each of the corpuses:

importing from lexicon:

lexicon <- read.csv("lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]

Then counting the positive/negative words

positive <- sum(TweetCorpus %in% pos.words)
negative <- sum(TweetCorpus %in% neg.words)

Now let's complicate the topic, and simplify our lexicon, starting from this visualizaiton.

D3

Wordclouds, frequency charts in time and topics

Link1, Link2

Link3, Link4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture09.md

lecture09.md

Text Analysis: Topic Detection and Visualization, Part I

D3

Files

lecture09.md

Latest commit

History

lecture09.md

File metadata and controls

Text Analysis: Topic Detection and Visualization, Part I

D3