##Lecture 9
This week we start with frequency analysis, of specific unigrams in the text data.
Files to download: Hillary Clinton's tweets, Donald Trump's tweets
Let's start with a simple word cloud
library(streamR)
tweets_HC <- parseTweets("tweets_LP_HC.json")
tweets_DT <- parseTweets("tweets_LP_DT.json")
library(tm)
library(wordcloud)
library(Rstem)
library(stringr)
tweets_HC$text <- sapply(tweets_HC$text, function(row) iconv(row, "latin1", "ASCII", sub=""))
TweetCorpus <- paste(unlist(tweets_HC$text), collapse =" ") #to get all of the tweets together
%TweetCorpus <- Corpus(VectorSource(TweetCorpus))
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
TweetCorpus <- tm_map(TweetCorpus, removePunctuation)
TweetCorpus <- tm_map(TweetCorpus, removeWords, stopwords('english'))
%TweetCorpus <- tm_map(TweetCorpus, stemDocument) % No stemming for now!
TweetCorpus <- tm_map(TweetCorpus, content_transformer(tolower),lazy=TRUE)
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
wordcloud(TweetCorpus, max.words = 100, random.order = FALSE)
Then let's try to check the level of positivity and negativity of each candiate's tweet corpus:
Link to Lexicon (Courtesy of Neal Caren and Pablo Barbera)
We use the a simple function for finding the frequency of positive and negative words in each of the corpuses:
importing from lexicon:
lexicon <- read.csv("lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
Then counting the positive/negative words
positive <- sum(TweetCorpus %in% pos.words)
negative <- sum(TweetCorpus %in% neg.words)
Now let's complicate the topic, and simplify our lexicon, starting from this visualizaiton.
Wordclouds, frequency charts in time and topics