Skip to content

Latest commit

 

History

History
66 lines (43 loc) · 2.58 KB

lecture09.md

File metadata and controls

66 lines (43 loc) · 2.58 KB

##Lecture 9

Text Analysis: Topic Detection and Visualization, Part I

This week we start with frequency analysis, of specific unigrams in the text data.

Files to download: Hillary Clinton's tweets, Donald Trump's tweets

Let's start with a simple word cloud

library(streamR)

tweets_HC <- parseTweets("tweets_LP_HC.json")
tweets_DT <- parseTweets("tweets_LP_DT.json")

library(tm)
library(wordcloud)
library(Rstem)
library(stringr)

tweets_HC$text <- sapply(tweets_HC$text, function(row) iconv(row, "latin1", "ASCII", sub=""))
TweetCorpus <- paste(unlist(tweets_HC$text), collapse =" ") #to get all of the tweets together
%TweetCorpus <- Corpus(VectorSource(TweetCorpus))
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
TweetCorpus <- tm_map(TweetCorpus, removePunctuation)
TweetCorpus <- tm_map(TweetCorpus, removeWords, stopwords('english'))
%TweetCorpus <- tm_map(TweetCorpus, stemDocument) % No stemming for now!
TweetCorpus <- tm_map(TweetCorpus, content_transformer(tolower),lazy=TRUE)
TweetCorpus <- tm_map(TweetCorpus, PlainTextDocument)
wordcloud(TweetCorpus, max.words = 100, random.order = FALSE)

Then let's try to check the level of positivity and negativity of each candiate's tweet corpus:

Link to Lexicon (Courtesy of Neal Caren and Pablo Barbera)

We use the a simple function for finding the frequency of positive and negative words in each of the corpuses:

importing from lexicon:

lexicon <- read.csv("lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]

Then counting the positive/negative words

positive <- sum(TweetCorpus %in% pos.words)
negative <- sum(TweetCorpus %in% neg.words)

Now let's complicate the topic, and simplify our lexicon, starting from this visualizaiton.

D3

Wordclouds, frequency charts in time and topics

Link1, Link2

Link3, Link4