Social Network Sentiment Analysis with twitteR

Public Sentiment Analysis of a trend or event has proven to be useful in many ways. In important business decisions, marketing campaigning or introducing a new product it is always beneficial to have public emotion on social media. In this blog, I’ve outlined the methodology that I used to analyse the public sentiment for the general election trend on Twitter. I am using the “twitteR” package in R programming language.

The first thing to do is to extract the tweets. You need to create a Twitter Dev App using the https://apps.twitter.com/. You need to fill in few details and have your mobile phone already registered with your twitter account. Once this is done with your newly created app, make note of your consumer key and consumer secret from the Details page of your app.

Once we have all the necessary packages installed in R, we need to have an OAuth authentication. Jeff Gentry, the author of the twitteR package has shown us how to do the OAuth setup here

Library and OAuth Setting up

#Calling necessary library functions
>library(twitteR)
>library(ggplot2)

#Enter your consumer key and consumer secret values
>setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")

Retrieving the tweets

Twitter collects about millions of tweets per day. Handling “big data” is a whole new topic which I will skip in this blog. I am therefore going to limit to a reasonable amount of tweets that I can use for easy processing and analysis so I can focus on the actual analysis. In the below code lines I am extracting 5000 tweets and stripping off the duplicate retweets.

#Retrieving most recent tweets about the general election this month
>e.tweets<-searchTwitter('#generalelection', since='2017-06-06', until='2017-06-08', n=5000, lang='en', resultType = 'recent')

#Stripping the duplicated retweets from our collection
>e.tweets<-strip_retweets(e.tweets, strip_manual = TRUE, strip_mt = TRUE)
> length(e.tweets)
[1] 1276

Let’s now convert the tweets into a data frame and have a peek at the data.

>tweets.df <- twListToDF(e.tweets)
>head(tweets.df)

text
1 7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz
2 MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy.
3 I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better.
4 No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection
5 Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46
6 Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC
 favorited favoriteCount replyToSN created truncated replyToSID
1 FALSE 0 <NA> 2017-06-07 23:59:31 FALSE <NA>
2 FALSE 13 <NA> 2017-06-07 23:59:12 FALSE <NA>
3 FALSE 0 <NA> 2017-06-07 23:58:45 FALSE <NA>
4 FALSE 0 <NA> 2017-06-07 23:58:23 FALSE <NA>
5 FALSE 1 <NA> 2017-06-07 23:58:03 FALSE <NA>
6 FALSE 0 <NA> 2017-06-07 23:57:35 TRUE <NA>
 id replyToUID
1 872603961286684678 <NA>
2 872603881125154816 <NA>
3 872603769795727360 <NA>
4 872603679182004224 <NA>
5 872603592368295940 <NA>
6 872603474449575936 <NA>
 statusSource
1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
3 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
4 <a href="http://www.echofon.com/" rel="nofollow">Echofon</a>
5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
 screenName retweetCount isRetweet retweeted longitude latitude
1 whatfrandid 0 FALSE FALSE <NA> <NA>
2 TheMurdochTimes 4 FALSE FALSE <NA> <NA>
3 mattrobin140s 0 FALSE FALSE <NA> <NA>
4 juliahobsbawm 0 FALSE FALSE <NA> <NA>
5 helbigfordeyes 0 FALSE FALSE <NA> <NA>
6 theokoulouris 0 FALSE FALSE <NA> <NA>

Now let’s git rid of some unwanted information and extract only the texts.

#Extracting the text from the tweets
> tweets.text<-tweets.df$text
> head(tweets.text)

[1] "7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz" 
[2] "MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy." 
[3] "I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better." 
[4] "No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection" 
[5] "Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46"
[6] "Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC"

This looks better to read.

Collecting positive and negative words

Next, I am taking a list of all possible positive and negative words from the website http://ptrckprry.com/. The negative list was missing a few words like “wtf”, etc., so I manually add them!

#After changing the required directory, read the positive and negative files
>pos<-scan('positive.txt', what='character', comment.char = ";")
>neg<-scan('negative.txt', what='character', comment.char = ";")
> head(pos)
[1] "a+" "abound" "abounds" "abundance" "abundant" "accessable"
> head(neg)
[1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably"

Sentiment scores

I am using a function by Jeffrey Breen with a slight modification to remove certain unwanted characters / unicode blocks.

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
 require(plyr)
 require(stringr)
 
 # we got a vector of sentences. plyr will handle a list
 # or a vector as an "l" for us
 # we want a simple array ("a") of scores back, so we use 
 # "l" + "a" + "ply" = "laply":
 scores = laply(sentences, function(sentence, pos.words, neg.words) {
 
 # clean up sentences with R's regex-driven global substitute, gsub():
 sentence = gsub('[[:punct:]]', '', sentence)
 sentence = gsub('[[:cntrl:]]', '', sentence)
 sentence = gsub('\\d+', '', sentence)
 sentence = gsub('\\x', '', sentence)

 # remove unicodes
 sentence<-iconv(sentence, "ASCII", "UTF-8", sub="")
 
 # and convert to lower case:
 sentence = tolower(sentence)
 
 # split into words. str_split is in the stringr package
 word.list = str_split(sentence, '\\s+')
 # sometimes a list() is one level of hierarchy too much
 words = unlist(word.list)
 
 # compare our words to the dictionaries of positive & negative terms
 pos.matches = match(words, pos.words)
 neg.matches = match(words, neg.words)
 
 # match() returns the position of the matched term or NA
 # we just want a TRUE/FALSE:
 pos.matches = !is.na(pos.matches)
 neg.matches = !is.na(neg.matches)
 
 # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
 score = sum(pos.matches) - sum(neg.matches)
 
 return(score)
 }, pos.words, neg.words, .progress=.progress )
 
 scores.df = data.frame(score=scores, text=sentences)
 return(scores.df)
}

Here’s an example of how the function works. I have written 3 sentences with different sentiments. You can see the function has scored it accordingly.

> sample <- c("All flights delayed, such a dreadful experience", 
 "Enjoyed & looking forward to another informative conference",
 "Today you are you! That is truer than true!")
> result <- score.sentiment(sample, pos, neg)
> result$score
[1] -2 1 0

Now let’s score our tweets and plot them out.

>sentence<-tweets.text
>e.result <- score.sentiment(tweets.text,pos,neg)
>sentiment<-e.result$score

>qplot(sentiment,
 geom="histogram",
 binwidth = 0.5,
 color= sentiment >= 0,
 xlab = "Sentiment",
 ylab = "Number of tweets",
 main = "Social Network Sentiment Analysis of General Election")

Election

The negative tweets are in red and the neutral/positive is shown in green. We can see positive tweets slightly higher in number than negatives, if we ignore the neutral.

Now let’s see how the function works if I extract labour and conservative tweets separately. After changing the search term to #labour and #conservatives in the above code, I am getting the following graphs. Both the graphs look very similar.

Labour

 

Conservatives

Possible extensions to the project

  • Invoke big data technologies to analyse huge number of tweets to find hourly sentiments for a longer continuous time period.
  • More detailed visualisations of the tweets and sentiments.
  • Use Machine learning to predict sentiments based on past trends. Reduce neutrality of tweets by a more sophisticated model.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s