Social Network Sentiment Analysis with twitteR

Public Sentiment Analysis of a trend or event has proven to be useful in many ways. In important business decisions, marketing campaigning or introducing a new product it is always beneficial to have public emotion on social media. In this blog, I’ve outlined the methodology that I used to analyse the public sentiment for the general election trend on Twitter. I am using the “twitteR” package in R programming language.

The first thing to do is to extract the tweets. You need to create a Twitter Dev App using the https://apps.twitter.com/. You need to fill in few details and have your mobile phone already registered with your twitter account. Once this is done with your newly created app, make note of your consumer key and consumer secret from the Details page of your app.

Once we have all the necessary packages installed in R, we need to have an OAuth authentication. Jeff Gentry, the author of the twitteR package has shown us how to do the OAuth setup here

Library and OAuth Setting up

#Calling necessary library functions
>library(twitteR)
>library(ggplot2)

#Enter your consumer key and consumer secret values
>setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")

Retrieving the tweets

Twitter collects about millions of tweets per day. Handling “big data” is a whole new topic which I will skip in this blog. I am therefore going to limit to a reasonable amount of tweets that I can use for easy processing and analysis so I can focus on the actual analysis. In the below code lines I am extracting 5000 tweets and stripping off the duplicate retweets.

#Retrieving most recent tweets about the general election this month
>e.tweets<-searchTwitter('#generalelection', since='2017-06-06', until='2017-06-08', n=5000, lang='en', resultType = 'recent')

#Stripping the duplicated retweets from our collection
>e.tweets<-strip_retweets(e.tweets, strip_manual = TRUE, strip_mt = TRUE)
> length(e.tweets)
[1] 1276

Let’s now convert the tweets into a data frame and have a peek at the data.

>tweets.df <- twListToDF(e.tweets)
>head(tweets.df)

text
1 7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz
2 MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy.
3 I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better.
4 No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection
5 Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46
6 Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC
 favorited favoriteCount replyToSN created truncated replyToSID
1 FALSE 0 <NA> 2017-06-07 23:59:31 FALSE <NA>
2 FALSE 13 <NA> 2017-06-07 23:59:12 FALSE <NA>
3 FALSE 0 <NA> 2017-06-07 23:58:45 FALSE <NA>
4 FALSE 0 <NA> 2017-06-07 23:58:23 FALSE <NA>
5 FALSE 1 <NA> 2017-06-07 23:58:03 FALSE <NA>
6 FALSE 0 <NA> 2017-06-07 23:57:35 TRUE <NA>
 id replyToUID
1 872603961286684678 <NA>
2 872603881125154816 <NA>
3 872603769795727360 <NA>
4 872603679182004224 <NA>
5 872603592368295940 <NA>
6 872603474449575936 <NA>
 statusSource
1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
3 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
4 <a href="http://www.echofon.com/" rel="nofollow">Echofon</a>
5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
 screenName retweetCount isRetweet retweeted longitude latitude
1 whatfrandid 0 FALSE FALSE <NA> <NA>
2 TheMurdochTimes 4 FALSE FALSE <NA> <NA>
3 mattrobin140s 0 FALSE FALSE <NA> <NA>
4 juliahobsbawm 0 FALSE FALSE <NA> <NA>
5 helbigfordeyes 0 FALSE FALSE <NA> <NA>
6 theokoulouris 0 FALSE FALSE <NA> <NA>

Now let’s git rid of some unwanted information and extract only the texts.

#Extracting the text from the tweets
> tweets.text<-tweets.df$text
> head(tweets.text)

[1] "7 hours til voting time and I still don't have a clue what to do #help #GeneralElection #givemeapoliticslesson #plz" 
[2] "MSM doesn't realise this is Rupert Murdoch's last #GeneralElection after 40 years of interference and corruption of British democracy." 
[3] "I have no doubt that the Tories will win the #GeneralElection ...too many people are afraid to vote for something different and better." 
[4] "No more sleeps. Less than 7 hours to show if the latest 7 point Conservative poll predicted lead holds true #generalelection" 
[5] "Me toooo, hell yeah we're gonna make a difference \xed\xa0\xbd\xed\xb2\xaa\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb2\x81\xed\xa0\xbc\xed\xbf\xbb #GeneralElection https://t.co/DCas5Y8Q46"
[6] "Go Gary #GeneralElection #GE2017 #VoteLabour #VoteConservative #VoteTory #VoteForChange #ForTheMany #VoteLibDem… https://t.co/thJDlDZgdC"

This looks better to read.

Collecting positive and negative words

Next, I am taking a list of all possible positive and negative words from the website http://ptrckprry.com/. The negative list was missing a few words like “wtf”, etc., so I manually add them!

#After changing the required directory, read the positive and negative files
>pos<-scan('positive.txt', what='character', comment.char = ";")
>neg<-scan('negative.txt', what='character', comment.char = ";")
> head(pos)
[1] "a+" "abound" "abounds" "abundance" "abundant" "accessable"
> head(neg)
[1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably"

Sentiment scores

I am using a function by Jeffrey Breen with a slight modification to remove certain unwanted characters / unicode blocks.

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
 require(plyr)
 require(stringr)
 
 # we got a vector of sentences. plyr will handle a list
 # or a vector as an "l" for us
 # we want a simple array ("a") of scores back, so we use 
 # "l" + "a" + "ply" = "laply":
 scores = laply(sentences, function(sentence, pos.words, neg.words) {
 
 # clean up sentences with R's regex-driven global substitute, gsub():
 sentence = gsub('[[:punct:]]', '', sentence)
 sentence = gsub('[[:cntrl:]]', '', sentence)
 sentence = gsub('\\d+', '', sentence)
 sentence = gsub('\\x', '', sentence)

 # remove unicodes
 sentence<-iconv(sentence, "ASCII", "UTF-8", sub="")
 
 # and convert to lower case:
 sentence = tolower(sentence)
 
 # split into words. str_split is in the stringr package
 word.list = str_split(sentence, '\\s+')
 # sometimes a list() is one level of hierarchy too much
 words = unlist(word.list)
 
 # compare our words to the dictionaries of positive & negative terms
 pos.matches = match(words, pos.words)
 neg.matches = match(words, neg.words)
 
 # match() returns the position of the matched term or NA
 # we just want a TRUE/FALSE:
 pos.matches = !is.na(pos.matches)
 neg.matches = !is.na(neg.matches)
 
 # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
 score = sum(pos.matches) - sum(neg.matches)
 
 return(score)
 }, pos.words, neg.words, .progress=.progress )
 
 scores.df = data.frame(score=scores, text=sentences)
 return(scores.df)
}

Here’s an example of how the function works. I have written 3 sentences with different sentiments. You can see the function has scored it accordingly.

> sample <- c("All flights delayed, such a dreadful experience", 
 "Enjoyed & looking forward to another informative conference",
 "Today you are you! That is truer than true!")
> result <- score.sentiment(sample, pos, neg)
> result$score
[1] -2 1 0

Now let’s score our tweets and plot them out.

>sentence<-tweets.text
>e.result <- score.sentiment(tweets.text,pos,neg)
>sentiment<-e.result$score

>qplot(sentiment,
 geom="histogram",
 binwidth = 0.5,
 color= sentiment >= 0,
 xlab = "Sentiment",
 ylab = "Number of tweets",
 main = "Social Network Sentiment Analysis of General Election")

Election

The negative tweets are in red and the neutral/positive is shown in green. We can see positive tweets slightly higher in number than negatives, if we ignore the neutral.

Now let’s see how the function works if I extract labour and conservative tweets separately. After changing the search term to #labour and #conservatives in the above code, I am getting the following graphs. Both the graphs look very similar.

Labour

 

Conservatives

Possible extensions to the project

  • Invoke big data technologies to analyse huge number of tweets to find hourly sentiments for a longer continuous time period.
  • More detailed visualisations of the tweets and sentiments.
  • Use Machine learning to predict sentiments based on past trends. Reduce neutrality of tweets by a more sophisticated model.

 

Linear Regression in R

Linear regression is a fundamental method used as part of Regression analysis. It is a quick and an easy way to understand how predictive algorithms work in general and in the field of machine learning.

To give a simple overview of how the algorithm works, a linear regression fits a linear relationship for a given target variable(Y) using one or many scalar dependent variables(X).

For example consider the age, height and weight of children. By intuition we can say that these variables are correlated. With increase in height and age there is also an increase in weight. A linear regression would fit a model or an equation for the target variable assuming the relation between X and Y is linear. In many practical applications a linear regression model works effectively.

Linear_regression.svg

See figure above of an example of a linear regression model. The red line indicates a linear regression model of a target variable (Y axis) with one dependent variable (X axis)

Let’s perform a linear regression analysis using the trees dataset in R

First let’s attach the dataset in the R workspace and have a sneak peak of how the data looks.

> data(trees)
> head(trees)
  Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
> str(trees)
'data.frame': 31 obs. of 3 variables:
 $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
 Girth Height Volume 
 Min. : 8.30 Min. :63 Min. :10.20 
 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 
 Median :12.90 Median :76 Median :24.20 
 Mean :13.25 Mean :76 Mean :30.17 
 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 
 Max. :20.60 Max. :87 Max. :77.00

The head() function gives us the a top sample of dataset. We can see the dataset has 3 variables namely the Girth, Height and Volume. From the str() function we can get a quick understanding of how the data of each variable looks like. For a more detailed understanding use summary() to see the mean, median and quartiles of each variable.

Let us now do some exploratory analysis of these variables by plotting them out.

> plot(trees)

Rplot

We can see from the above plot that volume and girth have a linear relationship. Let’s build a linear model using the knowledge we have gained.  Let’s assume we want to predict the volume of a tree using the girth.

> lmod<-lm(Volume~Girth, data = trees)
> summary(lmod)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
 Min 1Q Median 3Q Max 
-8.065 -3.107 0.152 3.495 9.587 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331 
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16

The lm() function calls a linear model and you can see the target and dependent variables the model uses. By using the summary we can get the details of the model.

The model has an intercept of -36.9435 and a slope of 5.0659. The intercept indicates the point where the equation passes through the Y axis and the slope is the steepness of the linear equation. So for our linear equation Y = a + bX, the model denotes that b is -36.9435 and a is 5.0659 where Y is Volume and X is Girth.

From the Signif. code we can see that Girth is statistically significant for our model. We can plot our model for better understanding.

Rplot01.png

That’s it! We have build our first linear model. The red line indicates that we are able to draw a line to match most of our data points. Some data points are further away to the line and some are quite close. The distance between the predicted and actual values is the residual. In the following blogs let’s see how we can improve our model fit further.

The types of machine learning problems

I have written this 2 part blog to articulate the technical aspects of machine learning in layman’s terms. For part 1 of this series click here


In the last blog we looked into a small introduction to machine learning and why it is important.  I suggest you read the last blog to get a better understanding from this one.

This time we will dive into a more technical introduction to the types of machine learning problems. Let’s look closely into the situations you might come across when you are trying to build your own predictive model.

220px-Chocolate_Cupcakes_with_Raspberry_Buttercream.jpg

Situation 1

Imagine you own a bakery. Your business seems to be quite popular among all types of customers – kids, teens, adults.  But you want to know if the people truly like your bakery or not. It can depend on anything (e.g., the order they place, their age, their favourite flavour, suggestions from their family, their friends). These are the predictor variables that impact our answer. But the answer you are looking forward is a simple Yes or No. Do people like your bakery or not? This type of machine learning is known as classification. Sometimes there are more than 2 categories. For example how much do people like your bakery (Very much, Quite a bit, Not at all). These are ordinal classifiers. Ordinal classifiers can also be 1, 2 or 3 but remember this is not the same as regression (see below)

Situation 2

You are the owner of the same bakery. But you want more than a classification answer. You want to go straight to the target and find out how much a customer might spend based on their historic data. You are now looking at a numerical scale measurement for an answer. It can range anywhere from £5 to £15 per visit. Imagine every time you see a new customer walk into your bakery you see the amount they are most likely to spend floating above their head. This is a regression situation.

Situation 3

You don’t know what you want to know. You just want to know if there are groups of customers who are likely to act in a particular way. Do little kids always go for the cupcakes with cartoon characters. Do young teens with their girlfriend / boyfriend go for heart shaped one? You want the data to frame the question and answer it. We are looking for patterns, groups or clusters in the data. This is the Clustering problem

In situations 1 and 2 we have a question framed, we have a set of predictors that we think might influence the answer to our question. This type of machine learning is known as supervised learning. In situation 3, we did not have any question in our mind but we are looking to find patterns or groups from the data. This is known as unsupervised learning.

Summary:

  • Classification: supervised machine learning method where the output variable takes the form of class labels.
  • Regression: supervised machine learning method where the output variable takes the form of continuous values.
  • Clustering: unsupervised machine learning method where we group a set of objects and find whether there is some relationship between the objects

For part 1 of this series click here

Or read my blog on big data