Wednesday, July 13, 2016

Identifying Emotions from tweets for a Social Event using R in IBM Data Scientist Workbench


This blog describes the steps on - How to identify the emotions from tweets for an event. As an example, I have taken the event "Britian Exit".  I am using IBM Data Scientist Workbench, that is an online platform where you can directly execute R statements using RStudio. I would recommend you to use the  Data Scientist Workbench because it has RStudio and Apache Spark in a single platform without installing it, in your machine.

The blog covers

1) Setting up the infrastructure IBM Data Scientist Workbench
2) Setting up Twitter App for OAuth.
3) Connecting to Twitter using R
4) Cleansing the tweets
5) Identifying the emotions
6) Plot the graph with emotions & tweets


1) Setting up the infrastructure - IBM Data Scientist Workbench

Create an online account in https://my.datascientistworkbench.com/login

Login to Data Scientist Workbench and click on RStudio IDE


  

 





















2) Setting up Twitter App for OAuth

Login to  to https://apps.twitter.com/app/new and create an Application.


























Open the "Keys and Access Tokens" Tab and get the Consumer Key, Consumer Secret, Access Token, Access Token Secret.


 
 3) Connecting to Twitter using R

We are using existing R Libraries to connect to Twitter. Install the required libraries

install.packages("twitteR")
install.packages("ROAuth")


Below statements will load the libraries & Connect to twitter. We are doing a search for some popular hashtags.

> library("twitteR")
>
> library("ROAuth") 
>
> dir.create(file.path("/resources", "BRExit"), mode = "0777")
>
> setwd("/resources/BRExit")
>
> getwd()
[1] "/resources/BRExit"
>
> download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
trying URL 'http://curl.haxx.se/ca/cacert.pem'
Content type 'unknown' length 250607 bytes (244 KB)
==================================================
downloaded 244 KB

>## Provide the proper keys & token got from step 2.
> consumer_key <- 'xxxxxxxxxxx'
>
> consumer_secret <- 'xxxxxxxxxxxxxxxxxxxxxxx'
>
> access_token <- 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>
> access_secret <- 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>
> twiter_handle <- setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)
[1] "Using direct authentication"
Use a local file ('.httr-oauth'), to cache OAuth access credentials between R sessions?

1: Yes
2: No

Selection: 1
Adding .httr-oauth to .gitignore




Now, we query the Twitter to get the tweets related to hashtags.

> search_term <- "#brexit,#strongerin,#yes2eu"
> 
> search_term <- unlist(strsplit(search_term,","))
> 
> tweets = list()
>  
> ## You may get warning because of Twitter rate limit.
> ## If there is many hashtags, then you may need to stop it after sometime since
> ## Twitter will impose the Rate Limit and you will be getting the exception to 
> ## getting the tweets. 
> for(i in 1:length(search_term)){
     result<-searchTwitter(search_term[i],n=1500,lang="en")
     tweets <- c(tweets,result)
     tweets <- unique(tweets)
 }
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
  1500 tweets were requested but the API can only return 107
>
> ## Collected 3107 tweets  
> length(tweets)
[1] 3107
>  
> ## Display 5 tweets
> head(tweets)
[[1]]
[1] "AAPLRoom: AAPL Trading Room Newsletter is out! https://t.co/cXzBstbEI1 #trading #brexit"

[[2]]
[1] "DVATW: If #TheresaMay becomes PM then the hope and optimism of a fast clean #Brexit is over."

[[3]]
[1] "melvine: RT @RoubiniGlobal: #Brexit Circus: Which branch of the #EU line will the UK take? #MindTheGap https://t.co/mvFceRuJpy"
























Save the tweets to a file for other types of analytics on the data.


> file<-NULL
> 
> if (file.exists("tweetsBRExit.csv")){file<- read.csv("tweetsBRExit.csv")}
> 
> df <- do.call("rbind", lapply(tweets, as.data.frame))
> 
> df<-rbind(df,file)
> 
> df <- df[!duplicated(df[c("id")]),]
> 
> write.csv(df,file="tweetsBRExit.csv",row.names=FALSE)
> 





















You could see the tweet is exported to /resources/BRExit/tweetsBRExit.csv. It has various informations that can be helpfull for building another insights like influencers, geo influence etc.

I am sharing the column names for your reference.

"text","favorited","favoriteCount","replyToSN","created","truncated","replyToSID","id",
"replyToUID","statusSource","screenName","retweetCount","isRetweet",
"retweeted","longitude","latitude"



4) Cleansing the tweets 

We do some cleansing of tweets like removing the whitespace, numbers, punctuatoions etc.

> library(tm)
Loading required package: NLP
> 
> twitter_brexit_df = twListToDF(tweets)
> 
> r_text_corpus <- Corpus(VectorSource(twitter_brexit_df$text))
> 
> r_text_cleansing <- tm_map(r_text_corpus, stripWhitespace)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, removeNumbers)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, removePunctuation)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, content_transformer(tolower))
> 






















5) Identifying the emotions

We are using the Library syuzhet to identify the emotions of the tweets. The get_nrc_sentiment implements Saif Mohammad’s NRC Emotion lexicon
Refer: http://www.purl.org/net/NRCemotionlexicon

> install.packages("syuzhet")
>
> library(syuzhet)
> 
> isNull <- function(data) {
     if(is.null(data))
         return(0)
     else
         return(data)
 }
> 
> text_vec = c()
> anger = c() ; anticipation=c() ; disgust=c() ; fear=c() ; joy=c() ;
> sadness=c() ; surprise=c() ; rust=c() ; nrc_negative=c() ; nrc_positive=c();
> 
> for(i in 1:length(r_text_cleansing)){
     text <- lapply(r_text_cleansing[i], as.character)
     text <- gsub("http\\w+", "", text)
     nrc_emotions <- get_nrc_sentiment(text)
     
     text_vec[i] <- text
     anger[i] <- isNull(nrc_emotions$anger)
     anticipation[i] <- isNull(nrc_emotions$anticipation)
     disgust[i] <- isNull(nrc_emotions$disgust)
     fear[i] <- isNull(nrc_emotions$fear)
     joy[i] <- isNull(nrc_emotions$joy)
     sadness[i] <- isNull(nrc_emotions$sadness)
     surprise[i] <- isNull(nrc_emotions$surprise)
     rust[i] <- isNull(nrc_emotions$rust)
     nrc_negative[i] <- isNull(nrc_emotions$negative)
     nrc_positive[i] <- isNull(nrc_emotions$positive)
 }
> 
> nrc_df <- data.frame(text_vec,anger,anticipation,disgust,fear,joy,sadness,surprise,
                      rust,nrc_negative,nrc_positive)
> 
>
> nrc_df[1:2,2:11]
  anger anticipation disgust fear joy sadness surprise rust nrc_negative nrc_positive
1     0            0       0    0   0       0        0    0            0            0
2     0            2       0    0   3       0        2    0            0            3
>
 







Plot the graph with Emotions & Tweets

Plot the graph in R.

> par(mar=c(5.1,5,4.1,2.1))
> 
> barplot(
     sort(colSums(prop.table(nrc_df[, 2:9]))), 
     horiz = TRUE, 
     cex.names = 0.7,
     las = 1, 
     main = "Emotions for Britian Exit", 
     xlab="Percentage",
     col="lightblue"
 )
>













































I was tracking the tweets for last couple of weeks and I could see the change in emotions as time progressed.





















No comments: