I use a three way sentiment analysis on tweets to classify them as positive, negative, or neutral. I will use NLTK's in-built sentiment analysis library as well as a custom script to analyze the tweets.
Let's get started with something simpler to demonstrate an easy positive/negative binomial classification.
The two most common ways to gather data from Twitter -
b. You can use an inbuilt dataset in many different natural processing libraries such as NLTK's twitter_samples corpus.Ideally, I would create my own dataset that would satisfy the needs of my research question. Typically, the questions might be
a) what do people associate a specific brand/company with? Is the opinion positive, negative, or neutral?
b) Did a certain event occured within the history or timeline of a brand/company impact the sentiment associated with that brand/company?
The question you seek to answer determines the parameters of your data collection approach and the time span of collecting the tweets before you have a decent amount of data for reliable and robust analysis.
# Lets import nltk first and download 'twitter_samples' from nltk and print out the file ids of the downloaded files
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
print (twitter_samples.fileids())
# You can also accomplish this by importing twitter_samples from nltk.corpus but it doesn't always work as well.
Let's try to navigate them each separately to see their lengths (or a count of tweets)
# let's import each file to a corresponding variable and check lengths
positive_tweets = twitter_samples.strings('positive_tweets.json')
print (len(positive_tweets))
negative_tweets = twitter_samples.strings('negative_tweets.json')
print (len(negative_tweets))
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets))
Let's take a look at the first 5 tweets in one of the file. I will do this for the negative tweets but feel free to change the name of the file to see the first 5 (or however many you want) tweets
# a for loop that prints the first 5 tweets from the negative_tweets
for tweet in negative_tweets[:5]:
print (tweet)
I can sometimes echo with that last tweet! I feel you random stranger!
Now that we see that our tweets are properly imported, we need to Tokenize them. Tokenize is a fancy term to say that one wants to split the words within the tweet and add them to a list of words. The easiest way to do that is to call upon NLTK's inbuilt function called "TweetTokenizer" that does that job fairly well in most cases. You can read more about NLTK's TweetTokenizer here.
Essentially, the information in the documentation boils down to this -
You can 'preserve_case' of the tweet or choose not to in which case, the tweets are converted to lowercase letters. This comes in handy when you do not care for the capitalization and avoid making two similar words being trated differently due to the differences in the capitalization.
You can 'strip_handles' which removes the twitter user handles from the tweets. This is useful, for instance, when you want to anonymize the data.
You can 'reduce_len' to reduce the length of words such as "YESSSSSSSSS" which may throw off our sentiment analyzer.
Now let's get into it.
# let's use nltk.tokenize to break down our tweets into words and remove the twitter handles associated with each tweet.
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
# Let's tokenize the first 5 tweets to see what's happening behind the scenes
for tweet in negative_tweets[:5]:
print (tweet_tokenizer.tokenize(tweet))
Well there are a lot of non-words included in the word list. For instance, you see commas, colons, and quotations added as words. These do not add much value to the sentiment analysis that we want to perform. Thus, we need to clean these tweets to remove those and any other unwanted information. Those could include emojis, stock market tickers, hyperlinks, hashtags, punctuations.
In addition, we could also remove 'stop words' which are words like a, the, and, an...etc. that are not valuable in determining sentiment. Furthermore, we can also ensure that words that are similar (write, writing, and written) to a single stem word (write) using the inbuilt Porter Stemming Algorithm.
Let's do it!
# Let's get the imports out of the way
import string
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
# If nltk corpus failed to import stopwords, use the below code to download them directly
nltk.download('stopwords')
# We will only remove english stopwords given our dataset is mostly consisting english tweets. You can play around with other languages depending on your dataset
stopwords_english = stopwords.words('english')
# let's also store out PorterStemmer function for ease of use later.
stemmer = PorterStemmer()
# Next, we can create lists of happy and sad emoticons. Here, I chose to do it separately for my ease since I can get these lists from online
# First let's create a list of Happy Emoticons
emoticons_happy = set([
':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3'
])
# Now let's create a list of Sad ones
emoticons_sad = set([
':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';('
])
# Now let's combine all emoticons so they are easy to use later
emoticons = emoticons_happy.union(emoticons_sad)
# One last step to do is to create a lemma for each word
def lemmatize(word):
lemma = lemmatizer.lemmatize(word,'v')
if lemma == word:
lemma = lemmatizer.lemmatize(word,'n')
return lemma
# Let's create a simple function that will let us clean the tweets as we need to analyze them
def clean_tweets(tweet):
# First let's remove stock market tickers like $GE from each tweet
tweet = re.sub(r'\$\w*', '', tweet)
# Next, let's remove old retweet tag "RT" - this might be different depending on your dataset
tweet = re.sub(r'^RT[\s]+', '', tweet)
# Then, we remove the hyperlinks because we do not need them
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
# The next thing we need to care about it removing hashtags. However, we only remove hashtags by removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# Now that all unwanted charachters are removed, let's tokenize each tweet
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
# Now let's finally clean the tweets and append them to a list
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in emoticons and # remove emoticons
word not in string.punctuation): # remove punctuation
#tweets_clean.append(word) can be used to append words right away without stemming if you forsee value in not stemming
stem_word = stemmer.stem(word) # stemming word
lemma_word = lemmatize(stem_word)
tweets_clean.append(lemma_word)
return tweets_clean # Here we return the clean tweets
Let's write our own sample tweet to see it worked.
How about something simple like this --> "RT @Twitter @shafeethexenos Hey there! Check out my portfolio at the link below! Have a great day. :) #good #morning http://shaf.codes "
custom_tweet = "RT @Twitter @shafeethexenos Hey there! Check out my portfolio at the link below! Have a great day. :) #good #morning http://shaf.codes "
# Let's print out the cleaned tweet to see if it looks like the one we are expecting.
print (clean_tweets(custom_tweet))
print (positive_tweets[5])
print (clean_tweets(positive_tweets[5]))
# let's start by defining a function that extracts and stores our bag of words
def bag_of_words(tweet):
words = clean_tweets(tweet)
words_list = list([word] for word in words)
return words_list
# Let's try our new function on our custom tweet from earlier.
print (bag_of_words(custom_tweet))
# positive tweets feature set
positive_tweets_set = []
for tweet in positive_tweets:
positive_tweets_set.append((bag_of_words(tweet), 'positive'))
# negative tweets feature set
negative_tweets_set = []
for tweet in negative_tweets:
negative_tweets_set.append((bag_of_words(tweet), 'negative'))
# All tweets feature set
all_tweets_set = []
for tweet in all_tweets:
all_tweets_set.append((bag_of_words(tweet), 'all_tweets'))
# Let's make sure all of our tweets are converted into their corresponding bags of words
print (len(positive_tweets_set), len(negative_tweets_set), len(all_tweets_set)) # If everything went well, the expected output: (5000, 5000, 20000)
import pandas as pd
pos_df = pd.DataFrame(positive_tweets_set)
pos_df.head()
positive_df = pd.DataFrame(pos_df[0].values.tolist(), index=pos_df.index)
# positive_df.head()
def get_all_values(d):
if isinstance(d, dict):
for v in d.values():
yield from get_all_values(v)
elif isinstance(d, list):
for v in d:
yield from get_all_values(v)
else:
yield d
list(get_all_values(pos_df[0]))
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib
import matplotlib.pyplot as plt
%matplotlib notebook
model = Word2Vec(pos_df[0], min_count=1)
print(model)
# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])
# access vector for one word for reference
print(model['true'])
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
#Plot helpers
import matplotlib
import matplotlib.pyplot as plt
#Enable matplotlib to be interactive (zoom etc)
%matplotlib inline
# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('Tweets')
plt.show()
from sklearn.manifold import TSNE as tsne
Y = tsne(2, 50, 30.0)
tsne_results = Y.fit_transform(X)
x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('Tweets')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()
neg_df = pd.DataFrame(negative_tweets_set)
list(get_all_values(neg_df[0]))
model = Word2Vec(neg_df[0], min_count=1)
# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])
# access vector for one word for reference
print(model['sad'])
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('Negative Tweets')
plt.show()
from sklearn.manifold import TSNE as tsne
Y = tsne(2, 50, 30.0)
tsne_results = Y.fit_transform(X)
x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('Negative Tweets post T-SNE')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()
all_tweets_df = pd.DataFrame(all_tweets_set)
list(get_all_values(all_tweets_df[0]))
model = Word2Vec(all_tweets_df[0], min_count=1)
# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])
# access vector for one word for reference
print(model['sad'])
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('All Tweets')
plt.show()
from sklearn.manifold import TSNE as tsne
Y = tsne(3, 50, 30.0)
tsne_results = Y.fit_transform(X)
x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('All Tweets post T-SNE')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()
The training data is used to train our algorithm/model and the testing data is used to understand how well our trained model is working when used on data it has not trained on. This is to ensure that our model is not memorizing the instances/tweets and regurgitating information it memorized.
# We can assign a subset of our data (usually a smaller subset). However, this might yield biased results
# The bias comes due to the lack of certainity in our results. What if we get an accuracy just because we divided our data that made it easier for our model to predict?
# Thus, we radomize positive_reviews_set and negative_reviews_set which, in turn, will output different accuracy result everytime we run the program
from random import shuffle
shuffle(positive_tweets_set)
shuffle(negative_tweets_set)
test_set = positive_tweets_set[:1000] + negative_tweets_set[:1000]
train_set = positive_tweets_set[1000:] + negative_tweets_set[1000:]
print(len(test_set), len(train_set)) # Output: (2000, 8000)
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy)
print (classifier.show_most_informative_features(10))
# Let's try this out on a different custom_tweet. How about something really negative like the one below?
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg
# Now that we see that the tweet is classified as negative tweet correctly, let's try to look at a more detailed output
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: negative
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))
# How about on something more of a positive tone?
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result)
print (prob_result.max())
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))
# Now that we have our basic accuracy predicts, let us derive a confusion matrix.
# to accomplish that, we need to create a count of positive and negative predictions and compare them with what they are supposed to be
from collections import defaultdict
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
actual_set_cm = []
predicted_set_cm = []
for index, (feature, actual_label) in enumerate(test_set):
actual_set[actual_label].add(index)
actual_set_cm.append(actual_label)
predicted_label = classifier.classify(feature)
predicted_set[predicted_label].add(index)
predicted_set_cm.append(predicted_label)
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
print ('positive precision:', precision(actual_set['positive'], predicted_set['positive']))
print ('positive recall:', recall(actual_set['positive'], predicted_set['positive']))
print ('positive F-measure:', f_measure(actual_set['positive'], predicted_set['positive']))
print ('negative precision:', precision(actual_set['negative'], predicted_set['negative']))
print ('negative recall:', recall(actual_set['negative'], predicted_set['negative']))
print ('negative F-measure:', f_measure(actual_set['negative'], predicted_set['negative']))
# The last step is to build the actual confusion matrix for the test set
cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
for tweet in all_tweets[:5]:
print (tweet)
all_tweets_df.head()
all_tweets_df.applymap(str)
all_tweets_df[0]
# all_tweets_values = ','.join(str(v) for v in all_tweets_df[0])
sample_df = all_tweets_df[0]
result_sample_df = sample_df.pop(': True')
# all_tweets_values[:10]
if ": True" in sample_df:
del wordFreqDic[": True"]
print("Updated Dictionary :" , sample_df)
# from wordcloud import WordCloud, STOPWORDS
# def wordcloud(tweets):
# stopwords = set(STOPWORDS)
# wordcloud = WordCloud(background_color="white",stopwords=stopwords,random_state = 2016).generate(tweets)
# plt.figure( figsize=(20,10), facecolor='k')
# plt.imshow(wordcloud)
# plt.axis("off")
# plt.title("All Tweets")
# wordcloud(all_tweets_df[0])
# We can assign a subset of our data (usually a smaller subset). However, this might yield biased results
# The bias comes due to the lack of certainity in our results. What if we get an accuracy just because we divided our data that made it easier for our model to predict?
# Thus, we radomize positive_reviews_set and negative_reviews_set which, in turn, will output different accuracy result everytime we run the program
from random import shuffle
shuffle(positive_tweets_set)
shuffle(negative_tweets_set)
test_set = positive_tweets_set[:1000] + negative_tweets_set[:1000]
train_set = positive_tweets_set[1000:] + negative_tweets_set[1000:]
print(len(test_set), len(train_set)) # Output: (2000, 8000)
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy)
print (classifier.show_most_informative_features(10))
# Let's try this out on a different custom_tweet. How about something really negative like the one below?
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg
# Now that we see that the tweet is classified as negative tweet correctly, let's try to look at a more detailed output
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: negative
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))
# How about on something more of a positive tone?
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result)
print (prob_result.max())
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))
# Now that we have our basic accuracy predicts, let us derive a confusion matrix.
# to accomplish that, we need to create a count of positive and negative predictions and compare them with what they are supposed to be
from collections import defaultdict
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
actual_set_cm = []
predicted_set_cm = []
for index, (feature, actual_label) in enumerate(test_set):
actual_set[actual_label].add(index)
actual_set_cm.append(actual_label)
predicted_label = classifier.classify(feature)
predicted_set[predicted_label].add(index)
predicted_set_cm.append(predicted_label)
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
print ('positive precision:', precision(actual_set['positive'], predicted_set['positive']))
print ('positive recall:', recall(actual_set['positive'], predicted_set['positive']))
print ('positive F-measure:', f_measure(actual_set['positive'], predicted_set['positive']))
print ('negative precision:', precision(actual_set['negative'], predicted_set['negative']))
print ('negative recall:', recall(actual_set['negative'], predicted_set['negative']))
print ('negative F-measure:', f_measure(actual_set['negative'], predicted_set['negative']))
# The last step is to build the actual confusion matrix for the test set
cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
for tweet in all_tweets[:5]:
print (tweet)
vectorizer = TfidfVectorizer(max_df=0.5,max_features=10000,min_df=10,stop_words='english',use_idf=True)
X = vectorizer.fit_transform(all_tweets_set['text_lem'].str.upper())
sid = SentimentIntensityAnalyzer()
all_tweets_set['sentiment_compound_polarity']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['compound'])
all_tweets_set['sentiment_neutral']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['neu'])
all_tweets_set['sentiment_negative']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['neg'])
all_tweets_set['sentiment_pos']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['pos'])
all_tweets_set['sentiment_type']=''
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity>0,'sentiment_type']='POSITIVE'
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity==0,'sentiment_type']='NEUTRAL'
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity<0,'sentiment_type']='NEGATIVE'
tweets_sentiment = tweets.groupby(['sentiment_type'])['sentiment_neutral'].count()
tweets_sentiment.rename("",inplace=True)
explode = (1, 0, 0)
plt.subplot(221)
tweets_sentiment.transpose().plot(kind='barh',figsize=(20, 20))
plt.title('Sentiment Analysis 1', bbox={'facecolor':'0.8', 'pad':0})
plt.subplot(222)
tweets_sentiment.plot(kind='pie',figsize=(20, 20),autopct='%1.1f%%',shadow=True,explode=explode)
plt.legend(bbox_to_anchor=(1, 1), loc=3, borderaxespad=0.)
plt.title('Sentiment Analysis 2', bbox={'facecolor':'0.8', 'pad':0})
plt.show()