Sentiment Analysis on Tweets¶

Introduction¶

I use a three way sentiment analysis on tweets to classify them as positive, negative, or neutral. I will use NLTK's in-built sentiment analysis library as well as a custom script to analyze the tweets.

Let's get started with something simpler to demonstrate an easy positive/negative binomial classification.

Create your dataset or import the twitter_samples from nltk dataset corpus?¶

The two most common ways to gather data from Twitter -

You can write your own python script to call Twitter's apis to gather data using keywords that interest you. Here is a primer for such an approach.

b. You can use an inbuilt dataset in many different natural processing libraries such as NLTK's twitter_samples corpus.Ideally, I would create my own dataset that would satisfy the needs of my research question. Typically, the questions might be

a) what do people associate a specific brand/company with? Is the opinion positive, negative, or neutral?

b) Did a certain event occured within the history or timeline of a brand/company impact the sentiment associated with that brand/company?

The question you seek to answer determines the parameters of your data collection approach and the time span of collecting the tweets before you have a decent amount of data for reliable and robust analysis.

In this project, I use the in-built twitter dataset to show the sentiment analysis approach because my goal for this project is to show my familiarity with such approach.

With that preface out of the way, let's import our data from NLTK¶

# Lets import nltk first and download 'twitter_samples' from nltk and print out the file ids of the downloaded files
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
print (twitter_samples.fileids())

# You can also accomplish this by importing twitter_samples from nltk.corpus but it doesn't always work as well.

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Shaf\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!

We see that there are three json files in our dataset¶

Let's try to navigate them each separately to see their lengths (or a count of tweets)

# let's import each file to a corresponding variable and check lengths
positive_tweets = twitter_samples.strings('positive_tweets.json')
print (len(positive_tweets))
negative_tweets = twitter_samples.strings('negative_tweets.json')
print (len(negative_tweets))
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets))

5000
5000
20000

We see that both positive and negative tweets are of length 5000 each and the file with all tweets is 20000 long¶

Let's take a look at the first 5 tweets in one of the file. I will do this for the negative tweets but feel free to change the name of the file to see the first 5 (or however many you want) tweets

# a for loop that prints the first 5 tweets from the negative_tweets
for tweet in negative_tweets[:5]:
     print (tweet)

hopeless for tmr :(
Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(
@Hegelbon That heart sliding into the waste basket. :(
“@ketchBurning: I hate Japanese call him "bani" :( :(”

Me too
Dang starting next week I have "work" :(

I can sometimes echo with that last tweet! I feel you random stranger!

Now that we see that our tweets are properly imported, we need to Tokenize them. Tokenize is a fancy term to say that one wants to split the words within the tweet and add them to a list of words. The easiest way to do that is to call upon NLTK's inbuilt function called "TweetTokenizer" that does that job fairly well in most cases. You can read more about NLTK's TweetTokenizer here.

Essentially, the information in the documentation boils down to this -

You can 'preserve_case' of the tweet or choose not to in which case, the tweets are converted to lowercase letters. This comes in handy when you do not care for the capitalization and avoid making two similar words being trated differently due to the differences in the capitalization.
You can 'strip_handles' which removes the twitter user handles from the tweets. This is useful, for instance, when you want to anonymize the data.
You can 'reduce_len' to reduce the length of words such as "YESSSSSSSSS" which may throw off our sentiment analyzer.

Now let's get into it.

# let's use nltk.tokenize to break down our tweets into words and remove the twitter handles associated with each tweet.
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

# Let's tokenize the first 5 tweets to see what's happening behind the scenes
for tweet in negative_tweets[:5]:
    print (tweet_tokenizer.tokenize(tweet))

['hopeless', 'for', 'tmr', ':(']
['everything', 'in', 'the', 'kids', 'section', 'of', 'ikea', 'is', 'so', 'cute', '.', 'shame', "i'm", 'nearly', '19', 'in', '2', 'months', ':(']
['that', 'heart', 'sliding', 'into', 'the', 'waste', 'basket', '.', ':(']
['“', ':', 'i', 'hate', 'japanese', 'call', 'him', '"', 'bani', '"', ':(', ':(', '”', 'me', 'too']
['dang', 'starting', 'next', 'week', 'i', 'have', '"', 'work', '"', ':(']

Did you notice something odd about the word lists?¶

Well there are a lot of non-words included in the word list. For instance, you see commas, colons, and quotations added as words. These do not add much value to the sentiment analysis that we want to perform. Thus, we need to clean these tweets to remove those and any other unwanted information. Those could include emojis, stock market tickers, hyperlinks, hashtags, punctuations.

In addition, we could also remove 'stop words' which are words like a, the, and, an...etc. that are not valuable in determining sentiment. Furthermore, we can also ensure that words that are similar (write, writing, and written) to a single stem word (write) using the inbuilt Porter Stemming Algorithm.

Let's do it!

# Let's get the imports out of the way
import string
import re
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
# If nltk corpus failed to import stopwords, use the below code to download them directly
nltk.download('stopwords')


# We will only remove english stopwords given our dataset is mostly consisting english tweets. You can play around with other languages depending on your dataset
stopwords_english = stopwords.words('english')

# let's also store out PorterStemmer function for ease of use later.
stemmer = PorterStemmer()


# Next, we can create lists of happy and sad emoticons. Here, I chose to do it separately for my ease since I can get these lists from online

# First let's create a list of Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])
 
# Now let's create a list of Sad ones 
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])
 
# Now let's combine all emoticons so they are easy to use later
emoticons = emoticons_happy.union(emoticons_sad)

# One last step to do is to create a lemma for each word

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma
 
# Let's create a simple function that will let us clean the tweets as we need to analyze them


def clean_tweets(tweet):
    # First let's remove stock market tickers like $GE from each tweet
    tweet = re.sub(r'\$\w*', '', tweet)
 
    # Next, let's remove old retweet tag "RT" - this might be different depending on your dataset
    tweet = re.sub(r'^RT[\s]+', '', tweet)
 
    # Then, we remove the hyperlinks because we do not need them
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # The next thing we need to care about it removing hashtags. However, we only remove hashtags by removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
 
    # Now that all unwanted charachters are removed, let's tokenize each tweet
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
 
    # Now let's finally clean the tweets and append them to a list
    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation): # remove punctuation
            #tweets_clean.append(word) can be used to append words right away without stemming if you forsee value in not stemming
            stem_word = stemmer.stem(word) # stemming word
            lemma_word = lemmatize(stem_word)
            tweets_clean.append(lemma_word)
 
    return tweets_clean # Here we return the clean tweets

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Shaf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Let's see if our function actually worked. Shall we?¶

Let's write our own sample tweet to see it worked.

How about something simple like this --> "RT @Twitter @shafeethexenos Hey there! Check out my portfolio at the link below! Have a great day. :) #good #morning http://shaf.codes "

custom_tweet = "RT @Twitter @shafeethexenos Hey there! Check out my portfolio at the link below! Have a great day. :) #good #morning http://shaf.codes "

# Let's print out the cleaned tweet to see if it looks like the one we are expecting.
print (clean_tweets(custom_tweet))
 
print (positive_tweets[5])
 
print (clean_tweets(positive_tweets[5]))

['hey', 'check', 'portfolio', 'link', 'great', 'day', 'good', 'morn']
@BhaktisBanter @PallaviRuhail This one is irresistible :)
#FlipkartFashionFriday http://t.co/EbZ0L2VENM
['one', 'irresist', 'flipkartfashionfriday']

Great!¶

Now that we know our function is working, we need a way to identify which words are associated with good sentiment, bad sentiment, and neutral sentiment. Let's extract some features. A simple way to accomplish that is by creating a "bag of words".¶

# let's start by defining a function that extracts and stores our bag of words
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_list = list([word] for word in words)    
    return words_list

# Let's try our new function on our custom tweet from earlier.
print (bag_of_words(custom_tweet))

[['hey'], ['check'], ['portfolio'], ['link'], ['great'], ['day'], ['good'], ['morn']]

Now that we know it works, we can use it to on our positive and negative tweets to extract our bags of words¶

# positive tweets feature set
positive_tweets_set = []
for tweet in positive_tweets:
    positive_tweets_set.append((bag_of_words(tweet), 'positive'))    
 
# negative tweets feature set
negative_tweets_set = []
for tweet in negative_tweets:
    negative_tweets_set.append((bag_of_words(tweet), 'negative'))
    
# All tweets feature set
all_tweets_set = []
for tweet in all_tweets:
    all_tweets_set.append((bag_of_words(tweet), 'all_tweets'))

# Let's make sure all of our tweets are converted into their corresponding bags of words
print (len(positive_tweets_set), len(negative_tweets_set), len(all_tweets_set)) # If everything went well, the expected output: (5000, 5000, 20000)

5000 5000 20000

import pandas as pd
pos_df = pd.DataFrame(positive_tweets_set)
pos_df.head()

positive_df = pd.DataFrame(pos_df[0].values.tolist(), index=pos_df.index)
# positive_df.head()

def get_all_values(d):
    if isinstance(d, dict):
        for v in d.values():
            yield from get_all_values(v)
    elif isinstance(d, list):
        for v in d:
            yield from get_all_values(v)
    else:
        yield d
        
list(get_all_values(pos_df[0]))

[0       [[followfriday], [top], [engag], [member], [co...
 1       [[hey], [jame], [odd], [:/], [plea], [call], [...
 2       [[listen], [last], [night], [bleed], [amaz], [...
 3                                             [[congrat]]
 4       [[yeaaah], [yipppi], [accnt], [verifi], [rqst]...
                               ...                        
 4995    [[chri], [that'], [great], [hear], [due], [tim...
 4996            [[thank], [shout-out], [great], [aboard]]
 4997               [[hey], [long], [time], [talk], [...]]
 4998    [[matt], [would], [say], [welcom], [adulthood]...
 4999                      [[could], [say], [egg], [face]]
 Name: 0, Length: 5000, dtype: object]

Let's Visualize the tweets before we do any further analysis¶

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib
import matplotlib.pyplot as plt
%matplotlib notebook

model = Word2Vec(pos_df[0], min_count=1)
print(model)

# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])

# access vector for one word for reference
print(model['true'])

[-3.91212851e-03  2.43060011e-03 -1.27370842e-02  5.07225096e-03
  1.89820526e-03 -4.96045686e-03 -3.15773347e-03 -3.05890036e-03
  5.40454173e-04  5.15784184e-03 -3.18101165e-03 -5.02640242e-03
  2.97165848e-03  9.04929417e-04  1.31602713e-03  4.37670341e-03
 -2.80230423e-03 -1.02580823e-02 -4.56965342e-03  4.85077919e-03
 -4.82783560e-03 -1.20679301e-03 -9.66807711e-04  4.58961958e-03
  8.53866804e-03  5.62675111e-03  2.09114165e-03  4.30298736e-03
 -4.22979519e-03  4.08197939e-03 -3.05539207e-03 -4.13900893e-03
  6.35676598e-03  6.78853074e-04 -1.41317514e-03 -1.43368333e-03
  7.84078054e-03 -2.83865910e-03  1.94326261e-04 -1.51447894e-03
  6.81768032e-03 -3.16668395e-03 -1.24700693e-02 -8.00101645e-03
 -2.14064587e-03 -2.20871647e-03 -5.86843165e-03  1.32631662e-03
 -4.51250235e-03  2.05123145e-03  2.59153964e-03  6.35787845e-03
  3.96165019e-03  5.11414465e-03 -8.55860859e-03 -3.22074420e-03
  2.19913502e-03 -5.71868895e-03 -1.15608762e-03  1.50920229e-03
 -8.12438899e-04  4.32414934e-03 -9.33258038e-04 -3.99556651e-04
  1.41569413e-03 -2.79861363e-03  5.39896917e-03 -2.12278427e-03
  3.89862631e-04 -9.20326356e-03 -3.26538039e-03  1.55966810e-03
  7.61918817e-03 -1.00861015e-02 -9.84860398e-03 -1.27929449e-03
 -6.61671907e-03 -6.01867493e-03 -7.73236295e-03  4.00702003e-03
  7.40504405e-03  1.77671958e-03 -2.53208214e-03 -8.26145243e-03
 -7.65115721e-03 -3.35899019e-03 -7.97061250e-04  9.26542655e-03
 -8.10774788e-03  3.39337415e-03  4.09773318e-03 -5.58963620e-06
 -4.47247270e-03 -9.61020356e-04  4.54196520e-03 -4.99457493e-03
 -4.19176416e-03 -1.31673436e-03  6.26646401e-03  2.93425424e-03]

C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

#Plot helpers
import matplotlib
import matplotlib.pyplot as plt
#Enable matplotlib to be interactive (zoom etc)
%matplotlib inline

# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('Tweets')
plt.show()

from sklearn.manifold import TSNE as tsne
Y = tsne(2, 50, 30.0)
tsne_results = Y.fit_transform(X)

x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('Tweets')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()

neg_df = pd.DataFrame(negative_tweets_set)
list(get_all_values(neg_df[0]))

model = Word2Vec(neg_df[0], min_count=1)

# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])

# access vector for one word for reference
print(model['sad'])

# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('Negative Tweets')
plt.show()

from sklearn.manifold import TSNE as tsne
Y = tsne(2, 50, 30.0)
tsne_results = Y.fit_transform(X)

x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('Negative Tweets post T-SNE')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()

['hopeless', 'tmr', 'everyth', 'kid', 'section']
[ 0.0114456  -0.00064316 -0.02685282 -0.00669086  0.01029847 -0.01846948
 -0.03317746 -0.00360301  0.00712569  0.00499224 -0.00245997 -0.03371656
  0.00634025 -0.00687196 -0.00041755 -0.00216461 -0.01058136 -0.01675764
 -0.00876876  0.01100288 -0.02061682 -0.01660873 -0.00435599  0.01186121
  0.00616554  0.02929511 -0.01088662  0.02960073 -0.0082244   0.01720593
 -0.00633295 -0.01763634  0.01975154  0.00115356 -0.00489886 -0.00961297
  0.02966774 -0.00562812 -0.03002316  0.01438442  0.03167796  0.00229078
 -0.01695311 -0.02495504  0.0031506  -0.01064528 -0.01110038 -0.00128144
 -0.03293002  0.00614145  0.0016475  -0.00027443 -0.0124058   0.01399327
 -0.03688622 -0.00900276  0.0066205   0.00336388  0.01158947  0.00043917
 -0.02709228 -0.01292249 -0.01036044  0.00621185  0.00512698  0.00751814
 -0.0088728  -0.00327653  0.00561926 -0.00211829 -0.00403595 -0.005853
  0.00475837 -0.00933879 -0.05079765 -0.00926017 -0.01268897  0.00472985
  0.00126578  0.01792561  0.00271722  0.02084506 -0.01655696 -0.01116505
 -0.00985952 -0.01138874 -0.02721738  0.00122741 -0.0186827   0.00562593
  0.00418249 -0.01383341 -0.00120755  0.01882572  0.00364597 -0.02161919
  0.00259577  0.00292326 -0.01441326 -0.00204094]

C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  # This is added back by InteractiveShellApp.init_path()
C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:14: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

all_tweets_df = pd.DataFrame(all_tweets_set)
list(get_all_values(all_tweets_df[0]))

model = Word2Vec(all_tweets_df[0], min_count=1)

# summarise vocabulary
words = list(model.wv.vocab)
print(words[0:5])

# access vector for one word for reference
print(model['sad'])

# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a plot of the projection
fig, ax = plt.subplots()
ax.plot(result[:, 0], result[:, 1], 'o')
ax.set_title('All Tweets')
plt.show()

from sklearn.manifold import TSNE as tsne
Y = tsne(3, 50, 30.0)
tsne_results = Y.fit_transform(X)

x=tsne_results[:,0]
y=tsne_results[:,1]
#Plot the t-SNE output
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_title('All Tweets post T-SNE')
ax.set_yticklabels([]) #Hide ticks
ax.set_xticklabels([]) #Hide ticks
plt.show()

['indirect', 'cost', 'uk', 'eu', 'estim']
[ 0.02648305 -0.10764097  0.00169681 -0.11740062  0.03470078 -0.09356029
  0.22883165  0.05076125  0.07956485 -0.33178458 -0.28703663 -0.05152384
  0.49260747 -0.1258099  -0.16870923  0.05955225 -0.22548382 -0.10054799
 -0.25272012 -0.02394372  0.0777604   0.0592568   0.08861803  0.02455152
 -0.19382876  0.01382111  0.01618313  0.23163597 -0.05768004 -0.01548063
  0.05972089 -0.3741822  -0.13581972  0.31561202  0.03375628 -0.07880153
  0.0165733   0.35434943 -0.23301286  0.04203634  0.37904218  0.06499031
 -0.15531483 -0.23875476  0.03892971 -0.0491959   0.26248404 -0.20435491
 -0.28198797 -0.00803955 -0.08708784 -0.06184975 -0.10431606  0.17159106
 -0.09514494 -0.05897852 -0.07708711  0.0877555  -0.02618479  0.09500904
 -0.00336756 -0.30559465 -0.19567285  0.36040467  0.10246088 -0.03822417
 -0.05292587  0.04150635  0.03537274 -0.55984163  0.19569527 -0.08414158
  0.09612387  0.12695634 -0.29382822 -0.0841881   0.10282163  0.13232991
 -0.1188817   0.21821445 -0.0451611  -0.28693187  0.03716455 -0.24094005
  0.21079807  0.06421104  0.0937331  -0.28302938 -0.10735425  0.15594862
  0.07863455 -0.08079988 -0.05883536  0.01577887 -0.11794372 -0.1245175
 -0.22905822 -0.12176648 -0.15581015 -0.1430444 ]

C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  # This is added back by InteractiveShellApp.init_path()
C:\Users\Shaf\Anaconda3\envs\tfgpu2\lib\site-packages\ipykernel_launcher.py:14: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

Great! Now that we have our data converted into the bag of words, we need to create a training and a testing data.¶

The training data is used to train our algorithm/model and the testing data is used to understand how well our trained model is working when used on data it has not trained on. This is to ensure that our model is not memorizing the instances/tweets and regurgitating information it memorized.

# We can assign a subset of our data (usually a smaller subset). However, this might yield biased results 
# The bias comes due to the lack of certainity in our results. What if we get an accuracy just because we divided our data that made it easier for our model to predict?
# Thus, we radomize positive_reviews_set and negative_reviews_set which, in turn, will output different accuracy result everytime we run the program
from random import shuffle 
shuffle(positive_tweets_set)
shuffle(negative_tweets_set)

test_set = positive_tweets_set[:1000] + negative_tweets_set[:1000]
train_set = positive_tweets_set[1000:] + negative_tweets_set[1000:]
 
print(len(test_set),  len(train_set)) # Output: (2000, 8000)

2000 8000

Awesome! Now that we have the data split into a training and testing set, let's import classifiers and try them out. We can start with a NaiveBayesClassifier.¶

You can read up on NBC here.

from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print(accuracy)
 
print (classifier.show_most_informative_features(10))

0.7505
Most Informative Features
                     sad = True           negati : positi =     26.1 : 1.0
                     bam = True           positi : negati =     21.0 : 1.0
                    poor = True           negati : positi =     17.7 : 1.0
                  commun = True           positi : negati =     17.0 : 1.0
                  welcom = True           positi : negati =     16.7 : 1.0
                   arriv = True           positi : negati =     15.3 : 1.0
                     ugh = True           negati : positi =     14.3 : 1.0
                    damn = True           negati : positi =     14.3 : 1.0
                    sick = True           negati : positi =     12.6 : 1.0
                    miss = True           negati : positi =     12.5 : 1.0
None

# Let's try this out on a different custom_tweet. How about something really negative like the one below?
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg

negative

# Now that we see that the tweet is classified as negative tweet correctly, let's try to look at a more detailed output
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: negative
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))

<ProbDist with 2 samples>
negative
0.9391418577022094
0.060858142297789485

# How about on something more of a positive tone? 
 
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
 
print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) 
print (prob_result.max())
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))

positive
<ProbDist with 2 samples>
positive
0.001025971825887687
0.9989740281741102

# Now that we have our basic accuracy predicts, let us derive a confusion matrix.
# to accomplish that, we need to create a count of positive and negative predictions and compare them with what they are supposed to be
from collections import defaultdict
 
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
 
actual_set_cm = []
predicted_set_cm = []
 
for index, (feature, actual_label) in enumerate(test_set):
    actual_set[actual_label].add(index)
    actual_set_cm.append(actual_label)
 
    predicted_label = classifier.classify(feature)
 
    predicted_set[predicted_label].add(index)
    predicted_set_cm.append(predicted_label)
    
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
 
print ('positive precision:', precision(actual_set['positive'], predicted_set['positive']))
print ('positive recall:', recall(actual_set['positive'], predicted_set['positive']))
print ('positive F-measure:', f_measure(actual_set['positive'], predicted_set['positive']))
print ('negative precision:', precision(actual_set['negative'], predicted_set['negative']))
print ('negative recall:', recall(actual_set['negative'], predicted_set['negative']))
print ('negative F-measure:', f_measure(actual_set['negative'], predicted_set['negative']))

positive precision: 0.7439143135345667
positive recall: 0.764
positive F-measure: 0.7538233843117909
negative precision: 0.7574511819116135
negative recall: 0.737
negative F-measure: 0.7470856563608718

# The last step is to build the actual confusion matrix for the test set

cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

         |      n      p |
         |      e      o |
         |      g      s |
         |      a      i |
         |      t      t |
         |      i      i |
         |      v      v |
         |      e      e |
---------+---------------+
negative | <36.9%> 13.2% |
positive |  11.8% <38.2%>|
---------+---------------+
(row = reference; col = test)

There We go! This is the most simple way to decide on the sentiment of a tweet to be positive or negative.¶

However, our job doesn't end here. Let's try to improve the performance of our models and see if we can identify neutral tweets too!¶

Credits for the tutorial - recreated from the blogpost of Mukesh Chapagain.¶

for tweet in all_tweets[:5]:
     print (tweet)

RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY
RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…
RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1
RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…

all_tweets_df.head()
all_tweets_df.applymap(str)
all_tweets_df[0]
# all_tweets_values = ','.join(str(v) for v in all_tweets_df[0])

sample_df = all_tweets_df[0]
result_sample_df = sample_df.pop(': True')
# all_tweets_values[:10]
if ": True" in sample_df:
    del wordFreqDic[": True"]
print("Updated Dictionary :" , sample_df)     


# from wordcloud import WordCloud, STOPWORDS
# def wordcloud(tweets):
#     stopwords = set(STOPWORDS)
#     wordcloud = WordCloud(background_color="white",stopwords=stopwords,random_state = 2016).generate(tweets)
#     plt.figure( figsize=(20,10), facecolor='k')
#     plt.imshow(wordcloud)
#     plt.axis("off")
#     plt.title("All Tweets")
# wordcloud(all_tweets_df[0])

Updated Dictionary : 0        {'indirect': True, 'cost': True, 'uk': True, '...
1        {'video': True, 'sturgeon': True, 'post-elect'...
2        {'economi': True, 'grow': True, '3': True, 'ti...
3        {'ukip': True, 'east': True, 'lothian': True, ...
4        {'ukip'': True, 'hous': True, 'spokesman': Tru...
                               ...                        
19995    {'we'r': True, 'go': True, 'deal': True, 'snp'...
19996    {'ed': True, 'think': True, 'scot': True, 'sud...
19997    {'exactli': True, 'alleg': True, 'commut': Tru...
19998    {'actual': True, 'agre': True, '95': True, 'fa...
19999    {'vote': True, 'snp': True, 'er': True, 'anoth...
Name: 0, Length: 20000, dtype: object

# We can assign a subset of our data (usually a smaller subset). However, this might yield biased results 
# The bias comes due to the lack of certainity in our results. What if we get an accuracy just because we divided our data that made it easier for our model to predict?
# Thus, we radomize positive_reviews_set and negative_reviews_set which, in turn, will output different accuracy result everytime we run the program
from random import shuffle 
shuffle(positive_tweets_set)
shuffle(negative_tweets_set)

test_set = positive_tweets_set[:1000] + negative_tweets_set[:1000]
train_set = positive_tweets_set[1000:] + negative_tweets_set[1000:]
 
print(len(test_set),  len(train_set)) # Output: (2000, 8000)



from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print(accuracy)
 
print (classifier.show_most_informative_features(10))



# Let's try this out on a different custom_tweet. How about something really negative like the one below?
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg


# Now that we see that the tweet is classified as negative tweet correctly, let's try to look at a more detailed output
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: negative
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))


# How about on something more of a positive tone? 
 
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
 
print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) 
print (prob_result.max())
print (prob_result.prob("negative"))
print (prob_result.prob("positive"))




# Now that we have our basic accuracy predicts, let us derive a confusion matrix.
# to accomplish that, we need to create a count of positive and negative predictions and compare them with what they are supposed to be
from collections import defaultdict
 
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
 
actual_set_cm = []
predicted_set_cm = []
 
for index, (feature, actual_label) in enumerate(test_set):
    actual_set[actual_label].add(index)
    actual_set_cm.append(actual_label)
 
    predicted_label = classifier.classify(feature)
 
    predicted_set[predicted_label].add(index)
    predicted_set_cm.append(predicted_label)
    
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
 
print ('positive precision:', precision(actual_set['positive'], predicted_set['positive']))
print ('positive recall:', recall(actual_set['positive'], predicted_set['positive']))
print ('positive F-measure:', f_measure(actual_set['positive'], predicted_set['positive']))
print ('negative precision:', precision(actual_set['negative'], predicted_set['negative']))
print ('negative recall:', recall(actual_set['negative'], predicted_set['negative']))
print ('negative F-measure:', f_measure(actual_set['negative'], predicted_set['negative']))

# The last step is to build the actual confusion matrix for the test set

cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

for tweet in all_tweets[:5]:
     print (tweet)

vectorizer = TfidfVectorizer(max_df=0.5,max_features=10000,min_df=10,stop_words='english',use_idf=True)
X = vectorizer.fit_transform(all_tweets_set['text_lem'].str.upper())
sid = SentimentIntensityAnalyzer()
all_tweets_set['sentiment_compound_polarity']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['compound'])
all_tweets_set['sentiment_neutral']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['neu'])
all_tweets_set['sentiment_negative']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['neg'])
all_tweets_set['sentiment_pos']=all_tweets_set.text_lem.apply(lambda x:sid.polarity_scores(x)['pos'])
all_tweets_set['sentiment_type']=''
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity>0,'sentiment_type']='POSITIVE'
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity==0,'sentiment_type']='NEUTRAL'
all_tweets_set.loc[all_tweets_set.sentiment_compound_polarity<0,'sentiment_type']='NEGATIVE'

tweets_sentiment = tweets.groupby(['sentiment_type'])['sentiment_neutral'].count()
tweets_sentiment.rename("",inplace=True)
explode = (1, 0, 0)
plt.subplot(221)
tweets_sentiment.transpose().plot(kind='barh',figsize=(20, 20))
plt.title('Sentiment Analysis 1', bbox={'facecolor':'0.8', 'pad':0})
plt.subplot(222)
tweets_sentiment.plot(kind='pie',figsize=(20, 20),autopct='%1.1f%%',shadow=True,explode=explode)
plt.legend(bbox_to_anchor=(1, 1), loc=3, borderaxespad=0.)
plt.title('Sentiment Analysis 2', bbox={'facecolor':'0.8', 'pad':0})
plt.show()