Question

1 Approved Answer

Posted on Apr 06, 2022

Sentiment classification for tweets are very helpful for people to understand the public opinions from users. Here, we will implement a Naive Bayes classifier to

Sentiment classification for tweets are very helpful for people to understand the public opinions from users. Here, we will implement a Naive Bayes classifier to classify the sentiment polarity of the tweets. In the class, we discussed the case with only positive and negative polarity, you can generalize that to the classification problem involving three classes — positive, negative, and neutral, where neutral means that the tweet does not exhibit subjective polarity of positive or negative, such as “I did the laundry on the weekend.”, which tends to be objective. We’ve released the training data in “Q2/train tweets.txt”. Each tweet record is presented in a line, separated into “tweetID”, “userID”, “sentiment” (positive, neutral, or negative), and “text” by tab (t).

(a) Read in the training data and create a dataframe named as “tweets” with four columns to hold the data. Each row corresponds to a tweet record and each column indicates an attribute. Please also name the columns accordingly as “tweetID”, “userID”, “sentiment”, and “text”. (10’)

(b) Employ an external R library “tokenizers” and tokenize the text in each tweet with the function “tokenize words(·)”.

2 Create the fifth column of the “tweets” dataframe and name the column as “tokens”, which should carry the tokenized text for each tweet record. For the i-th row of the “tweets” dataframe, the “text” column presents the raw text of the i-th tweet while the “tokens” column presents its tokens obtained from the “tokenize words(·)” function. (10’)

(c) Create a new vector named as “vocab” to maintain the distinct tokens appearing in > 3 tweets from the training data. In other words, each token in the “vocab” vector should appear in > 3 tweets measured by the “tokens” column of “tweets” dataframe and there is no replicated tokens in the “vocab” vector. Print the count of tokens in the “vocab” vector on the screen. (10’)

(d) Build a Naive Bayes classifier to measure the likelihood of observing each words conditioned on “positive”, “neutral”, and “negative” sentiment and the prior of observing each sentiment in the training data. Use your model to predict the sentiment label of the following three tweets and print the results on the screen:

〈i〉 “I love the banner that was unfurled in the United end last night. It read: Chelsea - Standing up against racism since Sunday”

〈ii〉 “So Clattenburg’s alleged racism may mean end of his career; Terry, Suarez, Rio use it and can’t play for a couple of weeks?”

〈iii〉 “In our busy lives in Dubai could we just spare a moment of silence this Friday morning for the people who still wear crocs.” Please be reminded to tokenize the text of the above three tweets before predicting their sentiment. Use “add-1” smoothing for the likehood measure and map the probabilities into the log space when building and applying the model. (20’)

data

264183816548130816 15140428 positive Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)
263405084770172928 591166521 negative Theo Walcott is still shit, watch Rafa and Johnny deal with him on Saturday.
264249301910310912 18516728 negative Iranian general says Israel's Iron Dome can't deal with their missiles (keep talking like that and we may end up finding out)
264105751826538497 147088367 positive with J Davlar 11th. Main rivals are team Poland. Hopefully we an make it a successful end to a tough week of training tomorrow.
264094586689953794 332474633 negative Talking about ACT's && SAT's, deciding where I want to go to college, applying to colleges and everything about college stresses me out.
254941790757601280 557103111 negative They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner. @S4NYC @RasmussenPoll
264169034155696130 382403760 neutral Im bringing the monster load of candy tomorrow, I just hope it doesn't get all squiched