Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 09, 2024

Sentiment Classification with Na ve Bayes * In this assignment, you will build a na ve Bayes classifier for sentiment classification. We are definingsentiment classification

Sentiment Classification with Na

ve Bayes

*

In this assignment, you will build a na

ve Bayes classifier for sentiment classification. We are definingsentiment classification as two classes: positive and negative. Our data set consists of airline reviews. Thezip directory for the data contains training and test datasets, where each file contains one airline reviewtweet. You will build the model using training data and evaluate with test data. Each of training data andtest data contains

4182

reviews. You will have to build the system from the scratch

(

.

.

numpy

) .

Do notuse any existing libraries

(

.

.

scikit

-

learn

) . 1 .

Build your naive Bayes classifier Create your Vocabulary: Read the complete training data word by word and create the vocabulary Vfor the corpus. You must not include the test set in this process. Remove any markup tags, e

.

.,

HTMLtags, from the data. Lower case capitalized words

(

.

.,

starts with a capital letter

)

but not all capitalwords

(

.

.,

USA

) .

Keep all stop words. Create

2

versions of V: with stemming and without stemming.You can use appropriate tools in nltk

1

to stem. Tokenize at white space and also at each punctuation.In other words, childs consists of two tokens child and s

,

home. consists of two tokens homeand

. .

Consider emoticons in this process. You can use an emoticon tokenizer, if you so choose. Ifyes, specify which one. Extract Features: Convert documents to vectors using Bag of Words

(

BoW

)

representation. Do this intwo ways: keeping frequency count where each word is represented by its count in each document,keeping binary representation that only keeps track of presence

(

or not

)

of a word in a document. Training: calculate the prior for each class & the likelihood for each word

|

class

.

Note that:

1 .

Ignore any words that appear in the test set but not the training set

. 2 .

If you want to experiment with different stemmers or other aspects of the input features, youmust do so on the training set, through cross

-

validation. You must not do such preliminaryevaluations on test data. When you have finalized your system, features, and parameters, youcan evaluate on test data. Evaluation:

1 .

Compute the most likely class for each document in the test set using each of the combinationsof stemming

+

frequency count, stemming

+

binary, no

-

stemming

+

frequency count, no

-

stemming

+

binary

. 2 .

Compute and report accuracy. Accuracy is number of correctly classified reviews

/

number of allreviews in test

. 3 .

Create a confusion matrix for each classifier. Save your results in a

.

txt or

.

log file.Bonus points: how would the results change if you used term frequency x inverse document frequency insteadof binary representation for Na

ve Bayes

(10

points

) ? *

Originally designed by Dr

.

Uzuner for AIT

726 .

Revised by Dr

.

Liao for AIT

526 (3 / 15 / 2024) . 3 .

Documentation Identify your information

(

group number, name, date and etc

)

and describe the problem to be solvedwell enough so that someone not familiar with our class could understand. Give actual examples of program input and output, along with usage instructions. Describe the algorithm you have used to solve the problem, specified in a stepwise or point by pointfashion. Additional description: Please state whether the bonus credit questions are answered or not

4 .

DeliverablesPlease submit a zip file named with student

1 [

firstname initial

] [

lastname

]_

student

2 [

firstnameinitial

] [

lastname

]_[

hw#

] .

zip

(

.

.

student

1

jamie lee, student

2

kahyun lee: jlee

_

klee

_

1 .

zip

) .

Zip file should include the following: Your code

(

) .

log file or

.

txt file that contains your output. You can choose whatever is convenient for you.

.

log canbe created using logging library.

.

txt file can be created using simpleio library.NOTE: if you use jupyter notebook, you still need to save results to a text file and do NOT print all results inthe Jupter notebook. You can have some small outputs in the notebook and then save into HTML

(

) .

Zip allnotebook file

(

),

HTML files, and

/

or related intermediate datasets, and other files into ONE zip file. Do not zipthe original datasets provided for the assignment. Please submit only one zip file.