Answered step by step
Verified Expert Solution
Link Copied!
Question
1 Approved Answer

Sentiment Classification with Na ve Bayes * In this assignment, you will build a na ve Bayes classifier for sentiment classification. We are definingsentiment classification

Sentiment Classification with Nave Bayes*In this assignment, you will build a nave Bayes classifier for sentiment classification. We are definingsentiment classification as two classes: positive and negative. Our data set consists of airline reviews. Thezip directory for the data contains training and test datasets, where each file contains one airline reviewtweet. You will build the model using training data and evaluate with test data. Each of training data andtest data contains 4182 reviews. You will have to build the system from the scratch (e.g. numpy). Do notuse any existing libraries (e.g. scikit-learn).1. Build your naive Bayes classifier Create your Vocabulary: Read the complete training data word by word and create the vocabulary Vfor the corpus. You must not include the test set in this process. Remove any markup tags, e.g., HTMLtags, from the data. Lower case capitalized words (i.e., starts with a capital letter) but not all capitalwords (e.g., USA). Keep all stop words. Create 2 versions of V: with stemming and without stemming.You can use appropriate tools in nltk1 to stem. Tokenize at white space and also at each punctuation.In other words, childs consists of two tokens child and s, home. consists of two tokens homeand .. Consider emoticons in this process. You can use an emoticon tokenizer, if you so choose. Ifyes, specify which one. Extract Features: Convert documents to vectors using Bag of Words (BoW) representation. Do this intwo ways: keeping frequency count where each word is represented by its count in each document,keeping binary representation that only keeps track of presence (or not) of a word in a document. Training: calculate the prior for each class & the likelihood for each word|class.Note that:1. Ignore any words that appear in the test set but not the training set.2. If you want to experiment with different stemmers or other aspects of the input features, youmust do so on the training set, through cross-validation. You must not do such preliminaryevaluations on test data. When you have finalized your system, features, and parameters, youcan evaluate on test data. Evaluation:1. Compute the most likely class for each document in the test set using each of the combinationsof stemming + frequency count, stemming + binary, no-stemming + frequency count, no-stemming + binary.2. Compute and report accuracy. Accuracy is number of correctly classified reviews/number of allreviews in test.3. Create a confusion matrix for each classifier. Save your results in a.txt or .log file.Bonus points: how would the results change if you used term frequency x inverse document frequency insteadof binary representation for Nave Bayes (10 points)?*Originally designed by Dr. Uzuner for AIT 726. Revised by Dr. Liao for AIT 526(3/15/2024).3. Documentation Identify your information (group number, name, date and etc) and describe the problem to be solvedwell enough so that someone not familiar with our class could understand. Give actual examples of program input and output, along with usage instructions. Describe the algorithm you have used to solve the problem, specified in a stepwise or point by pointfashion. Additional description: Please state whether the bonus credit questions are answered or not4. DeliverablesPlease submit a zip file named with student1[firstname initial][lastname]_student2[firstnameinitial][lastname]_[hw#].zip (i.e. student 1 jamie lee, student 2 kahyun lee: jlee_klee_hw1.zip).Zip file should include the following: Your code(s).log file or .txt file that contains your output. You can choose whatever is convenient for you. .log canbe created using logging library. .txt file can be created using simpleio library.NOTE: if you use jupyter notebook, you still need to save results to a text file and do NOT print all results inthe Jupter notebook. You can have some small outputs in the notebook and then save into HTML(s). Zip allnotebook file(s), HTML files, and/or related intermediate datasets, and other files into ONE zip file. Do not zipthe original datasets provided for the assignment. Please submit only one zip file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image
Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_step_2

Step: 3

blur-text-image_step3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students explore these related Databases questions