Question
(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART
(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART FOR EACH QUESTION LABELED SEPARATELY... PLEASE HELP ME AND I PROMISE TO LEAVE GOOD FEEDBACK AND REVIEWS)
1 Introduction
With this assignment, you will learn about the implementation of two basic classifiers, the Naive Bayes Classifier and the Logistic Regression Classifier. All the code must be yours and use of libraries like scikit-learn is prohibited unless stated otherwise.
Naive Bayes is a classifier where for a document d, out of all classes ???????????? it returns the class ????̂ which has the maximum posterior probability given the document. It is a generative model which calculates the likelihood ????(????|????) and the prior ????(????) to predict the class instead of calculating ????(????|????) directly. By contrast, a discriminative model, like Logistic regression, attempts to directly compute the posterior ????(????|????) by learning boundaries in the data space.
You will also learn to implement the Gradient Descent algorithm for binary logistic regression. The objective of the algorithm is to minimize the loss defined by the learned boundary to update the parameters of the model. It tries to find the local minima of the loss function as the loss will be lowest there in the given region.
2 Instructions
Each question is labelled as Code or Written and the guidelines for each type are provided below.
Code
This section needs to be completed using Python 3.6+. You will also require following packages:
- pandas
- numpy
- NLTK or SpaCy
- scikit-learn
If you want to use an external package for any reason, you are required to get approval from the course staff prior to submission.
3 Questions
Q1. Naive Bayes: Code [25]
In this question, you will learn to build a Naive Bayes Classifier for the binary classification task.
- Dataset: "Financial Phrasebank" dataset from HuggingFace1. To load the data, you need to install library "datasets" (pip install datasets) and then use load_datset() method to load the dataset. You can find the code on the link provided above.
- The dataset contains 3 class labels, neutral (1), positive (2), and negative (0). Consider only positive and negative samples and ignore the neutral samples. Use 80% of the samples selected randomly to train the model and the remaining 20% for the test.
- Clean the dataset with the steps from the previous assignment and build a vocabulary of all the words.
- Compute the prior probability of each class
p(ci)=Ncount(ci)
Here, ????????????????????(????i) is the number of samples with class ????i and ???? is the total number of
samples in the dataset. - Compute the likelihood ????(???? |????) for a all words ???? and all classes ???? with following equation:
p(wi∣c)=∣V∣+∑wεVcount(w,c)count(wi,c)+1
Here, the ????????????????????(????i ,????) is the frequency of the word ????i in class c while ∑WεVcount(w,c) is the frequency of all the words in the class c. Laplace smoothing is used to avoid zero probability in the case of a new word. - For each sample in the test set, predict class ???????????? which is the class with the highest posterior probability. To avoid underflow and increase speed, use log space to predict the class as follows:
cNB=argmax(logp(c)+∑wiεVlogp(wi∣c)) - Using the metrics from scikit-learnlibrary2, calculate the accuracy and macro-average precision, recall, and F1 score, and also provide the confusion matrix on the test set.
Q2. Logistic Regression: Code [25]
In this task, you will learn to build a Logistic Regression Classifier for the same "Financial Phrasebank" dataset. Bag of Words model will be used for this task.
- Use 60% of the data selected randomly for training, 20% selected randomly for testing and the remaining 20% for validation set. Use classes 'positive' and 'negative' only. Perform the same cleaning tasks on the text data and build a vocabulary of the words.
- Using CountVectorizer3, fit the cleaned train data. This will create the bag-of-words model for the train data. Transform test and validation sets using same CountVectorizer.
- To implement the logistic regression using following equations,
zi=W.xi
y^i=σ(zi)
we need the weight vector W. Create an array of dimension equal to those of each ???????? from the CountVectorizer - Apply above equations over whole training dataset and calculate ????̂ and cross-entropy loss £CE which can be calculated as
£CE=−ylogy^+(1−y)log(1−y^) - Now, update the weights as follows:
Wt+1=Wt−α(y^−yi).xi
Here, (????̂ − ????i).????i is the gradient of sigmoid function and ???? = 0.01 is the learning rate. - Repeat step 4 and step 5 for 500 iterations or epochs. For each iteration, calculate the cross-entropy loss on validation set.
- Calculate the accuracy and macro-average precision, recall, and F1 score and provide the confusion matrix on the test set.
- Experiment with varying values of ???? ∈ (0.0001, 0.001, 0.01, 0.1). Report your observations with respect to the performance of the model. You can vary the number of iterations to enhance the performance if necessary.
Link to dataset
https://huggingface.co/datasets/financial_phrasebank
Link to scikit-learnlibrary
https://scikit-learn.org/stable/modules/model_evaluation.html
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started