(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART FOR EACH QUESTION LABELED SEPARATELY PLEASE HELP ME AND I PROMISE TO LEAVE GOOD FEEDBACK AND REVIEWS) 1 Introduction With this assignment, you will learn about the implementation of two basic classifiers, the Naive Bayes Classifier and the Logistic Regression Classifier All the code must be yours and use of libraries like scikit learn is prohibited unless stated otherwise Naive Bayes is a classifier where for a document d, out of all classes it returns the class which has the maximum posterior probability given the document It is a generative model which calculates the likelihood ( ) and the prior ( ) to predict the class instead of calculating ( ) directly By contrast, a discriminative model, like Logistic regressio n, attempts to directly compute the posterior ( ) by learning boundaries in the data space You will also learn to implement the Gradient Descent algorithm for binary logistic regression The objective of the algorithm is to minimize the loss defined by the learned boundary to update the parameters of the model It tries to find the local minima of the loss function as the loss will be lowest there in the given region 2 Instructions Each question is labelled as Code or Written and the guidelines for each type are provided below Code This section needs to be completed using Python 3 6 You will also require following packages pandas numpy NLTK or SpaCy scikit learn If you want to use an external package for any reason, you are required to get approval from the course staff prior to submission 3 Questions Q1 Naive Bayes Code 25 In this question, you will learn to build a Naive Bayes Classifier for the binary classification task Dataset Financial Phrasebank dataset from HuggingFace 1 To load the data, you need to install library datasets ( pip install datasets ) and then use load datset() method to load the dataset You can find the code on the link provided above The dataset contains 3 class labels, neutral (1), positive (2), and negative (0) Consider only positive and negative samples and ignore the neutral samples Use 80 of the samples selected randomly to train the model and the remaining 20 for the test Clean the dataset with the steps from the previous assignment and build a vocabulary of all the words Compute the prior probability of each class p(ci) Ncount(ci) Here, ( i ) is the number of samples with class i and is the total number of samples in the dataset Compute the likelihood ( ) for a all words and all classes with following equation p(wic) V wVcount(w,c)count(wi,c) 1 Here, the ( i , ) is the frequency of the word i in class c while WVcount(w,c) is the frequency of all the words in the class c Laplace smoothing is used to avoid zero probability in the case of a new word For each sample in the test set, predict class which is the class with the highest posterior probability To avoid underflow and increase speed, use log space to predict the class as follows cNB argmax(logp(c) wiVlogp(wic)) Using the metrics from scikit learn library 2 , calculate the accuracy and macro average precision, recall, and F1 score, and also provide the confusion matrix on the test set Q2 Logistic Regression Code 25 In this task, you will learn to build a Logistic Regression Classifier for the same Financial Phrasebank dataset Bag of Words model will be used for this task Use 60 of the data selected randomly for training, 20 selected randomly for testing and the remaining 20 for validation set Use classes 'positive' and 'negative' only Perform the same cleaning tasks on the text data and build a vocabulary of the words Using CountVectorizer 3 , fit the cleaned train data This will create the bag of words model for the train data Transform test and validation sets using same CountVectorizer To implement the logistic regression using following equations, zi W xi y i (zi) we need the weight vector W Create an array of dimension equal to those of each from the CountVectorizer Apply above equations over whole training dataset and calculate and cross entropy loss CE which can be calculated as CE ylogy (1y)log(1y ) Now, update the weights as follows Wt 1 Wt(y yi) xi Here, ( i ) i is the gradient of sigmoid function and 0 01 is the learning rate Repeat step 4 and step 5 for 500 iterations or epochs For each iteration, calculate the cross entropy loss on validation set Calculate the accuracy and macro average precision, recall, and F1 score and provide the confusion matrix on the test set Experiment with varying values of (0 0001, 0 001, 0 01, 0 1) Report your observations with respect to the performance of the model You can vary the number of iterations to enhance the performance if necessary Link to dataset https huggingface co datasets financial phrasebank Link to scikit learnlibrary https scikit learn org stable modules model evaluation html

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Feb 28, 2024

(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART

(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART FOR EACH QUESTION LABELED SEPARATELY... PLEASE HELP ME AND I PROMISE TO LEAVE GOOD FEEDBACK AND REVIEWS)

1 Introduction
With this assignment, you will learn about the implementation of two basic classifiers, the Naive Bayes Classifier and the Logistic Regression Classifier. All the code must be yours and use of libraries like scikit-learn is prohibited unless stated otherwise.

Naive Bayes is a classifier where for a document d, out of all classes ???????????? it returns the class ????̂ which has the maximum posterior probability given the document. It is a generative model which calculates the likelihood ????(????|????) and the prior ????(????) to predict the class instead of calculating ????(????|????) directly. By contrast, a discriminative model, like Logistic regression, attempts to directly compute the posterior ????(????|????) by learning boundaries in the data space.

You will also learn to implement the Gradient Descent algorithm for binary logistic regression. The objective of the algorithm is to minimize the loss defined by the learned boundary to update the parameters of the model. It tries to find the local minima of the loss function as the loss will be lowest there in the given region.

2 Instructions
Each question is labelled as Code or Written and the guidelines for each type are provided below.

Code
This section needs to be completed using Python 3.6+. You will also require following packages:

pandas
numpy
NLTK or SpaCy
scikit-learn

If you want to use an external package for any reason, you are required to get approval from the course staff prior to submission.

3 Questions

Q1. Naive Bayes: Code [25]
In this question, you will learn to build a Naive Bayes Classifier for the binary classification task.

Dataset: "Financial Phrasebank" dataset from HuggingFace¹. To load the data, you need to install library "datasets" (pip install datasets) and then use load_datset() method to load the dataset. You can find the code on the link provided above.
The dataset contains 3 class labels, neutral (1), positive (2), and negative (0). Consider only positive and negative samples and ignore the neutral samples. Use 80% of the samples selected randomly to train the model and the remaining 20% for the test.
Clean the dataset with the steps from the previous assignment and build a vocabulary of all the words.
Compute the prior probability of each class
p(ci)=Ncount(ci)
Here, ????????????????????(????_i) is the number of samples with class ????_i and ???? is the total number of
samples in the dataset.
Compute the likelihood ????(???? |????) for a all words ???? and all classes ???? with following equation:
p(wi∣c)=∣V∣+∑wεVcount(w,c)count(wi,c)+1
Here, the ????????????????????(????_i ,????) is the frequency of the word ????_i in class c while ∑WεVcount(w,c) is the frequency of all the words in the class c. Laplace smoothing is used to avoid zero probability in the case of a new word.
For each sample in the test set, predict class ????_???????? which is the class with the highest posterior probability. To avoid underflow and increase speed, use log space to predict the class as follows:
cNB=argmax(logp(c)+∑wiεVlogp(wi∣c))
Using the metrics from scikit-learnlibrary², calculate the accuracy and macro-average precision, recall, and F1 score, and also provide the confusion matrix on the test set.

Q2. Logistic Regression: Code [25]
In this task, you will learn to build a Logistic Regression Classifier for the same "Financial Phrasebank" dataset. Bag of Words model will be used for this task.

Use 60% of the data selected randomly for training, 20% selected randomly for testing and the remaining 20% for validation set. Use classes 'positive' and 'negative' only. Perform the same cleaning tasks on the text data and build a vocabulary of the words.
Using CountVectorizer³, fit the cleaned train data. This will create the bag-of-words model for the train data. Transform test and validation sets using same CountVectorizer.
To implement the logistic regression using following equations,
zi=W.xi
y^i=σ(zi)
we need the weight vector W. Create an array of dimension equal to those of each ????_???? from the CountVectorizer
Apply above equations over whole training dataset and calculate ????̂ and cross-entropy loss £CE which can be calculated as
£CE=−ylogy^+(1−y)log(1−y^)
Now, update the weights as follows:
Wt+1=Wt−α(y^−yi).xi
Here, (????̂ − ????_i).????_i is the gradient of sigmoid function and ???? = 0.01 is the learning rate.
Repeat step 4 and step 5 for 500 iterations or epochs. For each iteration, calculate the cross-entropy loss on validation set.
Calculate the accuracy and macro-average precision, recall, and F1 score and provide the confusion matrix on the test set.
Experiment with varying values of ???? ∈ (0.0001, 0.001, 0.01, 0.1). Report your observations with respect to the performance of the model. You can vary the number of iterations to enhance the performance if necessary.