Answered step by step
Verified Expert Solution
Question
1 Approved Answer
In this question, you will learn to build a Naive Bayes Classifier for the binary classification task. 1 . Dataset: Financial Phrasebank dataset from HuggingFace
In this question, you will learn to build a Naive Bayes Classifier for the binary classification
task.
Dataset: "Financial Phrasebank" dataset from HuggingFace To load the data, you
need to install library "datasets" pip install datasets and then use loaddatset
method to load the dataset. You can find the code on the link provided above.
The dataset contains class labels, neutral positive and negative Consider
only positive and negative samples and ignore the neutral samples. Use of the
samples selected randomly to train the model and the remaining for the test.
Clean the dataset with the steps from the previous assignment and build a vocabulary of
all the words.
Compute the prior probability of each class
Here, is the number of samples with class and is the total number of
samples in the dataset.
Compute the likelihood
for a all words and all classes with following equation:
Here, the
is the frequency of the word
in class c while is
the frequency of all the words in the class c Laplace smoothing is used to avoid zero
probability in the case of a new word.
For each sample in the test set, predict class which is the class with the highest
posterior probability. To avoid underflow and increase speed, use log space to predict
the class as follows:
argmax
log
log
Using the metrics from scikitlearn library
calculate the accuracy and macroaverage
precision, recall, and F score, and also provide the confusion matrix on the test set.
Q Logistic Regression: Code
In this task, you will learn to build a Logistic Regression Classifier for the same "Financial
Phrasebank" dataset. Bag of Words model will be used for this task.
Use of the data selected randomly for training, selected randomly for testing and
the remaining for validation set. Use classes positive and negative only. Perform
the same cleaning tasks on the text data and build a vocabulary of the words.
Link to the dataset.
Link to the scikitlearn documentation for metrics.
Using CountVectorizer
fit the cleaned train data. This will create the bagofwords
model for the train data. Transform test and validation sets using same CountVectorizer.
To implement the logistic regression using following equations,
we need the weight vector W Create an array of dimension equal to those of each
from the CountVectorizer.
Apply above equations over whole training dataset and calculate and crossentropy
loss which can be calculated as
log log
Now, update the weights as follows:
Here,
is the gradient of sigmoid function and is the learning rate.
Repeat step and step for iterations or epochs. For each iteration, calculate the
crossentropy loss on validation set.
Calculate the accuracy and macroaverage precision, recall, and F score and provide
the confusion matrix on the test set.
Experiment with varying values of in Report your observations
with respect to the performance of the model. You can vary the number of iterations to
enhance the performance if necessary
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started