Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this question, you will learn to build a Naive Bayes Classifier for the binary classification task. 1 . Dataset: Financial Phrasebank dataset from HuggingFace

In this question, you will learn to build a Naive Bayes Classifier for the binary classification
task.
1. Dataset: "Financial Phrasebank" dataset from HuggingFace.1 To load the data, you
need to install library "datasets" (pip install datasets) and then use load_datset()
method to load the dataset. You can find the code on the link provided above.
2. The dataset contains 3 class labels, neutral (1), positive (2), and negative (0). Consider
only positive and negative samples and ignore the neutral samples. Use 80% of the
samples selected randomly to train the model and the remaining 20% for the test.
3. Clean the dataset with the steps from the previous assignment and build a vocabulary of
all the words.
4. Compute the prior probability of each class
()=()
Here, () is the number of samples with class and is the total number of
samples in the dataset.
5. Compute the likelihood (
|) for a all words and all classes with following equation:
(
|)=(
,)+1
||+(,)
Here, the (
,) is the frequency of the word
in class c while (,) is
the frequency of all the words in the class c. Laplace smoothing is used to avoid zero
probability in the case of a new word.
6. For each sample in the test set, predict class which is the class with the highest
posterior probability. To avoid underflow and increase speed, use log space to predict
the class as follows:
= argmax
(log ()+
log (
|))(3.1)
7. Using the metrics from scikit-learn library2
, calculate the accuracy and macro-average
precision, recall, and F1 score, and also provide the confusion matrix on the test set.
Q2. Logistic Regression: Code [25]
In this task, you will learn to build a Logistic Regression Classifier for the same "Financial
Phrasebank" dataset. Bag of Words model will be used for this task.
1. Use 60% of the data selected randomly for training, 20% selected randomly for testing and
the remaining 20% for validation set. Use classes positive and negative only. Perform
the same cleaning tasks on the text data and build a vocabulary of the words.
1Link to the dataset.
2Link to the scikit-learn documentation for metrics.
2
2. Using CountVectorizer3
, fit the cleaned train data. This will create the bag-of-words
model for the train data. Transform test and validation sets using same CountVectorizer.
3. To implement the logistic regression using following equations,
=.
=()
we need the weight vector W. Create an array of dimension equal to those of each
from the CountVectorizer.
4. Apply above equations over whole training dataset and calculate and cross-entropy
loss which can be calculated as
= log +(1) log(1)
5. Now, update the weights as follows:
+1=().
Here, ().
is the gradient of sigmoid function and =0.01 is the learning rate.
6. Repeat step 4 and step 5 for 500 iterations or epochs. For each iteration, calculate the
cross-entropy loss on validation set.
7. Calculate the accuracy and macro-average precision, recall, and F1 score and provide
the confusion matrix on the test set.
8. Experiment with varying values of in (0.0001,0.001,0.01,0.1). Report your observations
with respect to the performance of the model. You can vary the number of iterations to
enhance the performance if necessary

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Design And Relational Theory Normal Forms And All That Jazz

Authors: Chris Date

1st Edition

1449328016, 978-1449328016

More Books

Students also viewed these Databases questions