Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Problem 1 TF - IDF Implement TF - IDF using using Python, Numpy, Pandas and whatever text cleaning library required. The tf idf is the

Problem

1

-

IDF

Implement TF

-

IDF using using Python, Numpy, Pandas and whatever text cleaning library required.

The tf

idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.

Term Frequency

tft

,

=

log

10 (

count

(

,

) + 1)

tft

,

d is the frequency of the word t in the document d

Inverse Document Frequency

idft

=

log

10 (

Ndft

)

N is the total number of documents

dft is the number of documents in which term t occurs

-

IDF

-

idft,d

=

tft

,

\

times idft

What is expected?

Your implementation should include the following two functions:

compute

_

tfidf

_

weights

(

train

_

docs

)

word

_

tfidf

_

vector

(

word

,

_

,

idf

_

)

[]

def compute

_

tfidf

_

weights

(

train

_

docs

)

'''

Input arguments:

train

_

docs : list of documents, i

.

.,

strings

Output arguments:

docs

_

tf : tf as a DataFrame

docs

_

idf : idf as a DataFrame

Note: the use of Pandas DataFrame is not mandatory

'''

docs

_

=

None

docs

_

idf

=

None

return docs

_

,

docs

_

idf

def word

_

tfidf

_

vector

(

word

,

_

,

idf

_

)

'''

Input arguments:

word : a query string

_

tf : tf as a DataFrame

_

idf : idf as a DataFrame

Output arguments:

_

idf

_

value : a numpy array of dimension

1

'''

_

idf

_

value

=

None

return tf

_

idf

_

value

Problem

2

Word embedding as features for classification

Task

Implement a sentiment classifier based on Twitter data to analyse the sentiments of COVID

- 19

tweets.

Train and test multiple classification model using necessary libraries with the features being sentence embeddings of tweets.

Report the accuracy and F

1

score

(

micro

-

and macro

-

averaged

)

for multiple classifier and discuss the differences.

Dataset

The dataset have been provided in the first code trunk with the assignment. You are required to use the original tweet text for this classification task.

Tweet representation

After necessary pre

-

processing of the tweets, convert the words into their embeddings, then take the mean of all the word vectors in a tweet to end up with a single vector representing each tweet. The tweet vector is then used for sentiment classification.

In the process of finding the embeddings for each word, you can ignore out

-

-

vocabulary words.

Classifier choice

You are required to implement the following TWO classifiers:

One tradition classification model

(

not a neural network based model

)

One classifier based on any neural network based model.

You can use PyTorch

/

TensorFlow

/

scikit

-

learn to implement your classifier. However, you are free to develop a classifier from scratch.

Your answer must include the following:

Code for data loading, data pre

-

processing, training, and testing of the models.

A discussion on the comparison between the classifiers based on classifier accuracy and F

1

score

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Rules In Database Systems Third International Workshop Rids 97 Sk Vde Sweden June 26 28 1997 Proceedings Lncs 1312

Authors: Andreas Geppert ,Mikael Berndtsson

1997th Edition

the Accelerated Development Programme, which focuses on developing business leaders from people who are identified as being able to help Eskom achieve its business objectives; this is focused...

Answered: 1 week ago

Previous Question Next Question