Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Problem 1 TF - IDF Implement TF - IDF using using Python, Numpy, Pandas and whatever text cleaning library required. The tf idf is the

Problem 1 TF-IDF
Implement TF-IDF using using Python, Numpy, Pandas and whatever text cleaning library required.
The tfidf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.
Term Frequency
tft,d=log10(count(t,d)+1)
tft,d is the frequency of the word t in the document d
Inverse Document Frequency
idft=log10(Ndft)
N is the total number of documents
dft is the number of documents in which term t occurs
TF-IDF
tf-idft,d=tft,d\times idft
What is expected?
Your implementation should include the following two functions:
compute_tfidf_weights(train_docs)
word_tfidf_vector(word, tf_df, idf_df)
[]
def compute_tfidf_weights(train_docs):
'''
Input arguments:
train_docs : list of documents, i.e., strings
Output arguments:
docs_tf : tf as a DataFrame
docs_idf : idf as a DataFrame
Note: the use of Pandas DataFrame is not mandatory
'''
docs_tf = None
docs_idf = None
return docs_tf, docs_idf
def word_tfidf_vector(word, tf_df, idf_df):
'''
Input arguments:
word : a query string
tf_tf : tf as a DataFrame
tf_idf : idf as a DataFrame
Output arguments:
tf_idf_value : a numpy array of dimension 1xN
'''
tf_idf_value = None
return tf_idf_value
Problem 2 Word embedding as features for classification
Task
Implement a sentiment classifier based on Twitter data to analyse the sentiments of COVID-19 tweets.
Train and test multiple classification model using necessary libraries with the features being sentence embeddings of tweets.
Report the accuracy and F1 score (micro- and macro-averaged) for multiple classifier and discuss the differences.
Dataset
The dataset have been provided in the first code trunk with the assignment. You are required to use the original tweet text for this classification task.
Tweet representation
After necessary pre-processing of the tweets, convert the words into their embeddings, then take the mean of all the word vectors in a tweet to end up with a single vector representing each tweet. The tweet vector is then used for sentiment classification.
In the process of finding the embeddings for each word, you can ignore out-of-vocabulary words.
Classifier choice
You are required to implement the following TWO classifiers:
One tradition classification model (not a neural network based model)
One classifier based on any neural network based model.
You can use PyTorch/TensorFlow/scikit-learn to implement your classifier. However, you are free to develop a classifier from scratch.
Your answer must include the following:
Code for data loading, data pre-processing, training, and testing of the models.
A discussion on the comparison between the classifiers based on classifier accuracy and F1 score

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Rules In Database Systems Third International Workshop Rids 97 Sk Vde Sweden June 26 28 1997 Proceedings Lncs 1312

Authors: Andreas Geppert ,Mikael Berndtsson

1997th Edition

3540635165, 978-3540635161

More Books

Students also viewed these Databases questions

Question

what are the provisions in the absence of Partnership Deed?

Answered: 1 week ago