Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1 Introduction In this assignment, you learn about various ways to normalize text data, using libraries like NLTK and SpaCy. Text normalization is performed to

1 Introduction
In this assignment, you learn about various ways to normalize text data, using libraries like
NLTK and SpaCy. Text normalization is performed to convert the textual data to a more
standard form which can then be converted to a numerical form which is used by machine
learning models you will implement in later assignments. These tasks include sentence and
word tokenization, removal of Stopwords and special characters, stemming, and lemmatization.
You will learn to clean the dataset of healthcare-related tweets in this assignment.
You will learn about Regular Expressions or regex and implement them in python. Regex
can be used to extract, replace or verify a string according to the specified rules or pattern.
Regex is implemented using package re1
in python. In this assignment, you will learn to extract
information like hashtags from tweets and remove certain data like special characters.
Next, you will learn about the Minimum Edit Distance to implement the spell-checker.
The minimum edit distance between two strings is defined as the minimum editing operations
needed to transform the target to the source string. Thus, it gives the similarity between two
strings; the smaller the distance, the closer the strings are. One of the use cases of minimum
edit distance is spell-checker in which you will implement a dynamic programming approach
to calculate the distance of the target string with all the other words in the dictionary to find
the most similar string. Please follow all the instructions in Section 2 to solve questions in
Section 3
2 Instructions
Each question is labelled as Code or Written and the guidelines for each type are provided
below.
Code
This section needs to be completed using Python 3.6+. You will also require following packages:
pandas
numpy
NLTK or SpaCy
1Documentation to the package re.
1
If you want to use an external package for any reason, you are required to get approval
from the course staff prior to submission.
Written
In this section, you will need to write answers on paper, in Microsoft word or LATEX and convert
it to PDF format.
To submit your solution for Written questions, you need to provide answers to the
following questions in a single PDF.
Q2
Before submission, ensure that all pages of your solution are present and in order. Submit
this PDF on Gradescope under Assignment 1- Written. Please match all questions to their
respective solutions (pages) on Gradescope. Questions not associated with any pages will be
considered blank or missing and all questions need to be completed to receive full credit.
3 Questions
Q1. Normalize the text data: Code [20]
In this question, you need to complete the tasks listed below. For these tasks, use "Health News
in Twitter dataset" from UC Irvine Machine Learning Repository.2
1. Load the dataset in a pandas DataFrame; use only foxnewshealth.txt and cnnhealth.txt
for next tasks. Use the tweets column only, for further tasks.
2. Using regular expressions, extract hashtags in the tweets and store them. Keep a counter
for each hashtag.
3. Then sentence tokenize each tweet and then word tokenize each sentence. Keep a counter
of each word and sort them in descending order. Notice that there are a lot of special
characters and repeated words with different cases, like New vs new.
4. Clean the tweets by removing the special characters and unwanted strings like numbers,
HTTP links, etc. Use regular expressions to do so. Normalize the case of all the characters
in all the tweets.
5. Remove Stopwords from all the text examples. You may use any library like SpaCy or
NLTK. Repeat task 3 and notice the difference.
6. Lemmatize the words in step 5 to their base form using either NLTK or SpaCy. Use
stemming to the same words in 5 and observe the difference.
7. Maintain a corpus of all the words that appeared in the given files.
Q2. Normalize the text data: Written [5]
Write the answer to the following question:
2Link to the dataset.
2
1. Based on the above results, analyze and discuss if there is any pattern to the occurrence
of the words and hashtags in the news outlets.
Q3. Minimum edit distance: Code [25]
In this question, the following tasks need to be completed.
1. Implement a method, min_edit_dist(target, source, ins_cost, del_cost, sub_cost),
to compute the minimum edit distance between a target word and source word. The
variables ins_cost, del_cost, sub_cost are insertion, deletion, and substitution costs
respectively.
2. Implement a method, spell_checker(target), which will access the corpora generated
in Q1.7 and calculate the minimum edit distance of the target word with all the words in
the corpora. Assume insertion and deletion cost as 1 and substitution cost as 0. Return
the top five words from the above result.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beginning ASP.NET 4.5 Databases

Authors: Sandeep Chanda, Damien Foggon

3rd Edition

1430243805, 978-1430243809

More Books

Students also viewed these Databases questions