Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this question, you need to complete the tasks listed below. For these tasks, use Health News in Twitter dataset from UC Irvine Machine Learning

In this question, you need to complete the tasks listed below. For these tasks, use "Health News in Twitter dataset" from UC Irvine Machine Learning Repository.2
1. Load the dataset in a pandas DataFrame; use only foxnewshealth.txt and cnnhealth.txt for next tasks. Use the tweets column only, for further tasks.
2. Using regular expressions, extract hashtags in the tweets and store them. Keep a counter for each hashtag.
3. Then sentence tokenize each tweet and then word tokenize each sentence. Keep a counter of each word and sort them in descending order. Notice that there are a lot of special characters and repeated words with different cases, like 'New' vs 'new'.
4. Clean the tweets by removing the special characters and unwanted strings like numbers, HTTP links, etc. Use regular expressions to do so. Normalize the case of all the characters in all the tweets.
5. Remove Stopwords from all the text examples. You may use any library like SpaCy or NLTK. Repeat task 3 and notice the difference.
6. Lemmatize the words in step 5 to their base form using either NLTK or SpaCy. Use stemming to the same words in 5 and observe the difference.
7. Maintain a corpus of all the words that appeared in the given files.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

T Sql Fundamentals

Authors: Itzik Ben Gan

4th Edition

0138102104, 978-0138102104

More Books

Students also viewed these Databases questions