Answered step by step
Verified Expert Solution
Question
1 Approved Answer
In this question, you need to complete the tasks listed below. For these tasks, use Health News in Twitter dataset from UC Irvine Machine Learning
In this question, you need to complete the tasks listed below. For these tasks, use "Health News in Twitter dataset" from UC Irvine Machine Learning Repository
Load the dataset in a pandas DataFrame; use only foxnewshealth.txt and cnnhealth.txt for next tasks. Use the tweets column only, for further tasks.
Using regular expressions, extract hashtags in the tweets and store them. Keep a counter for each hashtag.
Then sentence tokenize each tweet and then word tokenize each sentence. Keep a counter of each word and sort them in descending order. Notice that there are a lot of special characters and repeated words with different cases, like 'New' vs 'new'.
Clean the tweets by removing the special characters and unwanted strings like numbers, HTTP links, etc. Use regular expressions to do so Normalize the case of all the characters in all the tweets.
Remove Stopwords from all the text examples. You may use any library like SpaCy or NLTK Repeat task and notice the difference.
Lemmatize the words in step to their base form using either NLTK or SpaCy. Use stemming to the same words in and observe the difference.
Maintain a corpus of all the words that appeared in the given files.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started