Question
You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles You need to determine which
You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles
You need to determine which news articles (news_df) are similar to each other and which tweets (tweets_df) are more similar to each other. In order to accomplish this you need to create n-grams and compare the similarity of the text using Jaccard distance.
Additional instructions:
- To get quality results apply appropriate text cleaning methods
- Your submission must be a Python Notebook (ipynb)
- Use 'Markdown' in the cell to document your answers / provide comments as needed
- Visualize your results instead of writing about them. Remember, a picture is worth a thousand words
Your final submission must include the following:
- Which news articles / tweets were similar and which ones were dissimilar?
- A brief write-up explaining why and how you chose "n" for you analysis (for n-grams)
- Was the "n" identical or different for articles vs. tweets and why
- Visualize the selection of "n"
- For the news articles, please explain why you chose text or title or both text + title combined
- Include all of your program codes (creating n-grams from text as well as selecting the "n" for analysis)
Here's my code so far:
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)
news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_3_news.json'
news_df = pd.read_json(news_path, orient='records', lines=True)
print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)
tweets_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_3_tweets.json'
tweets_df = pd.read_json(tweets_path, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)
Step by Step Solution
3.45 Rating (148 Votes )
There are 3 Steps involved in it
Step: 1
Solution with steps Step 1 Import the necessary libraries and load the data import pandas as pd Load news and tweets data newspath httpsstoragegoogleapiscommscabdpdataopennewsnlpa3newsjson newsdf pdre...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started