You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles You need to determine which news articles ( news df) are similar to each other and which tweets ( tweets df) are more similar to each other In order to accomplish this you need to create n grams and compare the similarity of the text using Jaccard distance Additional instructions To get quality results apply appropriate text cleaning methods Your submission must be a Python Notebook (ipynb) Use 'Markdown' in the cell to document your answers provide comments as needed Visualize your results instead of writing about them Remember, a picture is worth a thousand words Your final submission must include the following Which news articles tweets were similar and which ones were dissimilar A brief write up explaining why and how you chose n for you analysis (for n grams) Was the n identical or different for articles vs tweets and why Visualize the selection of n For the news articles, please explain why you chose text or title or both text title combined Include all of your program codes (creating n grams from text as well as selecting the n for analysis) Here's my code so far import pandas as pd pd set option('display max rows', 100) pd set option('display max columns', None) pd set option('display max colwidth', 500) news path 'https storage googleapis com msca bdp data open news nlp a 3 news json' news df pd read json(news path, orient 'records', lines True) print(f'Sample contains news df shape 0 , 0f news articles') news df head(2) tweets path 'https storage googleapis com msca bdp data open tweets nlp a 3 tweets json' tweets df pd read json(tweets path, orient 'records', lines True) print(f'Sample contains tweets df shape 0 , 0f tweets') tweets df head(2)

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 10, 2023

You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles You need to determine which

You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles

You need to determine which news articles (news_df) are similar to each other and which tweets (tweets_df) are more similar to each other. In order to accomplish this you need to create n-grams and compare the similarity of the text using Jaccard distance.

Additional instructions:

To get quality results apply appropriate text cleaning methods
Your submission must be a Python Notebook (ipynb)
Use 'Markdown' in the cell to document your answers / provide comments as needed
Visualize your results instead of writing about them. Remember, a picture is worth a thousand words

Your final submission must include the following:

Which news articles / tweets were similar and which ones were dissimilar?
A brief write-up explaining why and how you chose "n" for you analysis (for n-grams)
Was the "n" identical or different for articles vs. tweets and why
Visualize the selection of "n"
For the news articles, please explain why you chose text or title or both text + title combined
Include all of your program codes (creating n-grams from text as well as selecting the "n" for analysis)

Here's my code so far:

import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_3_news.json'
news_df = pd.read_json(news_path, orient='records', lines=True)

print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

tweets_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_3_tweets.json'
tweets_df = pd.read_json(tweets_path, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Step by Step Solution

★★★★★

3.45 Rating (148 Votes )

There are 3 Steps involved in it

Step: 1

Solution with steps Step 1 Import the necessary libraries and load the data import pandas as pd Load news and tweets data newspath httpsstoragegoogleapiscommscabdpdataopennewsnlpa3newsjson newsdf pdre... blur-text-image