Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles You need to determine which

You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles

You need to determine which news articles (news_df) are similar to each other and which tweets (tweets_df) are more similar to each other.  In order to accomplish this you need to create n-grams and compare the similarity of the text using Jaccard distance.

Additional instructions:

  • To get quality results apply appropriate text cleaning methods
  • Your submission must be a Python Notebook (ipynb)
  • Use 'Markdown' in the cell to document your answers / provide comments as needed
  • Visualize your results instead of writing about them. Remember, a picture is worth a thousand words

Your final submission must include the following:

  • Which news articles / tweets were similar and which ones were dissimilar?
  • A brief write-up explaining why and how you chose "n" for you analysis (for n-grams)
  • Was the "n" identical or different for articles vs. tweets and why
  • Visualize the selection of "n"
  • For the news articles, please explain why you chose text or title or both text + title combined
  • Include all of your program codes (creating n-grams from text as well as selecting the "n" for analysis)

Here's my code so far: 

import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

 

news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_3_news.json'
news_df = pd.read_json(news_path, orient='records', lines=True)

print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

 

tweets_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_3_tweets.json'
tweets_df = pd.read_json(tweets_path, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Step by Step Solution

3.45 Rating (148 Votes )

There are 3 Steps involved in it

Step: 1

Solution with steps Step 1 Import the necessary libraries and load the data import pandas as pd Load news and tweets data newspath httpsstoragegoogleapiscommscabdpdataopennewsnlpa3newsjson newsdf pdre... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Exploring Management

Authors: John R. Schermerhorn

3rd edition

1118217252, 9780470878217, 978-1118217252

More Books

Students also viewed these Programming questions

Question

9. What is the relationship between orexin and narcolepsy?

Answered: 1 week ago