Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

We will use one full day worth of tweets as our input ( there are total of 4 . 4 M tweets in this file

We will use one full day worth of tweets as our input (there are total of 4.4M tweets in this file)
Execute and time the following tasks with 110,000 tweets and 550,000 tweets:
a. Use python to download tweets from the web and save to a local text file (not into a database yet, just to a text file). This is as simple as it sounds, all you need is a for-loop that reads lines from the web and writes them into a file.
NOTE: Do not call read() or readlines(). That command will attempt to read the entire file which is too much data. Clicking on the link in the browser would cause the same problem.
b. Repeat what you did in part 1-a, but instead of saving tweets to the file, populate the 3-table schema that you previously created in SQLite. Be sure to execute commit and verify that the data has been successfully loaded. Report loaded row counts for each of the 3 tables by running a SELECT DISCINT of the primary key of each table. Additionally, report the runtime of finding the number of rows for each table.
NOTE: If your schema contains a foreign key in the Geo table or relies on TweetID as the primary key for the Geo table, you should change your schema. Geo entries should be identified based on the location they represent. There should not be any blank Geo entries such as (ID, None, None, None). The easiest way to create an ID is by combining lon_lat into a primary key.
c. Use your locally saved tweet file to repeat the database population step from part-c. That is, load the tweets into the 3-table database using your saved file with tweets. This is the same code as in 1-b, but reading tweets from your file, not from the web. Time the code used to run this step and report.
d. Repeat the same step with a batching size of 2,000(i.e. by inserting 2,000 rows at a time with executemany instead of doing individual inserts). Since many of the tweets are missing a Geo location, its fine for the batches of Geo inserts to be smaller than 2,000.
e. Plot the resulting runtimes (# of tweets versus runtimes) using matplotlib for 1-a,1-b,1-c, and 1-d. How does the runtime compare?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Concepts of Database Management

Authors: Philip J. Pratt, Joseph J. Adamski

7th edition

978-1111825911, 1111825912, 978-1133684374, 1133684378, 978-111182591

More Books

Students also viewed these Databases questions