Question
The Internet Archive is a digital archive that makes snapshots (a local copy) of a huge number of Web pages, archiving the Internet for future
-
The Internet Archive is a digital archive that makes snapshots (a local copy) of a huge number of Web pages, archiving the Internet for future generations. It currently has collected more than 400 billion snapshots of Web pages; very dynamic pages such as the front page of the New York Times website or the Volkskrant website are crawled several times per day. Other Web pages are crawled only once a year. Assume you are tasked by the Internet Archive to provide the following services for their data: . (25p)
*All snapshots crawled within the past 30 days should be directly accessible to users: it should take merely seconds between a user submitting a URL and the Internet Archive returning the list of snapshots taken during the past 30 days. **Requesting older snapshots requires more patience: a user can submit a list of URLs and a time frame of interest html between March 1, 2009 and March 15, 2009) and within a few hours or at most days the Internet Archive should return the requested snapshots.
***On the website of the Internet Archive live statistics should be shown, indicating the number of URL requests issued by users today and the past month, the number of snapshots crawled today and the number of older snapshot requests currently being processed. Given your knowledge of big data (and small-data) technologies, discuss which technologies you would use to provide these three services.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started