The Internet Archive is a digital archive that makes snapshots (a local copy) of a huge number of Web pages, archiving the Internet for future generations It currently has collected more than 400 billion snapshots of Web pages very dynamic pages such as the front page of the New York Times website or the Volkskrant website are crawled several times per day Other Web pages are crawled only once a year Assume you are tasked by the Internet Archive to provide the following services for their data (25p) All snapshots crawled within the past 30 days should be directly accessible to users it should take merely seconds between a user submitting a URL and the Internet Archive returning the list of snapshots taken during the past 30 days Requesting older snapshots requires more patience a user can submit a list of URLs and a time frame of interest html between March 1, 2009 and March 15, 2009) and within a few hours or at most days the Internet Archive should return the requested snapshots On the website of the Internet Archive live statistics should be shown, indicating the number of URL requests issued by users today and the past month, the number of snapshots crawled today and the number of older snapshot requests currently being processed Given your knowledge of big data (and small data) technologies, discuss which technologies you would use to provide these three services

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

The Internet Archive is a digital archive that makes snapshots (a local copy) of a huge number of Web pages, archiving the Internet for future generations. It currently has collected more than 400 billion snapshots of Web pages; very dynamic pages such as the front page of the New York Times website or the Volkskrant website are crawled several times per day. Other Web pages are crawled only once a year. Assume you are tasked by the Internet Archive to provide the following services for their data: . (25p)

*All snapshots crawled within the past 30 days should be directly accessible to users: it should take merely seconds between a user submitting a URL and the Internet Archive returning the list of snapshots taken during the past 30 days. **Requesting older snapshots requires more patience: a user can submit a list of URLs and a time frame of interest html between March 1, 2009 and March 15, 2009) and within a few hours or at most days the Internet Archive should return the requested snapshots.

***On the website of the Internet Archive live statistics should be shown, indicating the number of URL requests issued by users today and the past month, the number of snapshots crawled today and the number of older snapshot requests currently being processed. Given your knowledge of big data (and small-data) technologies, discuss which technologies you would use to provide these three services.