Question

1 Approved Answer

Posted on Sep 26, 2024

Create a web crawler. Following are the parameters: Input. The crawler should take as command-line input: the name of a file containing a list of

Create a web crawler. Following are the parameters:

Input. The crawler should take as command-line input: the name of a file containing a list of seed URLs; the maximum total number of pages to crawl (an integer); the name of a directory in which to save the crawled pages, one page per file; and a string that indicates the crawling algorithm that should be used (either dfs for depth-first

search or bfs for breadth-first search). For example, you might run your crawler like this:

 crawl.py seeds.txt 100 pages/ dfs

which would start crawling from the URLs in the file seeds.txt, visit at most 200 pages, save each page in the directory pages/, and use a breadth-rst traversal. The seed file should be a list of URLs, one per line, like this:

 http://www.thesimpsons.com/ http://homerforpresident.tripod.com/ http://www.snpp.com/

Output. Your crawler should produce 2 kinds of output:

It should write the HTML code for the pages that it discovers, one file per page, into the output directory specified on the command line. Many pages on different websites will have the same name (e.g. almost every site on the web has a file named index.html), so you'll have to generate unique file names for each of these pages so that they don't overwrite each other in your output directory. One simple approach is to name the files as consecutive images; e.g. name the first file you download 0.html, the second 1.html, etc.

It should also output a file called index.txt that lists the mapping between the URLs and the filenames you've assigned locally. The file should also record the time that the page was downloaded. For example:

 0.html 2013-10-05_12:15:12 http://www.cnn.com/ 1.html 2013-10-05_12:15:14 http://www.cnn.com/links.html 2.html 2013-10-05_12:15:20 http://www.cs.indiana.edu/

Crawler politeness. Your crawler should behave in an ethical and polite way, i.e. avoid placing unwelcome load on the network. For this purpose, you must avoid sending too many requests in rapid succession to a server. Furthermore, your crawler should obey the Robots Exclusion Protocol by which a webmaster may elect to exclude any crawler from all or parts of a site (see tips section below for how to do this). When you fetch web pages, make sure to identify yourself using an appropriate User-Agent and From string. For the User-Agent, use SCB-I427-login where login is your Spelman username (network ID). For the From string, use your full Spelman email address, e.g. login@scmail.spelman.edu. Note that compliance with these identification steps and the Robots Exclusion Protocol is an absolute requirement and is necessary to comply with the Spelman network use policy. If you experience issues, set User-Agent to *.

Crawling algorithms. As described above, the fourth parameter to the crawler species the traversal algorithm that should be used:

bfs should conduct a breadth-first traversal. Recall that this means that in any iteration of the

crawler, it should visit the page that has been in the request queue the longest.

dfs should conduct a depth-first traversal. Recall this means that the crawler should visit the

page that was most recently added to the request queue.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Graph Database Modeling With Neo4j

Authors: Ajit Singh

★★★★★

Suppose the federal government cuts taxes and increases spending, raising the budget deficit to 12 percent of GDP. If nominal GDP is rising 5 percent per year, are such budget deficits sustainable...

Answered: 1 week ago

Previous Question Next Question