Question
Create a web crawler. Following are the parameters: Input. The crawler should take as command-line input: the name of a file containing a list of
Create a web crawler. Following are the parameters:
Input. The crawler should take as command-line input: the name of a file containing a list of seed URLs; the maximum total number of pages to crawl (an integer); the name of a directory in which to save the crawled pages, one page per file; and a string that indicates the crawling algorithm that should be used (either dfs for depth-first
search or bfs for breadth-first search). For example, you might run your crawler like this:
crawl.py seeds.txt 100 pages/ dfs
which would start crawling from the URLs in the file seeds.txt, visit at most 200 pages, save each page in the directory pages/, and use a breadth-rst traversal. The seed file should be a list of URLs, one per line, like this:
http://www.thesimpsons.com/ http://homerforpresident.tripod.com/ http://www.snpp.com/
Output. Your crawler should produce 2 kinds of output:
It should write the HTML code for the pages that it discovers, one file per page, into the output directory specified on the command line. Many pages on different websites will have the same name (e.g. almost every site on the web has a file named index.html), so you'll have to generate unique file names for each of these pages so that they don't overwrite each other in your output directory. One simple approach is to name the files as consecutive images; e.g. name the first file you download 0.html, the second 1.html, etc.
It should also output a file called index.txt that lists the mapping between the URLs and the filenames you've assigned locally. The file should also record the time that the page was downloaded. For example:
0.html 2013-10-05_12:15:12 http://www.cnn.com/ 1.html 2013-10-05_12:15:14 http://www.cnn.com/links.html 2.html 2013-10-05_12:15:20 http://www.cs.indiana.edu/
Crawler politeness. Your crawler should behave in an ethical and polite way, i.e. avoid placing unwelcome load on the network. For this purpose, you must avoid sending too many requests in rapid succession to a server. Furthermore, your crawler should obey the Robots Exclusion Protocol by which a webmaster may elect to exclude any crawler from all or parts of a site (see tips section below for how to do this). When you fetch web pages, make sure to identify yourself using an appropriate User-Agent and From string. For the User-Agent, use SCB-I427-login where login is your Spelman username (network ID). For the From string, use your full Spelman email address, e.g. login@scmail.spelman.edu. Note that compliance with these identification steps and the Robots Exclusion Protocol is an absolute requirement and is necessary to comply with the Spelman network use policy. If you experience issues, set User-Agent to *.
Crawling algorithms. As described above, the fourth parameter to the crawler species the traversal algorithm that should be used:
bfs should conduct a breadth-first traversal. Recall that this means that in any iteration of the
crawler, it should visit the page that has been in the request queue the longest.
dfs should conduct a depth-first traversal. Recall this means that the crawler should visit the
page that was most recently added to the request queue.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started