Web crawler for python Crawler needs to discover web pages by following links We need to start with a set of known URLS, download those pages,look for links to other HTML pages, download those pages, etc It will stop when all pages are explored because all the links on all the pages it knows about have already been followed or it will stop when user specified max number of pages have been reached crawler should only follow and download links to HTML pages, not any other documents(pdf,images,etc) should not visit any page twice Input(taking command line input like) name of file containing list of see URLs maximum total number of pages to crawl(integer) name of a directory in which to save the crawled pages, one page per file string which indidcates the crawling algorithm that should be used(either dfs(depth first search) or bfs(breath first search) bfs should do a breadth first traversal this means that in any iteration of the crawler, it should visit the page that has been in the request queue the longest dfs should do a depth first traversal this means that the crawler should visit the page that was most recently added to the request queue example of how it might run crawler py seed txt 100 pages dfs it would start crawling from the URLs in the file seed txt visit at most 200 pages, save each page in the directory pages , and use a dfs traversal The seed file is a list of URLs one per like like http www cnn com http www fox com I will ask someone to help with the output in a separate question You can even get the python code online somewhere but It might need to be edited to include these things Do what you can

Question

Web crawler for python  Crawler needs to discover web pages by following links   We need to start with a set of known URLS, download those pages,look for links to other HTML pages, download those pages, etc   It will stop when all pages are explored because all the links on all the pages it knows about have already been followed   or it will stop when user specified max number of pages have been reached  crawler should only follow and download links to HTML pages, not any other documents(pdf,images,etc)  should not visit any page twice Input(taking command line input like)   name of file containing list of see URLs  maximum total number of pages to crawl(integer)  name of a directory in which to save the crawled pages, one page per file  string which indidcates the crawling algorithm that should be used(either dfs(depth first search) or bfs(breath first search) bfs should do a breadth first traversal  this means that in any iteration of the crawler, it should visit the page that has been in the request queue the longest  dfs should do a depth first traversal  this means that the crawler should visit the page that was most recently added to the request queue  example of how it might run  crawler py seed txt 100 pages  dfs it would start crawling from the URLs in the file seed txt visit at most 200 pages, save each page in the directory pages , and use a dfs traversal  The seed file is a list of URLs one per like like  http   www cnn com  http   www fox com  I will ask someone to help with the output in a separate question You can even get the python code online somewhere but It might need to be edited to include these things   Do what you can

Accepted Answer

The Answer is in the image, click to view ...

Question

Web crawler for python ~Crawler needs to discover web pages by following links. ~We need to start with a set of known URLS, download those

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Beyond Big Data Using Social MDM To Drive Deep Customer Insight

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question