Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Web crawler for python ~Crawler needs to discover web pages by following links. ~We need to start with a set of known URLS, download those

Web crawler for python

~Crawler needs to discover web pages by following links.

~We need to start with a set of known URLS, download those pages,look for links to other HTML pages, download those pages, etc.

~It will stop when all pages are explored because all the links on all the pages it knows about have already been followed.

~or it will stop when user specified max number of pages have been reached

~crawler should only follow and download links to HTML pages, not any other documents(pdf,images,etc)

~should not visit any page twice

Input(taking command line input like):

~name of file containing list of see URLs

~maximum total number of pages to crawl(integer)

~name of a directory in which to save the crawled pages, one page per file

~string which indidcates the crawling algorithm that should be used(either dfs(depth first search) or bfs(breath first search)

bfs should do a breadth-first traversal. this means that in any iteration of the

crawler, it should visit the page that has been in the request queue the longest.

dfs should do a depth-first traversal. this means that the crawler should visit the

page that was most recently added to the request queue.

example of how it might run:

crawler.py seed.txt 100 pages/ dfs

it would start crawling from the URLs in the file seed.txt visit at most 200 pages, save each page in the directory pages/, and use a dfs traversal. The seed file is a list of URLs one per like like:

http://www.cnn.com/

http://www.fox.com/

I will ask someone to help with the output in a separate question

You can even get the python code online somewhere but It might need to be edited to include these things..

Do what you can

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beyond Big Data Using Social MDM To Drive Deep Customer Insight

Authors: Martin Oberhofer, Eberhard Hechler

1st Edition

0133509796, 9780133509793

More Books

Students also viewed these Databases questions