Question
Web crawler for python ~Crawler needs to discover web pages by following links. ~We need to start with a set of known URLS, download those
Web crawler for python
~Crawler needs to discover web pages by following links.
~We need to start with a set of known URLS, download those pages,look for links to other HTML pages, download those pages, etc.
~It will stop when all pages are explored because all the links on all the pages it knows about have already been followed.
~or it will stop when user specified max number of pages have been reached
~crawler should only follow and download links to HTML pages, not any other documents(pdf,images,etc)
~should not visit any page twice
Input(taking command line input like):
~name of file containing list of see URLs
~maximum total number of pages to crawl(integer)
~name of a directory in which to save the crawled pages, one page per file
~string which indidcates the crawling algorithm that should be used(either dfs(depth first search) or bfs(breath first search)
bfs should do a breadth-first traversal. this means that in any iteration of the
crawler, it should visit the page that has been in the request queue the longest.
dfs should do a depth-first traversal. this means that the crawler should visit the
page that was most recently added to the request queue.
example of how it might run:
crawler.py seed.txt 100 pages/ dfs
it would start crawling from the URLs in the file seed.txt visit at most 200 pages, save each page in the directory pages/, and use a dfs traversal. The seed file is a list of URLs one per like like:
http://www.cnn.com/
http://www.fox.com/
I will ask someone to help with the output in a separate question
You can even get the python code online somewhere but It might need to be edited to include these things..
Do what you can
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started