hi can someone edit the crawl function to allow the command line to enter how many urls to crawl edit it so that it puts the sourcecode of each page visted in a directory specified in the command line Usage crawler py seed url seed absolute url the crawler will use it as the initial web address import urllib request import urllib parse import urllib error import urllib robotparser import re import sys DO NOT CHANGE ok to crawl def ok to crawl(absolute url) check if it is OK to crawl the specified absolute url We are implementing polite crawling by checking the robots txt file for all urls except the ones using the file scheme (these are urls on the local host and they are all OK to crawl ) We also use this function to skip over mailto links and javascript links Parameter absolute url (string) this is an absolute url that we would like to crawl Returns boolean True if the scheme is file (it is a local webpage) True if we successfully read the corresponding robots txt file and determined that user agent is allowed to crawl False if it is a mailto link or a javascript link if user agent is not allowed to crawl it or if it is NOT an absolute url if absolute url lower() startswith('mailto ') return False if absolute url lower() startswith('javascript ') return False link obj urllib parse urlparse(absolute url) if link obj scheme lower() startswith('file') return True check if the url given as input is an absolute url if not link obj scheme or not link obj hostname print('Not a valid absolute url ', absolute url) return False construct the robots txt url from the scheme and host name else robot url link obj scheme ' ' link obj hostname ' robots txt' rp urllib robotparser RobotFileParser() rp set url(robot url) try rp read() except print ( Error accessing robot file , robot url) return False else return rp can fetch( , absolute url) DO NOT CHANGE crawl def crawl(seed url) start with the seed url and crawl up to 10 urls Parameter seed url (string) this is the first url we'll visit Returns set of strings set of all the urls we have visited urls tocrawl seed url initialize our set of urls to crawl urls visited set() initialize our set of urls visited while urls tocrawl and len(urls visited) 10 current url urls tocrawl pop() just get any url from the set if current url not in urls visited check if we have crawled it before page get page(current url) if page more urls extract links(current url, page) get the links urls tocrawl urls tocrawl more urls add them to be crawled urls visited add(current url) return urls visited Do not change anything above this line def get page(url) generate a web page of html in string from url params absolute url as string return if there is URLError or DecodeError, return an empty string else return the full html page content as string try with urllib request urlopen(url) as url file page string url file read() decode('UTF 8') return page string except urllib error URLError as url err print( Error opening url , url, url err) return except UnicodeDecodeError as decode err print( Error decoding url , url, decode err) return def extract links(base url, page) extract the links contained in the page at the base url Parameters base url (string) the url we are currently crawling web address page(string) the content of that url html Returns A set of absolute urls (set of strings) These are all the urls extracted from the current url and converted to absolute urls urls set set() page links re findall(' Convert each link to an absolute url urllib parse urljoin(base url, link) urls set add(link) def main() if len(seed url) 2 if len(seed url) 2 with open('crawled txt', 'w', encoding 'utf 8') as new file new file write(url ) main() author 'xxx'

Question

hi can someone  edit the crawl function to allow the command line to enter how many urls to crawl   edit it so that it puts the sourcecode of each page visted in a directory specified in the command line    Usage  crawler py seed url seed  absolute url   the crawler will use it as the initial web address     import urllib request import urllib parse import urllib error import urllib robotparser import re import sys   DO NOT CHANGE ok to crawl    def ok to crawl(absolute url)      check if it is OK to crawl the specified absolute url We are implementing polite crawling by checking the robots txt file for all urls except the ones using the file scheme (these are urls on the local host and they are all OK to crawl ) We also use this function to skip over mailto  links and javascript  links  Parameter  absolute url (string)  this is an absolute url that we would like to crawl Returns  boolean  True if the scheme is file (it is a local webpage) True if we successfully read the corresponding robots txt file and determined that user agent   is allowed to crawl False if it is a mailto  link or a javascript  link if user agent   is not allowed to crawl it or if it is NOT an absolute url      if absolute url lower() startswith('mailto ')  return False if absolute url lower() startswith('javascript ')  return False link obj urllib parse urlparse(absolute url) if link obj scheme lower() startswith('file')  return True   check if the url given as input is an absolute url if not link obj scheme or not link obj hostname  print('Not a valid absolute url  ', absolute url) return False  construct the robots txt url from the scheme and host name else  robot url  link obj scheme '   ' link obj hostname   ' robots txt' rp   urllib robotparser RobotFileParser() rp set url(robot url) try  rp read() except  print ( Error accessing robot file   , robot url) return False else  return rp can fetch(   , absolute url)   DO NOT CHANGE crawl    def crawl(seed url)      start with the seed url and crawl up to 10 urls Parameter  seed url (string)   this is the first url we'll visit  Returns  set of strings   set of all the urls we have visited      urls tocrawl    seed url    initialize our set of urls to crawl urls visited   set()   initialize our set of urls visited while urls tocrawl and len(urls visited)   10  current url   urls tocrawl pop()   just get any url from the set if current url not in urls visited    check if we have crawled it before page   get page(current url) if page  more urls   extract links(current url, page)   get the links urls tocrawl   urls tocrawl   more urls   add them to be crawled urls visited add(current url) return urls visited              Do not change anything above this line                              def get page(url)      generate a web page of html in string from url params  absolute url as string return  if there is URLError or DecodeError, return an empty string else return the full html page content as string     try  with urllib request urlopen(url) as url file  page string   url file read() decode('UTF 8') return page string except urllib error URLError as url err  print( Error opening url   , url, url err) return     except UnicodeDecodeError as decode err  print( Error decoding url , url, decode err) return     def extract links(base url, page)      extract the links contained in the page at the base url Parameters  base url (string)  the url we are currently crawling   web address page(string)  the content of that url   html Returns  A set of absolute urls (set of strings)   These are all the urls extracted from the current url and converted to absolute urls      urls set   set() page links   re findall('   Convert each link to an absolute url urllib parse urljoin(base url, link) urls set add(link) def main()  if len(seed url)    2  if len(seed url)    2  with open('crawled txt', 'w', encoding 'utf 8') as new file  new file write(url      ) main()   author     'xxx'

Accepted Answer

The Answer is in the image, click to view ...

Question

hi can someone -edit the crawl function to allow the command line to enter how many urls to crawl - edit it so that it

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Accounting For Small Business 2023

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question