Question

1 Approved Answer

Posted on Sep 26, 2024

Using Python 3+ 1. Include the following imports at the top of your module (hopefully this is sufficient): from web import LinkCollector # make sure

Using Python 3+

1. Include the following imports at the top of your module (hopefully this is sufficient):

from web import LinkCollector # make sure you did 1 from html.parser import HTMLParser from urllib.request import urlopen from urllib.parse import urljoin from urllib.error import URLError

2.) Implement a class ImageCrawler that will inherit from the Crawler developed in amd will both crawl links and collect images. This is very easy by inheriting from and extending the Crawler class. You will need to collect images in a set. Hint: what does it mean to extend? Implementation details:

You must inherit from Crawler. Make sure that the module web.py is in your working folder and make sure that you import Crawler from the web module.

__init__ - extends Crawlers __init__ by adding an set attribute that will be used to store images

Crawl extends Crawlers crawl by creating an image collector, opening the url and then collecting any images from the url in the set of images being stored. I recommend that you collect the images before you call the Crawlers crawl method.

getImages returns the set of images collected

>>> c = ImageCrawler() >>> c.crawl('http://www2.warnerbros.com/spacejam/movie/jam.htm',1,True) >>> c.getImages() {'http://www2.warnerbros.com/spacejam/movie/img/p-lunartunes.gif', 'http://www2.warnerbros.com/spacejam/movie/cmp/pressbox/img/r-blue.gif'}

>>> c = ImageCrawler() >>> c.crawl('http://www.pmichaud.com/toast/',1,True) >>> c.getImages() {'http://www.pmichaud.com/toast/toast-6a.gif', 'http://www.pmichaud.com/toast/toast-2c.gif', 'http://www.pmichaud.com/toast/toast-4c.gif', 'http://www.pmichaud.com/toast/toast-6c.gif', 'http://www.pmichaud.com/toast/ptart-1c.gif', 'http://www.pmichaud.com/toast/toast-7b.gif', 'http://www.pmichaud.com/toast/krnbo24.gif', 'http://www.pmichaud.com/toast/toast-1b.gif', 'http://www.pmichaud.com/toast/toast-3c.gif', 'http://www.pmichaud.com/toast/toast-5c.gif', 'http://www.pmichaud.com/toast/toast-8a.gif'}

web.py:

class Crawler: # does not inherit

# init - create empty sets def __init__(self): self.crawled = set() # html read self.found = set() # found self.dead = set() # cant load

# recursive crawl # c.crawl( 'http://www.kli.org/',2,True) def crawl(self,url,depth,relativeOnly=True): # read the html found at url lc = LinkCollector(url) try: lc.feed( urlopen(url).read().decode() ) except (UnicodeDecodeError,URLError,TypeError): self.dead.add(url) self.crawled.add( url )

# extract links if relativeOnly: # if relativeOnly==True: found = lc.getRelatives() else: found = lc.getLinks() self.found.update( found )

# recursively crawl all the (new) links # that were found if depth>0: for link in found: # dont want to duplicate work if link not in self.crawled: self.crawl( link,depth-1,relativeOnly)

# easy get methods def getCrawled(self): return self.crawled def getFound(self): return self.found def getDead(self): return self.dead