Question
Using Python 3+ 1. Include the following imports at the top of your module (hopefully this is sufficient): from web import LinkCollector # make sure
Using Python 3+
1. Include the following imports at the top of your module (hopefully this is sufficient):
from web import LinkCollector # make sure you did 1 from html.parser import HTMLParser from urllib.request import urlopen from urllib.parse import urljoin from urllib.error import URLError
2.) Implement a class ImageCrawler that will inherit from the Crawler developed in amd will both crawl links and collect images. This is very easy by inheriting from and extending the Crawler class. You will need to collect images in a set. Hint: what does it mean to extend? Implementation details:
You must inherit from Crawler. Make sure that the module web.py is in your working folder and make sure that you import Crawler from the web module.
__init__ - extends Crawlers __init__ by adding an set attribute that will be used to store images
Crawl extends Crawlers crawl by creating an image collector, opening the url and then collecting any images from the url in the set of images being stored. I recommend that you collect the images before you call the Crawlers crawl method.
getImages returns the set of images collected
>>> c = ImageCrawler() >>> c.crawl('http://www2.warnerbros.com/spacejam/movie/jam.htm',1,True) >>> c.getImages() {'http://www2.warnerbros.com/spacejam/movie/img/p-lunartunes.gif', 'http://www2.warnerbros.com/spacejam/movie/cmp/pressbox/img/r-blue.gif'}
>>> c = ImageCrawler() >>> c.crawl('http://www.pmichaud.com/toast/',1,True) >>> c.getImages() {'http://www.pmichaud.com/toast/toast-6a.gif', 'http://www.pmichaud.com/toast/toast-2c.gif', 'http://www.pmichaud.com/toast/toast-4c.gif', 'http://www.pmichaud.com/toast/toast-6c.gif', 'http://www.pmichaud.com/toast/ptart-1c.gif', 'http://www.pmichaud.com/toast/toast-7b.gif', 'http://www.pmichaud.com/toast/krnbo24.gif', 'http://www.pmichaud.com/toast/toast-1b.gif', 'http://www.pmichaud.com/toast/toast-3c.gif', 'http://www.pmichaud.com/toast/toast-5c.gif', 'http://www.pmichaud.com/toast/toast-8a.gif'}
web.py:
class Crawler: # does not inherit
# init - create empty sets def __init__(self): self.crawled = set() # html read self.found = set() # found self.dead = set() # cant load
# recursive crawl # c.crawl( 'http://www.kli.org/',2,True) def crawl(self,url,depth,relativeOnly=True): # read the html found at url lc = LinkCollector(url) try: lc.feed( urlopen(url).read().decode() ) except (UnicodeDecodeError,URLError,TypeError): self.dead.add(url) self.crawled.add( url )
# extract links if relativeOnly: # if relativeOnly==True: found = lc.getRelatives() else: found = lc.getLinks() self.found.update( found )
# recursively crawl all the (new) links # that were found if depth>0: for link in found: # dont want to duplicate work if link not in self.crawled: self.crawl( link,depth-1,relativeOnly)
# easy get methods def getCrawled(self): return self.crawled def getFound(self): return self.found def getDead(self): return self.dead
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started