Question

1 Approved Answer

Posted on Sep 25, 2024

Write a scraper which collects images on the Web. The scraper isgiven a seed page from which to begin itscrawl. For each page that it

Write a scraper which collects images on the Web. The scraper isgiven a "seed" page from which to begin itscrawl. For each page that it visits, it looks for img start tags, and then the src attribute, whose value is the URL of the images. After the scraper is done with a page, then the crawler finds additional pages which can be scraped for more images.

Below you will find 2 programs. The first extracts images from a single Web page. It writes img elements to a file, so that when the program terminates, one can open a browser window on the file and see the images that have been collected.

The second program is a crawler. It is given a maximum number of pages to crawl (otherwise it probably would never terminate). Once it is done crawling, it writes anchor elements to a file, so that again one can open a brower window on the file and see the hyperlinks that have been collected. The hypelinks can be clicked in order to go to the page specified by the URL in the href attribute of the anchor element.

At the bottom of this file, I've also started the ImageScraper class for you. You may feel free to fill in the code I have started, or you can delete my code and write it in another way

from html import unescape from urllib.request import urlopen from html.parser import HTMLParser from urllib.error import * from urllib.parse import urljoin from http.client import *

#################################################################### This program extracts images from a single page, using HTMLParser ##################################################################### class ImageParser(HTMLParser): def __init__(self, url): HTMLParser.__init__(self) self.images = set() self.url = url

def extract_images(self): contents = unescape(urlopen(self.url).read().decode()) self.feed(contents) return len(self.images) def handle_starttag(self, tag, attrs): if tag == 'img': for attr in attrs: if attr[0] == 'src': self.images.add(urljoin(self.url, attr[1]))

def get_images(self): return self.images

######################################################## # this code crawls the Web and gathers all of the # hyperlinks that it can find, up to a maximum number # # to run it, for example: # # c = Crawler('http://www.nytimes.com', 20) # c.crawl() # # this will return a set of 20 URLs ########################################################

class Crawler(HTMLParser): def __init__(self, url, max_links = 100, visited=None): HTMLParser.__init__(self) self.url = url self.max_links = max_links if visited == None: self.visited = set() else: self.visited = visited self.href = None def crawl(self): if self.url in self.visited: print('already visited') return else: self.visited.add(self.url) try: page_contents = unescape(urlopen(self.url).read().decode()) self.feed(page_contents) except UnicodeDecodeError: print('unicodedecode') pass except UnicodeEncodeError: print('unicodeuncode') pass except URLError: print('urlerror') pass except BadStatusLine: print('bad') pass

def handle_starttag(self, tag, attrs): if tag == 'a': for attr in attrs: if attr[0] == 'href': self.href = urljoin(self.url, attr[1]) if len(self.visited) < self.max_links: print(len(self.visited)) c = Crawler(self.href, self.max_links, self.visited) return c.crawl() def get_pages(self): return self.visited

######################################################## # YOU MUST COMPLETE THIS CLASS. ########################################################

class ImageScraper(HTMLParser): def __init__(self, url, max_links = 100): self.url = url self.max_links = max_links self.visited = set() self.images = [ ] def scrape(self): # you must complete this pass