Question

1 Approved Answer

Posted on May 24, 2024

Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL,

Define function fetch_url (crawler.py)

This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py) Input: the URL to be fetched Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned Define function extract_next_links (crawler.py) This function extracts links from the content of a fetched webpage.

Input: url_data which is a dictionary with the url', 'content' and'size keys. The'content' value is the content of the URL's corresponding file in the corpus and the size is the size of the content in bytes.

Output: list of URLs in string form. Each URL should be in absolute form. It is not required to remove duplicates that have already been fetched. The frontier takes care of ignoring duplicates.

Define function is_valid (crawler.py)

This function returns True or False based on whether a URL is valid and must be fetched or not.

Input: URL is the URL of a web page in string form

Output: True if URL is valid, False if the URL otherwise. This is a great place to filter out crawler traps Duplicated urls will be taken care of by frontier. You don't need to check for duplication in this method

Filter out crawler traps (e.g. the ICS calendar, dynamic URL's, etc.) You will need to do some research online or apply concepts regarding crawler traps covered in class

Returning False on a URL does not let that URL to enter your frontier. Some part of the function has already been implemented. It is your job to figure out how to add to the existing logic in order to avoid crawler traps and ensuring that only valid links are sent to the frontier

Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py)

Input: the URL to be fetched

Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned

import json import os from urllib.parse import urlparse

class Corpus: """ This class is responsible for handling corpus related functionalities like mapping a url to its local file name """

# The corpus directory name WEBPAGES_RAW_NAME = "WEBPAGES_RAW" # The corpus JSON mapping file JSON_FILE_NAME = os.path.join(".", WEBPAGES_RAW_NAME, "bookkeeping.json")

def __init__(self): self.file_url_map = json.load(open(self.JSON_FILE_NAME), encoding="utf-8") self.url_file_map = dict() for key in self.file_url_map: self.url_file_map[self.file_url_map[key]] = key

def get_file_name(self, url): """ Given a url, this method looks up for a local file in the corpus and, if existed, returns the file address. Otherwise returns None """ url = url.strip() parsed_url = urlparse(url) url = url[len(parsed_url.scheme) + 3:] if url in self.url_file_map: addr = self.url_file_map[url].split("/") dir = addr[0] file = addr[1] return os.path.join(".", self.WEBPAGES_RAW_NAME, dir, file) return None

---------------------------

import logging import re from urllib.parse import urlparse from corpus import Corpus

logger = logging.getLogger(__name__)

class Crawler: """ This class is responsible for scraping urls from the next available link in frontier and adding the scraped links to the frontier """

def __init__(self, frontier): self.frontier = frontier self.corpus = Corpus()

def start_crawling(self): """ This method starts the crawling process which is scraping urls from the next available link in frontier and adding the scraped links to the frontier """ while self.frontier.has_next_url(): url = self.frontier.get_next_url() logger.info("Fetching URL %s ... Fetched: %s, Queue size: %s", url, self.frontier.fetched, len(self.frontier)) url_data = self.fetch_url(url)

for next_link in self.extract_next_links(url_data): if self.corpus.get_file_name(next_link) is not None: if self.is_valid(next_link): self.frontier.add_url(next_link)

def fetch_url(self, url): """ This method, using the given url, should find the corresponding file in the corpus and return a dictionary containing the url, content of the file in binary format and the content size in bytes :param url: the url to be fetched :return: a dictionary containing the url, content and the size of the content. If the url does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned. """ url_data = { "url": url, "content": None, "size": 0 } return url_data

def extract_next_links(self, url_data): """ The url_data coming from the fetch_url method will be given as a parameter to this method. url_data contains the fetched url, the url content in binary format, and the size of the content in bytes. This method should return a list of urls in their absolute form (some links in the content are relative and needs to be converted to the absolute form). Validation of links is done later via is_valid method. It is not required to remove duplicates that have already been fetched. The frontier takes care of that.

Suggested library: lxml """ outputLinks = [] return outputLinks

def is_valid(self, url): """ Function returns True or False based on whether the url has to be fetched or not. This is a great place to filter out crawler traps. Duplicated urls will be taken care of by frontier. You don't need to check for duplication in this method """ parsed = urlparse(url) if parsed.scheme not in set(["http", "https"]): return False try: return ".ics.uci.edu" in parsed.hostname \ and not re.match(".*\.(css|js|bmp|gif|jpe?g|ico" + "|png|tiff?|mid|mp2|mp3|mp4" \ + "|wav|avi|mov|mpeg|ram|m4v|mkv|ogg|ogv|pdf" \ + "|ps|eps|tex|ppt|pptx|doc|docx|xls|xlsx|names|data|dat|exe|bz2|tar|msi|bin|7z|psd|dmg|iso|epub|dll|cnf|tgz|sha1" \ + "|thmx|mso|arff|rtf|jar|csv" \ + "|rm|smil|wmv|swf|wma|zip|rar|gz|pdf)$", parsed.path.lower())

except TypeError: print("TypeError for ", parsed) return False