Question

1 Approved Answer

Posted on Sep 23, 2024

Objective: Modify def extract_next_links(self, url_data): and def is_valid(self, url): in crawler.py Requirements: Python 3.6+, lxml, and BeautifulSoup ---------------------------------------------------- Define function extract_next_links (crawler.py) This function extracts

Objective: Modify "def extract_next_links(self, url_data):" and "def is_valid(self, url):" in crawler.py

Requirements: Python 3.6+, lxml, and BeautifulSoup

----------------------------------------------------

Define function extract_next_links (crawler.py) This function extracts links from the content of a fetched webpage.

Input: url_data which is a dictionary containing the content and required meta-data for a downloaded webpage. Following is the description for each key in the dictionary: -url: the requested url to be downloaded -content: the content of the downloaded url in binary format. None if url does not exist in the corpus -size: the size of the downloaded content in bytes. 0 if url does not exist in the corpus -content_type: Content-Type from the response http headers. None if the url does not exist in the corpus or content-type wasn't provided -http_code: the response http status code. 404 if the url does not exist in the corpus -is_redirected: a boolean indicating if redirection has happened to get the final response -final_url: the final url after all of the redirections. None if there was no redirection.

Output: list of URLs in string form. Each URL should be in absolute form. It is not required to remove duplicates that have already been fetched. The frontier takes care of ignoring duplicates.

----------------------------------------------------

Define function is_valid (crawler.py) This function returns True or False based on whether a URL is valid and must be fetched or not. -Input: URL is the URL of a web page in string form -Output: True if URL is valid, False if the URL otherwise. This is a great place to filter out crawler traps. Duplicated urls will be taken care of by frontier. You don't need to check for duplication in this method

Filter out crawler traps (e.g. the ICS calendar, dynamic URLs, etc.), Additionally crawler traps include history based trap detection where based on your practice runs you will determine if there are sites that you have crawled that are traps, continuously repeating sub-directories and very long URLs. You will need to do some research online but you will provide information on the type of trap detection you implemented and why you implemented it that way.(DO NOT HARD CODE URLS YOU THINK ARE TRAPS, ie regex urls, YOU SHOULD USE LOGIC TO FILTER THEM OUT)

Returning False on a URL does not let that URL to enter your frontier. Some part of the function has already been implemented. It is your job to figure out how to add to the existing logic in order to avoid crawler traps and ensuring that only valid links are sent to the frontier.

----------------------------------------------------

crawler.py: This file is responsible for scraping URLs from the next available link in frontier and adding the scraped links back to the frontier

import logging import re from urllib.parse import urlparse

logger = logging.getLogger(__name__)

class Crawler: """ This class is responsible for scraping urls from the next available link in frontier and adding the scraped links to the frontier """

def __init__(self, frontier, corpus): self.frontier = frontier self.corpus = corpus

def start_crawling(self): """ This method starts the crawling process which is scraping urls from the next available link in frontier and adding the scraped links to the frontier """ while self.frontier.has_next_url(): url = self.frontier.get_next_url() logger.info("Fetching URL %s ... Fetched: %s, Queue size: %s", url, self.frontier.fetched, len(self.frontier)) url_data = self.corpus.fetch_url(url)

for next_link in self.extract_next_links(url_data): if self.is_valid(next_link): if self.corpus.get_file_name(next_link) is not None: self.frontier.add_url(next_link)

def extract_next_links(self, url_data): """ The url_data coming from the fetch_url method will be given as a parameter to this method. url_data contains the fetched url, the url content in binary format, and the size of the content in bytes. This method should return a list of urls in their absolute form (some links in the content are relative and needs to be converted to the absolute form). Validation of links is done later via is_valid method. It is not required to remove duplicates that have already been fetched. The frontier takes care of that.

Suggested library: lxml """ outputLinks = [] return outputLinks

def is_valid(self, url): """ Function returns True or False based on whether the url has to be fetched or not. This is a great place to filter out crawler traps. Duplicated urls will be taken care of by frontier. You don't need to check for duplication in this method """ parsed = urlparse(url) if parsed.scheme not in set(["http", "https"]): return False try: return ".ics.uci.edu" in parsed.hostname \ and not re.match(".*\.(css|js|bmp|gif|jpe?g|ico" + "|png|tiff?|mid|mp2|mp3|mp4" \ + "|wav|avi|mov|mpeg|ram|m4v|mkv|ogg|ogv|pdf" \ + "|ps|eps|tex|ppt|pptx|doc|docx|xls|xlsx|names|data|dat|exe|bz2|tar|msi|bin|7z|psd|dmg|iso|epub|dll|cnf|tgz|sha1" \ + "|thmx|mso|arff|rtf|jar|csv" \ + "|rm|smil|wmv|swf|wma|zip|rar|gz|pdf)$", parsed.path.lower())

except TypeError: print("TypeError for ", parsed) return False

----------------------------------------------------

frontier.py: This file acts as a representation of a frontier. It has method to add a URL to the frontier, get the next URL and check if the frontier has any more URLs. Additionally, it has methods to save the current state of the frontier and load existing state

import logging import os from collections import deque import pickle

logger = logging.getLogger(__name__)

class Frontier: """ This class acts as a representation of a frontier. It has method to add a url to the frontier, get the next url and check if the frontier has any more urls. Additionally, it has methods to save the current state of the frontier and load existing state

Attributes: urls_queue: A queue of urls to be download by crawlers urls_set: A set of urls to avoid duplicated urls fetched: the number of fetched urls so far """

# File names to be used when loading and saving the frontier state FRONTIER_DIR_NAME = "frontier_state" URL_QUEUE_FILE_NAME = os.path.join(".", FRONTIER_DIR_NAME, "url_queue.pkl") URL_SET_FILE_NAME = os.path.join(".", FRONTIER_DIR_NAME, "url_set.pkl")

Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py) Input: the URL to be fetched Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned Define function extract_next_links (crawler.py) This function extracts links from the content of a fetched webpage. Input: url_data which is a dictionary with the url, content andsize keys. Thecontent value is the content of the URLs corresponding file in the corpus and the size is the size of the content in bytes. Output: list of URLs in string form. Each URL should be in absolute form. It is not required to remove duplicates that have already been fetched. The frontier takes care of ignoring duplicates. Define function is_valid (crawler.py) This function returns True or False based on whether a URL is valid and must be fetched or not. Input: URL is the URL of a web page in string form Output: True if URL is valid, False if the URL otherwise. This is a great place to filter out crawler traps Duplicated urls will be taken care of by frontier. You dont need to check for duplication in this method Filter out crawler traps (e.g. the ICS calendar, dynamic URLs, etc.) You will need to do some research online or apply concepts regarding crawler traps covered in class Returning False on a URL does not let that URL to enter your frontier. Some part of the function has already been implemented. It is your job to figure out how to add to the existing logic in order to avoid crawler traps and ensuring that only valid links are sent to the frontier Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py) Input: the URL to be fetched Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned Show transcribed image text Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py) Input: the URL to be fetched Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned Define function extract_next_links (crawler.py) This function extracts links from the content of a fetched webpage. Input: url_data which is a dictionary with the url, content andsize keys. Thecontent value is the content of the URLs corresponding file in the corpus and the size is the size of the content in bytes. Output: list of URLs in string form. Each URL should be in absolute form. It is not required to remove duplicates that have already been fetched. The frontier takes care of ignoring duplicates. Define function is_valid (crawler.py) This function returns True or False based on whether a URL is valid and must be fetched or not. Input: URL is the URL of a web page in string form Output: True if URL is valid, False if the URL otherwise. This is a great place to filter out crawler traps Duplicated urls will be taken care of by frontier. You dont need to check for duplication in this method Filter out crawler traps (e.g. the ICS calendar, dynamic URLs, etc.) You will need to do some research online or apply concepts regarding crawler traps covered in class Returning False on a URL does not let that URL to enter your frontier. Some part of the function has already been implemented. It is your job to figure out how to add to the existing logic in order to avoid crawler traps and ensuring that only valid links are sent to the frontier Define function fetch_url (crawler.py) This method, using the given URL, should find the corresponding file in the corpus and return a dictionary containing the URL, content of the file in binary format and the content size in bytes. To find out how you can find the corresponding file in the corpus take a look at Corpus class (corpus.py) Input: the URL to be fetched Output: a dictionary containing the URL, content and the size of the content. If the URL does not exist in the corpus, a dictionary with content set to None and size set to 0 can be returned