Question
This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify:
This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify: 1) the topic, 2) at least 10 related terms (could be single words or phrases), and 3) at least 2 seed URLs. In the crawling process, you need to determine whether a page is relevant to the topic: checking whether it contains at least 2 different related terms that you specified, before saving it into the crawled collection. The page-relevance checking process should be case-insensitive. For example, if the topic is Information Retrieval, the seed URLs can be:
http://en.wikipedia.org/wiki/Information_retrieval and http://en.wikipedia.org/wiki/Search_engine_(computing).
Example related terms for the topic information retrieval might be: Information Retrieval, Crawler, Search Engine, tf-idf, Mean Average Precision, Precision, Recall, Relevance Feedback, Query Expansion, Retrieval Models, Boolean Model, Vector Space Model, and Language Model.
PYTHON PLEASE WITH COMMENTS!!!!
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started