Question
Using Python 3.7 in Pycharm This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from
Using Python 3.7 in Pycharm
This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify: 1) the topic, 2) at least 10 related terms (could be single words or phrases), and 3) at least 2 seed URLs. In the crawling process, you need to determine whether a page is relevant to the topic: checking whether it contains at least 2 different related terms that you specified, before saving it into the crawled collection. The page-relevance checking process should be case-insensitive. For example, if the topic is Information Retrieval, related terms for the topic information retrieval might be: Information Retrieval, Crawler, Search Engine, tf-idf, Mean Average Precision, Precision, Recall, Relevance Feedback, Query Expansion, Retrieval Models, Boolean Model, Vector Space Model, and Language Model. You can use any programming language that you are comfortable with and you are free to reference codes from online for customization. Prepare a file folder which contains 2 sub-folders: 1. the first sub-folder has all the crawled pages. 2. the second sub-folder has the source code and a report. The report must have the followings: 2a. The topic of your choice, at least 10 related terms, and at least 2 seed URLs. 2b. How the crawler is implemented, number of pages crawled, and the URLs of all crawled pages
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started