Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

NOTE: USE PYTHON 3.8+ PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a file (https://github.com/avourakis/Search-Engine/tree/master/WEBPAGES_CLEAN). You

NOTE: USE PYTHON 3.8+ PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a file (https://github.com/avourakis/Search-Engine/tree/master/WEBPAGES_CLEAN). You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. You need to remove stop words from and apply lemmatization on the identified tokens. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Words in title, bold and heading (h1, h2, h3) tags are more important than the other words. You should store meta-data about their importance to be used later in the retrieval phase. Sample Index: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token docId1, tf-idf1 ; docId2, tf-idf2 Example: informatics doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. These are useful metadata that could be added to your inverted index data. COMPONENT 2 SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesnt need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query. COMPONENT 3 - RANKING: At the very least, your ranking formula should include the following (but you should feel free to add additional components to this formula if you think they improve the retrieval): - TF-IDF score - Importance of the words in important HTML tags EXTRA CREDITS: +2 Implement PageRank and use it in your ranking formula +2 Implement an additional 2-gram index and use it during retrieval +2 Enhance the index with word positions and that use that information in retrieval +1 Index anchor words for the target pages +2 Implement a Web or GUI interface instead of a console. You should display the title and a brief description of each page in the results. Goal: Build an index and a basic retrieval component By basic retrieval component; we mean that at this point you just need to be able to query your index for links (The query can be as simple as single word at this point). These links do not need to be accurate/ranked. At least the following queries should be used to test your retrieval: 1 Informatics 2 Mondego 3 Irvine 4 artificial intelligence 5 computer science

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

The Database Experts Guide To SQL

Authors: Frank Lusardi

1st Edition

0070390029, 978-0070390027

More Books

Students also viewed these Databases questions

Question

1. Explain why evaluation is important.

Answered: 1 week ago