Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the
Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the work you submit must therefore be your own. Any material you make use of, whether it be from textbooks, the Web or any other source must be acknowledged as a comment in the program, and the extent of the reference clearly indicated. The context of your task Researchers, clinicians, and policy makers involved with the response to COVID-19 are constantly searching for reliable information on the virus and its impact. This presents a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events. The idea of this assignment is that you apply the information retrieval knowledge you acquired during this term and put it into practice. You are already familiar with Elasticsearch. You also know the processing steps that turn documents into a structured index, commonly applied retrieval models and you know the key evaluation approaches that are being employed in IR. Now is a good time to put it all together. Scenario: In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19)[1]. CORD-19 is a resource of over 181,000 scholarly articles, including over 80,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in information retreival and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. Your task This task comes in stages. Marks are given for each stage. The stages are as follows:
[1] https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
- Indexing (20%) The first step for you will be to obtain the dataset. Once you have done so choose a sample of 1000 articles as your corpus (the simplest thing is to use the first 1000 documents). This will need to be imported to Elasticsearch later (after you defined your processing pipeline). Please note the full data is very large, you will only need to download the metada.csv file provided by the challenge.
- ? Tokenization and Case folding (10%) The next step should be to transform the input text into a normal form. For this task you are required to use Elasticsearchs build-in analyzers or other libraries (as learned in Lab 2) to tokenize the document and perform case folding to the tokens.
- ? Selecting Keywords (20%) One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. For this task you are required to do include stopword removal and (n-gram extraction or named entity recognition). As well as apply tf.idf as part of your selection and weighting step. (Hint: the stopword removal, n-gram extraction can be done with Elasticsearchs build-in tokenizer and tf.idf scores can also be configured using Elasticsearch similarity module.)
- Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, e.g. bus and busses refer to exactly the same thing even though they are different words.
- Searching (10%) Once you have indexed the collection you want to be able to search it. You can do that on the command line (like in Lab 1), but it would be easier to do it Kibanas dev tool. The task is to create 3 textural queries that the user might come up and write the corresponding Elasticsearch queries.
- Working with Elasticsearch API (10%) Finally the 10% will be given if you could make everything work with the Elasticsearch API.
- Instructions for running your system
- Screenshots illustrating the functionality you have implemented
- A description of the document collection you have chosen
- Discussion of your solution focussing on functionality implemented and possible improvements and extensions.
- Report (use the template below)
- Code
[1] https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started