Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the work you submit must therefore be your own Any material you make use of, whether it be from textbooks, the Web or any other source must be acknowledged as a comment in the program, and the extent of the reference clearly indicated The context of your task Researchers, clinicians, and policy makers involved with the response to COVID 19 are constantly searching for reliable information on the virus and its impact This presents a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events The idea of this assignment is that you apply the information retrieval knowledge you acquired during this term and put it into practice You are already familiar with Elasticsearch You also know the processing steps that turn documents into a structured index, commonly applied retrieval models and you know the key evaluation approaches that are being employed in IR Now is a good time to put it all together Scenario In response to the COVID 19 pandemic, the White House and a coalition of leading research groups have prepared the COVID 19 Open Research Dataset (CORD 19) 1 CORD 19 is a resource of over 181,000 scholarly articles, including over 80,000 with full text, about COVID 19, SARS CoV 2, and related coronaviruses This freely available dataset is provided to the global research community to apply recent advances in information retreival and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up Your task This task comes in stages Marks are given for each stage The stages are as follows Indexing (20 ) The first step for you will be to obtain the dataset Once you have done so choose a sample of 1000 articles as your corpus (the simplest thing is to use the first 1000 documents) This will need to be imported to Elasticsearch later (after you defined your processing pipeline) Please note the full data is very large, you will only need to download the metada csv file provided by the challenge Tokenization and Case folding (10 ) The next step should be to transform the input text into a normal form For this task you are required to use Elasticsearchs build in analyzers or other libraries (as learned in Lab 2) to tokenize the document and perform case folding to the tokens Selecting Keywords (20 ) One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes For this task you are required to do include stopword removal and (n gram extraction or named entity recognition) As well as apply tf idf as part of your selection and weighting step ( Hint the stopword removal, n gram extraction can be done with Elasticsearchs build in tokenizer and tf idf scores can also be configured using Elasticsearch similarity module ) Stemming or Morphological Analysis (10 ) Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, e g bus and busses refer to exactly the same thing even though they are different words Searching (10 ) Once you have indexed the collection you want to be able to search it You can do that on the command line (like in Lab 1), but it would be easier to do it Kibanas dev tool The task is to create 3 textural queries that the user might come up and write the corresponding Elasticsearch queries Working with Elasticsearch API (10 ) Finally the 10 will be given if you could make everything work with the Elasticsearch API You will have noticed that the percentages above only add up to 80 This is because one of the important aspects of the project is that your work should be well documented and your code well commented 20 of your mark will come from this The report should contain Instructions for running your system Screenshots illustrating the functionality you have implemented A description of the document collection you have chosen Discussion of your solution focussing on functionality implemented and possible improvements and extensions The report does not need to be long as long as it addresses all the above points Software The backend search engine to be used is Elasticsearch Apart from that you are free to write additional code in any language of your choice, and employ any open source tool that you find suitable Submission You should submit Report ( use the template below ) Code The submission of all two completed tasks should be submitted as a single zip file via the electronic submission system Please check the details of the submission deadline with the CSEE School Office The guidelines about late assignments are explained in the students handbook CE706 Information Retrieval 2022 Assignment 1 Student ID Instructions for running your system Include here instructions to run your system, this could be as simple as start Elasticsearch and Kibana if you are not using Elasticsearch API You may include screenshots to clarify Indexing Include here the details of how you download your datasert and index it including any issue that you had and how did you face it Explain which documents you have selected for your experiments You may include screenshots to clarify Tokenization and Normalization Include here the details of how you did this step including any issue that you had and how did you face it Present examples to show how your system works, e g , if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2) You may include screenshots to clarify Selecting Keywords Include here the details of how you did this step including any issue that you had and how did you face it Present examples to show how your system works, e g , if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2) You may include screenshots to clarify Stemming or Morphological Analysis Include here the details of how you did this step including any issue that you had and how did you face it Present examples to show how your system works, e g , if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2) You may include screenshots to clarify Searching Include here the details of your textural and Elasticsearch queries as well as the system outputs You may include screenshots to clarify 1 https www kaggle com allen institute for ai CORD 19 research challenge Attachments ce706 2022 as docx

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on May 15, 2024

Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the

Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the work you submit must therefore be your own. Any material you make use of, whether it be from textbooks, the Web or any other source must be acknowledged as a comment in the program, and the extent of the reference clearly indicated. The context of your task Researchers, clinicians, and policy makers involved with the response to COVID-19 are constantly searching for reliable information on the virus and its impact. This presents a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events. The idea of this assignment is that you apply the information retrieval knowledge you acquired during this term and put it into practice. You are already familiar with Elasticsearch. You also know the processing steps that turn documents into a structured index, commonly applied retrieval models and you know the key evaluation approaches that are being employed in IR. Now is a good time to put it all together. Scenario: In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19)^{^[1]}. CORD-19 is a resource of over 181,000 scholarly articles, including over 80,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in information retreival and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. Your task This task comes in stages. Marks are given for each stage. The stages are as follows:

Indexing (20%) The first step for you will be to obtain the dataset. Once you have done so choose a sample of 1000 articles as your corpus (the simplest thing is to use the first 1000 documents). This will need to be imported to Elasticsearch later (after you defined your processing pipeline). Please note the full data is very large, you will only need to download the metada.csv file provided by the challenge.
? Tokenization and Case folding (10%) The next step should be to transform the input text into a normal form. For this task you are required to use Elasticsearchs build-in analyzers or other libraries (as learned in Lab 2) to tokenize the document and perform case folding to the tokens.
? Selecting Keywords (20%) One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. For this task you are required to do include stopword removal and (n-gram extraction or named entity recognition). As well as apply tf.idf as part of your selection and weighting step. (Hint: the stopword removal, n-gram extraction can be done with Elasticsearchs build-in tokenizer and tf.idf scores can also be configured using Elasticsearch similarity module.)
Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, e.g. bus and busses refer to exactly the same thing even though they are different words.
Searching (10%) Once you have indexed the collection you want to be able to search it. You can do that on the command line (like in Lab 1), but it would be easier to do it Kibanas dev tool. The task is to create 3 textural queries that the user might come up and write the corresponding Elasticsearch queries.
Working with Elasticsearch API (10%) Finally the 10% will be given if you could make everything work with the Elasticsearch API.

You will have noticed that the percentages above only add up to 80%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 20% of your mark will come from this. The report should contain:

Instructions for running your system
Screenshots illustrating the functionality you have implemented
A description of the document collection you have chosen
Discussion of your solution focussing on functionality implemented and possible improvements and extensions.

The report does not need to be long as long as it addresses all the above points. Software The backend search engine to be used is Elasticsearch. Apart from that you are free to write additional code in any language of your choice, and employ any open source tool that you find suitable. Submission You should submit:

Report (use the template below)
Code

The submission of all two completed tasks should be submitted as a single zip file via the electronic submission system. Please check the details of the submission deadline with the CSEE School Office. The guidelines about late assignments are explained in the students handbook. CE706 - Information Retrieval 2022 Assignment 1 Student ID Instructions for running your system Include here instructions to run your system, this could be as simple as start Elasticsearch and Kibana if you are not using Elasticsearch API. You may include screenshots to clarify. Indexing Include here the details of how you download your datasert and index it including any issue that you had and how did you face it. Explain which documents you have selected for your experiments. You may include screenshots to clarify. Tokenization and Normalization Include here the details of how you did this step including any issue that you had and how did you face it. Present examples to show how your system works, e.g., if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2). You may include screenshots to clarify. Selecting Keywords Include here the details of how you did this step including any issue that you had and how did you face it. Present examples to show how your system works, e.g., if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2). You may include screenshots to clarify. Stemming or Morphological Analysis Include here the details of how you did this step including any issue that you had and how did you face it. Present examples to show how your system works, e.g., if you use elastic analyzers, you can show how the analyzer works by given sample text input (remember we did this in Lab 2). You may include screenshots to clarify. Searching Include here the details of your textural and Elasticsearch queries as well as the system outputs. You may include screenshots to clarify.

^{^[1]} https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Attachments:

ce706-2022-as....docx