Question

1 Approved Answer

Posted on Sep 25, 2024

https://openlibrary.org catalogues books from various sources. Assume that you are working for a startup company called Ultimate Library , who are cash strapped, but they

https://openlibrary.org catalogues books from various sources. Assume that you are working for a startup company called Ultimate Library, who are cash strapped, but they want to build a similar system with many novel features which they assume will make their solution more popular than openlibrary.

You are given one of the JSON document which details a typical book content for Information Retrieval task. Here is the link

http://openlibrary.org/books/OL6807502M.json (Links to an external site.)Links to an external site.

The HTML version can be seen at this link: https://openlibrary.org/books/OL6807502M/Code (Links to an external site.)Links to an external site.

At the minimum the business wants you to enable search on ToC field title (Table of Contents - fields of title) and main book fields of; isbn_10, publishers, title, description, by_statement and subjects.

When a user invokes a search on the proposed Website, the front end developers require to show the search field values along with the following main document fields; first_sentence, notes, number_of_pages and publish_date on the web page.

Since you are required to index several million documents, you have to ensure that you are not indexing or storing fields that are not required. This is a lean startup and they only want to pay for resources that are absolutely required.

Your first instinct is to try to index the JSON document as is into Solr as Solr is by default configured to use ManangedIndexFactory (Links to an external site.)Links to an external site.; using JSON update handler which is available at: /update/json/docs

Read more on Custom JSON handler: https://lucene.apache.org/solr/guide/7_0/transforming-and-indexing-custom-json.html (Links to an external site.)Links to an external site.

Once you successfully index and store the document in Solr using the default managed-index schema, verify the fields that are created on the fly and provide your critical analysis based on the requirements given.

Do all the auto-generated fields have attributes defined as per business requirement? Check on the field attributes like, indexed, stored, OmitNorms, OmitTermFrequencies (Links to an external site.)Links to an external site. etc and see if they are optimized for every field by default. Explain why? or why not? these fields are created correctly or incorrectly

Fix the issues by providing a schema.xml file with correct attributes. Please note, there will be many fields which are created with appropriate attributes. In such cases explain why it is appropriate for the given field. List out any field value assumptions you are making to base your argument.

Please note you are only required to provide a schema.xml for this assignment. There is no need to re-deploy your Solr with the corrected schema.xml

Provide your Google Cloud Solr instance password for verification of the inserted document along with your analysis. Submit a word or pdf document with your answers for the assignment on Canvas. Your deliverables should include corrected schema.xml to obtain full marks.