Question
https://openlibrary.org catalogues books from various sources. Assume that you are working for a startup company called Ultimate Library , who are cash strapped, but they
https://openlibrary.org catalogues books from various sources. Assume that you are working for a startup company called Ultimate Library, who are cash strapped, but they want to build a similar system with many novel features which they assume will make their solution more popular than openlibrary.
You are given one of the JSON document which details a typical book content for Information Retrieval task. Here is the link
http://openlibrary.org/books/OL6807502M.json (Links to an external site.)Links to an external site.
The HTML version can be seen at this link: https://openlibrary.org/books/OL6807502M/Code (Links to an external site.)Links to an external site.
At the minimum the business wants you to enable search on ToC field title (Table of Contents - fields of title) and main book fields of; isbn_10, publishers, title, description, by_statement and subjects.
When a user invokes a search on the proposed Website, the front end developers require to show the search field values along with the following main document fields; first_sentence, notes, number_of_pages and publish_date on the web page.
Since you are required to index several million documents, you have to ensure that you are not indexing or storing fields that are not required. This is a lean startup and they only want to pay for resources that are absolutely required.
Your first instinct is to try to index the JSON document as is into Solr as Solr is by default configured to use ManangedIndexFactory (Links to an external site.)Links to an external site.; using JSON update handler which is available at: /update/json/docs
Read more on Custom JSON handler: https://lucene.apache.org/solr/guide/7_0/transforming-and-indexing-custom-json.html (Links to an external site.)Links to an external site.
Once you successfully index and store the document in Solr using the default managed-index schema, verify the fields that are created on the fly and provide your critical analysis based on the requirements given.
Do all the auto-generated fields have attributes defined as per business requirement? Check on the field attributes like, indexed, stored, OmitNorms, OmitTermFrequencies (Links to an external site.)Links to an external site. etc and see if they are optimized for every field by default. Explain why? or why not? these fields are created correctly or incorrectly
Fix the issues by providing a schema.xml file with correct attributes. Please note, there will be many fields which are created with appropriate attributes. In such cases explain why it is appropriate for the given field. List out any field value assumptions you are making to base your argument.
Please note you are only required to provide a schema.xml for this assignment. There is no need to re-deploy your Solr with the corrected schema.xml
Provide your Google Cloud Solr instance password for verification of the inserted document along with your analysis. Submit a word or pdf document with your answers for the assignment on Canvas. Your deliverables should include corrected schema.xml to obtain full marks.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started