Question
Text Mining 1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.
Text Mining
1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.
Cranfield is a small curated dataset that is very extensively used in the information retrieval experiments. In the dataset, there are 226 queries (search terms), 1400 documents, and 1837 (evaluations). The dataset is supposed to be complete in the sense, we know what are the documents that should be returned for each 226 queries and in what in order they should be returned. This makes the evaluation easier. The basic implementation of the dataset has been already included and must be used for the assignment.
You can find the dataset at http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
You can find the implementation code at https://github.com/feroshjacob/cranfield/tree/spring2019-assignment-br
2 What are my possible next steps?
Currently we have a basic solution, but we discussed several methods in the class for retrieving and scoring the documents. The methods can be classified into two categories based on the application of the methods:
2.1 Text processing methods We discussed several text processing methods to improve the search. e.g., Lemmatization, Stemming, Stopwords removal, Including synonyms, Normalization, Sophisticated tokenization methods.
2.2 Query Document scoring methods We discussed several IR scoring methods including the classical TFIDF, TF, Cosine Similarity, Log TF, Probabilistic model, Okapi BM25 and their variations.
For this assignment, students are expected to try these methods and see how they can improve the NDCG score.
3 Should I use source code shared?
Yes. The shared code has two sections: A very basic search implementation (CoreSearchImpl class).
Evaluation (EvalSearch class) of the search results.
Please make sure your search method always returns a list of unique document IDs, because the evaluation is implemented on the returned document IDs and I would like you to leave it untouched.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started