Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 05, 2024

Text Mining 1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.

Text Mining

1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.

Cranfield is a small curated dataset that is very extensively used in the information retrieval experiments. In the dataset, there are 226 queries (search terms), 1400 documents, and 1837 (evaluations). The dataset is supposed to be complete in the sense, we know what are the documents that should be returned for each 226 queries and in what in order they should be returned. This makes the evaluation easier. The basic implementation of the dataset has been already included and must be used for the assignment.

You can find the dataset at http://ir.dcs.gla.ac.uk/resources/test_collections/cran/

You can find the implementation code at https://github.com/feroshjacob/cranfield/tree/spring2019-assignment-br

2 What are my possible next steps?

Currently we have a basic solution, but we discussed several methods in the class for retrieving and scoring the documents. The methods can be classified into two categories based on the application of the methods:

2.1 Text processing methods We discussed several text processing methods to improve the search. e.g., Lemmatization, Stemming, Stopwords removal, Including synonyms, Normalization, Sophisticated tokenization methods.

2.2 Query Document scoring methods We discussed several IR scoring methods including the classical TFIDF, TF, Cosine Similarity, Log TF, Probabilistic model, Okapi BM25 and their variations.

For this assignment, students are expected to try these methods and see how they can improve the NDCG score.

3 Should I use source code shared?

Yes. The shared code has two sections: A very basic search implementation (CoreSearchImpl class).

Evaluation (EvalSearch class) of the search results.

Please make sure your search method always returns a list of unique document IDs, because the evaluation is implemented on the returned document IDs and I would like you to leave it untouched.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Pro Database Migration To Azure Data Modernization For The Enterprise

Authors: Kevin Kline, Denis McDowell, Dustin Dorsey, Matt Gordon

1st Edition

1484282299, 978-1484282298

Students also viewed these Databases questions

Question

★★★★★

Table P-10 contains a time series of 80 observations. Using a computer program for ARIMA modeling, obtain a plot of the data, the sample autocorrelations, and the sample partial autocorrelations....

Answered: 1 week ago

Question

★★★★★

1. Discuss the sport commitment model as a way to develop sport commitment and persistence and to counter sport withdrawal. What is the most critical aspect of this model and how can it be developed?

Answered: 1 week ago

Question

★★★★★

Comprehensive problem including special order, outsourcing, and segment elimination decisions Emerson Company's electronics division produces an MP3 player. The vice president in charge of the...

Answered: 1 week ago

Question

★★★★★

sorry, it is 2.5% their new monthly payment. 8. A family bought a house for $191,000. They paid $40.000 down and took out a 30-year mortgage for the bala 6.5%. (a) Find their monthly payment. (b) How...

Answered: 1 week ago

Question

★★★★★

I have decided to focus this assignment on cybersecurity threats and solutions. Choosing this topic is from my background and degree focus on cybersecurity. With the growing concern of cyber threats...

Answered: 1 week ago

Question

★★★★★

Answer the question in terms of kN for the last picture no need to do the other ones with the solution. In problem 2-23, a round tube section 3 mm thick was one of the options to prevent buckling. If...

Answered: 1 week ago

Question

★★★★★

Question 1 (6 marks) Two cables are attached to a wall by a pin at O with the tensions in the cables adjusted to 450 N and 500 N as shown in the following diagram. 450N 30 500N Use the method of...

Answered: 1 week ago

Question

★★★★★

Many people have argued that the continuum of processes described by Garvin is outdated and does not apply to digital businesses. Very often digital businesses do not have equipment, or raw material,...

Answered: 1 week ago

Question

★★★★★

Dispatch Software Solutions Tugboats might seem like the last place to look for innovative information technology. But Control Software Group recognized the potential of information technology to...

Answered: 1 week ago

Question

★★★★★

7. Suppose a student has no more than t minutes to write an examination consisting of two questions, 1 and 2. He receives A points if he gets question 1 correct and B points if he gets question 2...

Answered: 1 week ago

Question

★★★★★

(Appendices) INTERNAL CONTROL FOR SALES. Gateway 2000 is a large mail-order computer and software business located in South Dakota. Most of Gateways customers call on its toll-free phone line and...

Answered: 1 week ago

Question

★★★★★

(Appendices) Why is the allowance procedure preferred over the direct write-off procedure for uncollectible accounts? LO77

Answered: 1 week ago

Question

★★★★★

(Appendices) SALES DISCOUNT POLICIES. Consider three businesses, all of which offer price reductions to their customers. The first is an independently owned Shell service station located at a busy...

Answered: 1 week ago

Previous Question Next Question