Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 03, 2024

Text Mining 1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.

Text Mining

1 The introduction this project is intended to get students familiarize with a search problem. We focus our attention to the Cranfield dataset.

Cranfield is a small curated dataset that is very extensively used in the information retrieval experiments. In the dataset, there are 226 queries (search terms), 1400 documents, and 1837 (evaluations). The dataset is supposed to be complete in the sense, we know what are the documents that should be returned for each 226 queries and in what in order they should be returned. This makes the evaluation easier. The basic implementation of the dataset has been already included and must be used for the assignment.

You can find the dataset at http://ir.dcs.gla.ac.uk/resources/test_collections/cran/

You can find the implementation code at https://github.com/feroshjacob/cranfield/tree/spring2019-assignment-br

2 What are my possible next steps?

Currently we have a basic solution, but we discussed several methods in the class for retrieving and scoring the documents. The methods can be classified into two categories based on the application of the methods:

2.1 Text processing methods We discussed several text processing methods to improve the search. e.g., Lemmatization, Stemming, Stopwords removal, Including synonyms, Normalization, Sophisticated tokenization methods.

2.2 Query Document scoring methods We discussed several IR scoring methods including the classical TFIDF, TF, Cosine Similarity, Log TF, Probabilistic model, Okapi BM25 and their variations.

For this assignment, students are expected to try these methods and see how they can improve the NDCG score.

3 Should I use source code shared?

Yes. The shared code has two sections: A very basic search implementation (CoreSearchImpl class).

Evaluation (EvalSearch class) of the search results.

Please make sure your search method always returns a list of unique document IDs, because the evaluation is implemented on the returned document IDs and I would like you to leave it untouched.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Database Concepts

Authors: David Kroenke, David Auer, Scott Vandenberg, Robert Yoder

10th Edition

0137916787, 978-0137916788

More Books

Students also viewed these Databases questions

Question

★★★★★

Under the terms of the Lagos Trust instrument, the trustee has discretion to distribute or accumulate income on behalf of Willie, Sylvia, and Doris in equal shares. The trustee also can invade corpus...

Answered: 1 week ago

Question

★★★★★

How should sales promotion decisions be made? (p. 622)

Answered: 1 week ago

Question

★★★★★

Did you pick a topic that you know all about?

Answered: 1 week ago

Question

★★★★★

Bill purchased a vacation lot he saw advertised on television for an $800 down and monthly payments of $55. When he visited the lot he had purchased, he found it was not something he wanted to own....

Answered: 1 week ago

Question

★★★★★

Kenny Electric Company's noncallable bonds were issued several years ago and now have 20 years to maturity. These bonds have a 9.25% annual coupon, paid semiannually, sells at a price of $1,075, and...

Answered: 1 week ago

Question

★★★★★

Provide a detailed response (essay) to the following questions and references 1)Self and Personality College student Kwan is in the foreclosure status with respect to vocational identity, whereas his...

Answered: 1 week ago

Question

★★★★★

N 2. Determine the effective rate of interest. 1 & 3 to 5 Dropou P On January 1, 2024, Rodriguez Window and Pane issued $18.2 million of 10-year, zero-coupon bonds for $6,409,758. Required: Q

Answered: 1 week ago

Question

★★★★★

[The following information applies to the questions displayed below.] Income statement and balance sheet data for Great Adventures, Incorporated, are provided below. Net sales revenues Interest...

Answered: 1 week ago

Question

★★★★★

According to the data, the weight of a randomly selected checked-in luggage has a normal distribution with a mean of 50 lbs and a standard deviation of 11.3 lbs. Let X be the weight of a randomly...

Answered: 1 week ago

Question

★★★★★

A) In what area of your graph was the diode forward biased? Where was it reversed biased? B) How did your threshold voltage compare with the expected value (just a lot higher, a little higher, a...

Answered: 1 week ago

Question

★★★★★

1.Does the leadership team of your organization exercise the competencies as discussed in the reading and videos? What examples can you provide that will substantiate your answer? If they do not,...

Answered: 1 week ago

Question

★★★★★

Why do employers rate listening as one of the top skills they expect employees to have? (Objective 1)

Answered: 1 week ago

Question

★★★★★

Technology.Free riders,people who dont contribute to a teams success but share in the credit given for team accomplishments, are often cited as a problem in school-based group projects. What can the...

Answered: 1 week ago

Question

★★★★★

Is conflict always unhealthy? Why or why not? (Objective 4)

Answered: 1 week ago

Previous Question Next Question