Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Task 1 : Design a BM 2 5 - based IR model ( BM 2 5 ) that ranks documents in each data collection using

Task 1: Design a BM25-based IR model (BM25) that ranks documents in each data collection
using the corresponding topic (query) for all 50 data collections.
Inputs: 50 long queries (topics) in the50Queries.txt and the corresponding 50 data collections
(Data_C101, Data_C102,..., Data_C150).
Output: 50 ranked document files (e.g., for Query R107, the output file name is
BM25_R107Ranking.dat) for all 50 data collections and save them in the folder
RankingOutputs.
For each long query (topic) Q, you need to use the following equation to calculate a score for
each document D in the corresponding data collection (dataset):
\sum_{i \in Q}\log \left(\frac{(r_i+0.5)/(R-r_i+0.5)}{(n_i-r_i+0.5)/(N-n_i-R+r_i+0.5)}\right)\cdot \frac{(k_1+1)f_i}{k_1+f_i}\cdot \frac{(k_2+1)q f_i}{k_2+q f_i}
where Q is the title of the long query, k1=1.2, k2=500, b =0.75, K = k1*((1-b)+ b*dl /avdl), dl
is document Ds length and avdl is the average length of a document in the dataset, the base of
the log function is 10. Note that BM25 values can be negative, and you may need to update the
above equation to produce non-negative values but keep the resulting documents in the same
rank order.
Formally describe your design for BM25 in an algorithm to rank documents in each data
collection using corresponding query (topic) for all 50 data collections. When you use the BM25 score to rank the documents of each data collection, you also need to answer what the
query feature function and document feature function are.
Use Python to implement: BM25, For each long query, your python programs will produce ranked resultsand save them into .dat files. For example, for query R107, you can save the ranked results of
thee models into BM25_R107Ranking.dat, by using the following format, where the firstcolumn is the document id (the itemid in the corresponding XML document) and the secondcolumn is the document score (or probability).
OUTPUT for BM25 Model
Query101(DocID Weight):
...
...
Query107(DocID Weight):
515762.765639300454872
711572.386400256599359
779362.273108908141271
799502.144090202267175
592441.8983172340327779
671071.6074284655326316
674111.6074284655326316
864591.128090008008347
314040.5807469321665876
417910.56351553901341
693770.5625603334593126
373300.5584756327943026
182970.5497333978620387
185290.548367073995152
781640.5467090324257456
...
Query109(DocID Weight):
169531.7689441459684538
260731.64507953070341
615401.6346667531459578
49331.4873010135488298
644761.4002858970645484
165751.3834259539490046

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

The Accidental Data Scientist

Authors: Amy Affelt

1st Edition

1573877077, 9781573877077

More Books

Students also viewed these Databases questions

Question

Define the term international marketing channel.

Answered: 1 week ago

Question

A 300N F 30% d 2 m Answered: 1 week ago

Answered: 1 week ago

Question

Compare and contrast verbal and nonverbal codes

Answered: 1 week ago

Question

Define and discuss the nature of ethnocentrism and racism

Answered: 1 week ago

Question

Define and discuss racial and ethnic stereotypes across cultures

Answered: 1 week ago