Question: 1. (a) Two documents D, and D, have the following forms: Di: Leaves on railway tracks were causing train cancellations and train delays in the

 1. (a) Two documents D, and D, have the following forms:

1. (a) Two documents D, and D, have the following forms: Di: Leaves on railway tracks were causing train cancellations and train delays in the north of England Dz: Delays on Northern Rail trains peaked in October, with an average delay of an hour After stop-word removal and stemming these become: dz: leave rail track cause train cancel train delay north england dz: delay north rail train peak october average delay hour The IDFs of the words that occur in these documents are given in Table 1: Term (t) IDF(t) average 0.2 cancel 0.6 cause 0.3 delay 0.8 england 0.4 hour 1.1 leave 1.8 north 0.8 october 0.5 peak 0.6 rail 2.2 track 1.5 train 0.9 Table 1 FO coon (0) Calculate the TF-IDF similarity sim(d,,d,) between d, and dz. You must (4) show all of your calculations. (i) Assuming that the vocabulary in the table above is the complete [2] vocabulary and is ordered according to the table, write down the document vectors vec(d) and vec(dz). (ii) Suppose that the term "delay" is repeated N times in di. Write down a [5] formula for the angle en between vec(d,) and vec(d) as a function of N. (b) (c) Explain what WordNet is and how it can be employed in a text retrieval (4) system. Give examples to illustrate your answer. A corpus consists of C documents, which between them contain a total of V different terms. Explain how you would analyse this corpus to obtain a set of conceptual topics. Discuss the advantages of using this approach for text retrieval as opposed to retrieval based on TF-IDF. [5]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!