Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

From Mining of Massive dataset book Exercise 3.9.2 : Suppose we filter candidate pairs based only on length, as in Section 3.9.3. If s is

From Mining of Massive dataset book

Exercise 3.9.2 : Suppose we filter candidate pairs based only on length, as in

Section 3.9.3. If s is a string of length 20, with what strings is s compared when

J, the lower bound on Jaccard similarity has the following values: (a) J = 0.85

(b) J = 0.95 (c) J = 0.98?

============================================================

3.9.3 Length-Based Filtering

The simplest way to exploit the string representation of Section 3.9.2 is to sort

the strings by length. Then, each string s is compared with those strings t that

follow s in the list, but are not too long. Suppose the lower bound on Jaccard

similarity between two strings is J. For any string x, denote its length by Lx.

Note that Ls Lt. The intersection of the sets represented by s and t cannot

have more than Ls members, while their union has at least Lt members. Thus,

the Jaccard similarity of s and t, which we denote SIM(s, t), is at most Ls/Lt.

That is, in order for s and t to require comparison, it must be that J Ls/Lt,

or equivalently, Lt Ls/J.

Example 3.25 : Suppose that s is a string of length 9, and we are looking for

strings with at least 0.9 Jaccard similarity. Then we have only to compare s

with strings following it in the length-based sorted order that have length at

most 9/0.9 = 10. That is, we compare s with those strings of length 9 that

follow it in order, and all strings of length 10. We have no need to compare s

with any other string.

Suppose the length of s were 8 instead. Then s would be compared with

following strings of length up to 8/0.9 = 8.89. That is, a string of length 9

would be too long to have a Jaccard similarity of 0.9 with s, so we only have to

compare s with the strings that have length 8 but follow it in the sorted order.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Principles Programming And Performance

Authors: Patrick O'Neil

1st Edition

1558603921, 978-1558603929

More Books

Students also viewed these Databases questions

Question

Why do HCMSs exist? Do they change over time?

Answered: 1 week ago