Question

1 Approved Answer

Posted on Sep 21, 2024

Write a JAVA class for Document Similarity (instructions and example below) Given two documents how similar are they? For example, Chromes similar pages plug-in finds

Write a JAVA class for Document Similarity (instructions and example below)

Given two documents how similar are they? For example, Chromes similar pages plug-in finds web pages that are similar to the page that you are currently browsing. The notion of document/set similarity has quite a few applications in text/image processing, recommendation systems etc. We will use set similarity to define two notions of document similarity. This is done by converting each document into a String and then associating a multi-set with each string. Let D1 and D2 be two documents whose similarity we wish to estimate. We convert both D1 and D2 into strings by eliminating white space/tab characters, and the punctuation symbols period, comma, colon and semi-colon. Converting each character into lower case. Now each document can be viewed as a (perhaps very long) string. We now define a notion of k-shingles of a string.

image text in transcribed

k-shingle of a string is a substring of length k. Let S be a multi-set of all k-shingles of D1 and S be multi-set of all k-shingles of D2. Now Similarityk(DL D2) = Similarity(S S5 Note that the above value depends on the value of k. Another way to define similarity is by considering hashCodes of elements of Sf and Si. Let HSf be the multi-set of all hashCodes of strings from the multi-set S , and let HS be the set of all hashCodes of string from the multi-set S. HashSimilarityk(D1, D2) = Similarity (H S: H S Here is an example. Suppose you have two documents, Say that the contents of Di are ) A rose is a rose is a rose The contents of D2 are A rose is a flower, which is a rose The the String corresponding to the first document is (by ignoring case, removing white space, period and comma) aroseisaroseisarose The string corresponding to the second document is aroseisaflowerwhichisarose k-shingle of a string is a substring of length k. Let S be a multi-set of all k-shingles of D1 and S be multi-set of all k-shingles of D2. Now Similarityk(DL D2) = Similarity(S S5 Note that the above value depends on the value of k. Another way to define similarity is by considering hashCodes of elements of Sf and Si. Let HSf be the multi-set of all hashCodes of strings from the multi-set S , and let HS be the set of all hashCodes of string from the multi-set S. HashSimilarityk(D1, D2) = Similarity (H S: H S Here is an example. Suppose you have two documents, Say that the contents of Di are ) A rose is a rose is a rose The contents of D2 are A rose is a flower, which is a rose The the String corresponding to the first document is (by ignoring case, removing white space, period and comma) aroseisaroseisarose The string corresponding to the second document is aroseisaflowerwhichisarose