Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Write a JAVA class for Document Similarity (instructions and example below) Given two documents how similar are they? For example, Chromes similar pages plug-in finds

Write a JAVA class for Document Similarity (instructions and example below)

Given two documents how similar are they? For example, Chromes similar pages plug-in finds web pages that are similar to the page that you are currently browsing. The notion of document/set similarity has quite a few applications in text/image processing, recommendation systems etc. We will use set similarity to define two notions of document similarity. This is done by converting each document into a String and then associating a multi-set with each string. Let D1 and D2 be two documents whose similarity we wish to estimate. We convert both D1 and D2 into strings by eliminating white space/tab characters, and the punctuation symbols period, comma, colon and semi-colon. Converting each character into lower case. Now each document can be viewed as a (perhaps very long) string. We now define a notion of k-shingles of a string.

image text in transcribed

image text in transcribed

k-shingle of a string is a substring of length k. Let S be a multi-set of all k-shingles of D1 and S be multi-set of all k-shingles of D2. Now Similarityk(DL D2) = Similarity(S S5 Note that the above value depends on the value of k. Another way to define similarity is by considering hashCodes of elements of Sf and Si. Let HSf be the multi-set of all hashCodes of strings from the multi-set S , and let HS be the set of all hashCodes of string from the multi-set S. HashSimilarityk(D1, D2) = Similarity (H S: H S Here is an example. Suppose you have two documents, Say that the contents of Di are ) A rose is a rose is a rose The contents of D2 are A rose is a flower, which is a rose The the String corresponding to the first document is (by ignoring case, removing white space, period and comma) aroseisaroseisarose The string corresponding to the second document is aroseisaflowerwhichisarose k-shingle of a string is a substring of length k. Let S be a multi-set of all k-shingles of D1 and S be multi-set of all k-shingles of D2. Now Similarityk(DL D2) = Similarity(S S5 Note that the above value depends on the value of k. Another way to define similarity is by considering hashCodes of elements of Sf and Si. Let HSf be the multi-set of all hashCodes of strings from the multi-set S , and let HS be the set of all hashCodes of string from the multi-set S. HashSimilarityk(D1, D2) = Similarity (H S: H S Here is an example. Suppose you have two documents, Say that the contents of Di are ) A rose is a rose is a rose The contents of D2 are A rose is a flower, which is a rose The the String corresponding to the first document is (by ignoring case, removing white space, period and comma) aroseisaroseisarose The string corresponding to the second document is aroseisaflowerwhichisarose

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning And Knowledge Discovery In Databases European Conference Ecml Pkdd 2015 Porto Portugal September 7 11 2015 Proceedings Part 2 Lnai 9285

Authors: Annalisa Appice ,Pedro Pereira Rodrigues ,Vitor Santos Costa ,Joao Gama ,Alipio Jorge ,Carlos Soares

1st Edition

3319235249, 978-3319235240

More Books

Students also viewed these Databases questions

Question

2. Are you varying your pitch (to avoid being monotonous)?

Answered: 1 week ago

Question

3. Are you varying your speaking rate and volume?

Answered: 1 week ago