Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Exercise 3.25 (Document-Term Matrices): Suppose we have an mn documentterm matrix A where each row corresponds to a document and has been normalized to length

image text in transcribed

Exercise 3.25 (Document-Term Matrices): Suppose we have an mn documentterm matrix A where each row corresponds to a document and has been normalized to length one. Define the "similarity" between two such documents by their dot product. 1. Consider a "synthetic" document whose sum of squared similarities with all documents in the matrix is as high as possible. What is this synthetic document and how would you find it? 2. How does the synthetic document in (1) differ from the center of gravity? 3. Building on (1), given a positive integer k, find a set of k synthetic documents such that the sum of squares of the mk similarities between each document in the matrix and each synthetic document is marimized. To avoid the trivial solution of selecting k copies of the document in (1), require the k synthetic documents to be orthogonal to each other. Relate these synthetic documents to singular vectors. 4. Suppose that the documents can be partitioned into k subsets (often called clusters), where documents in the same cluster are similar and documents in different clusters are not very similar. Consider the computational problem of isolating the clusters. This is a hard problem in general. But assume that the terms can also be partitioned into k clusters so that for i=j, no term in the ith cluster occurs in a document in the jth cluster. If we knew the clusters and arranged the rows and columns in them to be contiguous, then the matrix would be a block-diagonal matrix. Of course the clusters are not known. By a "black" of the document-term matrix, we mean a submatrix with rows corresponding to the ith cluster of documents and columns corresponding to the ith cluster of terms. We can also partition any n vector into blocks. Show that any right-singular vector of the matrix must have the property that each of its blocks is a right-singular vector of the corresponding block of the document-term matrix. 5. Suppose now that the k singular values are all distinct. Show how to solve the clustering problem. Hint: (4) Use the fact that the right-singular vectors must be eigenvectors of ATA. Show that ATA is also block-diagonal and use properties of eigenvectors

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Machine Learning And Knowledge Discovery In Databases European Conference Ecml Pkdd 2017 Skopje Macedonia September 18 22 2017 Proceedings Part 3 Lnai 10536

Authors: Yasemin Altun ,Kamalika Das ,Taneli Mielikainen ,Donato Malerba ,Jerzy Stefanowski ,Jesse Read ,Marinka Zitnik ,Michelangelo Ceci ,Saso Dzeroski

1st Edition

3319712721, 978-3319712727

More Books

Students also viewed these Databases questions