Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Exercise 3.25 (Document-Term Matrices): Suppose we have an mn documentterm matrix A where each row corresponds to a document and has been normalized to length
Exercise 3.25 (Document-Term Matrices): Suppose we have an mn documentterm matrix A where each row corresponds to a document and has been normalized to length one. Define the "similarity" between two such documents by their dot product. 1. Consider a "synthetic" document whose sum of squared similarities with all documents in the matrix is as high as possible. What is this synthetic document and how would you find it? 2. How does the synthetic document in (1) differ from the center of gravity? 3. Building on (1), given a positive integer k, find a set of k synthetic documents such that the sum of squares of the mk similarities between each document in the matrix and each synthetic document is marimized. To avoid the trivial solution of selecting k copies of the document in (1), require the k synthetic documents to be orthogonal to each other. Relate these synthetic documents to singular vectors. 4. Suppose that the documents can be partitioned into k subsets (often called clusters), where documents in the same cluster are similar and documents in different clusters are not very similar. Consider the computational problem of isolating the clusters. This is a hard problem in general. But assume that the terms can also be partitioned into k clusters so that for i=j, no term in the ith cluster occurs in a document in the jth cluster. If we knew the clusters and arranged the rows and columns in them to be contiguous, then the matrix would be a block-diagonal matrix. Of course the clusters are not known. By a "black" of the document-term matrix, we mean a submatrix with rows corresponding to the ith cluster of documents and columns corresponding to the ith cluster of terms. We can also partition any n vector into blocks. Show that any right-singular vector of the matrix must have the property that each of its blocks is a right-singular vector of the corresponding block of the document-term matrix. 5. Suppose now that the k singular values are all distinct. Show how to solve the clustering problem. Hint: (4) Use the fact that the right-singular vectors must be eigenvectors of ATA. Show that ATA is also block-diagonal and use properties of eigenvectors
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started