I have a document-term matrix and I performed SVD on it. How can I cluster terms based on the singular values?
Is there any relationship between SVD and factor analysis?
I have a document-term matrix and I performed SVD on it. How can I cluster terms based on the singular values?
Is there any relationship between SVD and factor analysis?
You have a document - term matrix, let's call it $U$ with terms as columns and documents as rows. Let the number of terms be $k$ and the number of documents be $n$ respectively.
Now, it's only normal to think that each document comprises of certain (hidden) broad topics, and that each (hidden) topic itself is associated with a group of terms. Suppose further, that we want to represent documents not as mere collection of terms, but as a meaningful collection of some topics, and the topics themselves would be expressed as some collection of items.
These (hidden) topics are referred to as factors, and the idea is that similar documents will have similar topics associated with them.
We want some way to represent documents in the 'topic-space'. We also want to represent topics in the 'term-space'. This is achieved by Singular Value Decomposition.
Mathematically, we want to 'break' the document - item interaction matrix $U$ into two parts, one matrix $P$ containing the document - topic interaction matrix and another matrix $Q$ containing the topic - term interactions.
$$ \Large U \approx PQ^T$$
This is what Singular Value Decomposition (henceforth SVD) does, because on applying SVD on $U$, it spits out three matrices as shown next
$$ \Large U = S\Sigma V^T$$
Now if you assign the values to $P$ and $Q$ matrices as follows
$$ \Large S \rightarrow P$$
$$ \Large \Sigma V^T \rightarrow Q^T$$
We see that $$ \Large U \approx PQ^T = S\Sigma V^T$$
Hence, we can use SVD to find matrices $P$ and $Q$ that can approximately reconstruct $U$
Now, let's look at matrix $P$ that contains the document - topic interactions. Each row of $P$ refers to a document, and the column values are the weights that each topic contributes to the document.
As an example, a sensational bank robbery article may have high weights for topics such as 'current-affairs', 'crime', 'money' whereas a financial report on inflation and its effects might contain the topics 'economy', 'geopolitics', 'money'. Do note that a topic can belong to more than one document, and each document can contain more than one topic.
Now to answer your two questions:
The matrix $P$ represents each document as a weight of topics. These topics are unknown and are the (latent or hidden) factors, which can be used for further analysis.
You can use the matrix $P$ and run any clustering algorithm (KMeans, GMM etc) to find similar document clusters.