why do we calculating dot product between a matrix and its transpose to capture interaction of data?

Question

I am wondering why in this work https://github.com/facebookresearch/dlrm, authors are calculating dot product of embeddings by its transpose. Here is the sentence from their paper, the last paragraph in the 3rd page.

We will compute second-order interaction of different features explicitly, following the intuition for handling sparse data provided in FMs (factorization machine), optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features.

Here is a picture from the architecture of their model in their github repository:

I can not understand the intuition for dot product. Why does the dot product computes second-order interaction? I studied factorization machine method but I could not understand the intuition. Can anyone give me some sources to study and understand? or clear it out for me?

See https://stats.stackexchange.com/questions/22501/is-there-an-intuitive-interpretation-of-ata-for-a-data-matrix-a/22520#22520 — Tim, Jul 27 '23 at 09:27

score 0 · Answer 1 · answered Jun 07 '21 at 09:01

I wonder if it is a case of saying 'covariance' in a different language.

One could consider two random variables $X$ and $Y$, two vectors are then the realizations of these variables, which can be written as sets of measurements $\{X_k\}_{k=1\dots N}$, $\{Y_k\}_{k=1\dots N}$, or as vectors $\mathbf{X}=\left(X_1,\,\dots X_n\right)^T$, $\mathbf{Y}=\left(Y_1,\,\dots Y_n\right)^T$.

The dot product between the two vectors is then related to sample covariance $S^2_{XY}$:

$$ \begin{align} \mathbf{X}^T.\mathbf{Y}&=\sum_{k=1}^N X_k Y_k=\left(N-1\right)\cdot\frac{1}{N-1}\sum_{k=1}^N \left(X_k-\bar{X}\right) \left(Y_k-\bar{Y}\right)+N\cdot\bar{X}\cdot\bar{Y} \\ &=\left(N-1\right)\cdot S^2_{XY}+N\cdot\bar{X}\cdot\bar{Y} \end{align} $$

Where $\bar{X}$ and $\bar{Y}$ are sample means ($\bar{X}=\sum_k^N X_k/N$ and same for $Y$).

If the samples are conditioned so that sample means are zero (e.g. z-scaling), then dot product is proportional to the sample variance, which is an estimator of the variance between the two variables.

If $X$ and $Y$ are statistically independent then their covariance will be zero. Furthermore, if the two variables come from suitable distributions, e.g. Normal distribution, then zero variance actually implies statistical independence

why do we calculating dot product between a matrix and its transpose to capture interaction of data?

1 Answers1