2

I am reading the Performer paper https://arxiv.org/abs/2009.14794. To understand their ReLU kernel used to approximate softmax attention, I need to evaluate $\mathbb{E}[ReLU(x^T w) \cdot ReLU(y^T w)]$ where $w \sim \mathcal{N}(0, I)$. I was able to reduce the problem to the more general one of computing $\mathbb{E}[|X \cdot Y|]$ where $X,Y$ are correlated jointly Gaussians. However, now I am stuck. Is there an approach of computing this expression or bound it from above reasonably without numerical simulation? Cauchy-Schwarz seems to loose for the upper bound.

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73
  • 1
    Because $\text{ReLU}(z) = 0$ if $z < 0$, then you could write (using the law of total expectation) $$\mathbb{E}[ReLU(x^T w) \cdot ReLU(y^T w)] = \Pr(x^Tw \geq 0, y^Tw \geq 0) \cdot E[x^Tw \cdot y^T w \mid x^Tw \geq 0, y^Tw \geq 0]$$ Would this be easier to evaluate? I think the answer may be yes, since $w$ consists of i.i.d Gaussians, and so $x^Tw$ and $y^T w$ are both Gaussian. – mhdadk Apr 26 '23 at 21:35
  • I did think about this approach as well, but the conditional expectation seems really hard to evaluate. In particular, you can simplify this to $Trace(y x^T\mathbb{E}[ww^T | x^T w \geq 0, y^T w \geq 0])$. Intuitively for the dimension of $w$ large enough the expectation should yield about the identity, but how to make this exact? – N. Menet Apr 27 '23 at 01:22

1 Answers1

3

Probably not the best solution, but a direction that may be useful.

Suppose \begin{align} \begin{bmatrix} X \\ Y \end{bmatrix} \sim N_2\left(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{bmatrix}\right). \end{align}

It is well known that the conditional distribution of $X$ given $Y = y$ is univariate normal: \begin{align} X | Y = y \sim N(\mu_1 + \rho\sigma_1\sigma_2^{-1}(y - \mu_2), (1 - \rho^2)\sigma_1^2). \tag{1} \end{align}

This suggests that, to bound $E[|XY|] = E[E[|X||Y||Y]] = E[|Y|E[|X||Y]]$, we can evaluate $E[|X||Y]$ first. In this answer (or use the property of folded normal distribution), it is shown that if $Z \sim N(\mu, \sigma^2)$, then \begin{align} E[|Z|] = \sigma\sqrt{\frac{2}{\pi}}e^{-\frac{\mu^2}{2\sigma^2}} + \mu(1 - 2\Phi(-\mu\sigma^{-1})), \tag{2} \end{align} where $\Phi$ is CDF of $N(0, 1)$. $(1)$ and $(2)$ together then imply \begin{align} & E[|X||Y] = \sqrt{1 - \rho^2}\sigma_1\sqrt{\frac{2}{\pi}}e^{-\frac{[\mu_1 + \rho\sigma_1\sigma_2^{-1}(Y - \mu_2)]^2}{2(1 - \rho^2)\sigma_1^2}} \\ +& (\mu_1 + \rho\sigma_1\sigma_2^{-1}(Y - \mu_2))\left(1 - 2\Phi\left(-\frac{\mu_1 + \rho\sigma_1\sigma_2^{-1}(Y - \mu_2)}{\sqrt{1 - \rho^2}\sigma_1}\right)\right). \tag{3} \end{align}

A trivial upper bound for $(3)$ is \begin{align} \sqrt{1 - \rho^2}\sigma_1\sqrt{\frac{2}{\pi}} + |\mu_1| + |\rho|\sigma_1\sigma_2^{-1}(|Y| + |\mu_2|), \end{align} which gives an upper bound for $E[|XY|]$ as \begin{align} & CE[|Y|] + |\rho|\sigma_1\sigma_2^{-1}E[Y^2] \\ =& C\left[\sigma_2\sqrt{\frac{2}{\pi}}e^{-\frac{\mu_2^2}{2\sigma_2^2}} + \mu_2(1 - 2\Phi(-\mu_2\sigma_2^{-1}))\right] + |\rho|\sigma_1\sigma_2^{-1}(\sigma_2^2 + \mu_2^2), \tag{4} \end{align} where $C = \sqrt{1 - \rho^2}\sigma_1\sqrt{\frac{2}{\pi}} + |\mu_1| + |\rho|\sigma_1\sigma_2^{-1}|\mu_2|$.


Updates

Added R code of comparing bound $(4)$ and the Cauchy-Schwarz bound $\sqrt{(\sigma_1^2 + \mu_1^2)(\sigma_2^2 + \mu_2^2)}$. As expected, bound $(4)$ is sharper when $X$ and $Y$ are weakly correlated while Cauchy-Schwarz bound is sharper when $X$ and $Y$ are highly correlated. Different setups of $\mu_i$ and $\sigma_i$ also result in different winners. Those who are interested can play with it.

bounds <- function(mu1, mu2, sigma1, sigma2, rho) {
  C1 <- sqrt(1 - rho^2) * sigma1 * sqrt(2/pi) + abs(mu1) + abs(rho * mu2) * sigma1/sigma2
  C2 <- sigma2 * sqrt(2/pi) * exp(-mu2^2/(2 * sigma2^2)) + 
    mu2 * (1 - 2 * pnorm(-mu2/sigma2))
  C3 <- abs(rho) * sigma1 * (sigma2^2 + mu2^2)/sigma2
  UB1 <- C1 * C2 + C3
  UB2 <- sqrt((sigma1^2 + mu1^2) * (sigma2^2 + mu2^2))
  c(UB1, UB2)
}

para_mat <- cbind(rep(0, 10), rep(0, 10), rep(1, 10), rep(1, 10), seq(0.1, 0.9, len = 10)) colnames(para_mat) <- c("mu1", "mu2", "sigma1", "sigma2", "rho")

My_bound <- apply(para_mat, 1, function(t) bounds(t[1], t[2], t[3], t[4], t[5])[1]) CS_bound <- apply(para_mat, 1, function(t) bounds(t[1], t[2], t[3], t[4], t[5])[2]) cbind(para_mat, My_bound, CS_bound)

Result:

    mu1 mu2 sigma1 sigma2       rho  My_bound CS_bound

[1,] 0 0 1 1 0.1000000 0.7334287 1 [2,] 0 0 1 1 0.1888889 0.8140485 1 [3,] 0 0 1 1 0.2777778 0.8893436 1 [4,] 0 0 1 1 0.3666667 0.9589474 1 [5,] 0 0 1 1 0.4555556 1.0222792 1 [6,] 0 0 1 1 0.5444444 1.0784391 1 [7,] 0 0 1 1 0.6333333 1.1260001 1 [8,] 0 0 1 1 0.7222222 1.1625473 1 [9,] 0 0 1 1 0.8111111 1.1834650 1 [10,] 0 0 1 1 0.9000000 1.1774961 1

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73
  • I did some numerical experiment, this upper bound is not uniformly tighter than the bound given by C-S inequality (i.e., for different $(\mu_1, \mu_2, \sigma_1, \sigma_2, \rho)$, it may or may not be sharper than C-S inequality). So if you want to apply this upper bound, compare it with C-S bound first. – Zhanxiong Apr 27 '23 at 02:47
  • Nice idea. Could you include the code you used to run the numerical experiment? – mhdadk Apr 27 '23 at 12:29
  • 1
    @mhdadk Added the updates. Enjoy playing it :). – Zhanxiong Apr 27 '23 at 13:24
  • Did a square in (1) get lost? Should the variance not be $(1-\rho^2) \sigma_1^2$? I guess for the centered jointly Gaussians $X = w^T x$ and $Y = w^T y$ of the original formulation one would thus get $E[|X \cdot Y |] \leq \sqrt{1-\rho^2} \sigma_1 \sigma_2 \frac{2}{\pi} + |\rho|\sigma_1\sigma_2$ compared to what Cauchy-Schwarz gives: $E[|X \cdot Y |] \leq \sigma_1 \sigma_2$ ? So for $|\rho|=1$ the two coincide and for $\rho=0$ an improvement of $\frac{2}{\pi} $ is observed? – N. Menet Apr 27 '23 at 23:31
  • Actually, if $\rho = 0$, i.e. $X,Y$ independent, then according to the expectation of centred folded distributions $\mathbb{E}[|X \cdot Y|] = \mathbb{E}[|X|] \cdot \mathbb{E}[|Y|] = \frac{2}{\pi} \sigma_1 \sigma_2$, meaning your formula is tight at both endpoints $\rho=0$ and $|\rho| = 1$. Neat! – N. Menet Apr 27 '23 at 23:39
  • @N.Menet Yes, it should be $1 - \rho^2$, I will fix it and check the numerical results again. – Zhanxiong Apr 27 '23 at 23:59
  • 1
    @N.Menet Surprisingly, for non-zero means ($\mu_1 \neq 0, \mu_2 \neq 0$) and heteroscedastic variances, C-S bound works better under many situations. – Zhanxiong Apr 28 '23 at 00:16
  • That is indeed interesting. Thank you for your answer, it helped a great deal. I never encountered folded normal distributions before. – N. Menet Apr 28 '23 at 00:54