transformation of a kernel density estimate to uniform distribution

Question

I am interested in estimating the expected value of a function, $f(x)$ with respect to a probability density function, $P(x)$.

I am exploring a method that requires I change variables from the underlying distribution $P(x)$ to a uniform distribution $u\sim U[0, 1]$. The idea is to map $P(x)$ into the input of $f(x)$ as $f(\Phi^{-1}(u))$, where $P(x)=u$ and $\Phi(x)=\int_{-\infty}^x P(x')dx'$:

$\int f(x) P(x) dx = \int f(\Phi^{-1}(u))du$

This was straightforward enough when I let $P(x)$ be an uncorrelated normal distribution:

import numpy as np
from scipy.stats import norm
np.random.seed(1)
nsamps = 200000
def f(x): return np.sum(x ** 2, 0)
u = np.random.uniform(0, 1, (2, nsamps))
x = np.random.normal(0, 1, (2, nsamps))
print('Expectation from normal samples: ', np.mean(f(x)))
print('Expectation from uniform samples: ', np.mean(f(norm.ppf(u))))
Expectation from normal samples:  1.99866994956538
Expectation from uniform samples:  1.9984323166733162

In my real problem, I have estimated $P(x)$ using a kernel density estimation procedure. Even if I can find $\Phi$, I am baffled if it is even possible to estimate $\Phi^{-1}(u)$ in this case.

Is it possible to compute $\Phi^{-1}(u)$ in this case? If so, how cold this be accomplished?

That is my question. I ask it in the context of my challenge problem, estimating $\int f(\Phi^{-1}(u))du$, where $\Phi(x)$ is the cdf of the joint probability density function found using a kernel density estimator.

The remaining is my best attempt so far:

This stackoverflow thread covers the case when $x$ is one dimensional. The solution interpolates the cdf as a function of $x$.

I've tried emulating this solution with multidimensional $x$, but it is problematic to interpolate a two-dimensional output from a one-dimensional input. This is my best attempt so far at approximating the integral using uniform samples, but it does not compute the correct expectation.

import numpy as np
from scipy.stats import norm, gaussian_kde
from scipy.special import ndtr
from scipy.interpolate import interp2d
def f(x): return(np.sum(x ** 2, 0))
create kde
samples = np.random.normal(loc=0, scale=1, size=(2, 1000))
kde = gaussian_kde(samples)
compute cdf
cdf = tuple(ndtr(np.ravel(item - kde.dataset.T) / kde.factor).mean()
            for item in samples.T)
interpolate x_2 from cdf and x_1
invcdf = interp2d(cdf, samples[0, :], samples[1, :])
though this is my best attempt so far at using uniform samples, it is not very good
print('estiamtion from uniform samples: ', np.mean(f(invcdf(np.random.uniform(0, 1, 1000), 0))))
expectation approximated by sampling KDE
print('estimation from direct samples: ', np.mean(f(kde.resample(10000))))
estiamtion from uniform samples:  1026.192868721676
estimation from direct samples:  2.178433139240008

Edit

From g g's answer below, I think I have put together a coded example that suites my needs:

import numpy as np
from sklearn.datasets  import load_diabetes
from scipy.stats import norm
import matplotlib.pyplot as plt
load data
dat = load_diabetes()
train_x = dat.data[:, [4, 5]] 
train_x -= train_x.mean()
train_x /= train_x.std()
dimension = train_x.shape[1]
data_size = train_x.shape[0]
define length scale of KDE estimate
LSCALE = 0.5
create KDE of data
def kde(x, lscale=1):
   density = 0
   for point in train_x:
      density += norm.pdf(x[0], loc=point[0], scale=lscale) * norm.pdf(x[1], loc=point[1], scale=lscale)
   density /= train_x.shape[0]
   return density
Compute Cholesky factors
C = np.zeros((data_size, dimension, dimension))
for imat in range(data_size):
  C[imat,:,:] = np.linalg.cholesky(np.cov(train_x.T))
it seems like the covariance should depend on the length scale of the GPs.
So, I multiplied it by the length scale
C *= LSCALE
define phi
cumulative probabilities, here all equal 1/nsample for simplicity
qprob = np.arange(data_size) / data_size
Function Psi doing the transformation
def psi(u):
determine component according to first coordinate
comp = sum(qprob < u[0]) - 1
determine normal according to the remaining coordinates
Z = norm.ppf(u[1:])
  return(train_x[comp, :] + C[comp,:,:]@Z)
draw samples from U]0,1[ to sample space using phi
np.random.seed(10)
u_samps = np.random.uniform(0, 1, (dimension + 1, 100))
generated_samps = np.array([psi(u) for u in u_samps.T])
plot results
if True:
   NX = 16
   plot_x = np.linspace(generated_samps.min(0)[0], generated_samps.max(0)[0], NX)
   plot_y = np.linspace(generated_samps.min(0)[1], generated_samps.max(0)[1], NX)
   X, Y = np.meshgrid(plot_x, plot_y)
   plot_points = np.array([X, Y]).reshape(2, -1)
   dens = np.array([kde(p, LSCALE) for p in plot_points.T])
   plt.contourf(X, Y, dens.reshape(NX, NX), 1000)
   plt.scatter(train_x[:, 0], train_x[:, 1], c='k', s=10, label='Observed')
   plt.scatter(generated_samps[:, 0], generated_samps[:, 1], c='r', s=10, label='Generated')
   plt.xlim(plot_x.min(), plot_x.max())
   plt.ylim(plot_y.min(), plot_y.max())
   plt.legend()
   plt.savefig('density')
   plt.clf()

np.mean(f(train_x.T))
  1.1494072502058117
np.mean(f(generated_samps.T))
  1.105391993079355

Is it at all possible to perform the integral with the change of variables for any joint probability distributions of $x$? This answer indicated it is possible for correlated Gaussian inputs. https://stats.stackexchange.com/questions/572787/bayesian-quadrature-to-find-expectation-of-unkown-function-w-r-t-known-pdf/572800#572800 — kilojoules, Apr 25 '22 at 07:21
Thanks for this feedback all. I reworded the question to be more careful to use $\Phi$ instead of $P$ when appropriate. Say I have $\Phi$: how can I then compute $\Phi^{-1}$? — kilojoules, Apr 25 '22 at 07:43
As I understand the OP he wants to use a library for integration with respect to the uniform measure. The measure he is interested in is not uniform. So I suggested to pull back his measure to the uniform measure and perform the integral with the pull back. — g g, Apr 25 '22 at 07:44
The practical issue is now, whether it is possible to find such a pull back map explicitely. — g g, Apr 25 '22 at 07:46
You should change the title of the question. Since you are not looking for a CDF but for "a transformation of a kernel density estimate to uniform" the title as it is is misleading. — g g, Apr 25 '22 at 10:23
Because the KDE is a mixture of kernels, https://stats.stackexchange.com/questions/411647 answers your question. There's no explicit (closed form analytical) solution, so the key is to implement an efficient numerical procedure. That means deploying a suitable root finder. cc @gg — whuber, Apr 25 '22 at 20:05
@whuber I do not see how the linked answer applies to this question. The OP neither wants to compute the quantiles nor the likelihood. He "just" wants to compute the expectation in a certain way (due to his restrictions on numerical libraries). While the expectation of a mixture is straightforward to compute in terms of the components this will contradict his goal of having as few evaluations of the integrand as possible. — g g, Apr 25 '22 at 20:42
@gg I was responding to the emphasized question in the middle of the post, "Is it possible to compute $\Phi^{-1}(u)$ in this case?" I have understood the introductory material about computing expectations as being background and motivating material, utilizing the OP's typographic emphasis to determine the ultimate question. After all, when the kernel has a mean of zero, the expectation of the KDE is the arithmetic mean of the data. — whuber, Apr 25 '22 at 20:47
Re the motivating question: why not use a different method to estimate the expectation? Your approach appears to be a very complicated way to compute the arithmetic mean of your data! — whuber, Apr 26 '22 at 12:57

Xi'an · Answer 1 · 2022-04-25T07:49:15.427

The multivariate $d$ dimensional extension of the inverse cdf generation is incorrect, both because $F^{−1}(\cdot)$ does not exist and because $F(X)$ is not Uniform (0,1). (For instance, in the independent case, it would be a product of $d$ Uniforms.) The closest (?) solution is to decompose the distribution in a marginal and successive conditionals, F1, F2(⋅|x1), &tc. To simulate a multivariate random vector, one need generate the same number of Uniforms and then invert the marginal and successive conditionals.

For a joint $d$ dimensional distribution $F$, define \begin{align} F_1(x) &= \mathbb P(X_1<x)\\ F_2(x|x_1) &= \mathbb P(X_2<x|X_1=x_1)\\ &\vdots\\ F_d(x|x_1,\ldots,x_{d-1}) &= \mathbb P(X_d<x|X_1=x_1,\ldots,C_{d-1}=x_{d-1}) \end{align} generate $U_1,\ldots,U_d\sim\mathcal U(0,1)$, and take $$X_1=F_1^{-1}(U_1),X_2=F_2^{-1}(U_2|X_1),\ldots$$

For the modified question of seeking the distribution of $f(X)$ when $f$ maps $\mathbb R^d$ to $\mathbb R$, the practical difficulty is to find the marginal distribution of $f(X)$.

g g · Answer 2 · 2022-04-28T13:04:07.980

Background

As stated in this as well in his prior question the OP wants to perform Bayes quadrature of an expensive function against a density, which is a Gaussian mixture as the result of applying a kernel density smoother. The library he wants to use for Bayes quadrature allows only uniform distributions to integrate against. This is why he wants to transform the integrand. In the following I construct a function $\Psi_0:]0;1[^{d+1} \rightarrow \mathbb{R}^d$ such that $$ \int_{\mathbb{R}^d}f(x)P(x)dx=\int_{]0;1[^{d+1}}f(\Psi_0(u))du$$ where the density $P(x)$ is a Gaussian mixture.

I will explain this in steps, starting with very simple densities $P$ and transformations $\Psi$. I also provide some code for testing and further documentation at the end.

A: Multivariate, independent standard normals

Let $F:\mathbb{R}\rightarrow ]0;1[$ denote the standard normal univariate CDF. For univariate standard normal density $P$ the transformation is $ \Psi(u) = F^{-1}(u).$ In the multivariate case apply this transformation to each margin. I.e. for $x=(x_1,\ldots,x_d)$ and $u=(u_1,\ldots,u_d)$ define the function as

$$ x = \Psi(u)= (F^{-1}(u_1), \ldots, F^{-1}(u_d)).$$

B: General multivariate normal distribution

The density $P$ is now defined by a d-dimensional mean vector $\mu$ and a d-by-d correlation matrix $\Sigma.$

First you need to find a square root of $\Sigma.$ This is a d-by-d matrix $C$ such that $\Sigma=C C^T$. The Cholesky factorisation as provided by numpy.linalg.cholesky is a good choice for this.
The matrix $C$ transforms an uncorrelated zero-mean d-dimensional multivariate normal vector $Z$ to a properly correlated vector $X$ as follows: $$ X = C Z + \mu.$$

Apply step A. above to $Z$ and you have your transformation: $$ x = \Psi(u)=\mu + C (F^{-1}(u_1), \ldots, F^{-1}(u_d)).$$

C: Gaussian mixtures

Now the density $P$ is a sum of $N$ components which are Gaussian densities $G_j$, each with its own parameters $\mu_j$ and $\Sigma_j:$ $$ P(x) = \frac 1 N \sum_{j=1}^N G_j(x).$$

Note that an integral against such a density is a sum of $N$ integrals against the components and can accordingly be reduced to step B. But calculating $N$ integrals would defeat the purpose of doing a Bayesian quadrature with few function evaluations. So one should transform this case as well.

To take care of the mixture you need to introduce an additional dimension, i.e. you need to use a $d+1$ dimensional uniform density. The idea is to use this additional variable to select the component and then apply a properly parameterized function $\Psi$ from step B to the remaining $d$ coordinates:

$$ x = \Psi_0(u_0, u_1, \ldots,u_d) = \sum_{j=1}^N \mathbf 1_{[\frac{j-1}{N};\frac{j}{N}]}(u_0) \Psi_j(u_1,\ldots,u_d).$$

Here $\mathbf 1_{]\frac{j-1}{N};\frac{j}{N}]}$ is the 0-1 indicator function of the interval $]\frac{j-1}{N};\frac{j}{N}]$ and $\Psi_j$ a function according to step B. with mean $\mu_j$ and covariance $\Sigma_j$ according to the mixture parameters.

Code example

Note that the code is not written for speed and you need quite a sample size to see comparable means/integrals:

import numpy as np
from scipy.stats import norm
convention for axis 0: features 1: samples
Parameters:
number of features/dimension
d = 2
size of sample
nsample = 100000
number of components/centres in the mixture
ncomp = 10
target function for integration
def f(x): 
  return(np.sqrt(np.sum(x ** 2, 0)))
define data (locations, covariances) for the components, i.e. Gaussian distributions
locations/mean vectors (d-dimensional)
mu = np.random.normal(loc=0, scale=1, size=(d, ncomp))
Create for each component a Covariance matrix (d-by-d symmetric, positive definite matrix)
covmat = np.zeros((ncomp,d,d))
L = np.zeros((d,d))
Factor is upper triangular with d*(d + 1)/2 entries
for imat in range(ncomp):
the entries just generated from random uniform
L[np.tril_indices(d)] =  np.random.uniform(size=int(d*(d + 1)/2))
  covmat[imat,:,:] = np.matmul(L, L.T)
Generate sample from mixture distribution
smp_mix = np.zeros((d, nsample))
draw the components
smpcomp = np.random.randint(0,ncomp,nsample)
draw appropriate normal from each component
for isample in range(nsample):
  comp = smpcomp[isample]
  smp_mix[:,isample] = np.random.multivariate_normal(mu[:,comp], covmat[comp, :,:], size=1)
preparation for the uniform sample
calculate cholesky factors (pretend we did not know them already)
C = np.zeros((ncomp,d,d))
for imat in range(ncomp):
  C[imat,:,:] = np.linalg.cholesky(covmat[comp, :,:])
cumulative probabilities, here all equal 1/nsample for simplicity
qprob = np.arange(ncomp)/ncomp
Function Psi doing the transformation
def psi(u):
determine component according to first coordinate
comp = sum(qprob < u[0]) - 1
determine normal according to the remaining coordinates
Z = norm.ppf(u[1:])
  return(mu[:,comp] + C[comp,:,:]@Z)
Generate sample from uniform, notice the "d+1"!
u_sample = np.random.uniform(size=(d + 1, nsample))
smp_uni = np.zeros((d, nsample))
for isample in range(nsample):
  smp_uni[:,isample] = psi(u_sample[:,isample])
Compare results
mean_mix = np.mean(f(smp_mix))
mean_uni = np.mean(f(smp_uni))

Thank you very much for this detailed answer. I am confused by the coded example. The entries of covmat are selected randomly, but covmat is used to compute cholesky factors C, where the code notes we pretend we did not know them already. How can I obtain the C matrix without knowing covmat beforehand? — kilojoules, May 06 '22 at 22:08
I edited my question to include new code implementing the approach to the best of my abilities. So far, it looks like it works. I compute covmat as the $d\times d$ covariance matrix of the observed input. I multiply the cholesky matrix by the length scale of the KDE GP models, since it seems like the covariance matrix would change based on the assumed length scale. Is this the correct implementation? — kilojoules, May 06 '22 at 23:52
I didn't include the length scale in the ppf inverse cdf function because my thinking is that we are going to the input space, which is normalized to have unity variance — kilojoules, May 07 '22 at 00:00
Now I see that my remark about covmat and the cholesky factor was confusing. There are really two different cholesky factors: The cholesky factor of the covariance matrix of the "true" but unknown distribution and the cholesky factor of the empirical covariance matrix. But it seems you figured this out. — g g, May 13 '22 at 16:49
All this is purely concerned with the transformation to uniform. All other stuff, be it Bayesian or otherwise is independent from this. So, yes, do not include lengthscales of kernels here. — g g, May 13 '22 at 16:50
I don't think my solution is correct. It seems like the entries of C should all be different. Otherwise, they don't pick up on local correlations. — kilojoules, May 27 '22 at 22:06

transformation of a kernel density estimate to uniform distribution

create kde

compute cdf

interpolate x_2 from cdf and x_1

though this is my best attempt so far at using uniform samples, it is not very good

expectation approximated by sampling KDE

load data

define length scale of KDE estimate

create KDE of data

Compute Cholesky factors

it seems like the covariance should depend on the length scale of the GPs.

So, I multiplied it by the length scale

define phi

cumulative probabilities, here all equal 1/nsample for simplicity

Function Psi doing the transformation

determine component according to first coordinate

determine normal according to the remaining coordinates

draw samples from U]0,1[ to sample space using phi

plot results

2 Answers2

Background

A: Multivariate, independent standard normals

B: General multivariate normal distribution

C: Gaussian mixtures

Code example

convention for axis 0: features 1: samples

Parameters:

number of features/dimension

size of sample

number of components/centres in the mixture

target function for integration

define data (locations, covariances) for the components, i.e. Gaussian distributions

locations/mean vectors (d-dimensional)

Create for each component a Covariance matrix (d-by-d symmetric, positive definite matrix)

Factor is upper triangular with d*(d + 1)/2 entries

the entries just generated from random uniform

Generate sample from mixture distribution

draw the components

draw appropriate normal from each component

preparation for the uniform sample

calculate cholesky factors (pretend we did not know them already)

cumulative probabilities, here all equal 1/nsample for simplicity

Function Psi doing the transformation

determine component according to first coordinate

determine normal according to the remaining coordinates

Generate sample from uniform, notice the "d+1"!

Compare results

Linked