Questions tagged [kernel-smoothing]

Kernel smoothing techniques, such as kernel density estimation (KDE) and Nadaraya-Watson kernel regression, estimate functions by local interpolation from data points. Not to be confused with [kernel-trick], for the kernels used e.g. in SVMs.

A kernel in the context of kernel smoothing is a local similarity function $K$, which must integrate to 1 and is typically symmetric and nonnegative. Kernel smoothing uses these functions to interpolate observed data points into a smooth function.

For example, Watson-Nadaraya kernel regression estimates a function $f : \mathcal X \to \mathbb R$ based on observations $\{ (x_i, y_i) \}_{i=1}^n$ by $$ \hat{f}(x) = \frac{\sum_{i=1}^n K(x, x_i) \, y_i}{\sum_{i=1}^n K(x, x_i)} ,$$ i.e. a mean of the observed data points weighted by their similarity to the test point.

Kernel density estimation estimates a density function $\hat{p}$ from samples $\{ x_i \}_{i=1}^n$ by $$ \hat{p}(x) = \frac{1}{n} \sum_{i=1}^n K(x, x_i) ,$$ essentially placing density "bumps" at each observed data point.

The choice of kernel function is of theoretical importance but typically does not matter much in practice for estimation quality. (Wikipedia has a table of the most common choices.) Rather, the important practical problem for kernel smoothing methods is that of bandwidth selection: choosing the scale of the kernel function. Undersmoothing or oversmoothing can result in extremely poor estimates, and so care must be taken to choose an appropriate bandwidth, often via cross-validation.


Note that the word "kernel" is also used to refer to the kernel of a reproducing kernel Hilbert space, as in the "kernel trick" common in support vector machines and other kernel methods. See [kernel-trick] for this usage.

631 questions
21
votes
1 answer

Kernel Bandwidth: Scott's vs. Silverman's rules

Could anyone explain in plain English what the difference is between Scott's and Silverman's rules of thumb for bandwidth selection? Specifically, when is one better than the other? Is it related to the underlying distribution? Number of…
xrfang
  • 333
  • 1
  • 2
  • 10
10
votes
1 answer

Efficient evaluation of multidimensional kernel density estimate

I've seen a reasonable amount of literature about how to choose kernels and bandwidths when computing a kernel density estimate, but I am currently interested in how to improve the time it takes to evaluate the resulting KDE at an arbitrary number…
Gabriel
  • 391
8
votes
1 answer

Bias for kernel density estimator (periodic case)

Kernel density estimator is given by $$\hat{f}(x,h)=\frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-X_{i}}{h})$$ where $X_1,...X_n$ i.i.d with some unknown density $f$, $h$ - bandwith, $K$ - kernel function …
Katja
  • 275
4
votes
1 answer

Problem on spherical kernel and spherical distribution

Suppose, $X$ is a random variable which follows $f$ (i.e., $X$ $\sim$ $f$), such that $f(x)$ is the probability density function (p.d.f.) of a spherical distribution. Here, $X$ is multi-dimensional ! By spherical distribution, I mean any…
4
votes
2 answers

Density estimation for streams of Data

What statistical methods out there that will estimate the probability density of data as it arrives temporally? This is the situation I have: I need to estimate the pdf of a multivariate dataset; however, new data arrives over time and as the data…
3
votes
3 answers

Is kernel density in kernel density estimation derived or defined?

Is kernel density in kernel density estimation derived or defined? If defined, why is it defined this way, if derived, how to derive it? In particular, why $h^d$ and not $h$ in the multivariate case, where kernel density is defined as…
user10024395
  • 1
  • 2
  • 11
  • 21
3
votes
0 answers

Does it make sense to compute the KL divergence of two KDEs?

Does it? I'm not exactly certain. I fit two KDE PDFs on two datasets. I want to measure the disimilarity between them.
user46925
2
votes
1 answer

Order of the kernel for periodic case

This question is related to my previous question Bias for kernel density estimator (periodic case) A kernel $K(x)$ is of the order $p$ if $$\int_{-\infty}^{\infty}K(x)x^{j}=\delta_{0,j}\ j=0,...p-1$$ $$\int_{-\infty}^{\infty}K(x)x^{p}\neq0\…
Katja
  • 275
2
votes
1 answer

Kernel Bandwidth: Why scott's rule only use n**(-1./(d+4)) in scipy.stats

I have a question about bandwidth selection of kernel density estimate in scipy.stats. In the method, if we use Scott's rule, the bandwidth is equal to n**(-1./(d+4)), which means that the bandwidth is only related to the number and dimensions of…
Gid
  • 86
2
votes
0 answers

Divide Kernel Density Estimate with another one

Let's say, for example in an e-commerce website, I create a kernel density estimate for every sold items at their price point. I also create another KDE for every listed items at their price point. Does it make sense to divide the first KDE by the…
mitbal
  • 171
2
votes
1 answer

Kernel density estimation: Order remains under integration

In (univariate) kernel density estimation, I often come across constructions where some Taylor expansion like $ \int K_h ( u - y) f(y) dy = \int K(x) f( u - hx) dx = \sum_{k = 0}^2 h^{2k} f^{(2k)} (u) \frac{m_{2k}(K)}{(2k)!} + O(h^6)$ is done and…
xxx
  • 121
  • 2
2
votes
1 answer

Kernel Density Estimation

In the Kernel density estimation formula below (from Wikipedia), what do the values of $x$ and $x_i$ represent? $$ \hat f_h(x) = \frac{1}{n}\sum_{i=1}^n K_h(x-x_i) = \frac{1}{nh}\sum_{i=1}^n K\bigg(\frac{x-x_i}{h}\bigg) $$ I am implementing this…
1
vote
1 answer

How the kernel regression works?

I am working on propensity score matching using the Kernel Nadaraya–Watson kernel regression. But I am looking to understand the logic of estimation; First, we estimate the Kernel density of each unit (i) in the data using the formula $$ \hat Y_{0i}…
1
vote
0 answers

Mathematical underpinnings of variations on Silverman's bandwidth

When using a Gaussian kernel to estimate the distribution of a Gaussian-distributed $x$, the bandwidth that minimizes the mean integrated squared error is: $$h=\left(\frac{4 \hat{\sigma}^5}{3n}\right)^{\frac{1}{5}} $$ where $\hat{\sigma}$ is the…
1
vote
1 answer

smoother: "... prediction on x, which is unrelated to the values of x_i." What does it mean?

My lecture notes says Any practical implementation of a smoother is based on input in the form of a scatterplot $(x_i, y_i)_{i=1}^n$, on a tuning parameter $h$, and on a grid of output points $x$ where one would like to see the estimate (usually…
WCMC
  • 1,058
1
2 3