7

I have two versions of the same question:

  1. Given a list of numbers (with possible duplicates), how to find a k-subset (with possible duplicates) that maximize the variance? is there a more efficient way than the obvious "check-all-k-subsets"?
  2. Given a set of numbers, how do I select from that set a list of k numbers that maximize the variance?
GreyGeek
  • 171
  • 4
    I believe there's a simple $O(k)$ algorithm, because when $k \gt 1$, a variance-maximizing subset must consist of the $k_0$ smallest and $k-k_0$ largest elements, whence a search over $k_0=1,2,\ldots,k-1$ does the trick. – whuber Mar 16 '12 at 19:37
  • 1
    Don't forget about computing the $k$ smallest and largest elements of the input, this requires $O(n \log k)$. – krlmlr Mar 19 '12 at 03:05
  • 1
    Good point, @user. Regardless, these execution times ($O(k)$ for a pre-sorted list and $O(n\log(k))$ otherwise) are very small compared to the number of $k$-subsets of $n$ whenever $k\gt 1$. – whuber Mar 19 '12 at 17:29
  • @whuber should that be "$k-k_0$ largest elements" – Spacedman Jan 31 '18 at 17:43
  • @Spaced yes. I'll fix it. – whuber Jan 31 '18 at 18:34
  • @whuber I got here from here: https://gis.stackexchange.com/questions/269719/select-n-cells-from-a-raster-maximizing-their-variance/269731#269731 where someone wants to do this for raster values. I've implemented a method which seems to give same results as yours but slightly different method... – Spacedman Jan 31 '18 at 19:55
  • @Spaced Yes, I saw that yesterday (and upvoted your answer then: it's a nice one). I have posted an answer here to justify your solution. – whuber Jan 31 '18 at 20:12

1 Answers1

3

When the numbers are sorted there's a simple $O(k)$ algorithm, because when $k\gt1,$ some variance-maximizing subset will consist of the $k_0\ge 1$ smallest and $k−k_0$ largest elements, whence a search over $k_0=1,\ldots,k−1$ does the trick.

(Even when the $n$ numbers are not sorted, finding the $k^\text{th}$ largest or smallest element takes $O(n)$ effort, so the algorithm is at worst $O(k n)$ or $O(n\log n)$, whichever is smaller.)


By understanding how changing a single value changes the variance, we can see immediately why this is. Consider any $k$ numbers $x_1, \ldots, x_k$. Contemplate changing $x_k$ to $x_k+\delta$ for some number $\delta$. Because the original mean is

$$\bar x = \frac{1}{k}(x_1+\cdots+x_k),$$

adding $\delta$ changes it to

$$\bar x^\prime = \bar x + \frac{1}{k}\delta.\tag{1}$$

Because the original variance, multiplied by $k$, is

$$ks^2 = (x_1^2 + \cdots + x_k^2) - k\bar x^2,$$

we may directly compute the effect of adding $\delta$ to the sum of squares (it affects only the last square) and exploit $(1)$ to determine how the variance changes:

$$\eqalign{ ks^{\prime 2} - ks^2 &= [(x_k+\delta)^2 - x_k^2] - k[\bar x^{\prime 2} - \bar x^2]\\ &= 2\delta(x_k - \bar x) + \frac{k-1}{k}\delta^2 \ge 2\delta(x_k - \bar x). }$$

This will be nonnegative whenever the signs of $\delta$ and $x_k-\bar x$ are the same. In other words,

if we can manage to move $x_k$ further from the mean, we will increase the variance.

Now suppose we have identified a variance-maximizing sequence $\mathcal A$ of $k$ values chosen from a sequence $X$ of given values. It should be clear--and is easy to demonstrate formally--that if $\mathcal A$ does not consist of the very smallest and the very largest elements of $X$--then we can select some $x\in\mathcal A$ that comes from somewhere in the middle of $X$ and replace it by another element $y\in X\setminus \mathcal A$ that is at least as far from the mean of $\mathcal A$ as $x$ itself is. This replacement would be tantamount to adding the value $\delta = y-x$ to $x$, and (by the foregoing calculations) would produce another variance-maximizing subsequence.

That's it: after a finite number of such moves, we would no longer be able to find such an $x$. At this point, $\mathcal A$ would be the union of two extreme "tails" of $X$, as claimed.

Similar reasoning shows that $\mathcal A$ will not consist of just one tail of $X$ (unless $\mathcal A$ were all of $X$ itself or the elements of $X$ were all equal), for by replacing the most "inner" element of $\mathcal A$ by the extreme element in the same direction, we would not decrease the variance.

whuber
  • 322,774