5

Suppose I have some dataset $X = \{x_1, x_2, \ldots, x_n\}$ which has a mean $\bar{X}$ and a standard deviation $\sigma_X$. Now, suppose that I want to trim the tails of the dataset such that the new average is $\bar{X}_d$ with new standard deviation $\sigma_{X_d}$. In other words, I want to remove the tail points in the dataset such that the new average of $X$ is approximately $\bar{X}_d$ with a new standard deviation of approximately $\sigma_{X_d}$.

Is there a way to formulate some (convex) optimization problem to accomplish this? Basically, the optimization objective might be to find the threshold value $x_i$ which seperates the dataset. I was thinking of this kind of formulation:

$$\min \|\bar{X} - \bar{X}_d\|+ \|\sigma_{X}^2 - \sigma_{X_d}^2\|$$

But not sure how to formulate this with optimization variables.

Johnny
  • 293
  • 1
  • 7
  • 1
    Brute force? Assuming you want to cut from both tails, there are approximately $(n/2)^2$ combinations of a first and last observation to keep. Computing sample moments is pretty quick compared to discrete optimization. – prubin Oct 08 '21 at 20:27

1 Answers1

1

I think it is not easy (at least for me) to formulate the deviation part, so let's omit it...

Let decision variable $b_i = 1$ if $x_i$ is chosen, otherwise $b_i = 0$. The mean value is defined by $$ \sum b_i \bar{X} = \sum b_ix_i. $$

To vanish the bilinear term, introduce new variables $y_i$, to ensure $$ y_i = b_i\bar{X}, $$ introduce constraints $$ -Mb_i \leq y_i \leq Mb_i\\ -M(1-b_i)\leq y_i - \bar{X}\leq M(1-b_i) $$

By 'tail points', you might mean values larger than some $x^\mathrm{th}$ should be disgarded, which is equivalent to $$ x_i \leq x^\mathrm{th} + M(1-b_i). $$

A better way is to sort the data first and constrain $$ b_{i-1} \geq b_i $$

(If you means ruling out left side, just flip the sign. To chop both sides, you might need two sets of 0-1 variables.)

I think if you are just chopping one side tail, just iterate over the size of the result and choose your favorite.

xd y
  • 1,186
  • 3
  • 9
  • Thanks for your answer! A couple of questions...$\bar{X}_d$ is a constant value, so where is the bilinear term in your definition of $y_i$? Also, in the minimization problem you stated, $\bar{X}_d$ and $\bar{X}$ are known, so what is the point of including $| \bar{X} - \bar{X}_d |$? – Johnny Oct 08 '21 at 13:50
  • $\bar{X}_d$ is a variable. – xd y Oct 08 '21 at 14:22
  • $\bar{X}_d$ is the desired mean. It should be set by us, not by the algorithm. – Johnny Oct 08 '21 at 14:28
  • Oh, I've misunderstood your question. But you can just exchange $\bar X$ and $\bar X_d$ everywhere, and omit the $\alpha \sum b_i$ part. I've updated my answer. – xd y Oct 08 '21 at 14:33