1

I have $N$ datapoint dataset $\{x_i, y_i\}$, where $x_i$ are equally spaced over the interval $[0, 1]$, and $y_i$ are non-negative. It is known that $y_i$ is a sum of a signal and gaussian noise. The goal is to find a scalar metric $F(\vec{y})$, which would measure the center of mass $c$ and the spread $s$ of $y$. The requirements are as follows

  1. $c \in [0, 1]$ corresponds to position of centre of mass on the interval
  2. $s$ should be minimal if only one $y_i$ is not equal to zero, and maximal if $y_i$ are uniform.
  3. $s$ should be independent of $c$ as much as possible. That is, for example, a gaussian of the same variance centered around 0.5 or 0.1 should give the same spread.
  4. The metrics $c$ and $s$ should be robust to small perturbations of the data

So far, I have tried converting $y_i$ into a probability distribution $p_i = \frac{y_i}{\sum_i y_i}$, and estimating the mean and variance of that distribution. The problems with using variance as a measure of spread are as follows:

  • Requirement 4 does not hold. For example, if a gaussian profile is used for $y$, addition of a small amount of Poisson noise to the system can result in the variance estimate jumping several times.
  • Further, if a signal is slightly positive everywhere, even if it has reasonable SNR, the estimated variance does not change much with true variance, indicating that the simple fact of having non-zero entries has higher impact on the estimate than SNR.

I wonder if there exist other ways to estimate spread in my case. I emphasize that the spread need not be an estimate of variance. It just has to satisfy the requirements

mkt
  • 18,245
  • 11
  • 73
  • 172
  • 1
    The $x_i$ appear to play no role in your question, leading one to wonder whether you have asked what you intended to ask. Your criteria (1) and (2) appear to conflict with one another, making it difficult to determine what you might mean conceptually by "spread." – whuber May 04 '20 at 20:15
  • @whuber Your first concern: If $x_i$ are uniformly spaced, $x_1 = 0$ and $x_N = 1$, then naturally the function can infer all the values of $x$ just from $N$, and it can infer $N = dim(y)$. Your second concern: can you elaborate why criteria (1) and (2) appear to be conflict? (1) is a statement about $c$, (2) is a statement about $s$. To me it feels that these statements have no overlap at all – Aleksejs Fomins May 05 '20 at 08:28
  • (1) Why mention the $x_i$ at all? They are irrelevant. (2) Your criterion (2) does not characterize anything like a measure of "spread." – whuber May 05 '20 at 11:16
  • @whuber (1) Sure, I'll eliminate $x_i$ from the text. (2) I don't really know how to write it nicely, that is why I give examples. So if signal is concentrated within 1 bin, its spread is minimal, if it is spread over all bins, its spread is maximal. In the in-between cases it should characterize how close the bulk of the signal is from its mean. If I knew how to write this more precisely, I would know the answer to my question. That is why I am asking to come up with a metric – Aleksejs Fomins May 05 '20 at 13:09
  • How "spread" is the signal if half the values are at one extreme and the other half are at the other extreme? – whuber May 05 '20 at 15:21
  • @whuber if the two peaks are far apart, then big spread. If two peaks are very close, and the peaks themselves are narrow, then small spread. In other words, I am not looking for a measure of entropy. – Aleksejs Fomins May 06 '20 at 13:31
  • But this is almost the opposite of what you are describing in your question! You state the spread is "maximal when the $y_i$ are uniform." That is far less spread than concentrating all the values at the two extremes. – whuber May 06 '20 at 15:42
  • @whuber Thanks for noticing. I think you are right that two peaks at extremes should in principle be more spread than uniform. The reason I stopped at uniform is that I am only interested in testing if something is less spread than uniform. If something is more spread than uniform I would just round it down to that of uniform for the purpose of this analysis. I am sorry I did not write it more clearly, your questions help me better understand what I want – Aleksejs Fomins May 07 '20 at 07:56
  • It sounds like any robust version of the standard deviation does what you want. To help guide readers, I therefore added the [tag:robust] tag to your post. – whuber May 07 '20 at 11:41
  • @whuber, Thanks, that makes sense to me – Aleksejs Fomins May 08 '20 at 07:36

0 Answers0