1

I would like to construct a 1-sigma confidence interval for my 1D data. I don't know the underlying distribution, and it is strongly skewed, so standard deviation will not suffice. I see that people recommend using some CDF-based estimators like DKW. However, it looks difficult from the first glance, and, if possible I would like to avoid fully understanding what it does. I'm searching for a function (preferably in Python), where I would plug in an array of IID samples and get back an interval. I'm gonna use it for visual comparison of multiple different datasets, so it's ok if it is not very precise.

Edit: I was explained that what I seek is not a confidence interval, because I am interested in an interval predicting positions of future samples from the underlying distribution, not an interval for a parameter of this distribution. What I want is to find two numbers, $x_{\min}$ and $x_{\max}$, such that

  1. $x_{\max} > x_{\min}$
  2. $|x_{\max} - x_{\min}| \rightarrow \min$
  3. $\int^{x_{\max}}_{x_{\min}} f(x)dx = 1-\alpha$, where $\alpha = 1\%$ (as an example, I want a procedure where I can pick my own $\alpha \in (0,1]$)

and $f(x)$ is the underlying probability distribution of my data that I do not know. Again, I note that the solution need not be perfectly optimal

  • 1
    For what, exactly, do you want to calculate a confidence interval? Around the empirical cdf? The mean? – COOLSerdash Sep 19 '19 at 14:47
  • Sorry, did I mention I did not fully understand it :D. I think the mean. I'll write an edit – Aleksejs Fomins Sep 19 '19 at 14:48
  • 2
    As @COOLSerdash indicates, data don't have confidence intervals. The concept of confidence interval applies to some property of a hypothesized distribution from which the data were sampled. You need to specify the property of interest. How a CI is constructed depends on that property as well as on the possible underlying distribution. – whuber Sep 19 '19 at 14:49
  • Thank you for your advice. I think I can assume that my distribution has a single extremum (first monotonically increasing, them monotonically decreasing). Can I request a confidence interval that has the smallest possible length and is a single interval? I think those constraints are probably sufficient to make it unique – Aleksejs Fomins Sep 19 '19 at 14:59
  • @AleksejsFomins What population value do you want to measure? Mean? – Dave Sep 19 '19 at 15:02
  • @Dave I understand the question, I just don't understand why is a specific measure necessary to construct an interval. I want to get an interval as small as possible where "most" of my points lie. If it helps you to use the mean to construct such an interval around it, go ahead. Please excuse my persistence. I'm not an expert in building confidence intervals – Aleksejs Fomins Sep 19 '19 at 15:09
  • @Dave I think 1-sigma interval is 68%. But really I want a function where I provide dataset and p-value, and get out two numbers $x_{\min}$ and $x_{\max}$ defining an interval. After reading wiki article, I finally remember. Confidence intervals are defined for parameters, such as mean. As in, how likely is the mean of the distribution to lie within $[a,b]$. I'm sorry about confusing everybody. Is there a good name for an interval which contains future observations with certain probability? – Aleksejs Fomins Sep 19 '19 at 15:24
  • I don't (yet) know how to solve this, but it looks like what you're looking to do is $ \underset{a,b}{\mathrm{argmax}} \left{ \int_a^b f(x)dx \ge 0.68\right} $ or something analogous for an empirical distribution. Is this right? (I think the empirical variant would be $ \underset{a,b}{\mathrm{argmax}} \left{ \sum_{i=a}^b X_{(i)} \ge 0.68\right} $ for the order statistics $X_{(i)}$.) – Dave Sep 19 '19 at 15:24
  • @Dave Yes, I want exactly this, subject to the constraint that |a-b| is as small as possible, and given that I don't know $f(x)$, but I only have data sampled from it. I have actually considered taking the middle point and adding the closest point to the set of chosen points until I reach the correct proportion. I am not 100% convinced that this would always work, so I wanted to see if there is already an established way to do this – Aleksejs Fomins Sep 19 '19 at 15:27
  • 1
    I've made a mistake. You want $ \underset{a,b}{\mathrm{argmin}} \left{ | a-b | \bigg\vert \sum_{i=a}^b X_{(i)} \ge 0.68\right} $ for the order statistics $X_{(i)}$. Okay, now this is phrased the right way, I think. And my previous comment should be argmin instead of argmax. – Dave Sep 19 '19 at 15:31
  • 2
    You are asking for a form of tolerance interval, q.v. – whuber Sep 19 '19 at 16:05

0 Answers0