Calculating the probability my observation, $Y_i$, is drawn from a random variable $X$?

Question

If I sample a population distribution 2,000 times and get an estimator for the population mean, $\mu$, and the standard deviation, $\sigma$, how can I use these to get the probability that an observation is part of the population distribution?

Mathematically, say I sample the population distribution and get estimators for the mean and standard deviation. I assume my population is distributed:

$$ X \sim \mathcal{N}(\mu, \sigma) $$

where $X$ is a normally distributed random variable, $\mathcal{N}$ is a normal distribution, $\mu$ is an estimator for the population mean, and $\sigma$ is an estimator for the population standard deviation.

How can I use this distribution to test the probability that an observation, $Y_i$, has been drawn from the population distribution?

This sounds like part of a simple hypothesis test. Just compute $$z = \frac{Y_i - \mu}{\sigma}$$ and then to compute the probability density of $Y_i$ being sampled from $\mathcal N(0,1)$, compute $f(z | 0,1)$, where $f$ is the standard normal pdf. Alternatively, for a more standardized approach, see the Student’s $t$ test — mhdadk, Mar 30 '23 at 11:11
You seem to use $\sigma$ for both the population distribution and the sampling distribution of the mean here, which is confusing. — Christian Hennig, Mar 30 '23 at 11:31
@mhdadk Why can't I pull the probability from the $z$ I calculate in the first equation? Is the student's t-test the usual way something like this is calculated? — Connor, Mar 30 '23 at 11:36
@ChristianHennig Okay, so here I should use something like $\sigma_{\bar X}$? — Connor, Mar 30 '23 at 11:37
@Tim $Y_i$ is the same type of variable. Specifically, I'm thinking of correlations. So $\bar X$ is drawn from a background population of correlations and $Y_i$ is a potentially related correlation. I tried posting a long question on precisely what I mean, but it got no traction! If you'd like to know exactly what I'm getting at, please see: https://stats.stackexchange.com/questions/611190/can-you-use-a-z-test-with-a-sample-size-of-1 — Connor, Mar 30 '23 at 11:40
But you are talking about the sampling distribution of the mean while $Y_i$ seems to be a single sample here, or is it a mean? — Tim, Mar 30 '23 at 11:41
$Y_i$ is a single sample. Should I change the notation to make that clearer? — Connor, Mar 30 '23 at 11:41
Why would you want to test if a single observation comes from the same distribution as the distribution of the mean of many samples? It's like asking if orange is the same as orange juice. — Tim, Mar 30 '23 at 11:47
@Connor I don't mind what you use, but use different notation for different things, and explain it clearly. In standard theory, by the way, the sd of the sampling distrbution of the mean is $\sigma/\sqrt{n}$ if $\sigma$ is the sd of the underlying population. Also, if $\bar X$ is meant to be a mean of observations called $Y_i$, chances are $\bar Y$ would be a more intuitive notation. — Christian Hennig, Mar 30 '23 at 11:53
@Tim I don't know if $Y_i$ is from the same distribution. It could be, I want to test that. That's why I have it as $Y_i$ not $X_i$. I'll update the question to make that clearer. — Connor, Mar 30 '23 at 11:57
Could you edit your question to explain what exactly does each symbol mean and how its value was obtained? Why do you want to calculate the probability? What do you need it for? — Tim, Mar 30 '23 at 16:49
@Tim I've edited the question to explain all the symbols. I can re-explain everything in the question I've posted above. But it seems to me that what I'm asking about is fairly common. Wanting to know if an observation is in a distribution or not is pretty fundamental. If I'm assuming that it's normal and I know the mean and the variance, does it really matter why I want to know? I apologise if my poor use of concepts and terminology is obscuring my question. — Connor, Mar 30 '23 at 19:06
Do you mean $P(Y_i\sim N(\hat\mu,\hat\sigma)\vert Y_i = y)?$ // Yes, it matters why you want to know. Many questions about statistics (among other fields) are asked because someone is trying to solve a problem, develops an idea for a solution, gets confused about that idea, and asks about that confusion, rather than asking about the original problem for which the idea might be a poor approach. This results in the original problem not being solved. This is often called an XY Problem. — Dave, Mar 30 '23 at 19:43
@Dave Yes, if I'm reading what you wrote correctly, that is exactly what I mean! I get your point, but I've been extremely clear on the underlying problem I'm having in another question: https://stats.stackexchange.com/questions/611190/can-you-use-a-z-test-with-a-sample-size-of-1, and that didn't help me get an answer at all! The reason I'm hesitant to explain is that there's a lot of background and it seems to overload the question. I don't know the best way forward. If you happen to read the question I posted, could you let me know how much detail needs porting over to this question? — Connor, Mar 30 '23 at 19:58
In order to give a value to the conditional probability I wrote above, there should be a probability space. Do you have a sense of what that should be? Note that the probability space is a sigma algebra over the set of Gaussian distributions (or maybe more than just Gaussians), rather than being over the real numbers, since the event in question is having a random variable being distributed a certain way, rather than a random variable equaling a certain value. (If you want to think of $Y_i = N(\hat\mu,\hat\sigma)$, that is abuse of notation but might convey the idea of the event in question.) — Dave, Mar 30 '23 at 20:02
@Dave I believe that distribution will be a Gaussian as well. But do I need to know the precise distribution of $Y_i$ to say with certainty that it's not the same as $\mathcal{N}(\hat \mu,\hat \sigma)$? (BTW: I'm taking sigma algebra here to mean, "some distribution with assigned probability". From wikipedia's third paragraph: https://en.wikipedia.org/wiki/%CE%A3-algebra) — Connor, Mar 30 '23 at 20:09
@Dave Isn't your statement of the problem perfect for a Z-Test? Isn't that essentially what a Z-Test is? — Connor, Mar 30 '23 at 20:13
@Dave Yes, the p-value of a z-test, is that a reasonable way of calculating this? — Connor, Mar 30 '23 at 20:14
Unless you put some serious assumptions on your probability space, assumptions that it is not a given that you want to make, that conditional probability above is likely zero. Likewise, in a z-test, the probability of getting the exact z-stat you get is zero, but there is that "or more extreme" part of the p-value that has you integrate the tail(s) to calculate the probability of a continuum of values instead of just the one value. — Dave, Mar 30 '23 at 20:17
@Dave Yes, that's exactly what I'm talking about. Getting the probability of a value at least as extreme as the observation, and using significance to reject or accept the null hypothesis that $Y_i$ is in the distribution $\mathcal{N}(\hat \mu, \hat \sigma)$. Is using the z-test in this fashion reasonable? — Connor, Mar 30 '23 at 20:19
It would help immensely if you wrote the exact conditional probability you want to calculate, since it seems not to be the one I wrote a few comments ago. Please edit that into your original post and not just write it in the comments, as comments often get overlooked. — Dave, Mar 30 '23 at 20:27
@Dave Okay, thank you for all of your input, it's massively helpful. Is this the correct way to state the problem: $$ P(Y_i = y | Y_i \sim \mathcal{N}(\hat \mu, \hat \sigma))$$ Also, it would really help me if you could say if the z-test is useful here. — Connor, Mar 30 '23 at 20:43
Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. — Tim, Mar 30 '23 at 21:50

Tim · Accepted Answer · 2023-03-30T22:08:51.300

1

I assume my population is distributed:

$$ X \sim \mathcal{N}(\mu, \sigma) $$

where $X$ is a normally distributed random variable, $\mathcal{N}$ is a normal distribution, $\mu$ is an estimator for the population mean, and $\sigma$ is an estimator for the population standard deviation.

This statement does not make sense. Either $\mu$ and $\sigma$ are the parameters of the population, or they are estimates of the parameters. They cannot be both.

How can I use this distribution to test the probability that an observation, $Y_i$, has been drawn from the population distribution?

Assuming that you are asking about what you are saying here, you want to calculate what is the probability for $Y_i$ assuming the $\mathcal{N}(\mu, \sigma)$ distribution. If that is the case, just plug-in $Y_i$ to the Gaussian cumulative distribution function parametrized by $\mu$ and $\sigma$ and read the probability it returns. There's nothing more to it, if this actually is what you mean.

Something like

$$ P(Y_i = y | Y_i \sim \mathcal{N}(\hat \mu, \hat \sigma))$$

(in the comments), does not make sense, it's like asking "is the color red if we know that the color is red". You cannot have conditional distribution conditioned on this distribution itself. The only way to read this notation would be as $P(Y_i|Y_i)$, which is a tautology.

edited Mar 30 '23 at 22:08

answered Mar 30 '23 at 22:01

Tim

138,066

Thank you! What's the confusion with my notation? I've re-read the question and it seems clear to me that $\mu$ and $\sigma$ are estimators (or estimates, I'm using estimators as I read today that is a type of statistic). – Connor Mar 30 '23 at 22:14
What should the notation be? $$ P(y | Y \sim \mathcal{N}(\hat \mu, \hat \sigma), y \in Y)$$ Or is this hopelessly wrong too! – Connor Mar 30 '23 at 22:15
1

@Connor you cannot condition a random variable on itself. What is confusing with your notation is that the parameters of the population and their estimates are different things. The parameters of the population are some possibly unknown values, estimates are estimated from the samples. It's like winning lottery numbers vs your guess about the winning numbers, if they would be the same, you'd be a lucky man. – Tim Mar 30 '23 at 22:28
I guess true to some extent. But I could guess the average of winning lottery numbers over a few thousand samples right? If that's all I care about, what's the issue? – Connor Mar 30 '23 at 22:49
@Connor it's completely untrue. I'd recommend you review basic probability and statistics concepts as you seem to be confusing many of them and without understanding the basics it may be impossible to move forward. – Tim Mar 31 '23 at 05:45
How is it completely untrue? If I pull numbers from 1-100 enough times, will they not have a mean and standard deviation that stabilises? – Connor Mar 31 '23 at 05:59
@Connor if you have a very detailed photograph of a thing is it the same as the thing itself? – Tim Mar 31 '23 at 08:10
No, but I'm not suggesting it is, just that it's a guess that improves the more samples you draw. As far as I'm aware the mean of a discrete random variable is knowable! I think we're going off into the weeds here though! Is this the correct notation: $$ P ( Y_i < y | X \sim \mathcal{N}(\hat \mu, \hat \sigma)) $$ Using the answer from this question: https://stats.stackexchange.com/questions/110194/what-do-vertical-bars-mean-in-statistical-distributions#:~:text=The%20vertical%20bar%20is%20often,it%20as%20'conditional%20on'. (I read $|$ as "given that", which explains the confusion ) – Connor Mar 31 '23 at 08:13
@Connor It might help to say in words what you want to know. If you can say why you want to know it, that will help, too, since this appears to be a major XY problem. – Dave Mar 31 '23 at 17:35

Calculating the probability my observation, $Y_i$, is drawn from a random variable $X$?

1 Answers1

Linked