0

I have four independent trials of a random variable. I have run a Shapiro-Wilk test on it and it is not normally distributed.

I have two different images of a peak. The background is flat with a little bit of noise but no large trends. I am subtracting the two peaks and seeing if their difference is significant or if they are the same.

What is the best way to quantify the statistical significance of the difference region?

EDIT: I have clarified my problem. I am trying to see if the region of interest between the dashed lines (where my two peaks where) is significantly different from the rest of the image.

My plots

zaphod
  • 13
  • 1
    "I want to know if my peak above the threshold area in Run 4 is just due to noise or if it is a signal." The problem here is what exactly do you mean by "noise", and what constitutes a "signal" in your application. Some use the term "noise" for what is normally distributed, in which case your data wouldn't be "noise" anyway (in fact peaks could lead to rejection of normality). The thing is that a peak "stands out" relative to a model assumption such as normality, so such an assumption needs to be provided. – Christian Hennig Jul 08 '23 at 13:57
  • 1
    Also observations within runs don't look independent, although I'm not sure as there's no information about potential smoothing. In which case a model for dependence would be needed. – Christian Hennig Jul 08 '23 at 14:00
  • @ChristianHennig I do not know if there is an alternative way of devising a hypothesis.

    I am considering making a histogram of the noise outside of the region of interest and doing a chi-squared test with the noise/signal inside the region to find the goodness of fit. Would this work?

    – zaphod Jul 09 '23 at 05:04
  • The goodness of fit of what? What is your hypothesis there? – Christian Hennig Jul 09 '23 at 08:35
  • So I want to check if the data within the region of interest has the same distribution as the data outside of it. My hypothesis is that if the difference between the two peaks is significant then the data in the region of interest (where my peaks were) will have a different distribution to that of the background. – zaphod Jul 09 '23 at 20:52
  • Not sure I understand this. Am I right that the peaks are in one region of the data space ("region of interest") and the "background" in another? But then any distribution of the "background" can be combined with any distribution in the "region of interest" to form a distribution together. The background cannot tell you how data outside the background should look like. Apart from the dependence issue, that is. – Christian Hennig Jul 09 '23 at 21:39
  • Yes, the peaks are in a smaller region of the whole image. By background, I mean the area outside of the peak region. I was thinking of comparing the distribution of the background to the distribution of the region of interest to see if there is a significant difference. My thought process is that the region of interest would have the same distribution as the background if the difference between the two peaks is not significant. Does this make sense or am I making poor assumptions? – zaphod Jul 10 '23 at 01:24
  • The peaks are extreme and the background isn't. They are in different regions of the data space, so they can't have the same distribution. That would mean that they can take the same values with the same probability, which obviously can't be the case if they take systematically different values. (What you mean is probably that the overall distribution is the same, but as written before, you can put together any two distributions on disjoint domains and get a legitimate distribution on the overall space.) – Christian Hennig Jul 10 '23 at 10:41

1 Answers1

0

Welcome to the site.

A "peak" does not have a p-value. To get a p value, you need a hypothesis and a hypothesis test. You haven't stated one and I don't think you actually need one here. What you need to do is more precisely define peak (ideally, you would have done this beforehand; you should not let the actual results affect your definition). So for instance, you might say:

A peak is a value that is higher than both the preceding and succeeding variable by at least XXX.

I am not saying this is what you should use, it's just an example of the sort of thing I mean.

Then you can say whether or not you have peaks, and p values never come into it.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • Thank you. I am asking about p-values because I want to know if my peak above the threshold area in Run 4 is just due to noise or if it is a signal. I am not sure how to 'combine' it with the other Runs where there isn't a big peak. This is why I want to ask about p values. – zaphod Jul 08 '23 at 13:47
  • P values won't help you answer that question. – Peter Flom Jul 08 '23 at 13:53
  • Would comparing the distribution of the noise outside the region of interest (I am only concerned with the region in between the dashed lines) and the noise within the region with a chi-squared test work? – zaphod Jul 09 '23 at 05:09
  • If you want to test that, you need to define the process that is generating the "signal". – Peter Flom Jul 09 '23 at 11:47