3

I collected disease severity data, expressed as proportions between 0 and 1 on plants in field trials conducted across different year. Additionally, I gathered weather data, representing average values over the duration of the field trials in each year.

For relative humidity (and temperature, rainfall, wind speed), I aim to determine whether the minimum, maximum, or mean relative humidity exhibits the highest correlation with disease_severity. Could you advise on the recommended correlation method to use, considering that my response variable, 'disease_severity,' is not normally distributed? Moreover, the relationship between weather variables and disease_severity is non-linear.

Here is the scatter plot for relative humidity

enter image description here

Ahsk
  • 405
  • 4
    You should clarify the purpose of your analysis; it's presently unclear what question "correlation" would be a good answer to. The standard error of your 'disease severity values depends on the number of plants (BTW it's unclear how you're computing this value, please explain it) as well as on the underlying severity these are estimates of (it's bounded), so just using the raw 'disease severity' in a correlation without knowing at least the relative number of plants they're based on could be problematic. If you have covariates, raw bivariate correlation of any sort doesn't seem useful. – Glen_b Dec 12 '23 at 00:59
  • 1
    Okay, thanks for the clarification on the percentages. Why are the dots in the diagram different sizes (blue dots mostly having about 4 times the area of red dots, green in between)? What does size represent? – Glen_b Dec 13 '23 at 03:38
  • I counted pixels. They're not remotely close to the same size. When I said blue is 4 times the area of red I wasn't making it up. I have numbers! If anything 4 appears to be a slight underestimate – Glen_b Dec 13 '23 at 22:36
  • Eyeballing this graph suggests that the correlations are all fairly weak any way and for that and other reasons the question of which correlation to use is likely to be incidental. Your tag and text in earlier versions of your question imply an interest in generalized additive models, meaning that you intend to look at several predictors and will do so in a way that doesn't depend on knowing whether individual relationships are linear, monotonic, more complicated or weakly defined. – Nick Cox Jan 23 '24 at 11:08
  • Never display summary statistics to a statistician. The statistician knows how to summarize the data herself...

    Show us the graph of raw data (no max/min). In general, Spearman's Rho is a safe start.

    – stans Jan 23 '24 at 11:11
  • @stans That maxim may be good in general, but it likely doesn't fit the case at all here. Relative humidity varies on just about any time scale you like, from daily upwards. Underlying the graph you see would be many, many values of relative humidity that would make a graph with all available data useless. A reduction of those data to summaries is, I guess, as someone with moderate experience of climatological and ecological data, essential to make any progress at all. – Nick Cox Jan 23 '24 at 11:21
  • @NickCox The correlation analysis was just to shortlist variables that show the highest association with the outcome variable. I can't include min, max and mean RH in GAMs due to multicollinearity. I just need one (mean, max or min) showing the highest association with the disease. – Ahsk Jan 23 '24 at 15:52
  • Pearson measures collinearity. I would try all 3 in turn. Pre-screening could answer the wrong question. My guess would be that min works best, but nothing beats trying it out. – Nick Cox Jan 23 '24 at 16:38
  • So you recommend Pearson, is that right? Min does show the highest association/correlation coefficient value than mean and max. – Ahsk Jan 23 '24 at 16:47
  • Correct on which correlation. – Nick Cox Jan 23 '24 at 22:11
  • Note as before that which of three variables -- min, mean, max RH -- looks best when examined with the outcome is not guaranteed to be best within a GAM with other predictors. – Nick Cox Jan 23 '24 at 23:58

1 Answers1

10

You wrote

my response variable, 'disease_severity,' is non-linear.

A variable cannot be linear or non-linear. That is a characteristic of relationships between variables.

The Spearman method seems right, but the disease severity was the average of 10 plants, so I am not certain it's actually ranked data.

Spearman does not require ranked data. It ranks whatever data you give it.

The internet suggests that Pearson is only suitable for normally distributed data.

The internet is wrong a lot. Pearson does not require normally distributed variables.

What I would do is make scatter plots and then base my choice on those, combined with what my exact goals were and which correlation better met those goals. E.g. If there are outliers, do you want them to be de-emphasized? Then Spearman. But if they are very important, Pearson might be better.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 3
  • 2
    If your "outlier" is a data error, then it should be removed before analysis. If your "outlier" is a bona fide data point, then including it does not introduce "false" correlations. If you believe that correlations calculated with these data points are misleading, then you presumably do not want to use linear (Pearson) correlations. Best to first think about what your research question is, then decide on your analysis strategy, then collect data, then perform the analysis. – Stephan Kolassa Dec 12 '23 at 14:27
  • 1
    if outliers are present, and indeed if some kinds of curvature or heteroscedasticity are present, you may be better off transforming first and then calculating correlations afterwards. – Nick Cox Jan 17 '24 at 15:29
  • I would replace “what my exact goals were” with “my goals” — because the emphatic language doesn’t help and because imprecise goals are often appropriate. (The problem is that the OP doesn’t describe their ultimate goal of how they intend to use the correlations, at all.) – Matt F. Jan 23 '24 at 11:44