6

Yesterday, I set up a topic outlining the problem I am currently working on.

After receiving many interesting responses, I added linear regression to my results, following this suggestion.

My research area consists of a river and two rows of wells, one closer to and one further away from the river, in a north-south direction.

In my research problem, I want to show using statistics that the river I am studying significantly affects the water quality of the row of wells closer to it (a process called bank filtration is at work). At the same time, the second row of wells is also influenced to some extent by the river, but not as much as the first row. To do this, I decided to collate laboratory physicochemical data (e.g. chlorides or sulphates in [mg/L]) from the river and the two well rows and compare them statistically.

As you can see in my earlier post, I decided to use statistical tests. Still, after the suggestion today, I also did a linear regression for the three groups of results divided by location (river and two wells rows).

Below I insert an example output for one group of parameters from the suggested method in Python:

"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               chloride   R-squared:                       0.487
Model:                            OLS   Adj. R-squared:                  0.480
Method:                 Least Squares   F-statistic:                     65.04
Date:                Tue, 22 Aug 2023   Prob (F-statistic):           1.38e-20
Time:                        20:36:58   Log-Likelihood:                -372.99
No. Observations:                 140   AIC:                             752.0
Df Residuals:                     137   BIC:                             760.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            14.6316      0.806     18.162      0.000      13.038      16.225
location[T.a]        -0.2420      0.900     -0.269      0.788      -2.021       1.537
location[T.b]         7.0957      0.964      7.361      0.000       5.189       9.002
==============================================================================
Omnibus:                       41.716   Durbin-Watson:                   1.416
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              121.766
Skew:                           1.122   Prob(JB):                     3.62e-27
Kurtosis:                       6.979   Cond. No.                         5.84
==============================================================================

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """

I want to ask whether a) these results seem to make sense and b) which of the results are worth including, e.g. in the Table or supplementary material in the paper for which I am performing these analyses?

Do I correctly understand that in a nutshell:

  • The R-square is 0.487, which means that the model explains about 48.7% of the variability in the data.

  • The F-statistic is 65.04, and the associated p-value is a very low 1.38e-20, suggesting that at least one of the model coefficients is significant.

  • Since p for T.b was small (<0.05) and for T.a was high, it means that between the river and row A closer to it, there are no significant differences in means and between the river and row B, there are.

Did I get it right? Also, which other results from this output are worth noting? Also, keep in mind that if I leave the OLS in the paper, I would need to do the same for 17 other groups, so I need to be concise with the number of results.

And the other question, if I may:

Do you think it's worth adding plots like this on top of the OLS results table?:

enter image description here

I should mention that I have already prepared box plots representing the river and the wells, so I don't know if that won't be too many figures.

crtnnn
  • 103
  • Your comparison is just testing whether the means are different, but does that mean that the river is unrelated to the wells? – Sextus Empiricus Aug 26 '23 at 15:50
  • Also, what do the coefficients in your model mean? The intercept is the value of the rover and the T.a is the difference with the row A? Why is the value of that coefficient negative when in the image the row A seems to have a higher value? – Sextus Empiricus Aug 26 '23 at 15:53
  • An alternative test could be to use a pairwise comparison, and look at the distribution of the difference. These values might be consistently positive, yet due to the large variation in your values this is not significant for the non-paired comparison. – Sextus Empiricus Aug 26 '23 at 15:59
  • Hello. "but does that mean that the river is unrelated to the wells?" - Hmm, I don't think I fully understand what do you mean by "unrelated", but if I would ask "Does the well impact the river?" - the answer would be "no". However, does the river impact the well? - Yes. The groundwater flow is from the river to the wells, not the other side. Don't know if that was your point. – crtnnn Aug 26 '23 at 16:21
  • "The intercept is the value of the rover and the T.a is the difference with the row A?" - Yes. I think the plot is kinda misleading, because actually the mean in the group "Row A" in this example is smaller than mean in the group "river". That's why the coefficient in row A is negative. The misleading plot might be because of the samples (observations) number I guess. – crtnnn Aug 26 '23 at 16:23
  • "An alternative test could be to use a pairwise comparison, and look at the distribution of the difference." - I did this too. I did Kruskal-Wallis test with Dunn's post hoc for the pairwise comparison of pairs: river vs row A, river vs row B, row A vs row B. – crtnnn Aug 26 '23 at 16:24
  • With pairwise I do not mean the comparison of two values, but the comparison of the difference for each pair of observations within the sample. This will reduce the variance because it ignores the variance between the pairs. – Sextus Empiricus Aug 26 '23 at 16:52
  • How does a plot of river vs row look like? – Sextus Empiricus Aug 26 '23 at 17:07
  • 1
    As an alternative to running multiple hypothesis tests on 20 wells & assuming you have geographic coordinates for all locations, you could fit a model for Chloride (mg/L) with a smooth nonlinear function of distance from river. The fitted curve will then illustrate the relationship between distance and chloride much more intuitively than any number of pairwise comparisons. – dipetkov Aug 27 '23 at 09:05

2 Answers2

3

Your first two bullet points are exactly right.

Your third isn't wrong, but, instead, I would look at the parameter estimates. Row A's is very close to 0 and is actually negative (less chlorine than the river) but not close to sign Row B, though, is high, and sig.

I like lots of plots, but that may depend on the editor of the journal you submit to. I prefer your plot to a box plot, but ... You can overlay a box on top of the strip plot that you have. I know how to do this in R or SAS but I am sure it is possible in Python.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • Thanks. I think that I will put the strip plots in the supplementary materials. "but not close to sign Row B, though, is high, and sig." - I feel like your sentence got cut there? Can you explain why you would look at the parameter estimates instead? If I understand properly the lowest the "t-value", the closer the results from one group are to the reference group (river in this case) - which is "good" because that's what I'm trying to prove here...? – crtnnn Aug 23 '23 at 14:36
  • 1
    I look at the parameter estimates because they tell you what you really want to know, which is NOT whether a difference is significant, but, rather, how big it is (and in which direction).

    You aren't trying to PROVE things (statistics won't do that), you are showing evidence for something. So, you can say "this parameter is big, and positive; that parameter is very close to 0; this is just what I thought would happen."

    – Peter Flom Aug 23 '23 at 14:43
  • Thank you for the clarification, and I am sorry for the late response. So, did I get it right that in the OLS results table in the paper, for every studied parameter, you would suggest putting: a) Adj. R2, b) F-statistic, c) Coefficient (the parameter estimate, am I right?) and d) p-value (P>|t|)? Also, I wanted to ask: I have 3 location groups, but there are two "location" outputs, namely "location[T.a]" and "location[T.b]". Is, let's call it "location[river]" presented as an "Intercept" in this output? So in other words the river is treated as a reference group? Thanks. – crtnnn Aug 25 '23 at 12:17
  • Yes, river is the reference group. – Peter Flom Aug 25 '23 at 12:22
3

The other answer by @peter-flom has already addressed the part of the question that pertains to the interpretation of the OLS regression model, so I will go more into the last part – whether the strip chart is a useful visualization here (after all, I was the one to suggest this in a comment to the earlier question).

I think the strip charts work reasonably well for your data. It's pretty easy to see e.g. that the overall range of values for "Row A" isn't too different from "River", and also that the number of observations in this group is notably larger than the number of observations in the first group. "Row B" has obviously a larger variance, and the mean is also much higher. Yet, you mention that you have 17 other groups – it may turn out that the strip chart for these won't work as well.

But there's still room for some improvement even for this strip chart. The most glaring issue is that there is quite some overplotting happening: many of the dots in "Row A" are printed on top of or very close to other dots from this category. This makes it difficult to assess the overall distribution of values especially within this category (for instance, comparing the number of observations between 11 and 14 to the number of observations between 15 and 18 is basically impossible).

Spreading the observations out more is one useful way of reducing overlap, so I'd suggest increasing the amount of jitter – this will also make sure that the available whitespace (in particular, the currently empty space between the categories) is used more economically.

You may also want to experiment with the size of the dots as well as with the degree of transparency. Smaller dots will result in less overplotting, and even more transparent colors will emphasize the value range in which there are many observations.

Compared to box plots, one disadvantage of strip charts is that while they can be used to faithfully visualize all observations in small data sets like the one you're dealing with (N=140), they don't show any centrality measures (means, medians) or dispersion measures (SD, IQR). In other words, they don't attempt to represent an abstraction of the underlying distributions of each category.

However, you can easily mitigate this disadvantage by plotting these measures on top of the strip plot. As suggested in @peter-flom's answer, you could simply plot both the strip chart and the corresponding box plot on top of each other (if you use seaborn in Python, there's an example in the gallery that does exactly that). If the resulting combination of box plot and strip chart looks overly cluttered, you could add manually a thick line for the median, and thinner lines for the 1st and 3rd quartile to your strip charts.

So, to wrap up: The strip chart appears to be appropriate for this data set (but it may not be optimal for the other groups that you work with). It can be tweaked so that it nicely visualizes the location of all 140 observations, something that will get lost in abstraction when using a box plot. In my opinion, if you add lines to show the median and the 1st and 3rd quartile, the strip chart will become superior to a box plot, as the latter can add only little information that wouldn't be conveyed by the strip chart as well. Combining the box plot and the strip chart into a single figure may also be an option, but I can easily imagine it to be too cluttered.

What I wouldn't recommend, though, is having the box plot in the main text, and the strip chart in the supplementary material – there is too much redundancy between the two types of visualizations to warrant this, especially since there are ways to produce a single, superior visualization.

Schmuddi
  • 155
  • +1 You make comment about the variance but I think the variance situation may be more complex. I would say that row A has lower variance than River and row B. That seems unexpected since if we had River and B only we might be asking whether the variance increases with the mean Chloride (mg/L). In any case, the OLS variance assumption seems violated. The question is -- does it matter much? – dipetkov Aug 25 '23 at 16:09