1

I analyse a set of physicochemical data from the river and two rows of wells - one located closer and the other further from the river (in the N-S direction).

The study's main aim is to investigate the variability (or heterogeneity) in the content of the different parameters at the water intake concerning distance from the river.

My initial assumption is that, statistically, a more significant influence of the river on the first well row can be demonstrated (the water in row 1 is more similar to the water in the river in terms of analysed parameters content, e.g. chlorides).

Thus, I divided 3 sampling locations/regions:

  • the river,

  • the first row of wells

  • the second row of wells

I rearranged my data by the sampling location of the selected parameter (river, row 1 and row 2), as suggested in here. Still, after further research and the discussion on reddit, I found that (I might be wrong) maybe the Linear Mixed Effects Model is the better solution for my data than the OLS. So, to the prepared data frames, I added one more column, "ID", with the ID of the given measurement "region" (river or one of the wells).

So, for example, for chlorides, the analysed group (data frame) has 3 columns:

  • chloride (numeric values in mg/L),
  • location (categorical values: river, row A and row B)
  • ID (categorical values: river, well_1, well_2 ... well_11)

For the data frame prepared this way, I performed LMM to see if the sampling location affects the value of that sample. Here is the line of code I used:

cl_spring_mix = smf.mixedlm(
    'chloride ~ location', 
    data=cl_spring, 
    groups=cl_spring['ID']
    ).fit()

and there is my example output:

print(cl_spring_mix.summary())
       Mixed Linear Model Regression Results

=========================================================== Model: MixedLM Dependent Variable: chloride
No. Observations: 140 Method: REML
No. Groups: 12 Scale: 11.3058
Min. group size: 11 Log-Likelihood: -369.9698 Max. group size: 19 Converged: Yes
Mean group size: 11.7


             Coef.  Std.Err.   z    P>|z| [0.025 0.975]

Intercept 14.632 1.419 10.308 0.000 11.849 17.414 location[T.rowA] -0.242 1.538 -0.157 0.875 -3.256 2.772 location[T.rowB] 7.096 1.621 4.378 0.000 3.919 10.272 Group Var 1.420 0.355
===========================================================

However, I am not entirely convinced about the interpretation of these results. My understanding is that, since the Intercept is assumed here as a river, in row A the chloride values are very similar to the river (actually 0.242 lower), while in row B they are more different (7.096 mg/L higher on average).

What I do not know is how to interpret the value of "z" and "P>|z|". I guess that P>|z| values equal to 0.000 for "Intercept" and "location[T.rowB]" mean that the result is statistically significant, but what does this mean in the case of my study? It means that in the dataset analysed, the location statistically significantly affects the chloride value in row B relative to the river? Whereas in row A, where the p-value is large, it does not? In other words, if the value of chlorides in row A was 0, the value in the river wouldn't change? Or did I mess something up.

I mean, that would be OK for my initial assessment but I am not sure if I look at these results properly.

And what does the p-value = 0.000 mean for the Intercept? The results are statistically significant in relation to what?

Galen
  • 8,442
crtnnn
  • 103

2 Answers2

2

The p value here for each parameter comes from a two-sided one-sample t test with the null hypothesis being that the parameter is equal to 0 (you can confirm for yourself in the below table that the value of z is always (Coef. - 0) / Std.Err. or just Coef. / Std.Err. The 95% confidence interval provided is also just ±1.96 * Std.Err. from the estimate of the mean (table reproduced below).

-----------------------------------------------------------
                 Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-----------------------------------------------------------
Intercept        14.632    1.419 10.308 0.000 11.849 17.414
location[T.rowA] -0.242    1.538 -0.157 0.875 -3.256  2.772
location[T.rowB]  7.096    1.621  4.378 0.000  3.919 10.272
Group Var         1.420    0.355                           
===========================================================

This means that a "statistically significant" result for Intercept and location[T.rowB] should be treated quite differently. The test being run for Intercept is approximately "are there are detectable levels of chloride in the river water." The location parameter tests are assessing whether or not the levels of chloride in each row of wells are different from the levels in the river (I believe these two tests are what you are primarily interested in). In this case, the levels in rowA are not distinguishable from those in the river, while those in rowB are (p<0.0005) based on a two-tailed one-sample t test.

Specifically, the test being run is always the same (comparing the parameter to 0). However, the location parameters are offsets from the Intercept (mean chloride levels in the river are estimated at 14.632 units, in rowA wells at 14.390 units, and in rowB wells at 21.728 units). Comparing the per-location offsets to 0 is equivalent to testing whether the estimated mean values are different at each location.

1

First, while the printout says 0.000, the p value is never exactly 0. It ought to print out < 0.001.

Second, the intercept being highly significant means that the chloride value for the river is not 0. Your model is estimating that chloride at the river is $14.632$, at row A it is $14.632 - 0.242 = 14.390$ and at row B it is $14.632 + 7.096 = 21.728$.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 4
    "It ought to print out < 0.001." I agree in this instance. I wish that the full numerical precision (e.g. float32) values were reported in the supplement of papers. This is more valuable to a detailed validation of the analysis than the inequality < 0.001. – Galen Aug 27 '23 at 16:12
  • "the intercept being highly significant means that the chloride value for the river is not 0." - of course... I don't know why I didn't think of it that way... And regarding the 0.000 -> yes, this information I actually knew. Thank you. The actual output I put in the post is 0.000 because I don't know how to set it in Python to get the actual value in scientific E notation. – crtnnn Aug 27 '23 at 16:24
  • I found the way around for anyone wondering about the precise number if the output in summary is rounded to 0: You have to write your_model_name.pvalues and there you have the specific value. Before that make sure to set your pandas options, e.g. pd.set_option('display.float_format', '{:.2E}'.format) – crtnnn Aug 27 '23 at 16:50
  • Hello, I also wanted to ask one thing: I am writing about my results now. You said that "the intercept being highly significant means that the chloride value for the river is not 0." but I wonder what it means when, for example "location[T.rowA]" p-value is > 0.05? Does it mean its coefficient is not statistically significantly different from the intercept coefficient? In my case, it would mean "p-value >0.05 implies that the estimated value of parameter X in row B is not significantly different than the estimated parameter value in the river."? Did I get it right? – crtnnn Aug 29 '23 at 17:36
  • I think you have the right idea but are not expressing it correctly. The fact that the parameter estimate for row A is not significant means that it is not significantly different from 0. – Peter Flom Aug 29 '23 at 17:56
  • Well then I didn't quite understand it after all. Could you briefly explain in relation to what this is calculated? Or how is the p-value calculated here? I can understand the assumption when e.g. the Coeff. for one variable is eg. 0.5 and the others are e.g. 15 and 30. In this case the 0.5 looks like not significantly different to 0. But suppose e.g. the situation that a Coeff. of p-value = 0.36 has a value of 3.5 - in what sense is this different from 0? – crtnnn Aug 30 '23 at 10:08
  • Or for one of my results the coefficients are: intercept: 5.2, row A: 7.6, row B: 4.3 - all have p > 0.05 (0.28, 0.15 and 0.44). Now, this would mean that all these coefficients are not significantly different from 0? But what does this information mean exactly? I cannot get it. – crtnnn Aug 30 '23 at 10:09
  • or maybe you meant "0" in the sense of "location 0" - in this case the river? - not a "number zero" and I misunderstood your response? – crtnnn Aug 30 '23 at 10:19
  • When a coefficient is not significantly different from 0 it means that, if, in the population from which the sample is randomly drawn, the coefficients really were 0, then you could have gotten results as extreme as yours at least 5% of the time, in samples the size of yours. – Peter Flom Aug 30 '23 at 10:25
  • The p value is calculated from the distribution of the test statistic (here the coefficient) when the null hypothesis is true (here, the actual coefficient is 0). – Peter Flom Aug 30 '23 at 10:26