Different p-values from weight and unweighted regressions?

Question

I'm trying to determine whether an unweighted or weighted regression would be more suitable for my data.

I have variables X and Y, both are measured variables but X has very small errors while Y has quite large errors. This is because Y is calculated from an average of 5 measurements - so the error bars are including repeatability of these values from our analytical instrument.

I initially thought weighted regression would work because it would treat data points with smaller errors as more important, giving more weight to Y values with smaller standard deviations (so that data points that were more reproducible by the instrument are more reliable).

However I've been warned against using weighted linear regressions because they are ideal for data that has large errors on both X and Y variables, is this statement true?

I'm also worried about choosing the regression method because I get very different p values from the two regressions. With the weighted regression my p value is <0.001, but with the unweighted regression my p value is ~0.5, which completely changes my interpretation. I'm trying to understand what's causing this much difference in p values with the different regression methods and what regression would be best for my data.

Any insight would be appreciated!

I would not consider your Y variable with larger error, because you repeated the measurement 5 times. — Sextus Empiricus, Mar 29 '18 at 19:44
The weighted regression might perform 'better' (lower significance). However, I imagine that this fit is dominated by a few 'accurate' high weight points (in only a part of the entire domain of X). So the result should not be generalized over the entire domain of your measurement for X. This would be sort of extrapolation. — Sextus Empiricus, Mar 29 '18 at 19:48
@MartijnWeterings Initially the idea of repeating measurement was to test how reproducible my data was. That's why I'm wondering, if low standard deviations across 5 measurements of a same sample result in more 'accurate' data points, would it be acceptable to treat them with more 'weight'. What did you mean by "generalized over the entire domain"? Sorry, I'm not super familiar with regression methods. — Jen, Mar 29 '18 at 20:16

mkt · Answer 1 · 2018-03-29T19:26:05.437

Given the large differences in error for the different points, the unweighted regression p-values are not reliable. Your weighted regression is probably more appropriate; I am not aware of the assumption that you describe and do not think that it is accurate. If anything, I would expect the opposite to be true (i.e. that X-error should be much lower than Y-lower, as in your data).

A few additional thoughts/ideas.

1) There's no need to average your 5 measurements, and therefore little need for weighting. You can include them all while accounting for their non-independence with a random intercept term in a linear mixed effects model.

2) It's true that having errors in both X & Y can get complicated. There are methods that have been developed for this (orthogonal regression), but they have their limitations as well. I discuss their pros and cons a bit in this answer. If your errors on the X-axis are small, it may be best to avoid this approach.

3) A completely different approach one can take when faced with errors in both X & Y is to bootstrap your regressions. Let's say you can estimate the mean and standard deviation of each point, both in X & Y dimensions. You can randomly draw a value from the normal distribution fitted to each point, in each dimension. Do this for every point and fit a linear model to this resampled dataset. Repeat this process a large number of times, fitting your model to each resampled dataset. By aggregating across all your fitted models, you can estimate the distribution of intercept and slope values. Since each point is drawn from its own distributions, this accounts for differences in the error of each point, in both dimensions.

Reading your comment 1) I tried using all my measurement values as Y (and just assigned a small constant (0.01) as errors to all values, assuming those are analytical errors from the instrument). Now that I assigned the same X and Y errors for all my data points I'm getting the same regression lines (slope/intercept/R2) for both my weighted and unweighted regressions but my p-values are still 3 orders of magnitude different from each other. So maybe there were other factors than the large error bars in Y that affects my p-values? — Jen, Mar 29 '18 at 20:21
@Jen I don't quite follow, I'm afraid. My point #1 advocated using all your data points and not including weights (but a random intercept instead). So I don't see how the process you describe reflected this. However, if you weight all points identically, then your results (including p-values) should be identical to unweighted regression. — mkt, Mar 30 '18 at 05:50
I figured it might be just because I was using an 'unusual' weighted regression (York method). When I used lm function with weights I did find that my weighted and unweighted points yield same results. I'll have to think more about which weighted regression algorithm I want to use, but thank you! — Jen, Mar 30 '18 at 21:36

Different p-values from weight and unweighted regressions?

1 Answers1