Is it okay to fit a linear model to my data, and if so, are the errors normally distributed?

Question

I have fit a linear model to my data, because a (somewhat) linear model would greatly simplify my analysis.

Requesting second opinion on whether my data plot, as well as my residual plots, imply somewhat linear data as well as normal residuals. i.e. can i get away with using a linear model
If not, may you recommend a transformation to the data to make it linear

The dataset R code

effective.vpip=c(0.004489674,
             0.004489674,
             0.004489674,
             0.004489674,
             0.004489674,
             0.004489674,
             0.004489674,
             0.004302604,
             0.00432339,
             0.004190362,
             0.004634795,
             0.004497988,
             0.004382229,
             0.004717721,
             0.00500848,
             0.00488252,
             0.004859412,
             0.004755728,
             0.004983275,
             0.004883767,
             0.004793736,
             0.00471189,
             0.004901769,
             0.004822242,
             0.004992517,
             0.005149694,
             0.005069821,
             0.005049101,
             0.005188067,
             0.005114902,
             0.005046458,
             0.004982291,
             0.005106437,
             0.005044281,
             0.005159562,
             0.005099383,
             0.005042456,
             0.005148684,
             0.005093413,
             0.005193056,
             0.005139399,
             0.005233202,
             0.005181107,
             0.005269698,
             0.005219107,
             0.00530302,
             0.005253874,
             0.005237953,
             0.00519214,
             0.005269879,
             0.00534457,
             0.00529935,
             0.005255836,
             0.005213934,
             0.00528421,
             0.005351976,
             0.005417365,
             0.005480498,
             0.005438339,
             0.005499019,
             0.005457939,
             0.005516345,
             0.005476293,
             0.005437494,
             0.005399888,
             0.005455634,
             0.00541888,
             0.005405215,
             0.00545846,
             0.005510184,
             0.005474732,
             0.005440265,
             0.005406743,
             0.005374126,
             0.005423526,
             0.005471626,
             0.005439437,
             0.0054861,
             0.005531581,
             0.005575925,
             0.005619175,
             0.00558715,
             0.005555896,
             0.005525387,
             0.005495595,
             0.005483899,
             0.005525223,
             0.005565608,
             0.005536704,
             0.005576064,
             0.00554768,
             0.005586065,
             0.005558184,
             0.005530896,
             0.005568246,
             0.005541421,
             0.005577892,
             0.005613619,
             0.00558715,
             0.005622069,
             0.005656297,
             0.005630188,
             0.005604585,
             0.005579475,
             0.005612805,
             0.005588091,
             0.005620717,
             0.005652739,
             0.005628339,
             0.005604383,
             0.005635687,
             0.005666431,
             0.005642774,
             0.005619531,
             0.005596693,
             0.005626714,
             0.005656221,
             0.005685229,
             0.005662606,
             0.005640361,
             0.00566878,
             0.005696734,
             0.005674753,
             0.005653127,
             0.005680535,
             0.005707508,
             0.005686134,
             0.005665095,
             0.005691561,
             0.005670803,
             0.005696821,
             0.005722444,
             0.005701923,
             0.005681708,
             0.005706874,
             0.00573167,
             0.005711681,
             0.005691981,
             0.005672565,
             0.005696897,
             0.005720885,
             0.005701675,
             0.005725293,
             0.00570632,
             0.005729581,
             0.005752523,
             0.005733751,
             0.005715233,
             0.00573781,
             0.005760085,
             0.005741761,
             0.005763717,
             0.005745609,
             0.005767255,
             0.005788622,
             0.005770702,
             0.00575301,
             0.005774061,
             0.005794848,
             0.005777337,
             0.005797845,
             0.0058181,
             0.005800769,
             0.005820759,
             0.005840506,
             0.005860016,
             0.005879292,
             0.005898339,
             0.00591716)

The linear model and the residual plots

fit=lm(effective.vpip~c(1:length(effective.vpip)))
plot(fit)

Sequel at https://stats.stackexchange.com/questions/306025/do-these-plots-imply-a-good-fit-of-a-linear-model-with-normal-errors — Nick Cox, Oct 03 '17 at 07:28
I'm voting to close this question as off-topic because it is apparently continued in another question that is, itself, a duplicate. — Peter Flom, Aug 10 '19 at 11:26

rolando2 · Accepted Answer · 2017-10-02T20:50:39.373

3

Of course the plots imply "somewhat" linear data; the question is one of degree. Rather than answering whether the linear fit is "OK" or whether you can "get away with it", suppose I describe the consequences of using the linear fit as opposed to a quadratic one. R-squared will be .82, compared to .93 if you build in a squared term. Linear residuals will be much more highly associated with the fitted values and the leverage values. Data will conform less well to the theoretical line in a Q-Q plot. And you will be excluding a term whose predictive power is unquestionably statistically significant, with p < 1 in 1 trillion. Standing outside your situation, the quadratic fit seems preferable in every way. Yet you clearly have reason to favor the linear solution. Only you can say what your ultimate criteria will be, or how superior the quadratic fit would have to be to sway you in that direction.
Transformation: taking the square root of case number and substituting that in place of the original case number in a linear model yields an R-squared of .92.

edited Oct 02 '17 at 20:50

answered Oct 02 '17 at 20:44

rolando2

12,511

4

Some striking features about this plot, which should cause serious re-examination of the entire approach, are (1) its strongly heteroscedastic response; (2) the cluster of constant values at the left; and (3) the complete abstraction of the data description. I am concerned that any opinion one might try to derive from this plot could be counterproductive in the actual application. It might be better to point out the pitfalls in fitting a linear or quadratic function and to elicit more information about what these data mean and what the model is intended for. – whuber Oct 02 '17 at 21:29
There are even more striking hidden features that are revealed upon applying EDA techniques. To see them (using R), plot diagnostics for a robust fit of $\log(y_0 - y)\sim x$ where $y_0$ is barely greater than the largest response. E.g., library(robust); y <- effective.vpip; x <- 1:length(y); fit <- rlm(log(max(y) + 1e-5 - y) ~ x); plot(fit, cex=fit$w). This is nearly linear with Normal residuals. The response appears to be discrete and there are outliers at the right of the plot. These patterns beg for a better description of how the data were collected and what they represent. – whuber Oct 02 '17 at 21:54
Thanks, this makes sense. Perhaps a transformation of the data is in order – user5211911 Oct 03 '17 at 05:40

Is it okay to fit a linear model to my data, and if so, are the errors normally distributed?

1 Answers1

Linked