1

I am trying to use Linear Regression on a dataset using scikit-learn with python. And my understanding is that Linear Regression requires "some linearity" to exist between independent and dependent variables. Here, I am sharing a scatter plot of one of my features against the target variable. As far as my understanding is, I know that either I need to transform my distributions or use another model. I do not want to give up on using Linear Regression. I tried several transformations such as square root, square and, log to see if it helps. But it does not help to show any trend.

What should I do in this case? Should I just drop the idea of using Linear models? Or are there other things to do as well before I move on to other models?

enter image description here There are other features too in my dataset that are both, numerical, and categorical.

The data used in this scatter plot is humidity on a day and Target, the count of bikes rented on a day. I have several features to predict Target (count of bikes rented on a day), like temperature and etc. The humidity or hum is between 0 and 1 because it was scaled.

Most of my feature-vs-target scatter plots have a common problem that the plots do not have proper shape or direction.

letdatado
  • 325
  • 1
    Why don't you want to give up linear regression? Why does windspeed vary only from 0 to 1? And, your scatterplot has so many points that it's impossible to see if a pattern exists (you could try making the points smaller and maybe hollow) but it may be that there simply is no pattern. – Peter Flom Dec 27 '23 at 12:05
  • 1
    Where is it stated that linear regression requires linear relationships? It is always useful in providing a baseline fit against which more sophisticated, less intuitive alternatives can be evaluated. – user78229 Dec 27 '23 at 12:06
  • 1
    Linear regression models a linear relationship between predictors and outcomes - but predictors can be nonlinear transformations of input features. For instance, you can model nonlinear relationships between wind speed and target by using spline transforms of wind speed. Linear models can be quite flexible, one reason why they are still taught. – Stephan Kolassa Dec 27 '23 at 12:54
  • 1
    You could consider using a hexbinplot to avoid the perceptual issues with your large data set. Also, your dots have different colors, what do these stand for? – Stephan Kolassa Dec 27 '23 at 12:56
  • @PeterFlom I want learn about data preparations for Linear models. The humidity "hum" was already scaled from 0 to 1. On your advice, I have changed the figure to show smaller markers. My question is that how to transform this data? – letdatado Dec 28 '23 at 05:21
  • @MikeHunter Mikeee.... Some experts say that it is a condition, like no multicolinearity, etc, that must be taken care of. I know your point and I have seen other experts doing the same as you said. – letdatado Dec 28 '23 at 05:22
  • @StephanKolassa Is it correct to say that the figure I have shared shows a non-linear relationship? I have made my scatter markers smallers and added more details. I don't think that changing colors is possible becuase each point is showing both variables. Making colors could make a difference if each point was showing a single variable. – letdatado Dec 28 '23 at 05:24
  • 1
    Hm. By the colors, I was referring to the original plot, which showed different colors. As to whether there is a particular pattern, that is extremely hard to say since you have very many data points. As above, a hexbinplot would be useful. I don't know whether there is a Python implementation, but there is a "hexbinplot" package for R. – Stephan Kolassa Dec 28 '23 at 06:39
  • You could try adding a smooth line to the plot, e.g. with loess. Also, while it's great to "want to learn about data preparation for linear models" sometimes it's not the right toll. Just because you want to learn about hammering doesn't mean you shouldn't learn about screwdrivers. – Peter Flom Dec 28 '23 at 09:59

1 Answers1

2

First, the linearity of linear regression is linearity in the coefficients, not in the predictors themselves or transformations of them. There's no need to restrict yourself to a single transformation like log or square root or whatever. As comments indicate, a regression spline can be a good way of letting the data tell you the shape of the association between a continuous predictor and outcome.

Second, as you have multiple predictors, then any single plot like this can be misleading. It doesn't take into account the associations of the other predictors with outcome, potentially hiding the shape of the association of this single predictor with outcome. You first need a multiple regression model based on all the predictors. Then examine how individual predictors are associated with outcome when all the other predictors are thus taken into account.

Third, if you want to learn about ordinary least squares multiple linear regression, this doesn't seem like a good data set. The outcome values are count values, necessarily integer and non-negative, with what seem to be a preponderance of values at or near 0. Such data are unlikely to meet the usual requirements of ordinary least squares; a generalized linear model designed for counts, like a Poisson or negative-binomial model, might be called for.

EdM
  • 92,183
  • 10
  • 92
  • 267