One of the factors for my regression analysis is customer's familiarity with the store which equals 1 if the customer visited the store more than $N$ times and 0 otherwise. Is there a right way to choose $N$? For some $N$ this factor is statistically significant and for some not.
1 Answers
Don't discretize your predictor at all. This would amount to treating everyone with $0$ to $N$ visits exactly the same, and also treating everyone with $N+1, N+2, \dots, 2N, \dots, 1000N, \dots$ exactly the same - with a discontinuous step at $N$. This is almost certainly not a good reflection of reality. See this earlier thread for more information: What is the justification for unsupervised discretization of continuous variables?, in particular this page edited by Frank Harrell.
As you note, it makes little sense to include the number of visits "as is", as familiarity with the store will not scale linearly with the number of visits.
My recommendation would be to transform the number of visits using splines, e.g., restricted cubic splines or natural splines. A very good introduction can be found at the very beginning of Frank Harrell's Regression Modeling Strategies.
- 123,354
-
Thank your for your detailed explanation. How can I choose the spline knots? Isn't it the same problem as choosing $N$ in my original question? – 8k14 Dec 28 '17 at 11:50
-
I wouldn't say that it's the same as choosing $N$. Yes, it's a choice for a parameter. But the consequences are different: choosing a threshold makes your response discontinuous, but choosing spline knots only deforms a continuous response curve. ... – Stephan Kolassa Dec 28 '17 at 15:06
-
... You would typically set the knots at specific quantiles of your observed number of store visits. Harrell has a rule of thumb table in his book. You can look at the default behavior of the
splines:ns()function in R, see this earlier thread or look through earlier questions. – Stephan Kolassa Dec 28 '17 at 15:06 -
Thank you. By saying that choosing knots is the same that choosing the threshold I mean that it is also not related to the relation between familiarity and the number of visits. For example, the idea that this relation is not linear is not reflected there, right? – 8k14 Dec 28 '17 at 17:06
-
The nonlinearity will come in once you have transformed your original variable into a set of multiple spline regressors and fitted a model. The weighted sum of the spline regressors, weighted by the estimated coefficients, will be a nonlinear response function. Take a look at the Wikipedia page, or run the example in the help page for
splines::ns. – Stephan Kolassa Dec 28 '17 at 17:12 -
-
Interpretation of spline coefficients is hard. Better to just calculate the matrix product between the spline regressors and the parameter estimates and plot this (the response function) against the number of visits. – Stephan Kolassa Dec 28 '17 at 19:06
-
Thanks again. Could you please be a bit more detailed? I just need to know if the corresponding factor has a statistically significant effect on the outcome. – 8k14 Dec 28 '17 at 20:26
-
Ah. In that case, I would recommend that you compare two models, e.g., using ANOVA or a likelihood ratio test. Model 1 would contain all your predictors except the number of visits (or any transform). Model 2 would contain all predictors plus the spline-transformed number of visits. Thus, Model 2 nests Model 1, and you can compare them using ANOVA or similar. – Stephan Kolassa Dec 28 '17 at 20:31
-
Thank you very much. Thus I can see the statistical significance and what about the direction of the effect? The signs of the spline coefficients? It's fine if they are equal but what if they aren't? – 8k14 Dec 28 '17 at 20:46
-
Even the signs are not overly helpful. The response function can curve up and down. Best to plot it and eyeball it. – Stephan Kolassa Dec 28 '17 at 20:51
-
Thank you. Then why on earth is linear regression still in use? No relation is purely linear... – 8k14 Dec 29 '17 at 04:34
-
1
npreg Y X. – Alexis Dec 27 '17 at 23:11