3

I have a gam model with automatic predictor selection based on cubic splines (bs = cr) and SELECT == T or shrinkage cubic splines (bs = cs) and SELECT == F.

As I know these predictors should be scaled, because of different data ranges and units and transformed to gaussian distribution.

I got two questions with respect to this topic:

  1. Should I first scale and then transform my predictor data?
  2. I'm scaling and transforming my predictor data to build the model. Do I also have to scale and transform the predictor data when I make a prediction? - So transforming and scaling the predictor data of new unknown data?

I'm using the following model:

basis <- "ts"
selection <- TRUE
gammaVal <- 1.8
model <- mgcv::bam(formula = TT2 ~
            s(DGM5, k = k, bs = basis) +
            ...
            s(NDVI_MEAN, k = k, bs = basis),
            data = trainSet, family = gaussian(), method = "fREML", select = selection,
            gamma = gammaVal, scale = 1, control = ctrl, cluster = cl, discrete = TRUE, nthreads = 31)

I try to predict some kind of air temperature based on Height Model, Vegetation Index, Buildings, Sealing etc.

  • 1
    You should probably add a little more information about the computing environment. I suspect you're using the R package mgcv? It would also help to know more about the specfic variables you're using. – COOLSerdash Jan 30 '23 at 07:57
  • 2
    Scaling a random variable (by a constant) to achieve a Gaussian distribution can only happen if the original distribution was already Gaussian? In any case, the predictors in the GAM don't need to be Gaussian. Can you elaborate on your model more? What is the family/link choice? Regarding your second question, yes whatever variable is used in the estimation should be the same variable passed to the prediction. – statsplease Jan 30 '23 at 08:20
  • @COOLSerdash yes you are right, I'm using mgcv in a R environment. - I adjusted my question accordingly. – Marcel Gangwisch Jan 30 '23 at 09:06
  • @statsplease thanks for your comment. You are right if you are just scaling, but if I transform the data by log/sqrt or by box-cox/yeo-johnson I also change the distribution. – Marcel Gangwisch Jan 30 '23 at 10:05
  • 2
    But why would you transform them given we are using splines anyway? – usεr11852 Jan 30 '23 at 11:15
  • 2
    Why are you transforming your data to a Gaussian distribution? There's no point in doing so. – jbowman Feb 01 '23 at 22:39
  • ok; I try to run the model without transforming, but only scaling. - Thank you for your remarks! – Marcel Gangwisch Feb 05 '23 at 16:56
  • 1
    Let's suppose one predictor is binary. Then there is no scope to make it Gaussian. No transformation can do anything useful. But there is no need to make it Gaussian. The idea that predictors should be Gaussian is a myth. What is true is that some non-Gaussian predictors can be hard to handle and might be easier to work with if transformed. – Nick Cox Feb 06 '23 at 08:56

2 Answers2

3

Scaling/Transformation

To directly answer your two main questions here: the answer is no to both.

First off, there is pretty much no defensible reason to scale or transform your variables here. Even if you were to include some interaction term here, you could simply add tensor product splines to your model and avoid any transformation/scaling that would necessitate it in other regression interactions. The only real thing you should be concerned with in that respect is having the appropriate regression link function. Unfortunately, without knowing more about your data, it is hard to say. If you have a normally distributed outcome variable, then there is no need to change this, and it looks like you have already explicitly modeled it as such anyway with family = gaussian(). So just leave your predictors in without scaling/transformation. I don't know if the scale and control arguments have been expressed here for that purpose, but I would just remove those terms. I also don't know what cluster is being used here for.

As an example of why this is not helpful, I have fit two models with the MASS package's mcycle data, one scaled and one not scaled. I have then plotted them thereafter.

#### Load Libraries ####
library(MASS)
library(mgcv)
library(tidyverse)

Scale Data

data <- mcycle %>% mutate(scale_times = scale(times))

Fit Unscaled Model

fit.reg <- gam( accel ~ s(times, bs = "cr"), data = data, method = "REML" )

Fit Scaled Model

fit.scale <- gam( accel ~ s(scale_times, bs = "cr"), data = data, method = "REML" )

Plot Them

par(mfrow = (c(1,2))) plot(fit.reg) plot(fit.scale)

You can probably see an issue already:

enter image description here

This mcycle data is typically used for GAMs, and shows time in milliseconds after a crash plotted against head acceleration in g's. The plot to the left is the data in raw form. With this, we can predict or at least guess where the crash test dummy's head should be located after specific times (for example, around 15 milliseconds we expect somebody's head to jerk in one direction, then whiplash after about 20 milliseconds.) The plot to the right (the scaled version), has completely lost interpretability. Now we have z-scores on the x-axis and we cannot ascertain what this actually means in terms of head acceleration. What does a z-score of zero mean here? Very little if anything.

Of course, what scaling you use may solve different problems (maybe you are converting hours to minutes or something similar). But as it stands you really shouldn't adjust this in your data. Other transformations would likely yield equally dubious outcomes.

Spline Terms

I'm also not sure what you are doing with your splines. In your question you say they are cubic regression splines, but it is not specified as so based off your code. You are using a thin plate regression spline called ts, but personally I would recommend against using this in a bam model. You are likely using bam for speed of estimation (due to a large number of observations, etc.), and you are already using discrete to increase the speed of this optimization. Thin plate regression splines tend to increase the amount of time it takes to estimate and likely unnecessarily so because they estimate as many unknown parameters as there are unique combinations of covariate values in a data set (see Simpson, 2018). Since you are using single predictor terms here, you could consider instead using something less computationally demanding like a cr cubic regression spline instead. It may be helpful to know why gamma is being used too if you are including thin plate regression splines. Gamma is usually adjusted to make your model more smooth.

Knots

You have set k to k here (I'm assuming you saved this elsewhere), but generically adding this number of knots to all your predictors may not be wise. Check to see what your data actually looks like and see if it needs a higher degree of specificity. If for example there are several curves in your model and you have k=5, your regression will miss a lot of this curvilinearity.

  • I like you example, but I'm wondering what happens if you add a second spline + predictor with different range of data. And if we are interested in the size of the effect of different predictors it might be useful to scale the predictor, right? – Marcel Gangwisch Feb 07 '23 at 06:54
  • Furthermore, I'm generically adding the number of knots elsewhere, because I was using ts or cs - so thin plate spline or cubic spline WITH shrinkage and also selection = T, so that the model can also shrink predictors to the zero space. – Marcel Gangwisch Feb 07 '23 at 06:57
  • Let's say we added wind speed as a predictor in the model I showed. Would it make the time variable any less confusing? – Shawn Hemelstrand Feb 07 '23 at 13:07
1

Scaling is a linear transformation and therefore the order doesn't matter. You can scale first and then transform or the other way around.

When predicting you must use the exact same scaling and transformation you used when fitting the model for the prediction to be valid. Note that you still use the preprocessing used on your training data, i.e. the data used to fit the model, when predicting.

leviva
  • 954
  • This advice would contradict e.g. generalized linear models where the whole point of fitting using a link function is to have the best of both worlds, fitting on one scale but presenting predictions on the original scale. – Nick Cox Feb 06 '23 at 09:26
  • I'm not sure I understand your point. If the inputs are scaled when fitting the model, they must also be scaled when using the model for prediction. For example y=ax+b is a simple case of a GLM. If x is a scaled version of the input t, say S*t, then predicting directly on t will lead to the wrong result: y= at+b instead of y=aSt+b. – leviva Feb 06 '23 at 09:42
  • Consider generalized linear models with a log link as a leading counterexample. (Also, the abbreviation GLM is often used to mean general linear models, which are usually generalized linear models, but the converse isn't true. I really do mean generalized linear models.) – Nick Cox Feb 06 '23 at 11:07
  • 1
    As for definitions, for something to be correct for GLMs, it must hold for all GLMs, including the degenerate case of linear regression. It also holds for any other GLM like logistic regression for example. It's a well known fact that if the input is scaled in the train set, it (the input) must also be scaled in the test set. For example see this answer https://datascience.stackexchange.com/questions/39932/feature-scaling-both-training-and-test-data – leviva Feb 06 '23 at 12:05
  • It may be machine learning orthodoxy that you always have a train[ing] set and a test set, but that is not universal across statistical science. But I fit e.g.. $y = \exp(Xb)$ via a generalized linear model with log link, that is an equation available directly for new $X$, I don't have to repeat the manipulations. – Nick Cox Feb 06 '23 at 12:15