I am unaware of how to generate a sample data set for reproduction that has the same characteristics as my actual data that would deliver similar results when running a log(response_variable) model vs. a regular non transformed equivalent.
My goal here is therefore to provide some detail about my response variable and the rmse of a model before and after log transformation.
Here are histograms of response variable for non transformed and then log transformed version:
Regular non transformed response variable:
And the log version:
So the log version looks closer to a normal distribution than the regular one. So perhaps I should consider a model with a log transformation (right?).
Here's a function I use after running my models to get my evaluation metric:
rmse <- function(errors) {
mse <- mean(errors^2)
rmse <- sqrt(mse)
}
My base lm model:
mod.spend_transactions <- lm(Total.Transactions ~
Video.Streaming.Spend +
Display.Banner.Spend +
Shopping.Spend +
Trademark.Search.Spend +
Non.branded.Search.Spend, data = campaign_data)
And my evaluation metric:
rmse(residuals(mod.spend_transactions))
Gives: 12.60294
And now my log transformed model:
logmod.spend_logtransactions <- lm(log(Total.Transactions+1) ~
Video.Streaming.Spend +
Display.Banner.Spend +
Shopping.Spend +
Trademark.Search.Spend +
Non.branded.Search.Spend, data = campaign_data)
rmse(exp(residuals(logmod.spend_logtransactions)))
Gives: 1.412357
This is great!
Is it? I back-transformed my log model residuals using exp().
Have I missed something? Based on this info and on evaluation measure RMSE is my log model better than it's non log equivalent?

