Low hanging fruits for a simple NN

Question

I have only the basic understanding of Neural Networks (NN). Recently, I encountered a scenario at my company where a team was using linear regression (LR) to forecast an important continuous parameter. It looked like a classic problem for NN so I tried to make a better forecast.

In a Colab notebook, Chat-GPT and myself created a simple NN. Initially it only marginally improved the Mean absolute error (MAE). However, after replacing "not a number" (NaN) values with zeros and standardizing the data, the performance significantly improved:

LR: MAE = 4.7
1st NN: MAE = 4.3
2nd NN + replace NaNs with zeros + standardized data: MAE = 1.6
Edit: LR + replace NaNs with zeros + standardized data: MAE = 2.8

Is the 2nd NN that largely lowers the MAE means it is a good forecast? What should I check or verify before I can say I found a better forecast?

code:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Masking
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Replace NaN values to 0
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)
Standardize the data (optional but recommended for neural networks)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Define the model with a Masking layer
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(X_train_scaled.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(42, activation='relu'))
model.add(Dense(1, activation='linear'))  # Linear activation for regression
model.summary()
Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')
Fit the model to your data
model.fit(X_train_scaled, y_train, epochs=50, batch_size=42, validation_data=(X_test_scaled, y_test))
Evaluate the model
y_pred = model.predict(X_test_scaled)
Calculate evaluation metrics
mae_nn = metrics.mean_absolute_error(y_test, y_pred)
mse_nn = metrics.mean_squared_error(y_test, y_pred)
rmse_nn = np.sqrt(mse_nn)
r2_nn = metrics.r2_score(y_test, y_pred)
explained_variance_nn = metrics.explained_variance_score(y_test, y_pred)
Print the summary of evaluation metrics
print(f"MAE: {round(mae_nn, 3)}, RMSE: {round(rmse_nn, 3)}, R^2: {round(r2_nn, 3)}, Explained Variance Score: {round(explained_variance_nn, 3)}")

EDIT: Based on @Stephan Kolassa answer I checked a flat zero forecast. It does very badly:

MAE: 7.864, MSE: 221.427, RMSE: 14.88, R^2: -0.388, Explained Variance Score: 0.0

Zero Forecast Code:

import pandas as pd
from sklearn import metrics
Create a flat zero forecast
zero_forecast = pd.Series(0, index=y_test.index)
Calculate evaluation metrics for the zero forecast
mae_zero = metrics.mean_absolute_error(y_test, zero_forecast)
mse_zero = metrics.mean_squared_error(y_test, zero_forecast)
rmse_zero = np.sqrt(mse_zero)
r2_zero = metrics.r2_score(y_test, zero_forecast)
explained_variance_zero = metrics.explained_variance_score(y_test, zero_forecast)
Print the summary of evaluation metrics for the zero forecast
print(f"Zero Forecast: MAE: {round(mae_zero, 3)}, MSE: {round(mse_zero, 3)}, RMSE: {round(rmse_zero, 3)}, R^2: {round(r2_zero, 3)}, Explained Variance Score: {round(explained_variance_zero, 3)}")

I am in the middle of feeding the model more data, this results come from 1 Million data points — Cohensius, Jan 22 '24 at 13:37
This really depends on your data but if you're doing forecasting I'm assuming you're working with time-series, for which recurrent networks can be really useful. Not sure if this counts as 'low-hanging fruit' for you, but there's a really good course on this here. — A. Bollans, Jan 22 '24 at 14:00
In order to improve communication among the millions of cross-validated users, it is advisable to define your acronyms, like "NN" for example, but also the rest. You also lose out on those users that may have something to contribute that could be helpful to you, but are not sure about these acronyms. — Alecos Papadopoulos, Jan 22 '24 at 16:11
Could you make your python code reproducible, or explain how you acquired the three different MAE values. — Sextus Empiricus, Jan 23 '24 at 12:40
"I'm curious if there are other "low-hanging fruits" that I should explore to further reduce the error." Is this really the true question here? It would be too broad for a SE question not? — Sextus Empiricus, Jan 23 '24 at 12:41
@SextusEmpiricus, is "What should I check or verify before I can say I found a better forecast?" a better question? — Cohensius, Jan 23 '24 at 13:04
@Cohensius that would change the nature of the question a lot. The question 'how to improve my model' is a valid question, but a undirected question like 'how to improve (in whatever possibly low hanging fruit way)?' is too broad. Ideally it should be more focussed. For instance, it can be related to domain knowledge which makes particular fruits more suitable than others, and in order to make such focus easier an introduction into the underlying processes behind the data is welcome. — Sextus Empiricus, Jan 23 '24 at 13:42
Regarding your new question "Is the 2nd NN that largely lowers the MAE means it is a good forecast" it depends a lot on how your data looks like and what statistical considerations can be made. For example, 1) did you compute MAE in the same way (based on the original variable, or on the transformed variable? 2) How is the data obtained and how can we assume the variations? What are these NaN values? Do they relate to potential sampling bias? If you include the NaN values in the testing dataset, then how is the LR model supposed to make a prediction for these cases? — Sextus Empiricus, Jan 23 '24 at 13:48
@SextusEmpiricus, I started to measures: MAE, RMSE, R^2 and Explained Variance Score, the ANN is better in all of them.
Regarding what are these NaN values: The data is about users that play at a mobile game. Some users are new so they do not have a value in some of the fields, for example: #games played on the last session, #games in the session before that, and so on... so some users did not played yet X sessions. The team that predict using LR made several different models in order to avoid NaN values. (a model for users with X features, a model for Y features, ...) — Cohensius, Feb 04 '24 at 16:19

score 13 · Accepted Answer · answered Jan 22 '24 at 13:44

13

Your results may not mean what you think they mean.

Linear regression aims at unbiased expectation predictions. This is not what you elicit using the MAE, which elicits the conditional median: Why does minimizing the MAE lead to forecasting the median and not the mean? Essentially, you told the regression to give you expectation forecasts and your NN to give you conditional median forecasts, and then evaluated both with a quality measure that is appropriate for median predictions. This stacks the deck against regression.

If your "real" data are nonnegative, then filling NAs with zero may have reduced the median more than the expectation, so this may well have exacerbated this effect. Especially since you apparently did not retrain the linear regression on the imputed data. Fun fact: a flat zero forecast may well be optimal under MAE loss for intermittent demands.

I would very much recommend you even the playing field, by fitting a regression with MAE loss (which essentially would be a quantile regression aiming for the median), then you can compare the two predictions more meaningfully. Alternatively, first think about which functional of the unknown future distribution you want, then choose an appropriate error measure, then fit all models using this measure (Kolassa, 2020).

answered Jan 22 '24 at 13:44

Stephan Kolassa

123,354

I hugely appreciate your answers! My real data is indeed nonnegative. Flat zero forecast got large MAE. Do you mean that maybe the results are still worse than the Linear regression that has larger MAE? – Cohensius Jan 23 '24 at 09:53
2

The zero forecast is really something specific to intermittent demands. What is important is that the MAE incentivizes you to predict the conditional median (which, for sufficiently intermittent demands, is zero, see that 2016 paper). Per above, I would recommend that you first figure out what you want (conditional mean, median, something else, possibly really a high quantile to set safety amounts), then pick an appropriate error measure. If you are interested in those papers, feel free to reach out to me on ResearchGate or LinkedIn (please point to this thread so I get the context). – Stephan Kolassa Jan 23 '24 at 10:17
1

Not sure how big of an effect this might have in the OP's case, but what is relevant here is the evaluation loss, not the fitting loss. So it is not necessarily best to fit the regression using absolute loss as long as the prediction from the fitted model targets the median. E.g. fit by OLS, obtain fitting errors, obtain their median and adjust the point forecast (targeting the conditional mean) by the median of the errors to target the median. – Richard Hardy Jan 23 '24 at 10:25
@RichardHardy. Any particular reason why one would not use quantile regression in the first place? – Stephan Kolassa Jan 23 '24 at 10:32
2

Fitting by minimizing MAE is not a guarantee of superior out-of-sample MAE. – Dave Jan 23 '24 at 11:27
1

@StephanKolassa, Dave beat me to it. Not saying this will always work, but it is good to be aware of such cases. – Richard Hardy Jan 23 '24 at 11:33

Sextus Empiricus · Answer 2 · 2024-01-23T13:07:46.877

However, after replacing "not a number" (NaN) values with 0 and standardizing the data, the performance significantly improved:

While I consider the project successful as it is, I'm curious if there are other "low-hanging fruits" that I should explore to further reduce the error.

One thing that you could try is replace all NaN values with a number $x$ (different from zero) and and optimize by searching which $x$ makes your model perform better. (In a neural network this would be equivalent to an extra layer that performs a transformation of the data)

... that was an ironic answer. But, it shows the problem with the approach. The improvement that you made is not neccesarily a low hanging fruit and could have been simply overfitting.

So, important in your considerations about more low-hanging fruits is that you try not to fall into the trap of data dredging.

"Everything should be made as simple as possible, but not simpler"?

we can consider that quote in reverse

"Everything should be made as complex as possible, but not complexer"?

You are trying to make something less simple. But is your current model too simple? For this you can consider hypothesis testing or information criteria.

What exactly happened with your model is difficult to say. For example answering the question "Did the NN work better than linear regression?" is difficult to answer/discuss because you didn't compare both with the same conditions.

One interesting aspect is what the NaN values mean and how they influence the model. This is difficult to assess with the information given. One potential explanation can be that the NaN values exclude a large part of the data to be taken into consideration, which reduces the information available to fit the model. Performing imputation by replacing the NaN values with zeros is a bit arbitrary, but it may have helped you because the advantage of adding data, albeit being of low quality, may improve the model.

if I understand my next step should be to compare the NN and the LR in the same conditions. How to do that? — Cohensius, Jan 23 '24 at 13:08
@Cohensius you could create the same conditions, and subsequently compare the NN and LR with those same conditions. — Sextus Empiricus, Jan 23 '24 at 13:11
just verifying, you mean: replace the NaN with zeros + standardized data. Then activate the NN and LR. — Cohensius, Jan 23 '24 at 13:13
@Cohensius that's at least what you should do when you wish to compare NN and LR. However, keep guard that you are not following a route of data dredging. There can be a lot of low hanging fruits to pick, but that doesn't mean that you should try all of it. This question feels a bit uncomfortable in that sense and sounds like "tell me all ways to reduce MAE", but reducing MAE by means of 'whatever way possible' should not be the goal. Ideally you are having some motivations (based on domain knowledge) to pick certain specific fruits, instead of picking whatever fruit that is possible. — Sextus Empiricus, Jan 23 '24 at 13:35
nice, I did the comparison on even ground: NaN as zeros + standardized data. So the LR improves a lot but still there is a gap between the LR and the NN. LR results: MAE: 2.831, RMSE: 5.145, R^2: 0.697, Explained Variance Score: 0.697 — Cohensius, Jan 23 '24 at 15:29

Low hanging fruits for a simple NN

Split the data into training and testing sets

Replace NaN values to 0

Standardize the data (optional but recommended for neural networks)

Define the model with a Masking layer

Compile the model

Fit the model to your data

Evaluate the model

Calculate evaluation metrics

Print the summary of evaluation metrics

Create a flat zero forecast

Calculate evaluation metrics for the zero forecast

Print the summary of evaluation metrics for the zero forecast

2 Answers2