How to update Neural Net Parameters on fresh new data

Question

[ I asked this earlier on DS SE: here
but the question didn't receive sufficient attention there. I ask it here because I'm looking for a practical answer with some theoretical justification. ]

I have a large dataset $D_1$.
I have a Feed-forward deep Neural Network $N$, with hidden layers, that I trained on $D_1$, using MSE Loss and standard back-propagation and obtained the parameters.

Now, I know the network architecture and the learned parameters.

Next, I obtained a new batch of data, $D_2$ - it is a large dataset similar to $D_1$.

I want to again learn the parameters of $N$, as though I had trained it from scratch on $D_1 \cup D_2$, but only by training on $D_2$ and using the earlier known parameters on $D_1$.

Is this some standard problem? How can I do this?

I can obtain basic info about $D_1$ such as size of the dataset; but training is expensive.

score 0 · Answer 1 · answered Oct 31 '23 at 18:25

Yes, this is a very standard problem and most commonly occurs with images/text, but clearly could occur with other types of data. For those types of data it's the most common approach, while for tabular data it's rare due to not so much data being similar to each other (but there are examples of people trying it successfully).

One common idea is to initialize the same model (or with a different final 1-2 layers if the task to be solved is different - e.g. pre-trained on ImageNet to classify images into ImageNet categories, but now being used to classify cat vs. dog on new data), load the old weights and then re-train. You might only train the last few layers (if the task is the same these are even already trained and you might only train these with a lowish learning rate). Or you might first train those and then the earlier layers. You might also wish to use increasingly lower learning rates the further the layers are from the final output. See e.g. here for a discussion.

score 0 · Answer 2 · answered Nov 07 '23 at 16:50

What you're describing is essentially approached as incremental learning or online learning, where a model is updated as new data comes available without starting from scratch. In your particular case, you've already trained a model on dataset D1 and have a new batch of data, D2, and you want to update your model to perform as if it had been trained on the combined dataset D1 U D2, only using D2 for further training.

There are a few approaches you could consider:

Fine-tuning: Continue the training process with the additional data. This would involve using the parameters learned from D1 as the starting point and training the network further on D2. You would typically want to use a smaller learning rate to make sure that the model does not deviate too much from what was learned initially. Additionally, it's sometimes beneficial to freeze certain layers to preserve learned features and only train the latter layers on the new data.

Elastic Weight Consolidation (EWC): This is a technique to mitigate catastrophic forgetting which can happen when a neural network forgets the knowledge about previous data when it is trained on new data.

Whether any of these methods would give you a result equivalent to training on D1 U D2 "from scratch" is not guaranteed and would be problem-dependent. It would also be affected by the size and distribution of D1 and D2.

Theoretical Justification: In terms of a theoretical proof, this is challenging because equivalence would depend on the optimization path taken by the neural network which is influenced by many factors including loss landscape, learning rates, batch sizes, and so forth.

However, the theoretical justification for these methods often comes from Bayesian learning principles and the theory of transfer learning. The idea is that the parameters learned from D1 form a prior distribution for the parameters that you want to learn for D1 U D2. When you start training on D2, you're updating this prior with new evidence. This is conceptually similar to Bayesian updating of belief distributions.

Further notes:

Consider the size of D2 relative to D1. If D2 is much smaller, its impact on the model might be limited unless you apply techniques like oversampling or data augmentation.
Monitor overfitting. Since the model is already trained on D1, it might overfit D2 more quickly.

You can use math typesetting with mathjax, which would make this easier to read. https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference — Sycorax, Nov 07 '23 at 17:31

prijatelj · Answer 3 · 2023-11-07T18:47:52.647

So far, there is no theoretically justified or standard way to train a typical neural network on new data without the old data that is equivalent to training on the datasets' union.

Closest today is a Bayesian modelling approach, but even then catastrophic forgetting remains, and that is then Bayesian Neural Networks (BNNs), rather than typical artificial neural networks (ANNs). This can only occur when the parameter space is

identifiable [ref 1], meaning that the model is able to represent the observed process and is typically meant as a single parameter is point-identifiable from the information available in the observations, and
"consistent" or is a "well behaved" parameter space with a "well behaved" update function that can converge to the true parameter.

This is discussed at length in [ref 2, Ch.6--9], specifically with regard to Bayesian modelling. Unfortunately, this is not open access.

What you are seeking in such a case is a mathematical and algorithmic formalization of inductive reasoning and inference, which has been of great interest for centuries. The Bayesian modeling approach tends to be involved in the probabilistic mathematical induction. Some relevant books of interest include ref 3 Ch. 5 & ref 4 Ch. 2.

Engineering Approaches

Theory aside, there are plenty of engineering approaches.

Given how ANNs are trained through typical back-propagation and stochastic gradient descent (SGD), having the weights of a trained ANN is typically not enough to properly adjust those weights given new data without having more information about the prior train run, such as the state of the SGD method including the learning rate or other parameters and state of the optimizer used.

As others have mentioned, you can approach this from an online learning or incremental learning fashion, however those subfields also do not have a sound theoretic solution (as requested from the bounty).

To be honest, why SGD works is still somewhat of a mystery and the reality is that there is not a sound way to use SGD (afaik) to update your weights without all of the prior training data along with the new data.

The only way you can get this, is if you constrain the space of the parameters (weights) learned through a fashion similar to Bayesian modelling. BNNs are one potential way to explore being able to update given only the most recent data, but even then there is difficulty.

In the end, the transduction approach of "re-train the model on all the data available" is the current standard.

"Hot start": Use prior weights at start of training

Issue is, as mentioned before, the initialization weights alone will not result in an equivalent model as if trained on the union of the datasets. This is mostly due to SGD being stochastic and unconstrained. If your optimization method has a temperature setting or even in general, the optimization method may quickly travel away from the initial weights, even if the new data is rather similar to the old data, and you'd rather explore further around the old local minima given the new data. This is where having the parameters and final state of the optimizer may aid in exploring around the prior local minima, rather than just providing the initial weights.

Freeze (part of) ANN then Fine-Tune

As others mentioned, freezing the weights of the trained network is doable and then training only a subset of the network or a new ANN appended to the end of the frozen one is one approach, however this is an engineering approach that has nuance in how to be done appropriately.

You could explore having an ANN frozen from $D_1$ and then rather than only having a trainable ANN at the end for fine-tuning, you could have a new ANN, say a clone of the old network, that is trained in parallel. The new model is then trained on $D_2$ as raw input along with that input run through the frozen network. This would avoid loss of possibly new relevant information in $D_2$ that was not in $D_1$.

Information Theoretic Perspective on Frozen Pre-trained Models

When you freeze the weights of the ANN trained on the first training data $D_1$, then train a "fine-tuning" layer or ANN at the end of that frozen network, meaning the penultimate layer or some earlier layers serve as the input to the fine-tuning ANN rather than the task output, you are limiting the amount of new information that is able to be learned from the new data $D_2$. This is because the frozen ANN serves as a filter of the input data.

This is related to the data processing inequality. The information you care to learn must be available in the observations (the training data), and without providing any further information, the only information to be learned to solve the task is from that training data. When you train the ANN on $D_1$ and freeze it, the ANN was optimized for that data, and thus any new information in $D_2$ that is relevant can be filtered out when run through this network as in the fine-tuning approach.

This is the justification for the fine-tune ANN to be given both the raw data and the data run through the frozen ANN. However, this still will not be equivalent to running on all of the data together, as done in transduction.

References

Lewbel, Arthur. 2019. "The Identification Zoo: Meanings of Identification in Econometrics." Journal of Economic Literature, 57 (4): 835-903. Open access preprint: https://www.bc.edu/content/dam/bc1/schools/mcas/economics/pdf/working-papers/wp957.pdf
Ghosal, Subhashis, and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference. Vol. 44. Cambridge University Press, 2017. Closed Access Google Scholar
Li, Ming, and Paul Vitányi. An introduction to Kolmogorov complexity and its applications. Vol. 3. New York: Springer, 2008. Closed Access: https://link.springer.com/book/10.1007/978-3-030-11298-1
Hutter, Marcus. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004. Closed Access: https://link.springer.com/book/10.1007/b138233 ; Public related slides: http://www.hutter1.net/ai/suaibook.pdf