So far, there is no theoretically justified or standard way to train a typical neural network on new data without the old data that is equivalent to training on the datasets' union.
Closest today is a Bayesian modelling approach, but even then catastrophic forgetting remains, and that is then Bayesian Neural Networks (BNNs), rather than typical artificial neural networks (ANNs).
This can only occur when the parameter space is
- identifiable [ref 1], meaning that the model is able to represent the observed process and is typically meant as a single parameter is point-identifiable from the information available in the observations, and
- "consistent" or is a "well behaved" parameter space with a "well
behaved" update function that can converge to the true parameter.
This is discussed at length in [ref 2, Ch.6--9], specifically with regard to Bayesian modelling. Unfortunately, this is not open access.
What you are seeking in such a case is a mathematical and algorithmic formalization of inductive reasoning and inference, which has been of great interest for centuries. The Bayesian modeling approach tends to be involved in the probabilistic mathematical induction. Some relevant books of interest include ref 3 Ch. 5 & ref 4 Ch. 2.
Engineering Approaches
Theory aside, there are plenty of engineering approaches.
Given how ANNs are trained through typical back-propagation and stochastic gradient descent (SGD), having the weights of a trained ANN is typically not enough to properly adjust those weights given new data without having more information about the prior train run, such as the state of the SGD method including the learning rate or other parameters and state of the optimizer used.
As others have mentioned, you can approach this from an online learning or incremental learning fashion, however those subfields also do not have a sound theoretic solution (as requested from the bounty).
To be honest, why SGD works is still somewhat of a mystery and the reality is that there is not a sound way to use SGD (afaik) to update your weights without all of the prior training data along with the new data.
The only way you can get this, is if you constrain the space of the parameters (weights) learned through a fashion similar to Bayesian modelling. BNNs are one potential way to explore being able to update given only the most recent data, but even then there is difficulty.
In the end, the transduction approach of "re-train the model on all the data available" is the current standard.
"Hot start": Use prior weights at start of training
Issue is, as mentioned before, the initialization weights alone will not result in an equivalent model as if trained on the union of the datasets. This is mostly due to SGD being stochastic and unconstrained. If your optimization method has a temperature setting or even in general, the optimization method may quickly travel away from the initial weights, even if the new data is rather similar to the old data, and you'd rather explore further around the old local minima given the new data. This is where having the parameters and final state of the optimizer may aid in exploring around the prior local minima, rather than just providing the initial weights.
Freeze (part of) ANN then Fine-Tune
As others mentioned, freezing the weights of the trained network is doable and then training only a subset of the network or a new ANN appended to the end of the frozen one is one approach, however this is an engineering approach that has nuance in how to be done appropriately.
You could explore having an ANN frozen from $D_1$ and then rather than only having a trainable ANN at the end for fine-tuning, you could have a new ANN, say a clone of the old network, that is trained in parallel. The new model is then trained on $D_2$ as raw input along with that input run through the frozen network. This would avoid loss of possibly new relevant information in $D_2$ that was not in $D_1$.
Information Theoretic Perspective on Frozen Pre-trained Models
When you freeze the weights of the ANN trained on the first training data $D_1$, then train a "fine-tuning" layer or ANN at the end of that frozen network, meaning the penultimate layer or some earlier layers serve as the input to the fine-tuning ANN rather than the task output, you are limiting the amount of new information that is able to be learned from the new data $D_2$. This is because the frozen ANN serves as a filter of the input data.
This is related to the data processing inequality. The information you care to learn must be available in the observations (the training data), and without providing any further information, the only information to be learned to solve the task is from that training data. When you train the ANN on $D_1$ and freeze it, the ANN was optimized for that data, and thus any new information in $D_2$ that is relevant can be filtered out when run through this network as in the fine-tuning approach.
This is the justification for the fine-tune ANN to be given both the raw data and the data run through the frozen ANN. However, this still will not be equivalent to running on all of the data together, as done in transduction.
Further Reading
Kevin Patrick Murphy's Probabilistic Machine Learning Books are open access and can be useful to get a grasp of different approaches and nuance to optimization of ANNs, as well as different approaches to address the problem of updating an ANN.
References
- Lewbel, Arthur. 2019. "The Identification Zoo: Meanings of Identification in Econometrics." Journal of Economic Literature, 57 (4): 835-903. Open access preprint: https://www.bc.edu/content/dam/bc1/schools/mcas/economics/pdf/working-papers/wp957.pdf
- Ghosal, Subhashis, and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference. Vol. 44. Cambridge University Press, 2017. Closed Access Google Scholar
- Li, Ming, and Paul Vitányi. An introduction to Kolmogorov complexity and its applications. Vol. 3. New York: Springer, 2008. Closed Access: https://link.springer.com/book/10.1007/978-3-030-11298-1
- Hutter, Marcus. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004. Closed Access: https://link.springer.com/book/10.1007/b138233 ; Public related slides: http://www.hutter1.net/ai/suaibook.pdf