I've recently started experimenting with Genetic Programming as an optimization tool. I'm still a little confused as to how to reduce overfitting in this framework.
A couple of techniques I've read about
- Limiting the number of generations
- Limiting the tree (genome) size
- Adding a complexity penalty to fitness function
- Incorporate cross validation into fitness function
Of the methods above, 4 is the most intriguing/confusing to me. I've read this post on SO How to avoid overfitting when using crossvalidation within Genetic Algorithms, but it didn't give me too much insight on the actual implementation.
Let's say I'm trying to implement cross validation in my fitness function. I can "train" my GP on some time period (let's say 2000-2010) and evaluate it's fitness on another time period (let's say 2010-2014). However, won't this just introduce overfitting on the time period from 2010-2014. The GP will just gravitate towards finding good programs on the testing time period.
Alternatively, you could say that you want a program that performs well on 2000-2010 (performance P1) and 2010-2014 (performance P2). Say your fitness function requires that the performance of the GP on those two time periods has to be high and similar (i.e. Fitness = (P1 + P2) * 1/(P1 + P2)), but then isn't that the same as just testing the fitness on the whole period 2000-2014?
I guess I'm asking is it even possible to implement a form of cross validation into your fitness function. GP isn't a parameterized algorithm like kNN, SVM, or Simple Linear Regression, so I'm confused as to how cross validation even works in this setting. I can always just evaluate the output of the naive GP on test data, but in my experiments I haven't found good performance, because of GP's tendency to overfit.