When too many models are tried, and they become too complex for the small dataset being used, overfitting can begin in model selection. It's a warning sign when the models begin to fit the noise rather than the real pattern.
It can be challenging to determine how many models are necessary. However, there is a method called "early stopping" that can be used in conjunction with monitoring validation scores. When the validation scores stop changing, the training of new models is halted using this method.
Always aim for the simplest solution. Overfitting is more likely to occur in datasets with fewer than 1000 labels, so using simpler models is recommended. Occam's razor is a failsafe method.
It's not a perfect science, after all. It's a matter of trying things out, gaining insight, and making adjustments as you go.