I'm trying to build an entity matching model. There are 2 kinds of features - binary (0/1) and text features. Initially I made a deep learning model that uses character level embeddings of some of the text features, word level embeddings of the other text features and also uses the binary features as inputs.
The output is through a softmax for 3 classes, so basically a $n\times 3$ array (where $n$ is the input data length).
I've done three splits - train, val and test, and for training the DL model through Keras, I've specified train as the training split and val as the validation split. I measured the performance on the test split to get DL model metrics. The softmax outputs for all three splits were obtained using model_DL.predict.
Next, I used a Random Forest model as a second stage. Inputs: all the binary features PLUS the softmax outputs as inputs. e.g. I took the train split, removed the text features and added in the columns of the predicted array as separate features. To be even more specific, if predtrain was obtained by using model_DL.predict on the train split, then the additional features were added using train['class1prob'] = predtrain[:,0], train['class2prob'] = predtrain[:,1], train['class3prob'] = predtrain[:,2].
Similarly I did for test and val splits. Now I trained the RF on the augmented train split and measured its performance on the val and test splits. The F-score for the train and val splits was around 0.85, 0.74, 0.73 for the 3 classes respectively (i.e. performance was similar on both splits).
BUT for the train split the predictions were near perfect - 0.98, 0.99, 0.98 F-scores for the three classes. My intuition is that overfitting of the 2nd stage RF is understandable for train, since the softmax outputs were predicted using the 1st stage DL model, which in turn was already trained on train. Also, there's some data leakage for the val split since val was used as a validation set to finetune the DL model by Keras, so maybe even the val metrics aren't so reliable. But there is no leakage for the test set.
My question is, in this scenario have I made an error, or is this blatant overfitting normal for 2 stage models? If there's a glaring error, any way or best practice to fix that?