I am doing a document classification task and I find that using simple BOW features with a random forest provide better results than using complex models like BERT or ELECTRA even after doing some parameter tunning. What could be the reasons behind this? Many of the documents I need to classify lack continuous text so maybe single terms are enough to figure out what class they belong to. I am also dealing with a highly unbalanced dataset but still the difference is so big I am wondering if maybe there is something I am missing here. I would have assumed BERT and pretrained NLP models would at least match a simple BOW representation. In what general cases do you think a BOW model could outscore BERT or similar deep NN models? Maybe I can get some tips to inspect the data.
1 Answers
It is not true that big models always outperform smaller, simpler ones. There are examples where logistic regression outperforms deep neural networks, same there are examples where LSTM outperforms more complicated language models, or even where much simpler models work well for NLP tasks.
Your data may differ from the data that was using for training BERT, or other pretrained models. Say that the pretrained model was trained on Reddit corpus, while you want to classify legal, or technical documents, or theoretical physics journal papers, they would greatly differ in language on all levels, starting from words, their length, grammar etc. Tuning pretrained models in such cases may, but does not have to help.
Finally, in some cases you don't need complicated models. Many problems can and are solved with very simple models, like random forest, logistic regression, or even rule based systems. It can be the case that it is enough to classify text just by checking if they contain particular keywords (say "covid" for documents related to COVID-19 illness).
- 138,066