1

i have small dataset 4840 samples (60% negative ,28% positive,12% negative) i use data augmentation on training set (70%train 30% test) and i have about 2000 samples for each class while test is unbalanced (800 neutral ,400 positive ,200 negative). I use synonym replace from wordnet and insert words with contextual embedding takeed from nlpaug library. i use 10-folds cross validation the permonace doesn't increase very much . without data aug i have 86% accuracy 84 f1 score , with data aug 87% accuracy 85 f1 .

p.s. I try different data augmentation the result are quite similar i have this value of train and val loss , I was expecting overfitting.

The train and val loss value after 10 epochs val_loss = 0.22798433965072035 train_loss = 1.6610426288098097

what is the problem ?

1 Answers1

4

I'm not sure there is necessarily any problem. If you are already fine-tuning a pre-trained language model (e.g. ULMFiT, BERT or something similar), then how much you gain through augmentation is a bit variable from application to application. It may have been reasonable to hope for a bit more, but it may no add that much. If you look at this paper (also an example of sentiment classification), where round-trip translation was used as an augmentation, the gains from augmentation are huge over just fine-tuning a pre-trained model when you have very little data (e.g. 50 examples). However, they get a lot smaller (order of magnitude similar to what you see Figure 1), when the number of examples is larger. I looked at both the 5000 records case (similar number of records) and the 500 records case (similar error rate to yours + the smallest class is about similar sized to your case). I'm not exactly sure how I expect round-trip translation as an augmentation to work vs. what you did, but I'd speculate it might even be more effective (by the way, that type of augmentation is now super-easy, because huggingface offers translation models, so you could easily give it a try).

To check whether there's an issue: Did you inspect some examples to see that the augmentations really modify a lot of text without changing the meaning? Did you have a look at cases from your validation set that get mis-classified (Are they cases that are simply ambigious and no model can realistically get them right? Are the labels even wrong? etc.)?

Other ideas:

  • With so few examples, it's realistic to try counter-factual augmentation of your training data. I.e. create (=a human does this manually) 2 or more versions of each training example that make the smallest possible change to the text that changes (in the judgement of the human) the category to the other 2 categories. It might be enough to do this for the training data cases that are particularly hard (e.g. using something like the BatchBALD criterion).

  • You could try test-time-augmentation like in the paper I linked.

Björn
  • 32,022
  • Thanks for the reply, yes I am using a pre-trained bert model on a financial dataset. I tried to change the sentence a little to avoid changing its meaning and not to create exactly the same sentence, for example by replacing a maximum of 3 synonyms or by inserting a maximum of 3 words, the labels are reliable. Even trying different parameters, learning-rate, batch size, epoch, the model on average on 10 fold remains on an accuracy of 87%. Now I give a reading to this article. – Flavio Torre Jul 21 '21 at 16:06
  • Financial texts might be somewhat different than normal internet text. I wonder whether huggingface has a FinanceBERT or something like that, or whether you could fine tune BERT on the language modelling task to a bunch of non-labelled similar documents (as self-supervised pre-pertaining prior to the supervised training)? – Björn Jul 21 '21 at 16:13
  • yes i'm using FinBert ProsusAI/finBERT this repository on github i want improve model on this dataset – Flavio Torre Jul 21 '21 at 16:17