Why Accuracy increase only 1% after data augmentation NLP?

Question

i have small dataset 4840 samples (60% negative ,28% positive,12% negative) i use data augmentation on training set (70%train 30% test) and i have about 2000 samples for each class while test is unbalanced (800 neutral ,400 positive ,200 negative). I use synonym replace from wordnet and insert words with contextual embedding takeed from nlpaug library. i use 10-folds cross validation the permonace doesn't increase very much . without data aug i have 86% accuracy 84 f1 score , with data aug 87% accuracy 85 f1 .

p.s. I try different data augmentation the result are quite similar i have this value of train and val loss , I was expecting overfitting.

The train and val loss value after 10 epochs val_loss = 0.22798433965072035 train_loss = 1.6610426288098097

what is the problem ?

Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Jul 21 '21 at 15:29
Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold — Stephan Kolassa, Jul 21 '21 at 15:29
Instead, use probabilistic classifications, and evaluate these using proper scoring rules. — Stephan Kolassa, Jul 21 '21 at 15:29
@StephanKolassa I usually agree with you on this topic, but would you not consider this a high signal-to-noise problem like you would consider optical character recognition? — Dave, Jul 21 '21 at 15:34
I''ve stated from a work and i want improve this perfomance, I was expecting an increase of at least 2/3 percent for each metrics,why doesn't this happen? — Flavio Torre, Jul 21 '21 at 15:39
@Dave: I don't see how this would be high or low signal to noise. How do you decide one way or the other based on the question? — Stephan Kolassa, Jul 21 '21 at 15:41
I could imagine this kind of task being easy. "Cross Validated is great, and I have learned so much here," is pretty clearly a positive, while "You're all terrible and mean and spend your time being professors and professional statisticians instead of doing my homework for me," is pretty clearly negative. — Dave, Jul 21 '21 at 15:44
I have read several articles and would like to know if it is possible to improve the model even more or am I wrong something — Flavio Torre, Jul 21 '21 at 15:45
Given the sparsity of detail in the question, it's impossible to diagnose a specific problem. Moreover, I don't think there's necessarily a "problem"; it seems like there's a limit to how much the model can be improved when using data augmentation. But you might find further improvements from alternate sources, perhaps used in conjunction with data augmentation -- alternative regularization strategies or choice of model or hyper-parameter tuning, etc. Machine learning is hard, data quality is important, there's only so much that you can do with a "small" data set, and noisy problems exist. — Sycorax, Jul 21 '21 at 15:52

score 4 · Answer 1 · answered Jul 21 '21 at 15:55

I'm not sure there is necessarily any problem. If you are already fine-tuning a pre-trained language model (e.g. ULMFiT, BERT or something similar), then how much you gain through augmentation is a bit variable from application to application. It may have been reasonable to hope for a bit more, but it may no add that much. If you look at this paper (also an example of sentiment classification), where round-trip translation was used as an augmentation, the gains from augmentation are huge over just fine-tuning a pre-trained model when you have very little data (e.g. 50 examples). However, they get a lot smaller (order of magnitude similar to what you see Figure 1), when the number of examples is larger. I looked at both the 5000 records case (similar number of records) and the 500 records case (similar error rate to yours + the smallest class is about similar sized to your case). I'm not exactly sure how I expect round-trip translation as an augmentation to work vs. what you did, but I'd speculate it might even be more effective (by the way, that type of augmentation is now super-easy, because huggingface offers translation models, so you could easily give it a try).

To check whether there's an issue: Did you inspect some examples to see that the augmentations really modify a lot of text without changing the meaning? Did you have a look at cases from your validation set that get mis-classified (Are they cases that are simply ambigious and no model can realistically get them right? Are the labels even wrong? etc.)?

Other ideas:

With so few examples, it's realistic to try counter-factual augmentation of your training data. I.e. create (=a human does this manually) 2 or more versions of each training example that make the smallest possible change to the text that changes (in the judgement of the human) the category to the other 2 categories. It might be enough to do this for the training data cases that are particularly hard (e.g. using something like the BatchBALD criterion).
You could try test-time-augmentation like in the paper I linked.

Thanks for the reply, yes I am using a pre-trained bert model on a financial dataset. I tried to change the sentence a little to avoid changing its meaning and not to create exactly the same sentence, for example by replacing a maximum of 3 synonyms or by inserting a maximum of 3 words, the labels are reliable. Even trying different parameters, learning-rate, batch size, epoch, the model on average on 10 fold remains on an accuracy of 87%. Now I give a reading to this article. — Flavio Torre, Jul 21 '21 at 16:06
Financial texts might be somewhat different than normal internet text. I wonder whether huggingface has a FinanceBERT or something like that, or whether you could fine tune BERT on the language modelling task to a bunch of non-labelled similar documents (as self-supervised pre-pertaining prior to the supervised training)? — Björn, Jul 21 '21 at 16:13
yes i'm using FinBert ProsusAI/finBERT this repository on github i want improve model on this dataset — Flavio Torre, Jul 21 '21 at 16:17

Why Accuracy increase only 1% after data augmentation NLP?

1 Answers1