How important is speaker-balance for speech recognition models?

Question

Working on ASR models, I have encountered several datasets which have distributions where a small amount of speaker make a huge part of the actual dataset.

The following image shows the extracted time spoken (log) per speaker from the Voxforge (de) dataset:

Extreme cases where the top 2% of speaker make up over 50% of the time spoken are possible.

My question is whether, or how much, this may impact the model's performance. How important is it to balance the dataset and which factors are most important (age, gender, total time spoken, etc.)?

score 1 · Answer 1 · answered Apr 13 '19 at 22:27

Realistic datasets are always like that. You'd better focus on proper algorithms to train on such datasets than on fixing the dataset itself.

With proper algorithm balance is not a problem. It is much more relevant to add more training data than to prepare dataset carefully.

How important is speaker-balance for speech recognition models?

1 Answers1