1

Working on ASR models, I have encountered several datasets which have distributions where a small amount of speaker make a huge part of the actual dataset.

The following image shows the extracted time spoken (log) per speaker from the Voxforge (de) dataset:

enter image description here

Extreme cases where the top 2% of speaker make up over 50% of the time spoken are possible.

My question is whether, or how much, this may impact the model's performance. How important is it to balance the dataset and which factors are most important (age, gender, total time spoken, etc.)?

Stefan Falk
  • 121
  • 4

1 Answers1

1

Realistic datasets are always like that. You'd better focus on proper algorithms to train on such datasets than on fixing the dataset itself.

With proper algorithm balance is not a problem. It is much more relevant to add more training data than to prepare dataset carefully.