1

Let say I have a balanced dataset that has a small training sample size (lack of data).

How do I increase the training sample size by generating synthetic data based on the original data?

I believe method like SMOTE is only suitable for imbalanced data.

Wherever I read about SMOTE, they are talking to balance the class.

How about increasing the training sample size of a dataset that has no problem with imbalanced classes?

Aqee
  • 39

3 Answers3

2

I'm dubious of generating more training data, even with techniques like SMOTE. Generating more training data via any method which is not observation of the phenomenon risks biasing any estimates by imposing some synthetic data generating processes. It could work, but I'm not sure it is reliable.

If you're worried your training sample is too small, you could use Bayesian models with informative priors. Else, get more data.

0

Why do you think SMOTE is not suitable for generating more training data here? Given SMOTE finds the points on the line segments with it's near by points, It's reasonable to use SMOTE to generate synthetic data.

tourism
  • 11
  • I'm not sure. Wherever I read about SMOTE, they are talking to balance the class. – Aqee Jan 26 '21 at 03:01
  • you can use SMOTE to generate training data. But as @Demmtri suggested in the answer.... generate synthetic training data with SMOTE or any other technique is not reliable. – tourism Jan 26 '21 at 11:56
0

A SMOTE-style technique is completely reasonable for balanced data. Yes, the "M" means "minority" and there is no minority class in a balanced problem. However, the idea of using the SMOTE-style synthesis for points in general could apply.

Note that SMOTE seems not to be so great at synthesizing reasonable points, which diminishes its potential utility.

I am with the answer by Pananos expressing skepticism about synthesizing new points in order to expand the sample size. If you have a small sample size that truly needs to be expanded, then there is a real risk that the point synthesis will overfit to the coincidences in the small training data, and funky behavior might not get washed out by other, more mainsteam points. Once you reach a large sample size where this is not such a concern, then I question if you really need to generate new points.

Dave
  • 62,186