What if I factor the training set?

Question

In pratice, it is usual that we don't have enough observations to build our desired models. An idea come to my mind is that the population can be factored: in other words, we can simply duplicate every observation, making (say) 5 copies of every observation. However, this is risky because if we oversample both training and testing set, we may overestimate the performance of the trained model.

One way to overcome this situation might be to factor only the training set, build the model on the duplicated training set, and test it on the original testing set. Is this a good idea? I guess that in most of the situations, it could even lead to a worse model. Is there any case in which factoring the training set could make sense?

Can you define in your question what you mean by "factor"? Do you just mean "duplicate"? — D.W., Oct 19 '15 at 19:16

score 8 · Accepted Answer · answered Oct 19 '15 at 13:52

As FrankH wrote, naive multiplication of observations is at least redundant and can be harmful.

Yet, in some areas, mostly computer sensing, there is a similar trick to multiply an observation by applying certain transformation(s) which should be redundant to classification -- for instance a picture of a cat is still a picture of a cat if it is slightly rotated, warmer tinted and cropped a bit, and a spoken word "computer" is still "computer" even if it is played 4% slower, 7% louder and there is a bear roar and flute sound in the background. This usually helps to teach the model that those transformations are really redundant, thus increases robustness.

Frank H. · Answer 2 · 2015-10-19T13:17:54.310

I don't think factoring or oversampling would be beneficial just to increase the number of data points for training or modeling. If you wanted to adjust your dataset because you believe it is not representative of the "true" dataset or the prospective dataset, then I think this sounds reasonable although very difficult in practice and difficult to support. Otherwise, factoring could lead you to a significantly different distribution of data which might lead you to a worse or a completely incorrect model.

What if I factor the training set?

2 Answers2