Is DoE applicable to collect data for machine learning model?

Question

I'm currently working on a machine learning model for a classification task in an engineering application. While working on this project I realized that the provided data is insufficient to get a robust classification running.

Now I'm planning to collect more data using DoE methods like fractional factorial to capture the whole plausible range of levels for the factors while keeping the number of experimental runs on a reasonable level.

In the course of doing some research on this, I found no proof which verified these method to gather data for training of ML models. So I'm worried to miss something and to end up with just another bunch of insufficient or biased data.

Some figures: The DoE I'm thinking of consists of three continuous factors and five discrete factors with two or three levels.

Can you give some more details? Which machine-learning methods do you plan to use? What will be the goal of the analysis? There is nothing in particular that should hinder the use of DoE ideas in ML, but to say more we want details. — kjetil b halvorsen, Apr 05 '19 at 17:53

score 2 · Answer 1 · answered Apr 16 '19 at 14:05

Design of experiments (DoE) is most often used with regression (or ANOVA)-like models, machine learning here is a red herring, if your intended model is regression-like (including classification, maybe you should look into logistic regression), then surely you can use DoE. But to say much more, we need more details of your setup. But I would maybe start looking into fractional factorial designs.

Some similar posts here with answers is Machine Learning for optimization of configuration file and Is factorial experiment used only for prediction (regression or classification)?. Then search this site.

score 0 · Answer 2 · answered Apr 16 '19 at 14:10

(This isn't a real answer, because we'd probably need more details as Kjetil asked for, but slightly beyond the scale of a comment.)

One thing to keep in mind is that the whole idea of fractional factorial designs is predicated on linear or at least additive models, while many ML models won't satisfy this assumption. I'm sure it wouldn't be hard to come up with an example where a traditional fractional factorial design would be useless for some concept that an equal number of random query points would be fine.

Machine learning people have worked a fair amount on an area called active learning, closely related to optimal experimental design. Here's one such scheme (a simple version of disagreement-based active learning), though there are many options:

Fit a bunch of models on the data you have so far.
Identify the data point that the set of models you have disagree on the most, so that learning the label of that point would allow you to pin down the candidate models the most.
Go get the true label for that data point from the real world.
Update the models and repeat.

This scheme works more or less well than various others in different situations, which will depend a lot on the kind of models you're using, as well as your budget for getting more labels, whether you can go back and forth many times or if you just need to do it in a few batches, etc.

Is DoE applicable to collect data for machine learning model?

2 Answers2

Linked