Organising the preprocessing of a dataset that will be consumed by multiple models

Question

In a data science project, data is typically preprocessed. We also build, test, and select different models. Models also come with their own preprocessing requirements that can vary greatly from model to model, e.g., some models require scaling, while others don't care.

What is considered best practice when managing the preprocessing and transformation of a dataset (or datasets) that will feed multiple distinct models, each with its own preprocessing requirements?

I'm wanting to know how to make preprocessing flexible enough to support multiple models, while also making it easy to manage change.

I recently took to using the cookiecutter data science project which advocates using an interim dataset. Presumably, this interim dataset forms a base set from which model-specific preprocessing is built off. This is one approach but would like to know what is considered best practice.

score 1 · Accepted Answer · answered Oct 11 '21 at 20:16

If the data preparation steps take significant processing time, then best practice would be to make that interim data set. If it is not significant, then repeat the data preparation steps for each model you build, as it is simpler and gives you more flexibility.

An example of the former: say, you have NLP training data, some in PDFs, some in docx, some in html, some in epub, etc. But all your models will require either plain text sentences or plain text paragraphs. It would be sensible to first run various scripts to extract each paragraph as a single row in a text file (for instance).

As an example of the latter, don't tokenize into sentences: that should be a pipeline step you do for each model, as needed, because it is relatively quick. (And also only some models require it.)

Another example is where you have a very large set of images, that are going to be processed over a cluster. If you plan to augment the data by also training on the mirror image of each, what you will find is that loading the data into the cluster takes a significant amount of time, but flipping an image is very quick. If you were to make all those mirror images in advance, you would have twice as much data, so the bottleneck stage would now take twice as long.

However, if those images were all hi-res photos, but all your models want to work with them resized into 256x256, I would prepare that in advance. Let's assume, on average, the new file size is 1/10th. Your total storage requirements have only gone up 10% (if you kept the originals), but the bottleneck stage of loading into the cluster will now be 10x quicker.

What would you say to extracting out all the shared (across all models) preprocessing steps and putting it into an interim dataset? — Oaty, Oct 12 '21 at 03:40
@Oaty What I said in my first paragraph: make the interim dataset if it takes significant processing time to generate, or it will be significantly smaller; otherwise don't, as you are adding complexity for little gain. BTW, extracting all those shared preprocessing steps into a shared library is definitely a good idea; as is specifying random seeds, so you know each model is getting the same input data. — Darren Cook, Oct 12 '21 at 08:01

score 0 · Answer 2 · answered Oct 08 '21 at 15:10

0

One approach is pipelines. A pipeline is an explicit list of the sequential steps to create and use a model.

answered Oct 08 '21 at 15:10

Brian Spiering

21,136
2
26
109

score -1 · Answer 3 · answered Oct 08 '21 at 10:34

Preprocessing, training and post processing, along with correctly tuning model hyper parameters and benchmarking models via cross validation is a complex and time consuming effort. To facilitate that effort several "meta" machine learning packages were developed in R (most notably mlr3 and tidymodels) and python (scikit-learn).

These packages not only implement all the ML pipeline best practices but are also very useful for learning those practices while experimenting with actual data. I would advise picking one of those (doesn't really matter which, they are all great) and some Kaggle datasets and learn while doing actual modeling.

Organising the preprocessing of a dataset that will be consumed by multiple models

3 Answers3