Where should tests and checks be implemented in a mathematical modelling process?

Question

I have a mathematical model with two main "stages" 1. data loading (only a few KB of data) and 2. running a model on the data. I want to implement some tests and checks to ensure the data being fed into the model are valid (correct types, correct values etc.) to ensure the model output is correct.

At what stage in the whole process is the best time to implement these tests - when loading the data or when running the model on the data? Or should tests be run at both stages?

Running tests only at the data loading stage reduces the overall overhead, and guarantees correct model output. Running tests on the model stage results in some duplication of code (as well as extra overhead if tests are already implemented at the data loading stage).

This model will only ever be run once a week by internal users, so there are no external applications or liabilities / dependent parties. However, the model object is available to the user without having to use the (separate) data loading object (so again, data can, but really shouldn't, be passed into the model from the user without having gone through the data loading stage).

Is there a best practice here? Could there be a better way to structure the code to implement the tests?

please add programming language and runtime environment to your question as automated testing can be quite different depending on used language/environment. — k3b, Dec 21 '18 at 20:44
@jonrsharpe: I am pretty sure OP does not mean unit tests, so I see no conflation in the question, only in your comment. The term "test" (without "unit", which nowhere occurs in the question) is a very general one, sometimes just used a synonym for validation. — Doc Brown, Dec 21 '18 at 20:51
@k3b: I don't think OP is talking about test automation in the sense it is discussed on this site in many other questions. — Doc Brown, Dec 22 '18 at 06:59

Doc Brown · Accepted Answer · 2018-12-22T07:52:29.577

At what stage in the whole process is the best time to implement these tests - when loading the data or when running the model on the data?

Neither.

Instead, separate the process in 3 stages:

Data loading
Validation
Running the model on the data

This makes validation independent from 1 and 3, which allows to reuse the validation also in other contexts. It also allows forms of validation which are easier to implment when all the data is available, and which may be harder to implement in a sequential loading step.

If there are some validations involved which are specific to the model, step 2 may utilize the model for this. Nevertheless it is often a good idea to keep 2 and 3 clearly separated.

I guess the only code duplication which will occur is some code for iterating over the data in step 1 and 2, which is quite acceptable, because this typically does not repeat any logic in these steps. But without knowing anything about details (like data structures involved), this is just a shot in the dark.

score 0 · Answer 2 · answered Dec 22 '18 at 07:40

The validation belongs in the model. Whether the data is bad or not depends on the model it is ultimately fed to and at the loading stage this is not known yet. OK, you currently have only one model but from a software engineering perspective that could change. So just like you test for null before using an object you test for data sanity before you perform a calculation. The loading stage does not need and should not have knowledge (== dependency) about the model.

Where should tests and checks be implemented in a mathematical modelling process?

2 Answers2