Can test results be considered reliable if it is not possible to have true out of sample validation data set?

Question

We are working with an anonymized dataset (similar to commercially available consumer data) to create a binary classification based model. We have data for entire US population that is commercially available. We constructed our model using standard set aside validation techniques (say 80% training and 20% unseen for validation).

A research partner (a business with thousands of customer) is interested in testing our model with their customer base data and verify 'generalizability' of our model. The challenge is that almost all their customers are available in our dataset and we do not have ways to isolate them as we are legally forbidden from matching.

The model fits quite well for the customer's data as expected (obviously!). The real challenge is that we want to publish our results in reputed journals and that all standard guidelines we see calls for true out of sample validation as unwritten golden rule for publication. Is there any way to overcome this objection? Are there any precedents for this that I can refer?

For example, let us say our model predicts likelihood for filing bankruptcy and we trained on commercially available consumer behavior dataset from XYZ that has data for about 200 million adults (all that is commercially available). Let us say, ROC-AUC we claim for set-aside-validation set is 0.85. Now, a bank with businesses in 5 states wants to test our model. We scored their data and came up with ROC-AUC of 0.845. From a utility point of view this would be considered a great fit. However, since the bank's customers are part of US population, we expect about 80% of them to be represented in out training data. This is raising concern that our results may not be accepted for publication (at least that is what the internal reviewers are opining). Building a model leaving out the Bank's customers is legally prohibited as it would lead to matching to exclude.

How is your 20% holdout set not a 'true out of sample validation'? — mkt, Sep 26 '19 at 19:38
It is true for the model's prediction of performance that we claim. However, we want to publish the performance metrics on the applied data set. When we apply customer's data on the model, we expect about 80% of customer's data to match with population in the training data and about 20% in the test data. So, from a purist's perspective, training data was trained on some portion of applied data even though it is a small population. — Joseph Robertson, Sep 26 '19 at 19:49

Can test results be considered reliable if it is not possible to have true out of sample validation data set?

0 Answers0