1

I have to automate a yes/no type business decision problem for a customer (think: Is the use of chemical compound X beneficial in combination with chemicals A,B,C?). He dumped on me a very large dataset that contains all the data that I need (and much more) and basically said "I don't care what you do, as long as in the end a yes or no answer comes out in which I can be fairly confident".

From this dataset I tried various sets of features, so that I can achieve a very good prediction score via a binary classification algorithm.

This model I have incorporated in a software that I will deliver to him, where the input is his whole dataset. Internally the software then computes the features that I have identified in my own analysis as being good ones and makes the prediction.

The problem is: The customer now wants me to prove to him that the good prediction score that I claim is actually true. But it seems to me I cannot prove that to him, unless I give him a good part of my features, so that he can check himself, which I don´t want to do, since that is my IP....

l7ll7
  • 1,275
  • 3
    "The customer now wants me to prove to him that the good prediction score that I claim is actually true." You should ask how the customer wishes to do that. They can have your code, or send validation data without an outcome, for them to cross reference... "But it seems to me I cannot prove that to him, unless I give him a good part of my features": that is the last conclusion I would make. – AdamO Nov 07 '18 at 16:58

2 Answers2

3

The customer now wants me to prove to him that the good prediction score that I claim is actually true

As analytical chemist, I think this a totally valid request. It is part of the necessary method validation.

As business owner, I'd say that who is to do the work (you, the customer themselves or a third, independent party) is a matter of contract and possibly regulations. Likewise, the question of whether the features should be handed over to the client is a matter of the contract/license.

But it seems to me I cannot prove that to him, unless I give him a good part of my features

I'm sorry to be so harsh, but at least in the context of analytical chemistry/chemometrics (which your application description hints at as the relevant field) this is so blatantly wrong that it would make me quite wary in accepting any claim of your's about the predictive quality of your model.

  • Verification of your model's predictive performance can (and should) be done with a (well designed) test of test cases which are subject to blinded or double-blinded prediction.
    Depending on the application, this may be subject to regulations that may even prescribe these blinded tests to be done by an independent third party, in regular ring trials or the like.
    While we work a lot with resampling validation up to some point, we are also aware of the limitations of resampling: it is often very hard (or plainly impossible within a given dataset) to achieve statistical independence in the splitting procedure for resampling. To the point where it is usually less expensive to check performance against a set of new samples, where independence is easier to achieve (acquired later = comprises drift, at a different facility/reactor, etc.).
    The client may be thinking of such a verification which would not imply that you reveal the features.

  • Validation has a wider scope, and as soon as your predictions do have a certain importance in term of harm that may be done by wrong predictions, I'd at least think it very desirable that the actually evaluated features are subject to some scrutiny during validation. I.e. customer (or the third party doing the validation) should compare your findings from the data with their knowledge about the application.
    Even many kinds of models that are hard to interprete in terms of their features allow at least to check whether the evaluated features do not contradict with known behaviour of the application system.
    So in the context of such a wider validation, I'd think it a sensible request from the customer to ask for your features.

So I'd recommend to

  1. find out what exactly the customer wants, and then
  2. if they are asking for the features and your contract really does not include handing over the features, you can surely prepare an offer to them at which price you are willing to sell the features.
  • Thanks, Claudia! I'm not a chemist, but a data scientist in training, so I did not knew about these domain-specific issue. (And I apologize for answering so late, I got swept away with other projects and did not have time to follow up on this one until now.) – l7ll7 Apr 30 '19 at 06:10
  • Could you please clarify two things in your answer: 1) What exactly do you mean by double-blinded predictions? I know of double-blinded (or triple-blinded) experiments, where various parts of the data of the experiment are not share to various parties involved in the experiment. But I haven't come across that term in terms of predictions of a statistical model. – l7ll7 Apr 30 '19 at 06:17
  • Could you please help me understand why one should use domain-knowledge to check whether "features do not contradict with known behaviour of the application system"? It seems to me to be an essential part of machine learning to precisely discover features (for example by using model selection, neural networks, or other techniques) that help to create a model that achieves excellent predictive performance, even though those features might not seem very helpful, or compelling if one relies on domain knowledge of the application system.
  • – l7ll7 Apr 30 '19 at 06:31
  • For example, insurance companies might detect that people that drive blue cars have more accidents (and thus would raise their premium for owners of blue cars), even though domain knowledge at first would suggest that the feature "car_color" should not influence significantly driving behavior. Only statistics can discover it (One could imagine that darker cars perhaps are less easier to spot- that actually make this feature relevant; though without further testing we won't know the exact reason of relevance of the feature). – l7ll7 Apr 30 '19 at 06:33
  • So we should not use domain knowledge to verify features, as you implied - but rather use discovered features to improve domain knowledge. (If you don't like the "car_color" example, one can always some come up with other features that seem even more obscure when relying in domain knowledge but might make sense statistically.) Also, please note that it is not my goal to be polemic here, and I apologize in advance if I sounded so. I know that I'm still in training, so if you see a flaw in the argument above, please point it out. – l7ll7 Apr 30 '19 at 06:36
  • blinding for validation: no, the term is not standard. What I mean here is that I've met a lot of mdoel validation/verification results that actually were far from being as independent as claimed/considered, for various reasons ranging from dependence in the data that was not known to or suspected by the data analyst over computational short cuts that use unrealistic assumptions long forgotten all the way to cheating (which in accordance with Hanlon's razor I think is rare). Very similar problems have existed for a long time when determining whether medical procedures or drugs are efficient. – cbeleites unhappy with SX May 02 '19 at 10:29
  • I see resampling validation or a hold-out test set set aside by the data analyst who trains the model as analog to single-blinding (the model isn't told what the outcome is). Test data that the data analyst receives without knowing outcome themselves would be analog to double-blind (avoiding inadvertently giving hints - e.g. via dependence due to clusters in the data the analyst doesn't realize). Note that this is still just verification. (I'm not so sure we need a direct analogy to triple blinding - at least not before we have established good design of validation experiment practices). – cbeleites unhappy with SX May 02 '19 at 10:37
  • So: blinding is a technique to improve independence of validation data. As you say, ML can be used to discover features. But after that we need to validate the features and/or the quality of the predictions. This is done by data independent of the data you used for training the model. And while maybe not as convenient for you and not as easily included in numerical descriptions of model validity, chemical knowledge can be used here (in several ways). And it can be highly efficient in the sense that it may allow to concentrate experimental effort during validation to some crucial points ... – cbeleites unhappy with SX May 02 '19 at 11:18
  • (i.e. to narrow down the design of the validation experiment, or to put dedicated stress tests on crucial features) compared to needing huge numbers of test cases without such knowlege that guides the DoE. Your blue car example is IMHO actually a nice example. a) Note how you immediately start to propose possible (understandable, causative) mechanisms why blue cars are more prone to accidents. So the next step would be to check literature/domain knowlege whether blue cars have been known to be more prone to accidents and whether/which reasons have been proposed. You are correct to say that – cbeleites unhappy with SX May 02 '19 at 11:42
  • without further experiment we cannot say more than that the training algorithm found a correlation in the training data. But knowing that the training algorithm found "blue cars" is crucial for formulating these further experiments. Including formulating alternatives that may be correlated with blue cars (dark car rather than blue car; in a chemical setting maybe the pH feature was a surrogate for acetic acid - or the other way round). For the insurance company it may very well be sufficient to know that a car being blue needs to be compensated by an additional fee of at least $x$. Now imagine – cbeleites unhappy with SX May 02 '19 at 11:58
  • ... the underlying reason is that blue cars were the fashion n years ago, so they are now the age cohors (car wise) that is predominantly driven by young drivers who are more prone to accidents (as you don't know the underlying chemistry, you may be able to spot a correlation iff driver age happens to be among your features. If not, tough luck. But even so, you are not able to decide which of the correlated features is the primary reason). Now, the insurance company can show their blue car owners data that they belong to a risk group, and charge tham accordingly. If that's spurious, it doesn't – cbeleites unhappy with SX May 02 '19 at 12:03