2

How to find the weight such that the final value (weighted sum) n range 0-1 represented as probability, as weight1 * var1 + $weight2 * var 2.... that would maximize the rank ordering performance? (AUC or any other performance measure, a new reference).

The logistic regression does not provide the desired solution.

The data I'm dealing with look like this: (with var1...varX is a variable converted to probability measure) so I would like to aggregate the individual measurement of risk across different tools to single risk measure including some confidence interval. The exit variable is target binary response variable that indicate the risk materializing.

enter image description here

How to approach it or which tool to use to achieve this? I'm using r statistical software so suggesting a r package would be also helpful.

  • I'm unclear what you are looking for. What does "rank ordering performance" mean? If you had three weights in your example, how would you calculate the "rank ordering performance"? – Stephan Kolassa Sep 06 '22 at 09:36
  • I have added reference regarding the "rank order performance" which is a measure I know of, if there is other measure its fine as long as I have the "best aggregated probability" that explains the 0/1 response variable in max optimal way. – Maximilian Sep 06 '22 at 11:07
  • Hm. You added a link to a Google search (in itself not a good idea, since (a) everybody gets different search results, and (b) search results change over time) for "Accuracy and ROC curve", which is something different from "rank ordering performance". I don't think people will dig through their personal search results to try to figure out what you mean. Please make it easy for us to help you, by providing full details and using references only as backup. – Stephan Kolassa Sep 06 '22 at 11:27
  • @Stephan: reference updated. Question also remains what would be the best method to evaluated the performance. One can look into the calibration like here: https://stats.stackexchange.com/questions/493023/brier-score-of-calibrated-probs-is-worse-than-non-calibrated-probs – Maximilian Sep 06 '22 at 11:47
  • Also this method by @whuber would be adequate to me: https://stats.stackexchange.com/questions/25482/visualizing-the-calibration-of-predicted-probability-of-a-model/25489#25489 – Maximilian Sep 06 '22 at 11:48
  • Ah. We started out with "rank ordering performance", you added a search for AUC and accuracy, and now we are at "AUC or any other performance measure", with a link to the Brier score. All these are quite different, some more closely related than others. Can you decide which performance measure you want? – Stephan Kolassa Sep 06 '22 at 11:49
  • It does not matter which of these is used, I have tried to communicate an example on which to benchmark the performance. But if this is the only outstanding question then AUC would be the preferred statistic. – Maximilian Sep 06 '22 at 12:00
  • Very good, thank you, we are getting somewhere. Next question: I assume you want your weights to sum to 1? – Stephan Kolassa Sep 06 '22 at 12:19
  • @Stephan. Thank you for holding up with me on this. Yes, I would expect the weights to sum to 1 (but I'm not sure if I claim this with absolute certainty, but I would say so). – Maximilian Sep 06 '22 at 12:46

1 Answers1

1

You have a straightforward optimization problem. The Optimization CRAN Task View is a very helpful resource.

First off, your objective function takes (in your example) three weight parameters. These are matrix-vector multiplied with your variables. The result is a vector of class membership probabilities. We compare these to the actuals (exit) and calculate the AUC. This can be done using auc() in the pROC package. This entire function needs to be fed into the optimization algorithm.

Next, your optimization is nonlinear (since the AUC does not depend linearly on the weights). You also have constraints: your weights must be nonnegative and less than one. These are box or bound constraints in the Task View. However, we also have the constraint that they must sum to one. This is not a box/bound constraint any more, but a linear constraint, so any tool on the Task View that only discusses box/bound constraints is not helpful.

I settled on the Rsolnp package, not because I have any experience with it, but because it was the first one to look helpful. It looks like it performs minimization (though it apparently nowhere says so explicitly).

Read your dataset (next time, best to provide the data as an MWE, ideally as the output of dput()):

dataset <- data.frame(
    var_prob1 = c(.95,.12,.34,.61,.17,.26,.78),
    var_prob2 = c(.28,.18,.33,.77,.70,.48,.05),
    var_prob3 = c(.77,.74,.47,.67,.14,.38,.43),
    exit = c(0,0,1,0,1,0,0))

Define the objective function - since we want to maximize the AUC and Rsolnp minimizes, we minimize negative AUC:

library(pROC)
objective <- function(parameters) {
    -as.numeric(auc(response=dataset$exit,
        predictor=(as.matrix(dataset)[,1:3]%*%parameters)[,1]))
}

Finally, call the solver:

library(Rsolnp)
n_parameters <- ncol(dataset)-1
result <- gosolnp(
    fun=objective,
    eqfun=function(parameters)sum(parameters),
    eqB=1,
    LB=rep(0,n_parameters),
    UB=rep(1,n_parameters))
Stephan Kolassa
  • 123,354
  • many thanks for this. For the with the example above, its not converging, there is endless iteration which I have to force to stop. I'm just running the code you have provided. – Maximilian Sep 06 '22 at 13:32
  • Strange. How long did you let it run? It does take a little while, but it terminates correctly for me. (It takes a while because it tries the underlying solver with multiple random starting values, a good practice). Which R and package versions are you running (sessionInfo())? I have R 4.2.1, Rsolnp 1.16 and pROC 1.18.0. – Stephan Kolassa Sep 06 '22 at 13:35
  • compiler_4.1.3 tools_4.1.3, with packageVersion("Rsolnp") [1] ‘1.16’ – Maximilian Sep 06 '22 at 13:58
  • packageVersion("pROC") [1] ‘1.18.0’ – Maximilian Sep 06 '22 at 14:11
  • The main difference seems to be that I have R 4.2.1, you have R 4.1.3. I don't know whether this is important, but you could try upgrading. – Stephan Kolassa Sep 06 '22 at 14:20
  • (+1) thanks, I see, unfortunately I have to stick with the current R version because of other dependencies etc. I have to find another way to solve this. many thanks for your proposed solution. I hope perhaps someone else will be able to contribute with some "simpler - less depended" solution. – Maximilian Sep 06 '22 at 14:23
  • 1
    OK, I'm sorry I couldn't be more helpful. You could take a look at the optimization task view and see whether something else addresses your problem. Unfortunately, R is not really strong at optimization. – Stephan Kolassa Sep 06 '22 at 14:25