1

I am moving from a single-model prediction of a binary outcome to an aggregate of a small number of models, for example:

set.seed(123)
library('data.table')  

m <- 1e5 # arbitrary number of obs
n <- 5 # arbitrary number of models

# y is the target outcome --  in this case, dichotomous and imbalanced
dt <- data.table(y = rbinom(m, 1, 0.01)) 

add_some_noise <- function(y, n){
  lapply(1:n, function(x){
    ifelse(y > 0.5, y - rexp(1,6), y + rexp(1,6))
  })
}

#x_1 - x_n are the predicted response probabilities from models 1 - n
dt[ , ( paste0('x_', 1:(n-1) ) ) :=  add_some_noise(y, n), by = 1:m] 

# to include a redundant model, let x_n be a covariate of x_1 and x_2
dt[ , ( paste0('x_', n ) ) :=  add_some_noise((x_1 + x_2)/2, 1), by = 1:m ]

What are good ways to (a) aggregate these predictions and (b) measure whether keeping any particular model is helpful?

Are there any key differences between this process and building the base model? The only difference I can see is that here the inputs are all in (0,1).

C8H10N4O2
  • 957

2 Answers2

1

For (a) I wound up using logistic regression to determine the weight coefficients. In real life, tree-based approaches might be more sensitive to the strengths of different predictors in different parts of the feature space, but I didn't impose that structure on the fake data.

train_rows <- sample.int(m, size = .5 * m)

dt[-train_rows, cor(y, x_1)] # ~0.4

fit_lm <- glm(y ~ . + 0, data=dt[train_rows]) # simple linear regression
summary(fit_lm)            # What does negative AIC mean here?
cor(dt[-train_rows, y],    # ~0.6, better than single predictor, but...
    predict(fit_lm, newdata=dt[-train_rows], type='response')) 
                           # ... logistic regression seems more appropriate here
fit_glm <- glm(y ~ ., family = binomial, data=dt[train_rows]) 
summary(fit_glm) 
cor(dt[-train_rows, y],    # >0.9, this works well
    predict(fit_glm, newdata=dt[-train_rows], type='response')) 

I'm still a little stuck on (b).

I found another question on relative importance in logistic regression suggesting caret's varImp, which I believe (based on the docs and messing around) is using a performance measure like AUC rather than a parametric approach like using $|t|$ from the glm.

library('caret'); varImp(fit_glm)   

This approach wasn't terribly useful for me without a cutoff value for "not useful." Clearly x_5 is less important, but it's not obviously useless. (And because it's a composite of multiple predictors, it might not stand out on a correlation matrix.)

I also considered:

drop1(fit_glm, test='Chisq')

# Single term deletions
# 
# Model:
# y ~ x_1 + x_2 + x_3 + x_4 + x_5
#          Df Deviance    AIC     LRT Pr(>Chi)    
#   <none>       93.35 105.35                     
#   x_1     1   281.98 291.98 188.630   <2e-16 ***
#   x_2     1   276.93 286.93 183.580   <2e-16 ***
#   x_3     1   368.13 378.13 274.773   <2e-16 ***
#   x_4     1   324.85 334.85 231.498   <2e-16 ***
#   x_5     1    93.95 103.95   0.592   0.4418   

So I see that AIC goes down when dropping x_5, and the p=0.4418 seems to indicate that the model without x_5 is not significantly different from the one with it. So I would be inclined to keep the first four predictors only.

I'm just posting what I tried, but hoping the experts here can point out anything I overlooked.

C8H10N4O2
  • 957
0

Late to the party here, but for future I suggest looking into Bayesian Model Averaging for this type of approach. The R package 'BMA' is a good place to start.

  • Welcome to CV! Could you expand on this answer to make it more useful to readers? – mkt Jul 19 '22 at 05:17