Optimize Weights used in a Weighted Average

Question

I want to calculate a weighted average, mostly for illustration purposes. However, I have an outcome variable that is Y/N and would like to optimize the weights relative to this outcome.

What are some basic ways of doing this? Basically, I want/hope to use past data to drive the weights instead of arbitrary assignment.

Could you give more detail about what you're doing? I'm having a hard time understanding how you are using this weighted average. — dave, Dec 14 '12 at 03:04

Zach · Accepted Answer · 2012-12-14T03:05:12.280

It sounds like you want to do a logistic regression.

/edit: In response to your comment: When you said "optimize a weighted average" I thought "aha! that's exactly what a regression does!" I totally sympathize with your situation, as I have been there before. People will put a surprising amount of trust into "scoring" models that are absolutely worthless when it comes to prediction because anything more complicated is too difficult to understand. I would say:

Step 1 is a simple linear regression, where the outcome is 0/1. This will give you a weighted average, where the weights are your coefficients. In fact, you don't have to tell them it's a regression at all. Just give them your weights, and say you optimized them using statistical magic. Calculate the accuracy of your model, vs the accuracy of the old model to demonstrate that it's better. Since they're already using a weighted average, just find better weights for them!
Step 2 would be to optimize against accuracy. The goal is the most accurate possible classification, so you'd use a linear solver to find the weights that maximize accuracy. (Linear regression minimizes the sum of squared errors, so this will give a different answer). Again, you should be extremely concrete in explaining your model as a weighted average with different weights, and demonstrate that it is more accurate at predicting.
Step 3 is to get them thinking scholastically. "Wouldn't it be nice if we could say there's a 75% chance of event X occurring, and act accordingly!" To get here, you simple plug the output from your weighted average into the logistic function. This will map predictions on the scale of (-Inf,Inf) to (0,1), and they can interpret these predictions as probabilities!
Step 4 is to realize that the probabilities from step 3 are terrible, and use a logistic regression, which is designed to give reasonable probabilities.

This sort of thing is always an uphill battle, but as a statistician, it's a HUGE part of your job to fight this fight. Present your results in a simple, concrete manner, that demonstrates the value of what you are doing. Don't be afraid to attach a value to your model (e.g. an incorrect prediction costs \$10 and a correct prediction is worth \$100, so my model saves the company \$10,000/month vs the old one). Start simple, and give them a chance to criticize your work, and then incorporate their feedback into the next version. Before you know it, they have a lot of investment in your model and will start finding ways to help you succeed.

Good Luck!

/edit 2: Here is an example in R:

library(caret)
set.seed(42)
N <- 100
logit <- function(t){1/(1+exp(-t))}
a <- runif(N); b <- runif(N); c <- runif(N)
y <- 0.5*a + 5*b + 3*c + runif(N)
y <- y-mean(y)
y <- round(logit(y), 0)

This creates some sample data

py <- round(a) + 2*round(b) + 3*round(c)
py <- (py-min(py))/(max(py)-min(py))
confusionMatrix(round(py), y, positive='1')
>Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 41 30
         1  5 24

               Accuracy : 0.65            
                 95% CI : (0.5482, 0.7427)
    No Information Rate : 0.54            
    P-Value [Acc > NIR] : 0.0169

Some arbitrary weights get us an accuracy of 65%. Not bad, but you have to consider the fact that guessing "1" every time gets us an accuracy of 54%. (That's the no information rate)

py <- predict(lm(y~a+b+c))
confusionMatrix(round(py), y, positive='1')
>Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 40  3
         1  6 51

               Accuracy : 0.91          
                 95% CI : (0.836, 0.958)
    No Information Rate : 0.54          
    P-Value [Acc > NIR] : 8.791e-16

A linear regression gets us to an accuracy of 91%. Wahoo, you can stop here!

py <- predict(glm(y~a+b+c, family=binomial(link = "logit")), type='response')
confusionMatrix(round(py), y, positive='1')
>Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 43  3
         1  3 51

               Accuracy : 0.94           
                 95% CI : (0.874, 0.9777)
    No Information Rate : 0.54           
    P-Value [Acc > NIR] : <2e-16

Logistic regression gets us to 94% accuracy. In this example it might not be worth the extra effort, but if the sole goal is predictive accuracy it's worth demoing a superior model and evaluating the $$$ it could make you...

I am struggling internally to deploy analytical methods at work. I have run and attempted to deploy logistic models at work, but at the end of the day, the decision is made to use basic averages. it's what they know and are comfortable with.For this reason, I am trying to help bridge the gap internally by translating/combining averages into a weighted average to help with the discussion of why we need to - at a minimum - go the logistic regression route. I am hoping that this will help the others internalize why other techniques are more robust and accurate — Btibert3, Dec 14 '12 at 01:39

Optimize Weights used in a Weighted Average

1 Answers1