11

I'm looking at a few logistic regression issues. ("regular" and "conditional").

Ideally, I'd like to weight each of the input cases so that the glm will focus more on predicting the higher weighted cases correctly at the expense of possibly misclassifying the lower weighted cases.

Surely this has been done before. Can anyone point me toward some relevant literature (Or possibly suggest a modified likelihood function.)

Thanks!

Noah
  • 627
  • 1
    You are assuming that classification is the goal, as opposed to prediction. For optimum estimation of probabilities you don't need to re-weight anything. "False negatives" and "false positives" only occur with forced choices, and usually no one is forcing a pure binary choice. – Frank Harrell Aug 16 '11 at 11:08
  • @Frank You make a good point. Ultimately, the goal for this project is to predict outcome of further events. (So, I guess it can be thought of as a machine learning tast with training data.) Some outcomes are more "important" than others, so I was looking for a way to weight them accordingly. Nick's suggestion for the likelihood function makes sense and should be fairly trivial to implement in code. – Noah Aug 17 '11 at 03:51
  • 1
    Sounds like you need exactly a probability model with no need for weights. – Frank Harrell Aug 17 '11 at 07:24
  • @frank harrell, as, our know decisions have costs, and it's useful to minimise your costs. A concrete example is making a loan: it's more important to get 100000 dollar loans correctly predicted than 100 dollar. (obviously number of loans important).. – seanv507 Nov 19 '15 at 12:38
  • 1
    Right; plug in the cost function and use the predicted probability and you have an optimal decision. – Frank Harrell Nov 19 '15 at 13:52
  • @FrankHarrell, no I am saying that if I estimate the probability of default, then estimation errors hurt me more when it is a large loan than a small loan. so it would be nice to weight errors for large loans more- but this does not make sense in maximum likelihood framework? – seanv507 Nov 20 '15 at 12:38
  • 1
    With a well-calibrated probability model there are no "errors", there's just randomness that cannot be predicted. Optimal decisions are a function of the predicted probability and the cost function for making various decisions to act. – Frank Harrell Nov 20 '15 at 13:22
  • Too late to weigh in? @FrankHarrell, I think you could say that there might be certain features which are more important for predicting default probability for larger loans (say for example, all your larger loans are to energy producers, in which case the price of oil might have a greater influence). In this case, simply selecting a confidence level will not allow you to minimize errors for your specific objective. – GPB Aug 11 '17 at 17:07
  • 1
    Stick to maximum likelihood estimation and you'll be OK. If you want to optimize decision making and not just make predictions, you need a utility function to feed the predicted probabilities into. – Frank Harrell Aug 12 '17 at 10:56

2 Answers2

3

glm holds a parameter weights exactly for this purpose. You provide it with a vector of numbers on any scale, that holds the same number of weights as you have observations.

I only now realize that you may not be talking R. If not, you might want to.

Nick Sabbe
  • 12,819
  • 2
  • 37
  • 47
  • I am very familiar with R, however I'd like to understand the math behind the likelihood function. I might code this in C++ or some other language. (Just trusting the "blackbox" of the glm function isn't always the best solution) – Noah Aug 16 '11 at 08:04
  • Ah. Good on you. Well, as far as I know, the weights are simply used to multiply the per-observation loglikelihood with. So if you've written an unweighted version, adding the weights should be a doddle. Note also that you can always look at the source code for glm to (probably) find a C implementation. – Nick Sabbe Aug 16 '11 at 08:15
  • 2
    @Nick, I too was under the misconception that this was the function of the weights argument in glm - it is not. It is actually used for when the binomial outcomes are inhomogeneous in the sense that they are based on different numbers of trials. For example, if the first observation was Binomial($3,.5$) and the second was Binomial($7,.5$), their weights would be $3,7$. Again, the weights argument in glm() are NOT sampling weights. To do this in R you will need to expand the data set according to the weights and fit the model to the expanded data set (the SEs may be wrong in this case though). – Macro Aug 16 '11 at 14:12
  • 3
    Here is a discussion of the 'weights' argument on a message board: http://r.789695.n4.nabble.com/Weights-in-binomial-glm-td1991249.html – Macro Aug 16 '11 at 14:18
  • @Macro: thx! Very neat. One of the things that could have hit me in the teeth if I'd used it before your comment :-) – Nick Sabbe Aug 16 '11 at 14:28
  • @Nick. That makes perfect sense. Should be fairly easy to add to my C++ implementation. Thanks! – Noah Aug 17 '11 at 02:47
1

If you have access to SAS, this is very easily accomplished using PROC GENMOD. As long as each observation has a weight variable, the use of the weight statement will allow you do perform the kind of analysis you're looking for. I've mostly used it using Inverse-Probability-of-Treatment weights, but I see no reason why you couldn't assign weights to your data to emphasize certain types of cases, so long as you make sure your N remains constant. You'll also want to make sure to include some sort of ID variable, because technically the upweighted cases are repeated observations. Example code, with an observation ID of 'id' and a weight variable of 'wt':

proc genmod data=work.dataset descending;
    class id;
    model exposure = outcome covariate / dist=bin link=logit;
    weight wt;
    repeated subject=id/type=ind;
run;
Fomite
  • 23,134