7

What does "weighted logistic regression" mean?

I came across this term "weighted logistic regression"in this paper.

I have read the paper a lot of times throughly. But I still can't get the idea of the author. Hope you can help me!

Ben
  • 323

2 Answers2

7

Let's begin with a weighted average, which slightly modifies the formula for an ordinary average:

$$\bar{x}^w=\frac{\sum_i w_i x_i}{\sum w_i}$$

An unweighted average would correspond to using $w_i=1$ (though any other constant would do as well).

Why would we do that?

Imagine, for example, that each value occurred multiple times ("We have 15 ones, 23 twos, 19 threes, 8 fours and 1 six"); then we could use weights to reflect the multiplicity of each value ($w_1=15$, etc). Then the weighted average is a faster way to calculate the average you'd get if you wrote "1" fifteen times and "2" twenty three times (etc) and calculated the ordinary average.

For another possible example, imagine that each observation was itself an average. Each average is not equally informative -- the ones based on larger samples should carry more weight (other things being equal).

(In that case if we set each observation's weight to the underlying sample size, we get the overall average of all the data that would comprise the component averages.)

There are many other reasons one might weight observations differently, though (e.g. if the observation values are not all equally precise).


In somewhat similar fashion, we can modify the estimator in ordinary regression to incorporate weights to the observations. It will reproduce a weighted average when the regression is intercept only.

The usual multiple regression estimator is $\hat{\beta}=(X^\top X)^{-1}X^\top y$. The weighted regression estimator is $\hat{\beta}=(X^\top W X)^{-1}X^\top W y$, where $W$ is a diagonal matrix, with weights on the diagonal, $W_{ii} = w_i$.


Weighted logistic regression works similarly, but without a closed form solution as you get with weighted linear regression.

Glen_b
  • 282,281
1

Weighted logistic regression is used when you have an imbalanced dataset. Let's understand with an example.

Let's assume you have a dataset with patient details and you need to predict whether patient has cancer or not. Such datasets are generally imbalanced. If you have 10,000 data points who is having cancer and 1,000,000 data points don't have cancer. So, your approach may be like:

  • Sample 10,000 data points having cancer (100 % of patients having cancer)
  • Sample 1,00,000 data points don't have cancer (10 % of patients don't have cancer)

Now you give weight=10 for data points don't have cancer so that, effect of 1,00,000 sampled data points is same as 1,000,000 data points. This is one of techniques that you can use when you have imbalanced dataset.

  • 1
    Thanks. But what does the weight mean in essence?How do we calculate it? – Ben Dec 31 '19 at 04:56
  • weight means you are giving more importance to a particular class. In above case for person don't have cancer you are give weight = 10 and for person having cancer you are give weight = 1 means you are giving 10 times more importance to person don't have cancer then that of person having cancer. We are doing this so that we can achieve data balancing with similar effect compared to original dataset.

    calculation depends on your sample size. Here in above case sample size is 1,00,000 but your original sample size is 1,000,000. hence, your weight = 1000000/100000 = 10.

    – Parimal Roy Dec 31 '19 at 05:12
  • Where do we use it?in loss function? or something else? – Ben Jan 01 '20 at 05:02
  • Yes we use this in loss function. We multiply the weight with the regularisation term. For code you can refer to scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html – Parimal Roy Jan 02 '20 at 06:38