11

I used Logistic Regression. I have six features, I want to know the important features in this classifier that influence the result more than other features. I used Information Gain but it seems that it doesn't depend on the used classifier. Is there any method to rank the features according to their importance based on specific classifier (like Logistic Regression)? any help would be highly appreciated.

BlueGirl
  • 211
  • 1
  • 3
  • 5
  • 3
    Logistic regression is not a classifier. Please re-write your question to reflect that logistic regression is a direct probability estimation model. – Frank Harrell Feb 14 '16 at 18:19
  • 1
    Aside the point raised by FrankHarrell, did you look at the $p$-values of your estimated coefficients? It is definitely not the best way of ranking features but it can give you a starting point. – usεr11852 Feb 14 '16 at 18:46
  • 9
    Sure, logistic regression is estimating probabilities and not explicitly classifying things, but who cares? The purpose is often to decide which class is most likely, and there's nothing wrong with calling it a classifier if that's what you're using it for. – dsaxton Feb 14 '16 at 20:40

3 Answers3

6

I think the answer you are looking for might be the Boruta algorithm. This is a wrapper method that directly measures the importance of features in an "all relevance" sense and is implemented in an R package, which produces nice plots such as this plot where the importance of any feature is on the y-axis and is compared with a null plotted in blue here. This blog post describes the approach and I would recommend you read it as a very clear intro.

3

To begin understanding how to rank variables by importance for regression models, you can start with linear regression. A popular approach to rank a variable's importance in a linear regression model is to decompose $R^2$ into contributions attributed to each variable. But variable importance is not straightforward in linear regression due to correlations between variables. Refer to the document describing the PMD method (Feldman, 2005)[3]. Another popular approach is averaging over orderings (LMG, 1980)[2].

There isn't much consensus over how to rank variables for logistic regression. A good overview of this topic is given in [1], it describes adaptations of the linear regression relative importance techniques using Pseudo-$R^2$ for logistic regression.

A list of the popular approaches to rank feature importance in logistic regression models are:

  1. Logistic pseudo partial correlation (using Pseudo-$R^2$)
  2. Adequacy: the proportion of the full model log‐likelihood that is explainable by each predictor individually
  3. Concordance: Indicates a model’s ability to differentiate between the positive and negative response variables. A separate model is constructed for each predictor and the importance score is the predicted probability of true positives based on that predictor alone.
  4. Information value: Information values quantify the amount of information about the outcome gained from a predictor. It is based on an analysis of each predictor in turn, without taking into account the other predictors.

References:

  1. On Measuring the Relative Importance of Explanatory Variables in a Logistic Regression
  2. Relative importance of Linear Regressors in R
  3. Relative Importance and Value, Barry Feldman (PMD method)
0

Don't be alarmed. Logistic Regression (LR) can very much be a classification scheme. LR minimizes the following loss: $$ \mathop {\min }\limits_{{\bf{w}},b} \sum\limits_{i = 1}^n {\log \left( {1 + \exp \left( { - {y_i}{f_{{\bf{w}},b}}({x_i})} \right)} \right) + \lambda {{\left\| {\bf{w}} \right\|}^2}} $$ where the $x_i$ and $y_i$ are the feature vector and target vector for example $i$ from your training set. This function originates from the joint likelihood over all training examples, which explains its probabalistic nature even though we use it for classification. In the equation $\mathbf{w}$ is your weight vector and $b$ your bias. I trust that you know what ${{f_{w,b}}({x_i})}$ is. The last term in the minimization problem is the regularization term, which, among other things, controls the generalization of the model.

Assuming all your $\mathbf{x}$ are normalized, for example by deviding by the magnitude of $\mathbf{x}$, it is quite easy to see which variables are more important: those wich are larger c.f. the others or (on the negative side) smaller c.f. the others. They influence the loss the most.

If you are keen on finding the variables which really are important and in the process don't mind kicking a few out, you can $\ell_1$ regularize your loss function: $$ \mathop {\min }\limits_{{\bf{w}},b} \sum\limits_{i = 1}^n {\log \left( {1 + \exp \left( { - {y_i}{f_{{\bf{w}},b}}({x_i})} \right)} \right) + \lambda \left| {\bf{w}} \right|} $$

The derivatives or the regularizer are quite straightforward, so I will not mention them here. Using this form of regularization and an appropriate $\lambda$ will enforce the less important elements in $\mathbf{w}$ to become zero and the others not.

I hope this helps. Ask if you have any further questions.

pAt84
  • 551
  • 3
  • 9
  • 4
    LR is not a classification scheme. Any use of classification comes as a postestimation step after defining the utility/cost function. Also, the OP did not ask about penalized maximum likelihood estimation. To provide evidence for relative importance of variables in regression it is very easy to use the bootstrap to obtain confidence limits for the ranks of added predictive information provided by each predictor. An example appears in Chapter 4 of Regression Modeling Strategies whose online notes and R code are available at http://biostat.mc.vanderbilt.edu/RmS#Materials – Frank Harrell Feb 14 '16 at 19:31
  • 4
    Prof. Harrell, please. It is obvious we are approaching this from two different sides. You from the statistical one and I am from machine learning. I respect you, your research and your career but you are very free to formulate your own answer and let the OP decide, which one he considers the better answer for his question. I am keen on learning, so please teach me your approach but don't make me buy your book. – pAt84 Feb 14 '16 at 19:40
  • 1
    I'll just note that logistic regression was developed by statistician DR Cox in 1958, decades before machine learning existed. It is also important to note that the "loss function" (better called an objective function perhaps?) you formulated does not have any relationship whatsoever to classification. And what implied to you that my extensive notes and audio files available online with all the information I referred to cost anything? – Frank Harrell Feb 14 '16 at 20:13
  • I couldn't find Chapter 4, which you recommended.

    It is the proper loss function and it has a relationship to classification. Why make vague statements and not backing them up?

    Yes, logistic regression pre-dates the first baby-steps of machine learning. So what? The OP specifically said he is using it as a classifier and is it forbidden to give an answer in this direction? Again, I urge you to provide your own answer. I wouldn't mind learning a thing or two from you and the original statistical approach.

    – pAt84 Feb 14 '16 at 20:23
  • 2
    I upvoted both initial comments, as both raise valid points. Later comments a bit like petty quarreling to me... – usεr11852 Feb 14 '16 at 20:29
  • @pAt84: I think you want to explain what you mean by c.f.; it is an acronym you do not explain in the text. – usεr11852 Feb 14 '16 at 20:31
  • The abbreviation cf. derives from the Latin verb conferre, while in English it is commonly read as "compare". [Wikipedia] – pAt84 Feb 14 '16 at 20:34
  • Thanks. I have seen it only used without the middle period as in your comment and not as in your original post. – usεr11852 Feb 14 '16 at 20:43
  • 1
    Apologies - it's in Chapter 5, section 5.4. Go to the URL I gave, click Handouts, then click Handouts again on the page that comes up. The loss function you specified has no relationship to classification. It is only related to estimation. – Frank Harrell Feb 14 '16 at 21:08
  • Minimizing it realizes the learning of the classification model. I don't understand your point but in my world this is a strong relationship to classification. Btw.: It is neither mine or the communities fault that Vladimir Vapnik (among others) came up with the SVM. Later on it was realized that we can embed other loss functions, including the logistic loss, instead of only the hinge loss. Despite the fact that it has not taken the statistical route, it has and is working well in many applications. I don't see your problem with that but honestly do also not want to hear about it. – pAt84 Feb 14 '16 at 21:20
  • Never thought I would find myself in a kindergarden fight with an established Professor. Thank you for the link. – pAt84 Feb 14 '16 at 21:20
  • 1
    No, if you want classification you either need a second step or you need a loss function that optimizes classification error. And I'm trying to see how SVM relates to this discussion. – Frank Harrell Feb 15 '16 at 12:35
  • 4
    P.S. Trying for a more clear way to say this, optimizing prediction/estimation leads to optimum decisions because the utility function is applied in a second step and is allowed to be unrelated to the predictors. Optimizing prediction/estimation does not optimize classification and vice-versa. Optimizing classification amounts to using a strange utility function that is tailored to the dataset at hand and may not apply to new datasets. Folks who really want to optimize classification (not recommended) can use a method that bypasses estimation/prediction altogether. – Frank Harrell Feb 15 '16 at 12:55
  • @FrankHarrell The link in the first comment leads nowhere. Do you have an updated link? – Igor F. Dec 26 '20 at 20:28
  • 1
    Sorry - the new link is https://hbiostat.org/rms – Frank Harrell Dec 26 '20 at 21:22