67

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?

Salvador Dali
  • 199,541
  • 138
  • 677
  • 738
mgokhanbakal
  • 1,569
  • 1
  • 18
  • 25
  • Can you perhaps include an example to make things more concrete? – carlosdc Dec 02 '15 at 20:34
  • Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense? – mgokhanbakal Dec 02 '15 at 20:46

1 Answers1

82

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

m.fit(X / np.std(X, 0), y)
print(m.coef_)

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Community
  • 1
  • 1
KT.
  • 10,049
  • 4
  • 45
  • 67
  • 4
    Thank you for the explanation. One more thing, what does a negative value of m.coef_ mean? Does it mean like it is more discriminative for decision of negative class? Same question for positive values, too. – mgokhanbakal Dec 02 '15 at 21:06
  • 26
    A negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class. – KT. Dec 02 '15 at 21:08
  • Thank you for the additional explanation. – mgokhanbakal Dec 02 '15 at 22:21
  • By "standardized parameters" do you mean "standardized input"? – minhle_r7 Mar 19 '18 at 10:25
  • @ngocminh.oss Yes, as you can see in the code snippet. – KT. Mar 19 '18 at 16:30
  • Whether it should be np.std(X, 1)? – kravi Jul 31 '18 at 03:57
  • np.std(X,1) would compute standard deviations for each row. What you need is a standard deviation for each column. – KT. Jul 31 '18 at 12:57
  • " more interesting at stats.stackexchange.com " well *this* answer was already interesting enough - thanks! – WestCoastProjects Jul 31 '18 at 19:36
  • Note that this approach may be misleading: the coefficient size does not always reflect feature importance. As a counterexample. think of this: ``x1 = np.random.randn(100), x2 = x1 + 0.00001*np.random.randn(100), x3 = np.random.randn(100), y = 100*x1 - 100*x2 + x3`` (A more correct approach is to turn some features on and off, and compare predictive powers. ) – Peter Franek Dec 17 '18 at 21:37
  • 1
    @PeterFranek Let us see how your counterexample works out in practice: https://pastebin.com/NXPxtPwc Note how the resulting model is "smart" enough to estimate smaller coefficients for the correlated features and thus correctly concluding that it is the third value which is the more important one. Try coming up with a working counter example ;) – KT. Dec 17 '18 at 23:56
  • You are correct to note, of course, that the described approach is among the simplest ones, that linear models with correlated inputs should always be handled with care, and that there are many other ways to estimate feature importance. Also, let me suggest you add your favourite method (whether it is stepwise feature elimination or something else of that kind) as a second possible answer to this question. It is a good question, after all, and is worth getting more than one answer in three years, right? – KT. Dec 17 '18 at 23:59
  • @KT Agree, thank you. The example I suggested works for Linear Regression, with continuous y and without the ``> 0).astype(int)`` constraint. But you are right, I should come with a LogisticRegression counterexample. Thanks for the challenges, I will come back to this within a few days. (To explain my initial reaction -- I'm not an expert in the topic and was searching for a good method myself when I found this, and somehow to me it looked too simple to be true.) – Peter Franek Dec 19 '18 at 00:24
  • @PeterFranek The method seems "too simple to be true" because it kind-of assumes the model fitting did all the hard job of extracting a meaningful pattern in the data. Coefficients then simply describe this pattern and, obviously, looking at them should be a good way to interpret the model. The process of *fitting* the model in a way where it extracts the pattern is not, however, necessarily as simple as you might think (contd) – KT. Dec 19 '18 at 08:26
  • .. in particular, note that your example would only work with *unregularized, bare linear regression*. The latter is notorious for easily overfitting on data with correlated inputs, which would be the case here. (Although you might argue whether it is fair to call a model which manages to estimate the parameters used to generate the data "overfitting", I would say in this case it is, the reason being - the model does not take parameter uncertainty into account and if you added even the slightest bit of noise to the output, your estimates would vary like crazy all over the place). (contd) – KT. Dec 19 '18 at 08:36
  • ... so try adding even a slightest bit of regularization, and you'll see again how the model, already in the fitting process will manage to figure out the true, statistically signifcant pattern, where the third paramer is the more important one. https://pastebin.com/Rfqfy0un – KT. Dec 19 '18 at 08:44
  • 4
    And, more generally, note that the questions of "how to understand the importance of features in an (already fitted) model of type X" and "how to understand the most influential features in the data in general" are different. Depending on your fitting process you may end up with different models for the same data - some features may be deemed more important by one model, while others - by another. The important features "within a model" would only be important "in the data in general" when your model was estimated in a somewhat "valid" way in the first place. – KT. Dec 19 '18 at 08:48
  • 1
    In particular, if the most important feature in your data has a nonlinear dependency on the output, most linear models may not discover this, no matter how you tease them. Hence, it is nice to remember about the differences between modeling and model interpretation. – KT. Dec 19 '18 at 08:49
  • @KT. Thanks for your comments, appreciate it. – Peter Franek Dec 24 '18 at 09:49
  • I am not very sure about this method - looks like it might be misleading. Could you give some references to it? Thanks – Huanfa Chen Jun 01 '20 at 18:23
  • @HuanfaChen, it should be obvious from the definition that the bare coefficient of a variable in a linear model indicates the rate of change in model's output in response to a unit change of the corresponding variable. The coefficient multiplied by the standard deviation of the respective variable also takes the variable's own spread into account and thus indicates the rate of change in the output in response to a "standardized unit" change in the corresponding variable. Whether this measure is misleading for your purposes or not depends on your goals and is for you to decide. – KT. Jun 03 '20 at 10:42
  • thanks for this post and its replies. I have a question. If I rank feature based on coeff times std which should be the threshold to decide whether or not to keep a feature? I mean if coeff times std>1e-5 or how to find such threshold? Many thanks – Luigi87 Dec 30 '20 at 13:38
  • @Luigi87 I don't think there's an objective universal threshold - it all depends on your ultimate goal. Ask yourself, why are you performing feature selection in the first place. If you can come up with a numeric measure of goodness for the selected feature set, then you can, for example, use cross-validation to find the optimal number of features to retain. Do note, though, that sorting by linear model coefficient values is not the only (nor often the best) way to do feature selection in general. The question here is not about feature selection but about the interpretation of a given model. – KT. Jan 06 '21 at 18:35
  • 1
    How else could you perform feature importance using linear regression? Using coefficients * s.d seems to be the only viable choice – Maths12 Feb 10 '21 at 19:50
  • Does this method require features to be normalized ? I'm asking this because features that span over a large range of values are likely to have a higher standard deviation (and therefore a higher importance value) even if they don't contribute much to the success of the logistic regression. – Holaf Jul 12 '21 at 14:01
  • If your features are normalized, you could just look at the model coefficients directly. std(feature)*coefficient expresses by how much the model output changes if the respective feature changes by a fraction of its operating range and thus takes into account the possible different spread of the features. – KT. Jul 20 '21 at 09:10