Asymmetrical Results from Logistic Regression Probability

Question

I have a logistic regression model, whose goal is to predict the winner of a sports match (a two-player heads-up game). I am using statistics from both players as features; using the difference between the statistics, instead of the raw values. For example, Person A's average running time - Person B's average running time, could be a feature.

After implementing logistic regression and cross-validation, I have a bit of an odd result - when I use scikit's in-built probability function, I get asymmetrical results. For example, if use my model to predict Person A's chances of beating Person B, using Person A features - Person B features, this value will not be equal to the compliment of when I predict Person B's chances of beating Person A, using Person B features - Person A features instead.

Is there a way to interpret the results of my model (since I do not know which probability to use: { Person A probability of beating Person B}, or { 1 - Prob(Person B beating Person A)}. I would say the average difference between the probabilities is around 10-20%.

Here's an example.

df = pd.DataFrame([[1, 2, 1 ], [1, 3, 0 ], [4, 6, 1 ]], columns=['Diff_Average_Running Time', 'Diff_Average_Max_Speed', 'Result'])
selected_features = ['Diff_Average_Running Time', 'Diff_Average_Max_Speed']
X_train, X_test, y_train, y_test = train_test_split(df[selected_features], df['Result'])
y_train = y_train.astype(int)
y_test = y_test.astype(int)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(fit_intercept = True, random_state=42)
model.fit(X_train_scaled, y_train)

Then I would create a new dataframe with statistics of a matchup I want to project and scale it using the same scaler. For example

predict_df = pd.DataFrame([[-1,2]], columns = ['Diff_Average_Running Time', 'Diff_Average_Max_Speed'])
predict_df = scaler.transform(predict_df)

and use sckit-learn's prediction function to now project the probabilities

model.predict_proba(predict_df)

Now the opposite projection (Person B's chances of beating Person A) would simply be this dataframe

opposite_predict_df = pd.DataFrame([[1,-2]], columns = ['Diff_Average_Running Time', 'Diff_Average_Max_Speed'])
opposite_predict_df = scaler.transform(opposite_predict_df)

However, when projecting the probability of B beating A using the same method, I expect

model.predict_proba(opposite_predict_df) = 1 - model.predict_proba(predict_df)

However, this does not hold. Often times the complement of Person B's probability is larger by Person A's probability by a noticeable margin.

What exactly are the features and what exactly are you predicting? What is the model? Could you give us a short example of the data you are using (a few rows)? — Tim, Jun 12 '23 at 07:12
The features are proprietary, however I can give an example of something very similar. An example would be a prediction of a race between two people in a marathon, I would be using the differences between their previous statistics as features (for example average run time, max speed, etc). So one feature could be the difference between Person A's average time in a 5 meter race compared to Person B's. Note that when I reverse the prediction(Person B's prob of beating Person A) I will have a negative value in these features. The model is a logistic regression model. — user54565, Jun 12 '23 at 21:06

Tim · Accepted Answer · 2023-06-15T07:09:53.737

To answer your question, let me simplify the model you are using. Recall that logistic regression consists of the linear predictor

$$ \eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $$

which is transformed using the logistic function $\sigma$

$$ E[y|X] = \sigma(\eta) $$

The linear predictor $\eta$ is unbounded, and the logistic function is symmetric around its midpoint $\sigma(0) = 0.5$. This means that the poor man's approximation of the model would be a linear regression model

$$ E[\tilde y|X] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $$

where $\tilde y$ has the negative category encoded as $-1$ (instead of $0$) and positive category as $+1$. This is almost as if we used $\eta$ directly for making predictions (but the models would not be the same because of not using $\sigma$ transformation, other loss functions, and optimization algorithms).

Now you should easily see the first problem

$$ \beta_0 + \beta_1 x_1 + \beta_2 (-x_2) \ne -1 \times (\beta_0 + \beta_1 (-x_1) + \beta_2 x_2) $$

because of the intercept $\beta_0$, it would need to be zero. In your model, you used fit_intercept = True (the default). Keep in mind however that in most of the cases, we should not drop the intercepts from the model because they are responsible for bias correction.

Another problem is that you used StandardScaler which re-scaled the features and changed their meaning. For example, check the following code.

scaler = StandardScaler()
scaler.fit_transform([[1], [2], [3], [5]])
# array([[-1.18321596],
#        [-0.50709255],
#        [ 0.16903085],
#        [ 1.52127766]])
scaler.transform([[-1], [-2], [-3], [-5]])
array([[-2.53546276],
[-3.21158617],
[-3.88770957],
[-5.23995638]])

As you can see, after the scaling it is not the case anymore that the negative value is the complement of the positive value. After the scaling, the values and their signs are relative to the mean of the data that was used for fitting the scaler. In the raw data, they were already using relative units ("faster than") and by scaling you destroyed their meaning.

I would also argue that there are problems with your features. Say that you are comparing two elite runners who completed a marathon in under three hours and they differed by one minute on the finish line. Would the difference mean the same for two runners who finished the marathon in eight hours? I would say "no". The variance of the times in elite runners would be much smaller than in the people who finish at the tail of the run. It's a success for an elite runner to beat their time by one minute, for someone who finished the last run in eight hours, they would consider it a success probably if they beat it by half an hour or more next time. Of course, they may be running just for fun and health benefits and not care about anything like that.

Finally, keep in mind that scikit-learn uses regularization by default. This may be a reasonable default for machine learning but is not if your aim is inference. From your description, it sounds like you may care about the interpretability of the parameters, so you may want to use penalty = None.

To summarize, if you use LogisticRegression(fit_intercept=False, penalty=None) and do not use the scaler, you would see symmetric results. But, as I argued, usually dropping the intercept is not a great decision, and the way you coded your data has possible flaws by itself that would handicap your model. But this is a completely different story and the answer is already lengthy.

Thanks for the in-depth answer, I think this solves my understanding for why this issue was occurring. If I do not use scaling, would this affect my accuracy score in great lengths? If so what is an alternative I can use to normalize data (since my features are not all of the same type). Would it be more optimal to use individual features instead of the differences between them - this would prevent having negative feature values and allow me to use the scalar as is. Open to any suggestions, thank you! — user54565, Jun 16 '23 at 00:12
@user54565 scaling is needed for logistic regression only if you use regularization, which you probably don't need. — Tim, Jun 16 '23 at 05:20

Asymmetrical Results from Logistic Regression Probability

1 Answers1

array([[-2.53546276],

[-3.21158617],

[-3.88770957],

[-5.23995638]])