3

Recently I ran a logistic regression model using the census data here. The model attempts to assess the relationship between female labor force participation and other variables. Here is the code I used to create the model:

library(tidyverse)
library(stargazer)
library(ggplot2)
library(ggpubr)
library(dplyr)
library(haven)
library(tidyr)

individu <- read_dta("Le fichier Individu en format STATA.dta")

newdata <- individu %>% filter(sexe==2 & AGE5>=25) %>% select(AGE5, TY_ACT, NIV_ET_AGR, E_MAT, mil, reg, ENF_VIV) %>% rename_all(funs(c("Age","Act","Edu","Mar","Loc","Reg","Kid"))) %>% mutate(across(c("Edu","Mar","Loc"), as_factor)) %>% mutate(Kid = ifelse(is.na(Kid), 0, Kid)) %>% mutate(Act = ifelse(Act == 0, 1, 0)) %>% mutate(Edu = fct_drop(Edu)) %>% mutate(Mar = fct_drop(Mar)) %>% mutate(Loc = fct_drop(Loc)) %>% na.omit()

newdata$Edu <- recode_factor(newdata$Edu, "Aucun niveau d'études" = "No Education", "Préscolaire" = "Preschool", "Primaire" = "Primary Sch.", "Secondaire collégial" = "Middle Sch.", "Secondaire qualifiant" = "High Sch.", "Supérieur" = "University")

newdata$Mar <- recode_factor(newdata$Mar, "Célibataire" = "Single", "Marié" = "Married", "Divorcé" = "Divorced", "Veuf" = "Widow")

newdata$Loc <- recode_factor(newdata$Loc, "Urbain" = "Urban")

Run the general model

main_model <- glm(Act ~ Age + I(Age^2) + Edu + Mar + Kid + Loc, family = binomial(link = "logit"), data = newdata)

Which gave me the following results:

> summary(main_model)
Call:
glm(formula = Act ~ Age + I(Age^2) + Edu + Mar + Kid + Loc, 
    family = binomial(link = &quot;logit&quot;), 
    data = newdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0413  -0.5466  -0.3897  -0.2084   3.6565  

Coefficients:
                  Estimate Std. Error  z value Pr(&gt;|z|)    
(Intercept)     -5.216e+00  4.054e-02 -128.665  &lt; 2e-16 ***
Age              2.211e-01  1.958e-03  112.901  &lt; 2e-16 ***
I(Age^2)        -2.677e-03  2.301e-05 -116.307  &lt; 2e-16 ***
EduPreschool    -5.677e-02  3.970e-02   -1.430   0.1528    
EduPrimary Sch.  2.145e-02  9.476e-03    2.263   0.0236 *  
EduMiddle Sch.   2.743e-01  1.110e-02   24.711  &lt; 2e-16 ***
EduHigh Sch.     1.018e+00  1.032e-02   98.576  &lt; 2e-16 ***
EduUniversity    1.971e+00  1.087e-02  181.393  &lt; 2e-16 ***
MarMarried      -7.747e-01  9.175e-03  -84.431  &lt; 2e-16 ***
MarDivorced      6.336e-01  1.318e-02   48.081  &lt; 2e-16 ***
MarWidow         1.165e-01  1.452e-02    8.026 1.01e-15 ***
Kid             -1.781e-01  2.245e-03  -79.342  &lt; 2e-16 ***
LocRural        -4.381e-01  8.251e-03  -53.105  &lt; 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 809064  on 913577  degrees of freedom
Residual deviance: 654932  on 913565  degrees of freedom
AIC: 654958

Number of Fisher Scoring iterations: 6

However, when I followed the steps in this website to assess the statistical validity of my model, I found results suggesting that my model is not statistically very sound:

# Hosmer-Lemeshow Test: The p.value should be high.
    > 
> ResourceSelection::hoslem.test(newdata$Act, fitted(main_model))
    Hosmer and Lemeshow goodness of fit (GOF) test

data:  newdata$Act, fitted(main_model)
X-squared = 2414.8, df = 8, p-value &lt; 2.2e-16

&gt; # Pseudo R^2: this should be close to 1, ideally.
&gt; 
&gt; DescTools::PseudoR2(main_model, which = 'Nagelkerke')
Nagelkerke 
 0.2642365 

&gt; # Residuals plot
&gt; 
&gt; dev_res &lt;- residuals(main_model, type='deviance')

&gt; dev_res_std &lt;- residuals(main_model, type='deviance') / 
       sqrt(1 - hatvalues(main_model))

&gt; pearson_res &lt;- residuals(main_model, type='pearson')

&gt; pearson_res_std &lt;- residuals(main_model, type='pearson') / 
sqrt(1 - hatvalues(main_model))

&gt; par(mfrow=c(1, 2))

&gt; plot(density(pearson_res), main='Deviance (red) v. Pearson 
    Residuals')

&gt; lines(density(dev_res), col='red')

&gt; plot(density(pearson_res_std), main='Deviance Standardized (red) v.\n Pearson Standardized Residuals')

&gt; lines(density(dev_res_std), col='red')

enter image description here

Therefore I would like to know: is this because my response variable Act is not balanced (0: 84%, 1: 16%)? If so, how do I extract a sample that is balanced as per the variable Act, but also balanced by region. Any help would be very much appreciated.

PS: I wanted to use the BalancedSampling library, it can create a balanced sample by region, but I don't think it will guarantee the proportions of Act will be balanced.

EDIT 0:

I created a balanced sample, i.e. one with 100,000 observations with Act==0, and 100,000 observations with Act==1. The problem persists, and the deviance graph looks like this: enter image description here

EDIT 1:

Following the links in the first comment, which argue against the use of R^2 and Hosmer-Lemeshow test to measure the goodness-of-fit, and propose their own measures instead, I installed the rms package to check the goodness of fit as measured by their resid function. It gave me the following results:

> lrm_one <- lrm(formula = Act ~ Age + AgeSq + Edu + Mar + Kid + 
                     Loc, data = newdata, y = TRUE, x = TRUE)
> resid(lrm_one, "gof")
Sum of squared errors     Expected value|H0                    SD                     Z                     P 
             1.004713e+05          1.003799e+05          2.218171e+01          4.121127e+00          3.770231e-05 

The p.value is very small and the model is not good. Therefore it is clear that the problem is in my model, not in how I measure the goodness of fit. The residuals plots generated by rms are also similar to the ones above.

  • 1
    Have a look at https://stats.stackexchange.com/questions/273966/logistic-regression-with-poor-goodness-of-fit-hosmer-lemeshow and https://stats.stackexchange.com/questions/169438/evaluating-logistic-regression-and-interpretation-of-hosmer-lemeshow-goodness-of/207512#207512 – kjetil b halvorsen Feb 01 '23 at 17:48
  • 2
    I'm not seeing why you think your model is poor in what you've presented. Your comment in the code - that the pseudo-R^2 should be close to 1 - seems to indicate that you believe you should be able to predict whether a given individual is in the labor force or not with a very high degree of accuracy; this seems overly optimistic to me, to say the least. – jbowman Feb 02 '23 at 19:46
  • @jbowman I think the Pseudo-r^2 is too low. I am trying to submit a paper to a respected journal, and I believe the reviewers will ask about the statistical validity of my model (goodness of fit + predictive ability). – Saïd Maanan Feb 02 '23 at 20:21
  • 2
  • Predictive performance of the regressions in many academic articles (even in top journals) can be terrible. I have seen linear regressions with $R^2\approx 0.01$, maybe lower. $//$ 2) Pseudo $R^2$ in logistic regressions play by different rules, and for McFadden's pseudo of $R^2$, McFadden himself claimed that $0.2$ to $0.4$ would qualify as outstanding performance.
  • – Dave Feb 02 '23 at 20:23
  • 3
    You might want to read this article: https://thestatsgeek.com/2014/02/08/r-squared-in-logistic-regression/ for a bit of insight on McFadden's $R^2$. The money quote: "the conclusion I draw is that one should really not be surprised if, from a fitted logistic regression McFadden’s R2 is not particularly large – we need extremely strong predictors in order for it to get close to 1." In his simulations, even a 90% accuracy only got an $R^2 = 0.55$. – jbowman Feb 02 '23 at 20:35
  • You seem to have a very large sample, which could support a larger model. Effects of age could well be non-linear, so maybe represent it with a spline? Maybe some interactions make sense? Can you show us some plot of residuals versus age? – kjetil b halvorsen Feb 03 '23 at 16:48
  • 1
    A couple of other (possible) issues with your analysis: You fill in "unknown number of children" with 0. This is incorrect unless the documentation explicitly says what "missing" means "no children". And then you drop all rows with any missing data remaining, which can introduce bias. To add to @kjetilbhalvorsen, you can spline number of children as well as age (ideally after coming up with a more appropriate way to handle missing data). – dipetkov Feb 04 '23 at 19:22
  • 1
    Thank you for the remarks @dipetkov, I will rewrite the code differently and see what happens. – Saïd Maanan Feb 04 '23 at 22:08