Recently I ran a logistic regression model using the census data here. The model attempts to assess the relationship between female labor force participation and other variables. Here is the code I used to create the model:
library(tidyverse)
library(stargazer)
library(ggplot2)
library(ggpubr)
library(dplyr)
library(haven)
library(tidyr)
individu <- read_dta("Le fichier Individu en format STATA.dta")
newdata <- individu %>% filter(sexe==2 & AGE5>=25) %>%
select(AGE5, TY_ACT, NIV_ET_AGR, E_MAT, mil, reg, ENF_VIV) %>%
rename_all(funs(c("Age","Act","Edu","Mar","Loc","Reg","Kid"))) %>%
mutate(across(c("Edu","Mar","Loc"), as_factor)) %>%
mutate(Kid = ifelse(is.na(Kid), 0, Kid)) %>%
mutate(Act = ifelse(Act == 0, 1, 0)) %>%
mutate(Edu = fct_drop(Edu)) %>%
mutate(Mar = fct_drop(Mar)) %>%
mutate(Loc = fct_drop(Loc)) %>%
na.omit()
newdata$Edu <- recode_factor(newdata$Edu,
"Aucun niveau d'études" = "No Education",
"Préscolaire" = "Preschool",
"Primaire" = "Primary Sch.",
"Secondaire collégial" = "Middle Sch.",
"Secondaire qualifiant" = "High Sch.",
"Supérieur" = "University")
newdata$Mar <- recode_factor(newdata$Mar, "Célibataire" = "Single",
"Marié" = "Married",
"Divorcé" = "Divorced",
"Veuf" = "Widow")
newdata$Loc <- recode_factor(newdata$Loc, "Urbain" = "Urban")
Run the general model
main_model <- glm(Act ~ Age + I(Age^2) + Edu + Mar + Kid + Loc,
family = binomial(link = "logit"),
data = newdata)
Which gave me the following results:
> summary(main_model)
Call:
glm(formula = Act ~ Age + I(Age^2) + Edu + Mar + Kid + Loc,
family = binomial(link = "logit"),
data = newdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0413 -0.5466 -0.3897 -0.2084 3.6565
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.216e+00 4.054e-02 -128.665 < 2e-16 ***
Age 2.211e-01 1.958e-03 112.901 < 2e-16 ***
I(Age^2) -2.677e-03 2.301e-05 -116.307 < 2e-16 ***
EduPreschool -5.677e-02 3.970e-02 -1.430 0.1528
EduPrimary Sch. 2.145e-02 9.476e-03 2.263 0.0236 *
EduMiddle Sch. 2.743e-01 1.110e-02 24.711 < 2e-16 ***
EduHigh Sch. 1.018e+00 1.032e-02 98.576 < 2e-16 ***
EduUniversity 1.971e+00 1.087e-02 181.393 < 2e-16 ***
MarMarried -7.747e-01 9.175e-03 -84.431 < 2e-16 ***
MarDivorced 6.336e-01 1.318e-02 48.081 < 2e-16 ***
MarWidow 1.165e-01 1.452e-02 8.026 1.01e-15 ***
Kid -1.781e-01 2.245e-03 -79.342 < 2e-16 ***
LocRural -4.381e-01 8.251e-03 -53.105 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 809064 on 913577 degrees of freedom
Residual deviance: 654932 on 913565 degrees of freedom
AIC: 654958
Number of Fisher Scoring iterations: 6
However, when I followed the steps in this website to assess the statistical validity of my model, I found results suggesting that my model is not statistically very sound:
# Hosmer-Lemeshow Test: The p.value should be high.
>
> ResourceSelection::hoslem.test(newdata$Act, fitted(main_model))
Hosmer and Lemeshow goodness of fit (GOF) test
data: newdata$Act, fitted(main_model)
X-squared = 2414.8, df = 8, p-value < 2.2e-16
> # Pseudo R^2: this should be close to 1, ideally.
>
> DescTools::PseudoR2(main_model, which = 'Nagelkerke')
Nagelkerke
0.2642365
> # Residuals plot
>
> dev_res <- residuals(main_model, type='deviance')
> dev_res_std <- residuals(main_model, type='deviance') /
sqrt(1 - hatvalues(main_model))
> pearson_res <- residuals(main_model, type='pearson')
> pearson_res_std <- residuals(main_model, type='pearson') /
sqrt(1 - hatvalues(main_model))
> par(mfrow=c(1, 2))
> plot(density(pearson_res), main='Deviance (red) v. Pearson
Residuals')
> lines(density(dev_res), col='red')
> plot(density(pearson_res_std), main='Deviance Standardized (red) v.\n Pearson Standardized Residuals')
> lines(density(dev_res_std), col='red')
Therefore I would like to know: is this because my response variable Act is not balanced (0: 84%, 1: 16%)? If so, how do I extract a sample that is balanced as per the variable Act, but also balanced by region. Any help would be very much appreciated.
PS: I wanted to use the BalancedSampling library, it can create a balanced sample by region, but I don't think it will guarantee the proportions of Act will be balanced.
EDIT 0:
I created a balanced sample, i.e. one with 100,000 observations with Act==0, and 100,000 observations with Act==1. The problem persists, and the deviance graph looks like this:

EDIT 1:
Following the links in the first comment,
which argue against the use of R^2 and Hosmer-Lemeshow test to measure the goodness-of-fit, and propose their own measures instead, I installed the rms package to check the goodness of fit as measured by their resid function. It gave me the following results:
> lrm_one <- lrm(formula = Act ~ Age + AgeSq + Edu + Mar + Kid +
Loc, data = newdata, y = TRUE, x = TRUE)
> resid(lrm_one, "gof")
Sum of squared errors Expected value|H0 SD Z P
1.004713e+05 1.003799e+05 2.218171e+01 4.121127e+00 3.770231e-05
The p.value is very small and the model is not good. Therefore it is clear that the problem is in my model, not in how I measure the goodness of fit. The residuals plots generated by rms are also similar to the ones above.

agecould well be non-linear, so maybe represent it with a spline? Maybe some interactions make sense? Can you show us some plot of residuals versus age? – kjetil b halvorsen Feb 03 '23 at 16:48