Update: The splines-based method to correct for age heaping is what the OP asks for but the result is not convincing: to my eye, the number of middle-aged adults is overpredicted at the expense of underpredicting the number of babies and toddlers. Parametrizing the population proportions on the logit scale — although mathetically correct as it produces probabilities that sum up to 1 — may exacerbate the problem. An alternative to splines is to model the age heaping process directly.
Here is one way to ensure the (smooth) age-specific estimates add up to the (known) total population size $N$: Put a spline on the population proportions $\mathbf{p} = \{p_i\}$ and model the observed counts $N_i$ as $\operatorname{multinomial}(N, \mathbf{p})$. The proportions form a simplex: $\sum_i p_i = 1$ and $0 < p_i < 1$, and the breakdown $N_i = Np_i$ preserves the total by construction.
At first blush, this doesn't get us far: we exchanged one set of constraints (non-negative counts by age that sum to $N$) with an equivalent set of constraints (non-negative proportions by age that sum to 1).
So we introduce age-specific parameters $\theta_i$ on the unconstrained scale (the real line) and map those to a simplex using a softmax transformation. Here is the complete model specification:
$$
\begin{aligned}
\boldsymbol{\theta} &= \mathbf{X}\boldsymbol{\beta}\\
\mathbf{p} &= \operatorname{softmax}(\boldsymbol{\theta}) = \frac{\exp(\mathbf{p})}{\sum_i\exp(p_i)} \\
\mathbf{N}_i &\sim \operatorname{multinomial}(N, \mathbf{p})
\end{aligned}
$$
where the model matrix $\mathbf{X}$ is determined by the spline function.
I fit this model with Stan for knots = 5, ..., 11. (The R code is attached at the end.) The corresponding age-specific estimates are in shades of green, and the original smooth estimates (obtained with lm(log(pop) ~ bs(age)) are in red.

Now about the choice of spline function, the number and location of knots, etc. It's straightforward to calculate the log-likelihood $\operatorname{multinomial}(N, \hat{\mathbf{p}})$, so models can be compared with an information criterion such as AIC. In this case, the process may not be particularly informative because $N$ is 68,000,000, so even a visually small change to the proportions $\hat{\mathbf{p}}$ makes a big difference to the log-likelihood. The penalty term (the number of spline parameters) is negligible in comparison. I stopped at 11 knots.
stan_code <- "
data {
int M;
int K;
matrix[M,K] X;
array[M] int Y;
}
parameters {
vector[K] beta;
}
transformed parameters {
vector[M] p = softmax(X * beta);
}
model {
Y ~ multinomial(p);
}
"
library("splines")
library("cmdstanr")
census <- read.csv("Age_pop.csv")
age <- census$age
pop <- census$pop
stan_model <- cmdstan_model(
write_stan_file(stan_code)
)
compose_data <- function(knots) {
Note 1: I use B-splines (with the splines::bs function) as this is
what the OP uses even though the question itself refers to cubic splines.
Note 2: The softmax is invariant under translation (ie. adding an intercept)
so I force the matrix of B-spline basis functions to have no intercept.
X <- model.matrix(~ 0 + bs(age, knots))
list(
M = nrow(X),
K = ncol(X),
X = X,
Y = pop
)
}
aic <- function(p, k) {
Ignores constants that are common to all models.
log_lik <- sum(pop * log(p$mean))
2 * k - 2 * log_lik
}
aics <- list()
for (knots in 5:11) {
fit <- stan_model$sample(data = compose_data(knots), seed = 1234)
p_hat <- fit$summary("p")
aics[knots] <- aic(p_hat, knots)
}
multinomialto thedirichlet_multinomialfamily. Those smooth probabilities are not any more realistic than the spline-based ones. – dipetkov Mar 29 '24 at 17:42