Age specific population interpolation: Why Cubic Spline interpolation is not giving sum equal to original sum

Question

Because of Age heaping, I want to smooth age-specific population derived from the census of Pakistan. The Cubic spline is assumed as an appropriate choice. However, the fitted spline is not giving the sum of the total population to the original total. Here is my code in R. The data can be found in the link.

Data on population by age

age <- seq(0,99, by=1)
library(splines)
pop <- log(pop) # use log to aviod negative values

sf <- lm(pop ~ bs(age, knots=c(25, 50, 75)))

pred = predict(sf,se=T)

plot(age,pop, pch=c(1),lty=c(1),type=c("o"))
  title("Cubic Splines")
  lines(age,pred$fit,lwd=2)#plot the fitted spline curve  #line width

Is there any way to keep sum equal before and after the smoothing? Before sum (68185120) != after sum(65270572)

Further, is there any logical way to assign the knot? The selection of the knots, in this case, is purely subjective. Regards, Asif

https://stackoverflow.com/q/48575615/626775

dipetkov · Answer 1 · 2024-03-29T16:53:27.843

Update: The splines-based method to correct for age heaping is what the OP asks for but the result is not convincing: to my eye, the number of middle-aged adults is overpredicted at the expense of underpredicting the number of babies and toddlers. Parametrizing the population proportions on the logit scale — although mathetically correct as it produces probabilities that sum up to 1 — may exacerbate the problem. An alternative to splines is to model the age heaping process directly.

Here is one way to ensure the (smooth) age-specific estimates add up to the (known) total population size $N$: Put a spline on the population proportions $\mathbf{p} = \{p_i\}$ and model the observed counts $N_i$ as $\operatorname{multinomial}(N, \mathbf{p})$. The proportions form a simplex: $\sum_i p_i = 1$ and $0 < p_i < 1$, and the breakdown $N_i = Np_i$ preserves the total by construction.

At first blush, this doesn't get us far: we exchanged one set of constraints (non-negative counts by age that sum to $N$) with an equivalent set of constraints (non-negative proportions by age that sum to 1).

So we introduce age-specific parameters $\theta_i$ on the unconstrained scale (the real line) and map those to a simplex using a softmax transformation. Here is the complete model specification:

$$ \begin{aligned} \boldsymbol{\theta} &= \mathbf{X}\boldsymbol{\beta}\\ \mathbf{p} &= \operatorname{softmax}(\boldsymbol{\theta}) = \frac{\exp(\mathbf{p})}{\sum_i\exp(p_i)} \\ \mathbf{N}_i &\sim \operatorname{multinomial}(N, \mathbf{p}) \end{aligned} $$ where the model matrix $\mathbf{X}$ is determined by the spline function.

I fit this model with Stan for knots = 5, ..., 11. (The R code is attached at the end.) The corresponding age-specific estimates are in shades of green, and the original smooth estimates (obtained with lm(log(pop) ~ bs(age)) are in red.

Now about the choice of spline function, the number and location of knots, etc. It's straightforward to calculate the log-likelihood $\operatorname{multinomial}(N, \hat{\mathbf{p}})$, so models can be compared with an information criterion such as AIC. In this case, the process may not be particularly informative because $N$ is 68,000,000, so even a visually small change to the proportions $\hat{\mathbf{p}}$ makes a big difference to the log-likelihood. The penalty term (the number of spline parameters) is negligible in comparison. I stopped at 11 knots.

stan_code <- "
data {
  int M;
  int K;
  matrix[M,K] X;
  array[M] int Y;
}
parameters {
  vector[K] beta;
}
transformed parameters {
  vector[M] p = softmax(X * beta);
}
model {
  Y ~ multinomial(p);
}
"
library("splines")
library("cmdstanr")
census <- read.csv("Age_pop.csv")
age <- census$age
pop <- census$pop
stan_model <- cmdstan_model(
  write_stan_file(stan_code)
)
compose_data <- function(knots) {
Note 1: I use B-splines (with the splines::bs function) as this is
what the OP uses even though the question itself refers to cubic splines.
Note 2: The softmax is invariant under translation (ie. adding an intercept)
so I force the matrix of B-spline basis functions to have no intercept.
X <- model.matrix(~ 0 + bs(age, knots))
  list(
    M = nrow(X),
    K = ncol(X),
    X = X,
    Y = pop
  )
}
aic <- function(p, k) {
Ignores constants that are common to all models.
log_lik <- sum(pop * log(p$mean))
  2 * k - 2 * log_lik
}
aics <- list()
for (knots in 5:11) {
fit <- stan_model$sample(data = compose_data(knots), seed = 1234)
  p_hat <- fit$summary("p")
aics[knots] <- aic(p_hat, knots)
}

score 1 · Answer 2 · answered Mar 29 '24 at 17:08

Here is a second approach to correct for age heaping and estimate the true population frequencies, without using splines. The gist of this approach is to model the age heaping process explicitly.

Age heaping occurs when a person doesn't know their age exactly and rounds up/down to the nearest "milestone" age. For example, a person who is 41 rounds down and reports 40; a person who is 44 rounds up and reports 45. So the observed frequencies $\hat{\mathbf{p}}$ are a "corrupted" version of the true frequencies $\mathbf{p}$.

I came across this method in Bayesian Statistical Modeling with Stan, R, and Python by Kentaro Matsuura. The data and scripts can be found in this GitHub repo. The age heaping example is covered in Section 12.3; see model 12—4.

The first step is to code up the "flow" from true age to milestone age. (The flow is partial as many people report correctly their age and their child's age.)

I started with milestone ages every 5th year (ie. ..., 40, 45, 50, ...) but it's clear from the OP's plot of the raw population frequencies that in this case the age heaping process is less regular; for example there is some heaping on 42 and 48 as well. So I ended up with allowing flow into every age that has a local peak. I also make a strong assumption about the direction of the flow: there is some probability (to be estimated) that a person mis-reports their age but if the age is mis-reported it's always assigned to the same milestone age. For example: 41→40, 43→42, 44→45, 46→45. This seems too rigid of an assumption but keeps (the model of) the age heaping process simple.

To complete the model, I also put an informative normal(0, 1) prior on the second-order differences of the true population proportions $\mathbf{p}$. This encourages smoothness but some traces of age heaping do remain. The dip at ages 0 and 1 also remains; this detail probably cannot be addressed satisfactorily by modeling the census data alone; it requires understanding the reasons, if any, why babies may be under-counted.

Note (to myself I guess): It's easy to smooth the probabilities by switching from the multinomial to the dirichlet_multinomial family. Those smooth probabilities are not any more realistic than the spline-based ones. — dipetkov, Mar 29 '24 at 17:42

Age specific population interpolation: Why Cubic Spline interpolation is not giving sum equal to original sum

2 Answers2

Note 1: I use B-splines (with the `splines::bs` function) as this is

what the OP uses even though the question itself refers to cubic splines.

Note 2: The softmax is invariant under translation (ie. adding an intercept)

so I force the matrix of B-spline basis functions to have no intercept.

Ignores constants that are common to all models.