Regression model with aggregated targets

Question

Similar as in this self-answered question, I want to ask about possible approaches for modelling data with aggregated targets, i.e. things like

$$ \bar y_{j[i]} = \alpha + \beta x_i + \varepsilon_i $$

where $j[i]$ is the $j$-th group, where $i$-th observation belongs, and for each $j$-th group of size $|j|$, we are predicting the target that is an average value of all the $y_i$ observations within the group, $\bar y_{j[i]} = |j|^{-1} \sum_{i \in j[i]} y_i$. Of course, the means are given, and cannot be disaggregated, this is the data we have.

Additional assumption that can be made in here, is that there is clustering within the $j[i]$ groups, so the group assignment is not completely random, the subjects within each group share some characteristics.

For example, imagine that you have data on average test score per class (something to predict), and features on both student level, e.g. individual IQ scores (that should be highly predictive, but not perfect, for exam scores), class level features, and features on higher level of aggregation (school level). I am interested in finding factors that contributed to each individual test score, and predict them. The data is a random sample of classes, the final predictions will be made for students from classes that were not observed in the training data.

Can we use such data to learn anything (approximately) about the unobserved individual-level targets?

What are the approaches used for modelling such data? Can you give some references? Obviously with aggregated data we loose precision, and the variance of the means $\bar y_{j[i]}$ is smaller then of the individual observations $y_i$, so predicting the average target is not the same as predicting individual values. Is there any way how to translate the predictions of the group averages to possible variability between subjects?

@user20160 1 - yes, 2 - shared, but there may be some per-group features, 3 - see example in my edit, 4 - group id's are not meaningful in here, I'd like to make predictions for groups that were not seen in the training data. — Tim, Oct 07 '19 at 05:08

user20160 · Accepted Answer · 2019-10-10T00:04:45.907

Here's an approach for solving this type of problem using latent variable models. It's not a specific model, but a general way to formulate a model by breaking the description of the system into two parts: the relationship between individual inputs and (unobserved) individual outputs, and the relationship between individual outputs and (observed) aggregate group outputs. This gives a natural way to think about the problem that (hopefully somewhat) mirrors the data generating process, and makes assumptions explicit. Linear or nonlinear relationships can be accommodated, as well as various types of noise model. There's well-developed, general-purpose machinery for performing inference in latent variable models (mentioned below). Finally, explicitly including individual outputs in the model gives a principled way to make predictions about them. But, of course there's no free lunch--aggregating data destroys information.

General approach

The central idea is to treat the individual outputs as latent variables, since they're not directly observed.

Suppose the individual inputs are $\{x_1, \dots, x_n\}$, where each $x_i \in \mathbb{R}^d$ contains both individual and group-level features for the $i$th individual (group-level features would be duplicated across individuals). Inputs are stored on the rows of matrix $X \in \mathbb{R}^{n \times d}$. The corresponding individual outputs are represented by $y = [y_1, \dots, y_n]^T$ where $y_i \in \mathbb{R}$.

The first step is to postulate a relationship between the individual inputs and outputs, even though the individual outputs are not directly observed in the training data. This takes the form of a joint conditional distribution $p(y \mid X, \theta)$ where $\theta$ is a parameter vector. Of course, it factorizes as $\prod_{i=1}^n p(y_i \mid x_i, \theta)$ if the outputs are conditionally independent, given the inputs (e.g. if error terms are independent).

Next, we relate the unobserved individual outputs to the observed aggregate group outputs $\bar{y} = [\bar{y}_1, \dots, \bar{y}_k]^T$ (for $k$ groups). In general, this takes the form of another conditional distribution $p(\bar{y} \mid y, \phi)$, since the observed group outputs may be a noisy function of the individual outputs (with parameters $\phi$). Note that $\bar{y}$ is conditionally independent of $X$, given $y$. If group outputs are a deterministic function of the individual outputs, then $p(\bar{y} \mid y)$ takes the form of a delta function.

The joint likelihood of the individual and group outputs can then be written as:

$$p(y, \bar{y} \mid X, \theta, \phi) = p(\bar{y} \mid y, \phi) p(y \mid X, \theta)$$

Since the individual outputs are latent variables, they must be integrated out of the joint likelihood to obtain the marginal likelihood for the observed group outputs:

$$p(\bar{y} \mid X, \theta, \phi) = \int p(\bar{y} \mid y, \phi) p(y \mid X, \theta) dy$$

If group outputs are a known, deterministic function of the individual outputs, the marginal likelihood can be written directly without having to think about this integral (and $\phi$ can be ignored).

Maximum likelihood estimation

Maximum likelihood estimation of the parameters proceeds by maximizing the marginal likelihood:

$$\theta_{ML}, \phi_{ML} \ = \ \arg \max_{\theta,\phi} \ p(\bar{y} \mid X, \theta, \phi)$$

If the above integral can be solved analytically, it's possible to directly optimize the resulting marginal likelihood (either analytically or numerically). However, the integral may be intractable, in which case the expectation maximization algorithm can be used.

The maximum likelihood parameters $\theta_{ML}$ could be studied to learn about the data generating process, or used to predict individual outputs for out-of-sample data. For example, given a new individual input $x_*$, we have the predictive distribution $p(y_* \mid x_*, \theta_{ML})$ (whose form we already chose in the first step above). Note that this distribution doesn't account for uncertainty in estimating the parameters, unlike the Bayesian version below. But, one could construct frequentist prediction intervals (e.g. by bootstrapping).

Care may be needed when making inferences about individuals based on aggregated data (e.g. see various forms of ecological fallacy). It's possible that these issues may be mitigated to some extent here, since individual inputs are known, and only the outputs are aggregated (and parameters are assumed to be common to all individuals). But, I don't want to make any strong statements about this without thinking about it more carefully.

Bayesian inference

Alternatively, we may be interested in the posterior distribution over parameters:

$$p(\theta, \phi \mid \bar{y}, X) = \frac{1}{Z} p(\bar{y} \mid X, \theta, \phi) p(\theta, \phi)$$

where $Z$ is a normalizing constant. Note that this is based on the marginal likelihood, as above. It also requires that we specify a prior distribution over parameters $p(\theta, \phi)$. In some cases, it may be possible to find a closed form expression for the posterior. This requires an analytical solution to the integral in the marginal likelihood, as well as the integral in the normalizing constant. Otherwise, the posterior can be approximated, e.g. by sampling (as in MCMC) or variational methods.

Given a new individual input $x_*$, we can make predictions about the output $y_*$ using the posterior predictive distribution. This is obtained by averaging the predictive distributions for each possible choice of parameters, weighted by the posterior probability of these parameters given the training data:

$$p(y_* \mid x_*, X, \bar{y}) = \iint p(y_* \mid x_*, \theta) p(\theta, \phi \mid \bar{y}, X) d\theta d\phi$$

As above, approximations may be necessary.

Example

Here's an example showing how to apply the above approach with a simple, linear model, similar to that described in the question. One could naturally apply the same techniques using nonlinear functions, more complicated noise models, etc.

Generating individual outputs

Let's suppose the unobserved individual outputs are generated as a linear function of the inputs, plus i.i.d. Gaussian noise. Assume the inputs include a constant feature (i.e. $X$ contains a column of ones), so we don't need to worry about an extra intercept term.

$$y_i = \beta \cdot x_i + \epsilon_i \quad \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

Therefore, $y = [y_1, \dots, y_n]^T$ has a Gaussian conditional distribution:

$$p(y \mid X, \beta, \sigma^2) = \mathcal{N}(y \mid X \beta, \sigma^2 I)$$

Generating aggregate group outputs

Suppose there are $k$ non-overlapping groups, and the $i$th group contains $n_i$ known points. For simplicity, assume we observe the mean output for each group:

$$\bar{y} = W y$$

where $W$ is a $k \times n$ weight matrix that performs averaging over individuals in each group. $W_{ij} = \frac{1}{n_i}$ if group $i$ contains point $j$, otherwise $0$. Alternatively, we might have assumed that observed group outputs are contaminated with additional noise (which would lead to a different expression for the marginal likelihood below).

Marginal likelihood

Note that $\bar{y}$ is a deterministic, linear transformation $y$, and $y$ has a Gaussian conditional distribution. Therefore, the conditional distribution of $\bar{y}$ (i.e. the marginal likelihood) is also Gaussian, with mean $W X \beta$ and covariance matrix $\sigma^2 W W^T$. Note that $W W^T = \text{diag}(\frac{1}{n_1}, \dots, \frac{1}{n_k})$, which follows from the structure of $W$ above. Let $\bar{X} = W X$ be a matrix whose $i$th row contains the mean of the inputs in the $i$th group. Then, the marginal likelihood can be written as:

$$p(\bar{y} \mid X, \beta, \sigma^2) = \mathcal{N} \left( \bar{y} \ \Big| \ \bar{X} \beta, \ \sigma^2 \text{diag} \big( \frac{1}{n_1}, \dots, \frac{1}{n_k} \big) \right)$$

The covariance matrix is diagonal, so the observed outputs are conditionally independent. But, they're not identically distributed; the variances are scaled by the reciprocal of the number of points in each group. This reflects the fact that larger groups average out the noise to a greater extent.

Maximum likelihood estimation

Maximizing the likelihood is equivalent to minimizing the following loss function, which was obtained by writing out the negative log marginal likelihood and then discarding constant terms:

$$\mathcal{L}(\beta, \sigma^2) = k \log(\sigma^2) + \frac{1}{\sigma^2} (\bar{y} - \bar{X} \beta)^T N (\bar{y} - \bar{X} \beta)$$

where $N = \text{diag}(n_1, \dots, n_k)$. From the loss function, it can be seen that the maximum likelihood weights $\beta_{ML}$ are equivalent to those obtained by a form of weighted least squares. Specifically, by regressing the group-average outputs $\bar{y}$ against the group-average inputs $\bar{X}$, with each group weighted by the number of points it contains.

$$\beta_{ML} = (\bar{X}^T N \bar{X})^{-1} \bar{X}^T N \bar{y}$$

The estimated variance is given by a weighted sum of the squared residuals:

$$\sigma^2_{ML} = \frac{1}{k} (\bar{y} - \bar{X} \beta_{ML})^T N (\bar{y} - \bar{X} \beta_{ML})$$

Prediction

Given a new input $x_*$, the conditional distribution for the corresponding individual output $y_*$ is:

$$p(y_* \mid x_*, \beta_{ML}, \sigma^2_{ML}) = \mathcal{N}(y_* \mid \beta_{ML} \cdot x_*, \sigma^2_{ML})$$

The conditional mean $\beta_{ML} \cdot x_*$ could be used as a point prediction.

References

Machine learning: A probabilistic perspective (Murphy 2012). I don't recall that it speaks specifically about aggregated data, but, it covers concepts related to latent variable models quite well.

Could you give an example? For example, how would it differ from the linear regression model like $\bar y_{j[i]} = \alpha + \beta x_i + \varepsilon_i$? How the likelihood would be defined in such case (single predictor)? What would be the advantage over linear regression? — Tim, Oct 09 '19 at 09:40
@Tim I've edited to include the example, and address questions in your comment — user20160, Oct 09 '19 at 23:50
Thanks for the example, but I'm a little bit confused. Wouldn't this be equivalent of running standard linear regression? Consider minimizing SSE for your model it would be $\sum_j ((\sum_{i\in j[i]} \hat y_{j[i]} / n_{j[i]}) - \bar y_j)^2$ vs $\sum_j (\sum_{i\in j[i]} (\hat y_{j[i]} - \bar y_j))^2$ with standard linear regression, so the losses would differ only with dividing by constant $n_{j[i]}$, it should not make any difference. Please correct me if I'm wrong. — Tim, Oct 10 '19 at 07:14
@Tim I don't think they're equivalent. Conceptually, they're trying to predict different quantities. If I understand correctly what you mean by standard linear regression, $\beta \cdot x_i$ gives a prediction for $\bar{y}_{j[i]}$, the average output for the group containing $x_i$ (the individual input). In my example, $\beta \cdot x_i$ gives a prediction for $y_i$, the individual output. To predict the average group output, one would predict $y_i$ for each group member, then take the average. (continued...) — user20160, Oct 10 '19 at 12:39
To find $\beta$, standard linear regression minimizes $\sum_j \sum_{i \in j[i]} (\bar{y}j - \beta \cdot x_i)^2$. In my example, the quantity minimized is $\sum_j n_j \left( \bar{y}_j - \frac{1}{n_j} \sum{i \in j[i]} \beta \cdot x_i \right)^2$. For some simple toy data, the example method recovers the weights more accurately than standard linear regression, and has lower prediction error. But, please let me know if I've misunderstood what you meant by standard linear regression. — user20160, Oct 10 '19 at 12:39
I think the issue with standard linear regression is that it treats the average group outputs as if they were individual outputs, and doesn't account for the fact that they're averages (which reduces variability, as you mentioned). — user20160, Oct 10 '19 at 12:54
Yes, I tested it on artificial data and you are right. I still can't see what I'm missing, but understand your point. Thanks! — Tim, Oct 10 '19 at 13:19
@Tim The following visualization might be interesting. First plot $y_i$ vs. $x_i$ (true individual outputs vs. inputs, for simulated 1d data). Assign points to groups at random. Then, plot $\bar{y}_j$ vs. $\bar{x}_j$ (average group outputs vs. average group inputs). It should look like a smaller point cloud with similar slope to the first plot. The example method is essentially trying to fit this second point cloud (with a particular weighting scheme). (continued...) — user20160, Oct 10 '19 at 14:38
Now, plot $\bar{y}{j[i]}$ vs. $x_i$ (average group outputs vs. individual inputs). Regular linear regression is trying to fit this third point cloud. But, it should look haphazardly distributed around a nearly horizontal line. This is because there's no structure in the group assignments, so each $\bar{y}_j$ is close to the global mean, and each $x_i$ is not strongly related to $\bar{y}{j[i]}$. — user20160, Oct 10 '19 at 14:38
Things improve with more structured group assignments. In the best case, groups contain contiguous inputs. Here, the third point cloud should look like a staircase with slope similar to the original point cloud. And, the performance of regular linear regression should improve. — user20160, Oct 10 '19 at 14:38
@Tim glad to help. Good question as well--it was fun thinking about this with you — user20160, Oct 10 '19 at 14:46
@Tim Interesting notebook. Maybe it could be edited into the question (or answer) in case others find it useful? — user20160, Oct 10 '19 at 14:51

Tim · Answer 2 · 2019-10-11T19:25:35.207

To verify the solution suggested in the great answer by @user20160 I prepared a toy example that demonstrates it. As suggested by @user20160, I am posting the code as a supplement to the answer. For explanations of this approach, check the other answer.

First, let's generate the independent variable and append the column of ones to it, to use matrix formulation of the model.

set.seed(42)
n <- 5000; k <- 50; m <- n/k

x <- rnorm(n, mean = (1:n)*0.01, sd = 10)
X <- cbind(Intercept=1, x)

Next, let's generate the individual predictions $y = X\beta + \varepsilon$.

beta <- rbind(3, 0.75)
sigma <- 10
y <- rnorm(n, X %*% beta, sigma)

To aggregate the results, we use the matrix $W$ of zeros and ones to indicate group membership of size $k \times n$. To estimate group means, we take $\bar y = \tfrac{1}{m}W y$ (same results as tapply(y, grp, mean)).

grp <- factor(rep(1:k, each=m))
W <- t(model.matrix(~grp-1))
ybar <- as.vector((W/m) %*% y)

What leads to the following results, where as expected, the conditional variability of $\bar y$ is much smaller then $y$.

lm_loss <- function(pars) mean((mu_rep - as.vector(X %*% pars))^2)
aggr_loss <- function(pars) mean((mu - as.vector((W/m) %*% (X %*% pars)))^2)

Results from the regular regression model are pretty poor.

init <- rbind(0, 0)
(est1 <- optim(init, lm_loss))$par
##          [,1]
## [1,] 9.058655
## [2,] 0.502987

The "aggregated" model gives results that are really close to the true values of $\beta$.

(est2 <- optim(init, aggr_loss))$par
##           [,1]
## [1,] 3.1029468
## [2,] 0.7424815

You can also see on the plot below, that besides that the input data was aggregated, if we use the "aggregated" model, we are able to recover the true regression line almost perfectly.

Also if we compare mean squared error of predictions for the individual values given the estimated parameters, the "aggregated" model has smaller squared error.

mean((y - as.vector(X %*% est1$par))^2)
## [1] 119.4491
mean((y - as.vector(X %*% est2$par))^2)
## [1] 101.4573

Same thing happens if we minimize the negative log-likelihood. Additionally, this lets us to estimate $\sigma$, and also gives much better result (43.95 for linear regression vs 8.02 for the "aggregated" model).

lm_llik <- function(pars) -1 * sum(dnorm(mu_rep, as.vector(X %*% pars[1:2]), pars[3]/sqrt(k), log=TRUE))
aggr_llik <- function(pars) -1 * sum(dnorm(mu, as.vector((W/m) %*% (X %*% pars[1:2])), pars[3]/sqrt(k), log=TRUE))

score 1 · Answer 3 · answered Oct 07 '19 at 17:36

1

Different approaches could be appropriate depending on your goal. I'll describe one approach in case your goal is group-level prediction.

You could use the individual-level features to build a bunch of aggregated features for each group (mean, std, median, max, min, ...). You now have richer features for each group which are likely to perform well on the group level. I've seen this work thousands of times in Kaggle competitions. Also, don't stick to linear regression, gradient boosting works in many cases with tabular data, and can even help you weed out some features (make lots of them, you never know what will work).

As a bonus, this also gives you a way of predicting individual scores by feeding the model a group of one (this feels a little shady though).

answered Oct 07 '19 at 17:36

Bananin

718

Why would using aggregate features work better then using raw data? Raw data provides much richer information, unless I'm missing something about your answer. – Tim Oct 07 '19 at 18:28
I'm assuming you want to predict for groups of different sizes, so this avoids the problem of having dependent variables with different lengths. On the other hand, the students don't have any particular order, so having variables 'student1score', 'student2score',.. will give your model the confusing task of figuring out how to use each of those differently, when in practice they mean the same to us. Some raw data could be used in the form of ordered scores: 'best_score', '2nd_best', ... provided that there is a minimum group size. Notice that aggregates like 'max' and 'min' have this spirit – Bananin Oct 07 '19 at 19:34
Why would you store the values in columns? There is more point in keeping them in rows. As you noticed, in columns both the ordering and variable number of them becomes problematic. In rows, you would need either a model that extracts the group level information (mixed-effects model), or some kind of aggregating of individual predictions (rows per group). With aggregate features you loose a lot of information. – Tim Oct 07 '19 at 21:48
I agree that rows would be better, but then rows would have to be students, and we don't have individual scores. I also agree that there is information loss in aggregating. However, as we know machine learning is an empirical practice; I would suggest you to try to fit some reasonable model (maybe this one) and let the results guide your process: If training error turns out unacceptably high, you might need a more complex model. My guess is you will get a tight fit if you add enough aggregates. – Bananin Oct 08 '19 at 15:11
but we have the individual-level features. Moreover the point is what can we do with such data about predicting the individual scores? – Tim Oct 08 '19 at 16:13

Regression model with aggregated targets

3 Answers3

General approach

Maximum likelihood estimation

Bayesian inference

Example

Generating individual outputs

Generating aggregate group outputs

Marginal likelihood

Maximum likelihood estimation

Prediction

References

Linked