Is there a way to build a regression model for continuous output using aggregate data instead of individual data points when all input variables are categorical?
I have a moderately large dataset (few million rows). All my predictor variables are categorical or binary. I have two outcome variables - one binary and another continuous. For the binary variables I am using logistic regression. My R code is as follows:
rawdata <- readRDS(file = "mvt_week2.rds")
system.time(m <- glm(y ~ F1 + F2 + F3, data = rawdata, family=binomial))
# 900 seconds
If I aggregate the data first and build the model its almost instant.
library(dplyr)
sumdata <- rawdata %>% group_by(F1,F2,F3) %>% summarize(y1 = sum(y),Visits = n())
system.time(agg_m <- glm(y1/Visits ~ F1+ F2 + F3, data = sumdata, family=binomial, weights = Visits))
# 0.05 seconds
The model output is exactly the same and it saves a lot of time and requires much less memory and computation power.
My question is this works for a logistic model but is there a way to make this work for continuous output variables? If I calculate the mean and standard deviation can I feed that into a model?
Standard errors in weighted least squares on aggregated data A good explanation is given here when the output variable takes a finite set of values, but is there an approach that works for continuous values. Please note I do have raw data so I can compute any statistics on it when aggregating.
glm.) As a result, all measures of spread and variation, as well as associated tests, ought to change too. The Ecological Fallacy is the mistake of supposing that regression based on aggregated data can be interpreted as a regression on the original data. – whuber Aug 21 '15 at 12:55