6

The context relates to a situation in I am interested to see whether class sizes predict test results. I have each individual's test results, and each individual's class size. I've been warned against simply calculating the test result for each class (thus making a new variable class_test_average, and then using class_size to predict class_test_average. I've been informed that if I do that I may have a problem with "aggregation bias" and "the ecological fallacy". However, these concepts were expressed to me in a somewhat hand-waving way. I have grasped that the ecological fallacy relates to inferences that relationships at macro level will translate into the same relationships at micro level. However, I didn't understand aggregation bias at all.

This isn't practically a serious issue for me as I was planning to do multilevel modeling anyway, which I guess will avert both aggregation bias and the ecological fallacy. However, I am curious about what aggregation bias actually means. There is no Wikipedia article which speaks to this issue, and googling turns up all sorts of definitions. However, I think the classic citation in this area is James (1982).

To me, the term bias indicates that by aggregation I should be systematically pushing the results to either overestimate or underestimate the size of relationships. However, it's not clear to me that that actually happens.

James, L. R. (1982). Aggregation bias in estimates of perceptual agreement. Journal of Applied Psychology, 67(2), 219.

1 Answers1

4

From Clark and Avery (1976):

It has long been known that the use of aggregate data may yield correlation coefficients exhibiting considerable bias above their values at the individual level [10, 21]; and Blalock [2] has shown that the regression coefficients may be biased also. It is well established that it is incorrect to assume that relationships existing at one level of analysis will necessarily demonstrate the same strength at another level. The estimates derived from aggregate data are valid only for the particular system of observational units employed. The consequences of using potentially biased estimates of the correlation and regression coefficients as substitutes for the “true” microlevel estimates are most serious in terms of the causal inferences to be drawn from statistical analyses

And just a bit later in the paper, relating to how aggregation bias and ecology fallacy are related (bold is mine):

Probably the most serious disadvantage of using aggregate data is the inherent difficulty of making valid multilevel inferences based on a single level of analysis [1]. Alker has identified three types of erroneous inferences that may appear should a researcher attempt to generalize from one level of investigation to another. The individualistic fallacy is the attempt to impute macrolevel (aggregate) relationships from microlevel (individual) relationships. It is the classic aggregation problem first examined by economists, and according to Hannan [15, p. 5] it concerns attempts to group observations on ‘behavioral units’ so as to investigate economic relationships holding for sectors or total economies.” Cross-level fallacies can occur when one makes inferences from one subpopulation to another at the same level of analysis. The ecological fallacy, so named from the work of Robinson [18], is the opposite of the individualistic fallacy and involves making inferences from higher to lower levels of analysis. Robinson demonstrated that there was not necessarily a correspondence between individual and ecological correlations, and that generally the latter would be larger than the former. Although the ecological fallacy has been widely discussed and publicized, it is still a common error in studies involving causal inference.

Clark, W. A., & Avery, K. L. (1976). The effects of data aggregation in statistical analysis. Geographical Analysis, 8(4), 428-438.

The paper is available as a PDF on Google Scholar (not linking because links break)

HFBrowning
  • 1,286