Suppose I have a data set containing every tax return lodged with a particular tax authority. Suppose I would like to allow people to study trends and patterns but not see any individual's tax return information.
I can produce a lot of aggregate statistics which don't necessarily identify any individual or disclose any information about their tax affairs. A simple one might be the average taxable income for various age groups. A more complicated one might be the number of individuals in each age group where the claim for interest deductions is above a certain threshold.
My understanding is that, when aggregates are released publicly by government authorities, they are first checked by statisticians who de-identify the data. The most detailed information I can find online is about censuses, but presumably the same techniques apply to tax statistics. I have heard it called 'statistical disclosure control'. Techniques include blanking out cells with only a few individuals in them, applying random adjustments to numbers, and 'swapping' records between groups. Statisticians are expensive and the requirement for human vetting precludes online querying.
Is it possible to do this automatically, that is, run a program that de-identifies any arbitrary aggregate data? Or do you need to examine each table and decide what techniques to use based on the characteristics of that data?
What I have in mind is that I would like to (but I don't know if it is feasible) be able to ask any pretty much question that boils down to an SQL query containig a GROUP BY clause and be able to get an answer without breaching any individual's privacy.