1

Suppose I have a data set containing every tax return lodged with a particular tax authority. Suppose I would like to allow people to study trends and patterns but not see any individual's tax return information.

I can produce a lot of aggregate statistics which don't necessarily identify any individual or disclose any information about their tax affairs. A simple one might be the average taxable income for various age groups. A more complicated one might be the number of individuals in each age group where the claim for interest deductions is above a certain threshold.

My understanding is that, when aggregates are released publicly by government authorities, they are first checked by statisticians who de-identify the data. The most detailed information I can find online is about censuses, but presumably the same techniques apply to tax statistics. I have heard it called 'statistical disclosure control'. Techniques include blanking out cells with only a few individuals in them, applying random adjustments to numbers, and 'swapping' records between groups. Statisticians are expensive and the requirement for human vetting precludes online querying.

Is it possible to do this automatically, that is, run a program that de-identifies any arbitrary aggregate data? Or do you need to examine each table and decide what techniques to use based on the characteristics of that data?

What I have in mind is that I would like to (but I don't know if it is feasible) be able to ask any pretty much question that boils down to an SQL query containig a GROUP BY clause and be able to get an answer without breaching any individual's privacy.

1 Answers1

1

I don't think this is possible. The issue is that all kinds of domain- and context-specific tricks can be used to identify people in a dataset. You can automatically protect against certain classes of identifiability exploits, such as small cells in contingency tables, but unless you have a smart statistician actually try to identify people and then patch up any such vulnerabilities he's found, you have no real protection against another smart statistician who wants to identify people.

The analogy to computer security is clear. Programs can be written or supervised during execution to automatically resist certain well-known exploits, such as buffer overflows, but ultimately, you need a smart human security expert if you want protection against smart human adversaries, because security holes are infinite in variety.

Kodiologist
  • 20,116
  • Thanks for this. I have a rough idea of computer security, and the difficulty of protecting against all possible exploits which you don't know you don't know about. In that domain, I know where/how to look for examples of exploits. However, I don't know about any statistics exploits apart from 'there is only one person in that town so it must be Elvis' income'. Do you have any pointers for where/how I can find out more about statistics exploits? – Patrick Conheady Jun 22 '16 at 13:48
  • Not really. You could look for articles or books on deidentification, but I'm not familiar with any myself. You could create a new question asking for pointers to literature on this. Be sure to specify whether you're looking for a general overview or for help with a particular dataset or kind of data. – Kodiologist Jun 22 '16 at 17:49