I have a dataset on avoided maritime accidents (near-miss) that looks like this:
[
All variables are categorical (type=1-3, position=1-5, area=1-5, risk=1-7, 4 columns, 525 rows - every row is 1 near miss described) and I encoded them in R.
From my own and other experiences, I know that accident avoidance data are often fabricated to meet bureaucratic forms for future inspections. If there are no avoided accidents, almost every company expects you to invent and report them anyway.
Before I do any analysis of this set I would like to test if there is significant fabricated data.
I am familiar with Benford's law (for economics and finance frauds), but I am interested in the following:
- How to use Benford's law when it comes to categorical variables?
- Is there are other (statistical or ML) ways to detect fabricated data in raw data?
- If not which methods do you recommend to analyze and find any structure this dataset?
If you had observations clearly known to be fraudulent and not fraudulent, there's more you could do. (eg. train a logistic a regression on the labelled data.)
– Matthew Gunn Jan 02 '21 at 17:32Benford's law isn't some magical incantation applicable anytime you see a number. Speaking loosely, the leading digit of a random variable X may follow Benford's law when distribution of $\log X$ is uniform over an appropriately wide interval (or other related situations). This can arise when there's exponential growth (eg. revenues grow some random percentage each month).
How is that in any way related to categorical variables (eg. equipment, personal injury..) where you don't even have numbers?
– Matthew Gunn Jan 02 '21 at 19:31