I would like to know some general parameters that can be used to describe how "dirty" the data is.
Issues I am having are the following:
- Lots of missing values;
- The values are some predictors are filled in but often completely wrong;
- I can try to extract some other variables out of a huge text string, but by doing so I create other doubts of the observations without this long text string.
Is there a general approach to formulate how dirty the data is?
I was thinking about a kind of Sankey diagram that visualises what fraction of the data can be used when taking all the predictors into account?