Is there a way to express how "dirty" a data set is?

Asked Oct 06 '14 at 07:55

Active Oct 06 '14 at 13:34

Viewed 58 times

I would like to know some general parameters that can be used to describe how "dirty" the data is.

Issues I am having are the following:

Lots of missing values;
The values are some predictors are filled in but often completely wrong;
I can try to extract some other variables out of a huge text string, but by doing so I create other doubts of the observations without this long text string.

Is there a general approach to formulate how dirty the data is?

I was thinking about a kind of Sankey diagram that visualises what fraction of the data can be used when taking all the predictors into account?

edited Oct 06 '14 at 13:34

Nick Cox

asked Oct 06 '14 at 07:55

Kasper

There are many problems that can afflict a dataset; you have mentioned some. I don't think that there is any magic threshold from separates data[sets] not worth using from data[sets] worth using. – Nick Cox Oct 06 '14 at 13:36
An after the fact measure could be time spent scrubbing the data / (#records * #fields) – JenSCDC Oct 07 '14 at 01:24

0 Answers0