What are the most common issues with data cleaning (e.g. outliers, duplicates)? Who has data sets that need to be prepared for analysis?

Question

When preparing data for analysis, I often encountered issues such as outliers, data entries that are logically inconsistent (e.g. age=150/age=-2), duplicates (that are not exactly equal) etc. When integrating data sets from different sources, there might not be a unique identifier oder data entries follow different standards (e.g. US vs. United States), which makes matching hard.

What are the issues you experience most commonly when working with data?

This question is very broad, probably too broad for a question and answer format. Can you be more specific? — , Aug 16 '16 at 10:14

score 5 · Answer 1 · answered Aug 18 '16 at 01:30

There can be numerous instances where data needs to be cleansed and standardised. To name a few -:

1) Mandatory fields having Null values - This is the greatest source of data loss

Remedy - Going back to the business, Replacing them with the mean.

2)Standardising columns- Like USA, US, America, United States. Remedy- Using a lookup to change them to one standard value.

3)Date- This is one major issue where a file might have multiple date formats. Remedy-This can be solved again by using a lookup.

4)Duplicates - This can be easily solved either in excel or in the database. You can even have a stored procedure for this.

5)Outliers- This is a separate thing all together, an outlier is a datapoint which comes from different distribution.I would suggest using the 3-standard deviation rules to mark potential outliers and then using a grub's test to check if they are outliers.

You can find cleansed and uncleansed datasets here https://www.datazar.com/

I hope this helped you.If you need any further details, feel free to email me at pramit@datazar.com

score 4 · Answer 2 · edited Apr 13 '17 at 12:57

In addition to Pramit's answer:

Entities are stored differently: system A stores full name, whereas B stores first name and last name.
Character set issues: data was exported from system A to B and B cannot interpret characters from A correctly (or the conversion process had errors). This is common when exporting to CSV format.
Field values in table X point to an index field in table Y, but that record no longer exists (lookup fails)
Not properly distinguishing NULL and 0 values (or NULL values and empty strings)
Varying CASING For values that Should HAVE consistent CASing
End dates before start dates
Date ranges that are not sequential (have gaps)
Invalid dates
Missing values
Invalid usage of field types, i.e. storing numbers or dates in text fields, strings in a string field have their length prepended (yes, I've seen that)
Values that are valid in one system but not in another, e.g in a DB/2 database you can store time 24:00 - try that in Oracle or SQL Server.
Different decimal separators; using thousand separators
Different languages used for texts

Then there's a whole separate class of the issues coming from databases that are not properly normalized. Some that come to mind:

Value duplicated in table X looked up from table Y no longer matches with the current value in Y
Field value Z calculated from fields X and Y does not equal the actual calculation when you re-do it
Putting 2;12;34;5 in a string field instead of storing those numbers as integers in a detail table

What are the most common issues with data cleaning (e.g. outliers, duplicates)? Who has data sets that need to be prepared for analysis?

2 Answers2