10

When preparing data for analysis, I often encountered issues such as outliers, data entries that are logically inconsistent (e.g. age=150/age=-2), duplicates (that are not exactly equal) etc. When integrating data sets from different sources, there might not be a unique identifier oder data entries follow different standards (e.g. US vs. United States), which makes matching hard.

What are the issues you experience most commonly when working with data?

Elisabeth
  • 101
  • 5
  • 2
    This question is very broad, probably too broad for a question and answer format. Can you be more specific? –  Aug 16 '16 at 10:14

2 Answers2

5

There can be numerous instances where data needs to be cleansed and standardised. To name a few -:

1) Mandatory fields having Null values - This is the greatest source of data loss

Remedy - Going back to the business, Replacing them with the mean.

2)Standardising columns- Like USA, US, America, United States. Remedy- Using a lookup to change them to one standard value.

3)Date- This is one major issue where a file might have multiple date formats. Remedy-This can be solved again by using a lookup.

4)Duplicates - This can be easily solved either in excel or in the database. You can even have a stored procedure for this.

5)Outliers- This is a separate thing all together, an outlier is a datapoint which comes from different distribution.I would suggest using the 3-standard deviation rules to mark potential outliers and then using a grub's test to check if they are outliers.

You can find cleansed and uncleansed datasets here https://www.datazar.com/

I hope this helped you.If you need any further details, feel free to email me at pramit@datazar.com

Pramit
  • 233
  • 1
  • 2
4

In addition to Pramit's answer:

  • Entities are stored differently: system A stores full name, whereas B stores first name and last name.
  • Character set issues: data was exported from system A to B and B cannot interpret characters from A correctly (or the conversion process had errors). This is common when exporting to CSV format.
  • Field values in table X point to an index field in table Y, but that record no longer exists (lookup fails)
  • Not properly distinguishing NULL and 0 values (or NULL values and empty strings)
  • Varying CASING For values that Should HAVE consistent CASing
  • End dates before start dates
  • Date ranges that are not sequential (have gaps)
  • Invalid dates
  • Missing values
  • Invalid usage of field types, i.e. storing numbers or dates in text fields, strings in a string field have their length prepended (yes, I've seen that)
  • Values that are valid in one system but not in another, e.g in a DB/2 database you can store time 24:00 - try that in Oracle or SQL Server.
  • Different decimal separators; using thousand separators
  • Different languages used for texts

Then there's a whole separate class of the issues coming from databases that are not properly normalized. Some that come to mind:

  • Value duplicated in table X looked up from table Y no longer matches with the current value in Y
  • Field value Z calculated from fields X and Y does not equal the actual calculation when you re-do it
  • Putting 2;12;34;5 in a string field instead of storing those numbers as integers in a detail table