How do you deal with missing column data for 98% of rows?

Question

I have a case where I have only 2.7% of the rows having value for a particular column?

What steps could I take to utilize it? Any specific methods?

Also, any algorithms or techniques for utilizing missing data is welcome.

It all depends on why those data might be missing, what they represent, and what kind of analysis you intend to perform. Although your tags provide hints, to avoid confusion among readers please provide some specifics in the body of the question. — whuber, Jun 16 '18 at 12:03

ironman · Answer 1 · 2018-06-26T15:11:30.507

In case of missing data ,the most famous method or we can say algorithm is the sample clustering using Random Forest.

Since in your case it is 98% missing,this method will work best when we have very large amount of data .

Algorithm:

We first make a initial guess of missing data values. Then we gradually refine the guess until it gives better result(converge).
To make the guess (in case of categorical value in a column) we take that value which is more frequent in number. In case of numeric value in column, we take median value as guess for the missing value.
Now we want to refine these guesses.We do this by first determining which samples are similar to the missing data sample.
How do we determine similarity? Simple. We make a random forest.
Run all of the data down all the tree. And which ever ended in similar node at end , is similar.
To keep track of similarity we use Proximity matrix. Put "1" in cell corresponding to appropriate sample. For example , sample 3 and sample 5 ended in same leaf node in first tree then put one in cell(3,5) = 1.
Now run the data down to the second tree and update the proximity matrix. Increase the value of cell by adding "1" to previous "1" if in second tree sample 3 and sample 5 ended in same leaf node. Now value of cell(3,5) =1+1 =2 .Similarly for other samples.
Then use this value of proximity matrix after normalizing as weights.
In case of categorical case( like yes or no value of column), find the weighted frequency of all values in column .

And for numerical values case, we use normalized proximity values to find weighted average of all values in a column to place in missing value of a row in particular column.
After filling all the missing values. We again build the random forest and repeat the whole process until missing data values converge.

The idea is that even though the data may be high-dimensional, involving mixed variables, etc., the proximity matrix gives an indication of which observations are effectively close together in the eyes of the random forest classifier.

Reference:

This algorithm can be found in the book Elements of Statistical Learning in chapter Random forest

Please stop making trivial edits to this post--that's just gaming the system. — whuber, Jun 26 '18 at 17:05

How do you deal with missing column data for 98% of rows?

1 Answers1