I have a case where I have only 2.7% of the rows having value for a particular column?
What steps could I take to utilize it? Any specific methods?
Also, any algorithms or techniques for utilizing missing data is welcome.
I have a case where I have only 2.7% of the rows having value for a particular column?
What steps could I take to utilize it? Any specific methods?
Also, any algorithms or techniques for utilizing missing data is welcome.
In case of missing data ,the most famous method or we can say algorithm is the sample clustering using Random Forest.
Since in your case it is 98% missing,this method will work best when we have very large amount of data .
Algorithm:
We first make a initial guess of missing data values. Then we gradually refine the guess until it gives better result(converge).
To make the guess (in case of categorical value in a column) we take that value which is more frequent in number. In case of numeric value in column, we take median value as guess for the missing value.
Now we want to refine these guesses.We do this by first determining which samples are similar to the missing data sample.
How do we determine similarity? Simple. We make a random forest.
Run all of the data down all the tree. And which ever ended in similar node at end , is similar.
To keep track of similarity we use Proximity matrix. Put "1" in cell corresponding to appropriate sample. For example , sample 3 and sample 5 ended in same leaf node in first tree then put one in cell(3,5) = 1.
Now run the data down to the second tree and update the proximity matrix. Increase the value of cell by adding "1" to previous "1" if in second tree sample 3 and sample 5 ended in same leaf node. Now value of cell(3,5) =1+1 =2 .Similarly for other samples.
Then use this value of proximity matrix after normalizing as weights.
In case of categorical case( like yes or no value of column), find the weighted frequency of all values in column .
And for numerical values case, we use normalized proximity values to find weighted average of all values in a column to place in missing value of a row in particular column.
After filling all the missing values. We again build the random forest and repeat the whole process until missing data values converge.
The idea is that even though the data may be high-dimensional, involving mixed variables, etc., the proximity matrix gives an indication of which observations are effectively close together in the eyes of the random forest classifier.
Reference:
This algorithm can be found in the book Elements of Statistical Learning in chapter Random forest