How to handle missing values NaiveBayes Scikit Learn

Question

I am working with a dataset which has 34 features (numerical, nominal) and the target class. Several of the columns have missing values, especially one column has approximately 50% missing values.

I have not been concerning, because in R Naive Bayes works perfect no matter missing values or type of features, but since I read in scikit docs that Naive Bayes cannot handle mixed data and Missing values, I concern.

I want to ask you if hopefully is any library in python that works exactly like NaiveBayes in R, or what can I do to run Naive Bayes for mixed features having also missing values.

Hello, maybe i must redefine my issue. After recalling the university excercises in R, I found that we did not use mixed Naive bayes, but only with categorical data AND NO missing values. I missunderstood the word missing values with the word zero Conditional probability of an attribute value given a class label. So now iam trying to reproduce an NB algorithm from a paper, and reproduce the results. They used WEKA, they dont define the problem of mixed and missing values for the evaluation of the given dataset. So basically i must learn what WEKA NB does with missing and mixed data — Panos, Apr 04 '22 at 20:02

Tim · Answer 1 · 2022-04-05T12:28:00.590

1

Recall how naive Bayes does the computations. It defines the problem in terms of a probability distribution, but with the "naive" assumption that the features are independent

$$ p(y, x_1, x_2, \dots, x_m) = p(y) \prod_{j=1}^m p(x_j \mid y) $$

What this means for us is that we can calculate $p(x_j|y)$ independently for each feature, using only the non-missing rows for the feature. In an extreme case, it would be even possible if you didn't have a single row in your dataset where all the features are non-missing, or with features coming from completely different samples (though it'd be risky). So technically, missing data is not a problem for the naive Bayes algorithm.

The limitations that you are describing seem to be related rather to the particular implementation than the algorithm itself, in such a case, you can implement it by hand or look for other software. Another solution is to use one of the many generic approaches to missing-data described in many threads on this site.

edited Apr 05 '22 at 12:28

answered Apr 05 '22 at 12:22

Tim

138,066

Basically, i follow a paper where authors used WEKA. I want to reproduce the results using SCIKIT learn. The problem is that scikit algorithms cannot handle missing values and the worst happening is that we have not any source to see what weka algorithms do under the hood for missing values – Panos Apr 05 '22 at 12:57
@Panos I'd expect to see such information in the documentation. If it is not there, Weka is open-source so you can always check the source code. – Tim Apr 05 '22 at 12:59
yes of course, but i expect from the more experienced members in the context of weka to help, my diploma thesis is about implementing algorithms in python, and because i found a paper with nice algorithms that was implemented in WEKA, now iam searching for WEKA, i mean WEKA is ultimately out the context of my research and purpose – Panos Apr 05 '22 at 13:04
@Panos if you are only looking for help with Weka, this would not be the best place, better ask on some Weka users support group https://waikato.github.io/weka-wiki/getting_help/ – Tim Apr 05 '22 at 13:16
Anyway thanks for the help, i surrender from diving into weka. I started using scikit imputation knn, mice etc to see which method gives me the closest to the paper results – Panos Apr 05 '22 at 17:23
And finaly i got an very acceptable result using KNNImputation n_neighbor = 1, optimal threshold differance 0.009- 0.003 = 0.6%, Younden index 0.436-0.391 = 4,5% – Panos Apr 05 '22 at 18:30
Suddenly, diving into logistic regression implementation on weka i found that it uses a filter which replaces nominal with mode and numerical with mean, from training set – Panos Apr 05 '22 at 18:42

How to handle missing values NaiveBayes Scikit Learn

1 Answers1