46

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values.

I gather I can do something like

len(mydata.index) - mydata.count()

to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).

desertnaut
  • 52,940
  • 19
  • 125
  • 157
orome
  • 40,124
  • 45
  • 174
  • 373
  • Similar question for rows: [Python/Pandas: counting the number of missing/NaN in each row](http://stackoverflow.com/questions/30059260/python-pandas-counting-the-number-of-nan-in-each-row) – smci Nov 17 '16 at 11:00

7 Answers7

55

Both describe and info report the count of non-missing values.

In [1]: df = DataFrame(np.random.randn(10,2))

In [2]: df.iloc[3:6,0] = np.nan

In [3]: df
Out[3]: 
          0         1
0 -0.560342  1.862640
1 -1.237742  0.596384
2  0.603539 -1.561594
3       NaN  3.018954
4       NaN -0.046759
5       NaN  0.480158
6  0.113200 -0.911159
7  0.990895  0.612990
8  0.668534 -0.701769
9 -0.607247 -0.489427

[10 rows x 2 columns]

In [4]: df.describe()
Out[4]: 
              0          1
count  7.000000  10.000000
mean  -0.004166   0.286042
std    0.818586   1.363422
min   -1.237742  -1.561594
25%   -0.583795  -0.648684
50%    0.113200   0.216699
75%    0.636036   0.608839
max    0.990895   3.018954

[8 rows x 2 columns]


In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
0    7 non-null float64
1    10 non-null float64
dtypes: float64(2)

To get a count of missing, your soln is correct

In [20]: len(df.index)-df.count()
Out[20]: 
0    3
1    0
dtype: int64

You could do this too

In [23]: df.isnull().sum()
Out[23]: 
0    3
1    0
dtype: int64
Jeff
  • 117,982
  • 20
  • 211
  • 179
10

As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:

df.isnull().sum()/len(df)*100
Ricky McMaster
  • 3,767
  • 1
  • 23
  • 20
5

Following one will do the trick and will return counts of nulls for every column:

df.isnull().sum(axis=0)

df.isnull() returns a dataframe with True / False values
sum(axis=0) sums the values across all rows for a column

Kshitij
  • 51
  • 1
  • 1
3

This isnt quite a full summary, but it will give you a quick sense of your column level data

def getPctMissing(series):
    num = series.isnull().sum()
    den = series.count()
    return 100*(num/den)
Vivek Kalyanarangan
  • 8,499
  • 1
  • 21
  • 39
userS
  • 41
  • 1
2

If you want to see not null summary of each column , just use df.info(null_counts=True):

Example 1:

df = pd.DataFrame(np.random.randn(10,5), columns=list('abcde'))
df.iloc[:4,0] = np.nan
df.iloc[:3,1] = np.nan
df.iloc[:2,2] = np.nan
df.iloc[:1,3] = np.nan

df.info(null_counts=True)

output:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       7 non-null      float64
 2   c       8 non-null      float64
 3   d       9 non-null      float64
 4   e       10 non-null     float64
dtypes: float64(5)
memory usage: 528.0 bytes

In addition, if you want to customize the result , such as add nan_rate , I wrote a method


def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

describe_nan(df)

>>> column  nan_counts  nan_rate
0   a   4   0.4
1   b   3   0.3
2   c   2   0.2
3   d   1   0.1
4   e   0   0.0

Mithril
  • 11,666
  • 17
  • 90
  • 135
0

If you didn't care which columns had Nan's and you just wanted to check overall, just add a second .sum() to get a single value.

result = df.isnull().sum().sum()
result > 0

a Series would only need one .sum() and a Panel() would need three

Peter O.
  • 30,765
  • 14
  • 76
  • 91
Drafter250
  • 128
  • 8
0

I had to process numerous large datasets to get NaNs information (counts and portions per column) and timing was an issue. So I timed various methods for getting summary counts of NaNs per column in a separate dataframe with column names, NaN counts and NaN portions as columns:

# create random dataframe
dfa = pd.DataFrame(np.random.randn(100000,300))
# add 30% random NaNs
dfa = dfa.mask(np.random.random(dfa.shape) < 0.3)

With pandas methods only:

%%timeit
nans_dfa = dfa.isna().sum().rename_axis('Columns').reset_index(name='Counts')
nans_dfa["NaNportions"] = nans_dfa["Counts"] / dfa.shape[0]

# Output:
# 10 loops, best of 5: 57.8 ms per loop

Using list comprehension, based on the fine answer from @Mithril:

%%timeit
nan_dfa_loop2 = pd.DataFrame([(col, dfa[dfa[col].isna()].shape[0], dfa[dfa[col].isna()].shape[0]/dfa.shape[0]) for col in dfa.columns], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 13.9 s per loop

Using list comprehension with a second for loop to store the result of method calls to reduce calls to these methods:

%%timeit
nan_dfa_loop1 = pd.DataFrame([(col, n, n/dfa.shape[0]) for col in dfa.columns for n in (dfa[col].isna().sum(),) if n], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 373 ms per loop

All the above will produce the same dataframe:

    Columns Counts  NaNportions
0   0   29902   0.29902
1   1   30101   0.30101
2   2   30008   0.30008
3   3   30194   0.30194
4   4   29856   0.29856
... ... ... ...
295 295 29823   0.29823
296 296 29818   0.29818
297 297 29979   0.29979
298 298 30050   0.30050
299 299 30192   0.30192

('Columns' is redundant with this test dataframe. It is just used as placeholder where in a real life dataset it would probably represent the names of the attributes in the initial dataframe.)

C. Liatas
  • 1
  • 1
  • 1