How do I get a summary count of missing/NaN data by column in 'pandas'?

Question

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values.

I gather I can do something like

len(mydata.index) - mydata.count()

to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).

Similar question for rows: [Python/Pandas: counting the number of missing/NaN in each row](http://stackoverflow.com/questions/30059260/python-pandas-counting-the-number-of-nan-in-each-row) — smci, Nov 17 '16 at 11:00

Jeff · Accepted Answer · 2014-03-07T18:23:10.583

Both describe and info report the count of non-missing values.

In [1]: df = DataFrame(np.random.randn(10,2))

In [2]: df.iloc[3:6,0] = np.nan

In [3]: df
Out[3]: 
          0         1
0 -0.560342  1.862640
1 -1.237742  0.596384
2  0.603539 -1.561594
3       NaN  3.018954
4       NaN -0.046759
5       NaN  0.480158
6  0.113200 -0.911159
7  0.990895  0.612990
8  0.668534 -0.701769
9 -0.607247 -0.489427

[10 rows x 2 columns]

In [4]: df.describe()
Out[4]: 
              0          1
count  7.000000  10.000000
mean  -0.004166   0.286042
std    0.818586   1.363422
min   -1.237742  -1.561594
25%   -0.583795  -0.648684
50%    0.113200   0.216699
75%    0.636036   0.608839
max    0.990895   3.018954

[8 rows x 2 columns]


In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
0    7 non-null float64
1    10 non-null float64
dtypes: float64(2)

To get a count of missing, your soln is correct

In [20]: len(df.index)-df.count()
Out[20]: 
0    3
1    0
dtype: int64

You could do this too

In [23]: df.isnull().sum()
Out[23]: 
0    3
1    0
dtype: int64

score 10 · Answer 2 · answered Mar 24 '17 at 11:22

10

As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:

df.isnull().sum()/len(df)*100

answered Mar 24 '17 at 11:22

Ricky McMaster

3,767
1
23
20

score 5 · Answer 3 · answered Aug 31 '19 at 02:13

5

Following one will do the trick and will return counts of nulls for every column:

df.isnull().sum(axis=0)

df.isnull() returns a dataframe with True / False values
sum(axis=0) sums the values across all rows for a column

answered Aug 31 '19 at 02:13

Kshitij

51
1
1

score 3 · Answer 4 · edited Jul 22 '18 at 08:21

3

This isnt quite a full summary, but it will give you a quick sense of your column level data

def getPctMissing(series):
    num = series.isnull().sum()
    den = series.count()
    return 100*(num/den)

edited Jul 22 '18 at 08:21

Vivek Kalyanarangan

8,499
1
21
39

answered Jan 28 '16 at 20:00

userS

41
1

Mithril · Answer 5 · 2021-05-07T06:28:39.343

If you want to see not null summary of each column , just use df.info(null_counts=True):

Example 1:

df = pd.DataFrame(np.random.randn(10,5), columns=list('abcde'))
df.iloc[:4,0] = np.nan
df.iloc[:3,1] = np.nan
df.iloc[:2,2] = np.nan
df.iloc[:1,3] = np.nan

df.info(null_counts=True)

output:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       7 non-null      float64
 2   c       8 non-null      float64
 3   d       9 non-null      float64
 4   e       10 non-null     float64
dtypes: float64(5)
memory usage: 528.0 bytes

In addition, if you want to customize the result , such as add nan_rate , I wrote a method


def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

describe_nan(df)

>>> column  nan_counts  nan_rate
0   a   4   0.4
1   b   3   0.3
2   c   2   0.2
3   d   1   0.1
4   e   0   0.0

`null_counts` is deprecated now – jtlz2 Feb 06 '22 at 19:08 — jtlz2, Feb 06 '22 at 19:08

score 0 · Answer 6 · edited Jul 27 '20 at 17:02

0

If you didn't care which columns had Nan's and you just wanted to check overall, just add a second .sum() to get a single value.

result = df.isnull().sum().sum()
result > 0

a Series would only need one .sum() and a Panel() would need three

edited Jul 27 '20 at 17:02

Peter O.

30,765
14
76
91

answered Apr 26 '17 at 15:52

Drafter250

128
8

score 0 · Answer 7 · answered Oct 13 '21 at 12:37

I had to process numerous large datasets to get NaNs information (counts and portions per column) and timing was an issue. So I timed various methods for getting summary counts of NaNs per column in a separate dataframe with column names, NaN counts and NaN portions as columns:

# create random dataframe
dfa = pd.DataFrame(np.random.randn(100000,300))
# add 30% random NaNs
dfa = dfa.mask(np.random.random(dfa.shape) < 0.3)

With pandas methods only:

%%timeit
nans_dfa = dfa.isna().sum().rename_axis('Columns').reset_index(name='Counts')
nans_dfa["NaNportions"] = nans_dfa["Counts"] / dfa.shape[0]

# Output:
# 10 loops, best of 5: 57.8 ms per loop

Using list comprehension, based on the fine answer from @Mithril:

%%timeit
nan_dfa_loop2 = pd.DataFrame([(col, dfa[dfa[col].isna()].shape[0], dfa[dfa[col].isna()].shape[0]/dfa.shape[0]) for col in dfa.columns], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 13.9 s per loop

Using list comprehension with a second for loop to store the result of method calls to reduce calls to these methods:

%%timeit
nan_dfa_loop1 = pd.DataFrame([(col, n, n/dfa.shape[0]) for col in dfa.columns for n in (dfa[col].isna().sum(),) if n], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 373 ms per loop

All the above will produce the same dataframe:

    Columns Counts  NaNportions
0   0   29902   0.29902
1   1   30101   0.30101
2   2   30008   0.30008
3   3   30194   0.30194
4   4   29856   0.29856
... ... ... ...
295 295 29823   0.29823
296 296 29818   0.29818
297 297 29979   0.29979
298 298 30050   0.30050
299 299 30192   0.30192

('Columns' is redundant with this test dataframe. It is just used as placeholder where in a real life dataset it would probably represent the names of the attributes in the initial dataframe.)

How do I get a summary count of missing/NaN data by column in 'pandas'?

7 Answers7

Linked