498

Is there a way to check if a column exists in a Pandas DataFrame?

Suppose that I have the following DataFrame:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                       'B': [randint(1, 9)*10 for x in xrange(10)],
                       'C': [randint(1, 9)*100 for x in xrange(10)]})
>>> df
   A   B    C
0  3  40  100
1  6  30  200
2  7  70  800
3  3  50  200
4  7  50  400
5  4  10  400
6  3  70  500
7  8  30  200
8  3  40  800
9  6  60  200

and I want to calculate df['sum'] = df['A'] + df['C']

But first I want to check if df['A'] exists, and if not, I want to calculate df['sum'] = df['B'] + df['C'] instead.

Mel
  • 5,460
  • 10
  • 38
  • 41
npires
  • 5,163
  • 2
  • 12
  • 9

4 Answers4

993

This will work:

if 'A' in df:

But for clarity, I'd probably write it as:

if 'A' in df.columns:
chrisb
  • 44,957
  • 8
  • 61
  • 62
  • 20
    the otherway around one could use: ```if not 'A' in df.columns:``` to execute an operation if ```A``` is not present in ```df``` – Robvh Feb 05 '20 at 10:59
  • 7
    Additionally, you can check multiple with `if header in df.columns for header in ('A', 'B')` – Joe Sadoski May 28 '21 at 14:11
  • @Robvh I think it is better to use `if 'A' not in df.columns:` rather than using `if not 'A' in df.columns:`. Because `not in` is a single python operator. But if you use something like `not A in B` it is calculating `A in B` first and then going through `not` operator. – Ramesh-X May 28 '22 at 03:19
161

To check if one or more columns all exist, you can use set.issubset, as in:

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']                

As @brianpck points out in a comment, set([]) can alternatively be constructed with curly braces,

if {'A', 'C'}.issubset(df.columns):

See this question for a discussion of the curly-braces syntax.

Or, you can use a generator comprehension, as in:

if all(item in df.columns for item in ['A','C']):
C8H10N4O2
  • 16,948
  • 6
  • 87
  • 123
18

Just to suggest another way without using if statements, you can use the get() method for DataFrames. For performing the sum based on the question:

df['sum'] = df.get('A', df['B']) + df['C']

The DataFrame get method has similar behavior as python dictionaries.

Gerges
  • 5,869
  • 2
  • 20
  • 37
  • Thank you, this works: `df['sum'] = df.get('A') + df['B'] + df['C']` or to avoid any column error if it does not exist, using get() for all the terms .. e.g. `df['sum'] = df.get('A') + df.get('B') + df.get('C')` – Santosh K Apr 05 '21 at 07:36
  • `df.get("A") + df.get("B")` still gives you an error if those don't exist, just the more confusing `TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'` rather than the easier-to-debug `KeyError`. `.get()` should only be used if you're actually planning on using the default value, otherwise it just pushes the error away from the point of failure and makes the state contract more confusing to intuit. The whole point of Gerges' answer is to use the second parameter to `.get()` to specify a column you know will exist as a fallback, not to let a bunch of Nones crash the code. – ggorlen Nov 11 '21 at 00:23
5

You can use the set's method issuperset:

set(df).issuperset(['A', 'B'])
# set(df.columns).issuperset(['A', 'B'])
Mykola Zotko
  • 12,250
  • 2
  • 39
  • 53