Compare two columns using pandas

Question

Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, check each row via the if statement, create new column.

The docs say to use .all but there is no example...

@Merlin: If you have numeric data in a column, it is best not to mix it with strings. Doing so changes the column's dtype to `object`. This allows arbitrary Python objects to be stored in the column, but it comes at the cost of slower numeric computation. Thus if the column is storing numeric data, using NaNs for not-a-numbers is preferable. — unutbu, Dec 14 '14 at 22:54
Having integers as strings and trying to do comparison on them looks odd: `a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]`. This creates confusing results with "correct" code: `df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]` yields `10` for the first line, while it should yield `NaN` if the input would have been integers. — Primer, Dec 15 '14 at 14:10

unutbu · Accepted Answer · 2014-12-15T11:44:27.400

You could use np.where. If cond is a boolean array, and A and B are arrays, then

C = np.where(cond, A, B)

defines C to be equal to A where cond is True, and B where cond is False.

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(The assumption may not be true if df['one'] or df['two'] contain NaNs.)

Note that

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

df2 = df.astype(float)

This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

score 108 · Answer 2 · edited Jul 25 '16 at 22:09

108

You can use .equals for columns or entire dataframes.

df['col1'].equals(df['col2'])

If they're equal, that statement will return True, else False.

edited Jul 25 '16 at 22:09

Alicia Garcia-Raboso

11,843
1
40
45

answered Jul 25 '16 at 21:35

ccook5760

1,140
1
7
6

42

Note: this only compares the whole column to another one. This does not compare the columsn element wise – guerda Oct 18 '18 at 12:46
2

How about if you want to see if one column always has value "greater than" or "lesser than" the other columns? – rrlamichhane Apr 21 '20 at 16:44

score 32 · Answer 3 · edited Aug 24 '20 at 09:48

32

You could use apply() and do something like this

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

or if you prefer not to use a lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

edited Aug 24 '20 at 09:48

divykj

556
6
12

answered Dec 14 '14 at 22:46

Bob Haffner

7,393
1
33
39

3

I suspect this is probably a bit slower than the other approaches posted, since it doesn't take advantage of the vectorized operations that pandas allows. – Marius Dec 14 '14 at 22:51
@BobHaffner: lambda are not readable when using complex if/then/else statements. – Merlin Dec 14 '14 at 23:31
@Merlin you could add an elseif and I would agree with you on lambdas and multiple conditions – Bob Haffner Dec 14 '14 at 23:45
is there a way to generalize the non lambda function such that you can pass dataframe columns in, and not change the name? – AZhao Jul 09 '15 at 13:18
@AZhao you could generalize with iloc like this df['que'] = df.apply(lambda x : x.iloc[0] if x.iloc[0] >= x.iloc[1] and x.iloc[0] <= x.iloc[2] else "", axis=1) Is that what you mean? Obviously. the order of your columns matter – Bob Haffner Jul 09 '15 at 14:22
This method is the only one that worked for me! https://stackoverflow.com/questions/54476753/compare-two-columns-using-pandas-2 – R. Cox Feb 01 '19 at 09:42

Alex Riley · Answer 4 · 2014-12-14T22:54:11.100

One way is to use a Boolean series to index the column df['one']. This gives you a new column where the True entries have the same value as the same row as df['one'] and the False values are NaN.

The Boolean series is just given by your if statement (although it is necessary to use & instead of and):

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

If you want the NaN values to be replaced by other values, you can use the fillna method on the new column que. I've used 0 instead of the empty string here:

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

score 9 · Answer 5 · answered Dec 14 '14 at 22:47

Wrap each individual condition in parentheses, and then use the & operator to combine the conditions:

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

You can fill the non-matching rows by just using ~ (the "not" operator) to invert the match:

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

You need to use & and ~ rather than and and not because the & and ~ operators work element-by-element.

The final result:

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0

score 4 · Answer 6 · answered Aug 13 '21 at 03:56

I'd like to add this answer for those who are trying to compare the equality of values in two columns that have NaN values, and get False when both values are NaN. By definition, NaN != NaN (See: numpy.isnan(value) not the same as value == numpy.nan?).

If you want the two NaN comparison to return True, you can use:

df['compare'] = (df["col_1"] == df["col_2"]) | (df["col_1"].isna() & df["col_2"].isna())

score 3 · Answer 7 · edited Aug 16 '19 at 10:09

3

Use np.select if you have multiple conditions to be checked from the dataframe and output a specific choice in a different column

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

Note: No of conditions and no of choices should match, repeat text in choice if for two different conditions you have same choices

edited Aug 16 '19 at 10:09

Dharman

26,923
21
73
125

answered Aug 16 '19 at 09:41

psn1997

124
9

score 3 · Answer 8 · answered Dec 03 '20 at 16:17

3

Use lambda expression:

df[df.apply(lambda x: x['col1'] != x['col2'], axis = 1)]

answered Dec 03 '20 at 16:17

aze45sq6d

781
3
7
24

score 2 · Answer 9 · answered Nov 07 '21 at 20:34

2

You can use the Series method where:

df['que'] = df['one'].where((df['one'] >= df['two']) & (df['one'] <= df['three']))

Result:

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

answered Nov 07 '21 at 20:34

Mykola Zotko

12,250
2
39
53

score 0 · Answer 10 · answered Oct 12 '18 at 19:28

0

I think the closest to the OP's intuition is an inline if statement:

df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three']))

answered Oct 12 '18 at 19:28

Nic Scozzaro

5,117
1
31
42

1

Your code gives me error `df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three'])) ^ SyntaxError: unexpected EOF while parsing` – vasili111 Feb 06 '20 at 14:41

Compare two columns using pandas

10 Answers10

Linked

Related