2

I need to annotate data (a liste of names in a column) with the content of a column of a second dataframe (containing some atributes of these names). Like dataframe 1 :

id      name             col
----------------------------
 29834   Marie Peer
890384   Marie Peach
 30047   Susan Peer

with an other dataframe containing

name       city
---------------------
Marie      Chicago
Marie      Boston 
Harley     Milan 
Josh       London

Questions:

  • Shall I better use pandas or does it worth to create python dictionaries?
  • I previously used data = pd.merge(df1, df2, on='column_name', how='left') but how does it work if my second dataframe has multiple times the same row in the column (I would need them to go in the same row of a new column of the first dataframe).

    I would need that, if 'Marie' has multiple entries in the second dataframe, that annotate in a new column the multiple entries of 'Marie' from the second df output needed :

     29834 Marie Peer      Chicago, Boston
    890384 Marie Peach     Chicago, Boston
     30047 Susan Peer
    

    (and if there is no entry in the second dataframe, like for 'Susan' here, the field would stay blank)

  • In general is that the best way to annotate a dataset with data from an other dataset ?

Many thanks in advance (and apologies, i'm a beginner though that might sound very basic questions) !

marc_s
  • 704,970
  • 168
  • 1,303
  • 1,425
martin
  • 25
  • 3

1 Answers1

0

Use DataFrame.groupby with join for second DataFrame, so possible use DataFrame.join:

df = df1.join(df2.groupby('name')['city'].apply(', '.join), on='name')
print (df)
       id   name    col             city
0   29834  Marie   Peer  Chicago, Boston
1  890384  Marie  Peach  Chicago, Boston
2   30047  Susan   Peer              NaN

Or Series.map:

df1['city'] = df1['name'].map(df2.groupby('name')['city'].apply(', '.join))
print (df1)
       id   name    col             city
0   29834  Marie   Peer  Chicago, Boston
1  890384  Marie  Peach  Chicago, Boston
2   30047  Susan   Peer              NaN

Detail:

print (df2.groupby('name')['city'].apply(', '.join))
name
Harley              Milan
Josh               London
Marie     Chicago, Boston
Name: city, dtype: object
jezrael
  • 729,927
  • 78
  • 1,141
  • 1,090