How to create a dictionary of two pandas DataFrame columns

Question

What is the most efficient way to organise the following pandas Dataframe:

data =

Position    Letter
1           a
2           b
3           c
4           d
5           e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?

score 300 · Accepted Answer · edited May 09 '19 at 16:29

300

In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter's method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop

edited May 09 '19 at 16:29

quantif

166
3
12

answered Jul 02 '13 at 13:08

Jeff

117,982
20
211
179

Do I understand this correctly, that your df is the same as my data (the first two commands being to just enter the data as I have it)? If not, why do you enter in the data values as a string, manually? – user1083734 Jul 02 '13 at 13:48
yes they are the same, I just copy and pasted your data (that step is only necessary for reproducibility) – Jeff Jul 02 '13 at 13:52
69

Without creating a Series first ... dict(zip(df.Position, df.Letter)) – Wouter Overmeire Jul 02 '13 at 14:05
@WouterOvermeire they are suprising close in speed, thought yours would be much faster – Jeff Jul 02 '13 at 14:13
1

FYI.....my method is very close under to the hood as to what Wouter is doing, difference is its implemented using ``izip``, rather than ``zip``; generator makes the difference I guess – Jeff Jul 02 '13 at 14:17
Thanks @Jeff, this answers my question but doesn't work in my actual application. I get KeyError: 0L – user1083734 Jul 02 '13 at 14:32
1

@WouterOvermeire this works in my application perfectly, thank you for your contribution – user1083734 Jul 02 '13 at 14:35
4

@Jeff dict(zip...) fastest one – Wouter Overmeire Jul 02 '13 at 17:44
4

On a DataFrame with shape=(100,2), Wouter's method with dict(zip...) was 3x faster than Jeff's - I used %timeit – Quetzalcoatl Jun 30 '15 at 15:11
@Jeff Is there a way to have another Series added rather than just the Letter one? – kgui May 31 '18 at 18:40

Kirell · Answer 2 · 2016-09-01T15:56:11.733

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

score 9 · Answer 3 · answered Aug 24 '18 at 08:28

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs' proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

score 7 · Answer 4 · answered Mar 27 '22 at 15:08

7

dict (zip(data['position'], data['letter']))

this will give you:

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

answered Mar 27 '22 at 15:08

Ayeni Lawrence

321
2
3

alvas · Answer 5 · 2018-10-22T00:28:12.193

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist()))

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

I like the Wouter method, however the behaviour with duplicate values might not be what is expected and this scenario is not discussed one way or the other by the OP unfortunately. Wouter, will always choose the last value for each key encountered. So in other words, it will keep overwriting the value for each key.

The expected behavior in my mind would be more like Create a dict using two columns from dataframe with duplicates in one column where a list is kept for each key.

So for the case of keeping duplicates, let me submit df.groupby('Position')['Letter'].apply(list).to_dict() (Or perhaps even a set instead of a list)

How to create a dictionary of two pandas DataFrame columns

7 Answers7

TL;DR

In Long

Related

Linked

Related