221

What is the most efficient way to organise the following pandas Dataframe:

data =

Position    Letter
1           a
2           b
3           c
4           d
5           e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?

Trenton McKinney
  • 43,885
  • 25
  • 111
  • 113
user1083734
  • 4,193
  • 7
  • 22
  • 24

7 Answers7

300
In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter's method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop
quantif
  • 166
  • 3
  • 12
Jeff
  • 117,982
  • 20
  • 211
  • 179
  • Do I understand this correctly, that your df is the same as my data (the first two commands being to just enter the data as I have it)? If not, why do you enter in the data values as a string, manually? – user1083734 Jul 02 '13 at 13:48
  • yes they are the same, I just copy and pasted your data (that step is only necessary for reproducibility) – Jeff Jul 02 '13 at 13:52
  • 69
    Without creating a Series first ... dict(zip(df.Position, df.Letter)) – Wouter Overmeire Jul 02 '13 at 14:05
  • @WouterOvermeire they are suprising close in speed, thought yours would be much faster – Jeff Jul 02 '13 at 14:13
  • 1
    FYI.....my method is very close under to the hood as to what Wouter is doing, difference is its implemented using ``izip``, rather than ``zip``; generator makes the difference I guess – Jeff Jul 02 '13 at 14:17
  • Thanks @Jeff, this answers my question but doesn't work in my actual application. I get KeyError: 0L – user1083734 Jul 02 '13 at 14:32
  • 1
    @WouterOvermeire this works in my application perfectly, thank you for your contribution – user1083734 Jul 02 '13 at 14:35
  • 4
    @Jeff dict(zip...) fastest one – Wouter Overmeire Jul 02 '13 at 17:44
  • 4
    On a DataFrame with shape=(100,2), Wouter's method with dict(zip...) was 3x faster than Jeff's - I used %timeit – Quetzalcoatl Jun 30 '15 at 15:11
  • @Jeff Is there a way to have another Series added rather than just the Letter one? – kgui May 31 '18 at 18:40
128

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)
Kirell
  • 8,460
  • 4
  • 43
  • 58
9

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs' proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs
pakobill
  • 316
  • 3
  • 8
7
dict (zip(data['position'], data['letter']))

this will give you:

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
Ayeni Lawrence
  • 321
  • 2
  • 3
6

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist())) 

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Related

Answering @sbradbio comment:

If there are multiple values for a specific key and you would like to keep all of them, it's the not the most efficient but the most intuitive way is:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[out]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})
alvas
  • 105,505
  • 99
  • 405
  • 683
1

Here are two other ways tested with the following df.

df = pd.DataFrame(np.random.randint(0,10,10000).reshape(5000,2),columns=list('AB'))

using to_records()

dict(df.to_records(index=False))

using MultiIndex.from_frame()

dict(pd.MultiIndex.from_frame(df))

Time of each.

24.6 ms ± 847 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.86 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
rhug123
  • 4,193
  • 1
  • 3
  • 20
1

I like the Wouter method, however the behaviour with duplicate values might not be what is expected and this scenario is not discussed one way or the other by the OP unfortunately. Wouter, will always choose the last value for each key encountered. So in other words, it will keep overwriting the value for each key.

The expected behavior in my mind would be more like Create a dict using two columns from dataframe with duplicates in one column where a list is kept for each key.

So for the case of keeping duplicates, let me submit df.groupby('Position')['Letter'].apply(list).to_dict() (Or perhaps even a set instead of a list)