Convert list of dictionaries to a pandas DataFrame

Question

How can I convert a list of dictionaries into a DataFrame? Given:

[{'points': 50, 'time': '5:00', 'year': 2010}, 
 {'points': 25, 'time': '6:00', 'month': "february"}, 
 {'points':90, 'time': '9:00', 'month': 'january'}, 
 {'points_h1':20, 'month': 'june'}]

I want to turn the above into a DataFrame:

      month  points  points_h1  time  year
0       NaN      50        NaN  5:00  2010
1  february      25        NaN  6:00   NaN
2   january      90        NaN  9:00   NaN
3      june     NaN         20   NaN   NaN

Note: Order of the columns does not matter.

score 1465 · Accepted Answer · edited Jun 04 '22 at 22:32

1465

If ds is a list of dicts:

df = pd.DataFrame(ds)

Note: this does not work with nested data.

edited Jun 04 '22 at 22:32

Mateen Ulhaq

21,459
16
82
123

answered Dec 17 '13 at 15:35

joris

121,165
35
238
198

4

How might one use one of the key/value pairs as the index (eg. time)? – CatsLoveJazz Jun 28 '16 at 13:37
11

@CatsLoveJazz You can just do `df = df.set_index('time')` afterwards – joris Jun 28 '16 at 13:38
Is it possible to create the dataframe and set the index in one line? – CatsLoveJazz Jun 28 '16 at 13:46
2

@CatsLoveJazz No, that is not possible when converting from a dict. – joris Jun 29 '16 at 08:16
6

As of Pandas 0.19.2, there's no mention of this in the documentation, at least not in the docs for `pandas.DataFrame` – Leo Alekseyev Apr 13 '17 at 22:56
@joris It is if it is part of the dictionary, using pd.DataFrame(d).set_index('uid', drop=False) – max Oct 26 '18 at 17:00
@joris Does this approach ensure that the order of the rows in the dataframe is the same as the order of the dictionaries in the list d? – AlexGuevara Sep 26 '19 at 06:15
Yes, a list is ordered, so the rows in the DataFrame will have the same order. – joris Sep 26 '19 at 10:54
3

Mind that for a nested dictionary `'{"":{"...` you use the json_normalize approach, see the detailed answer of @cs95 – questionto42standswithUkraine May 27 '20 at 22:16
@CatsLoveJazz Well, if you want to do in one line then you can do df = pd.DataFrame(d).set_index("time"). This will create the DataFrame and set the index in one line. – G.T. May 23 '22 at 11:24

cs95 · Answer 2 · 2020-11-08T11:42:24.627

How do I convert a list of dictionaries to a pandas DataFrame?

The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.

`DataFrame()`, `DataFrame.from_records()`, and `.from_dict()`

Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.

Consider a very contrived example.

np.random.seed(0)
data = pd.DataFrame(
    np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')

print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
 {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

This list consists of "records" with every keys present. This is the simplest case you could encounter.

# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Word on Dictionary Orientations: `orient='index'`/`'columns'`

Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".

orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.

For example, data above is in the "columns" orient.

data_c = [
 {'A': 5, 'B': 0, 'C': 3, 'D': 3},
 {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

pd.DataFrame.from_dict(data_c, orient='columns')

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.

orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.

data_i ={
 0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
 1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}

pd.DataFrame.from_dict(data_i, orient='index')

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

This case is not considered in the OP, but is still useful to know.

Setting Custom Index

If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.

pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])

   A  B  C  D
a  5  0  3  3
b  7  9  3  5
c  2  4  7  6

This is not supported by pd.DataFrame.from_dict.

Dealing with Missing Keys/Columns

All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,

data2 = [
     {'A': 5, 'C': 3, 'D': 3},
     {'A': 7, 'B': 9, 'F': 5},
     {'B': 4, 'C': 7, 'E': 6}]

# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)

     A    B    C    D    E    F
0  5.0  NaN  3.0  3.0  NaN  NaN
1  7.0  9.0  NaN  NaN  NaN  5.0
2  NaN  4.0  7.0  NaN  6.0  NaN

Reading Subset of Columns

"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.

For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:

pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])

     A    D    F
0  5.0  3.0  NaN
1  7.0  NaN  5.0
2  NaN  NaN  NaN

This is not supported by pd.DataFrame.from_dict with the default orient "columns".

pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])

ValueError: cannot use columns parameter with orient='columns'

Reading Subset of Rows

Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0^th and 2^nd rows from data2 above, you can use:

rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
    if i not in rows_to_select:
        del data2[i]

pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

The Panacea: `json_normalize` for Nested Data

A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.

pd.json_normalize(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

pd.json_normalize(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.

As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.

data_nested = [
  {'counties': [{'name': 'Dade', 'population': 12345},
                {'name': 'Broward', 'population': 40000},
                {'name': 'Palm Beach', 'population': 60000}],
   'info': {'governor': 'Rick Scott'},
   'shortname': 'FL',
   'state': 'Florida'},
  {'counties': [{'name': 'Summit', 'population': 1234},
                {'name': 'Cuyahoga', 'population': 1337}],
   'info': {'governor': 'John Kasich'},
   'shortname': 'OH',
   'state': 'Ohio'}
]

pd.json_normalize(data_nested, 
                          record_path='counties', 
                          meta=['state', 'shortname', ['info', 'governor']])

         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

For more information on the meta and record_path arguments, check out the documentation.

Summarising

Here's a table of all the methods discussed above, along with supported features/functionality.

_{* Use orient='columns' and then transpose to get the same effect as orient='index'.}

Woah! Okay this along with [Merging SO post](https://stackoverflow.com/a/53645883/6361531) belong in the API. You should contribute to the pandas documentations if you haven't already done so. Ted Petrou just posted a [LinkedIn article](https://www.linkedin.com/feed/update/urn:li:activity:6480573916921364480/) about the popularity of pandas on Stack Overflow and mentions that lack of good documentation contributes to the volume of questions here. — Scott Boston, Dec 18 '18 at 13:50
@ScottBoston You're absolutely right, I've heard that enough times now that I know it is something I should give more serious thought to. I think the documentation can be a great way of helping users, more so than posting on questions that would only reach a fraction of the same audience. — cs95, Dec 18 '18 at 13:59
It's particularly problematic because the details of which methods are good for which cases often change, and so having very lengthy, deep dive answers on SO is not only not as useful as having it in the pandas official documentation, but often is even harmful or misleading because some change to the function internals can suddenly make the answer incorrect or factually wrong and it's not clearly linked to the actual source repo to flag for documentation updating. — ely, Dec 18 '18 at 14:06
@ely Thanks for weighing in, and I agree because I see that with a lot of answers to older questions quickly becoming stale. The problem is especially pronounced with the pandas API being as mercurial as it is... I hope this post was up to your standard ;-) — cs95, Dec 18 '18 at 14:12
it is nice answer , I think it is time for us to re-walk-in those common question under the most current pandas version :-) — BENY, Dec 18 '18 at 14:48
@ely: that's never a reason not to write answers *here, anyway*. **Any** answer can become outdated, that's what we have voting for, and different perspectives and different goals exist here, and it is always valuable to have different ways of explaining the same thing. — Martijn Pieters, Jan 22 '19 at 16:11
@MartijnPieters I question and disagree with your last assertion but overall I agree with you. It's not always value additive to collate different answers to the same question together, especially if some of the answers are updates or conditional differences based on other answers. In the worst cases, those answers can be value destructive when collated together (as opposed to using the more updated answer to simply edit the older answer into a more correct state). But again, I largely agree with you. — ely, Jan 22 '19 at 22:38
@ely a year later and the vote score seems to disagree with you — cs95, Jan 22 '20 at 10:01
@cs95 where do you see that? I don't see that anywhere? Certainly not reflected in the vote total for this answer, which is an indication more of 'quick fix' appreciation than of the best place for docs like this to exist. If anything I only feel like the point is more confirmed, because coming back to this table after a year, I am not no longer sure if the different sub-entries are still correct or if changes to pandas have rendered changes that should appear here. What an unfortunate value-destructive chore to keep it updated here rather than in pandas docs. — ely, Jan 22 '20 at 14:48
@ely it's not exactly rocket science to keep the post up to date. New versions release once every few months, and not always do they require changes to every answer I've written. What a poorly thought out argument. — cs95, Jan 22 '20 at 16:19
@cs95 "It's not exactly rocket science to keep the post up to date" -- sure, I understand, but this is true of most documentation and doesn't prevent it from falling out of date. The larger the range of topics covered by a given doc, the more risk that some subset is wrong even if others are right, and it's hard to detect. Having it tied to the actual repo where the code resides gives more changes for autodoc tooling or checklists that flag lapsed docs, etc. — ely, Jan 22 '20 at 16:45
I don't understand your vitriolic response, I don't think what I am saying is controversial. It's the same reason to prefer small pull requests over large ones, or to prefer isolated commits over large multi-purpose commits. The same principle applies to having small, isolated docs that give info on narrowly scoped individual use cases. When you decouple docs from the use cases, and the expand the doc to cover more topics, the surface area of ways it can go bad gets larger. That's all. For this reason, I prefer posts on SO that are narrowly scoped, isolated units of explanation. — ely, Jan 22 '20 at 16:48
anyone have FutureWarning: Using short name for 'orient' is deprecated. for the first one? pd.__version__ '1.3.5' — rubengavidia0x, Feb 25 '22 at 22:39

score 94 · Answer 3 · edited May 14 '18 at 13:12

94

In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.

edited May 14 '18 at 13:12

Asclepius

49,954
14
144
128

answered Oct 08 '15 at 15:59

szeitlin

2,858
1
21
19

1

the good thing about this approach is that it also works with `deque` – MBZ Oct 12 '15 at 05:22
3

works fine with pandas `0.17.1` with @joris solution – Anton Protopopov Jan 19 '16 at 10:14
2

Usinig 0.14.1 and @joris' solution didn't work but this did – mchen Apr 15 '16 at 10:55
13

In `0.18.1`, one must use `from_records` if the dictionaries do not all have the same keys. – fredcallaway Oct 24 '16 at 21:49

shivsn · Answer 4 · 2017-07-07T06:03:08.810

30

You can also use pd.DataFrame.from_dict(d) as :

In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010}, 
   ...: {'points': 25, 'time': '6:00', 'month': "february"}, 
   ...: {'points':90, 'time': '9:00', 'month': 'january'}, 
   ...: {'points_h1':20, 'month': 'june'}]

In [12]: pd.DataFrame.from_dict(d)
Out[12]: 
      month  points  points_h1  time    year
0       NaN    50.0        NaN  5:00  2010.0
1  february    25.0        NaN  6:00     NaN
2   january    90.0        NaN  9:00     NaN
3      june     NaN       20.0   NaN     NaN

edited Jul 07 '17 at 06:03

answered Jul 06 '17 at 19:13

shivsn

6,820
24
33

The question is about constructing a data frame from a _list_ of `dict`s, not from a single `dict` as you assumed in your answer. – a_guest Jul 06 '17 at 21:54
@a_guest check the updated answer. I'am not assuming. – shivsn Jul 07 '17 at 06:03

score 1 · Answer 5 · edited Jun 20 '20 at 09:12

Pyhton3: Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.

The following method is useful in that case.

import csv

my file= 'C:\Users\John\Desktop\export_dataframe.csv'

records_to_save = data2 #used as in the thread. 


colnames = list[records_to_save[0].keys()] 
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value 

with open(myfile, 'w', newline="",encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(colnames)
    for d in records_to_save:
        writer.writerow([d.get(r, "None") for r in colnames])

score 0 · Answer 6 · edited Oct 01 '20 at 11:04

0

The easiest way I have found to do it is like this:

dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
    df = df.append(dict_list[i], ignore_index=True)

edited Oct 01 '20 at 11:04

cs95

330,695
80
606
657

answered Aug 15 '19 at 06:12

scottapotamus

440
2
15

Avoid looping when using ``pandas``, looping kill's the whole purpose of ``pandas`` – sushanth Aug 31 '20 at 11:29
I didn't downvote, but while this will technically work its performance is quite poor. See [this](https://stackoverflow.com/questions/31674557/how-to-append-rows-in-a-pandas-dataframe-in-a-for-loop) for more information. – EJoshuaS - Stand with Ukraine Jun 27 '21 at 04:50

Convert list of dictionaries to a pandas DataFrame

6 Answers6

How do I convert a list of dictionaries to a pandas DataFrame?

`DataFrame()`, `DataFrame.from_records()`, and `.from_dict()`

Word on Dictionary Orientations: `orient='index'`/`'columns'`

Setting Custom Index

Dealing with Missing Keys/Columns

Reading Subset of Columns

Reading Subset of Rows

The Panacea: `json_normalize` for Nested Data

Summarising

Linked

Related

Convert list of dictionaries to a pandas DataFrame

6 Answers6

How do I convert a list of dictionaries to a pandas DataFrame?

DataFrame(), DataFrame.from_records(), and .from_dict()

Word on Dictionary Orientations: orient='index'/'columns'

Setting Custom Index

Dealing with Missing Keys/Columns

Reading Subset of Columns

Reading Subset of Rows

The Panacea: json_normalize for Nested Data

Summarising

Linked

Related

`DataFrame()`, `DataFrame.from_records()`, and `.from_dict()`

Word on Dictionary Orientations: `orient='index'`/`'columns'`

The Panacea: `json_normalize` for Nested Data