225

I'm new to pandas and trying to figure out how to add multiple columns to pandas simultaneously. Any help here is appreciated. Ideally I would like to do this in one step rather than multiple repeated steps...

import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)

df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3]  #thought this would work here...
smci
  • 29,564
  • 18
  • 109
  • 144
runningbirds
  • 5,411
  • 11
  • 48
  • 84
  • 2
    You need to state what error you got. When I try this on pandas 1.0 I get `KeyError: "None of [Index(['column_new_1', 'column_new_2', 'column_new_3'], dtype='object')] are in the [columns]"` – smci Apr 19 '20 at 07:24

11 Answers11

334

I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).

Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.

Here are several approaches that will work:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

Then one of the following:

1) Three assignments in one, using list unpacking:

df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]

2) DataFrame conveniently expands a single row to match the index, so you can do this:

df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)

3) Make a temporary data frame with new columns, then combine with the original data frame later:

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3]], 
            index=df.index, 
            columns=['column_new_1', 'column_new_2', 'column_new_3']
        )
    ], axis=1
)

4) Similar to the previous, but using join instead of concat (may be less efficient):

df = df.join(pd.DataFrame(
    [[np.nan, 'dogs', 3]], 
    index=df.index, 
    columns=['column_new_1', 'column_new_2', 'column_new_3']
))

5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):

df = df.join(pd.DataFrame(
    {
        'column_new_1': np.nan,
        'column_new_2': 'dogs',
        'column_new_3': 3
    }, index=df.index
))

6) Use .assign() with multiple column arguments.

I like this variant on @zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:

df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)

7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:

new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols)   # add empty cols
df[new_cols] = new_vals  # multi-column assignment works for existing cols

8) In the end it's hard to beat three separate assignments:

df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3

Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame

smci
  • 29,564
  • 18
  • 109
  • 144
Matthias Fripp
  • 16,256
  • 5
  • 25
  • 40
  • Wouldn't approach #7 ([`.reindex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html)) alter the dataframe's index? Why would someone want to needlessly alter the index when adding columns unless it's an explicit goal... – Asclepius Feb 18 '20 at 20:12
  • 1
    `.reindex()` is used with the `columns` argument, so it only changes the column "index" (names). It doesn't alter the row index. – Matthias Fripp Feb 18 '20 at 23:19
  • for some of the approaches, you can use `OrderedDict`: for instance, `df.join(pd.DataFrame( OrderedDict([('column_new_2', 'dogs'),('column_new_1', np.nan),('column_new_3', 3)]), index=df.index ))` – hashmuke Mar 11 '20 at 12:12
  • @hashmuke That makes sense for early versions of Python. It may appeal especially to people using dictionaries for multiple things in Pandas, e.g., `df = pd.DataFrame({'before': [1, 2, 3], 'after': [4, 5, 6]})` vs. `df = pd.DataFrame(OrderedDict([('before', [1, 2, 3]), ('after', [4, 5, 6])])` – Matthias Fripp Mar 13 '20 at 18:32
  • `.assign()` (approach #6) is now the best, clearest, simplest with Python 3.7. I confirm that the comment *"the new columns will always be sorted alphabetically, at least with early versions of Python:"* is obsolete with 3.7. (Avoid saying "always"; say something version-related like "as of 3.5/3.6"; that sort of comment ages quickly). This makes approaches #1,2,3,4,5,7 obsolete. I can't see any positive reason you would ever want to do 3,4,5 or 7. You might do 1 or 8 on code you're still debugging, to compartmentalize changes. 2 seems unnecessary version of 1. – smci Apr 19 '20 at 07:42
  • Please move 6 up to the top of the list, followed by 1 and 8. The rest are only for historical curiosity. – smci Apr 19 '20 at 07:46
  • You also want to add a note per [Add multiple empty columns to pandas DataFrame](https://stackoverflow.com/questions/30926670/add-multiple-empty-columns-to-pandas-dataframe) that approaches #1,2 won't work if the values are `None`, but #6,8 will. – smci Apr 19 '20 at 11:15
  • 2
    In case you are using the option with `join`, make sure that you don't have duplicates in your index (or use a `reset_index` first). Might save you a few hours debugging. – Guido Apr 30 '20 at 13:53
  • @smci I used option 2 using a nested list of arrays for e.g. 10 new cols and 100 rows, without needing the index definition, and I do not know why this shoud be outdated or worse than 6, 1 and 8, I would rather put 2 to the top. Because there you have the lowest writing effort (no need to repeat the 'df') and the clearest structure. – questionto42standswithUkraine May 28 '20 at 17:28
  • @Lorenz: #6 (`df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)`) is a lot more human-readable than #2, #2 requires the reader to figure out which of the column-names on the left get assigned to each of the values on right, esp. for long lines. #6 is better because it directly tells you: `.assign(new_col = new_value, ...)` ; (and by the way `.assign()` syntax allows more complex expressions on the df). You could even automatically read it with a grep statement. And #1 is simply a more inefficient version of #2, multiple separate columns on the LHS instead of one slice. – smci May 29 '20 at 07:35
  • @smci that is of course correct in the very simple example case of the question, and still: you say it as well, #1 is not as efficient as #2, think of 10 columns instead of 3, then you do not want to repeat the df[] in #1 or the assignments in #6 for every column. You are right that #6 is more readable, but if you already have an object of many rows and columns to add, and you typically know the order of the columns, then you do not need this readability and prefer #2 in practice. – questionto42standswithUkraine May 29 '20 at 11:05
  • @Lorenz: the question title asks the very general *"add multiple columns to df"*, not just the incredibly narrow case where they are each constants. **`.assign()` allows you to use arbitrarily complex expressions of multiple columns**, not just constants, or functions on a single Series like `.apply()`. Example: *"median of col1, masked where col2 > 0, then multiply that elementwise by (col3+col4)"* – smci May 31 '20 at 02:13
  • 2
    @smci `.assign()` is certainly more flexible, but if you have simply a couple of columns to add, these must simply be put into their right order in a nested list of arrays or as a df like in #2, and then assigned. I just do not see why I should split up every column assignment with `.assign()` then. #2 is what I do in practice everytime I assign columns: `df[['column_new_1', 'column_new_2', 'column_new_3']] = [[np.array(),np.array(),np.array()]]`, and the `pd.DataFrame()` of #2 is not even needed. No repeated df[] (#1) nor individual column assignments (#6) needed. – questionto42standswithUkraine Jun 02 '20 at 10:08
  • 2
    Some performance metrics would really make this post golden. – np8 Jun 22 '20 at 19:29
  • @MatthiasFripp - How to add a **checkbox** as one of the column headers using dataframe? – Love Putin Jun 25 '20 at 07:39
  • @ user12379095 That is a different question from this. A checkbox is a GUI element and pandas is non-GUI code. In some places you can see pandas tables in a GUI environment (e.g., IPython/Jupyter displays them as HTML tables). But you would need to think more about what you want the display to look like, what you want to happen when people check the box, and what environment you expect to run in. Then go as far as you can and ask a new question on stackoverflow if you get stuck. – Matthias Fripp Jun 25 '20 at 20:39
  • I like #7 only because this is how my brain WANTS to do it, even though it is the most complicated on paper. – Nesha25 Oct 28 '21 at 00:24
52

You could use assign with a dict of column names and values.

In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
   col_1  col_2 col2_new_2  col3_new_3  col_new_1
0      0      4       dogs           3        NaN
1      1      5       dogs           3        NaN
2      2      6       dogs           3        NaN
3      3      7       dogs           3        NaN
Zero
  • 66,763
  • 15
  • 141
  • 151
  • Is there a way of doing the same that maintains a specific ordering of the columns? – user48956 May 31 '18 at 00:58
  • 2
    You can maintain a specific ordering with earlier versions of Python by calling assign multiple times: `df.assign(**{'col_new_1': np.nan}).assign(**{'col2_new_2': 'dogs'}).assign(**{'col3_new_3': 3})` – skasch Apr 02 '20 at 18:35
  • If column names only contain strings that are legal variable names: `df.assign(col_new_1=np.nan, col2_new_2='dogs', col3_new_3=3)`. This maintains order. – Tobias Bergkvist Aug 17 '20 at 08:56
  • What if the values `np.nan`, `dogs` and `3` were obtained out of single operation? Using this approach would require doing that operation thrice. Is there a way I can use assign and do the operation just once? @Zero – paradocslover May 25 '22 at 03:31
15

With the use of concat:

In [128]: df
Out[128]: 
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]: 
   col_1  col_2 column_new_1 column_new_2 column_new_3
0    0.0    4.0          NaN          NaN          NaN
1    1.0    5.0          NaN          NaN          NaN
2    2.0    6.0          NaN          NaN          NaN
3    3.0    7.0          NaN          NaN          NaN

Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?

In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]

In [144]: df1
Out[144]: 
   col_1  col_2  column_new_1 column_new_2  column_new_3
0    0.0    4.0           NaN         dogs             3
1    1.0    5.0           NaN         dogs             3
2    2.0    6.0           NaN         dogs             3
3    3.0    7.0           NaN         dogs             3
Nehal J Wani
  • 15,157
  • 3
  • 59
  • 87
3

use of list comprehension, pd.DataFrame and pd.concat

pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
            df.index, ['column_new_1', 'column_new_2','column_new_3']
        )
    ], axis=1)

enter image description here

piRSquared
  • 265,629
  • 48
  • 427
  • 571
3

if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:

    new_cols = ["a", "b", "c" ] 
    df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)

It's based on the second variant of the accepted answer.

A. Rabus
  • 429
  • 4
  • 9
1

Just want to point out that option2 in @Matthias Fripp's answer

(2) I wouldn't necessarily expect DataFrame to work this way, but it does

df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)

is already documented in pandas' own documentation http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner. You may find this useful for applying a transform (in-place) to a subset of the columns.

Community
  • 1
  • 1
  • I think this is pretty standard for multi-column assignment. The part that surprised me was that `pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)` replicates the one row it is given to create a whole dataframe the same length as the index. – Matthias Fripp Nov 10 '17 at 06:57
1

Dictionary mapping with .assign():

This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.

import pandas as pd
import numpy as np

new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)

If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.

import pandas as pd
import numpy as np

new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
bradylange
  • 182
  • 1
  • 5
0

If you just want to add empty new columns, reindex will do the job

df
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
   col_1  col_2  column_new_1  column_new_2  column_new_3
0      0      4           NaN           NaN           NaN
1      1      5           NaN           NaN           NaN
2      2      6           NaN           NaN           NaN
3      3      7           NaN           NaN           NaN

full code example

import numpy as np
import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')

otherwise go for zeros answer with assign

Markus Dutschke
  • 7,177
  • 2
  • 46
  • 45
0

I am not comfortable using "Index" and so on...could come up as below

df.columns
Index(['A123', 'B123'], dtype='object')

df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])

df.rename(columns={
    'C':'C123',
    'D':'D123',
    'E':'E123'
},inplace=True)


df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
Nensi Kasundra
  • 1,315
  • 3
  • 16
  • 27
Alex
  • 1
  • 1
0

You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
  'col_1': [0, 1, 2, 3], 
  'col_2': [4, 5, 6, 7]
})
>>> df
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7
>>> cols = {
  'column_new_1':np.nan,
  'column_new_2':'dogs',
  'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
   col_1  col_2  column_new_1 column_new_2  column_new_3
0      0      4           NaN         dogs             3
1      1      5           NaN         dogs             3
2      2      6           NaN         dogs             3
3      3      7           NaN         dogs             3

Not necessarily better than the accepted answer, but it's another approach not yet listed.

spen.smith
  • 536
  • 2
  • 13
0
import pandas as pd
df = pd.DataFrame({
 'col_1': [0, 1, 2, 3], 
 'col_2': [4, 5, 6, 7]
 })
df['col_3'],  df['col_4'] =  [df.col_1]*2

>> df
col_1   col_2   col_3   col_4
0      4       0       0
1      5       1       1
2      6       2       2
3      7       3       3