1

Unpredictably formatted df:

  First Name  number last_name
0    Cthulhu     666     Smith
    
df = pd.DataFrame({'First Name': ['Cthulhu'], 'number': [666], 'last_name': ['Smith']})

This needs to be put into column names and order: TemplateColumns = ['First Name', 'other', 'number']. If columns don't exist they can be created:

for col in TemplateColumns:
    if col not in df:
        df[col] = np.nan

Which gives:

  First Name  number last_name  other
0    Cthulhu     666     Smith    NaN

And initial columns need to be ordered the same as TemplateColumns, leaving the remaining columns at the end, to get desired_df:

  First Name  other   number last_name
0    Cthulhu    NaN      666     Smith

desired_df = pd.DataFrame({'First Name': ['Cthulhu'], 'other': [np.nan], 'number': [666], 'last_name': ['Smith']})

Reordering columns is well explained in other posts, but I don't know how to order the first n columns and keep the rest at the end. How can I do this?

Chris Dixon
  • 647
  • 5
  • 14

3 Answers3

1

You can write your own function to achieve this. Essentially you can use .reindex() to reorder the dataframe while including empty columns if they don't exist. The only remaining part to figure out would be how to add the remaining columns not in TemplateColumns to your dataframe. You can do this by obtaining the set difference of the column index from the TemplateColumns then updating the order before your call to .reindex

Set up data & function

def reordered(df, new_order, include_remaining=True):
    cols_to_end = []
    if include_remaining:
        # gets the items in `df.columns` that are NOT in `new_order` 
        cols_to_end = df.columns.difference(new_order, sort=False)
    
    # Ensures that the new_order items are first
    final_order = new_order + list(cols_to_end)
    return df.reindex(columns=final_order)

df = pd.DataFrame({'First Name': ['Cthulhu'], 'number': [666], 'last_name': ['Smith']})
new_order = ['First Name', 'other', 'number']

with include_remaining:

out = reordered(df, new_order, include_remaining=True)

print(out)
  First Name  other  number last_name
0    Cthulhu    NaN     666     Smith

without include_remaining:

out = reordered(df, new_order, include_remaining=False)

print(out)
  First Name  other  number
0    Cthulhu    NaN     666
Cameron Riddell
  • 7,865
  • 5
  • 16
1

Try this

cols = TemplateColumns + df.columns.difference(TemplateColumns, sort=False).tolist()
df_final =  df.reindex(cols, axis=1)

Out[714]:
  First Name  other  number last_name
0    Cthulhu    NaN     666     Smith
Andy L.
  • 24,086
  • 3
  • 14
  • 26
0

Use insert like this:

for col in TemplateColumns:
    if col not in df:
        df.insert(1, col, np.nan)
U12-Forward
  • 65,118
  • 12
  • 70
  • 89
  • @ChrisDixon No problem, don't forget to accept and upvote tho – U12-Forward Oct 21 '20 at 01:18
  • Deleted 1st comment as realized your suggestion didn't fix the problem. The 2nd column needs to be 'other'. – Chris Dixon Oct 21 '20 at 01:24
  • @ChrisDixon change your `TemplateColumns = ['First Name', 'other', 'numbers']` to `TemplateColumns = ['First Name', 'other', 'number']` – U12-Forward Oct 21 '20 at 01:36
  • I've fixed the 'numbers' typo, and although that works for this simplified example, that only leaves the columns in the correct order by coincidence. The columns need to be reordered. – Chris Dixon Oct 21 '20 at 01:48