Pandas split dataframe column for every character

Question

i have multiple dataframe columns which look like this:

                         Day1
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

What i want is that every character is seperated in a own column:

     012345678910111213....
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

So that "Day 1-Column" is splitted in 48 Columns and every Column has one of the Value A/B/C/D

i tried with split, but that didnt work.

Post raw data, code to load your data into a df, in order for us to try to replicate your issue if our answers didn't work — EdChum, May 08 '17 at 13:32
It looks like you have trailing spaces, try `dataframe['Mo'] dataframe['Mo'].str.rstrip()` to remove any trailing spaces — EdChum, May 08 '17 at 13:34
See my first comment, without data to reproduce this, this becomes a fishing expedition — EdChum, May 08 '17 at 13:40
ok i found the problem, i had trailing spaces. Thanks!!! @EdChum — Warry S., May 08 '17 at 14:15

score 26 · Accepted Answer · edited Feb 03 '22 at 08:18

You can call apply and for each row call pd.Series on the the list of the values:

In [16]:

df['Day1'].apply(lambda x: pd.Series(list(x)))
Out[16]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

It looks like you have trailing spaces, remove these using str.rstrip:

df['Day1'] = df['Day1'].str.rstrip()

then do the above.

MaxU - stop genocide of UA · Answer 2 · 2017-05-08T13:39:51.797

6

use Series.str.extractall() method:

In [19]: df.Day1.str.extractall('(.)', flags=re.U)[0].unstack().rename_axis(None, 1)
Out[19]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

edited May 08 '17 at 13:39

answered May 08 '17 at 13:16

MaxU - stop genocide of UA

191,778
30
340
375

Hey, i tried your way...i edited my question, there is anything wrong – Warry S. May 08 '17 at 13:28
1

@WarryS., do you have leading or trailing spaces? what is the output of `df.Day1.str.len()`? – MaxU - stop genocide of UA May 08 '17 at 13:30
its 48 vor every entry @MaxU, no between the values there are no spaces – Warry S. May 08 '17 at 13:32
1

@WarryS., i can't reproduce this behavior using provided sample data set :( – MaxU - stop genocide of UA May 08 '17 at 13:34
Thanks for your help @MaxU – Warry S. May 08 '17 at 14:21

score 2 · Answer 3 · answered May 27 '20 at 16:15

2

Try this:

df['Day1'].str.split(pat ="\s*", expand = True)

It will have empty 1st and last columns so you have to trim the dataframe using df['Day1'].iloc[:,1:-1]

answered May 27 '20 at 16:15

arjepak

113
1
6

score 0 · Answer 4 · answered May 04 '22 at 22:55

Following on from the answer from @ric-s, using list to separate the string is slightly faster when applying it outside of pandas:

In [1]: %timeit df['Day1'].apply(lambda x: pd.Series(list(x)))
1.08 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit pd.DataFrame([list(x) for x in df['Day1']])
718 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, the following construction will create meaningful column names for the extracted features:

df[[f'Day1_{i}' for i in range(len(df['Day1'][0]))]] = pd.DataFrame([list(x) for x in df['Day1']])

Pandas split dataframe column for every character

4 Answers4

Linked