10

Suppose we have the following dataframe with multiple values for a certain column:

    categories
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]

How can we get a table like this?

   "A"  "B"  "C"  "D"
0 - 1    1    0    0
1 - 0    1    1    1
2 - 0    1    0    1

Note: I don't necessarily need a new dataframe, I'm wondering how to transform such DataFrames to a format more suitable for machine learning.

Denis L
  • 218
  • 2
  • 7

1 Answers1

7

If [0, 1, 2] are numerical labels and is not the index, then pandas.DataFrame.pivot_table works:

In []:
data = pd.DataFrame.from_records(
    [[0, 'A'], [0, 'B'], [1, 'B'], [1, 'C'], [1, 'D'], [2, 'B'], [2, 'D']],
    columns=['number_label', 'category'])
data.pivot_table(index=['number_label'], columns=['category'], aggfunc=[len], fill_value=0)
Out[]:
              len
category      A      B      C      D
number_label                       
0             1      1      0      0
1             0      1      1      1
2             0      1      0      1

This blog post was helpful.


If [0, 1, 2] is the index, then collections.Counter is useful:

In []:
data2 = pd.DataFrame.from_dict(
    {'categories': {0: ['A', 'B'], 1: ['B', 'C', 'D'], 2:['B', 'D']}})
data3 = data2['categories'].apply(collections.Counter)
pd.DataFrame.from_records(data3).fillna(value=0)
Out[]:
       A      B      C      D
0      1      1      0      0
1      0      1      1      1
2      0      1      0      1
Zephyr
  • 997
  • 4
  • 10
  • 20
Samuel Harrold
  • 311
  • 1
  • 5
  • Thanks, I'll check it out. Actually, the 0, 1, and 2 are the index. Also, do you have any idea how sparseness can be handled efficiently here as there are lots of zeroes? – Denis L Oct 01 '15 at 10:58
  • Both pandas and scipy have sparse data structures (pandas sparse, scipy sparse) for saving memory, but they might not be supported by the machine learning library you use. If the dimensionality of your problem (number of columns) is so large that sparse representation is necessary, you may want to consider also using dimensionality reduction techniques. – Samuel Harrold Oct 01 '15 at 11:41