8

I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,

Dataset

Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.

Please advice.

Sumithran
  • 5,198
  • 3
  • 35
  • 49
Selva Saravana Er
  • 189
  • 1
  • 1
  • 5
  • consider using [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function available in pandas. Ignore all new values encountered in test data, you cannot use values which was not seen in during training. – shanmuga Jan 21 '16 at 05:15
  • i was thinking of using it. but some of the columns have many unique values (upto 400+). – Selva Saravana Er Jan 21 '16 at 05:23

2 Answers2

3

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series:

    letter
0   0
1   1
2   2
3   3
4   0
5   2
6   0
7   3
Amir F
  • 2,101
  • 14
  • 10
  • How would you apply this to prediction data to get the matching letter number? e.g. when I want to predict `d` it has to be converted to `3` from your example. – STIKO Sep 26 '18 at 02:43
  • If I am understanding you correctly - you can keep a copy of the original values on 'the side' for reference. You will be able to convert back to letters if needed. I hope this is helpful - in case its not please clarify what you are trying to do. – Amir F Sep 27 '18 at 06:31
  • So, let's use your example as my dataset for simplicity and let's pretend there is a target column (we don't care about it for this example), before I train my model on it, I convert it to numbers, then, I train my model on it. Now I have a trained model. Now I want to feed my model with a feature `c` to get a prediction. From your example `c` was converted to `2` (easy since I can look at it), so I need to feed my model with `2` to get my prediction. The question is how do I get `2` for `c`? – STIKO Sep 27 '18 at 22:24
  • you can toggle back and forth (2 to c and back) with np.where. Its as simple as 'if' in excel.(https://medium.com/@emayoung95/using-numpy-where-function-to-replace-for-loops-with-if-else-statements-a1e6044ac4c1) – Amir F Sep 30 '18 at 08:05
  • 1
    This may be helpful as well - https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn – Amir F Oct 01 '18 at 09:33
  • That's exactly what I was looking for. Thanks a bunch @Amir – STIKO Oct 02 '18 at 14:22
0

You can convert them into integer codes by using the categorical datatype.

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10), your model should be able to split out the categories again.

maxymoo
  • 32,647
  • 9
  • 86
  • 115