convert text columns into numbers in sklearn

Question

I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,

Dataset

Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.

Please advice.

consider using [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function available in pandas. Ignore all new values encountered in test data, you cannot use values which was not seen in during training. — shanmuga, Jan 21 '16 at 05:15
i was thinking of using it. but some of the columns have many unique values (upto 400+). — Selva Saravana Er, Jan 21 '16 at 05:23

score 3 · Answer 1 · answered Mar 11 '17 at 09:08

3

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series:

answered Mar 11 '17 at 09:08

Amir F

2,101
14
10

How would you apply this to prediction data to get the matching letter number? e.g. when I want to predict `d` it has to be converted to `3` from your example. – STIKO Sep 26 '18 at 02:43
If I am understanding you correctly - you can keep a copy of the original values on 'the side' for reference. You will be able to convert back to letters if needed. I hope this is helpful - in case its not please clarify what you are trying to do. – Amir F Sep 27 '18 at 06:31
So, let's use your example as my dataset for simplicity and let's pretend there is a target column (we don't care about it for this example), before I train my model on it, I convert it to numbers, then, I train my model on it. Now I have a trained model. Now I want to feed my model with a feature `c` to get a prediction. From your example `c` was converted to `2` (easy since I can look at it), so I need to feed my model with `2` to get my prediction. The question is how do I get `2` for `c`? – STIKO Sep 27 '18 at 22:24
you can toggle back and forth (2 to c and back) with np.where. Its as simple as 'if' in excel.(https://medium.com/@emayoung95/using-numpy-where-function-to-replace-for-loops-with-if-else-statements-a1e6044ac4c1) – Amir F Sep 30 '18 at 08:05
1

This may be helpful as well - https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn – Amir F Oct 01 '18 at 09:33
That's exactly what I was looking for. Thanks a bunch @Amir – STIKO Oct 02 '18 at 14:22

score 0 · Answer 2 · answered Jan 21 '16 at 05:53

You can convert them into integer codes by using the categorical datatype.

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10), your model should be able to split out the categories again.

convert text columns into numbers in sklearn

2 Answers2

Linked