My goal is to Feature Engineering the column Education_Level. This is an obvious ordinal data. However, I am having difficulty to put Education_Level to choose -1 or np.nan. The difficulty is that I don't know the effect of np.nan and -1 to the regression, classification, clustering algorithm.
I believe the ordinal level as follow: 1 = 'Uneducated', 2 = 'High School', 3 = 'College', 4 = 'Graduate', 5 = 'Post-Graduate', 6 = 'Doctorate'
df['Education_Leve'].value_counts()
Graduate 3128
High School 2013
Unknown 1519
Uneducated 1487
College 1013
Post-Graduate 516
Doctorate 451
What I've did:
- I tried to simulate it with the scikit-learn library.
ord_enc = OrdinalEncoder(categories=[
['Uneducated', 'High School', 'College', 'Graduate', 'Post-Graduate', 'Doctorate']
], handle_unknown='use_encoded_value', unknown_value=np.nan).fit(df)
X_encoded = ord_enc.transform(X=df)
sca = MinMaxScaler().fit(X=X_encoded)
X_scaled = sca.transform(X=X_encoded)
X_scaled
X_scaled (with unknown_value=np.nan)
array([[0.6],
[0.2],
[nan],
[0. ],
[0.4],
[0.8],
[1. ]])
X_scaled (with unknown_value=-1)
array([[0.66666667],
[0.33333333],
[0. ],
[0.16666667],
[0.5 ],
[0.83333333],
[1. ]])
What I know is that: unknown_value=-1 will affect the scaler, while unknown_value=np.nan should not. My assumption is that in a Linear Regression where we have equation ie y = ax + b, if X is np.nan, it will become y = b while if X is -1, it will become y = -1 + b.
By the time I write this post, I believe the answer is 'It depends'.
- It is ordinal if you believe that the higher the education level, the less likely a person will churn.
- If you believe there is no order of churn ('Graduate' > 'High School' < 'Uneducated'. ie 'Graduate' is more likely to churn than 'High School' but less likely to churn than 'Uneducated'). Then you can't use ordinal. With ordinal, you only have 1-axis, while with nominal, you have 6-axis.
- It is nominal if you believe (it depends on specific education level) if a person is going to churn or not.