3

My goal is to Feature Engineering the column Education_Level. This is an obvious ordinal data. However, I am having difficulty to put Education_Level to choose -1 or np.nan. The difficulty is that I don't know the effect of np.nan and -1 to the regression, classification, clustering algorithm.

I believe the ordinal level as follow: 1 = 'Uneducated', 2 = 'High School', 3 = 'College', 4 = 'Graduate', 5 = 'Post-Graduate', 6 = 'Doctorate'

df['Education_Leve'].value_counts()

Graduate         3128
High School      2013
Unknown          1519
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451

What I've did:

  1. I tried to simulate it with the scikit-learn library.
ord_enc = OrdinalEncoder(categories=[
    ['Uneducated', 'High School', 'College', 'Graduate', 'Post-Graduate', 'Doctorate']
], handle_unknown='use_encoded_value', unknown_value=np.nan).fit(df)

X_encoded = ord_enc.transform(X=df)

sca = MinMaxScaler().fit(X=X_encoded)

X_scaled = sca.transform(X=X_encoded) X_scaled

X_scaled (with unknown_value=np.nan)

array([[0.6],
       [0.2],
       [nan],
       [0. ],
       [0.4],
       [0.8],
       [1. ]])

X_scaled (with unknown_value=-1)

array([[0.66666667],
       [0.33333333],
       [0.        ],
       [0.16666667],
       [0.5       ],
       [0.83333333],
       [1.        ]])

What I know is that: unknown_value=-1 will affect the scaler, while unknown_value=np.nan should not. My assumption is that in a Linear Regression where we have equation ie y = ax + b, if X is np.nan, it will become y = b while if X is -1, it will become y = -1 + b.

By the time I write this post, I believe the answer is 'It depends'.

  1. It is ordinal if you believe that the higher the education level, the less likely a person will churn.
  2. If you believe there is no order of churn ('Graduate' > 'High School' < 'Uneducated'. ie 'Graduate' is more likely to churn than 'High School' but less likely to churn than 'Uneducated'). Then you can't use ordinal. With ordinal, you only have 1-axis, while with nominal, you have 6-axis.
  3. It is nominal if you believe (it depends on specific education level) if a person is going to churn or not.
  • You should probably align the education levels with a validated source like US Census if these are US based data. The standard terms are: less than high school, high school or equivalent, "some college", "associates degree", "bachelor's degree", "advanced degree". – AdamO Nov 23 '22 at 12:59

2 Answers2

3

The variable Eduction with levels 'Uneducated', 'High School', 'College', 'Graduate', 'Post-Graduate' and 'Doctorate' is certainly an ordered categorical variable. Indeed, we all agree that these levels can be ordered in a meaningful way.

A legitimate question may be: is the distance between Uneducated and High School the same as the distance between Post-Graduate and Doctorate? The answer to this question is, I guess, opinion-based.

If you included a further level 'Unknown' then this would destroy

we all agree that these levels can be ordered in a meaningful way

Since there is no meaningful where to place Unknown in the ordered list. In my opinion, 'Unknown' is most likely missing information and thus it cannot be treated as level.

utobi
  • 11,726
  • Therefore, treat it as nominal and one hot encode it? – Jason Rich Darmawan Nov 23 '22 at 09:24
  • yes. Treating it as nominal wouldn't do much harm after all since it is a feature. If you were to model it as a RESPONSE, then, I'd be a little bit more careful, but this is another matter. – utobi Nov 23 '22 at 09:28
  • How do you solve the dummy trap variable? I am having difficulty with sklearn.preprocessing.OneHotEncoder. If I drop the first column to address the dummy trap variable OneHotEncoder(drop='first', handle_unknown='ignore'). Then, if there is unknown variable ie Unknown or Unknowns or anything else, which is unknown category, the encoder will label it as [0. 0. 0. 0. 0.] which is actually for 'Uneducated'. But because the encoder translates unknown category to [0. 0. 0. 0. 0.], every unknown category will be 'Uneducated' – Jason Rich Darmawan Nov 23 '22 at 09:35
  • I'm sorry, I can't help you with this, I'm not familiar with python. R gives NA to everything that is missing or not available. NO idea what python does. However, this is a programming question and not related to your post :-) – utobi Nov 23 '22 at 09:41
  • 4
    A note of caution: when you code 'unknown' as an additional category, you're making the implicit assumption that the mechanism by which 'unknown' is generated is common to the data to which you fit your model & the data on which you intend to use your model to make predictions. See threads tagged with 'missing-data', & my answer here. – Scortchi - Reinstate Monica Nov 23 '22 at 11:48
  • I am learning from credit card dataset and the 'Unknown' proportion is 14%. Therefore, I assume that it is common practice in the credit card company to let customer not disclose their education. Therefore in a Linear Regression equation y = b + b0x0 + b1x1, 'Unknown' should have its own slope. Do you suggest to handle 'Unknown' as missing data and replace it ie with most frequent category? – Jason Rich Darmawan Nov 23 '22 at 16:20
  • 1
    @kidfrom imputation may be a viable alternative to row-wise deletion. However, for the imputation to be efficacious, there are some assumptions about missingness that have to be satisfied. Simplifying, imputation works best when missingness is at random. There are many references for the imputation of missing values, I'm not very familiar with it, to be honest. – utobi Nov 23 '22 at 21:09
  • @kidfrom: Depends what your model's for. If it's to estimate the relation between churn & education level, it's concerning that you don't know the latter in 15% of cases. If it's to predict churn for new customers whose data is collected in the same fashion then Unknown's having its own slope should be fine. [BTW I've edited a mistake in my previous comment so it's perhaps clearer now.] – Scortchi - Reinstate Monica Nov 24 '22 at 09:13
2

Education level is ordinal, as you already noticed. However, you cannot consider "Unknown" as one of the ordinal levels, it is missing data, so what you need to do is pick one of many approaches to deal with .

As an additional comment, OrdinalEncoder from scikit-learn encodes the levels as integers and this is not necessarily the best encoding. With ordinal data, we assume that there is ordering between the categories, the values do not have standard numerical interpretation (you cannot subtract them, etc) as we can do with interval or ratio data. When you use such coding, you are ignoring the fact that the variable is ordinal and treat the categories as if they had a numerical meaning.

Tim
  • 138,066
  • The 'Unknown' label proportion is 14%. I assume that the credit card company marked is as 'Unknown' if the customer do not wish to disclose that information. Knowing that fact, is it possible to handle the missing data ie replacing it with most frequent category. Regarding OrdinalEncoder, do you mean the algorithm won't know that feature is ordinal although we encode it with 'OrdinalEncoder'? Therefore, do you suggest to use 'OneHotEncoder' even for ordinal feature? – Jason Rich Darmawan Nov 23 '22 at 16:18
  • @kidfrom Imputing the most frequent category is a possible solution but not necessarily the best one (see other questions for more details). As about OrdinalEncoder, yes, it doesn't encode that is consistent with the meaning of ordinal measurement level variables. – Tim Nov 23 '22 at 18:23