Machine learning feature encoding

Question

I'm new to Machine Learning.

I've just finished the Coursera course. :)

And for my first practical attempt I wanted to "analyse" a local used cars selling website in order to compose a modal that would "predict" an end price.

And I have a problem with "encoding" car features: Some of them are "discrete" ( make, model, gearbox encoding : 1 - manual, 2 - automatic, 3 - semi-automatic, fuel encoding: 1 - petrol, 2 - diesel, 3 - electro, etc ), some are continuous ( engine volume, engine power, milage, etc ).

The issue is - some of these features might be absent as it is not compulsory to fill them all in.

My main question is: should I use some special value for representing a missing feature?

I don't feel like using "0" (zero) would do any good as "0 * x = 0" - absolutely any "theta" would do in this partical case. Should I set it to, say, "-1" or something? What is a common approach to this?

And what about feature scaling in that case?

Are you working with a programming language? R and NumPy both have special values for representing missing data (although these often get recoded anyway) — shadowtalker, Feb 01 '15 at 18:20
And I'm not recommending that you do this, but a lot of survey data comes with missing data coded as unreasonable values like "999" or, as you suggested, "-1". This is actually a very big gotcha for working with survey data if you aren't careful to read the codebook and completely clean the data set before using it. Make sure whatever you do is thoroughly documented! — shadowtalker, Feb 01 '15 at 18:22
Right now I'm collecting data and coming up with different features to use. I was actually planning to use "Octave" or "Matlab". Not very familiar with R or NumPy, but might give it a try as well :) — Dmitri, Feb 01 '15 at 19:31

Scortchi - Reinstate Monica · Accepted Answer · 2015-09-02T13:00:08.953

For categorical variables code a new category of "missing"; for continuous variables set missing values to any constant value $a$ & add an indicator variable for missingness. E.g. let the linear predictor be given by $$\eta=\beta_0 + \beta_1 x_1 + \beta_2 q_1 + \ldots$$ When $x_1$ is not missing set $q_1$ to $0$: $$\eta=\beta_0 + \beta_1 x_1 + \ldots$$ When $x_1$ is missing set $q_1$ to $1$ & $x_1$ to $a$: $$\eta=\beta_0 + \beta_1 a + \beta_2 + \ldots$$ Note that whatever value you choose for $a$ affects only the interpretation of $\beta_2$, leaving the model's predictions unchanged.^† The new predictor $(x_1, q_1)$ can be used in interactions, & $q_1$ alone; but not $x_1$ alone as it contains the arbitrary $a$.

Unfortunately, even when data are missing completely at random this technique leads to biased estimates when the predictors are correlated (i.e. almost always, with observational data). See Jones (1996), "Indicator & stratification methods for missing explanatory variables in multiple linear regression", JASA, 91, 433. Imputation of missing values is preferable. Little & Rubin (2002), Statistical Analysis with Missing Data, is a good introduction to the problems arising with missing data & techniques for dealing with them.

† Of course you need to be careful when using any techniques that penalizes coefficients according to their magnitude.

Thank you for your answer!
For categorial variables - can this new "category" be zero? For continuous variables - can this new constant value "a" be zero? What is missingness indicator variable? Like a new boolean feature: nothing is missing / something is missing? If yes - what is the purpose of it?

If it all leads to biasing then maybe it would be better to disregard the training "record" all along? — Dmitri, Feb 01 '15 at 17:03
I added an illustration. Not sure what you mean by a category's being zero: you can call it "zero" or "nought" or "missing" or "potato". Discarding incomplete records can, as well as increasing variability, introduce bias when observations aren't missing completely at random. — Scortchi - Reinstate Monica, Feb 01 '15 at 17:53

score 2 · Answer 2 · answered Feb 01 '15 at 17:02

2

Suppose you have trained Bayesian classifier using the full data. When classifying a pattern that contains one (or more) unmeasured variables, you marginalize over that variable, as described in textbooks on machine learning and pattern classification (e.g., Chapter 2 of Pattern classification, 2nd ed.) by Duda, Hart and Stork.

answered Feb 01 '15 at 17:02

David G. Stork

762

Thank you for the reference! I'll take a look as I'm pretty sure I'm lacking of knowledge and cases like mine wasn't really covered in the ML course. – Dmitri Feb 01 '15 at 17:11

Machine learning feature encoding

2 Answers2

Linked