4

I have a regression problem. Two input features describe a category and subcategory. For illustrative explanation, let's consider we speak about city and district.

Some more details about the regression problem: 39000 observations, 104 cities, each 1-17 districts. 5 additional demographical variables (age, salary, gender, marital status, education level). Trying to predict the number of children of a given person (from given city, district, age, salary, ....). In some districts, we have only 1 record

The question: Is there any specific method how to represent the nested categories for machine learning?

Important comments:

  • Plain application of one-hot encoder to city-district pairs will not work as lots of combinations are very rare in the data.
  • Still, not willing to ignore the information about districts completely.
  • If doing just logistic regression, a hierarchical Bayesian model could help. However, what about xgboost or neural networks?

Non-specific attempts so far:

  • To do one-hot encoding for both higher (city) and lower hierarchies (city-district) and then apply standard feature selection methods.
  • To combine the above with explicit filtration of very rare combinations (say at least 5 examples).
Karel Macek
  • 2,816

1 Answers1

1

As a start I would go for a random effects model with districts nested within cities, and then for each district a random intercept. Then the few districts with only one or very few observations should not be a problem. With 39000 observations and only a few covariates I would just use them all, avoid feature selection, see for instance Why is variable selection necessary?. Spline variables such as salary and age, and use relevant interactions!

You seem to want some flexible model, like a neural network? I'm not sure what much sense that make with 39000 observations, start first with flexible linear mixed models and evaluate the results. But there is an R package Buddle for neural networks with random effecs, and google gives a lot of hits.