How to encode categorical data with a lot of unique values and streaming data for anomaly detection

Question

I'm working on a Anomaly Detection problem with streaming data, where i use Robust Random Cut Forest (RRCF). I have 295.000+ rows to start with and there is more data coming in.

The problem is when encoding categorical features. There are several columns with 100's of unique values, but there is also a column which currently contains about 10.000 unique values. The amount of unique values will increase over time (which is the main problem).

That means i cannot One Hot Encode the variables because of the amount of columns getting generated and the amount of unique values will increase, meaning the training set and test set will be different.

For obvious reasons Label Encoding wont work either, because the data is not continuous.

I have also tried Frequency Encoding, which is not working when you are feeding your model with more data and training on it. Because then you will have to change the order when new data comes in when some values are getting more frequent then before, meaning the new data will be different from the training.

I have also looked up other different encoding techniques, without getting a feeling it would work. Either because the amount of columns will change, or you need some kind of target variable (which i don't have).

I'm using Collusive Displacement to give a point an anomaly score.

Does this help https://stats.stackexchange.com/questions/388049/one-hot-encoding-gives-untractable-amount-of-classes/388051#388051 ? — Tim, Mar 09 '22 at 13:32
You haven't told us what the values represent or what the purpose of the project is, but using an embedding strategy could be helpful. For example, if the categories are the names of cities, then one embedding would encode the location information so that cities that are close together on a globe are close together in the embedding space. Implicitly, this scheme implies that being close together on a globe is meaningful. Constructing a meaningful embedding for your project will require some thoughtful attention to what, exactly, you're doing. — Sycorax, Mar 09 '22 at 13:53
Thanks for the link, @Tim. I will take a closer look at hash tricking. Even though i got the feeling last time it wouldn't work. — Jostein Eriksen, Mar 09 '22 at 16:10
@Sycorax Well, the purpose of the project is to detect anomalies in time and absence registration. The values i'm having troubles with are mainly ID's. Social security number. A descriptive message ( for instance if a person want to give a note why he left early that day ) Assignment_id. Project_id. All those values are also anonymized into meaningless strings like "AB00CD1E2F3G". The only way i could think of grouping theese were to frequency encode, where you give them a number based on how many appearences a specific value have in the data set. Sorry if my english isn't that great. — Jostein Eriksen, Mar 09 '22 at 16:21
Using unique identifiers as features in a model is usually not recommended for the reasons you outline in your question -- they don't generalize. https://stats.stackexchange.com/questions/368021/how-to-build-a-predictive-model-when-more-levels-of-a-categorical-predictor-are/415441#415441 — Sycorax, Mar 09 '22 at 16:48

How to encode categorical data with a lot of unique values and streaming data for anomaly detection

0 Answers0