I'm working on a Anomaly Detection problem with streaming data, where i use Robust Random Cut Forest (RRCF). I have 295.000+ rows to start with and there is more data coming in.
The problem is when encoding categorical features. There are several columns with 100's of unique values, but there is also a column which currently contains about 10.000 unique values. The amount of unique values will increase over time (which is the main problem).
That means i cannot One Hot Encode the variables because of the amount of columns getting generated and the amount of unique values will increase, meaning the training set and test set will be different.
For obvious reasons Label Encoding wont work either, because the data is not continuous.
I have also tried Frequency Encoding, which is not working when you are feeding your model with more data and training on it. Because then you will have to change the order when new data comes in when some values are getting more frequent then before, meaning the new data will be different from the training.
I have also looked up other different encoding techniques, without getting a feeling it would work. Either because the amount of columns will change, or you need some kind of target variable (which i don't have).
I'm using Collusive Displacement to give a point an anomaly score.