I am quite a beginner in machine learning and now study anomaly detection, which is in the end just identification of outliers on Gaussian or other "chi squared", etc distributions.
What to do if for example you have three machine types where you have parameters like temperature and for example processing delay and want to evaluate if there is an anomaly? All three types of machines are different (both features would have slightly different variance in each type), and one type of machines has a very small amount of samples, like 20 against 1000 in other classes.
What is the standard logic - throw this class away as otherwise it could be pointed as anomaly or train separate anomaly detection?
What to do if there is a big amount of categorical data in your raw data? Should same analysis be performed then on every class in each category to check if it can represent outlier?
Answering the question: In the course we were given an example, where you have a airplane motor, and based on two features (Heat and Vibrations) you should build anomaly detection (if airplane motor is close to crash or not).
Now, as usually real-life data usually is not only numerical variables like heat or vibrations, what do you do with categorical features? Numerical features can vary or not depending on the class in categorical feature.
- Drop all categorical features? This would make model confuse between f.e. different engine types (for some a certain level of vibration is fine, for others it would break already)
- Build anomaly detection for each class of categorical feature separately? This would probably be the best solution but consume a lot of resources and probably would not be computable on production sets.
- Other options?