Dataset description
My dataset features are:
- some features not important for this question
- Price (target)
- Collection (categorical feature, there are 1.8k collections)
- latest 10 prices (time-series, latest price of the same collection)
- up to 37 attribute features (the number and type of attributes depends entirely on the collection)
I am trying to create a model to predict the price of the next sale.
Items belonging to different collections have different ranges of prices but even inside the same collection, there are items that are more "rare" than others hence the price of these items should result in a higher sale.
Example
Let's say we only have 2 collections: Coll1 and Coll2 each one having 10k items.
Coll1 has 3 different attributes: Background, Hat and Clothes
Coll2 has 5 different attributes: Background, Weapon, Vehicle, Pet and Power
As you can see depending on the collection we end up having a different number of features as well as a different type (Power is an actual number between 0 and 100). The attribute Background is not shared between the two, maybe the possible values overlap or maybe not, so we cannot treat it like it is the same feature.
As I said before even 2 items belonging to the same collection are not priced the same, there could be only 1 item having a specific category for an attribute, for example, let's take 2 items from Coll1:
| Item | Collection | Background | Hat | Clothes |
|---|---|---|---|---|
| ItemA | Coll1 | Green | Cowboy | Casual |
| ItemB | Coll1 | Green | Firefighter | Gucci |
There is only 1 item in the whole collection having the attribute Clothes = Gucci, hence this item should be recognized as rarer than the one having Clothes = Casual (there are hundreds of items with this attribute).
There cannot be 2 combinations of attributes that are the same for 2 different items inside the same collection
My question is: How can I handle these attribute features (usually categorical but can also be numerical) that are different in number depending on the collection they belong to? how should I treat and encode these columns?