How to handle features that rely on a category? where each category has a different set of features

Question

Dataset description

My dataset features are:

some features not important for this question
Price (target)
Collection (categorical feature, there are 1.8k collections)
latest 10 prices (time-series, latest price of the same collection)
up to 37 attribute features (the number and type of attributes depends entirely on the collection)

I am trying to create a model to predict the price of the next sale.

Items belonging to different collections have different ranges of prices but even inside the same collection, there are items that are more "rare" than others hence the price of these items should result in a higher sale.

Example

Let's say we only have 2 collections: Coll1 and Coll2 each one having 10k items.

Coll1 has 3 different attributes: Background, Hat and Clothes

Coll2 has 5 different attributes: Background, Weapon, Vehicle, Pet and Power

As you can see depending on the collection we end up having a different number of features as well as a different type (Power is an actual number between 0 and 100). The attribute Background is not shared between the two, maybe the possible values overlap or maybe not, so we cannot treat it like it is the same feature.

As I said before even 2 items belonging to the same collection are not priced the same, there could be only 1 item having a specific category for an attribute, for example, let's take 2 items from Coll1:

Item	Collection	Background	Hat	Clothes
ItemA	Coll1	Green	Cowboy	Casual
ItemB	Coll1	Green	Firefighter	Gucci

There is only 1 item in the whole collection having the attribute Clothes = Gucci, hence this item should be recognized as rarer than the one having Clothes = Casual (there are hundreds of items with this attribute).

There cannot be 2 combinations of attributes that are the same for 2 different items inside the same collection

My question is: How can I handle these attribute features (usually categorical but can also be numerical) that are different in number depending on the collection they belong to? how should I treat and encode these columns?

The following post could help you: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258 — kjetil b halvorsen, Oct 13 '22 at 02:54

How to handle features that rely on a category? where each category has a different set of features

Dataset description

Example

0 Answers0