2

I have a clustering problem that I might solve with an algorithm based on Euclidean distance (e.g. K-Means).

One potential feature is the "hour" at which each user began an interaction. As part of feature scaling, I have doubts that either keeping it as a number or one-hot encoding the feature makes sense, so I was thinking about alternatives.

  • keeping it as a number feels wrong because it penalizes the last and the first hours of the day (23 and 01 are related because they're at night time, but they would be very far away);
  • one-hot encoding it also seems wrong given that Euclidean distance doesn't work that great with binary features.

One (possibly crazy, or maybe not) idea I had is converting each hour into a 2d coordinates system as if they were laying on an 24h analogical clock centred in the space. So midnight would have cohordinates (0,1), midday (0,-1), 6am (1,0), 6pm (-1,0), etc.

In this way the Euclidean distance would make more sense because close hours would be close in the space. Does it make sense and would it be considered a valid approach?

Sycorax
  • 90,934
rusiano
  • 564
  • 2
  • 14

0 Answers0