Can anyone give ideas on the possible best way forward to solve this specific machine learning problem for sports analytics?
Data set looks like:
HomeTeam AwayTeam NoOfSpectators
AC Milan FC Barcelona 56900
Real Madrid Bayern Munchen 78900
The outcome variable is NoOfSpectators but there are many levels in both HomeTeam and AwayTeam.
There are about 50 levels in both HomeTeam and AwayTeam. I know you can do OneHot encoding or Label encoding but what other options are worth trying?
For example use RandomForest or LightGBM that can automatically handle categorical / factor variable?
Also since for example: HomeTeam AwayTeam NoOfSpectators AC Milan FC Barcelona 56900
is the same as: HomeTeam AwayTeam NoOfSpectators FC Barcelona AC Milan 56900
How do you suggest that the data set should be structured / modeled before input to a ML model?
HomeTeamandAwayTeam? You might find some help in https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels (maybe a duplicate) – kjetil b halvorsen Aug 28 '18 at 09:54