1

I need help in ranking data, says car models in this case, based on multiple variables. For some variables (eg. mpg), the higher the better. For some variables (eg. car age), the lower the better. For defects history, it's 3 status (Yes, No and Unknown). How to rank these car models in a mathematical way?

Sample input:

Car Model MPG Car Age Defects
A 25 6 Yes
B 40 2.5 No
C 23 5 Unknown

The intended output will be having a Rank column added. The rank will be 1, 2, 3 without repetition unless both model are equally good.

zZzZ
  • 11
  • There are many ways to rank them. By individual column, by choosing the order of columns ad sorting in order. It may be easier to decide if you tell us why you need to rank them. Imagine you had a perfect ranking system, what would you do with it? If the next step is sort of prediction, you may not need ranking, as intermediate step, at all – Cryo Jan 24 '24 at 18:43

1 Answers1

0

To achieve your goal two tasks need to be solved:

  • rank different variables according to their "direction" (whether lower or higher values should obtain best rank)
  • combine multiple ranks into single Rank column

Obtaining variables ranks

For the first task I suggest cloning the original DataFrame and converting all variables to the same direction:

  • make sure to handle NaN values according to column and task logic
  • columns like Car Age can be left intact since they already follow "the lower the better" direction
  • columns like MPG should be inverted (assuming no zero values) to follow "the lower the better" direction
  • categorical columns like Defects can be converted to ordered categorical type by providing the order of the options from best to worst: pd.Categorical(cars_df.Defects, ordered=True, categories=['No', 'Unknown', 'Yes'])

After that you can call cars_for_ranking_df.rank(axis=0) to obtain ranks for each variable. Check method argument options in pd.DataFrame.rank() documentation to choose how you prefer the ties to be handled.

Combining ranks

This task has multiple available options, and the choice of the right solution will depend on your task. And even then it might be subjective: is a new car with known defects better than old (but fuel efficient) car? Depends on who you ask.

One possible way to obtain final ranks would be to average rank of every variable for each car, and then rank obtained averages again:

final_ranks = cars_for_ranking_df.rank().mean(axis=1).rank(method='dense')

method='dense' is used to make sure final ranks are serial (1, 2, 3, ...).

Other options have been mentioned in these questions: