1

I had an assignment in which we had to classify the cuisine and also give back the top-5 recipes based on given input. I did a count vectorization (countVectorize.transformer()) for the following data and then used Jaccard's distance to calculate the closest matches. Is this approach right or are there better distance metrics for my purpose?

Dataset : https://www.kaggle.com/c/whats-cooking/data

{ "id": 24717, "cuisine": "indian", "ingredients": [ "tumeric", "vegetable stock", "tomatoes", "garam masala", "naan", "red lentils", "red chili peppers", "onions", "spinach", "sweet potatoes" ] },

1 Answers1

1

Since ingredients can be converted to elements in a set, you can use Jaccard distance directly. There is no need to count vectorization first.

Another option is to use pre-trained word embedding. The result would be a dense vector representing each word. Then any Minkowski distance or cosine distance could be used.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109