0

Sorry if this is a really simple question but I'm new to this and wondered if there's an easy way to do what I'm picturing.

Imagine I've got a bunch of people and I'm asking them what they've eaten in the last week. Each person will have a list of foods they have eaten, and a frequency of the number of times they've eaten each food. I want a single number that measures how similar each person's food list is to everyone else's. Probably no one will have eaten something that no one else has eaten, but probably no one will have eaten all kinds of food.

This is data wrangling for a logistic regression machine learning model, so the outcome will need to be a continuous variable. I was going to count something like the number of different foods they've eaten, but that's probably not very predictive.

My actual dataset has 500,000 "people" and 48 different "kinds of food".

How would you do this please? Thanks!

ttnphns
  • 57,480
  • 49
  • 284
  • 501
  • May this help? https://stats.stackexchange.com/a/173669/3277 – ttnphns Mar 26 '23 at 18:18
  • 500K people is very many to compute a distance matrix between them. I expect you will need to do your analysis (such as clustering) on a much smaller subsample(s). – ttnphns Mar 26 '23 at 18:23
  • 2
    What about using the Jensen-Shannon Distance. The reason I mention that is because if the food lists are measured as proportions (hence sum to 1) then they can be interpreted as probability distributions (multinomial) and the JS distance is like the KL divergence which measures roughly the "distance" between distributions. – Demetri Pananos Mar 26 '23 at 18:27
  • 1
    Hey @DemetriPananos - they're not probabilities / sum to 1; they're just the raw frequencies of each category. – travelsandbooks Mar 26 '23 at 18:32
  • 2
    +1, @Demetri; I might add that besides chi-square distance and Jensen-Shannon distance you could check and try other compositional distance measures. Just to cite my document... – ttnphns Mar 26 '23 at 18:35
  • (cont.) "Bhattacharyya distance, Hellinger distance, Chi-square distance for probabilities, Pearson/Neyman chi-square divergence, Kullback–Leibler symmetric divergence (Jeffreys divergence), K-divergence (symmetric, Topsoe distance), Jensen difference (Information radius), Taneja distance. Harmonic mean similarity and Geometric mean similarity are also used for probability vectors" Find their formulas in !KO_proxqnt of "Various proximities" on my web-page. – ttnphns Mar 26 '23 at 18:36
  • ... you may always convert raw couts to compositional (probabilities). – ttnphns Mar 26 '23 at 18:37
  • A function of a collection of discrete variables will still be discrete. – Glen_b Mar 26 '23 at 20:58
  • 1
    I don't see why not use all the 48 binary variables as explanatory in your model. The data wrangling you are asking for is too vague. As others have already commented, you can do data wrangling in infinitely many ways. – utobi Mar 27 '23 at 10:00

2 Answers2

1

I've ended up going with the below:

  • Find the percent breakdown of all meals -- e.g., 10% were fish and chips, 1% were hot dogs, etc.

  • Multiplying each person's frequency of each meal with the percent breakdown (e.g. if I ate 3 hot dogs, my number for hot dogs would be 3*0.01 = 0.03).

  • Calculating the mean and the sum of those values for each person.

I'm going to pass both into my model and work out whether either are predictive.

If I'm going absolutely down the wrong road, or if there's something I can do that would be much better, please let me know! Thank you!

1

Suppose you created a dataframe - every row represents a person and columns represent a food and value will be frequency of consumption. Then think in mathematical terms that every row is a vector and you can find the cosine angle between them - in other words cosine similarity. This is returning a continuous value but I am not sure if this will represent what you intend. Of course you will need to encode the data into numbers first.

otaku
  • 11