What sampling approach should I use to characterise groups of different sizes?

Question

Suppose I have a large (~10M) group of games, played by a large (~100K) group of players. Games are 1v1 and scored, so each player has a rating (analogous to chess in these regards; the real dataset is this).

Player rating distribution is kinda normally distributed, but the rating of players in games is biased:

Strong players play a lot more than typical or weak players
The game-matching algorithm matches strong players with strong players

I'd like to check whether strong players exhibit certain characteristics. For example, keeping the chess analogy, is it true that strong players tend to move the queen more. This is a hobby project, but I work enough around statisticians to know that the sampling approach matters. Alas, not enough to know what to use in this case :)

Naively, my plan would be:

bin players into 'rating buckets'
uniformly sample M players from each bucket
uniformly sample N games for each player
scatter plot (rating, ratio-of-queen-moves) for these games
do a Pearson correlation (to compare strategies, e.g. queen vs rook moves)

Does this sound plausible? Any particular terms or topics you recommend I read up on to better decide this myself? (e.g., AFAICT, what I described is neither 'stratified' nor 'clustered', but I don't know what to call it).

Binning a continuous variable means loss of information, so a big no-no if it can be avoided. You can sample players without rating buckets. If you think it's important to take player ratings into account when sampling (why?), then you can do weighted sampling. — dipetkov, Jun 29 '22 at 17:09
Making plots of the data is always a great place to start. And unless there are many features (how many variables per player?), it might be possible to look at all pairs of features. Note that Pearson's coefficient measures linear correlation. — dipetkov, Jun 29 '22 at 17:14
To sum up, you have cool data but not a specific question to ask of it (yet). So you might want to start with some exploratory data analysis (EDA). — dipetkov, Jun 29 '22 at 17:16
Just remembered about a recent question that also proposed computing multiple pairwise Pearson's correlation coefficients and using the corresponding p-values to "scan" the data for "important" relationship: When not to look at p-value. It might be interesting to read more about why this isn't such a great idea. — dipetkov, Jun 30 '22 at 11:52

What sampling approach should I use to characterise groups of different sizes?

0 Answers0