Pandas representative sampling across multiple columns

Question

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

I'd like to know how to do this in R. – Kaleb Coberly Jun 16 '21 at 00:22 — Kaleb Coberly, Jun 16 '21 at 00:22

score 5 · Accepted Answer · answered Nov 23 '20 at 12:01

5

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

answered Nov 23 '20 at 12:01

Patrick Artner

48,339
8
43
63

I think you don't need the `normalize=True` – Dani Mesejo Nov 23 '20 at 12:43
1

@DaniMesejo afaik the normalize will simply do a full sum over all and divide each count by that so you get values (0,1] - as sample can either use those or simply the amounts as is - you are correct. It is not needed as `pd.samle()` provided you give it weights, will _If weights do not sum to 1, they will be normalized to sum to 1_. So normalization is done anyhow and the OP decided to do it directly when gathering the weights so I copied that over ;) – Patrick Artner Nov 23 '20 at 13:05
I'd like to know how to do this in R. – Kaleb Coberly Jun 16 '21 at 00:22
@Kaleb no R expertise here - maybe ask a new question, provide a dataset in R and link to this question for explanation purposes? – Patrick Artner Jun 16 '21 at 05:11
1

Thank you so much for this. Using `weights` in pandas sample is the way to draw representative distributions! – Iopheam Mar 26 '22 at 15:02

Pandas representative sampling across multiple columns

1 Answers1

Linked