0

The topic is somewhat generic but I will try to specify it as much as possible.

Theoretically, we have a dataset that being a survey could be biased (geographically, gender ...) in this case are about 100k respondents.

This dataset is a sample of people with their characteristics and a response variable of a survey that measures the propensity to buy bicycles "bike_buyer".

Question

Because this dataset is used to train a ML problem. Prior to training the model:

would it be correct to weight the variables in the dataset to correct their distribution towards a correct theoretical distribution?

I have not seen publications on this type of methodology, I wonder, if it is correct to do so.

I understand that this question is open, since I cannot specify certain issues such as, what is the way of weighting, how are the weights obtained?

So it can be answered from first in a generic way with yes/no and why. And comment or add some methodology or article on the subject, to demonstrate good practice.

Roger V.
  • 3,903
PeCa
  • 75
  • For a start, (re)weighting a sample to match population totals/means is known as "raking" in the survey literature, or "iterative proportional fitting". – David Thiessen Sep 07 '22 at 16:14
  • Note that while weighting can reduce any bias in estimates/predictions owing to the sampling procedure, it increases their variance. If your predictors control for sampling bias, it may well be counter-productive. – Scortchi - Reinstate Monica Oct 16 '22 at 08:19

1 Answers1

-1

From the webpage "Raking Survey Data (a.k.a. Sample Balancing)" by Abt Associates:

A survey sample may cover segments of the target population in proportions that do not match the proportions of those segments in the population itself. The differences may arise, for example, from sampling fluctuations, from nonresponse, or because the sample design was not able to cover the entire target population. In such situations one can often improve the relation between the sample and the population by adjusting the sampling weights of the cases in the sample so that the marginal totals of the adjusted weights on specified characteristics agree with the corresponding totals for the population. This operation is known as raking or sample-balancing, and the population totals are usually referred to as control totals.

Raking assigns a weight value to each survey respondent such that the weighted distribution of the sample is in very close agreement with two or more marginal control variables. For example, in household surveys the control variables are typically sample design and socio-demographic variables. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when the convergence criterion is achieved. The resulting final weight may however exhibit considerable variability, with some sampling units having extremely low or high weights relative to most of the other sampling units. This leads to inflated sampling variances of the survey estimates. To combat this problem we enhanced the previously released IHB Raking macro by adding two weight trimming options that are implemented during the actual iterative process, allowing one to achieve convergence while controlling the highest and lowest weight values.

Stackoverflow code

Sycorax
  • 90,934
PeCa
  • 75
  • 1
    Please ensure you clearly distinguish what you've written yourself from what you've copied from your source (for which you ought to provide a readable reference, not just a bare link). If your answer then consists entirely of a quote you might as well delete it & link to the source in a comment; or, better, add your own commentary and explanation of how it helps answer the question. See https://stats.stackexchange.com/help/referencing – Scortchi - Reinstate Monica Oct 15 '22 at 14:36
  • It was almost done, I think now is the desired format with the quote and source link. – PeCa Oct 15 '22 at 18:02