1

Assume a survey asks a question in year one that 100 respondents answer. In year two, the survey asks the identical question and 120 respondents answer, but only 80 of them took part in both surveys (these numbers are arbitrary, purely for creating an illustrative example). I gather from Time series with same variables but different respondents that this situation has produced what is called "repeated cross section data," but that answer only points to books or articles.

It is straightforward to calculate the margin of error for the data from each year's survey, but is there a statistical way to quantify something like margin of error when there is an "overlap" of 80 respondents and 40 new respondents? In my field, surveys in the legal industry, reports simply state the increase or decrease from year to year, but they ignore the changed composition of the question's answer. How should they characterize the data in year two to quantify the mixture of old and new respondents?

Thank you for any guidance to this statistically challenged surveyor.

In response to the two comments: The first year obtained its participants from a large email campaign to law firm executives; only a small portion of those emailed completed the survey. The second year followed the same procedure and obtained completed surveys from the overlap group plus the others. There is no way to know why in the second year some of the first year folks dropped out or why the new group decided to take part. This is the way many surveys are conducted in the methodologically sloppy legal industry.

lawyeR
  • 111
  • In addition to a repeated crossection with overlap, this could also be unbalanced panel data, sometimes called a rotating panel. The Kitagawa-Oaxaca-Blinder decomposition has been extended to panel data, but it would be good to get a better idea of your setting and goals. Can you clarify how your survey sampling works and what you are trying to measure? Are the differences in composition from sampling, or attrition/death/birth, or non-response? – dimitriy Jan 27 '23 at 01:41
  • Like @dimitriy says, the answer will completely depend on how the data were sampled and why there's overlap. You'll get a much better answer if you can clearly describe that. – bschneidr Jan 27 '23 at 02:44
  • It's still not clear to me what is the response variable. A single question is being asked in the survey or multiple questions are being asked? Is your goal assessing how the responses are changed from the 1st year to the 2nd year for each question? Could you clarify the exact goal of the survey? – Amin Shn Jan 31 '23 at 22:04

1 Answers1

0

Suppose you are surveying firms to learn how many data scientists they hired this year. You can decompose the difference in the average number of hires across the two surveys into three pieces. The first is the changes in the characteristics of respondents. For example, you might have more large firms this year. The second piece is the change in the effect associated with those characteristics across time. For example, small firms might hire more this year than before. The third piece is the interaction of the two forces.

Here's an example of a Kitagawa–Blinder–Oaxaca decomposition in Stata. First, I simulate the data and get it in the right shape for analysis:

. /* (1) Simulate Data */
. clear

. set seed 58347390

. set obs 4 Number of observations (_N) was 0, now 4.

. gen size = _n - 1

. label define size 0 "S" 1 "M" 2 "L" 3 "XL"

. lab val size size

. expand 25 (96 observations created)

. sort size

. gen id = _n

. gen ds_hires0 = rpoisson(3 + size*3)

. gen ds_hires1 = rpoisson(3 + size*3 + 2.5 + 1/(1 + size)^2)

. reshape long ds_hires, i(id size) j(survey) (j = 0 1)

Data Wide -> Long

Number of observations 100 -> 200
Number of variables 4 -> 4
j variable (2 values) -> survey xij variables: ds_hires0 ds_hires1 -> ds_hires


. list in 1/4, noobs clean

id   size   survey   ds_hires  
 1      S        0          2  
 1      S        1          6  
 2      S        0          3  
 2      S        1          4  

There are 100 firms, each surveyed twice, for a total of 200 observations. Next, I fit the model:

. /* (2) Ground Truth Regression */
. regress ds_hires i.size##i.survey, vce(cluster id)

Linear regression Number of obs = 200 F(7, 99) = 66.96 Prob > F = 0.0000 R-squared = 0.6653 Root MSE = 2.7933

                               (Std. err. adjusted for 100 clusters in id)

         |               Robust
ds_hires | Coefficient  std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- size | M | 2.44 .5189928 4.70 0.000 1.410206 3.469794 L | 5.88 .6295966 9.34 0.000 4.630744 7.129256 XL | 8.92 .6819116 13.08 0.000 7.56694 10.27306 | 1.survey | 3.72 .6166937 6.03 0.000 2.496346 4.943654 | size#survey | M#1 | .24 .9873697 0.24 0.808 -1.719156 2.199156 L#1 | -1.16 .9432946 -1.23 0.222 -3.031701 .7117011 XL#1 | .48 1.171307 0.41 0.683 -1.844128 2.804128 | _cons | 3.12 .3535934 8.82 0.000 2.418394 3.821606


The coefficient on the survey dummy tells us that a small firm hired 3.72 additional data scientists. Here I clustered the standard errors to reflect that I surveys each firm twice.

Now I drop some S and M firms from survey 0, and some L and XL firms from survey 1, leaving me with an unbalanced panel (61 firms in both surveys, 25 in the first only, and 14 in the second only). Now I fit the same model as above:

. /* (3) Unbalance the panel at random */
. gen missing = cond( ///
>         survey == 0 & (inlist(size,"S":size,"M":size) & runiform() > .75) | ///
>         survey == 1 & (inlist(size,"L":size,"XL":size) & runiform() > .5) ///
>         ,1,0)

. regress ds_hires ib0.size##i.survey if !missing, vce(cluster id)

Linear regression Number of obs = 161 F(7, 99) = 52.08 Prob > F = 0.0000 R-squared = 0.6335 Root MSE = 2.8216

                               (Std. err. adjusted for 100 clusters in id)

         |               Robust
ds_hires | Coefficient  std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- size | M | 2.049536 .5910518 3.47 0.001 .8767606 3.222311 L | 5.470588 .6543468 8.36 0.000 4.172222 6.768954 XL | 8.510588 .7052633 12.07 0.000 7.111193 9.909984 | 1.survey | 3.310588 .5859948 5.65 0.000 2.147847 4.473329 | size#survey | M#1 | .6304644 1.043286 0.60 0.547 -1.439642 2.700571 L#1 | -.2105882 .914476 -0.23 0.818 -2.025107 1.603931 XL#1 | 1.516078 1.459585 1.04 0.301 -1.380054 4.412211 | _cons | 3.529412 .3929032 8.98 0.000 2.749807 4.309017


. margins, over(survey) post

Predictive margins Number of obs = 161 Model VCE: Robust

Expression: Linear prediction, predict() Over: survey


         |            Delta-method
         |     Margin   std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- survey | 0 | 8.046512 .2601406 30.93 0.000 7.530336 8.562687 1 | 10.44 .3738436 27.93 0.000 9.698213 11.18179


. lincom _b[1.survey] - _b[0.survey]

( 1) - 0bn.survey + 1.survey = 0


         | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- (1) | 2.393488 .4657015 5.14 0.000 1.469436 3.317541


This says that in survey 0, the average number of hires was 8. The second survey has 10.4, so that's a change of 2.4.

Now I decompose this gap into three pieces:

. qui tab size, gen(d) // generate dummy variables fpr size

. // KOB decomposition from the viewpoint of the second survey . oaxaca ds_hires d2 d3 d4 if !missing, by(survey) swap threefold(reverse) cluster(id)

Blinder-Oaxaca decomposition Number of obs = 161

       1: survey = 1
       2: survey = 0

                                (Std. err. adjusted for 100 clusters in id)

          |               Robust
 ds_hires | Coefficient  std. err.      z    P>|z|     [95% conf. interval]

--------------+---------------------------------------------------------------- Differential | Prediction_1 | 10.44 .5584137 18.70 0.000 9.345529 11.53447 Prediction_2 | 8.046512 .4306298 18.69 0.000 7.202493 8.890531 Difference | 2.393488 .5297834 4.52 0.000 1.355132 3.431845 --------------+---------------------------------------------------------------- Decomposition | Endowments | -1.435891 .3199881 -4.49 0.000 -2.063057 -.8087264 Coefficients | 3.79588 .4441424 8.55 0.000 2.925377 4.666383 Interaction | .0334996 .2089723 0.16 0.873 -.3760786 .4430777


The first panel is the same as the numbers from the regression. The second panel gives the decomp. The change in firm size mix across surveys shrank the gap by -1.435891 DS. The effect of firm size changing drove DS up by 3.79588, and their interactions added .0334996 DS. Put together, that adds up to 2.393488 DS. The first two effects are statistically significant, while the third is not.

dimitriy
  • 35,430