Describe year-over-year survey data statistically when there are some same respondents but also new respondents

Question

Assume a survey asks a question in year one that 100 respondents answer. In year two, the survey asks the identical question and 120 respondents answer, but only 80 of them took part in both surveys (these numbers are arbitrary, purely for creating an illustrative example). I gather from Time series with same variables but different respondents that this situation has produced what is called "repeated cross section data," but that answer only points to books or articles.

It is straightforward to calculate the margin of error for the data from each year's survey, but is there a statistical way to quantify something like margin of error when there is an "overlap" of 80 respondents and 40 new respondents? In my field, surveys in the legal industry, reports simply state the increase or decrease from year to year, but they ignore the changed composition of the question's answer. How should they characterize the data in year two to quantify the mixture of old and new respondents?

Thank you for any guidance to this statistically challenged surveyor.

In response to the two comments: The first year obtained its participants from a large email campaign to law firm executives; only a small portion of those emailed completed the survey. The second year followed the same procedure and obtained completed surveys from the overlap group plus the others. There is no way to know why in the second year some of the first year folks dropped out or why the new group decided to take part. This is the way many surveys are conducted in the methodologically sloppy legal industry.

In addition to a repeated crossection with overlap, this could also be unbalanced panel data, sometimes called a rotating panel. The Kitagawa-Oaxaca-Blinder decomposition has been extended to panel data, but it would be good to get a better idea of your setting and goals. Can you clarify how your survey sampling works and what you are trying to measure? Are the differences in composition from sampling, or attrition/death/birth, or non-response? — dimitriy, Jan 27 '23 at 01:41
Like @dimitriy says, the answer will completely depend on how the data were sampled and why there's overlap. You'll get a much better answer if you can clearly describe that. — bschneidr, Jan 27 '23 at 02:44
It's still not clear to me what is the response variable. A single question is being asked in the survey or multiple questions are being asked? Is your goal assessing how the responses are changed from the 1st year to the 2nd year for each question? Could you clarify the exact goal of the survey? — Amin Shn, Jan 31 '23 at 22:04

score 0 · Answer 1 · answered Feb 01 '23 at 23:42

Suppose you are surveying firms to learn how many data scientists they hired this year. You can decompose the difference in the average number of hires across the two surveys into three pieces. The first is the changes in the characteristics of respondents. For example, you might have more large firms this year. The second piece is the change in the effect associated with those characteristics across time. For example, small firms might hire more this year than before. The third piece is the interaction of the two forces.

Here's an example of a Kitagawa–Blinder–Oaxaca decomposition in Stata. First, I simulate the data and get it in the right shape for analysis:

. /* (1) Simulate Data */
. clear
. set seed 58347390
. set obs 4
Number of observations (_N) was 0, now 4.
. gen size = _n - 1
. label define size 0 "S" 1 "M" 2 "L" 3 "XL"
. lab val size size
. expand 25
(96 observations created)
. sort size
. gen id = _n
. gen ds_hires0 = rpoisson(3 + size*3)
. gen ds_hires1 = rpoisson(3 + size*3 + 2.5 + 1/(1 + size)^2)
. reshape long ds_hires, i(id size) j(survey)
(j = 0 1)
Data                               Wide   ->   Long
Number of observations              100   ->   200

Number of variables                   4   ->   4

j variable (2 values)                     ->   survey
xij variables:
                    ds_hires0 ds_hires1   ->   ds_hires

. list in 1/4, noobs clean
id   size   survey   ds_hires  
 1      S        0          2  
 1      S        1          6  
 2      S        0          3  
 2      S        1          4

There are 100 firms, each surveyed twice, for a total of 200 observations. Next, I fit the model:

. /* (2) Ground Truth Regression */
. regress ds_hires i.size##i.survey, vce(cluster id)
Linear regression                               Number of obs     =        200
                                                F(7, 99)          =      66.96
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6653
                                                Root MSE          =     2.7933
                               (Std. err. adjusted for 100 clusters in id)


         |               Robust
ds_hires | Coefficient  std. err.      t    P&gt;|t|     [95% conf. interval]

-------------+----------------------------------------------------------------
        size |
          M  |       2.44   .5189928     4.70   0.000     1.410206    3.469794
          L  |       5.88   .6295966     9.34   0.000     4.630744    7.129256
         XL  |       8.92   .6819116    13.08   0.000      7.56694    10.27306
             |
    1.survey |       3.72   .6166937     6.03   0.000     2.496346    4.943654
             |
 size#survey |
        M#1  |        .24   .9873697     0.24   0.808    -1.719156    2.199156
        L#1  |      -1.16   .9432946    -1.23   0.222    -3.031701    .7117011
       XL#1  |        .48   1.171307     0.41   0.683    -1.844128    2.804128
             |
       _cons |       3.12   .3535934     8.82   0.000     2.418394    3.821606

The coefficient on the survey dummy tells us that a small firm hired 3.72 additional data scientists. Here I clustered the standard errors to reflect that I surveys each firm twice.

Now I drop some S and M firms from survey 0, and some L and XL firms from survey 1, leaving me with an unbalanced panel (61 firms in both surveys, 25 in the first only, and 14 in the second only). Now I fit the same model as above:

. /* (3) Unbalance the panel at random */
. gen missing = cond( ///
>         survey == 0 & (inlist(size,"S":size,"M":size) & runiform() > .75) | ///
>         survey == 1 & (inlist(size,"L":size,"XL":size) & runiform() > .5) ///
>         ,1,0)
. regress ds_hires ib0.size##i.survey if !missing, vce(cluster id)
Linear regression                               Number of obs     =        161
                                                F(7, 99)          =      52.08
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6335
                                                Root MSE          =     2.8216
                               (Std. err. adjusted for 100 clusters in id)


         |               Robust
ds_hires | Coefficient  std. err.      t    P&gt;|t|     [95% conf. interval]

-------------+----------------------------------------------------------------
        size |
          M  |   2.049536   .5910518     3.47   0.001     .8767606    3.222311
          L  |   5.470588   .6543468     8.36   0.000     4.172222    6.768954
         XL  |   8.510588   .7052633    12.07   0.000     7.111193    9.909984
             |
    1.survey |   3.310588   .5859948     5.65   0.000     2.147847    4.473329
             |
 size#survey |
        M#1  |   .6304644   1.043286     0.60   0.547    -1.439642    2.700571
        L#1  |  -.2105882    .914476    -0.23   0.818    -2.025107    1.603931
       XL#1  |   1.516078   1.459585     1.04   0.301    -1.380054    4.412211
             |
       _cons |   3.529412   .3929032     8.98   0.000     2.749807    4.309017

. margins, over(survey) post
Predictive margins                                         Number of obs = 161
Model VCE: Robust
Expression: Linear prediction, predict()
Over:       survey

         |            Delta-method
         |     Margin   std. err.      t    P&gt;|t|     [95% conf. interval]

-------------+----------------------------------------------------------------
      survey |
          0  |   8.046512   .2601406    30.93   0.000     7.530336    8.562687
          1  |      10.44   .3738436    27.93   0.000     9.698213    11.18179

. lincom _b[1.survey] - _b[0.survey]
( 1)  - 0bn.survey + 1.survey = 0

         | Coefficient  Std. err.      t    P&gt;|t|     [95% conf. interval]

-------------+----------------------------------------------------------------
         (1) |   2.393488   .4657015     5.14   0.000     1.469436    3.317541

This says that in survey 0, the average number of hires was 8. The second survey has 10.4, so that's a change of 2.4.

Now I decompose this gap into three pieces:

. qui tab size, gen(d) // generate dummy variables fpr size
. // KOB decomposition from the viewpoint of the second survey
. oaxaca ds_hires d2 d3 d4 if !missing, by(survey) swap threefold(reverse) cluster(id)
Blinder-Oaxaca decomposition                               Number of obs = 161
       1: survey = 1
       2: survey = 0

                                (Std. err. adjusted for 100 clusters in id)


          |               Robust
 ds_hires | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]

--------------+----------------------------------------------------------------
Differential  |
 Prediction_1 |      10.44   .5584137    18.70   0.000     9.345529    11.53447
 Prediction_2 |   8.046512   .4306298    18.69   0.000     7.202493    8.890531
   Difference |   2.393488   .5297834     4.52   0.000     1.355132    3.431845
--------------+----------------------------------------------------------------
Decomposition |
   Endowments |  -1.435891   .3199881    -4.49   0.000    -2.063057   -.8087264
 Coefficients |    3.79588   .4441424     8.55   0.000     2.925377    4.666383
  Interaction |   .0334996   .2089723     0.16   0.873    -.3760786    .4430777

The first panel is the same as the numbers from the regression. The second panel gives the decomp. The change in firm size mix across surveys shrank the gap by -1.435891 DS. The effect of firm size changing drove DS up by 3.79588, and their interactions added .0334996 DS. Put together, that adds up to 2.393488 DS. The first two effects are statistically significant, while the third is not.

Describe year-over-year survey data statistically when there are some same respondents but also new respondents

1 Answers1

Data Wide -> Long