How to compare equality of the distribution at different quantiles

Question

I have a dataset from which I am taking a set of descriptive statistics as follows:

The value measured is productivity of a firm for each of the group at different quantile (I use Stata command: table group_var, c(p10 p25 p50... ).

The below is repeated for two subsets of the main data (I have two sets of descriptive stats as below).

              p10   p25   p50   p75   p90
group1        50    45    43    34    10
group2         ......

What I want to do is to compare the two descriptive stats for statistical significance:

so there will be:

       p10_1   p10_2
group1  50       52
group2  52      ....
....

I want to determine whether these are different, for p50 I am using a ranksum (median) test in stata. for mean, running a t-test but struggle to find a method for values at different quantiles. Can someone suggest the right approach?

Thanks,

Paul

"Whether these are different" is ambiguous, so it will help to present a clearer statement of your null hypothesis. For instance, would it be $$H_0: F_{.10}=G_{.10}\text{ and } F_{.25}=G_{.25}\text{ and }\cdots\text{ and } F_{.90}=G_{.90}$$ where $F$ and $G$ are the two distributions and the subscripts represent their quantiles? With just your summaries, this will be both difficult to test and not very powerful. Do you have the original data? — whuber, Mar 29 '22 at 21:06

dimitriy · Accepted Answer · 2022-03-30T18:22:44.303

You can do this using quantile regression.

The code below does the single quantile case. It

estimates the q90 price for foreign and domestic cars with various repair records. Here origin is like your two cities and repair record is like your groups.
calculates the statistics within each origin $\times$ repair cell from the model, which should match the output of the table command.
tests the hypothesis that the q90 prices for each group are the same regardless of manufacturing origin.

Here's the output:

. sysuse auto, clear
(1978 automobile data)
. table rep78 foreign, statistic(p90 price) nototals

               |      Car origin    
               |  Domestic   Foreign

-------------------+--------------------
Repair record 1978 |

  1                |      4934

  2                |     14500

  3                |     13466      6295
  4                |      8814      9735
  5                |      4425     11995

. keep if rep78>2 & !missing(rep78)
(15 observations deleted)
. qreg price i.rep78##i.foreign, quantile(0.9) nolog
.9 Quantile regression                              Number of obs =         59
  Raw sum of deviations  39983.8 (about 11385)
  Min sum of deviations  32468.6                    Pseudo R2     =     0.1880

    price | Coefficient  Std. err.      t    P&gt;|t|     [95% conf. interval]

--------------+----------------------------------------------------------------
        rep78 |
           4  |      -4652   2130.507    -2.18   0.033    -8925.256    -378.744
           5  |      -9041   4056.365    -2.23   0.030    -17177.04   -904.9629
              |
      foreign |
     Foreign  |      -7171   3368.627    -2.13   0.038    -13927.61   -414.3889
              |
rep78#foreign |
   4#Foreign  |       8092   4261.014     1.90   0.063    -454.5121    16638.51
   5#Foreign  |      14741   5483.728     2.69   0.010     3742.034    25739.97
              |
        _cons |      13466   1065.254    12.64   0.000     11329.37    15602.63

. margins foreign#rep78, post // coeflegend
warning: cannot perform check for estimable functions.
Adjusted predictions                                        Number of obs = 59
Model VCE: IID
Expression: Linear prediction, predict()

          |            Delta-method
          |     Margin   std. err.      z    P&gt;|z|     [95% conf. interval]

--------------+----------------------------------------------------------------
foreign#rep78 |
  Domestic#3  |      13466   1065.254    12.64   0.000     11378.14    15553.86
  Domestic#4  |       8814   1845.073     4.78   0.000     5197.723    12430.28
  Domestic#5  |       4425   3913.991     1.13   0.258    -3246.282    12096.28
   Foreign#3  |       6295   3195.761     1.97   0.049     31.42429    12558.58
   Foreign#4  |       9735   1845.073     5.28   0.000     6118.723    13351.28
   Foreign#5  |      11995   1845.073     6.50   0.000     8378.723    15611.28

. test ///
> (_b[0.foreign#3.rep78] = _b[1.foreign#3.rep78]) ///
> (_b[0.foreign#4.rep78] = _b[1.foreign#4.rep78]) ///
> (_b[0.foreign#5.rep78] = _b[1.foreign#5.rep78])
( 1)  0bn.foreign#3bn.rep78 - 1.foreign#3bn.rep78 = 0
 ( 2)  0bn.foreign#4.rep78 - 1.foreign#4.rep78 = 0
 ( 3)  0bn.foreign#5.rep78 - 1.foreign#5.rep78 = 0
       chi2(  3) =    7.72
     Prob &gt; chi2 =    0.0522

The p-value on the two-sided null that the q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5 is .0522. This means that it is fairly unlikely that we would observe differences like this (or larger) if they were the same for each repair record group.

But you want to test more than one quantile at the same time, so you need to use sqreg for simultaneous-quantile regression. It produces the same coefficients as qreg for each quantile. Reported standard errors will be similar, but sqreg obtains an estimate of the VCE via bootstrapping, and the VCE includes between-quantile blocks. This lets you do tests comparing predictions at different quantiles:

. sysuse auto, clear
(1978 automobile data)
. table rep78 foreign, stat(p50 price) statistic(p90 price)  nototals

                |      Car origin    
                |  Domestic   Foreign

--------------------+--------------------
Repair record 1978  |

  1                 |

    50th percentile |    4564.5

    90th percentile |      4934

  2                 |

    50th percentile |      4638

    90th percentile |     14500

  3                 |

    50th percentile |      4749      4296
    90th percentile |     13466      6295
  4                 |

    50th percentile |      5705      6229
    90th percentile |      8814      9735
  5                 |

    50th percentile |    4204.5      5719
    90th percentile |      4425     11995

. keep if rep78>2 & !missing(rep78)
(15 observations deleted)
. sqreg price i.rep78##i.foreign, quantile(0.5 0.9) nolog
Simultaneous quantile regression                    Number of obs =         59
  bootstrap(20) SEs                                 .50 Pseudo R2 =     0.0574
                                                    .90 Pseudo R2 =     0.1880

          |              Bootstrap
    price | Coefficient  std. err.      t    P&gt;|t|     [95% conf. interval]

--------------+----------------------------------------------------------------
q50           |
        rep78 |
           4  |        956   410.4789     2.33   0.024     132.6835    1779.316
           5  |       -324   358.3828    -0.90   0.370    -1042.825    394.8248
              |
      foreign |
     Foreign  |       -453   1002.362    -0.45   0.653    -2463.483    1557.483
              |
rep78#foreign |
   4#Foreign  |        977   1286.285     0.76   0.451    -1602.962    3556.962
   5#Foreign  |       1747   1127.146     1.55   0.127    -513.7688    4007.769
              |
        _cons |       4749   326.4424    14.55   0.000      4094.24     5403.76
--------------+----------------------------------------------------------------
q90           |
        rep78 |
           4  |      -4652   1808.008    -2.57   0.013    -8278.405   -1025.595
           5  |      -9041   1618.042    -5.59   0.000    -12286.38   -5795.619
              |
      foreign |
     Foreign  |      -7171   2282.641    -3.14   0.003     -11749.4   -2592.601
              |
rep78#foreign |
   4#Foreign  |       8092    2949.62     2.74   0.008     2175.812    14008.19
   5#Foreign  |      14741     2256.1     6.53   0.000     10215.84    19266.16
              |
        _cons |      13466   1601.298     8.41   0.000      10254.2     16677.8

. margins foreign#rep78, predict(equation(q50)) predict(equation(q90)) post // coeflegend
Adjusted predictions                                        Number of obs = 59
Model VCE: Bootstrap
1._predict: Linear prediction, predict(equation(q50))
2._predict: Linear prediction, predict(equation(q90))

                   |            Delta-method
                   |     Margin   std. err.      z    P&gt;|z|     [95% conf. interval]

-----------------------+----------------------------------------------------------------
_predict#foreign#rep78 |
         1#Domestic#3  |       4749   326.4424    14.55   0.000     4109.185    5388.815
         1#Domestic#4  |       5705   330.9142    17.24   0.000      5056.42     6353.58
         1#Domestic#5  |       4425   221.6575    19.96   0.000     3990.559    4859.441
          1#Foreign#3  |       4296   975.0888     4.41   0.000     2384.861    6207.139
          1#Foreign#4  |       6229   860.8888     7.24   0.000     4541.689    7916.311
          1#Foreign#5  |       5719   990.1683     5.78   0.000     3778.306    7659.694
         2#Domestic#3  |      13466   1601.298     8.41   0.000     10327.51    16604.49
         2#Domestic#4  |       8814   1048.677     8.40   0.000     6758.631    10869.37
         2#Domestic#5  |       4425   221.6575    19.96   0.000     3990.559    4859.441
          2#Foreign#3  |       6295   1123.791     5.60   0.000     4092.411    8497.589
          2#Foreign#4  |       9735   1285.327     7.57   0.000     7215.806    12254.19
          2#Foreign#5  |      11995   1902.861     6.30   0.000     8265.462    15724.54

. test ///
> (_b[1._predict#0.foreign#3.rep78] = _b[1._predict#1.foreign#3.rep78]) ///
> (_b[1._predict#0.foreign#4.rep78] = _b[1._predict#1.foreign#4.rep78]) ///
> (_b[1._predict#0.foreign#5.rep78] = _b[1._predict#1.foreign#5.rep78]) ///
> (_b[2._predict#0.foreign#3.rep78] = _b[2._predict#1.foreign#3.rep78]) ///
> (_b[2._predict#0.foreign#4.rep78] = _b[2._predict#1.foreign#4.rep78]) ///
> (_b[2._predict#0.foreign#5.rep78] = _b[2._predict#1.foreign#5.rep78])
( 1)  1bn._predict#0bn.foreign#3bn.rep78 - 1bn._predict#1.foreign#3bn.rep78 = 0
 ( 2)  1bn._predict#0bn.foreign#4.rep78 - 1bn._predict#1.foreign#4.rep78 = 0
 ( 3)  1bn._predict#0bn.foreign#5.rep78 - 1bn._predict#1.foreign#5.rep78 = 0
 ( 4)  2._predict#0bn.foreign#3bn.rep78 - 2._predict#1.foreign#3bn.rep78 = 0
 ( 5)  2._predict#0bn.foreign#4.rep78 - 2._predict#1.foreign#4.rep78 = 0
 ( 6)  2._predict#0bn.foreign#5.rep78 - 2._predict#1.foreign#5.rep78 = 0
       chi2(  6) =   50.71
     Prob &gt; chi2 =    0.0000

The factor variable notation above is tricky, but it is just quantile $\times$ origin $\times$ repair record level. The coeflegend can be useful here for decoding, but I left it commented out.

Here we reject the two-sided null that the q50 and q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5: the p-value is effectively zero.

@dimitry Stata command version 16: table group_var, c(p10 p25 p50... ) generates let's say 15 different groups, so for every p10 I am going to have 15 different values. What I meant there is to compare these 15 values at p10 to another 15 values at p10 from another subset (e.g. city). Your test would be fine if I was comparing just a single value at p10 - but I can have group 1 to group 15 for one city and group 1 - 15 for another city. Then will have these at q5 a10 q25 etc. — Paul, Mar 28 '22 at 22:07
It's straightforward to test the joint hypothesis. You just need to type out all the terms, so something like test _b[1.group] = _b[2.group] = ... =_b[15.group]. — dimitriy, Mar 28 '22 at 22:15
Or equivalently margins group, post contrast without test after. — dimitriy, Mar 28 '22 at 22:22
But this is testing if these groups are different, and this is not what I am after. I want to test if the cities are different for all the given descriptive stats. each city will have a group, let's say size, and there will be multiple sizes and productivity will be different by each size band ( group1 - 20, group 2- 30, group 3 - 28 etc.). What I end up with is a set of descriptive stats of those size bands for each city - and I want to check whether these two are different at different q25 and q75. it sounds like taking your approach require comparing 2 regressions (if 2 cities to compare) — Paul, Mar 29 '22 at 09:02
I've made some edits in response to your comments. If I am still not getting what you have in mind, edit your question with more details about the structure of your data and the problem and I can take another look. — dimitriy, Mar 29 '22 at 20:27
The asymmetry of this procedure makes it suspect: which of the two groups should be selected as the explanatory variable and which as the response? — whuber, Mar 30 '22 at 16:30
@whuber I have all group x city (or repair record x origin) cells as explanatory variables and price as the response/outcome. Could you elaborate on what is suspect in that? — dimitriy, Mar 30 '22 at 16:34
As far as I can tell, the question asks to test a hypothesis of the form I articulated in a comment to the question. I cannot see how this quantile regression correctly addresses that kind of test. Perhaps the evidence is buried in the details of the Stata output, but frankly I don't want to wade through that in order to understand what you propose. — whuber, Mar 30 '22 at 16:35
@whuber I am doing your comment version in the second part at the bottom, though for two different quantiles. I broke it up for didactic purposes. — dimitriy, Mar 30 '22 at 16:39
The information is buried in the (arcane) Stata commands. How, specifically, are you doing the multiple testing? How are you accounting for the lack of independence among the various quantiles of each distribution? — whuber, Mar 30 '22 at 16:41
@whuber I added that info above. Let me know if it passes muster! — dimitriy, Mar 30 '22 at 18:18

score 0 · Answer 2 · answered Mar 24 '22 at 16:54

0

Instead of making quantiles, use the Kolmogorov-Smirnov test.

answered Mar 24 '22 at 16:54

chrishmorris

1,780

1

I am adding these quantiles as my summary stats to a paper and wanted to assess their equality. The test mentioned doesn't allow me to check the values at different quantiles for given groups but would perform a test on the entire two samples – Paul Mar 25 '22 at 11:54
For the paper, you could show the K-S graph. See: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test. This is definitely statistically better than quantizing then testing. However, depending on the journal, your readers might not understand this! – chrishmorris Apr 04 '22 at 10:42

How to compare equality of the distribution at different quantiles

2 Answers2

Linked