0

I have a dataset from which I am taking a set of descriptive statistics as follows:

The value measured is productivity of a firm for each of the group at different quantile (I use Stata command: table group_var, c(p10 p25 p50... ).

The below is repeated for two subsets of the main data (I have two sets of descriptive stats as below).

              p10   p25   p50   p75   p90
group1        50    45    43    34    10
group2         ......

What I want to do is to compare the two descriptive stats for statistical significance:

so there will be:

       p10_1   p10_2
group1  50       52
group2  52      ....
....

I want to determine whether these are different, for p50 I am using a ranksum (median) test in stata. for mean, running a t-test but struggle to find a method for values at different quantiles. Can someone suggest the right approach?

Thanks,

Paul

Paul
  • 3
  • "Whether these are different" is ambiguous, so it will help to present a clearer statement of your null hypothesis. For instance, would it be $$H_0: F_{.10}=G_{.10}\text{ and } F_{.25}=G_{.25}\text{ and }\cdots\text{ and } F_{.90}=G_{.90}$$ where $F$ and $G$ are the two distributions and the subscripts represent their quantiles? With just your summaries, this will be both difficult to test and not very powerful. Do you have the original data? – whuber Mar 29 '22 at 21:06

2 Answers2

1

You can do this using quantile regression.

The code below does the single quantile case. It

  1. estimates the q90 price for foreign and domestic cars with various repair records. Here origin is like your two cities and repair record is like your groups.
  2. calculates the statistics within each origin $\times$ repair cell from the model, which should match the output of the table command.
  3. tests the hypothesis that the q90 prices for each group are the same regardless of manufacturing origin.

Here's the output:

. sysuse auto, clear
(1978 automobile data)

. table rep78 foreign, statistic(p90 price) nototals


               |      Car origin    
               |  Domestic   Foreign

-------------------+-------------------- Repair record 1978 |
1 | 4934
2 | 14500
3 | 13466 6295 4 | 8814 9735 5 | 4425 11995


. keep if rep78>2 & !missing(rep78) (15 observations deleted)

. qreg price i.rep78##i.foreign, quantile(0.9) nolog

.9 Quantile regression Number of obs = 59 Raw sum of deviations 39983.8 (about 11385) Min sum of deviations 32468.6 Pseudo R2 = 0.1880


    price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

--------------+---------------------------------------------------------------- rep78 | 4 | -4652 2130.507 -2.18 0.033 -8925.256 -378.744 5 | -9041 4056.365 -2.23 0.030 -17177.04 -904.9629 | foreign | Foreign | -7171 3368.627 -2.13 0.038 -13927.61 -414.3889 | rep78#foreign | 4#Foreign | 8092 4261.014 1.90 0.063 -454.5121 16638.51 5#Foreign | 14741 5483.728 2.69 0.010 3742.034 25739.97 | _cons | 13466 1065.254 12.64 0.000 11329.37 15602.63


. margins foreign#rep78, post // coeflegend warning: cannot perform check for estimable functions.

Adjusted predictions Number of obs = 59 Model VCE: IID

Expression: Linear prediction, predict()


          |            Delta-method
          |     Margin   std. err.      z    P>|z|     [95% conf. interval]

--------------+---------------------------------------------------------------- foreign#rep78 | Domestic#3 | 13466 1065.254 12.64 0.000 11378.14 15553.86 Domestic#4 | 8814 1845.073 4.78 0.000 5197.723 12430.28 Domestic#5 | 4425 3913.991 1.13 0.258 -3246.282 12096.28 Foreign#3 | 6295 3195.761 1.97 0.049 31.42429 12558.58 Foreign#4 | 9735 1845.073 5.28 0.000 6118.723 13351.28 Foreign#5 | 11995 1845.073 6.50 0.000 8378.723 15611.28


. test /// > (_b[0.foreign#3.rep78] = _b[1.foreign#3.rep78]) /// > (_b[0.foreign#4.rep78] = _b[1.foreign#4.rep78]) /// > (_b[0.foreign#5.rep78] = _b[1.foreign#5.rep78])

( 1) 0bn.foreign#3bn.rep78 - 1.foreign#3bn.rep78 = 0 ( 2) 0bn.foreign#4.rep78 - 1.foreign#4.rep78 = 0 ( 3) 0bn.foreign#5.rep78 - 1.foreign#5.rep78 = 0

       chi2(  3) =    7.72
     Prob > chi2 =    0.0522

The p-value on the two-sided null that the q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5 is .0522. This means that it is fairly unlikely that we would observe differences like this (or larger) if they were the same for each repair record group.

But you want to test more than one quantile at the same time, so you need to use sqreg for simultaneous-quantile regression. It produces the same coefficients as qreg for each quantile. Reported standard errors will be similar, but sqreg obtains an estimate of the VCE via bootstrapping, and the VCE includes between-quantile blocks. This lets you do tests comparing predictions at different quantiles:

. sysuse auto, clear
(1978 automobile data)

. table rep78 foreign, stat(p50 price) statistic(p90 price) nototals


                |      Car origin    
                |  Domestic   Foreign

--------------------+-------------------- Repair record 1978 |
1 |
50th percentile | 4564.5
90th percentile | 4934
2 |
50th percentile | 4638
90th percentile | 14500
3 |
50th percentile | 4749 4296 90th percentile | 13466 6295 4 |
50th percentile | 5705 6229 90th percentile | 8814 9735 5 |
50th percentile | 4204.5 5719 90th percentile | 4425 11995


. keep if rep78>2 & !missing(rep78) (15 observations deleted)

. sqreg price i.rep78##i.foreign, quantile(0.5 0.9) nolog

Simultaneous quantile regression Number of obs = 59 bootstrap(20) SEs .50 Pseudo R2 = 0.0574 .90 Pseudo R2 = 0.1880


          |              Bootstrap
    price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]

--------------+---------------------------------------------------------------- q50 | rep78 | 4 | 956 410.4789 2.33 0.024 132.6835 1779.316 5 | -324 358.3828 -0.90 0.370 -1042.825 394.8248 | foreign | Foreign | -453 1002.362 -0.45 0.653 -2463.483 1557.483 | rep78#foreign | 4#Foreign | 977 1286.285 0.76 0.451 -1602.962 3556.962 5#Foreign | 1747 1127.146 1.55 0.127 -513.7688 4007.769 | _cons | 4749 326.4424 14.55 0.000 4094.24 5403.76 --------------+---------------------------------------------------------------- q90 | rep78 | 4 | -4652 1808.008 -2.57 0.013 -8278.405 -1025.595 5 | -9041 1618.042 -5.59 0.000 -12286.38 -5795.619 | foreign | Foreign | -7171 2282.641 -3.14 0.003 -11749.4 -2592.601 | rep78#foreign | 4#Foreign | 8092 2949.62 2.74 0.008 2175.812 14008.19 5#Foreign | 14741 2256.1 6.53 0.000 10215.84 19266.16 | _cons | 13466 1601.298 8.41 0.000 10254.2 16677.8


. margins foreign#rep78, predict(equation(q50)) predict(equation(q90)) post // coeflegend

Adjusted predictions Number of obs = 59 Model VCE: Bootstrap

1._predict: Linear prediction, predict(equation(q50)) 2._predict: Linear prediction, predict(equation(q90))


                   |            Delta-method
                   |     Margin   std. err.      z    P>|z|     [95% conf. interval]

-----------------------+---------------------------------------------------------------- _predict#foreign#rep78 | 1#Domestic#3 | 4749 326.4424 14.55 0.000 4109.185 5388.815 1#Domestic#4 | 5705 330.9142 17.24 0.000 5056.42 6353.58 1#Domestic#5 | 4425 221.6575 19.96 0.000 3990.559 4859.441 1#Foreign#3 | 4296 975.0888 4.41 0.000 2384.861 6207.139 1#Foreign#4 | 6229 860.8888 7.24 0.000 4541.689 7916.311 1#Foreign#5 | 5719 990.1683 5.78 0.000 3778.306 7659.694 2#Domestic#3 | 13466 1601.298 8.41 0.000 10327.51 16604.49 2#Domestic#4 | 8814 1048.677 8.40 0.000 6758.631 10869.37 2#Domestic#5 | 4425 221.6575 19.96 0.000 3990.559 4859.441 2#Foreign#3 | 6295 1123.791 5.60 0.000 4092.411 8497.589 2#Foreign#4 | 9735 1285.327 7.57 0.000 7215.806 12254.19 2#Foreign#5 | 11995 1902.861 6.30 0.000 8265.462 15724.54


. test /// > (_b[1._predict#0.foreign#3.rep78] = _b[1._predict#1.foreign#3.rep78]) /// > (_b[1._predict#0.foreign#4.rep78] = _b[1._predict#1.foreign#4.rep78]) /// > (_b[1._predict#0.foreign#5.rep78] = _b[1._predict#1.foreign#5.rep78]) /// > (_b[2._predict#0.foreign#3.rep78] = _b[2._predict#1.foreign#3.rep78]) /// > (_b[2._predict#0.foreign#4.rep78] = _b[2._predict#1.foreign#4.rep78]) /// > (_b[2._predict#0.foreign#5.rep78] = _b[2._predict#1.foreign#5.rep78])

( 1) 1bn._predict#0bn.foreign#3bn.rep78 - 1bn._predict#1.foreign#3bn.rep78 = 0 ( 2) 1bn._predict#0bn.foreign#4.rep78 - 1bn._predict#1.foreign#4.rep78 = 0 ( 3) 1bn._predict#0bn.foreign#5.rep78 - 1bn._predict#1.foreign#5.rep78 = 0 ( 4) 2._predict#0bn.foreign#3bn.rep78 - 2._predict#1.foreign#3bn.rep78 = 0 ( 5) 2._predict#0bn.foreign#4.rep78 - 2._predict#1.foreign#4.rep78 = 0 ( 6) 2._predict#0bn.foreign#5.rep78 - 2._predict#1.foreign#5.rep78 = 0

       chi2(  6) =   50.71
     Prob > chi2 =    0.0000

The factor variable notation above is tricky, but it is just quantile $\times$ origin $\times$ repair record level. The coeflegend can be useful here for decoding, but I left it commented out.

Here we reject the two-sided null that the q50 and q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5: the p-value is effectively zero.

dimitriy
  • 35,430
  • @dimitry Stata command version 16: table group_var, c(p10 p25 p50... ) generates let's say 15 different groups, so for every p10 I am going to have 15 different values. What I meant there is to compare these 15 values at p10 to another 15 values at p10 from another subset (e.g. city). Your test would be fine if I was comparing just a single value at p10 - but I can have group 1 to group 15 for one city and group 1 - 15 for another city. Then will have these at q5 a10 q25 etc. – Paul Mar 28 '22 at 22:07
  • It's straightforward to test the joint hypothesis. You just need to type out all the terms, so something like test _b[1.group] = _b[2.group] = ... =_b[15.group]. – dimitriy Mar 28 '22 at 22:15
  • Or equivalently margins group, post contrast without test after. – dimitriy Mar 28 '22 at 22:22
  • 1
    But this is testing if these groups are different, and this is not what I am after. I want to test if the cities are different for all the given descriptive stats. each city will have a group, let's say size, and there will be multiple sizes and productivity will be different by each size band ( group1 - 20, group 2- 30, group 3 - 28 etc.). What I end up with is a set of descriptive stats of those size bands for each city - and I want to check whether these two are different at different q25 and q75. it sounds like taking your approach require comparing 2 regressions (if 2 cities to compare) – Paul Mar 29 '22 at 09:02
  • I've made some edits in response to your comments. If I am still not getting what you have in mind, edit your question with more details about the structure of your data and the problem and I can take another look. – dimitriy Mar 29 '22 at 20:27
  • works good. thanks! – Paul Mar 30 '22 at 15:34
  • The asymmetry of this procedure makes it suspect: which of the two groups should be selected as the explanatory variable and which as the response? – whuber Mar 30 '22 at 16:30
  • @whuber I have all group x city (or repair record x origin) cells as explanatory variables and price as the response/outcome. Could you elaborate on what is suspect in that? – dimitriy Mar 30 '22 at 16:34
  • As far as I can tell, the question asks to test a hypothesis of the form I articulated in a comment to the question. I cannot see how this quantile regression correctly addresses that kind of test. Perhaps the evidence is buried in the details of the Stata output, but frankly I don't want to wade through that in order to understand what you propose. – whuber Mar 30 '22 at 16:35
  • @whuber I am doing your comment version in the second part at the bottom, though for two different quantiles. I broke it up for didactic purposes. – dimitriy Mar 30 '22 at 16:39
  • The information is buried in the (arcane) Stata commands. How, specifically, are you doing the multiple testing? How are you accounting for the lack of independence among the various quantiles of each distribution? – whuber Mar 30 '22 at 16:41
  • @whuber I added that info above. Let me know if it passes muster! – dimitriy Mar 30 '22 at 18:18
  • Got it--thank you! (+1) – whuber Mar 30 '22 at 21:25
0

Instead of making quantiles, use the Kolmogorov-Smirnov test.

chrishmorris
  • 1,780
  • 1
    I am adding these quantiles as my summary stats to a paper and wanted to assess their equality. The test mentioned doesn't allow me to check the values at different quantiles for given groups but would perform a test on the entire two samples – Paul Mar 25 '22 at 11:54
  • For the paper, you could show the K-S graph. See: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test. This is definitely statistically better than quantizing then testing. However, depending on the journal, your readers might not understand this! – chrishmorris Apr 04 '22 at 10:42