6

This is probably a far too basic question for this board - but on the other hand, I know I'll get good answers. "Stats 101" is a metaphor, by the way. I'm asking for help with my work, not my homework!

I am looking at aggregate financial data for hospitals. I have identified two hospital systems that accumulate unusually large operating surpluses (profits) compared to their peers - in the 8% to 12% range when the standard for a non-profit hospital is 3%. This amounts to hundreds of millions of dollars after expenses. I created a metric by dividing these profits by annual case-adjusted admissions and the results negate volume or mix of patient type as reasons for the difference. I've also looked at expenses and they are about the same as peer hospital, so low expenses is not an explanation, either. This suggests pricing as the remaining reason for the difference.

Only aggregate data is available - I do not have case-level data. By simply ranking my list of 85 hospitals, the annual "profit per patient" for these two hospitals rises to the top of the list. The difference between these two hospitals is great enough that I am certain that the variance would be statistically significant if I ran the right test. I'd like to do that - show that it is highly unlikely that this is chance variation.

Can you recommend the best test to run on these figures? By the way, I do not have access to SPSS or SAS through my employer, so I'd likely be trying this in Excel or possibly Access.

  • By the way, when I say there is not case-level data, I mean that I do not have specific data attached to specific patients. I have a record of number of admissions and aggregate financial data. My unit of analysis is a hospital. – Vince Kueter Apr 03 '12 at 20:31

3 Answers3

5

A couple of things you can do to confirm that these are really oddballs. They actually might not be, since someone has to rank #1 and #2.

(1) Express the profit ratios as a multiplier (1+rateOfReturn) and plot them to see if they follow some likely distribution (you might start with a Q-Q plot for normality, and a Q-Q plot on log(1+rateOfReturn) for log-normality). There's a good chance your top 2 fall right in line with a log-normal distribution. But maybe not, and you're on to something.

(2) Fit a multiple regression model (it's in the Data Analysis add-in for Excel) to predict the rate of return based on possible contributing factors, e.g. case loads, patient mix, etc. If your two hospitals are really unusual, they will have very large regression residuals.

  • I like these ideas. Also, some less visually-oriented people will prefer to check (if normality holds) whether these 2 cases are an unusual number of standard deviations beyond the mean. There are also tests for outliers, though it seems they're controversial. – rolando2 Apr 03 '12 at 23:23
  • Thanks Mike! Those are good ideas. The multiple regression would be a bit complicated because while I could have developed separate variables for the environment, I've chosen the easy way out and used a case mix index that's already been created to normalize the differences in case mix. I like the plotting idea. Is there something simpler, like an ANOVA, that would be appropriate for this data? I'm looking for some sort of significance score that says that the Profit per Patient is far higher than the typical hospital (out of my sample of 84) than random variance would predict. – Vince Kueter Apr 06 '12 at 16:17
3

Before using Excel for something like this, first read the Spreadsheet Addiction page.

One problem you will have with whatever analysis you do is that you first identified the top 2 as unusual, then want to test them. This will always lead to some skepticism compared to if you formulated the question before looking at the data.

Also you should look for additional potential explanations. You divided by admissions, did these hospitals have smaller admissions (dividing by a small number by chance makes the ratio look bigger). Also look at the sizes of the hospitals, small groups will vary more in aggregate statistics than will large groups. If you have several big hospitals and a few small ones, then some of the small ones will look larger or smaller due to chance and higher variation.

Given all that, there could still be some possibilities. The simplest ones come if you can make reasonable assumptions about what the distribution should look like (but normal does not seem reasonable here). If there is not an obvious distribution then you could still estimate one. One possibility is to estimate a distribution based on the 83 lower values and the fact that there are 2 higher values, the logspline package for R has one possible way to do this. Then you can generate random samples from this distribution of size 85 (this assumes all the hospitals come from the same distribution, i.e. similar size, etc.) and in each compare the distance between the top 2 and the rest and see how that compares to your actual data.

Even better would be to simulate the whole process of deciding how many outliers there are, then doing the whole process to test those outliers, but this is less clear how to automate and what assumptions would be needed.

Greg Snow
  • 51,722
  • 1
    Maybe this will motivate me to download and learn R. I like this approach. The ACMVU multiplier that I am using does a pretty good job of accounting for differences in hospitals and the most of them are of similar size. We began by noting unusually large surpluses at these two hospital systems based on industry benchmarks before we compared them to peer hospitals in the state, so it wasn't a matter of simply picking off the top of the ranking. Thanks! You've been a big help! – Vince Kueter Apr 06 '12 at 19:52
1

It is difficult to use a significance test for this problem (i.e., one data vector with unusual observations). But oops, when I describe the problem in that way I have an idea:

You want to know whether these to datapoints are clear outliers (i.e., conceptually what Mike Anderson says). The easiest way to do this is to make a boxplot of your data and see whther they can be considered as outliers (i.e., whether they are outside the whiskers). Following Tukey (1977) the whiskers extend to the median +/- 3 times the interquartile range (sometimes you also see 1.5 times the interquartile range, however, in Tukey 1977 you find both and I tend to use the more extreme criterion to classify outliers).

If your data is approxamitely normal (use either a histogram or a qq-plot) you can simply use the standard-normal distribution to see whether your data point is an outlier. You need to transform your data into z-scores (i.e., each value minus mean divided by the mean). If the z-score is extreme enough (extremer than 1.95 for 5% significancer outlier, or extremer than 2.58 for 1% outlier) you even can say that this datapoint is significantly an outlier. This approach is described in Tabachnik and Fidell, I think in chapter 4.

UPDATE: Given that you have a N of 85, perhaps it is better to treat only really rare cases as outliers (i.e., z-score mor eextreme than 3.29, wich refers to $p < .001)$. I would simply write in the text that there were clear outliers that were deleted and put in the footnote: These cases had extremely high z scores (above 3.29, p < .001).

Henrik
  • 14,198
  • 11
  • 69
  • 130
  • Once ranking began to be discussed, I thought about z-scores. That sounds like a good approach. I'm leaning toward that as the answer - there are outliers on both ends, but for a small sample, this is a fairly normal distribution. Now thinking about how to explain that to a general audience and a technical audience at the same time with a one-liner in the text and details in a footnote and/or technical appendix. – Vince Kueter Apr 06 '12 at 19:41
  • If you like my answer you can also upvote it. – Henrik Apr 06 '12 at 19:59