To use or not to use a variable that contains information w.r.t. uncertainty

Question

The data I'm looking at is concerned with percentages of people who would recommend a hospital.

E.g. 
               NOT         PROBABLY    DEFINITELY          FREQ
hospital_1     10.0        80.0        10.0                100
hospital_2     20.0        50.0        30.0                200

We'd like to do the analysis on the hospital level, so basically I converted the percentages in some kind of weighted score (for example the arbitrarily chosen NOT * 0 + PROBABLY * 1 + DEFINITELY * 2, does not matter), which will yield some kind of single variable called "average_recommendation_score".

I'm expecting that this score is unbiased regardless of the number of people who filled in the survey (FREQ), well... in the limit. However, in doing this kind of analysis, where this might actually be a dependent variable, should we also use the FREQ variable for the estimation of parameters (it feels like this given uncertainty about the average_recommendation_score contains extra information; you can imagine how a maximum score is probably not the "real" score if only based on 1 participant's vote)?

Is it OK to disregard the FREQ variable or should it / could it be incorporated somehow yielding a single variable? If it should be, how?

Note: The goal is inference using some variables that are assumed to be predictors.

Alecos Papadopoulos · Answer 1 · 2014-08-02T20:36:47.327

From what I understand, you want to postulate a relationship (linear for exposition's shake)

$$q_i^* = \mathbf x_i'\beta + u_i,\;\; E(u_i\mid \mathbf X)=0,\;\; i=1,...,n$$

wher $q_i^*$ is a "quality score" of hospital $i$, and $\mathbf x_i'$ contains a series of regressors that in some way explain "objectively" this score (say, number of qualified staff, age of equipment, rate of in-hospital infections etc).

But, what you have available regarding the dependent variable, is an imperfect measurement of it, through the subjective evaluation of a sample from people that have used the services of each hospital. So instead of $q^*_i$ you have the series $q_i$. Now you need to make some assumptions.

First we assume that people perceive unbiasedly what the results of the "objective determinants of quality" are, and so "on average" they will provide a reliable measurement. This implies the relationship

$$q^*_i = q_i + \epsilon_i, \;\;E(\epsilon_i) =0,$$

where $\epsilon_i$ is an i.i.d. error term, uncorrelated with the measurement.

Now each $q_i$ is calculated as

$$q_i = \sum_{k=0}^{h}\frac {m_{ik}}{m_i}k$$

where the set $\{0,...,h\}$ is the set of values to which you map the evaluation of each person for hospital $i$, $m_i$ is the total number of persons that expressed an opinion for hospital $i$, and $m_{ik}$ is the number among them that evaluated it with a score $k$. Given our assumption we have

$$q_i = \sum_{k=0}^{h}\hat p_{ik}k =\hat E(q^*_i),\;\; E(q_i) = E[\hat E(q^*_i)]=E(q^*_i)$$

So $q_i$ is an estimator of the expected value of $q^*_i$, and as such, it has a variance. And its variance will depend inversely on the size of the sample on which it is based. We have

$$\operatorname{Var}[q_i] = \operatorname{Var}\left(\sum_{k=0}^{h}\frac {m_{ik}}{m_i}k\right) = \frac 1{m_i^2}\operatorname{Var}\left(\sum_{k=0}^{h} m_{ik}k\right)$$

If we are willing to assume that $\operatorname{Var}\left(\sum_{k=0}^{h} m_{ik}k\right)$ is constant across $i$, then we can write

$$\operatorname{Var}[q_i] = \frac {\alpha^2}{m_i^2}$$

meaning that the variance will be different for each $i$, because of the different sample sizes. This in turns imply that in the feasible specification that we can estimate,

$$q_i = \mathbf x_i'\beta + u_i,\;\; E(u_i\mid \mathbf X)=0,\;\;i=1,...,n$$

we have

$$\operatorname{Var}[q_i \mid X] = \operatorname{Var}[u_i \mid X] = \sigma^2_i=\frac {\alpha^2}{m_i^2}$$

and you have a model with conditional heteroskedasticity, of known functional form, a functional form that arises consistently from the characteristics of the model.

Then, the road forward is not to include the "sample size" as a regressor, but use it to estimate the conditional variances.

If you are willing to make a distributional assumption for $u_i$ you can estimate this in one take by using maximum likelihood (the $\alpha$ will be estimated alongside the $\beta$'s). Or by Two-stage least-squares.

For exposition's shake, if we assume that the distribution of $u_i$ conditional on the regressors is normal, then the log-likelihood of the sample will be

$$\ln L (\beta, \alpha \mid \mathbf X, \mathbf m) = \ln \left[\left(\frac 1{\sqrt {2\pi}}\right)^n\cdot \prod \frac {m_i}{\alpha}\cdot \exp{\left\{-\frac 12 \sum\frac {m_i^2}{\alpha^2}(q_i-\mathbf x_i'\beta)^2\right\}}\right]$$

$$= C -n\ln \alpha -\frac 1{2\alpha^2} \sum m_i^2(q_i-\mathbf x_i'\beta)^2$$

which would give us the estimate

$$\hat \alpha^2 = \frac 1n\sum m_i^2(q_i-\mathbf x_i'\hat \beta)^2$$

alongside the estimates for the $\beta$ vector, which of course would be affected by the presence and the need to estimate also $\alpha$, thus taking into account the different sample sizes from which the $q_i$'s were calculated.

Finally, note that if the set of people expressing opinions on each hospital, overlap, then the $q_i$'s are auto-correlated, and you have additional issues.

To use or not to use a variable that contains information w.r.t. uncertainty

1 Answers1