5

I need help thinking about and identifying the kind of regression analysis that would be appropriate for this problem. Nothing I've discovered so far seems quite right. Referrals to articles or examples would be helpful. Thank you.

The data look like this:

  • The data are observational.

  • The sampling unit is a geographic location (EDIT: let's assume units are independent); I'm just trying to understand the basic analytical problem here).

  • At each sampling unit, there are events of two types: the event type of interest (A) and all other types (B). EDIT: In other words, each event is a binary outcome (success, failure). The outcomes are aggregated to the to the location level (Location 1: Success 3, Failure 2. Location 2: Success 0, Failure 1. Location 3: Success 0, Failure 0. Location 4: Success 4, Failure 9 .... etc.

  • Often, A=0 and, somewhat less often but still frequently, A+B=0.

I am interested in testing a hypothesis about A, somehow controlling for the total count of events (A+B), so either a proportion A/(A+B) or a count model that controls for the total count.

If I understand correctly, if the counts were large, proportions could be calculated for all units, and I could do beta regression. But that definitely can't happen when the total number of events for a unit is zero.

If all I cared about was the count, I could use a ZIP or other count model (and maybe still can). But the research question regards the frequency of A relative to the total number of events.

But how to control for the total number of events? Does it just go in the predictors of a ZIP or similar model? I suspect it's more complicated than that.

It seems obvious to me that the individual events could be modeled directly using multi-level logistic regression (EDIT: or another model for clustered data), but I'm wondering if there is a simpler way to examine what I'm interested in, and I just somehow haven't seen an example of this.

Rico
  • 161
  • 1
  • 10
  • If $A$ & $B$ are independent Poisson counts, you can consider $A$ a binomial response for each location conditional on $A+B$ (which is ancillary for the binomial probability parameter). – Scortchi - Reinstate Monica Mar 03 '14 at 17:12
  • Thank you. Sorry if my notation is unclear. There is a single count of events that can have two outcomes (A & B). I'm interested in the proportion of outcome A.

    Can you assume that I'm about two levels less sophisticated? How would you set up an analysis? How would you use the total count?

    – Rico Mar 03 '14 at 17:16
  • you made more confusing now. can you show the example of the dependent variable observations? – Aksakal Mar 03 '14 at 17:26
  • Aksakal: At the event level, there is a binary outcome (e.g., success/failure). At each location observed, there is some number of observed events. I want to model the proportion of successes at the location level OR I want to model the count of successes controlling for the total number of events at the location. This is complicated by the fact that the total number of events observed at some locations is zero (preventing the calculation of a simple proportion for each location).

    Outcome Data: Location 1: Success 0, Fail 0. Location 2: Success 1, Fail 0. Location 3: Success 1, Fail 1 ...

    – Rico Mar 03 '14 at 18:23
  • In the example data I just gave, I can't model the proportion: Location 1 = 0/0 is undefined. – Rico Mar 03 '14 at 18:33
  • Is location 1, 0 success and 0 fails because you have no data for that location? Or is it becasue there are no "events" at that locations? Are there some events C? Another words are A and B both zero because you didn't measure that location, or is there some other outcome C? – RioRaider Mar 03 '14 at 18:54
  • Modelling the count of successes conditional on the total no. events is what I was getting at. Zero successes is fine; zero trials means you don't have an estimate for that location - unless you use a hierarchical model. One way or another you have to use information from other locations to guess at what might have been in a location where you didn't see anything. – Scortchi - Reinstate Monica Mar 03 '14 at 18:54
  • Thanks all. The data are observational (they are police-reported car crashes); if none are recorded, I am assuming that none occurred (no data are missing). This is relevant, because I am interested in crashes of a particular type (call it Type A). If there were no crashes at all, then there were no crashes of Type A (this might be related to the hypothesized location-level predictor); also, if there were nine crashes overall, but only one Type A crash, that's quite different from there being eight Type A crashes (also maybe related to X). A place having no crashes is important information. – Rico Mar 03 '14 at 19:43
  • So this remains my question: Do I model the count of Type A events directly, and, if so, what do I do with the total number of events (e.g., include as a predictor somehow), or do I model the proportion (if so, is there a way to do this with cases having a total of zero events), or do I have to model the events directly with a multi-level mode or GEE or something else? – Rico Mar 03 '14 at 19:46
  • 1
    You can model the counts of type $A$ crashes out of $A+B$ crashes in total as a binomial variate, on the assumption that type $A$ & type $B$ crashes occur independently. Locations with no crashes at all give no information on the proportion of type A crashes that they might have had. The next step is to consider the type of model to fit: is location the only predictor, & do you want to model it as a fixed or random effect (are you interested in making predictions for those locations with no crashes, or others)? Would you consider a Baysian multi-level model? – Scortchi - Reinstate Monica Mar 03 '14 at 23:06
  • 1
    maybe odds ratio will work here. e.g. $a\sim b$ – Aksakal Mar 04 '14 at 05:18

3 Answers3

4

Probably the most common way to look at this kind of thing, if you're only interested in the proportions, is to assume that at the $i$th location $A_i$ & $B_i$ are independent Poisson variables with rates $\lambda_i$ & $\mu_i$ respectively. (That doesn't seem unreasonable for two types of car crashes at the same location over a limited period of time.) The joint mass function is

$$\newcommand{\e}{\mathrm{e}} f_{A_i,B_i}(a_i,b_i) = \frac{\lambda_i^{a_i} \e^{-\lambda_i}}{a_i!} \cdot \frac{\mu_i^{b_i} \e^{-\mu_i}}{b_i!}$$

Reparametrize with $$\pi_i = \frac{\lambda_i}{\lambda_i+\mu_i}$$ $$\nu_i= \mu_i+\lambda_i$$

, let $$N_i = A_i+B_i$$

, & the joint density can be written as

$$f_{A_i,N_i}(a_i,n_i)=\frac{1}{a_i!(n_i-a_i!)}\cdot\pi_i^{a_i} (1-\pi_i)^{n_i-a_i}\cdot \nu_i^{n_i} \e^{\nu_i}$$

Note that $\pi_i$, what you're interested in, & $\nu_i$, the nuisance parameter, separate cleanly; $N_i$ is sufficient for $\nu_i$, & $(A_i,N_i)$ sufficient for $\pi_i$. Sum over $a_i$ to get the marginal distribution of $N_i$, which is also Poisson, with rate $\nu_i$:

$$f_{N_i}(n_i)= \frac{\nu_i^{n_i} \e^{-\nu_i}}{n_i!}$$

Conditioning on the observed value of the ancillary complement $N_i=n_i$ gives

$$f_{A_i|N_i=n_i}(a_i;n_i)=\frac{n_i!}{a_i!(n_i-a_i!)}\cdot\pi_i^{a_i} (1-\pi_i)^{n_i-a_i}$$

, i.e. a binomial distribution for $A_i$ successes out of $n_i$ trials.

I'm not sure what your concern is about locations where there are no events—there's simply no data at these to estimate the proportion of type-A crashes because there weren't any crashes. That doesn't stop you estimating $\pi_i$ at other locations. If location is the only predictor you have a simple $2\times k$ contingency table for the $k$ locations with data. If there are continuous predictors you can use a logistic regression model. If you want to make estimates for the $n=0$ locations you need in some way to borrow information from other locations: e.g. with predictors whose coefficients are estimated from other locations, treating location as a random effect. A Bayesian multi-level model might be quite useful, as some locations will have small, though non-zero, event counts, & estimates for these will be pulled further in the direction of the global model.

2

I may be missing something about your motivation to 'control' for the total number of events, but how about the following:

Concentrate on modelling the rate (not the count) of A per area. To do this you would need to control for different numbers of total events in each area. You'd do this by adding an offset of $\log(A+B)$ to a regular Poisson regression (or over-dispersed, zero-inflated, etc. variant) for each area.

This, btw, is my reading of the second disjunct in your comment: "I want to model the proportion of successes at the location level OR I want to model the count of successes controlling for the total number of events at the location."

Nick Stauner
  • 12,342
  • 5
  • 52
  • 110
  • Bullet point #4 suggests that $A+B=0$ often, and $\log(0)$ is undefined. – dimitriy Mar 04 '14 at 01:13
  • in which case there is no data for that area and hence no information about the rate (unless one is imputed from other information e.g. in a multilevel setup, or when there are covariates on which the rate is conditioned). – conjugateprior Mar 04 '14 at 11:01
  • @DimitriyV.Masterov (which is essentially the same answer as Scortchi gives to this issue). – conjugateprior Mar 04 '14 at 11:09
1

Have you thought about using Tukey's folded logs, as in $$\frac{1}{2} \cdot \ln \left( \frac{A + 1/6}{A+B + 1/3}\right) - \frac{1}{2} \cdot \ln \left(1 - \frac{A + 1/6}{A+B + 1/3}\right)$$

You can justify this type of transformation with some Bayesian arguments. For example, here's some date illustrating this transformation:

A   B   ratio   flog
100 0   1   3.1992975
1   0   1   .97295507
0   1   0   -.97295507
0   0   .   0
0   25  0   -2.5086399
0   75  0   -3.0557337
50  100 1/3 -.34574233

One of the things I really like about it is that $A=100,B=0$ has a higher value than $A=1,B=0$ even if the proportions are both 1. The disadvantage of this approach is that the coefficients will be hard to interpret.

If you have a different prior, you can adjust accordingly.

dimitriy
  • 35,430