1

My design involves three dependent variables. They are count data: frequencies of certain words used in a 20-min conversation. They're also repeated measure: the same subjects engaged in three 20-min conversations on three different topics and the use of certain words were counted alongside other variables. There is a factor of 2 levels and a covariate (IQ). My sample size is small (25) and there is an unequal number of subjects between the levels (13 and 12).

I've done a ton of online search and gone through a dozen of stats books but didn't find an appropriate model (I probably have missed something in my search). I think I am looking for a mixed design ANCOVA for count data (mixed effects negative binomial regression perhaps?). Any suggestions?

Xikeck
  • 11
  • 1
  • 4

3 Answers3

1

This sounds like a repeated measures version of shared frailty models. From what you, say, mixed effects negative binomial or mixed effects Poisson models sound like they would be very reasonable. I.e. you would at a minimum have a random subject effect, a random outcome effect (for the type of thing you are measuring) that would not yet reflect whether two records are from the same conversation, which you could do by numbering conversations across subjects and adding a random conversation effect. To use the notation of the lme4 R package: glmer(count ~ (1|subject)+(1|outcome)+(1|conversation) + factor(something) + IQ, family=poisson(link = "log")).

Björn
  • 32,022
0

Correct, you wouldn't find much on models with multiple dependent variables based on "count data." However, let's slice and dice what you are doing to break it down to a simpler idea, and then see if your goal can be accomplished.

I commonly only use the term count data when I have categorical data, such as what is used in a chi-squared test. For an IQ example, this would be like a 2-by-2 table with 2 columns (low IQ, high IQ), and 2 rows (treated, control) with the counts representing the number of subjects (mice, patients) with the given characteristics in the four categories. This type of analysis would be done if you said, "there is only count data for the 4 categories, and there are no averages and standard deviations of anything."

While you do have count data, couldn't you still consider more counts as a better (worse) outcome, and treat it like a Poisson or normally-distributed outcome? What do the histograms of each of the dependent variables look like for all records combined (independent of each experimental unit -- patient, mouse)? Also, you are not saying you have 12 and 14 levels (groups) for your categories? If so, you won't have a lot of data ($n=26$), since some counts will be sparse -- leading to an "ill-conditioned," or "over-parametrized" problem. This is also called the "curse of dimensionality," i.e., to many dimensions or degrees of freedom for your model.

Let's keep forging ahead however. A trick we sometimes use is to assign ranks to measurement values when they are highly skewed, that is, we sometimes (rarely) replace values with the rank of the observation across the research subjects (within a variable). Then, we input those ranks into a lot of different methods that usually require continuously-scaled data, like ANOVA and regression. We have done this on smaller sample size mouse data, and when submitting to journals the stat reviewers know we ran into trouble with skewness and outliers in the original data, and transformed to ranks -- so no problem. We describe that it's not perfect, since ranks are rectangularly-distributed, but there's no problem and the papers get published.

Now if you look at your counts, don't they look a little like ranks? If that's true, then probably run a panel-data (longitudinal) model such as GEE (generalized estimating equations) in Stata using a Poisson link for the count data, while clustering on subject ID. Setting up panel-data regression models requires specification of the ID variable for each subject (mouse, patient, student), so the model can see the repeated measures for each object. All the large packages (SAS, SPSS, Stata) have panel-data regression models for repeated measurements that allow specification of a link function, and they allow time to be used as a predictor as well. For Stata, there's not a categorical (count) link function but there is one for SAS. You could use the Poisson link in Stata however. In R, I am sure panel-data regression models with link functions are available. If a categorical link is not used, then maybe the Poisson link would be appropriate. Every package has a Binomial link, but that's used if your outcomes were binomial (y/n, 0/1) during each repeated measurement. But Poisson can take on count values of $0,1,2,3,4,5,\ldots,\infty$

  • I meant 12 and 14 subjects at each level. I call them count data because they're frequencies of certain words used in a 20-min conversation. They're repeated measure in a way that the subjects engaged in three 20-min conversations on three different topics. The use of certain words were counted alongside other variables. I checked the mean, variance, skewness, and kurtosis of the count data variables and believed Negative Binomial Regression to be a better fit given overdispersion. I think I am looking for a mixed design ANCOVA for count data (mixed effects negative binomial regression?). – Xikeck Aug 26 '18 at 20:57
  • I also ranked the count data and run them with mixed design ANCOVA. While it's helpful with skewness and heteroscedasticity, I don't know how to interpret the results. I checked GEE in SPSS but saw only one dependent variable was allowed in Response. – Xikeck Aug 26 '18 at 21:08
  • I noticed that the topics were different during the three interviews. So the information being sought each time was from a different cognitive construct? Then, if, so nothing is repeated. Just use a Poisson link and run a regression of the key (trigger) word counts from each interview separately. Yeah, the responses from the 3 interviews may be correlated, but you don't have to throw all of the response data into one sausage machine (Model) because they're correlated. –  Aug 28 '18 at 02:51
  • Look at SPSS GLM with a Poisson link (assuming the word counts are Poisson) and select subject ID as the "within-subject" factor. Setup the run to use the word counts (for one interview) as the dependent var, and your covariates as independent predictors (treatment vars, covariates like gender, age), with subject ID as the within-subject factor variable. –  Aug 28 '18 at 02:51
  • My bad, you don't even need a within-subject factor, since nothing is repeated. Just run the Shapiro-Wilk normality test on the word count values for one interview for all subjects, and then, if the p-value is not less than 0.05 (it's 0.06 and greater) just use linear regression and regress the word count variable on your treatment var and covariates. You will have three regression models when done. You could also forget running the normality test and regress the ranks of the word counts from each interview on to the independent predictors. Forget about GLM and Poisson. –  Aug 28 '18 at 02:59
  • This is why i mentioned slicing and dicing your problem, called "divide and conquer" --> split up what seems to be a complex problem into smaller parts so it's more understandable and can be tackled easier. –  Aug 28 '18 at 03:01
  • The matched variables (meaning the same things of interest collected in three conversations; count data or not) are all highly correlated with each other (effect sizes are between .6 and .9). Therefore I'd like to throw them in one model. Some of them conform to a normal distribution while others negative binomial distribution or something else. Skewness, kurtosis, and outliers are prevalent. I tried square root transformation on several matched variables. Since the distributions might be different among the matched variables, transformation did not always fix heteroscedasticity. – Xikeck Aug 28 '18 at 21:01
0

Implicitly, you are suggesting that there is one effect of your predictors on your responses, and that your responses really are "only" three different ways to measure this effect. If you want to throw them together, and if they additionally seem to derive from different distributions, I can only think of a hierarchical (Bayesian) approach:

Option 1: model each response separately, but state that the model coefficients (with the exception of the intercept) are identical in all three models. In this way, you can control the distribution of each response, make one normal, the other neg.binom. etc.

Option 2: a single model of a latent response, which is a linear combination of the three measured responses.

Fundamentally, these two approaches should converge on the same statement: the effect of your predictors on a combination of responses. To the best of my knowledge, either approach will take you to specialised software (e.g. JAGS, STAN, Win/OpenBUGS), as SEM-packages typically do not offer the flexibility of different distributions in the latent variable. (No-one said it would be easy.)

Carsten
  • 416