0

I've fitted a mixed model with participants and vowels as random factors and language (Tamil and French) as the fixed factor. The dependent variable is durations of prolongations (of a phoneme). The question I'm trying to answer: are the mean durations of prolongations in the two language significantly different from one another? I considered a T-test, but realized that it may not capture the effects of variations in the population or even vowels (some vowels in english or german are more easily prolonged than consonants, for example).

model1<-lmer(log(Duration) ~ Language + (1|Participant) + (1|Vowel), data = lmmdf, REML = TRUE)
summary(model1)

Output:

Linear mixed model fit by REML. t-tests use Satterthwaite's  method
 [lmerModLmerTest]
Formula: log(Duration) ~ Language + (1 | Participant) + (1 | Vowel)
   Data: lmmdf

REML criterion at convergence: 363.8

Scaled residuals: Min 1Q Median 3Q Max -3.03238 -0.65717 -0.05817 0.67908 2.42541

Random effects: Groups Name Variance Std.Dev. Vowel (Intercept) 0.030179 0.17372 Participant (Intercept) 0.009809 0.09904 Residual 0.178592 0.42260 Number of obs: 297, groups: Vowel, 26; Participant, 18

Fixed effects: Estimate Std. Error df t value (Intercept) -1.43762 0.07990 21.75516 -17.992 LanguageTamil 0.03926 0.09777 20.19149 0.402 Pr(>|t|)
(Intercept) 0.0000000000000151 *** LanguageTamil 0.692


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) LanguageTml -0.667

I can see that Language is not a statistically significant predicting factor for Prolongations. But how do I interpret the estimate of 'LanguageTamil'(0.3926) when it is a categorical variable (French or Tamil) and there is no "1 unit increase" between the two. Does this actually mean that when the participant spoke in Tamil, they prolonged 0.03926ms longer than when they did in French (but not stat significant)? Also, what does the negative sign in the 'Correlation of Fixed Effects' mean? (-0.667).

  • 1
    As you log-transformed the outcome, the exponentiated coefficient represents the ratio of geometric means of Duration. The ratio of geometric means is $\exp(0.03926) = 1.04$, meaning that Tamil speakers have an about 4% higher geometric mean of Duration compared to French speakers. – COOLSerdash Jun 24 '22 at 11:06
  • Thanks! However, I don't get how you went from 1.04 (exponentiating the log) to 4%. Was this derived by multiplying the exponentiated coefficient to the log value? Like (1.040.03926)100 = ~4%? Why is that done? (I'm very new to this, sorry and thanks again!). – MaVeee2021 Jun 24 '22 at 11:32
  • I just calculated $(1.04-1)100%$. It's just another way of looking at the ratio. You'll find a lot of resources about this on this website. For example. here. This article is also very informative. – COOLSerdash Jun 24 '22 at 11:37
  • Thanks so much! – MaVeee2021 Jun 24 '22 at 11:42
  • Are the vowels in any language random? – dipetkov Jun 24 '22 at 19:56
  • I don't know exactly what you mean? They were not controlled for. Participants spoke about a specific art piece colloquially, without any leading questions or guidance. so it was spontaneous speech and the vowels (and words) were annotated. The 'vowels' label here actually represents all phones, vowel and consonant alike. That was just the variable name I gave to it. I transcribed the list of prolongations to their corresponding phones. Does this bear any effect you think? – MaVeee2021 Jun 24 '22 at 20:59

1 Answers1

1

how do I interpret the estimate of 'LanguageTamil'(0.3926) when it is a categorical variable (French or Tamil) and there is no "1 unit increase" between the two.

Categorical variables get converted into numerical variables, one for each category, with values zero or one depending on the category of the specific row.

For example

$$\begin{array}{c|c|c|c} \text{Observation id} & \text{Language} & \text{LanguageTamil} & \text{LanguageFrench} \\ \hline 1 & \text{French} & 0 & 1\\ 2 & \text{Tamil} & 1 & 0\\ 3 & \text{French} & 0 & 1\\ 4 & \text{French} & 0 & 1\\ 5 & \text{Tamil} & 1 & 0\\ \vdots & \vdots & \vdots & \vdots\\ n & \text{French} & 0 & 1\\ \end{array}$$

These variables are also called dummy variables.

I your regression only one of the dummy variables is used. The value of the category French coincides with the intercept, and the value of Tamil coincides with the intercept plus the coefficient for the dummy variable LanguageTamil.

Related: Why does glm in R create new variables upon training?

Also, what does the negative sign in the 'Correlation of Fixed Effects' mean? (-0.667).

The estimates will differ for different experiments, and also these differences correlate. That is, the errors in the coefficients are not independent. In your case it is a negative correlation and the estimates of the intercept and the effect have a negative correlation. If the estimate of the intercept is too high then the estimate of the coefficient for Tamil will relatively often be too low.

In this question Why and how does adding an interaction term affects the confidence interval of a main effect? you see an example of regression where the estimates of the coefficients correlate.

The ellipse shows how the parameter estimates are estimated to be distributed. In the left image they have some correlation.

correlation and confidence regions

This is also a reason why the application of a t-distribution in order to estimate the significance is not always so great. On the other hand, in the image you see that this relates mostly to the intercept and the error of the slope parameter is the same in both left/right images.