2

I am working on the data set that consists of Patients (after stroke), Time (then can walk after a going through the program), and Program they follow.

Patient   Program   Time
1           P1       12
2           P3       23
3           P3       8
4           P2       36
5           P1       10

I am investigating the influence of the program to the results of patients, but I also want to investiagte how much variation was due to the differences between the patients.

My approach was to consider the Between Group Variation using Anova and Tukey:

new <- aov(Patient~Programme, dane)
TukeyHSD(new)
-------------------------------------------
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Patient ~ Programme, data = dane)

$Programme diff lwr upr p adj Program2-Program1 2.75 -4.294925 9.794925 0.4682698 Program3-Program1 2.50 -5.634778 10.634778 0.6082676 Program3-Program2 -0.25 -7.294925 6.794925 0.9926869

But I think it may not be a good way as it is not clear how much variation between the results was due to the patients.

Summary of anova:

> summary(new)                                                                                                     
            Df Sum Sq Mean Sq F value Pr(>F)
Programme    2  10.75   5.375    0.86  0.478
Residuals    5  31.25   6.250

Summary of Time - Programme ANOVA:

summary(new2)
            Df Sum Sq Mean Sq F value Pr(>F)
Programme    2  398.4   199.2    0.61   0.58
Residuals    5 1633.5   326.7               
  • 1
    Could you please edit the question to display the result of summary(new)? That should produce some clues. – EdM May 19 '22 at 18:45
  • @EdM sure, I edited the question – user13696679 May 21 '22 at 11:54
  • 1
    Sorry, I misread the question. I assumed that you were modeling the outcome as a function of Programme, as there isn't really anything that can be learned from a regression of an arbitrary numeric patient ID upon Programme. Post the anova results of Time ~ Programme and I will provide a new answer. – EdM May 21 '22 at 13:45
  • @EdM I put the summary of Time~Programme – user13696679 May 21 '22 at 17:45

1 Answers1

2

Analysis of variance (ANOVA) provides the answer to your specific question. Here, the proper outcome to examine is Time in the Time-Programme ANOVA.*

The "Residuals" mean square is the variance in Time values that isn't accounted for by the modeling of Programme. In your case, that residual variance is the variance associated with the patients, while accounting for their different values of Programme. The square root of that mean square is about 18, an estimate of the standard deviation among patients under the same Programme

An ANOVA significance test evaluates whether the between-treatments mean square is greater than the residual. In your case, it isn't. The modeling of Programme didn't "explain" any more variance in Time values than there is inherently among patients.

See this page for an introduction to how to think about the mean-square values in a model like yours. Your Programme mean square is "MS_between" on that page; your Residuals mean square is "MS_within."

A caution: the above assumes that the assumptions underlying ANOVA are reasonably well met. I worry about that for a couple of reasons.

First, the high residual SD (compared against the data values you show) suggests that there might be a lot of skew in the Time values, so that (particularly with a small sample as you have) the estimates might not be valid.

Second, with recovery after stroke there might be some who never recover. If so, how do you handle their Time values? Whether you just take their values to be the last Time at which they were examined or omit those patients from the study, you are making the same mistakes as in improper handling of censored survival times. Your estimates of mean time to recovery will be invalid, probably too low. Unless all of your patients recover function during the time course of your study, you should be using some type of survival analysis.


*It's not clear what would be measured by the Patient~Programme model, as that just takes what are effectively names (patient ID values) and treats them as numeric outcomes. A simple "re-naming" of patients (exchanging their ID values) would affect those results. The outcome of interest here is Time.

EdM
  • 92,183
  • 10
  • 92
  • 267