I have a nested dataset where information on individual workplace characteristics is available on the case level, and data on recorded sick leave on the group level.
7k individuals are nested within approx. 40 groups and, consequently, the sick leave data is the same for each case in a group. For example, 2.9 for group 1, 3.6 for group 2 etc. This is what inhibits multilevel modelling and where my problem starts from.
I would now like to run a regression analysis to investigate the power of the individual workplace characteristics in predicting sick leave on the group level. As far as I know, it is now possible to
(a) aggregate the data on to the group level by generating a dataset consisting of 40 cases then (the 40 groups) instead of 7k, and to run the regression analysis like that. Or
(b) to simply calculate the regression analysis by using the whole dataset.
However, figures of R squared differ heavily from each other with .645 (a) and .069 (b).
I am completely unaware why this is happening and what might mathematically cause this. It might occur because R squared is artificially inflated due to the reduced variances and number of cases in option (a).
Can anyone explain to me why this is happening?
Thanks a lot!
The figures I stated are examples for sickness absence rates per group. So for example, group 1 shows a sickness absence rate of 2.9 percent on average, 12 months after the data on workplace characteristics were surveyed. Because data on individual absence rates are not available, each employee in group 1 is given the respective variable value.
– ym.87 Aug 31 '22 at 19:43