0

I have data with a sample size of 1000. The data have 20 variables (1 numeric variable and 19 categorical variables). In one code, I would like to calculate the mean and SD of the numeric variable by each level of categorical variables. I want to know the average of the numeric variable by gender, for example. Then at the same time, I want to calculate this average by age group, education and other qualitative variables.

If I use the group_by (sex, age, education, ...), then I can not calculate the mean of numeric variable by each level of categorical variables separately.

How can I calculate the mean of numeric variable by all categorical variables?

AndrewGB
  • 12,571
  • 4
  • 13
  • 38
  • 1
    Welcome to SO! Please see [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You can provide data via the output of `dput(df)`. Providing a minimal reproducible example and expected output helps increase chances of someone being able to help. – AndrewGB Jan 05 '22 at 00:45
  • Please provide enough code so others can better understand or reproduce the problem. – Community Jan 12 '22 at 12:14

1 Answers1

1

I didn't do all 20 variables, but I imagine something like this should be sufficient. I didn't know what you meant by "one code", but this can be done in one chunk of an .rmd.

library(tidyverse)

studyid <- 1:1000
numeric <- sample(1:25, 1000, replace=TRUE)
color <- sample(c("red","blue","yellow"), 1000, replace=TRUE)
gender <- sample(c("male", "female"), 1000, replace=TRUE)
age <- sample(c("old","older","oldest"), 1000, replace=TRUE)

df <- data.frame(studyid, numeric, color, gender, age)

df %>%
  select(numeric, color) %>%
  group_by(color) %>%
  summarize(
    count = n(), 
    mean = mean(numeric, na.rm = TRUE),
    sd = sd(numeric, na.rm = TRUE)
  )
#> # A tibble: 3 × 4
#>   color  count  mean    sd
#>   <chr>  <int> <dbl> <dbl>
#> 1 blue     342  12.8  7.31
#> 2 red      324  12.9  6.86
#> 3 yellow   334  13.4  7.32

df %>%
  select(numeric, gender) %>%
  group_by(gender) %>%
  summarize(
    count = n(), 
    mean = mean(numeric, na.rm = TRUE),
    sd = sd(numeric, na.rm = TRUE)
  )
#> # A tibble: 2 × 4
#>   gender count  mean    sd
#>   <chr>  <int> <dbl> <dbl>
#> 1 female   485  13.0  7.25
#> 2 male     515  13.0  7.09

df %>%
  select(numeric, age) %>%
  group_by(age) %>%
  summarize(
    count = n(), 
    mean = mean(numeric, na.rm = TRUE),
    sd = sd(numeric, na.rm = TRUE)
  )
#> # A tibble: 3 × 4
#>   age    count  mean    sd
#>   <chr>  <int> <dbl> <dbl>
#> 1 old      330  12.0  6.95
#> 2 older    347  12.9  7.12
#> 3 oldest   323  14.2  7.27
Created on 2022-01-05 by the reprex package (v2.0.1)
jrcalabrese
  • 299
  • 1
  • 11
  • Could you write the code using just one group_by to get the average age for different categories of quality variables x1, x2 and x3? – Mehdi Looha Feb 08 '22 at 09:21