4

I am trying to summarise and visualise three attributes in R:

Patient_Age, Patient_Deprivation and Hospital_Time.

I am trying to summarise the time patients spent in hospital by deprivation(scale 1-5) & Age. I also calculated the covariance and correlation. If I want to summarise the total time (diffr_days) by Age and Deprivation. What would be the best way of visualisation because only violin plot and correlation matrix make a bit of sense to understand the data? Other plots are not making any sense of the data. What would be the best plot and descriptive method to explain such data scenario?

Age Cov: 28.42936 Corr: 0.2497208

Deprivation Cov: 0.3389552 Corr: 0.07772134

Below is the screenshot of my data frame and some visualisations

do.call(cbind.data.frame,list(id = c( 
1011L, 1012L, 1012L, 1014L, 1015L, 1015L, 1018L, 1018L, 1021L, 
1022L, 1028L, 1029L, 1036L, 1037L, 1042L, 1044L, 1045L, 1048L, 
1049L, 1050L, 1050L, 1051L, 1051L, 1052L, 1054L, 1057L, 1061L, 
1064L, 1064L, 1065L, 1066L, 1067L, 1067L, 1069L, 1072L, 1073L, 
1078L, 1079L, 1082L, 1083L, 1084L, 1086L, 1087L, 1089L, 1090L, 
1090L, 1091L, 1092L, 1095L, 1096L),

age = c(23L, 50L, 6L, 92L, 70L, 70L, 53L, 16L, 16L, 70L, 58L, 58L, 30L, 30L, 30L, 10L, 34L, 69L, 22L, 16L, 38L, 71L, 24L, 57L, 5L, 79L, 79L, 37L, 37L, 35L, 40L, 45L, 72L, 97L, 97L, 34L, 28L, 29L, 29L, 78L, 22L, 25L, 31L, 36L, 53L, 49L, 17L, 48L, 56L, 32L),

deprivation = c(1L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 4L, 4L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 4L, 4L, 2L, 4L, 1L, 1L, 3L, 3L, 4L, 3L, 3L, 3L, 2L, 2L, 1L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L),

diffr_days = c(8L, 2L, 4L, 12L, 12L, 6L, 7L, 6L, 9L, 7L, 8L, 4L, 11L, 5L, 1L, 10L, 7L, 9L, 2L, 5L, 10L, 6L, 3L, 2L, 9L, 1L, 2L, 9L, 0L, 9L, 4L, 1L, 10L, 2L, 5L, 9L, 14L, 0L, 5L, 0L, 0L, 1L, 10L, 8L, 3L, 4L, 2L, 1L, 0L, 2L )))

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 2
    Can you edit your post to include your data, ideally by posting the output of dput(your_data_frame)? Given that you apparently have a huge amount of data, perhaps subsample it? – Stephan Kolassa Jun 27 '22 at 14:21
  • @StephanKolassa. Hi, I have the added the data now. Is that fine ? Thanks – Usman YousafZai Jun 27 '22 at 14:51
  • 2
    I would suggest a beeswarm plot or a jittered scatter plot for the first case, and a hexbin plot for the second – mkt Jun 27 '22 at 14:58
  • 1
    Thanks, adding the data makes life easier for us. Perhaps add a little more. I agree with @mkt that a beeswarm plot would be a good replacement for your first plot, although I would overlay the "bees" over a violinplot/beanplot. A hexbinplot would also be a good replacement for the second plot. Q: are you trying to put all the information into a single plot? You could add a grayscale indication of deprivation to the second plot, but you would need to do a lot more jittering, and I don't know how legible that would be. Or do five versions of the second plot, faceting by deprivation. – Stephan Kolassa Jun 27 '22 at 15:07
  • @mkt The beeswarm is very congested and no information can be inferred from the visualisation. – Usman YousafZai Jun 27 '22 at 15:18
  • 1
    As Stephan Kolassa says, a violin plot is another good option for plot #1. This is especially true if you have a lot of data. – mkt Jun 27 '22 at 15:20
  • @StephanKolassa, I am trying to visualise it in a way that some sort of information can be highlighted easily. I am also looking to see what type of model(PCA, Regression) or any statistical tests can be applied. In case if a user is not able to capture any information from visualisations then he/she should be able to look at the model values and conclude something. – Usman YousafZai Jun 27 '22 at 15:23
  • 2
    Hm. I would not use a visualization to decide between PCA and regression, which are very different things and used for very different purposes - this decision should be based on what you are trying to do. One more possibility: plot time vs. age (as your bottom plot), but without the raw data, and with five different loess lines, one per deprivation. – Stephan Kolassa Jun 27 '22 at 15:26
  • Is there any censoring in hospital_time? – gung - Reinstate Monica Jun 27 '22 at 17:26
  • @gung-ReinstateMonica. No there is no censoring. – Usman YousafZai Jun 28 '22 at 08:27
  • @StephanKolassa. Can you help answer the plot: (time vs age with loess lines). I tried to do it but couldn't. Violin plot and correlation matrix are making a bit of sense to understand the data. – Usman YousafZai Jun 28 '22 at 08:31
  • I will try. Can you please include a little more data? Ideally, about five times as much as there is in your post right now. – Stephan Kolassa Jun 28 '22 at 09:16
  • @StephanKolassa . I have edited the question and added more data. I also deleted the extra columns. – Usman YousafZai Jun 28 '22 at 10:09

2 Answers2

6

Here is a possibility: plot hospital stay against age, with five different loess fit lines, one per deprivation level. In the plot below, I used your example data (actually, I replicated it five times, because loess needs enough data to calculate the smoother - with your original data, this should not be a problem). I also plotted the data from the example. With your original data, this will likely not be very helpful, and if you really want to include the raw data, you should make the dots a lot smaller.

loess plot

I used a black body radiation palette as per here.

As an alternative for your first plot, I would suggest a beanplot. I overlaid it with boxplots and again added the example data, with some horizontal jittering to avoid overplotting. You could even indicate ages using some kind of color coding, e.g., a black body radiation palette as above, but with your original sample size, I don't think that would be very helpful.

beanplots

I personally prefer base graphics. You should be able to build the first plot with ggplot2, but I don't know whether beanplots are available there.

R code:

dataset <- 
    do.call(cbind.data.frame,list(id = c( 
    1011L, 1012L, 1012L, 1014L, 1015L, 1015L, 1018L, 1018L, 1021L, 
    1022L, 1028L, 1029L, 1036L, 1037L, 1042L, 1044L, 1045L, 1048L, 
    1049L, 1050L, 1050L, 1051L, 1051L, 1052L, 1054L, 1057L, 1061L, 
    1064L, 1064L, 1065L, 1066L, 1067L, 1067L, 1069L, 1072L, 1073L, 
    1078L, 1079L, 1082L, 1083L, 1084L, 1086L, 1087L, 1089L, 1090L, 
    1090L, 1091L, 1092L, 1095L, 1096L),
age = c(23L, 50L, 
6L, 92L, 70L, 70L, 53L, 16L, 16L, 70L, 58L, 58L, 30L, 30L, 30L, 
10L, 34L, 69L, 22L, 16L, 38L, 71L, 24L, 57L, 5L, 79L, 79L, 37L, 
37L, 35L, 40L, 45L, 72L, 97L, 97L, 34L, 28L, 29L, 29L, 78L, 22L, 
25L, 31L, 36L, 53L, 49L, 17L, 48L, 56L, 32L),

deprivation = c(1L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 
2L, 2L, 4L, 4L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 
3L, 2L, 2L, 4L, 4L, 2L, 4L, 1L, 1L, 3L, 3L, 4L, 3L, 3L, 3L, 2L, 
2L, 1L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L), 

diffr_days = c(8L, 2L, 4L, 12L, 12L, 
6L, 7L, 6L, 9L, 7L, 8L, 4L, 11L, 5L, 1L, 10L, 7L, 9L, 2L, 5L, 
10L, 6L, 3L, 2L, 9L, 1L, 2L, 9L, 0L, 9L, 4L, 1L, 10L, 2L, 5L, 
9L, 14L, 0L, 5L, 0L, 0L, 1L, 10L, 8L, 3L, 4L, 2L, 1L, 0L, 2L
)))

dataset <- do.call(rbind.data.frame,list(dataset,dataset,dataset,dataset,dataset))

blackBodyRadiationColors <- function(x, max_value=1) { # x should be between 0 (black) and 1 (white) # if large x come out too bright, constrain the bright end of the palette # by setting max_value lower than 1 foo <- colorRamp(c(rgb(0,0,0),rgb(1,0,0),rgb(1,1,0),rgb(1,1,1)))(x*max_value)/255 apply(foo,1,function(bar)rgb(bar[1],bar[2],bar[3])) }

n.colors <- length(unique(dataset$deprivation)) colors.blackBody <- blackBodyRadiationColors(seq(0,0.6,length.out=n.colors)) par(mai=c(.8,.8,.1,.1))

with(dataset,plot(range(age),range(diffr_days),type="n",las=1,xlab="Age",ylab="Time in hospital (days)")) with(dataset,points(age,diffr_days,pch=19,col=colors.blackBody[deprivation])) legend("topleft",pch=19,lwd=1,col=colors.blackBody,legend=unique(dataset$deprivation),title="Deprivation") for ( ii in unique(dataset$deprivation) ) { loess_model <- loess(diffr_days~age,data=subset(dataset,deprivation==ii)) xx <- seq(min(subset(dataset,deprivation==ii)["age"]),max(subset(dataset,deprivation==ii)["age"])) loess_fit <- predict(loess_model,newdata=data.frame(age=xx)) lines(loess_fit,col=colors.blackBody[ii]) }

library(beanplot) with(dataset,beanplot(diffr_days~deprivation,las=1,what=c(0,1,0,0),col="lightgray",border=NA, xlab="Deprivation",ylab="Time in hospital (days)")) with(dataset,boxplot(diffr_days~deprivation,add=TRUE,outline=FALSE,col=NA,yaxt="n")) with(dataset,points(deprivation+runif(nrow(dataset),-0.2,0.2),diffr_days,pch=19))

Stephan Kolassa
  • 123,354
1

A Wilkinson dot plot is a handy graphics for visualizing counts. It's a histogram of stacked dots.

Here is a Wilkinson dot plot of the sample data, split by deprivation. It's easy to notice for example that all counts are multiples of 5.

Wilkinson dot plot

With the sample data, one dot represents one patient. With the full dataset, it would probably be necessary to summarize so that the chart doesn't get cluttered, eg. one dot can represent 10 patients.

PS: Color is under-used in this dot plot. It will be great to color by age (or by average age, if dots represent multiple patients of similar age). Unfortunately, I hit the limits of ggplot2, so in R at least it would have to be a custom-made chart.


R code to make a Wilkinson dot plot.

# Adapted from:
# https://stackoverflow.com/questions/49330742/change-y-axis-of-dot-plot-to-reflect-actual-count-using-geom-dotplot

library("tidyverse")

dataset <- as_tibble(dataset) %>% mutate( deprivation = factor(deprivation) ) counts <- dataset %>% count( deprivation, diffr_days )

binwidth <- 0.2 dotsize <- 1 yheight <- max(counts$n)

dataset %>% ggplot( aes(x = diffr_days, fill = deprivation) ) + geom_dotplot( binwidth = binwidth, dotsize = dotsize, method = "histodot" ) + coord_fixed( ratio = binwidth * dotsize * yheight ) + facet_wrap( ~ "deprivation" + deprivation, ncol = 2 ) + scale_x_continuous( limits = c(0, 15) ) + scale_y_continuous( limits = c(0, 1), expand = c(0, 0), breaks = seq(0, 1, 5 * 1 / yheight), labels = seq(0, yheight, by = 5) ) + labs( x = "Time in hospital (days)", y = "Number of patients" ) + theme(legend.position = "none")

dipetkov
  • 9,805
  • I agree strongly that (Wilkinson) dot plots can be helpful. But the bottom line with this size of sample is that you are essentially creating pointillist versions of histograms or bar charts, either of which in many software environments requires just one line of code. – Nick Cox Aug 25 '22 at 13:02
  • @NickCox Thanks for the feedback. I agree it's not the most successful plot ever. Better than a violin plot? But I admit I also wanted to try making a Wilkinson dot plot. It turned out to be surprisingly hard actually. – dipetkov Aug 25 '22 at 13:30
  • You mean surprisingly hard in R? I am only an occasional user, but I find that astonishing. – Nick Cox Aug 25 '22 at 13:56
  • @NickCox It's entirely possible that I missed something since several kinds of plots are called "dot plot". However: https://github.com/tidyverse/ggplot2/issues/2203. I don't really use base R for my plotting; it's very likely that it's easy to do in base R (stripchart?). I came away with the feeling that the math of making a dot plot is not trivial. – dipetkov Aug 25 '22 at 14:29