1

I have a data set looking sth like this ....

Date        Remaining  Volume   ID
1990-01-01  0          1000     1
1990-01-01  1          2000     2
1990-01-01  1          5000     3
1990-02-01  0          200      4
1990-03-01  1          4000     5
1990-03-01  0          3000     6

I filter the data according to a series of conditional statements and assign the binary flag variable to the data.table. A value of 0 means that the particular row entry doesn't meet the defined requirements and will subsequently be excluded; 1-flagged rows remain in the data.table. The key is ID and is unique for each row.

I would like to show two relationships.

(1) A stacked normalized/percentage bar chart over the monthly time series to show the percentage of entries remaining/being excluded in the data.set for each month,

f.ex. Jan 1990 --> 2/3 values remaining --> 66.6% vs. 33.3% of entries remain vs. are excluded

(2) A stacked normalized/percentage bar chart showing the normalized percentage of volume remaining/ being excluded by the filtering operation for each month,

f.ex. Jan 1990 --> 2k + 5k out of 8k remaining --> 87.5% vs. 12.5% of volume remains vs. is excluded

I tried various things so far, f.ex. compute the number of occurences of each flag-value per month and the sum of the corresponding "bucket" (0/1) volume, but all my attempts failed so far.

# dt_1 is the original data.table
id.vec <- dt_1[ , id]
dt_2 <- dt_1
# dt_1 is filterd subsequently
id_remaining.vec <- dt_1[ , id]

dt_2 <- dt_2[id.vec %in% id_remaining.vec, REMAIN := 1]
dt_2 <- dt_2[id.vec %notin% id_remaining.vec, REMAIN := 0]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN := .N]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN_MON := .N]

# Tried the code below to no avail
ggplot(data = dt_2, aes(x = Date, y = REMAIN, color = REMAIN, fill = REMAIN)) +
  geom_bar(position = "fill", stat = "identity")

Usually, I find ggplot grammar very intuitive, but I guess I am overlooking sth here or maybe the data set is not in the right format.

Any pointer or idea highly appreciated!

aimbotter21
  • 171
  • 1
  • 5

2 Answers2

2

Here's how I'd do it with dplyr:

library(dplyr)
dt_2 %>%
  mutate(Remaining = as.character(Remaining)) %>%  # just to make the charts use scale_fill_discrete by default
  group_by(Date, Remaining) %>%
  summarize(entries = n(),
         volume = sum(Volume)) %>%
  mutate(share_entries = entries / sum(entries),
            share_volume = volume / sum(volume)) %>%
  ungroup() -> dt_2_summary

> dt_2_summary
# A tibble: 5 x 6
  Date       Remaining entries volume share_entries share_volume
  <chr>      <chr>       <int>  <int>         <dbl>        <dbl>
1 1990-01-01 0               1   1000         0.333        0.125
2 1990-01-01 1               2   7000         0.667        0.875
3 1990-02-01 0               1    200         1            1    
4 1990-03-01 0               1   3000         0.5          0.429
5 1990-03-01 1               1   4000         0.5          0.571

Then to chart:

dt_2_summary %>%
  ggplot(aes(Date, share_entries, fill = Remaining)) +
  geom_col()

enter image description here

dt_2_summary %>%
  ggplot(aes(Date, share_volume, fill = Remaining)) +
  geom_col()

enter image description here

Jon Spring
  • 40,151
  • 4
  • 32
  • 50
  • That it awesome John! I love your proposed solution in particular as the ```ggplot()``` calls are so compact as a result of the data wrangling beforehand. Your answer is exactly what I was hoping for and more! **Very much appreciated!** One short follow-up question though. Why do you have to change the ```integer``` column ```Remain``` to ```character``` before grouping and summarizing? – aimbotter21 Mar 17 '21 at 09:47
  • I tried several versions now, but it took me a lot of time to debug on my end as I was getting a couple weird results before it finanally worked! And btw sorry for misspelling your name earlier Jon :) – aimbotter21 Mar 17 '21 at 12:55
  • I switched Remain to character only so that it would be a discrete variable going into ggplot. Otherwise it plots the 0's and 1's on a continuous scale and that seemed the quickest way around it. – Jon Spring Mar 17 '21 at 14:35
  • Ah, that makes sense. Usually I just use ```as.factor(xyz)``` in the ```aes()``` mapping definition of the variable I want to use for coloring / filling my ```geom```. But your way is quite elegant as well. Thanks again Jon! In the end this was essentially just the backbone of my plot, that I further customized (date labels, etc.) – aimbotter21 Mar 17 '21 at 15:16
0

Just as an appendix to Jon's great soution.

I had a large project with >25 libraries loaded and while the proposed code seemingly worked, it only did work for the share_entries and not for share_volume. Output of dt_2_summary was weird. The share_entries column was apparently computed to the total number of entries and not within each group and the share_volume column only showed NAs.

After hours of troubleshooting, I identified the culprit to be the pkg plyr, which did overwrite some functions. Thus, I had to specify which version of the applied functions I wanted to use.

The code below did the trick for me.

library(plyr) # the culprit
library(dplyr)
dt_2 %>%
  dplyr::mutate(Remaining = as.character(Remaining)) %>%
  group_by(Date, Remaining) %>%
  dplyr::summarize(entries = n(),
         volume = sum(Volume)) %>%
  dplyr::mutate(share_entries = entries / sum(entries),
            share_volume = volume / sum(volume)) %>%
  ungroup() -> dt_2_summary

Thanks again Jon for your wonderful solutiopn!

aimbotter21
  • 171
  • 1
  • 5