5

I have a numeric vector that I want to convert to five numeric levels. I can get the five levels using cut

dx <- data.frame(x=1:100)
dx$cut <- cut(dx$x,5)

But I am now having problems extracting the lower and upper boundaries of the levels. So for example (0.901,20.8] would be 0.901 in dx$min and 20.8 in dx$max.

I tried:

dx$min <- pmin(dx$cut)
dx$max <- pmax(dx$cut)
dx

But this does not work.

Ronak Shah
  • 355,584
  • 18
  • 123
  • 178
adam.888
  • 7,336
  • 17
  • 63
  • 102
  • `dx$cut` is a factor variable. You would need to split / extract the numbers from it to get numerical values – talat Sep 08 '16 at 10:11

2 Answers2

10

you can try splitting the labels (converted to character beforehand and modified to suppress the punctuation except , and .) according to the comma and then create 2 columns:

min_max <- unlist(strsplit(gsub("(?![,.])[[:punct:]]", "", as.character(dx$cut), perl=TRUE), ",")) # here, the regex ask to replace every punctuation mark except a . or a , by an empty string

dx$min <- min_max[seq(1, length(min_max), by=2)]
dx$max <- min_max[seq(2, length(min_max), by=2)]

head(dx)
#  x          cut   min  max
#1 1 (0.901,20.8] 0.901 20.8
#2 2 (0.901,20.8] 0.901 20.8
#3 3 (0.901,20.8] 0.901 20.8
#4 4 (0.901,20.8] 0.901 20.8
#5 5 (0.901,20.8] 0.901 20.8
#6 6 (0.901,20.8] 0.901 20.8
Cath
  • 23,575
  • 4
  • 51
  • 82
  • 1
    I find that surprising that there is not a kind of default way to do that, this is something I often end up doing and I always need to use some hack like the one given here, maybe I'm not looking for the right keywords... – Simon C. Nov 20 '20 at 16:00
  • 1
    @SimonC. I guess you might get the breaks from your data using `seq` and `range` of your data, defining `length.out` as 1+number of factors in your `cut` call. (I usually use `cut` with predefined breaks so I don't encounter this problem) – Cath Nov 20 '20 at 16:07
  • For anyone who might be interested on how to implement @Cath's answer, you could do: `breaks=seq(from=range(my.x)[1]-1, to=range(my.x)[2], length.out=1+my.n.factors)`. I subtract one at `from` because `cut`'s default doesn't start at the minimum. It doesn't start at `min-1` either, but it was the easiest way for me. – Andres Silva Apr 29 '21 at 22:11
5

Below is tidyverse style solution.

library(tidyverse)

tibble(x = seq(-1000, 1000, length.out = 10),
       x_cut = cut(x, 5)) %>% 
  mutate(x_tmp = str_sub(x_cut, 2, -2)) %>% 
  separate(x_tmp, c("min", "max"), sep = ",") %>% 
  mutate_at(c("min", "max"), as.double)
#> # A tibble: 10 x 4
#>         x x_cut           min   max
#>     <dbl> <fct>         <dbl> <dbl>
#>  1 -1000  (-1e+03,-600] -1000  -600
#>  2  -778. (-1e+03,-600] -1000  -600
#>  3  -556. (-600,-200]    -600  -200
#>  4  -333. (-600,-200]    -600  -200
#>  5  -111. (-200,200]     -200   200
#>  6   111. (-200,200]     -200   200
#>  7   333. (200,600]       200   600
#>  8   556. (200,600]       200   600
#>  9   778. (600,1e+03]     600  1000
#> 10  1000  (600,1e+03]     600  1000

Created on 2019-01-10 by the reprex package (v0.2.1)

Bryan Shalloway
  • 556
  • 4
  • 12