3

I'm wondering how can I calculate the distance between two regions from the bed (chr, start, end) alike dataframe in R. Precisely, I want to subtract end of r from start of r+1 (r=row). Also, the calculation will repeat similarly for each chromosome.

head(G3_dat[1:3])

A tibble: 6 × 3

chrom start end <chr> <int> <int> 1 chr1 1330426 1330628 2 chr1 2017424 2017750 3 chr1 2017424 2017750 4 chr1 2018462 2018871 5 chr1 3899411 3899468 6 chr1 4653431 4653724

The expected outcome would be,

head(G3_dat[1:4])
# A tibble: 6 × 4
  chrom   start     end     dis
  &lt;chr&gt;   &lt;int&gt;   &lt;int&gt;     &lt;int&gt;
1 chr1  1330426 1330628
2 chr1  2017424 2017750     686796
3 chr1  2017424 2017750       -326
4 chr1  2018462 2018871        712
5 chr1  3899411 3899468    1880540
6 chr1  4653431 4653724     753963

Ram RS
  • 2,297
  • 10
  • 29
Deb
  • 201
  • 1
  • 5

1 Answers1

3

There are two functions in the tidyverse suite (specifically dplyr) called lead and lag that allow calculations like this to be done:

library(tidyverse)
G3_dat %>%
   group_by(chrom) %>%
   arrange(start) %>%
   mutate(dis = start - lag(end))

I'm not sure if I've got the lead/lag function correct there (I've updated it based on a suggestion from @ram-rs), but it doesn't take too long to play round with all the combinations to get the right one.

FWIW, %>% (or |> now in native R) is a "pipe" operator that means "take the result of the last function and feed it into the first argument of the next function". It is used extensively in tidyverse scripts. More details here.

zx8754
  • 1,042
  • 8
  • 22
gringer
  • 14,012
  • 5
  • 23
  • 79