Range overlap python error with genomic regions

Question

I have two files

I am trying to match col0 of both the files and get rows that fall between range(inclusive of themselves) 10-20 and 20-30 as seen in s4 file. file s4 has co ordinates which can be used as reference range (chrom start and end) and s3 has list of co ordinates from an experimental condition what I am trying to achieve is to which co-ordinates from my file s3 fall on or between my reference co ordinates in s4.

code so far:

containing_ranges = []
with open('s4.txt', 'r') as f:
    for line in f:
        fields = line.strip().split('\t')
        containing_ranges.append(fields)

tested_ranges = []       
with open('s3.txt', 'r') as f:
    for line in f:
        fields = line.strip().split('\t')
        tested_ranges.append(fields)

for c_range in containing_ranges:
    for t_range in tested_ranges:
        tst = int(t_range[1])
        ten = int(t_range[2])
        cst = int(c_range[1])
        cen = int(c_range[2])
        if  c_range[0] == t_range[0]:
            included = cst >= tst and cen <= ten
            if included == True:
               print t_range

Output with missing row(1 14 17) :

['1', '10', '20']
['1', '5', '20']
['1', '10', '50']
['2', '20', '30']
['2', '20', '60']

Desired output:

Not sure if my logic is wrong and why does it miss 14-17 as it falls between 10-20

[EDIT] using pybedtools

>>> print(s4.intersect(s3, wb=True))
1   10  20  1   10  20
1   10  20  1   5   20
1   10  20  1   10  50
1   14  17  1   14  17
2   20  30  2   20  30
2   25  30  2   25  30
2   20  30  2   20  60

>>> print(s4.intersect(s3, wa=True, wb=True, F=1))
1   10  20  1   10  20
1   10  20  1   14  17
2   20  30  2   20  30
2   20  30  2   25  30


using bedops 
bin$ less answer.bed 
1       5       20
1       10      20
1       10      50
1       14      17
2       20      30
2       20      60
2       25      30

using @bli code(on python2.7)
('1', 10, 20)
('1', 14, 17)
('2', 20, 30)
('2', 25, 30)
 why can I not see the interval 1 5 20

Please bring your logic/coding questions to Stack Overflow. The relationship of your question to the subject of Bioinformatics is merely coincidental. — Robert Cartaino, Jun 20 '17 at 23:04
@RobertC If OP adds a "bed" tag, this question will immediately look like a bioinformatics question. Also, see the answers. OP is much more likely to get such spot-on answers here. This question could be improved for sure, but it is not off-topic. — user172818, Jun 20 '17 at 23:50
Just use bedops, as indicated. Using wrappers to command-line tools is rarely a substitute for learning the tools. — Alex Reynolds, Jun 21 '17 at 04:00
Can you please add more story/context around this question? It has the appearance of a pure programming question (presumably why it was marked as off-topic). It would be nice if you could explain what the different numbers mean, and why you want to do this. — gringer, Jun 21 '17 at 04:49
You should use more meaningful names for your variables. It would make the code easier to read, for others but also for you. — bli, Jun 21 '17 at 08:39
It seems you want the ranges from s3.txt ("tested ranges") that are included in a range in s4.txt ("containing ranges"). In this case, I think your error is in the start and end coordinates comparisons. Let's note t_start and t_end the coordinates of the tested range and c_start and c_end the coordinates of the containing range. What you want is c_start <= t_start and t_end <= c_end. — bli, Jun 21 '17 at 09:18
I edited your question to use more meaningful variable names, and I also removed 1 5 20 from the expected output: If I understood correctly, this is not what you want because this is not included in any of the ranges defined in s4.txt — bli, Jun 21 '17 at 09:31
I can't post an answer since your question is "on hold", but here is a (hopefully) corrected version of your code, with minor some coding style improvements and using python3: http://paste.ubuntu.com/24915950/ I hope this helps. — bli, Jun 21 '17 at 09:41
ohh I was away for a bit, so many suggestions thank you all. will edit my post once I go through each of your suggestions — novicebioinforesearcher, Jun 21 '17 at 14:09
You say "which co-ordinates from my file s3 fall on or between my reference co ordinates in s4". If I interpret this correctly, this means that you accept partial overlaps too, not only complete inclusions. Then the desired output should be all the ranges in s3, and not the restricted list I mistakenly corrected. — bli, Jun 21 '17 at 14:56
@novicebioinforesearcher It looks like bedops was able to find your missing interval. If you need to deal with strand labels in the sixth column (per BED specification), you can split a BED file by strand via awk '$6=="+"' in.bed > in.forward.bed and awk '$6=="-"' in.bed > in.reverse.bed, and then run set operations on each of the strand-split files. If you need to reconstruct one file at the end, use bedops -u to do a multiset union of all input BED files. — Alex Reynolds, Jun 21 '17 at 20:45

Devon Ryan · Answer 1 · 2017-06-22T11:22:23.687

5

You're reinventing bedtools intersect (or bedops), for which there's already a convenient python module:

from pybedtools import BedTool

s3 = BedTool('s3.bed')
s4 = BedTool('s4.bed')

print(s4.intersect(s3, wa=True, wb=True, F=1))

The wb=True is equivalent to -wb with bedtools intersect on the command line. Similarly, F=1 is the same as -F 1.

edited Jun 22 '17 at 11:22

answered Jun 20 '17 at 19:38

Devon Ryan

19,602
2
29
60

NotImplementedError: "intersectBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method. used pip install pybedtools – novicebioinforesearcher Jun 20 '17 at 19:55
1

You should be able to read and understand how to deal with that error message. – Devon Ryan Jun 20 '17 at 19:57
Hi i installed bedtools made is environment variable i can use it in bash from any directory i still get the same error, is there a different way than this? thank you – novicebioinforesearcher Jun 20 '17 at 20:15
You can always install it via conda, though it works for me with the outdated version of bedtools that I have installed (2.25.0). – Devon Ryan Jun 20 '17 at 20:18
i was able to get it working thanks, but the result seems to be ambiguous have added it to my original post, 5-20 seems to be missing – novicebioinforesearcher Jun 20 '17 at 20:24
Whether you want F=1 or f=1 will depend on your exact needs and the order you do things. – Devon Ryan Jun 20 '17 at 20:26

bli · Answer 2 · 2017-06-21T15:34:15.073

Regarding your python code

If you want the experimental ranges that are entirely contained in one of the reference ranges, you need to have the coordinates in the following order:

cst <= tst < ten <= cen

If what you want are the experimental ranges that overlap one of the reference ranges, you need to have either the start or the end of the experimental range fall within the reference range:

(cst <= tst < cen) or (cst < ten <= cen)

Your code is equivalent to neither possibilities:

cst >= tst and cen <= ten

This is equivalent to:

tst <= cst and cen <= ten

(Or tst <= cst < cen <= ten, since we know that cst < cen, by definition of a bed interval).

With this rewriting, we can more easily see that you are actually selecting the experimental ranges that contain a reference range.

Here is some (python3) code that gives you the results in the other two situations:

#!/usr/bin/env python3

ref_intervals = []
with open("s4.txt", "r") as f:
    for line in f:
        (chr, start, end) = line.strip().split("\t")
        ref_intervals.append((chr, int(start), int(end)))

exp_intervals = []
with open("s3.txt", "r") as f:
    for line in f:
        (chr, start, end) = line.strip().split("\t")
        exp_intervals.append((chr, int(start), int(end)))

contained = []
overlapping = []
for (r_chr, r_start, r_end) in ref_intervals:
    for (e_chr, e_start, e_end) in exp_intervals:
        if e_chr == r_chr:
            if r_start <= e_start < r_end or r_start < e_end <= r_end:
                overlapping.append((e_chr, e_start, e_end))
            if r_start <= e_start < e_end <= r_end:
                contained.append((e_chr, e_start, e_end))

print("overlapping")
for (chr, start, end) in overlapping:
    print(chr, start, end, sep="\t")

print("contained")
for (chr, start, end) in contained:
    print(chr, start, end, sep="\t")

If I run it, I obtain the following results:

$ ./overlap.py 
overlapping
1   10  20
1   5   20
1   10  50
1   14  17
2   20  30
2   25  30
2   20  60
contained
1   10  20
1   14  17
2   20  30
2   25  30

It is probably an good exercise to program this, but as the other answers point out, there are efficient tools that would be a better solution in a professional setting.

this makes sense, I guess I need to start thinking and talking like a programmer which would help me write the questions like one. Sure there are other tools but doing it myself helps me to learn and understand whats exactly going behind the tools. — novicebioinforesearcher, Jun 21 '17 at 15:57

The Unfun Cat · Accepted Answer · 2019-10-30T08:15:23.480

pyranges answer:

# pip install pyranges
# or 
# conda install -c bioconda pyranges

import pyranges as pr
s3 = pr.read_bed("s3.bed")
s4 = pr.read_bed("s4.bed")
s3.intersect(s4, how="containment")

Answer with setup (Edit):

import pandas as pd
from io import StringIO
import pyranges as pr

c1 = """1   10  20
1   5   20
2   20  30
2   25  30
1   10  50
2   20  60
1   14  17"""

c2 = """1   10  20
2   20  30"""

columns = "Chromosome Start End".split()

df1 = pd.read_table(StringIO(c1), sep="\s+", header=None, names=columns)
df2 = pd.read_table(StringIO(c2), sep="\s+", header=None, names=columns)

gr1 = pr.PyRanges(df1)
gr2 = pr.PyRanges(df2)

print(gr1.intersect(gr2, how="containment"))

Result:

+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |        10 |        20 |
|            1 |        14 |        17 |
|            2 |        20 |        30 |
|            2 |        25 |        30 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 4 rows and 3 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

score 1 · Answer 4 · answered Jun 20 '17 at 19:35

1

You could use BEDOPS, instead:

$ sort-bed s3.txt > s3.bed
$ sort-bed s4.txt > s4.bed
$ bedops --element-of 1 s3.bed s4.bed > answer.bed

If you need to run it from within Python, you could use subprocess.check_output():

import subprocess
...
try:
    result = subprocess.check_output("bedops --element-of 1 %s %s > %s" % (set_a_fn, set_b_fn, answer_fn), shell=True)
except subprocess.CalledProcessError as err:
    raise SystemExit("Could not run bedops\n")
# do stuff with 'result'

answered Jun 20 '17 at 19:35

Alex Reynolds

3,135
11
27

thank you I would certainly try it , but can you please help me understand what is wrong with my logic/code – novicebioinforesearcher Jun 20 '17 at 19:37
You're better off not reinventing the wheel, is my advice. Python is not the right tool for set operations. – Alex Reynolds Jun 20 '17 at 19:38
1

If you really want to do this, use debug statements (sys.stderr.write(...)) so you can see what's happening at each stage. – Alex Reynolds Jun 20 '17 at 19:39
If this is a homework assignment and you are required to write your own solution from scratch, please add a "homework" tag to your question to make this explicit. We're happy to help with homework questions, but the responses will likely be different (and hopefully accommodating). – Daniel Standage Jun 20 '17 at 19:46
1

If it's not a homework assignment, Alex is right: better to use well-tested tools for interval arithmetic. – Daniel Standage Jun 20 '17 at 19:47
@DanielStandage yes it is homework will make changes to tag, tried doing it it would not allow me to create a tag – novicebioinforesearcher Jun 20 '17 at 20:25
@novicebioinforesearcher I've added the homework tag to your post. – Devon Ryan Jun 20 '17 at 20:30
@AlexReynolds does the method suggested account for strand as well? – novicebioinforesearcher Jun 20 '17 at 20:40
I'm not sure. I don't see anything in your code that separates intervals by strand. You could use awk ($6=="+") etc. to do this. – Alex Reynolds Jun 20 '17 at 20:55
@AlexReynolds sorry for late reply i was curious you are right my code does not handle strand, bedops is new for me hence wanted to ask you upfront will install and update you with results – novicebioinforesearcher Jun 20 '17 at 21:48

Range overlap python error with genomic regions

4 Answers4

Regarding your python code