5

I have two files

s3.txt :
1   10  20
1   5   20
2   20  30
2   25  30
1   10  50
2   20  60
1   14  17

s4.txt:
1   10  20
2   20  30

I am trying to match col0 of both the files and get rows that fall between range(inclusive of themselves) 10-20 and 20-30 as seen in s4 file. file s4 has co ordinates which can be used as reference range (chrom start and end) and s3 has list of co ordinates from an experimental condition what I am trying to achieve is to which co-ordinates from my file s3 fall on or between my reference co ordinates in s4.

code so far:

containing_ranges = []
with open('s4.txt', 'r') as f:
    for line in f:
        fields = line.strip().split('\t')
        containing_ranges.append(fields)

tested_ranges = []       
with open('s3.txt', 'r') as f:
    for line in f:
        fields = line.strip().split('\t')
        tested_ranges.append(fields)

for c_range in containing_ranges:
    for t_range in tested_ranges:
        tst = int(t_range[1])
        ten = int(t_range[2])
        cst = int(c_range[1])
        cen = int(c_range[2])
        if  c_range[0] == t_range[0]:
            included = cst >= tst and cen <= ten
            if included == True:
               print t_range

Output with missing row(1 14 17) :

['1', '10', '20']
['1', '5', '20']
['1', '10', '50']
['2', '20', '30']
['2', '20', '60']

Desired output:

1   10  20
2   20  30
2   25  30
1   14  17

Not sure if my logic is wrong and why does it miss 14-17 as it falls between 10-20

[EDIT] using pybedtools

>>> print(s4.intersect(s3, wb=True))
1   10  20  1   10  20
1   10  20  1   5   20
1   10  20  1   10  50
1   14  17  1   14  17
2   20  30  2   20  30
2   25  30  2   25  30
2   20  30  2   20  60

>>> print(s4.intersect(s3, wa=True, wb=True, F=1))
1   10  20  1   10  20
1   10  20  1   14  17
2   20  30  2   20  30
2   20  30  2   25  30


using bedops 
bin$ less answer.bed 
1       5       20
1       10      20
1       10      50
1       14      17
2       20      30
2       20      60
2       25      30

using @bli code(on python2.7)
('1', 10, 20)
('1', 14, 17)
('2', 20, 30)
('2', 25, 30)
 why can I not see the interval 1 5 20
  • What result do you get with bedops --element-of? – Alex Reynolds Jun 20 '17 at 20:26
  • Please bring your logic/coding questions to Stack Overflow. The relationship of your question to the subject of Bioinformatics is merely coincidental. – Robert Cartaino Jun 20 '17 at 23:04
  • 2
    @RobertC If OP adds a "bed" tag, this question will immediately look like a bioinformatics question. Also, see the answers. OP is much more likely to get such spot-on answers here. This question could be improved for sure, but it is not off-topic. – user172818 Jun 20 '17 at 23:50
  • Just use bedops, as indicated. Using wrappers to command-line tools is rarely a substitute for learning the tools. – Alex Reynolds Jun 21 '17 at 04:00
  • 1
    Can you please add more story/context around this question? It has the appearance of a pure programming question (presumably why it was marked as off-topic). It would be nice if you could explain what the different numbers mean, and why you want to do this. – gringer Jun 21 '17 at 04:49
  • 1
    You should use more meaningful names for your variables. It would make the code easier to read, for others but also for you. – bli Jun 21 '17 at 08:39
  • It seems you want the ranges from s3.txt ("tested ranges") that are included in a range in s4.txt ("containing ranges"). In this case, I think your error is in the start and end coordinates comparisons. Let's note t_start and t_end the coordinates of the tested range and c_start and c_end the coordinates of the containing range. What you want is c_start <= t_start and t_end <= c_end. – bli Jun 21 '17 at 09:18
  • I edited your question to use more meaningful variable names, and I also removed 1 5 20 from the expected output: If I understood correctly, this is not what you want because this is not included in any of the ranges defined in s4.txt – bli Jun 21 '17 at 09:31
  • I can't post an answer since your question is "on hold", but here is a (hopefully) corrected version of your code, with minor some coding style improvements and using python3: http://paste.ubuntu.com/24915950/ I hope this helps. – bli Jun 21 '17 at 09:41
  • ohh I was away for a bit, so many suggestions thank you all. will edit my post once I go through each of your suggestions – novicebioinforesearcher Jun 21 '17 at 14:09
  • @AlexReynolds added answer – novicebioinforesearcher Jun 21 '17 at 14:47
  • @bli thank you for cleaning up code added answer – novicebioinforesearcher Jun 21 '17 at 14:48
  • You say "which co-ordinates from my file s3 fall on or between my reference co ordinates in s4". If I interpret this correctly, this means that you accept partial overlaps too, not only complete inclusions. Then the desired output should be all the ranges in s3, and not the restricted list I mistakenly corrected. – bli Jun 21 '17 at 14:56
  • @novicebioinforesearcher It looks like bedops was able to find your missing interval. If you need to deal with strand labels in the sixth column (per BED specification), you can split a BED file by strand via awk '$6=="+"' in.bed > in.forward.bed and awk '$6=="-"' in.bed > in.reverse.bed, and then run set operations on each of the strand-split files. If you need to reconstruct one file at the end, use bedops -u to do a multiset union of all input BED files. – Alex Reynolds Jun 21 '17 at 20:45

4 Answers4

5

You're reinventing bedtools intersect (or bedops), for which there's already a convenient python module:

from pybedtools import BedTool

s3 = BedTool('s3.bed')
s4 = BedTool('s4.bed')

print(s4.intersect(s3, wa=True, wb=True, F=1))

The wb=True is equivalent to -wb with bedtools intersect on the command line. Similarly, F=1 is the same as -F 1.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • NotImplementedError: "intersectBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method. used pip install pybedtools – novicebioinforesearcher Jun 20 '17 at 19:55
  • 1
    You should be able to read and understand how to deal with that error message. – Devon Ryan Jun 20 '17 at 19:57
  • Hi i installed bedtools made is environment variable i can use it in bash from any directory i still get the same error, is there a different way than this? thank you – novicebioinforesearcher Jun 20 '17 at 20:15
  • You can always install it via conda, though it works for me with the outdated version of bedtools that I have installed (2.25.0). – Devon Ryan Jun 20 '17 at 20:18
  • i was able to get it working thanks, but the result seems to be ambiguous have added it to my original post, 5-20 seems to be missing – novicebioinforesearcher Jun 20 '17 at 20:24
  • Whether you want F=1 or f=1 will depend on your exact needs and the order you do things. – Devon Ryan Jun 20 '17 at 20:26
3

Regarding your python code

If you want the experimental ranges that are entirely contained in one of the reference ranges, you need to have the coordinates in the following order:

cst <= tst < ten <= cen

If what you want are the experimental ranges that overlap one of the reference ranges, you need to have either the start or the end of the experimental range fall within the reference range:

(cst <= tst < cen) or (cst < ten <= cen)

Your code is equivalent to neither possibilities:

cst >= tst and cen <= ten

This is equivalent to:

tst <= cst and cen <= ten

(Or tst <= cst < cen <= ten, since we know that cst < cen, by definition of a bed interval).

With this rewriting, we can more easily see that you are actually selecting the experimental ranges that contain a reference range.

Here is some (python3) code that gives you the results in the other two situations:

#!/usr/bin/env python3

ref_intervals = []
with open("s4.txt", "r") as f:
    for line in f:
        (chr, start, end) = line.strip().split("\t")
        ref_intervals.append((chr, int(start), int(end)))

exp_intervals = []
with open("s3.txt", "r") as f:
    for line in f:
        (chr, start, end) = line.strip().split("\t")
        exp_intervals.append((chr, int(start), int(end)))

contained = []
overlapping = []
for (r_chr, r_start, r_end) in ref_intervals:
    for (e_chr, e_start, e_end) in exp_intervals:
        if e_chr == r_chr:
            if r_start <= e_start < r_end or r_start < e_end <= r_end:
                overlapping.append((e_chr, e_start, e_end))
            if r_start <= e_start < e_end <= r_end:
                contained.append((e_chr, e_start, e_end))

print("overlapping")
for (chr, start, end) in overlapping:
    print(chr, start, end, sep="\t")

print("contained")
for (chr, start, end) in contained:
    print(chr, start, end, sep="\t")

If I run it, I obtain the following results:

$ ./overlap.py 
overlapping
1   10  20
1   5   20
1   10  50
1   14  17
2   20  30
2   25  30
2   20  60
contained
1   10  20
1   14  17
2   20  30
2   25  30

It is probably an good exercise to program this, but as the other answers point out, there are efficient tools that would be a better solution in a professional setting.

bli
  • 3,130
  • 2
  • 15
  • 36
  • 1
    this makes sense, I guess I need to start thinking and talking like a programmer which would help me write the questions like one. Sure there are other tools but doing it myself helps me to learn and understand whats exactly going behind the tools. – novicebioinforesearcher Jun 21 '17 at 15:57
1

pyranges answer:

# pip install pyranges
# or 
# conda install -c bioconda pyranges

import pyranges as pr
s3 = pr.read_bed("s3.bed")
s4 = pr.read_bed("s4.bed")
s3.intersect(s4, how="containment")

Answer with setup (Edit):

import pandas as pd
from io import StringIO
import pyranges as pr

c1 = """1   10  20
1   5   20
2   20  30
2   25  30
1   10  50
2   20  60
1   14  17"""

c2 = """1   10  20
2   20  30"""

columns = "Chromosome Start End".split()

df1 = pd.read_table(StringIO(c1), sep="\s+", header=None, names=columns)
df2 = pd.read_table(StringIO(c2), sep="\s+", header=None, names=columns)

gr1 = pr.PyRanges(df1)
gr2 = pr.PyRanges(df2)

print(gr1.intersect(gr2, how="containment"))

Result:

+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |        10 |        20 |
|            1 |        14 |        17 |
|            2 |        20 |        30 |
|            2 |        25 |        30 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 4 rows and 3 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
The Unfun Cat
  • 461
  • 3
  • 10
1

You could use BEDOPS, instead:

$ sort-bed s3.txt > s3.bed
$ sort-bed s4.txt > s4.bed
$ bedops --element-of 1 s3.bed s4.bed > answer.bed

If you need to run it from within Python, you could use subprocess.check_output():

import subprocess
...
try:
    result = subprocess.check_output("bedops --element-of 1 %s %s > %s" % (set_a_fn, set_b_fn, answer_fn), shell=True)
except subprocess.CalledProcessError as err:
    raise SystemExit("Could not run bedops\n")
# do stuff with 'result'
Alex Reynolds
  • 3,135
  • 11
  • 27