I have two files
s3.txt :
1 10 20
1 5 20
2 20 30
2 25 30
1 10 50
2 20 60
1 14 17
s4.txt:
1 10 20
2 20 30
I am trying to match col0 of both the files and get rows that fall between range(inclusive of themselves) 10-20 and 20-30 as seen in s4 file. file s4 has co ordinates which can be used as reference range (chrom start and end) and s3 has list of co ordinates from an experimental condition what I am trying to achieve is to which co-ordinates from my file s3 fall on or between my reference co ordinates in s4.
code so far:
containing_ranges = []
with open('s4.txt', 'r') as f:
for line in f:
fields = line.strip().split('\t')
containing_ranges.append(fields)
tested_ranges = []
with open('s3.txt', 'r') as f:
for line in f:
fields = line.strip().split('\t')
tested_ranges.append(fields)
for c_range in containing_ranges:
for t_range in tested_ranges:
tst = int(t_range[1])
ten = int(t_range[2])
cst = int(c_range[1])
cen = int(c_range[2])
if c_range[0] == t_range[0]:
included = cst >= tst and cen <= ten
if included == True:
print t_range
Output with missing row(1 14 17) :
['1', '10', '20']
['1', '5', '20']
['1', '10', '50']
['2', '20', '30']
['2', '20', '60']
Desired output:
1 10 20
2 20 30
2 25 30
1 14 17
Not sure if my logic is wrong and why does it miss 14-17 as it falls between 10-20
[EDIT] using pybedtools
>>> print(s4.intersect(s3, wb=True))
1 10 20 1 10 20
1 10 20 1 5 20
1 10 20 1 10 50
1 14 17 1 14 17
2 20 30 2 20 30
2 25 30 2 25 30
2 20 30 2 20 60
>>> print(s4.intersect(s3, wa=True, wb=True, F=1))
1 10 20 1 10 20
1 10 20 1 14 17
2 20 30 2 20 30
2 20 30 2 25 30
using bedops
bin$ less answer.bed
1 5 20
1 10 20
1 10 50
1 14 17
2 20 30
2 20 60
2 25 30
using @bli code(on python2.7)
('1', 10, 20)
('1', 14, 17)
('2', 20, 30)
('2', 25, 30)
why can I not see the interval 1 5 20
bedops --element-of? – Alex Reynolds Jun 20 '17 at 20:26s3.txt("tested ranges") that are included in a range ins4.txt("containing ranges"). In this case, I think your error is in the start and end coordinates comparisons. Let's notet_startandt_endthe coordinates of the tested range andc_startandc_endthe coordinates of the containing range. What you want isc_start <= t_start and t_end <= c_end. – bli Jun 21 '17 at 09:181 5 20from the expected output: If I understood correctly, this is not what you want because this is not included in any of the ranges defined ins4.txt– bli Jun 21 '17 at 09:31bedopswas able to find your missing interval. If you need to deal with strand labels in the sixth column (per BED specification), you can split a BED file by strand viaawk '$6=="+"' in.bed > in.forward.bedandawk '$6=="-"' in.bed > in.reverse.bed, and then run set operations on each of the strand-split files. If you need to reconstruct one file at the end, usebedops -uto do a multiset union of all input BED files. – Alex Reynolds Jun 21 '17 at 20:45