Problem is to generate a random BED interval given the following constraints:
- minimum
start - maximum
end - fixed length
- maximum number of masked bases (similar to
-maxNoption infaSplit) - set of intervals to avoid overlap with
- stay within chromosome size constraints
I can think of computationally intensive ways to code it up from scratch, but am wondering if there is a more efficient approach before re-inventing the wheel.
seq_records = {x.name: x for x in SeqIO.parse('path/to/genome.fa', 'fasta')}
def generate_random_interval(chrom, lower, upper, length, maxrep=1.0, avoid_intervals=None):
while True:
start = np.random.randint(lower, upper - length)
end = start + length
regenerate = False
for interval in avoid_intervals:
if (interval.start < end) or (interval.end > start):
regenerate = True
if 1. * seq_records[chrom].seq[lower, upper].count('N') / length > maxrep:
regenerate = True
if not regenerate:
break
return chrom, start, end