Is this a low complexity region in our human genome?

Question

I have a screenshot where many of my reads were aligned to a region that I suspect a low complexity region. Although you can't see, all those reads are clipped in the cigar strings. Sample cigar strings: 64S32M29S, 74S32M18S etc... Consequently, the actual sequence mapped to the genome is less than the read length.

I have a feeling that my alignments are bogus because of the complexity and clipping, but I'm not sure if the complexity is indeed low. All I see is a few "T" between a bunch of "A". How to define a low complexity region?

Q: Is this a low complexity region in the human hg38 genome? Would that be fair if I report "the alignments are likely be errors due to the low complexity and alignment clipping"?

score 7 · Accepted Answer · answered Jan 09 '18 at 07:51

7

Yes, that's a short low complexity region wedged between a SINE and an snRNA. More importantly, your alignments have MAPQ of 0 (that's why they're filled with white in IGV), which will happen if they map equally well to multiple locations. Without looking at the sequence one can use that alone to determine that these are not trustworthy mappings.

answered Jan 09 '18 at 07:51

Devon Ryan

19,602
2
29
60

2

Oh, I didn’t know about white meaning MAPQ=0. Very useful! – Konrad Rudolph Jan 09 '18 at 12:09

score 6 · Answer 2 · answered Jan 09 '18 at 07:54

Yes, this is a low complexity region.

Regions are considered low complexity (or having a simple sequence) when they contain an abundance of a single base, or an abundance of short tandem repeats. The simplest is a tandem repeat of a single base (e.g. AAAAAAAAAAAAAAAA), also called a homopolymer. In your case there is a 3-unit tandem repeat of TAAAAA, and surrounding As (with a few intermingled Ts). Depending on the genetic history of this region, it might be that those other As and Ts are derived from the same TAAAAA unit, with some slipping that has occurred during replication to create a cryptic (i.e. non-perfect) repeat structure.

Is this a low complexity region in our human genome?

2 Answers2