Most Popular
1500 questions
7
votes
2 answers
Extracting the CIGAR string from a BAM via Python?
Is there a standard method in Python to extract a CIGAR string from the BAM?
There are great libraries which parse the CIGAR, e.g. https://pypi.python.org/pypi/cigar/0.1
>>> c = Cigar('10M20S10M')
>>> c.mask_left(10).cigar
'10S20S10M'
>>>…
ShanZhengYang
- 1,691
- 1
- 14
- 20
7
votes
2 answers
Stranded vs. unstranded library preparation protocols in RNAseq
I've been reading this paper lately:
Sailfish Enables Alignment-Free Isoform Quantification from RNA-Seq Reads Using Lightweight Algorithms
I don't really understand the second paragraph under Quantification in the Online Methods section. I quote…
Paghillect
- 309
- 1
- 7
7
votes
2 answers
Missing genes and normalisation of RSEM output using EBSeq
Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they…
J0HN_TIT0R
- 541
- 1
- 4
- 7
7
votes
6 answers
bed file with N regions of GRCh38 reference?
How do I obtain a bed file with the list of N-regions of GRCh38 reference? This is, the regions where the sequence is a stretch of Ns.
719016
- 2,324
- 13
- 19
7
votes
2 answers
How to write a hash function for canonical kmers?
This question is a follow up from How do kmer counters determine which kmer is 'canonical'?.
In that question we learned that kmer counting programs use a 2-bit hash function to internally represent canonical kmers as they are being counted.
Now I…
conchoecia
- 3,141
- 2
- 16
- 40
7
votes
3 answers
How to subset a VCF by chromosome and keep the header?
I would like to subset a VCF which only has chromosome 2.
The problem with using various grep commands, e.g.
grep -w '^#\|^2' my.vcf > my_new.vcf
or if there's a 'chr' prefix
grep -w '^#\|^chr2' my.vcf > my_new.vcf
is that this will remove the…
EB2127
- 1,413
- 2
- 10
- 23
7
votes
1 answer
Imputing missing genotypes from separate genotyping panels
What is the current standard for imputing missing genotypes between two genotyping panels? I have two populations genotyped using two different panels (A & B), and I would like to impute all the genotypes in population B for those positions on used…
Greg
- 831
- 6
- 12
7
votes
1 answer
ethnicity check either from bam or vcf files
What tool could I use to check the ethnicity of a human bam or vcf file? I would like to use the results as a QC check to know whether a given sample or set of samples match the ethnicity information annotated in the meta data or not.
Seen so…
719016
- 2,324
- 13
- 19
7
votes
2 answers
Is this a low complexity region in our human genome?
I have a screenshot where many of my reads were aligned to a region that I suspect a low complexity region. Although you can't see, all those reads are clipped in the cigar strings. Sample cigar strings: 64S32M29S, 74S32M18S etc... Consequently, the…
SmallChess
- 2,699
- 3
- 19
- 35
7
votes
2 answers
Lisp in bioinformatics
After many years of programming in Python, C/C++ and other mainstream languages, I am looking for a new language of pleasure, mainly as a part of mental hygiene. Since I really admire the ideas behind Lisp, including the original paper from the…
Kkk
- 71
- 1
7
votes
1 answer
How to remove all BAM read groups from all reads (not just the header)?
I have problem with one my BAMs---it appears to have invalid read groups.
Normally when I have such a problem, I remove all the read groups from the BAM header as follows:
samtools view -H your.bam | grep -v "^@RG" | samtools reheader -…
EB2127
- 1,413
- 2
- 10
- 23
7
votes
2 answers
How to get all PDB homologs from Uniprot (mapping + BLAST)?
I'd like to create a dataset consisting of all sequences which are either present in the PDB, or whose homolog is present in the PDB. In other words, any sequence in the PDB or any sequence related to it. The similiarity margin is to be very wide,…
Zubo
- 179
- 3
7
votes
1 answer
low-memory high-speed replacement for Picard MarkDuplicates
I am running Picard MarkDuplicates with the following parameters below. On the file described, it takes about 41.6Gb of RAM memory and about 20-25 minutes to compute (only uses 1 core AFAICS).
java -Xmx54890m -jar picard.jar MarkDuplicates \
…
719016
- 2,324
- 13
- 19
7
votes
1 answer
How to trim adapter sequences from GSE65360 in order to map the reads?
I am trying to map single-cell chromatin accessibility data from doi:10.1038/nature14590, obtained using scATAC-seq, to the reference genome. Example paired-end reads are here.
The reads contain various adapter sequences, which the authors "trimmed…
firefly
- 73
- 5
7
votes
2 answers
Which tools can detect chimeric RNA (fusion genes) from WGS or RNA-Seq data?
Given WGS data or RNA-seq data, which tools can I use to detect gene fusions?
L42
- 281
- 2
- 6