Most Popular

1500 questions
12
votes
5 answers

How to download FASTA sequences from NCBI using the terminal?

I have following accession numbers of the 10 chromosomes of Theobroma cacao genome. NC_030850.1 NC_030851.1 NC_030852.1 NC_030853.1 NC_030854.1 NC_030855.1 NC_030856.1 NC_030857.1 NC_030858.1 NC_030859.1 I need to download these FASTA files using…
MudithMMBc
  • 361
  • 1
  • 2
  • 9
11
votes
1 answer

Which quality score encoding does PacBio use?

Do you know which quality score encoding PacBio uses now? I know some of their file formats have changed in the past year or two, but I haven't found much on their quality score encoding. The most recent answer I found is from 2012, where one user…
Mark Ebbert
  • 1,354
  • 10
  • 22
11
votes
1 answer

What is the best method to estimate a phylogenetic tree from a large dataset of >1000 loci and >100 species

I have a large phylogenomic alignment of >1000 loci (each locus is ~1000bp), and >100 species. I have relatively little missing data (<10%). I want to estimate a maximum-likelihood phylogenetic tree from this data, with measures of statistical…
roblanf
  • 962
  • 7
  • 15
11
votes
2 answers

Do variant calls change when you call from CRAM?

We're considering switching our storage format from BAM to CRAM. We work with human cancer samples, which may have very low prevalence variants (i.e. not diploid frequency). If we use lossy CRAM to save more space, how much will variants called from…
morgantaschuk
  • 530
  • 4
  • 9
11
votes
5 answers

How to convert species names into common names?

I’m trying to find common names from a list of scientific names (not all will have them though). I was attempting to use taxize in R but it aborts if it doesn’t find an entry in EOL and I don’t know a way around this other than manually editing the…
Daniel Mead
  • 113
  • 1
  • 4
11
votes
1 answer

How to read and interpret a gene expression quantification file?

I have a gene expression quantification file from TCGA that contains the following lines: ENSG00000242268.2 591.041000514 ENSG00000270112.3 0.0 ENSG00000167578.15 62780.6543066 ENSG00000273842.1 0.0 ENSG00000078237.5 …
0x90
  • 1,437
  • 9
  • 18
11
votes
3 answers

How to extract RNA sequence and secondary structure restrains from a PDB file

I'm trying to find a programmatic way to automatically extract the following information from a PDB file: RNA sequence Secondary structure restraints in bracket format, e.g. . (( . ( . ) . )) Does software exist that can take a PDB file as input…
Peter
  • 353
  • 1
  • 8
11
votes
3 answers

What is the index fastq file (sample_I*.fastq.gz) generated when demultiplexing Illumina paired-end runs?

What is the index fastq file that comes with some Illumina sequencing datasets? (The samplename_I*.fastq.gz file.) For example, I recently received some 10X Chromium reads for two libraries sequenced on the same lane. This was a 2x150 sequencing…
conchoecia
  • 3,141
  • 2
  • 16
  • 40
11
votes
1 answer

Can I create a CRAM file with a relative reference path?

I’m trying to create a CRAM file that stores its path to the FASTA reference as a relative path, rather than an absolute path, so that I can move the files around. Unfortunately I can’t get this to work; I was expecting the following to work: ⟩⟩⟩…
Konrad Rudolph
  • 4,845
  • 14
  • 45
11
votes
4 answers

What are the pros and cons of the different basecallers in Oxford Nanopore Technology Sequencing?

What are the pros and cons of the different basecallers in Oxford Nanopore Technology Sequencing? I am about to start a MinION run on my laptop. What should I consider when choosing my basecaller? Can I let MinION do its sequencing and generate…
Biomagician
  • 2,459
  • 16
  • 30
11
votes
1 answer

Changing the record id in a FASTA file using BioPython

I have the following FASTA file, original.fasta: >foo GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA I need to change the record id from foo to bar, so I wrote the following code: from Bio import SeqIO original_file =…
BioGeek
  • 496
  • 5
  • 15
11
votes
1 answer

State of the art mutation simulation software

There are many features affecting mutation probabilities, e.g. CpG mutations are 10-fold more likely than other types of mutations. Is there a model (preferably with software) which can take two aligned genomic regions, estimate parameters of the…
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
11
votes
1 answer

Why does a very strong BLAST hit get lost when I change num_alignments, num_descriptions or max_target_seqs parameter?

Disclaimer: This is a self answered question for documentation purpose and I adapted this from the following github gist. Especially from users terrycojones and peterjc as well as sujaikumar who raised the issue. I have a strange situation. I have…
voiDnyx
  • 401
  • 2
  • 12
11
votes
3 answers

Converting a VCF into a FASTA given a reference with Python, R

I am interested in converting a VCF file into a FASTA file given a reference sequence with Python or R. Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528) This script has been…
ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20
11
votes
1 answer

Quantifying reads mapping to multiple loci

I have been using STAR for our RNA-Seq samples. The final.out log file reports percentage of uniquely mapped reads along with percentage of reads that map to multiple loci (less than or equal to 10) and percentage of reads mapping to too many loci…
rightskewed
  • 991
  • 8
  • 17