Most Popular

1500 questions
13
votes
2 answers

How to read structural variant VCF?

The IGSR has a sample for encoding structural variants in the VCF 4.0 format. An example from the site (the first record): #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 1 2827693 .…
SmallChess
  • 2,699
  • 3
  • 19
  • 35
13
votes
2 answers

How to convert fastq to fast5

fast5 is a variant of HDF5 the native format in which raw data from Oxford Nanopore MinION are provided. You can easily extract the reads in fast5 format into a standard fastq format, using for example poretools. Say I have aligned these reads in…
aechchiki
  • 2,676
  • 11
  • 34
13
votes
3 answers

How can I do an overlapping sequence count in Biopython?

Biopython's .count() methods, like Python's str.count(), perform a non-overlapping count, how can I do an overlapping one? For example, these code snippets return 2, but I want the answer 3: >>> from Bio.Seq import Seq >>>…
Chris_Rands
  • 3,948
  • 12
  • 31
13
votes
3 answers

Given a VCF of a human genome, how do I assess the quality against known SNVs?

I'm looking for tools to check the quality of a VCF I have of a human genome. I would like to check the VCF against publicly known variants across other human genomes, e.g. how many SNPs are already in public databases, whether insertions/deletions…
ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20
13
votes
2 answers

Are fgsea and Broad Institute GSEA equivalent?

Several gene set enrichment methods are available, the most famous/popular is the Broad Institute tool. Many other tools are available (See for example the biocView of GSE which list 82 different packages). There are several parameters in…
llrs
  • 4,693
  • 1
  • 18
  • 42
13
votes
3 answers

What is the best way to account for GC-content shift while constructing nucleotide-based phylogenetic tree?

Let's say I want to construct a phylogenetic tree based on orthologous nucleotide sequences; I do not want to use protein sequences to have a better resolution. These species have different GC-content. If we use a straightforward approach like…
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
13
votes
2 answers

Can open-source software be peer-reviewed and published?

My colleague and I have developed a software tool intend to release it open-source. This tool is specifically for tasks in bioinformatics but we think it would be helpful for the wider community. Our institution will permit us to release it provided…
Tom Kelly
  • 873
  • 7
  • 20
12
votes
5 answers

Improve a reference genome with sequencing data

I have a DNA sample which I know doesn't quite match my reference genome - my culture comes from a subpopulation which has undergone significant mutation since the reference was created. The example I have in mind is E.coli. We've tried assembly…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
12
votes
1 answer

How to calculate overlapping genes between two genome annotation versions

I have two annotations of the same genome generated with different annotation pipelines. I want to identify overlapping gene models. An important feature of this genome is that there are many 'genes within genes', i.e. a genemodel in the intron of…
holmrenser
  • 445
  • 3
  • 10
12
votes
2 answers

BLAST(n): No hits found

I am currently exploring the BLAST program, just for testing purpose i generated two FASTA files, containing two genes A and B, such that B is just a motif of repeated 'G's that occurs in A. file…
Paul
  • 327
  • 2
  • 8
12
votes
2 answers

Converting VCF file to PLINK bed/bim/fam files

I am trying to find the best way to convert VCF files to PLINK binary bed/bim/fam files, but it seems like there are many varied ways to do this. (For example, using Plink 1.9 --vcf tag, bcftools, GATK, and vcftools). Obviously they all probably…
Sarah
  • 486
  • 1
  • 4
  • 18
12
votes
2 answers

How to select high quality structures from the Protein Data Bank?

Models of structures deposited in the Protein Data Bank vary in the quality, depending both on the data quality and expertise and patience of the person who built the model. Is there a well-accepted subset of the PDB entries that has only "high…
marcin
  • 1,261
  • 7
  • 14
12
votes
2 answers

determining doublets in single-cell RNA-seq

Doublets are a known problem with scRNA-seq experiments, where 2 or more cells are sometimes captured instead. To determine their presence, there are studies that mix multiple species (such as human and mouse) or distinct cell types and then count…
burger
  • 2,179
  • 10
  • 21
12
votes
4 answers

How can I systematically detect unknown barcode/adapter sequences within a set of samples?

I have often downloaded datasets from the SRA where the authors failed to mention which adapters were trimmed during the processing. Local alignments tend to overcome this obstacle, but it feels a bit barbaric. fastQC works occasionally to pick them…
story
  • 1,573
  • 1
  • 8
  • 15
12
votes
5 answers

How do I efficiently perform a metagenome screen of “all” species?

I’ve got an RNA-seq dataset with a large proportion of environmental RNA “contamination”. BLASTing random reads reveals that much of the data comes from bacterial, plant and viral RNA. My target organism only accounts for ~5% of the RNA-seq read…
Konrad Rudolph
  • 4,845
  • 14
  • 45