Highest Voted Questions - Bioinformatics Stack Exchange

10

votes

2 answers

What are the de facto required fields in a SAM/BAM read group?

The SAM specification indicates that each read group must have a unique ID field, but does not mark any other field as required. I have also discovered that htsjdk throws exceptions if the sample (SM) field is empty, though there is no indication…

asked Jun 09 '17 at 15:51

mattm

754
7
19

10

votes

4 answers

How can I build a protein network pathway from a gene expression quantification file?

Assume I have found the top 0.01% most frequent genes from a gene expression file. Let's say, these are 10 genes and I want to study the protein protein interactions, the protein network and pathway. I thought to use string-db or interactome, but I…

asked Jun 08 '17 at 07:21

0x90

1,437
9
18

10

votes

4 answers

Annotation format design

Bashing file formats is a favorite pastime in bioinformatics, and annotation file formats such as GFF and BED seem to get special attention. A lot of this frustration stems from community's shockingly inconsistent adherence to specifications and…

asked Jun 08 '17 at 07:06

Daniel Standage

5,080
15
50

10

votes

2 answers

Building STAR Genome Index for nanopore RNA sequencing

I am aligning a dataset of 1,000,000 reads oh human mRNA sequenced on Oxford Nanopore Technologies' MinION, and would like to use the STAR aligner, using the parameters recommended by Pacific Biosciences for long reads. According to this Google…

asked Jun 07 '17 at 04:20

Scott Gigante

2,133
1
13
32

10

votes

2 answers

When to account for the blacklisted genomic regions in ChIP-seq data analyses?

We have heard in the group that it is important to keep track of and to filter artifact regions when analysing data from functional genomics experiments, especially ChIP-seq. Here, we have seen pipelines that remove the ENCODE tracks i) before…

chip-seq

asked Jun 05 '17 at 17:15

olga

481
2
8

10

votes

4 answers

What methods exist to calculate RNA expression profile similarity

Some of the work in our lab requires a comparison of a strain across several experimental conditions. We are looking to identify most similar experimental conditions based on the gene transcription response similarity from the cell. While we could…

rna-seq

asked Jun 04 '17 at 13:50

chiffa

201
2
4

10

votes

1 answer

Behavior of `--reference` flag with samtools sort

I'm about to run samtools sort (version 1.6) and I'm reviewing the available configuration options. $ samtools sort Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT …

samtools

asked May 02 '18 at 22:25

Daniel Standage

5,080
15
50

10

votes

3 answers

How to calculate the memory usage of storing kmers in RAM

I want to write a program in C++ that stores kmers in a hash or in a trie. How can I calculate how much RAM I would need for each type of data structure? For this application my kmers are strand-specific, so I cannot reduce the size complexity by…

asked Feb 23 '18 at 05:09

conchoecia

3,141
2
16
40

10

votes

1 answer

How to manipulate protein interaction network from String database in R?

How can I manipulate protein-interaction network graph from the String database using STRINGdb bioconductor package and R? I have downloaded the whole graph for Homo sapiens from STRING, which has about 20.000 proteins. How do I read the file using…

asked May 30 '17 at 12:16

A M

103
7

10

votes

6 answers

How do I find identical sequences in a FASTA file?

I want to create a database for a proteomics study. Therefore, the mapping from a given sequence to a protein ID has to be unique. I am wondering whether there is already a built-in function in Biopython for that, but I could not find any. The…

asked Nov 10 '17 at 15:28

Cleb

743
7
18

10

votes

2 answers

How to resolve in snakemake error : "Target rules may not contain wildcards."

I would like to do easily reproducible analysis using publicly available data from NCBI, so I have chosen a snakemake. I would like to write a single rule, that would be able to download any genome given a species code name and separated table of…

snakemake

asked Nov 03 '17 at 11:47

Kamil S Jaron

5,542
2
25
59

10

votes

2 answers

Is there a database of disease prevalence?

I want to collect data on how common different diseases are in different populations (or, at least, globally). Ideally, it would give me a way of querying the database with a disease name, and would return the number of cases per N population. My…

asked Oct 19 '17 at 11:01

terdon

10,071
5
22
48

10

votes

1 answer

How to apply upperquartile normalization on RSEM expected counts?

I see that TCGA RNASeq V2 RSEM data is normalized with upper-quartile normalization. After doing Quantification with RSEM with the samples I have, I got "genes.results" as output which has gene id, transcript id(s), length, expected count, and FPKM.…

asked Sep 29 '17 at 15:39

stack_learner

1,262
14
26

10

votes

2 answers

very unbalanced group sizes for DE

I downloaded some publicly available RNA-seq data and want to compare those samples carrying a mutation (~4) against the rest (~800!). I ran both EdgeR and DESeq2, and the first results in an asymmetric volcano plot: skewed in one of the sides,…

asked Sep 26 '17 at 23:50

Kraken

405
2
9

10

votes

3 answers

Tools to create annotated table of variants from VCF

The problem: I have a VCF file, a reference genome, and a bunch of annotations for the reference (genes, repeat regions, etc.) as GFF or BED files. What I would like is a tool that takes all of this as input and outputs a tab- or comma-delimited…

asked Aug 10 '17 at 22:44

roblanf

962
7
15

Most Popular