Most Popular

1500 questions
10
votes
2 answers

What are the de facto required fields in a SAM/BAM read group?

The SAM specification indicates that each read group must have a unique ID field, but does not mark any other field as required. I have also discovered that htsjdk throws exceptions if the sample (SM) field is empty, though there is no indication…
mattm
  • 754
  • 7
  • 19
10
votes
4 answers

How can I build a protein network pathway from a gene expression quantification file?

Assume I have found the top 0.01% most frequent genes from a gene expression file. Let's say, these are 10 genes and I want to study the protein protein interactions, the protein network and pathway. I thought to use string-db or interactome, but I…
0x90
  • 1,437
  • 9
  • 18
10
votes
4 answers

Annotation format design

Bashing file formats is a favorite pastime in bioinformatics, and annotation file formats such as GFF and BED seem to get special attention. A lot of this frustration stems from community's shockingly inconsistent adherence to specifications and…
Daniel Standage
  • 5,080
  • 15
  • 50
10
votes
2 answers

Building STAR Genome Index for nanopore RNA sequencing

I am aligning a dataset of 1,000,000 reads oh human mRNA sequenced on Oxford Nanopore Technologies' MinION, and would like to use the STAR aligner, using the parameters recommended by Pacific Biosciences for long reads. According to this Google…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
10
votes
2 answers

When to account for the blacklisted genomic regions in ChIP-seq data analyses?

We have heard in the group that it is important to keep track of and to filter artifact regions when analysing data from functional genomics experiments, especially ChIP-seq. Here, we have seen pipelines that remove the ENCODE tracks i) before…
olga
  • 481
  • 2
  • 8
10
votes
4 answers

What methods exist to calculate RNA expression profile similarity

Some of the work in our lab requires a comparison of a strain across several experimental conditions. We are looking to identify most similar experimental conditions based on the gene transcription response similarity from the cell. While we could…
chiffa
  • 201
  • 2
  • 4
10
votes
1 answer

Behavior of `--reference` flag with samtools sort

I'm about to run samtools sort (version 1.6) and I'm reviewing the available configuration options. $ samtools sort Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT …
Daniel Standage
  • 5,080
  • 15
  • 50
10
votes
3 answers

How to calculate the memory usage of storing kmers in RAM

I want to write a program in C++ that stores kmers in a hash or in a trie. How can I calculate how much RAM I would need for each type of data structure? For this application my kmers are strand-specific, so I cannot reduce the size complexity by…
conchoecia
  • 3,141
  • 2
  • 16
  • 40
10
votes
1 answer

How to manipulate protein interaction network from String database in R?

How can I manipulate protein-interaction network graph from the String database using STRINGdb bioconductor package and R? I have downloaded the whole graph for Homo sapiens from STRING, which has about 20.000 proteins. How do I read the file using…
A M
  • 103
  • 7
10
votes
6 answers

How do I find identical sequences in a FASTA file?

I want to create a database for a proteomics study. Therefore, the mapping from a given sequence to a protein ID has to be unique. I am wondering whether there is already a built-in function in Biopython for that, but I could not find any. The…
Cleb
  • 743
  • 7
  • 18
10
votes
2 answers

How to resolve in snakemake error : "Target rules may not contain wildcards."

I would like to do easily reproducible analysis using publicly available data from NCBI, so I have chosen a snakemake. I would like to write a single rule, that would be able to download any genome given a species code name and separated table of…
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
10
votes
2 answers

Is there a database of disease prevalence?

I want to collect data on how common different diseases are in different populations (or, at least, globally). Ideally, it would give me a way of querying the database with a disease name, and would return the number of cases per N population. My…
terdon
  • 10,071
  • 5
  • 22
  • 48
10
votes
1 answer

How to apply upperquartile normalization on RSEM expected counts?

I see that TCGA RNASeq V2 RSEM data is normalized with upper-quartile normalization. After doing Quantification with RSEM with the samples I have, I got "genes.results" as output which has gene id, transcript id(s), length, expected count, and FPKM.…
stack_learner
  • 1,262
  • 14
  • 26
10
votes
2 answers

*very* unbalanced group sizes for DE

I downloaded some publicly available RNA-seq data and want to compare those samples carrying a mutation (~4) against the rest (~800!). I ran both EdgeR and DESeq2, and the first results in an asymmetric volcano plot: skewed in one of the sides,…
Kraken
  • 405
  • 2
  • 9
10
votes
3 answers

Tools to create annotated table of variants from VCF

The problem: I have a VCF file, a reference genome, and a bunch of annotations for the reference (genes, repeat regions, etc.) as GFF or BED files. What I would like is a tool that takes all of this as input and outputs a tab- or comma-delimited…
roblanf
  • 962
  • 7
  • 15