Most Popular

1500 questions
7
votes
1 answer

Where to download JASPAR TFBS motif bed file?

I am interested in determining if any transcription factor binding site motifs are enriched in some BED files from a DNA methylation experiment. I am looking for a database that has BED Files containing regions enriched for specific transcription…
Reilstein
  • 367
  • 1
  • 14
7
votes
1 answer

Using the t-SNE algorithm on microarray data + an error bonus

I'm trying to use the t-SNE algorithm on some microarrays data. More specifically my data frame has 18600 columns with genes (features) and 72 rows with conditions with replicates ( 10xWt , 10xTg , etc ). The expression values are in log2…
J. Doe
  • 575
  • 3
  • 11
7
votes
2 answers

Filtering step for read counts data

I have around 1200 samples as columns and 60,000 genes with Htseq-Counts data. Before normalization with voom function I want to do filtering step. I want to remove genes whose expression is == 0 in at least 10 samples. Can I do this with read…
stack_learner
  • 1,262
  • 14
  • 26
7
votes
1 answer

What will be an appropriate mathematical distribution for SNP data?

I found that several papers describe SNPs as a binomial distribution with the probability of "success" equals to minor allele frequency. However, in my experiments, when I generate SNP array following this distribution, the simulation results…
Haohan Wang
  • 521
  • 3
  • 8
7
votes
1 answer

STAR-long parameters for aligning RNA ONT reads to genome

Are there any suggested parameters to align ONT reads to the reference genome using STAR-long? For now, I used the parameters suggested here, but I noticed a weird behaviour. I have RNA reads (D. melanogaster) from R7 and R9 flowcells, separately.…
aechchiki
  • 2,676
  • 11
  • 34
7
votes
3 answers

RIP-seq analysis?

Given an experiment consisting of an input (baseline RNA) and IP (pulldown to find RNAs bound to certain protein of interest)... Is a DE analysis performed over the RNA-seq data from the samples (lets say with EdgeR or DESEQ2) suitable to reveal the…
Kraken
  • 405
  • 2
  • 9
7
votes
2 answers

Is it wise to use RepeatMasker on prokaryotes?

I'm looking for a way to identify low complexity regions and other repeats in the genome of Escherichia coli. I found that RepeatMasker may be used for example when drafting genomes of prokaryotes (E. coli example). But RepeatMasker works on a…
7
votes
1 answer

samtools mpileup empty when filtering out flags

I produced a bam file by aligning reads to a small set of synthetic sequences using bwa-mem. I am heavily filtering reads that are not paired and of a certain orientation. Applying the filtering, I get a few thousands of reads: samtools view -h…
719016
  • 2,324
  • 13
  • 19
7
votes
3 answers

Is there a way to retrieve several SAM fields faster than `samtools view | cut -f`?

I am constructing a bit of software which pipes the outputs of the bam file via samtools view into a script for parsing. My goal is to (somehow) make this process more efficient, and faster than samtools view. I am only using 3-4 fields in the bam.…
ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20
7
votes
2 answers

Correct for gene length or read counts in GO enrichment analysis

It is a well reported fact that GO analysis of RNAseq results is affected by a number of biases, including length bias and expression level bias. The bioconductor package goseq allows you to correct for these biases. By default it corrects for…
Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21
7
votes
1 answer

Split FASTQ and matching BAM into matching chunks

I am running a slow downstream analysis on a large set of nanopore reads (approx 3 million), and would like to split them into smaller chunks, run the analysis in massively parallel, and then recombine. Originally I just split the FASTQ into chunks,…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
7
votes
1 answer

Where can I find summary data for how common certain mutation *types* are?

I'd like to know how common certain mutation types are in public data sets like the 1000 Genomes, ExAC, and ESP6500. Specifically, I'd like to know the distribution of stop-gains, stop-losses, frameshift, and other mutation types. For example, what…
Mark Ebbert
  • 1,354
  • 10
  • 22
7
votes
1 answer

The effects of incomplete bisulfite conversion upon mapping efficiency

This question has also been posted on Biostars I have sequenced numerous multiplexed pools of BS amplicon-seq libraries derived from human samples on a MiSeq over the past few weeks. I have been utilising trim-galore and Bismark for alignment and am…
David Ross
  • 313
  • 2
  • 5
7
votes
1 answer

How to calculate overall reference coverage with MUMmer?

Is the MUMmer suite capable of calculating reference sequence coverage statistics for all query sequences collectively? It would be possible to achieve by parsing the output of nucmer / show-coords / show-tiling but it seems like there should be a…
bedeabc
  • 248
  • 1
  • 6
7
votes
3 answers

Extract nanopore read ID & start times from fastq file

I have a fastq file from minION (albacore) that contains information on the read ID and the start time of the read. I want to extract these two bits of information into a single csv file. I've been trying to figure out a grep/awk/sed solution, but…
roblanf
  • 962
  • 7
  • 15