Merging sequencing data for ChIP-seq experiments

Question

I need to merge sequencing data from different sequencing runs but for the same ChiP-seq library (HiSeq 2000).

Are there any potential advantages or disadvantages when merging files at .fastq or .BAM stage (alignment with Bowite/1.1.2)?

score 7 · Accepted Answer · answered Jun 03 '17 at 12:48

I don’t think it matters. Both are easy to merge (BAM via samtools merge, and (gzipped) FASTQ via cat), and neither method has specific disadvantages, unless your FASTQ files are sorted for some reason (but they generally shouldn’t be).

One advantage of keeping the FASTQ files separate is that it makes it slightly easier to parallelise the mapping step: just run the mapper in parallel on the separate FASTQ files. Although bowtie has an option (-p) for this, throughput from that is slightly worse than running the mapping on split files.

score 3 · Answer 2 · answered Jun 06 '17 at 08:19

3

For ChIP-seq it shouldn't really matter. But do be aware that by default, samtools merge retains read group information (the @RG field in the header) from each input file. This could pose a problem for some downstream analyses (e.g. for the GATK HaplotypeCaller) if you want the merged data to be considered as all part of the same sample. You can change this behaviour using the -c option.

answered Jun 06 '17 at 08:19

Sarah Carl

362
2
11

I do not think one need @RG information for a ChIP-Seq , I mean it is very unlikely that someone would like to do a variants calling with ChIP-Seq. So in any case it it would hardly matter. I would just not mention about the @RG here since people might get confused. – ivivek_ngs Jun 06 '17 at 14:12
Fair point. But since olga did ask about advantages or disadvantages, I thought it would be worth mentioning. It might be useful in the future, or for other users, when merging files for other applications besides ChIP-seq, since samtools merge is widely used. – Sarah Carl Jun 06 '17 at 20:19

score 2 · Answer 3 · answered Jun 06 '17 at 17:40

2

Agree with the others that it doesn't really matter. One thing to note though - if you're deduplicating your BAM files (you probably should for ChIP-seq data), make sure that you do this after merging.. :)

answered Jun 06 '17 at 17:40

ewels

291
2
5

1

Note that if instead of one library, you have multiple technical replicates (ie multiple libraries from the same sample), you should do aligning for each technical replicate separate, then deduplicate, and THEN merge. This allows you to keep unique fragments from each library that might look like PCR duplicates (because they map to the same place) but are actually unique. – Daniel Kim Jun 06 '17 at 18:02

Merging sequencing data for ChIP-seq experiments

3 Answers3

Linked