8

I have a large collection of bam files and I want to post-process each of them into another bam where I can make queries about:

  • the reads position and pair-endness,
  • insert sizes,
  • MAPQ and other flags, etc. of the reads

but where I don't need to make use of:

  • the sequence string of the reads,
  • the Q-scores qualities or,
  • the read ids, apart from keeping pair-endness in the read id,

So I would like to know what's a good tool/parameters to process each bam and turn them into a small size bam that still retains the information needed above. CRAM format is not an option at this point as several of the tools I use are not CRAM-compatible yet.

Any ideas?

zx8754
  • 1,042
  • 8
  • 22
719016
  • 2,324
  • 13
  • 19

2 Answers2

12

You can just set the fields you don't need to *:

samtools view -h foo.bam | awk 'BEGIN{FS="\t"; OFS="\t"}{if($1~/^@/) {print $0} else {print "*", $2, $3, $4, $5, $6, $7, $8, $9, "*", "*"}}' | samtools view -bo smaller.bam -

This will set each read's name to *, but you can still see where its mate maps with the PNEXT and RNEXT fields.

The resulting BAM file looks like:

*   83  chr1    3003308 37  51M =   3003157 -202    *   *
*   99  chr1    3003364 33  51M =   3003547 234 *   *
*   99  chr1    3003410 37  50M =   3003636 269 *   *
*   99  chr1    3003438 255 50M =   3003760 373 *   *
Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • 2
    Why *, specifically? Is that part of the bam specs or will any character do? – terdon Mar 12 '18 at 09:31
  • 3
    * is defined in the BAM specification to mean "unknown" or "not available". You can get similar results if you convert to CRAM and don't decode the read name or seq/qual fields (i.e., you'll get a * in those fields). – Devon Ryan Mar 12 '18 at 09:34
3

I wrote http://lindenb.github.io/jvarkit/Biostar173114.html for biostars "What is the best way to make a bam file smaller by removing unwanted information?"

Usage: biostar173114 [options] Files

    -keepAtt, --keepAttributes
      keep Attributes
      Default: false
    -keepCigar, --keepCigar
      keep cigar : don't remove hard clip
      Default: false
    -keepName, --keepName, --name
      keep Read Name, do not try to create a shorter name
      Default: false
    -keepQuals, --keepQualities
      keep base qualities
      Default: false
    -keepRG, --keepReadGroup
      if attributes are removed, keep the RG
      Default: false
    -keepSeq, --keepSequence
      keep read sequence
      Default: false
    -mate, --mate
      keep Mate/Paired Information

e.g:

$ java -jar dist/biostar173114.jar src/test/resources/S1.bam --keepSequence | head -n 20
@HD VN:1.5  SO:coordinate
(...)
@SQ SN:RF11 LN:666
@RG ID:S1   SM:S1   LB:L1   CN:Nantes
R0  128 RF01    1   60  70M *   0   0   GGCTATTAAAGCTATACAATGGGGCCGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACT  *
R1  64  RF01    8   60  70M *   0   0   AAAGCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATG  *
R2  64  RF01    11  60  70M *   0   0   GCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGC  *
R3  64  RF01    12  60  70M *   0   0   CTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGCA  *
R4  128 RF01    27  60  70M *   0   0   GTATCATCTAATCTTGTCATAATATTTATCATATATATATAACTCACAATCCGCAGTTCAAATTCCAATA  *
R5  64  RF01    44  60  70M *   0   0   CAGAATATTTATCATTTATATATAACTCAGAATCCGCAGTTCAAATTCCAATATACTATTCTTCCAATAG  *
R6  128 RF01    44  60  70M *   0   0   CAGAATATTTATCATTTATATATAACTCACAATCCGCAGTTCAAATTCCAATATACTATTCTTCCAATAG  *
Pierre
  • 1,536
  • 7
  • 11