Can I index a compressed FASTA file using STAR?

Question

I am using STAR to align RNA-seq reads to a reference genome. Before the alignment, I need to generate an index of the reference genome. I use the following code to generate the index successfully:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir output/index/star --genomeFastaFiles ref.fa --sjdbGTFfile ref.gtf --sjdbOverhang 100

This works fine. However, I would like to keep my reference genome compressed to save disk space. So I am trying the following command:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir output/index/star --genomeFastaFiles ref.fa.gz --readFilesCommand "gunzip -c" --sjdbGTFfile ref.gtf --sjdbOverhang 100

but I get the following error:

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: ref.fa.gz is not fasta: the first character is '' (31), not '>'. Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

I am using the readFilesCommand successfully with compressed RNA-seq fastq files. Does anybody know if there is a similar option to use compressed references? Is there a workaround using Unix commands (maybe piping?) or do I need to decompress the reference, index it and then compress it again?

score 5 · Accepted Answer · answered Feb 26 '18 at 20:11

I can think of two possible workarounds. The simplest would be to try using zcat instead of gunzip -c, in case that works. Assuming it doesn't, you can use process substitution, if your shell supports it. Something like this:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir output/index/star \
    --genomeFastaFiles <(zcat ref.fa.gz) --sjdbGTFfile ref.gtf --sjdbOverhang 100

The <() creates a file descriptor and passes that to the command. It is very useful for passing streams to commands that expect files.

If your shell doesn't support this construct, assuming you are on a system that supports named pipes (Linux does), you can do the same thing a bit more manually:

Create a named pipe (FIFO)
```
mkfifo foo
```
Set it to read the uncompressed genome
```
zcat ref.fa.gz > foo &
```
The & is important since the command above will hang until something starts reading from foo. Alternatively, you can just run this in one terminal and the next command in another.

Read the input from the pipe

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir output/index/star \
    --genomeFastaFiles foo --sjdbGTFfile ref.gtf --sjdbOverhang 100

Can I index a compressed FASTA file using STAR?

1 Answers1