5

I have a set of scRNA-seq samples enriched with FACS for cells expressing a specific gene reporter (TdTomato). In particular the gene I want to report has positive counts in the resulting matrix for 97% of cells.

I followed CellRanger documentation to create a custom reference. I added TdTomato on an additional chromosome and I added an entry in the gtf:

TdTomato    .   exon    1   1431    .   -   .   gene_id "TdTomato"; transcript_id "TdTomato"; gene_name "TdTomato"

The command I'm using is:

cellranger-2.1.1/cellranger mkref --genome=refdata-cellranger-mm10-2.1.0.tdtomato --fasta=genome.tdtomato.fa --genes=genes.tdtomato.gtf

where genome.tdtomato.fa is the original reference genome (mm10, as provided by cellranger) with an additional chromosome at the end with header:

>TdTomato

and genes.tdtomato.gtf the original reference genome gtf with the additional line indicated previously at the end of the file.

I actually tried both on the '+' and on the '-' strand, however I guess this should not affect the read mapping, since TdTomato is on an independent chromosome.

I ran cellranger count using the custom reference:

s="SAMPLE"
cellranger-2.1.1/cellranger count --id="$s"_tt --transcriptome=refdata-cellranger-mm10-2.1.0.tdtomato --fastqs=original_fastqs_directory/"$s"/fastqs --sample="$s"

I am using the default parameters of CellRanger.

However, I found that less than 1% of cells seems to express TdTomato. If I use the '+' strand I get slightly more counts (2.4% of cells have positive TdTomato).

What I checked:

  • With IGV I can see that there are reads mapping to TdTomato
  • Using samtools I am able to see the same number of reads mapped on TdTom either with the + or - strand
  • However, following this documentation I noticed that by executing the CellRanger filtering steps, more reads are filtered in the case of the - strand

This reduces the number of reads used for counts to 44 (2.4% of the original 1807 reads, - strand), 240 (13.3%, + strand) for TdTomato, 71620 (21.2%) for the gene reported by TdTomato. Interestingly, there are fewer reads for TdTom than counts, so there may be some kind of normalization afterwards?

The gene reported by TdTomato is longer, 11394 bp than TdTom, 1431 bp. The total length of the exons for the two genes is 5349 vs TdTom: 1431. Does this influence exponentially the amount of reads mapped to the two genes?

gc5
  • 1,783
  • 18
  • 32
  • Did you check the bam file for reads mapping to TdTomato? If you believe it is a mapping issue you can always run HTseq count on the cell ranger bamfile. It would help if you add the actual input command for cellranger mkref and cellranger count to trouble shoot your problem – Mack123456 Jun 29 '18 at 15:51
  • As an alternative you can make a reference for an alternative mapper such as bwa or STAR, count the mapped reads and see how this corresponds to the reads found in the cellranger bamfile. If there is a good correspondence you most likely did not enrich, or used the wrong sequence for your reference. – Mack123456 Jun 29 '18 at 16:00
  • The strand will not affect read mapping, but will affect read counting. How many TdT counts do you get with different strands? If the counts are similar, you are probably not capturing the real TdT. – burger Jun 29 '18 at 16:10
  • @Mack123456 I've added the issued input command, thanks. I'll also look at the bam file with HTseq. – gc5 Jun 29 '18 at 16:32
  • @burger The count is similar, more or less 1% of cells with >0 counts and maximum of 2 or 3 counts. – gc5 Jun 29 '18 at 16:33
  • 10X RNA-seq tends to only reliably capture the most abundant transcripts expressed in a cell. The TdT gene may not be expressed at a level high enough relative to the other transcripts to be reliably captured in your experiment. You may be able to identify additional transcripts if you increase the sequencing saturation. – GWW Jun 29 '18 at 20:56
  • @gc5 Do you have any update regarding this? I have done two experiments on cells sorted by expression of tdTomato, one with 10X and another using DropSeq, with a custom tdTomato reference and in both experiments the number of tdTomato transcripts is very small (1/2000 cells in the 10X experiment, and 91/900 in the DropSeq experiment). I can't figure out if it's a problem with my reference or a biological/technical problem on the sequencing side. – wflynny Jul 13 '18 at 15:40
  • @wflynny unfortunately not yet. I'm still checking I haven't done some mistakes during mapping, but the steps are pretty much consolidated so I don't think there's much room for mistakes. My guess is that TdTomato reads are filtered by cellranger during count, but I don't know why yet. Please update here if you find a possible solution. Thanks – gc5 Jul 13 '18 at 16:15
  • @gc5 With samtools idxstats possorted_genome_bam.bam | cut -f 1,3 for my 10X experiment, I get 130 reads mapping to my "tdTomato chromosome" so I know there are at least some reads mapping. Very puzzling. – wflynny Jul 13 '18 at 22:36
  • @wflynny I've updated my question with more details, hope this is interesting for your case – gc5 Jul 26 '18 at 16:55
  • @GWW I am thinking that your comment is on point, maybe there are very few reads for TdTomato to begin with and 10X scRNA-seq cannot reliably capture them. Is it basically dropout or another type of issue? Can you provide me some articles to read about this problem? Thanks – gc5 Jul 26 '18 at 16:57

1 Answers1

1

The first column of your GTF must correspond exactly to the name of the sequence in your FASTA file. Typically this is the chromosome so a GTF line:

chr1    hg38_knownCanonical exon    11106535    11262507    0.000000    .   .   gene_id "gene1"; transcript_id "tx1"; 

This must have a corresponding sequence in the FASTA file:

>chr1
ATCG ATCG ATCG ATCG ATCG ATCG...

Therefore you need to add an additional sequence to your FASTA file with >TdTomato. This must have exactly the same sequence as the construct that you are expecting reads for. The run cellranger mkref and cellranger count on these FASTA and GTF files. Cellranger also can take a GTF with “CDS” as a feature in addition to “exons”. The STAR alignment algorithm used by cellranger is splice aware so it can handle split reads but you will need to ensure the GTF corresponds to exons of the mRNA/cDNA rather than the full sequence.

10x Genomics is a platform with a high dropout rate, especially for lowly expressed genes. You should bear this in mind for what you expect to observe. It may be worth comparing the mapping rate of your construct to housekeeper genes to see if this is the reason for the low counts. Since whole mRNA molecules are captured on the beads before preparing the library, there should not be a difference in UMI counts between long and short genes.

Tom Kelly
  • 873
  • 7
  • 20