Collapse cell barcodes distribution within 1 Hamming distance

Question

I have a barcode distribution from single-cell data, e.g:

11612552 TCCTGAGCACTGCATAACTCAA
9349711 GCTACGCTACTGCATAAGTCCA
8343678 CAGAGAGGCTAAGCCTGCACAT
8161950 CGTACTAGTCTCTCCGCGGCTA
8102383 TCCTGAGCGTAAGGAGCAGATC
7110298 AAGAGGCAGTAAGGAGATAATC
6626630 TAAGGCGAACTGCATAACATGT
6390489 CGAGGCTGTATCCTCTACTAGA
6210446 AGGCAGAATCTCTCCGAGTGCT
5985219 TAAGGCGATCTCTCCGTACGTG
...

where the first column is the number of reads associated with the cell barcode in the second column.

I would like to collapse each set of cell barcodes within 1 Hamming distance (i.e. allowing one mismatch between barcodes), keeping only the barcode with the highest read count and adding the read counts of the duplicates to it.

I was thinking to use awk and grep to filter the barcode list in an iterative procedure using regular expressions, but I think this would require too much time. I can share my code if needed.

Is there a way to do it using unix tools or already published pipelines (e.g. UMI tools)?

This is my code at the moment (it is not optimized though):

#!/bin/awk -f

{
    matching = 0;
    for (r in regexs) {
        if ($2 ~ regexs[r]) {
        matching = 1;
        counts[r] += int($1);
        break;
    }
}
if (!matching) {
    regex = "";
    for (i = 1; i <= length($2); i++) {
        l_regex = substr($2, 1, i-1) "." substr($2, i+1);
        if (length(regex) == 0)
            regex = l_regex;
        else
            regex = regex "|" l_regex;
    }
    regexs[$2] = regex;
    counts[$2] = int($1);
    }
    # so we know that 100 records have been processed
    if (!(NR % 100)) printf "." > "/dev/stderr";
}
END {
    for (c in counts) {
        print counts[c], c | "sort -k1 -rn";
    }
}

The problem with UMI tools (as I understood from here) is that it runs (whitelist) with fastqs, while I have already the count distribution. As of now, I haven't found a way to use directly the distribution with UMI tools. — gc5, Nov 05 '18 at 23:51
So just go back to the fastq files and make your life easier. — Devon Ryan, Nov 05 '18 at 23:56
It would be awesome, but before generating the count distribution, the analysis pipeline I'm using filters out reads based on multiple and supplemental alignments. Additionally, due to the size of the data I had to run everything through pipes, and the first step is the concatenation of the original fastqs, which were provided by the sequencing center already demultiplexed on a portion of cell barcode. My objective for this question is to collapse the distribution after filtering and mapping, for which I proposed a sample code, in the future I'll surely use UMI tools or similar from the start. — gc5, Nov 06 '18 at 00:09
Also, if this is 10x, or Drop-seq, check out alevin, which is an all in one pipeline that includes dealing with multiple and supplemental alignments in an all-in-one pipeline that goes from fastq to counts matrix. — Ian Sudbery, Nov 08 '18 at 11:49

score 3 · Accepted Answer · answered Nov 06 '18 at 08:32

This doesn't use standard unix tools, but rather the API for UMI tools:

#!/usr/bin/env python
from umi_tools import network

cb = []  # cell barcodes
cbc = dict()  # cell barcode counts
for line in open("UMIs.txt"):
    cols = line.strip().split("\t")
    cols[1] = bytes(cols[1], "ascii")
    cb.append(cols[1])
    cbc[cols[1]] = int(cols[0])

uc = network.UMIClusterer()
CBclusters = uc(cb, cbc, 1)
cbFinal = dict()
for l in CBclusters:  # This is a list with the first element the dominant barcode
    cbFinal[l[0]] = 0
    for x in l:  # Iterate over all barcodes in a cluster
        cbFinal[l[0]] += cbc[x]
print(cbFinal)

If you give it input in UMIs.txt of the following form (note the last two lines differ from the first with a distance of 1):

11612552    TCCTGAGCACTGCATAACTCAA
9349711 GCTACGCTACTGCATAAGTCCA
8343678 CAGAGAGGCTAAGCCTGCACAT
8161950 CGTACTAGTCTCTCCGCGGCTA
8102383 TCCTGAGCGTAAGGAGCAGATC
7110298 AAGAGGCAGTAAGGAGATAATC
6626630 TAAGGCGAACTGCATAACATGT
6390489 CGAGGCTGTATCCTCTACTAGA
6210446 AGGCAGAATCTCTCCGAGTGCT
5985219 TAAGGCGATCTCTCCGTACGTG
20  TCCAGAGCACTGCATAACTCAA
2   TCCTGAGCACTGCATAACTCNA

Then it will produce:

{b'TAAGGCGATCTCTCCGTACGTG': 5985219,
 b'TAAGGCGAACTGCATAACATGT': 6626630,
 b'GCTACGCTACTGCATAAGTCCA': 9349711,
 b'CGTACTAGTCTCTCCGCGGCTA': 8161950, 
 b'TCCTGAGCGTAAGGAGCAGATC': 8102383,
 b'CAGAGAGGCTAAGCCTGCACAT': 8343678, 
 b'TCCTGAGCACTGCATAACTCAA': 11612574,
 b'AGGCAGAATCTCTCCGAGTGCT': 6210446, 
 b'AAGAGGCAGTAAGGAGATAATC': 7110298,
 b'CGAGGCTGTATCCTCTACTAGA': 6390489}

Awesome, thanks! I didn't know you could use the API in this way. — gc5, Nov 06 '18 at 17:51
Yeah, it doesn’t seem to be documented anywhere, but it’s python and the documentation in the source code is ok. — Devon Ryan, Nov 06 '18 at 18:07
Sorry, we should get round to documenting this. Nice to see someone using it. — Ian Sudbery, Nov 07 '18 at 11:45
Note however, that this is using the "directional" algorithm. That is, the collapse will only happen if the two barcodes are within a hamming distance of 1 AND the counts for the second barcode are less that half those for the first. This was designed to work with UMIs rather than cell barcodes There are other ways of deciding which collapses to make - see the paper for details. You can pass the method to the UMIClusterer constructor. — Ian Sudbery, Nov 07 '18 at 11:57

score 3 · Answer 2 · answered Nov 07 '18 at 12:09

The difficulty with all these things comes with what to do about barcodes that are 1 edit distance away from more than one other barcodes in ways that form complex networks.

Cell barcodes are a bit different from UMIs. Partly this is because we expect them to be sparser. Partly because the consequences of inncorect error correction at the CB level are much worst than at the UMI level. The way that UMI-tools does its Cell Barcode error correction is:

Determine a whitelist of barcodes using the "knee" method
Iterate through the list of non-whitelist barcodes and check if they are within one edit distance of one and only one whitelist barcode. (this is done with umi_tools.umi_methods.getErrorCorrectMapping if you are interested).
Things that pass the above are collapsed into the whilelist barcode, else they are discarded.

The function umi_tools.umi_methods.getCellWhitelist takes a list of barcodes and their frequencies and returns a whitelist and (optionally) a mapping of which barcodes should be corrected to which others.

Collapse cell barcodes distribution within 1 Hamming distance

2 Answers2