I have a barcode distribution from single-cell data, e.g:
11612552 TCCTGAGCACTGCATAACTCAA
9349711 GCTACGCTACTGCATAAGTCCA
8343678 CAGAGAGGCTAAGCCTGCACAT
8161950 CGTACTAGTCTCTCCGCGGCTA
8102383 TCCTGAGCGTAAGGAGCAGATC
7110298 AAGAGGCAGTAAGGAGATAATC
6626630 TAAGGCGAACTGCATAACATGT
6390489 CGAGGCTGTATCCTCTACTAGA
6210446 AGGCAGAATCTCTCCGAGTGCT
5985219 TAAGGCGATCTCTCCGTACGTG
...
where the first column is the number of reads associated with the cell barcode in the second column.
I would like to collapse each set of cell barcodes within 1 Hamming distance (i.e. allowing one mismatch between barcodes), keeping only the barcode with the highest read count and adding the read counts of the duplicates to it.
I was thinking to use awk and grep to filter the barcode list in an iterative procedure using regular expressions, but I think this would require too much time. I can share my code if needed.
Is there a way to do it using unix tools or already published pipelines (e.g. UMI tools)?
This is my code at the moment (it is not optimized though):
#!/bin/awk -f
{
matching = 0;
for (r in regexs) {
if ($2 ~ regexs[r]) {
matching = 1;
counts[r] += int($1);
break;
}
}
if (!matching) {
regex = "";
for (i = 1; i <= length($2); i++) {
l_regex = substr($2, 1, i-1) "." substr($2, i+1);
if (length(regex) == 0)
regex = l_regex;
else
regex = regex "|" l_regex;
}
regexs[$2] = regex;
counts[$2] = int($1);
}
# so we know that 100 records have been processed
if (!(NR % 100)) printf "." > "/dev/stderr";
}
END {
for (c in counts) {
print counts[c], c | "sort -k1 -rn";
}
}