-1

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.

Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?

For example,

aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
    222
     333
            444
             555
              666 

must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)

Walter A
  • 17,923
  • 2
  • 22
  • 40
user3932000
  • 620
  • 8
  • 23

1 Answers1

3

Find 6 overlapping substrings aaa in the string

line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"

You don't want to see the strings, you want to count them. When you try

# wrong
grep -o -F "aaa" <<< "${line}" | wc -l

you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with

grep -Eo "a{3,}" <<< "${line}"

Result

aaa
aaaa
aaaaa

Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa. Compare the total count of characters with the number of lines (wc):

 match lines chars  add_to_total
   aaa     1     4      1
  aaaa     1     5      2
 aaaaa     1     6      3

For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate

15 characters - (3 lines * 3 characters) = 15 - 9 = 6

In code:

read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

Or for a file

read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are

abcdabc
abcdabcdabc
abcdabcdabcdabc

Use testline

line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"

you need "abc(dabc)+" and have

          match   lines    chars  add_to_total
abcdabcdabcdabc       1      16    3
abcdabcdabcdabc       1      16    3
        abcdabc       1       8    1
    abcdabcdabc       1      12    2

For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate

(52 characters / fixed 4) / 4 lines = 13 - 4 = 9

In code:

read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"

When you have a large file, you might want to split it first.

Walter A
  • 17,923
  • 2
  • 22
  • 40
  • I was gonna post `grep -Eo 'AAA{1,}' file | wc -lc | awk '{print $1 + $2 - $1 * 4}'` This takes 8 seconds to process a ~1GB (10M rows, 100 cols) file on an *AMD A9-9410 RADEON R5 2C+3G (2) @ 2.900GHz* with GNU grep, GNU wc and GNU awk – oguz ismail Apr 11 '20 at 17:32
  • 1
    `$1 + $2 - $1 * 4 = $2 - $1 * 3`, so you found the same solution as I did. Nice job, I didn't expect that. You tested the performance, that seems good news for OP. Thanks. – Walter A Apr 11 '20 at 18:17
  • 1
    Oh yeah, I always sucked at maths – oguz ismail Apr 11 '20 at 18:18