18

I'm trying to make unique a set of lines pulled from a file with egrep with sort -u, then count them. About 10% of the lines(all 100 characters long from the alphabet [ATCG]) are duplicated. There are two files, about 3 gigs each, 50% aren't relevant, so perhaps 300 million lines.

LC_ALL=C  grep -E  <files> |  sort --parallel=24  -u | wc -m

Between LC_ALL=C and using -x to accelerate grep, the slowest part by far is the sort. Reading the man pages led me to --parallel=n, but experimentation showed absolutely no improvement. A little digging with top showed that even with --parallel=24, the sort process only ever runs on one processor at a time.

I have 4 chips with 6 cores and 2 threads/core, giving a total of 48 logical processors. See lscpu because /proc/cpuinfo would be too long.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Stepping:              2
CPU MHz:               1400.000
BogoMIPS:              5199.96

What am I missing? Even if the process is IO-bound, shouldn't I see parallel processing anyway? The sort process uses 99% of the processor it is actually on at any given time, so I should be able to see parallelization if it's happening. Memory isn't a concern, I have 256 Gb to play with and none of it is used by anything else.

Something I discovered piping grep to a file then reading the file with sort:

 LC_ALL=C  grep -E  <files>  > reads.txt ; sort reads.txt  -u | wc -m

default, file 1m 50s
--parallel=24, file 1m15s
--parallel=48, file 1m6s
--parallel=1, no file 10m53s
--parallel=2, no file 10m42s
--parallel=4 no file 10m56s

others still running

In doing these benchmarks it's pretty clear that when piped input sort isn't parallelizing at all. When allowed to read a file sort splits the load as instructed.

  • What sort is that on which distribution? The standard sort doesn't know that option. – ott-- Jul 09 '15 at 20:17
  • uname -a gives "3.13.0-46-generic #79-Ubuntu SMP" and lsb_release -a claims 14.04.2 codename trusty , and the version of sort that's part of the gnu coreutils, according to man sort. – Jeremy Kemball Jul 09 '15 at 20:22
  • It seems to me there are portions here that needs to be re-read:

    https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html

    – Hannu Jul 09 '15 at 20:29
  • I'm not sure I understand what you're getting at @Hannu, could you be more specific? sort --parallel=2 doesn't parallelize either. Neither does 4 or 8. nproc gives back 48 like it should. – Jeremy Kemball Jul 09 '15 at 20:36
  • 1
    I'd say... don't use coreutils for this. Amusingly we had a very similar question and well.... every other method works better http://superuser.com/a/485987/10165 – Journeyman Geek Jul 10 '15 at 02:23
  • @JeremyKemball - the fact that n>8 would not add much for starters, it also seemed to me that your use of -u (unique) was not inline with what you asked for; better use unique not sort -u – Hannu Jul 10 '15 at 08:19
  • sort | uniq and and sort -u are explicitly equivalent if you don't use keys to sort the lines. Is there a reason to use an extra pipe? – Jeremy Kemball Jul 10 '15 at 13:40
  • From how I understand the text under the link above, there is a difference. But then, I'm NOT a native English speaker. – Hannu Jul 10 '15 at 18:22
  • They differ in last-resort comparison if your specified ordering/sorting options, so that things that sort identically are unique'd by those sorting options. I don't have -d, -f, -n, -h, -M,-v, or -k, so sort | uniq is explicitly identical to sort --unique, doubly so since I have very boring strings. – Jeremy Kemball Jul 11 '15 at 19:36

2 Answers2

35

sort doesn't create a thread unless it needs to, and for small files it's just too much overhead. Now unfortunately sort treats a pipe like a small file. If you want to feed enough data to 24 threads then you'll need to specify to sort to use a large internal buffer (sort does that automatically when presented with large files). This is something we should improve on upstream (at least in documentation). So you'll want something like:

(export LC_ALL=C; grep -E  <files> | sort -S1G --parallel=24 -u | wc -m)

Note I've set LC_ALL=C for all processes, since they'll all benefit with this data).

BTW you can monitor the sort threads with something like:

watch -n.1 ps -C sort -L -o pcpu
pixelbeat
  • 1,260
  • 1
    Are the brackets only necessary for the LC_ALL=C invocation? Could I remove them if not using LC_ALL=C or if prefixing the grep and sort commands with it? – Hashim Aziz Dec 02 '20 at 20:14
5

With parsort you can sort big files faster on a multi-core machine.

On a 48 core machine you should see a speedup of 3x over sort.

parsort is part of GNU Parallel and should be a drop-in replacement for sort.

Ole Tange
  • 4,809
  • My experiance with parsort for large files is it slowly creeps up to 100% usage of ram, and once it gets there it bottlenecks hard. for context, target file is 30GB, machine RAM is 64GB. So for my particular case parsort was not a good solution. – Do Not Track Me Jan 28 '24 at 18:53
  • @DoNotTrackMe I just tested sorting 25GB on my 16G machine. The memory usage stays constant at ~100MB/core, so I have the feeling there is something odd in your environment. See if you can reproduce it on a Virtualbox. – Ole Tange Jan 29 '24 at 13:14
  • 1
    @DoNotTrackMe Is your $TMPDIR (typically /tmp) in RAM? Because that would explain it: parsort saves temporary results in $TMPDIR, so your $TMPDIR needs to be able to hold two full copies of the data sorted. – Ole Tange Jan 29 '24 at 13:17