Grepping a huge file (80GB) any way to speed it up?

Question

 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

Depending on your locale, the `-i` switch may slow the process down, try without `-i` or with `LC_ALL=C grep ...`. Also, if you're only grepping for a fixed string, use `grep -F`. — Thor, Dec 17 '12 at 11:19
As @dogbane mentioned using the **LC_ALL=C** variable along with **fgrep** can speed up your search.I did some testing and was able to achieve a **1400%** performance increase and wrote up a detailed article why this is in my [speed up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) post — JacobN, Aug 23 '13 at 17:57
I'm curious - what file is 80GB in size? I'd like to think that when a file gets that big, there may be a better storage strategy (e.g. rotating log files, or categorizing hierarchically into different files and folders). Also, if the changes only occur in certain places of the file (e.g. at the end), then just store some grep results from the earlier section that doesn't change and instead of grepping the original file, grep the stored result file. — Sridhar Sarnobat, Nov 14 '16 at 20:03
I settled on https://github.com/google/codesearch — both indexing and searching are lightning fast (written in Go). `cindex .` to index your current folder, then `csearch db_pd.Clients`. — ccpizza, Oct 28 '17 at 02:06
If your file were indexed or sorted, this could be made **vastly** faster. Searching every line is O(n) by definition, whereas a sorted file can be seeked by bisecting it -- at which point you'd be talking under a second to search your 80gb (hence why a 80gb indexed database takes no time at all for a simple SELECT, whereas your grep takes... well, as long as it takes). — Charles Duffy, Jan 18 '18 at 03:23

score 186 · Accepted Answer · answered Dec 17 '12 at 11:25

186

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

answered Dec 17 '12 at 11:25

dogbane

254,755
72
386
405

My grep has finally returned a result . Will try your grep suggestions and report the result. Must first however cut everything up to db_pd.Clients (illegal mysql table name) : sigh! – zzapper Dec 17 '12 at 11:33
9

that was MUCH quicker by an order of magnitude thanks. BTW I added -n to get the line numbers. Also maybe a -m to exit after match – zzapper Dec 17 '12 at 12:55
7

Wow thanks so much @dogbane great tip! This led me down a research tunnel to find out [why LC_ALL=C speeds up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) and it was a very enlightening experience! – JacobN Aug 23 '13 at 18:06
weird that nobody have mentioned the --mmap flag. – sw. Mar 27 '14 at 19:20
10

Some people (not me) like `grep -F` more than `fgrep` – Walter Tross Jun 18 '14 at 09:21
2

My understanding is that `LANG=C` (instead of `LC_ALL=C`) is enough, and is easier to type. – Walter Tross Jun 18 '14 at 11:46
1

@WalterTross what's the diff between `grep` and `fgrep` ? – Bob Jun 07 '16 at 15:38
3

@Adrian `fgrep` is another way to write `grep -F`, as `man fgrep` will tell you. Some versions of the `man` also say that the former is deprecated for the latter, but the shorter form is too convenient to die. – Walter Tross Jun 07 '16 at 16:20
1

Why doesnt `LC_ALL=C` help for `bzgrep` then? – Emma He Jun 07 '17 at 02:05
1

Doesn't seem to help `zgrep` either. Also what's a RAM disk, and how do I copy a file to it and run `zgrep` over it? Any pointers? – amit_saxena Sep 04 '18 at 13:24
I tested fgrep on a 1.5 GB file, and to my surprise it took twice as long as a normal grep. LANG=C did not seem to make a significant difference. Is this answer still valid, or has grep been optimised since this this answer? – Onnonymous Dec 05 '18 at 10:34
Setting LC_ALL to C went from taking ~5 minutes to ~2 seconds for me piping from `cut` | `grep` (with Python regex) | `sed` (regex) | `sort` | `uniq -c`. That's crazy. – scorgn Oct 06 '21 at 18:23

score 42 · Answer 2 · answered Dec 17 '12 at 12:49

42

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.

answered Dec 17 '12 at 12:49

Steve

45,652
12
89
100

2

If it is an actual file, use `--pipepart` instead of `--pipe`. It is much faster. – Ole Tange Jul 05 '16 at 06:51
This usage not support pattern include space, we need use like this: parallel --pipe --block 10M "/usr/bin/grep -F -C5 -e 'Animal Care & Pets'" – zw963 Jun 14 '17 at 07:40
What does it mean the ` – elcortegano Oct 21 '19 at 15:22
1

@elcortegano: That's what's called [I/O redirection](https://www.tldp.org/LDP/abs/html/io-redirection.html). Basically, it reads input from the following filename. Similar to `cat file.sql | parallel ...` but avoids a [UUOC](https://stackoverflow.com/questions/11710552/useless-use-of-cat). GNU parallel also has a way to read input from a file using `parallel ... :::: file.sql`. HTH. – Steve Oct 21 '19 at 21:26
What if I wanna grep the whole file directory? – Yan Yang Apr 09 '21 at 11:19
@YanYang, `find /path/to/dir -type f | parallel grep "foo"` – Steve Apr 09 '21 at 13:12

score 10 · Answer 3 · answered Dec 17 '12 at 11:19

10

Some trivial improvement:

Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.

A single point is the regex symbol to match any character, which is also slow

answered Dec 17 '12 at 11:19

BeniBela

15,674
4
41
51

score 3 · Answer 4 · answered Dec 17 '12 at 11:18

3

Two lines of attack:

are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

answered Dec 17 '12 at 11:18

Eugen Rieck

62,299
10
67
91

score 1 · Answer 5 · answered Jan 18 '18 at 03:10

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

score 0 · Answer 6 · answered Aug 25 '21 at 08:10

0

Try ripgrep

It provides much better results compared to grep.

answered Aug 25 '21 at 08:10

Shailesh

329
1
13

Grepping a huge file (80GB) any way to speed it up?

6 Answers6

Linked