12

I've come across compression algorithms, I tried creating a simple run-length encoding algorithm but I notice that when I searched other algorithms I was shocked to know that some algorithms can process thousands of megabits per second how is this possible? I tried compressing a 1GB length text it was taking some time for me.

How do is this achievable? or am I missing something here?

compression image

I can't code well in C, so I created my simple program in Python can these speeds only be available in C?

gushkash
  • 153
  • 1
  • 4
  • 14
    From the tour it says "Computational Science is a question and answer site for questions and answers about computational methods used in technical disciplines". Given compression is important for use in handling the large datasets required in a number of disciplines, I don't see why this is OT here. – Ian Bush Aug 15 '22 at 13:36
  • 10
    @Federico: Computational Science is not just numerical. – Richard Aug 15 '22 at 20:16
  • 2
    This is just something I was told once, so take it with a grain of salt, but I believe the computationally intensive parts of some of the common encryption and compression algorithms are implemented in hardware, which gives them a massive speedup. – craq Aug 16 '22 at 07:27
  • 9
    There is basically no chance python code will ever compete with well-written C code, even with pypy JIT compilation. And data compression happens to be one of those low-level byte manipulation tasks very well suited for C. – qwr Aug 16 '22 at 07:36
  • @craq I don't know about compression, but both x86 and ARMv8-A have cryptographic extensions, mostly to support AES. And for architectures which do not support them directly, it's not uncommon to have a separate cryptographic co-processor included in the SoC, which is accessed through drivers (for example, the Linux Kernel Crypto API). Additionally, ARMv8-A commonly includes an extension for computing CRC32, which I could see being used as a checksum in compressed archives. – jaskij Aug 16 '22 at 13:03
  • https://cs.stackexchange.com/questions/tagged/data-compression – Rodrigo de Azevedo Aug 16 '22 at 14:39
  • 4
    @craq: Encryption yes, for example x86 CPUs since Nehalem have had AES hardware support which makes quite a big difference. Compression no. There aren't any x86 instructions that specifically accelerate lossless compression. (Maybe you're thinking of lossy video compression, where SIMD instructions like psadbw can accelerate motion-search for similar blocks, giving sum-of-absolute-differences for 2x 8-byte chunks at a time, i.e. rows of two 8x8 blocks. But lossless compression only wants exact matches; pcmpeqb is a[i] == b[i] for 16 bytes may be useful to find a candidate start point) – Peter Cordes Aug 16 '22 at 21:40

6 Answers6

28

The short answer to your question is this:

If your goal is speed (as it is in typical applications of data compression), then (i) you need to choose a programming language that allows you to write algorithms close to the hardware, and (ii) you will spend a very large amount of time benchmarking and profiling your software to find where it is slow, and then fix these places by using better algorithms and better ways of implementing these algorithms.

The authors of the compression packages you mention have not chosen to use magic to achieve the speed shown in the table. They have just put a large effort (probably years of work) into making their software fast.

Wolfgang Bangerth
  • 55,373
  • 59
  • 119
  • 11
    Also, with typical well-engineered packages there is almost always a trade-off available between compression efficiency and time taken. That is, they nearly always allow taking more time and getting more compressed results. – SoronelHaetir Aug 16 '22 at 04:03
  • 6
    Computers are really, really fast today but you have to make an effort to use all of that speed, especially to use multiple CPU cores. I suppose a naive Python program will only use a single CPU core. – idspispopd Aug 16 '22 at 07:11
  • 3
    @idspispopd yes and you have to fight a battle with the GIL for multithreaded performance – qwr Aug 16 '22 at 07:38
  • 1
    @idspispopd You also have to choose an algorithm that can take advantage of parallelism. E.g. split your data into chunks and compress them separately on different cores. This might be a tradeoff between speed and best compression ratio. – Barmar Aug 16 '22 at 13:42
  • You also have to write to your language. Python is extremely inefficient at operating on single characters in a stream, doing hundreds of times more work than "necessary." However, often we can operate on more characters (a chunk or a stream of them) with one function. Doing so sidesteps what Python is poor at, and leverages what it is good at. As an example, I had written a rather slow function to unescape strings. I looked at what the pros did. They didn't write a big loop, like I did. They called the string's unencode function with arguments that handled the entire string at once! – Cort Ammon Aug 17 '22 at 17:28
  • 3
    "The authors of the compression packages you mention have not chosen to use magic" - This is the problem with wizards moonlighting as developers. They like to do stuff the hard way, depriving us of faster compression libraries. – Igby Largeman Aug 18 '22 at 04:09
11

I have written compression software in Rust. The answer is not simple, but compression algorithms are usually designed to be reasonably quick. RFC 1951 has two steps, the first stage is to find patterns that occurred earlier in the input, which can be compressed as a pointer to the earlier pattern. This is generally done using a hash lookup.

The second stage is to use Huffman encoding to efficiently represent the values needed for the encoding, using variable (bit) length codes. Values which occur frequently use shorter codes, less frequent values use longer codes.

My software is probably a bit unusual in that it runs the two stages in parallel, which can be faster if your computer has more than one CPU.

There tend to be trade-offs between how much compression is achieved and how fast the compression runs. Finally, the answer is partly that modern computers are very fast.

Glorfindel
  • 219
  • 1
  • 4
  • 11
George Barwood
  • 211
  • 1
  • 3
  • Very fast and if the software makes use of threading it just get's even faster. Modern computer speeds are absolutely insane, only surpassed by some applications ability to still respond slowly – Hobbamok Aug 19 '22 at 08:32
7

There are actually two questions that can be answered:

Why is decompression so much faster than compression?

The key point to understand here is, that compression typically involves searching for matching byte sequences, while the compressed format is basically an extremely dense format for instructions on how to create the original byte stream. I.e. while decompression simply means decoding and following the instructions in the compressed file, one step at a time, compression usually means considering many alternatives and selecting one that leads to the smallest amount of required instructions for reconstruction.

A simple corollary of this is, that decompression throughput is significantly higher for highly compressed data (less instructions to decode) than for virtually incompressible data.

Why are the standard tools so much faster on doing compression than a naive implementation?

Now that we know that it is the searching for matching byte sequences that consumes most time in a naive compressor implementation, it is clear how the compressors can be optimized: They achieve their speed by aggressively avoiding searching and considering alternatives. Key part of this is typically some form of hash table, or other quick lookup data structure that allows finding matching sequences rather quickly. And then those standard tools employ heuristics for looking up good matches rather than just matches, allowing them to cut down on the number of matches they will consider before committing to using one of them. One of these heuristics can be that a match is selected when it is "good enough", whatever that means, and that the search continues for longer when the considered matches do not meet this criterion.

As with decompression, each "instruction" in the compressed format requires the compressor to find a (good) match. As such, compression is also fastest for highly compressible data as fewer matches need to be found. Also quite a visible effect when benchmarking compression algorithms.

  • An example of this can be find in the description of xz --extreme option: «Use a slower variant of the selected compression preset level to hopefully get a little bit better compression ratio, but with bad luck this can also make it worse.» You use different algorithms, but there is no certainty you will find the best compression (or even a better one than with the standard options). – Ángel Aug 18 '22 at 23:14
  • I would like to add to "Why is decompression so much faster than compression?", that in most use cases, you compress only once, but you decompress many times. Hence it is vital that decompression is fast. If you are producing digital media, you as the artist can affort to run an encoding that is x-times slower than real time; however, the consumer during playback will not accept slow, or intermittent playback. – Dohn Joe Apr 18 '23 at 07:42
2

In low-level languages, for example C or C++, your code will be directly translated to machine instructions. And then you find that a modern processor can perform a few billion machine instructions per second. Take a statement x = y << (64 - n); which could be translated to (assuming x, y and n are in registers)

move 64 to register tmp
subtract register n from register tmp
move y to register x
shift register x to the left by (register tmp) bits

That will happen in two processor cycles, so at say 3 GHz it can be performed 1.5 billion times per second. That's the kind of speed you can get if you use a low-level language. zstd for example runs at 136MB/sec according to your table; at three GHz that is 22 cycles per byte. Probably 44 machine instructions per byte.

And then you can use vector instructions, which can operate say on 16 bytes at a time, and you could use multiple cores; I assume that lz4 uses that. And the programmer tries out many possible ways to write their code to find the one that runs fastest; they may even have different code for different processors.

Python on the other hand doesn't even know that these variables x, y and n are integers. And that they are small integers. It has to do a lot of work before it can even start to perform one of these instructions.

gnasher729
  • 342
  • 1
  • 1
1

I can't code well in C, so I created my simple program in Python can these speeds only be available in C?

Yes.

C and possibly C++ are two of the few that are about as close to hardware as most programmers are able to get. (assembly language is not very cross-platform) Aggressively processing hundreds, thousands, millions, billions, or more of iterations to convert one long bitstream to a smaller bitstream is right at home there.

These things are also tuned every way to Sunday to figure out WHAT they are trying to compress. Is it text, audio (and what format?), video (again what format? mp4, vp6, something decades older. what wrapper? avi, mkv, ogg), microsoft documents, and then there are files that are all of these bundled into a tar (or similar) archive that the user did you the service of not having gzipped the archive to make it harder for you to "squeeze juice" out of? If you are not using metal, or close to metal, something slow like python will suffer no matter how well you tune it.

On the other hand, Python is not that. It is designed for little, short tasks where it doesn't really matter if they are done inefficiently, because they are (relatively) rare. Where with C or C++, finding some little thing on some webpage (oh, and you have to write how to find the thing... and it might miss) to populate a part of a table (oops, the thing is totally too wide for that column because it wasn't the proper hit), all of these difficulties are handled already... with the deal with the devil being speed. For little one-off tasks, speed isn't critical. Millions or trillions of iterations, it will choke.

Then, even if python could "speed itself up" by summoning the functions of the more hardware-centric protocols/methods, it would probably be doing so inefficiently. It would gain on the sub-task, but lose on calling and returning from the sub-task, with maybe a marginal net gain. Then also the "misdiagnosis" where an audio block is requested to be treated as a video block because python misdiagnosed it.

Sorry if this is a bit long-winded.

killermist
  • 119
  • 2
  • 3
    For the most performance-critical parts of performance-critical libraries such as compression libraries, assembly is sometimes chosen. (The rest is still written in C or something similar, to avoid uniformly slow Python code) – user253751 Aug 16 '22 at 17:22
  • 4
    C, C++, Rust, and Julia and others compile to native code with good ahead-of-time compilers (like GCC or LLVM optimizers) and would be suitable for this. Many other languages come very close, having fairly good optimizing JITs, like C# and Java. Even JavaScript has good JITs, but to work with binary data I assume you'd have to be careful what types you used. Python is extremely slow compared to most languages for manual looping over elements of arrays, with the standard implementation being a pure interpreter. – Peter Cordes Aug 16 '22 at 21:46
  • 2
    e.g. a naive C program might be close to optimal with optimization enabled, or might be missing out on a factor of 10 or more if manual SIMD vectorization is possible, and/or if better algorithms are possible. But a Python version of the same C program might be 200x slower than the naive C. Not remotely in the same ballpark for looking at bytes 1 at a time. So there's plenty of room to be close to C and way faster than Python. – Peter Cordes Aug 16 '22 at 21:47
  • 4
    Lossless compressors don't generally try to figure out what the content is. e.g. zstd won't detect that it's video and call libx264 functions in lossless mode, even though that could potentially save more bits if your input was uncompressed video data. But if your input was already MP4 with h.264 video, recompressing it isn't going to save anything if you need a bit-exact reconstruction of the MP4 file. Also note the question talks about implementing a simple RLE algorithm, run-length encoding. That can be done even faster than lz4, esp. the compression part since there's no search – Peter Cordes Aug 16 '22 at 21:52
  • That bit about lossless compression is fair. A number of compression algorithms (lossy and lossless) are specifically tuned for very specific types of bitstreams/formats/data types. FLAC and LAME both exclusively target audio as an example. In these specialized types of algorithms it is generally conceded that speed plays second priority in favor of disk savings. Then also, these developers also know where shortcuts can be done to increase speed by "trying less hard" to save disk/bandwidth, allowing the user to determine which is more important. General-purpose tend to prioritize speed more. – killermist Aug 17 '22 at 01:36
  • @PeterCordes Python has a JIT implementation in Pypy. I do not know how it compares to JITs in other languages. – qwr Aug 17 '22 at 05:38
  • 1
    I don't think it's accurate to say compressors are tuned to figure out what they are trying to compress. Or what "misdiagnosis" means in your answer. If a program does this (I'm not aware of any of the usual tools that bother detecting media type) it's usually very simple or known a priori what media type is being worked on. – qwr Aug 17 '22 at 05:41
  • 1
    @qwr: Indeed, PyPy is potentially much faster if the JIT is any good for tight loops that use small integers and don't need Python's arbitrary-precision integers. That's why I said "the standard implementation" of Python (CPython) is purely an interpreter; guessing from the OP reporting very low speeds, I was guessing that's what they used. There's also Cython that translates Python to C (at least in some limited cases?) Not sure how well it performs either; I haven't done much with Python myself. – Peter Cordes Aug 17 '22 at 05:57
1

This is not exactly an answer to your question but testing zstd against lz4 with a 498545 bytes sample json file with 500 records (sjfw500r.json) on my linux desktop with single thread made me say WHAT!!!.

lz4 Ubuntu Desktop Benchmarks (Levels: 1~9):

omer@redus:~/Downloads$ lz4 -b1e9 ./sjfw500r.json
Benchmarking levels from 1 to 9
 1#sjfw500r.json     :    498545 ->    144447 (3.451),2026.3 MB/s ,11339.8 MB/s
 2#sjfw500r.json     :    498545 ->    144447 (3.451),2022.0 MB/s ,11265.8 MB/s
 3#sjfw500r.json     :    498545 ->    115807 (4.305), 232.8 MB/s ,9314.1 MB/s 
 4#sjfw500r.json     :    498545 ->    114851 (4.341), 199.1 MB/s ,9400.1 MB/s 
 5#sjfw500r.json     :    498545 ->    114365 (4.359), 173.9 MB/s ,9482.2 MB/s 
 6#sjfw500r.json     :    498545 ->    114025 (4.372), 149.3 MB/s ,9404.4 MB/s 
 7#sjfw500r.json     :    498545 ->    113849 (4.379), 129.4 MB/s ,9513.5 MB/s 
 8#sjfw500r.json     :    498545 ->    113759 (4.382), 115.1 MB/s ,9535.0 MB/s 
 9#sjfw500r.json     :    498545 ->    113729 (4.384), 108.5 MB/s ,9485.1 MB/s

zstd Ubuntu Desktop Benchmarks (Levels: 1~22):

omer@redus:~/Downloads$ zstd -b1e22 ./sjfw500r.json
 1#sjfw500r.json     :    498545 ->     21223 (23.49),2862.9 MB/s ,11509.3 MB/s 
 2#sjfw500r.json     :    498545 ->     20911 (23.84),3185.9 MB/s ,10995.9 MB/s 
 3#sjfw500r.json     :    498545 ->     20330 (24.52),2287.7 MB/s ,10804.9 MB/s 
 4#sjfw500r.json     :    498545 ->     20329 (24.52),2422.0 MB/s ,10788.9 MB/s 
 5#sjfw500r.json     :    498545 ->     19748 (25.25),1113.6 MB/s ,10794.6 MB/s 
 6#sjfw500r.json     :    498545 ->     19660 (25.36),1070.4 MB/s ,10845.0 MB/s 
 7#sjfw500r.json     :    498545 ->     19299 (25.83), 784.1 MB/s ,11194.8 MB/s 
 8#sjfw500r.json     :    498545 ->     19195 (25.97), 697.6 MB/s ,11267.8 MB/s 
 9#sjfw500r.json     :    498545 ->     19126 (26.07), 605.5 MB/s ,11351.0 MB/s 
10#sjfw500r.json     :    498545 ->     19126 (26.07), 608.6 MB/s ,11370.9 MB/s 
11#sjfw500r.json     :    498545 ->     19126 (26.07), 609.8 MB/s ,11362.1 MB/s 
12#sjfw500r.json     :    498545 ->     19082 (26.13), 537.2 MB/s ,11399.8 MB/s 
13#sjfw500r.json     :    498545 ->     19049 (26.17), 297.9 MB/s ,11441.6 MB/s 
14#sjfw500r.json     :    498545 ->     19049 (26.17), 297.3 MB/s ,11401.5 MB/s 
15#sjfw500r.json     :    498545 ->     19050 (26.17), 272.5 MB/s ,11435.6 MB/s 
16#sjfw500r.json     :    498545 ->     18933 (26.33),  73.5 MB/s ,11578.4 MB/s 
17#sjfw500r.json     :    498545 ->     18723 (26.63),  65.8 MB/s ,10990.3 MB/s 
18#sjfw500r.json     :    498545 ->     18663 (26.71),  49.7 MB/s ,10308.6 MB/s 
19#sjfw500r.json     :    498545 ->     18424 (27.06), 10.13 MB/s ,10155.1 MB/s 
20#sjfw500r.json     :    498545 ->     18424 (27.06), 10.13 MB/s ,10153.4 MB/s 
21#sjfw500r.json     :    498545 ->     18423 (27.06),  8.64 MB/s ,10163.2 MB/s 
22#sjfw500r.json     :    498545 ->     18423 (27.06),  8.34 MB/s ,10164.5 MB/s

So unlike the chart in your question zstd seems to outshine lz4 by a huge margin.

Redu
  • 111
  • 3