How is PTX compressed in a fatbinary?

Question

It is not clear how PTX is compressed in a fatbinary. I'm doing some research and by looking at the binary it seems a sort of LZ77 (or LZSS?). I've prepared some tests:

Small PTX. A simple vec_add (poorly implemented): Input,Output and CUDA Source.

Larger PTX. lavaMD from Rodinia Benchmarks: Input, Output and CUDA Source

Note: The question is the same of the following thread on NVIDIA Developers Forum (I am the author of the thread).

Is your question, "How is PTX encoded in a fatbinary?" It is best to clearly ask your question when posting so that those responding know what to answer. — dingo_kinznerhook, Oct 16 '18 at 14:06
"I'm doing some research and by looking at the binary it seems a sort of LZ77 compression." Which binary are you looking at? vec_add.fatbin is clearly not compressed. — julian, Oct 16 '18 at 14:14
@SYS_V The binary is not compressed, but the PTX code inside yes. The PTX code is saved as a string inside the fatbin. You can try to download the second link a open it with a text editor. You can see pieces of strings regarding the PTX, but it is not in a plaintext form. — fusiled, Oct 16 '18 at 14:18
encoding != compression. Are you able to provide larger samples (on the order of kilobytes or megabytes in size)? The tiny amount of data in the file provided is insufficient for meaningful analysis. — julian, Oct 16 '18 at 14:26
@SYS_V I've updated a larger example. I can try to cook a bigger one. I think that PTX maximum size is 2MB — fusiled, Oct 16 '18 at 15:10

nee · Answer 1 · 2023-04-27T09:40:03.103

I know this question is rather old, but I recently also needed to understand the compression of CUDA fatbinaries. It indeed seems to be a compression somewhat similar to LZ77. I wrote the following code that seems to decompress the actual text sections of compressed fatbinaries.

size_t decompress(const uint8_t* input, size_t input_size, uint8_t* output, size_t output_size)
{
    size_t ipos = 0, opos = 0;  
    uint16_t next_nclen;  // length of next non-compressed segment
    uint16_t next_clen;   // length of next compressed segment
    uint16_t back_offset; // negative offset where redudant data is located, relative to current opos
while (ipos &lt; input_size) {
    next_nclen = (input[ipos] &amp; 0xf0) &gt;&gt; 4;
    next_clen = 4 + (input[ipos] &amp; 0xf);
    if (next_nclen == 0xf) {
        next_nclen += input[++ipos];
    }

    if (memcpy(output + opos, input + (++ipos), next_nclen) == NULL) {
        fprintf(stderr, &quot;Error copying data&quot;);
        return 0;
    }
    ipos += next_nclen;
    opos += next_nclen;
    if (ipos &gt;= input_size || opos &gt;= output_size) {
        break;
    }
    back_offset = input[ipos] + (input[ipos + 1] &lt;&lt; 8);       
    ipos += 2;
    if (next_clen == 0xf+4) {
        do {
            next_clen += input[ipos++];
        } while (input[ipos - 1] == 0xff);
    }
    if (next_clen &lt;= back_offset) {
        if (memcpy(output + opos, output + opos - back_offset, next_clen) == NULL) {
            fprintf(stderr, &quot;Error copying data&quot;);
            return 0;
        }
    } else {
        if (memcpy(output + opos, output + opos - back_offset, back_offset) == NULL) {
            fprintf(stderr, &quot;Error copying data&quot;);
            return 0;
        }
        for (size_t i = back_offset; i &lt; next_clen; i++) {
            output[opos + i] = output[opos + i - back_offset];
        }
    }
    opos += next_clen;
}
return opos;

}

I am no compression expert but I think this is a variant of LZ4 compression. There is some more code related to decoding the fatbinary headers here: https://github.com/n-eiling/cuda-fatbin-decompression. The output is bit identical to using nvcc with the --no-compress flag.

How is PTX compressed in a fatbinary?

1 Answers1