3

It is not clear how PTX is compressed in a fatbinary. I'm doing some research and by looking at the binary it seems a sort of LZ77 (or LZSS?). I've prepared some tests:

Small PTX. A simple vec_add (poorly implemented): Input,Output and CUDA Source.

Larger PTX. lavaMD from Rodinia Benchmarks: Input, Output and CUDA Source

Note: The question is the same of the following thread on NVIDIA Developers Forum (I am the author of the thread).

fusiled
  • 31
  • 3
  • 3
    Is your question, "How is PTX encoded in a fatbinary?" It is best to clearly ask your question when posting so that those responding know what to answer. – dingo_kinznerhook Oct 16 '18 at 14:06
  • @dingo_kinznerhook Done. Thank you for your suggestion. – fusiled Oct 16 '18 at 14:10
  • "I'm doing some research and by looking at the binary it seems a sort of LZ77 compression." Which binary are you looking at? vec_add.fatbin is clearly not compressed. – julian Oct 16 '18 at 14:14
  • @SYS_V The binary is not compressed, but the PTX code inside yes. The PTX code is saved as a string inside the fatbin. You can try to download the second link a open it with a text editor. You can see pieces of strings regarding the PTX, but it is not in a plaintext form. – fusiled Oct 16 '18 at 14:18
  • encoding != compression. Are you able to provide larger samples (on the order of kilobytes or megabytes in size)? The tiny amount of data in the file provided is insufficient for meaningful analysis. – julian Oct 16 '18 at 14:26
  • 1
    @SYS_V I've updated a larger example. I can try to cook a bigger one. I think that PTX maximum size is 2MB – fusiled Oct 16 '18 at 15:10

1 Answers1

2

I know this question is rather old, but I recently also needed to understand the compression of CUDA fatbinaries. It indeed seems to be a compression somewhat similar to LZ77. I wrote the following code that seems to decompress the actual text sections of compressed fatbinaries.

size_t decompress(const uint8_t* input, size_t input_size, uint8_t* output, size_t output_size)
{
    size_t ipos = 0, opos = 0;  
    uint16_t next_nclen;  // length of next non-compressed segment
    uint16_t next_clen;   // length of next compressed segment
    uint16_t back_offset; // negative offset where redudant data is located, relative to current opos
while (ipos < input_size) {
    next_nclen = (input[ipos] & 0xf0) >> 4;
    next_clen = 4 + (input[ipos] & 0xf);
    if (next_nclen == 0xf) {
        next_nclen += input[++ipos];
    }

    if (memcpy(output + opos, input + (++ipos), next_nclen) == NULL) {
        fprintf(stderr, "Error copying data");
        return 0;
    }
    ipos += next_nclen;
    opos += next_nclen;
    if (ipos >= input_size || opos >= output_size) {
        break;
    }
    back_offset = input[ipos] + (input[ipos + 1] << 8);       
    ipos += 2;
    if (next_clen == 0xf+4) {
        do {
            next_clen += input[ipos++];
        } while (input[ipos - 1] == 0xff);
    }
    if (next_clen <= back_offset) {
        if (memcpy(output + opos, output + opos - back_offset, next_clen) == NULL) {
            fprintf(stderr, "Error copying data");
            return 0;
        }
    } else {
        if (memcpy(output + opos, output + opos - back_offset, back_offset) == NULL) {
            fprintf(stderr, "Error copying data");
            return 0;
        }
        for (size_t i = back_offset; i < next_clen; i++) {
            output[opos + i] = output[opos + i - back_offset];
        }
    }
    opos += next_clen;
}
return opos;

}

I am no compression expert but I think this is a variant of LZ4 compression. There is some more code related to decoding the fatbinary headers here: https://github.com/n-eiling/cuda-fatbin-decompression. The output is bit identical to using nvcc with the --no-compress flag.

nee
  • 21
  • 3