Convert FP32 to Bfloat16 in C++

Question

How can I convert from float (1bit sign, 8bit exp, 23bit mantissa) to Bfloat16 (1bit sign, 8bit exp, 7bit mantissa) in C++?

[`frexp`](https://en.cppreference.com/w/cpp/numeric/math/frexp) can be used to break a `float` down into components. Assembling it back into whatever structure you call `Bfloat16` is left as an exercise for the reader. — Igor Tandetnik, Mar 20 '19 at 03:44
I imagine you want to do this efficiently, since the only reason for such a small floating point format is when you have a very large number of them. I also imagine it needs to do proper rounding. — Mark Ransom, Mar 20 '19 at 03:49
@IgorTandetnik but that would be expensive. Bfloat16 is designed as the top half of float so that you can truncate it easily — phuclv, Mar 20 '19 at 05:02

score 2 · Answer 1 · answered Mar 21 '19 at 23:08

As demonstrated in the answer by Botje it is sufficient to copy the upper half of the float value since the bit patterns are the same. The way it is done in that answer violates the rules about strict aliasing in C++. The way around that is to use memcpy to copy the bits.

static inline tensorflow::bfloat16 FloatToBFloat16(float float_val)
{
    tensorflow::bfloat16 retval;
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    memcpy(&retval, &float_val, sizeof retval);
#else
    memcpy(&retval, reinterpret_cast<char *>(&float_val) + sizeof float_val - sizeof retval, sizeof retval);
#endif
    return retval;
}

If it's necessary to round the result rather than truncating it, you can multiply by a magic value to push some of those lower bits into the upper bits.

float_val *= 1.001957f;

score 2 · Answer 2 · answered Oct 23 '20 at 03:26

memcpy wouldn't compile for me in the little endian case for some reason. This is my solution. I have it as a struct here so that I can easily access the data and run through different ranges of values to confirm that it works properly.

struct bfloat16{
   unsigned short int data;
   public:
   bfloat16(){
      data = 0;
   }
   //cast to float
   operator float(){
      unsigned int proc = data<<16;
      return *reinterpret_cast<float*>(&proc);
   }
   //cast to bfloat16
   bfloat16& operator =(float float_val){
      data = (*reinterpret_cast<unsigned int *>(&float_val))>>16;
      return *this;
   }
};

//an example that enumerates all the possible values between 1.0f and 300.0f
using namespace std;

int main(){
   bfloat16 x;
   for(x = 1.0f; x < 300.0f; x.data++){
      cout<<x.data<<" "<<x<<endl;
   }
   
   return 0;
}

Works great :). Do check this answer, which also include `operator >>` overload for cin: https://stackoverflow.com/a/56017304/1413259 — wolfram77, May 12 '21 at 13:18

score 1 · Answer 3 · answered Mar 20 '19 at 05:49

1

From the Tensorflow implementation:

static inline tensorflow::bfloat16 FloatToBFloat16(float float_val) {
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    return *reinterpret_cast<tensorflow::bfloat16*>(
        reinterpret_cast<uint16_t*>(&float_val));
#else
    return *reinterpret_cast<tensorflow::bfloat16*>(
        &(reinterpret_cast<uint16_t*>(&float_val)[1]));
#endif
}

answered Mar 20 '19 at 05:49

Botje

21,384
3
27
38

1

I think that implementation is flawed because it violates aliasing. If you copy the two bytes as individual bytes (e.g. `unsigned char`'s), it'll be right. – Alexey Frunze Mar 20 '19 at 06:07

Convert FP32 to Bfloat16 in C++

3 Answers3

Linked