Recently, I have been interested with using bit shiftings on floating point numbers to do some fast calculations.
To make them work in more generic ways, I would like to make my functions work with different floating point types, probably through templates, that is not limited to float and double, but also "halfwidth" or "quadruple width" floating point numbers and so on.
Then I noticed:
- Half --- 5 exponent bits --- 10 signicant bits
- Float --- 8 exponent bits --- 23 signicant bits
- Double --- 11 exponent bits --- 52 signicant bits
So far I thought exponent bits = logbase2(total byte) * 3 + 2,
which means 128bit float should have 14 exponent bits, and 256bit float should have 17 exponent bits.
However, then I learned:
- Quad --- 15 exponent bits --- 112 signicant bits
- Octuple--- 19 exponent bits --- 237 signicant bits
So, is there a formula to find it at all? Or, is there a way to call it through some builtin functions?
C or C++ are preferred, but open to other languages.
Thanks.