Is there a formula to find the numbers of bits for either exponent or significand in a floating point number?

Question

Recently, I have been interested with using bit shiftings on floating point numbers to do some fast calculations.

To make them work in more generic ways, I would like to make my functions work with different floating point types, probably through templates, that is not limited to float and double, but also "halfwidth" or "quadruple width" floating point numbers and so on.

Then I noticed:

 - Half   ---  5 exponent bits  ---  10 signicant bits
 - Float  ---  8 exponent bits  ---  23 signicant bits
 - Double --- 11 exponent bits  ---  52 signicant bits

So far I thought exponent bits = logbase2(total byte) * 3 + 2,
which means 128bit float should have 14 exponent bits, and 256bit float should have 17 exponent bits.

However, then I learned:

 - Quad   --- 15 exponent bits  ---  112 signicant bits
 - Octuple--- 19 exponent bits  ---  237 signicant bits

So, is there a formula to find it at all? Or, is there a way to call it through some builtin functions?
C or C++ are preferred, but open to other languages.

Thanks.

Note: No guarantees, at least none from the C++ Standard. What the implementation uses for the encoding of floating point is left up to the implementors, but will probably be defined in the IEEE floating point standards. IEEE 754, is common. [wiki page for IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) — user4581301, Jun 29 '20 at 04:41
What would prevent you from hardcoding the counts directly into your program, given all of them are standardized? — vmt, Jun 29 '20 at 04:41
Anyone can define their own floating point formats and allocate the bits how they see fit. About IEEE-754 formats specifically, there are several good pointers [here](https://stackoverflow.com/questions/40775949/why-do-higher-precision-floating-point-formats-have-so-many-exponent-bits/) about the history and choices. — dxiv, Jun 29 '20 at 04:51
In IEEE 754 double has 53 significant bits. You're ignoring the hidden bit. — john, Jun 29 '20 at 04:54
What makes you think you can do these bit twiddling optimizations on floating point better than the compiler can with full retail optimizations turned on? — selbie, Jun 29 '20 at 05:14
@vmt the reason I want to see if there's a formula is to say if 512bit float is put in as standard, it would automatically work with it, without the need of altering anything. — Ranoiaetep, Jun 29 '20 at 05:32
@Ranoiaetep given the rarity of even 256 bit floats, I'd say once you want/need to add support, it should be trivial to hardcode a new exponent bit count for the type in addition to the actual implementation — vmt, Jun 29 '20 at 06:06
The newest hardware has `bfloat16`, which has only 7 significant bits. There's no logical pattern. — MSalters, Jun 29 '20 at 06:54
Please note that, even when you get the bit counts right, you don't necessarily get the byte order right as well. If a CPU stores integers in little endian format, there is no guarantee that floats are not stored in big endian, or vice versa. Or mixed endianness. Even if you avoid the UB due to strict aliasing rules, you are going to have implementation defined behavior and will need to check every single architecture that you support. — cmaster - reinstate monica, Jun 29 '20 at 12:31
`DBL_MANT_DIG` provide number of `bits` in a base 2 `double` in the significand. Usable at pre-processor time. — chux - Reinstate Monica, Jun 29 '20 at 14:52
C and C++ don't require IEEE-754. Other floating-point formats may have different number of bits for the exponent and significand. See [What uncommon floating-point sizes exist in C++ compilers?](https://stackoverflow.com/q/38509009/995714) — phuclv, Sep 20 '21 at 10:55

Eric Postpischil · Accepted Answer · 2020-06-29T12:06:25.523

Characteristics Provided Via Built-In Functions

C++ provides this information via the std::numeric_limits template:

#include <iostream>
#include <limits>
#include <cmath>


template<typename T> void ShowCharacteristics()
{
    int radix = std::numeric_limits<T>::radix;

    std::cout << "The floating-point radix is " << radix << ".\n";

    std::cout << "There are " << std::numeric_limits<T>::digits
        << " base-" << radix << " digits in the significand.\n";

    int min = std::numeric_limits<T>::min_exponent;
    int max = std::numeric_limits<T>::max_exponent;

    std::cout << "Exponents range from " << min << " to " << max << ".\n";
    std::cout << "So there must be " << std::ceil(std::log2(max-min+1))
        << " bits in the exponent field.\n";
}


int main()
{
    ShowCharacteristics<double>();
}

Sample output:

The floating-point radix is 2.
There are 53 base-2 digits in the significand.
Exponents range from -1021 to 1024.
So there must be 11 bits in the exponent field.

C also provides the information, via macro definitions like DBL_MANT_DIG defined in <float.h>, but the standard defines the names only for types float (prefix FLT), double (DBL), and long double (LDBL), so the names in a C implementation that supported additional floating-point types would not be predictable.

Note that the exponent as specified in the C and C++ standards is one off from the usual exponent described in IEEE-754: It is adjusted for a significand scaled to [½, 1) instead of [1, 2), so it is one greater than the usual IEEE-754 exponent. (The example above shows the exponent ranges from −1021 to 1024, but the IEEE-754 exponent range is −1022 to 1023.)

Formulas

IEEE-754 does provide formulas for recommended field widths, but it does not require IEEE-754 implementations to conform to these, and of course the C and C++ standards do not require C and C++ implementations to conform to IEEE-754. The interchange format parameters are specified in IEEE 754-2008 3.6, and the binary parameters are:

For a floating-point format of 16, 32, 64, or 128 bits, the significand width (including leading bit) should be 11, 24, 53, or 113 bits, and the exponent field width should be 5, 8, 11, or 15 bits.
Otherwise, for a floating-point format of k bits, k should be a multiple of 32, and the significand width should be k−round(4•log₂k)+13, and the exponent field should be round(4•log₂k)−13.

+1 for tracking down the reference. May be worth noting at the last point that the formula is only meant for `k >= 128` (it does in fact match the `11` bits for `k = 64`, too, but it is off by `1` and `2` for `k = 32, 16`). — dxiv, Jun 29 '20 at 14:33

dxiv · Answer 2 · 2020-06-29T14:47:11.683

I want to see if there's a formula is to say if 512bit float is put in as standard, it would automatically work with it, without the need of altering anything

I don't know of a published standard that guarantees the bit allocation for future formats (*). Past history shows that several considerations factor into the final choice, see for example the answer and links at Why do higher-precision floating point formats have so many exponent bits?.
(*) EDIT: see note added at the end.

For a guessing game, the existing 5 binary formats defined by IEEE-754 hint that the number of exponent bits grows slightly faster than linear. One (random) formula that fits these 5 data points could be for example (in WA notation) exponent_bits = round( (log2(total_bits) - 1)^(3/2) ).

This would foresee that a hypothetical binary512 format would assign 23 bits to the exponent, though of course IEEE is not bound in any way by such second-guesses.

The above is just an interpolation formula that happens to match the 5 known exponents, and it is certainly not the only such formula. For example, searching for the sequence 5,8,11,15,19 on oeis finds 18 listed integer sequences that contain this as a subsequence.

[ EDIT ] As pointed out in @EricPostpischil's answer, IEEE 754-2008 does in fact list the formula exponent_bits = round( 4 * log2(total_bits) - 13 ) for total_bits >= 128 (the formula actually holds for total_bits = 64, too, though it does not for = 32 or = 16).

The empirical formula above matches the reference IEEE one for 128 <= total_bits <= 1472, in particular IEEE also gives 23 exponent bits for binary512 and 27 exponent bits for binary1024.

The formula has to be asymptotically linear though, regardless of how it starts out... — Mad Physicist, Dec 30 '21 at 00:18
@MadPhysicist It has to be *at least* linear for the product of 2 b-bits values to fit into a (b+1)-bit one, which was a rationale quoted in the [linked](https://stackoverflow.com/a/40789013/5538420) answer. Other than that, I don't see a hard requirement that it be strictly linear, unless additional constraints are imposed. — dxiv, Dec 30 '21 at 00:57
It has to be at most linear asymptotically if you want to have any fractional part left as you go to infinity. — Mad Physicist, Dec 30 '21 at 02:03
@MadPhysicist I meant linear in `log2(b)`. For example, the empirical formula above has faster than log-linear growth for both exponent and mantissa bits. — dxiv, Dec 30 '21 at 02:20

6502 · Answer 3 · 2020-06-29T06:33:01.000

The answer is no.

How many bits to use (or even which representation to use) is decided by compiler implementers and committees. And there's no way to guess what a committee decided (and no, it's not the "best" solution for any reasonable definition of "best"... it's just what happened that day in that room: an historical accident).

If you really want to get down to that level you need to actually test your code on the platforms you want to deploy to and add in some #ifdef macrology (or ask the user) to find which kind of system your code is running on.

Also beware that in my experience one area in which compilers are extremely aggressive (to the point of being obnoxious) about type aliasing is with floating point numbers.

The last paragraph cannot be stressed enough: You cannot manipulate floating point bits without either invoking UB or going through at least one `memcpy()` call, two if you actually need the result. Even `union` is not enough to make the type punning defined. — cmaster - reinstate monica, Jun 29 '20 at 12:22

RARE Kpop Manifesto · Answer 4 · 2021-09-20T09:38:23.913

Similar to the concept mentioned above, here's an alternative formula (just re-arranging some terms) that will calculate the unsigned integer range of the exponent ([32,256,2048,32768,524288], corresponding to [5,8,11,15,19]-powers-of-2) without needing to call the round function :

uint_range =  ( 64 **  ( 1 + (k=log2(bits)-4)/2) )
              *
              (  2 ** -(  (3-k)**(2<k)         ) )

(a) x ** y means x-to-y-power
(b) 2 < k is a boolean condition that should just return 0 or 1.

The function shall be accurate from 16-bit to 256-bit, at least. Beyond that, this formula yields exponent sizes of

   –  512-bit : 23 
   – 1024-bit : 27 
   – 2048-bit : 31 
   – 4096-bit : 35

(beyond-256 may be inaccurate. even 27-bit-wide exponent allows exponents that are +/- 67 million, and over 40-million decimal digits once you calculate 2-to-that-power.)

from there to IEEE 754 exponent is just a matter of log2(uint_range)

Is there a formula to find the numbers of bits for either exponent or significand in a floating point number?

4 Answers4

Characteristics Provided Via Built-In Functions

Formulas

Linked