7

I am exploring Higham's excellent Accuracy and Stability of Numerical Algorithms and chapter 4 is dedicated to summation.

So I decided to test the most basic thing. Summing positive random numbers uniformly distributed between $0$ and $1$ (exactly $2^{20}$ of them). I was expecting that first sorting (increasing order of course) them would give me a better relative error (I am also counting how many no-operations are there and with the sorted version I get 0, while thousands with the unsorted version). However this is not the case, many times, the unsorted version gives a smaller error by $1-2$ order of magnitudes (despite many no-operations). Is this normal? Is there any clear manufactured example where this is the case?

I am accumulating IEEE floats (c++) for the approximate computations while computing the exact solution in a 16 bytes long double (so casting each float to long double and accumulate on the long double result).

test_recursive_sum
Relative error:
1.1244650891750177666438375817657311598468661273386715038213878870e-06
Noops:
10974
test_sort_recursive_sum
Relative error:
1.3188994459459007621524197348059082024462895788019523024559020996e-04
Noops:
0
Sample:
1048576
lucmobz
  • 71
  • 2
  • 1
    I am having trouble reproducing the data from the question. I used the MRG32k3a PRNG to generate $2^{20}$ random floats in $(0,1)$ and summed them in given order using a float and a _Quad accumulator. If by "no-operation" we mean that the accumulator did not increase when adding a number, I am seeing 10773 of those (close to your result). However, with the float accumulator I am seeing a relative error of 7.7e-6 in the sum, much larger than shown in the sample output in the question. I used /fp:strict (equiv. Linux: -fp-model=strict) when building with the Intel classic compiler. – njuffa Mar 21 '24 at 22:53
  • I replaced MRG32k3a with Marsaglia's 64-bit KISS generator, and see a relative error of 6.9e-6 (and 10940 no-operations) when accumulating $2^{20}$ floats in a float accumulator. Based on that, the relative error of 1.12e-6 reported by the test output above seems unusually low, something that warrants further investigation. – njuffa Mar 21 '24 at 23:08
  • 2
    I can confirm the general observation: as the number $N$ of floats uniformly distributed in $(0,1)$ increases, the relative error of their sum accumulated in a float increases much faster when sorted in ascending order than when using the given order. For $N = 2^{24}$ I observe relative error of 4.17e-2 when accumulating in a float in ascending order, with either PRNG, while relative error when accumulating in native order is of order 1e-5. – njuffa Mar 22 '24 at 00:10
  • 1
    T. G. Robertazzi and S. C. Schwartz, "Best “Ordering” for Floating-Point Addition," ACM Transections on Mathematical Software, Vol. 14, No. 1, Mar, 1988, pp. 101-110, looked into the question of the accuracy of summing positive finite-precision floating-point numbers. According to Table 1, for data with a uniform distribution in $(0, 2\mu)$, where $\mu$ is the average, summing the data in increasing order is more accurate than in given order is more accurate than in decreasing order. Quote: "Adding the numbers in accordance with increasing magnitude is the best policy." – njuffa Mar 22 '24 at 00:36
  • Nicholas J. Higham, "The Accuracy of Floating-Point Summation," SIAM J. Sci. Comput., Vol. 14, No. 4, Jul. 1993, pp. 783-799, experimentally evaluates the case of uniform random distribution on $[1,2]$ but only considers incr. and decr. ordering. He further cites Robertazzi & Schwartz: "The results show that for recursive summation the ordering affects only the constant in the mean square error, with the increasing ordering having the smallest constant and the decreasing ordering the largest; since the $x_{i}$ are nonnegative, this is precisely the ranking given by the rounding error bound" – njuffa Mar 22 '24 at 02:59
  • 1
    Not sure If I am allowed to post code but this is what I am using to generate the sequence
    std::mt19937 rng;
    std::unifrm_real_distributino<float> dist{0, 1};
    

    Yes a no-operation is when the sum does not change.

    – lucmobz Mar 22 '24 at 06:19
  • 2
    No idea what your background is, but if it is in academia, consider making this into a research project and publishing the results. – njuffa Mar 22 '24 at 06:35
  • @njuffa Have you tried with true random numbers like the ones in https://www.rngresearch.com/download/ ? – Federico Poloni Mar 22 '24 at 08:33
  • I am not in academia. I will try more test with random numbers. I am thinking of simply getting integers from that suggested link and dividing them by the max to get floats between 0 and 1. – lucmobz Mar 22 '24 at 11:04
  • 1
    @Federico Poloni 2**20 nums in [0,1), native order: block0.rng sumf=524538.87500 sumq=524541.37541850912 relerr=-4.7669e-6; block1.rng sumf=524362.12500 sumq=524366.17823627125 relerr=-7.7298e-6; block2.rng sumf=524582.06250 sumq=524587.48543693894 relerr=-1.0338e-5; block3.rng sumf=524138.56250 sumq=524130.98043318139 relerr= 1.4466e-5; block4.rng sumf=523751.93750 sumq=523747.13998567569 relerr= 9.1600e-6; block5.rng sumf=524158.03125 sumq=524156.47622591048 relerr= 2.9667e-6; block6.rng sumf=523836.03125 sumq=523852.86839615623 relerr=-3.2141e-5; Hypothesis busted :-) – njuffa Mar 22 '24 at 23:04

1 Answers1

8

Very interesting problem! I might have a partial answer.

To start, I replicated a simple C++ demo that can reproduce the effect

#include <iostream>
#include <vector>
#include <random>
#include <algorithm>
#include <numeric>
#include <iomanip>

#include <boost/multiprecision/cpp_dec_float.hpp> using namespace boost::multiprecision;

template <typename T> T KahanSum(const std::vector<T>& numbers) { T sum = 0.0f; T c = 0.0f;

for (const T&amp; num : numbers) {
    T y = num - c;
    T t = sum + y;
    c = (t - sum) - y;
    sum = t;
}

return sum;

}

int main() { // Define the number of random numbers const int N = 1048576;

// Create a Mersenne Twister pseudo-random generator
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution&lt;float&gt; dis(0.0f, 1.0f);

// Generate random numbers and store them in a vector
std::vector&lt;float&gt; numbers(N);
for (int i = 0; i &lt; N; ++i) {
    numbers[i] = dis(gen);
}

float sum_float = std::accumulate(numbers.begin(), numbers.end(), 0.0f);
double sum_double = std::accumulate(numbers.begin(), numbers.end(), 0.0);
auto sum_double_k = KahanSum(numbers);

// Calculate sum using higher precision (cpp_dec_float_100)
using high_precision_float = cpp_dec_float_100;
high_precision_float sum_high_precision = std::accumulate(numbers.begin(), numbers.end(), high_precision_float(0));

std::sort(numbers.begin(), numbers.end());

float sum_float_sorted = std::accumulate(numbers.begin(), numbers.end(), 0.0f);
double sum_double_sorted = std::accumulate(numbers.begin(), numbers.end(), 0.0);
auto sum_double_k_sorted = KahanSum(numbers);

std::cout &lt;&lt; &quot;Sum using float accumulator:                &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_float &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using float accumulator after sorting:  &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_float_sorted &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using double accumulator:               &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_double &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using double accumulator after sorting: &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_double_sorted &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using high precision:                   &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_high_precision &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using KahanSum:                         &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_k &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Sum using KahanSum sorted:                         &quot; &lt;&lt; std::scientific &lt;&lt; std::setprecision(30) &lt;&lt; sum_k_sorted &lt;&lt; std::endl;

float float_diff = std::abs(static_cast&lt;double&gt;(sum_float) - static_cast&lt;double&gt;(sum_high_precision));
float float_sorted_diff = std::abs(static_cast&lt;double&gt;(sum_float_sorted) - static_cast&lt;double&gt;(sum_high_precision));
std::cout &lt;&lt; &quot;Error of float accumulator:                &quot; &lt;&lt; float_diff/sum_high_precision &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Error of float accumulator after sorting:  &quot; &lt;&lt; float_sorted_diff/sum_high_precision &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Error of float accumulator:                &quot; &lt;&lt; float_diff/sum_high_precision &lt;&lt; std::endl;
std::cout &lt;&lt; &quot;Error of float accumulator after sorting:  &quot; &lt;&lt; float_sorted_diff/sum_high_precision &lt;&lt; std::endl;

return 0;

}

I included even higher precision float to confirm, but the double is actually enough here for a "true" answer. I also included the Kahan summation algorithm.

  1. Using the Kahan summation completely removes this effect, and the final answer is more accurate.
  2. The unsorted sum is smaller than the true answer, and the sorted sum is larger.

Wait what? Larger? I admit I haven't read the floating point spec, but my gut feeling was that adding large_sum + small_term would always produce an underestimate.

Diving into the Kahan summation helps our understanding, we can modify it to just report the summation error, without modifying the values:

template <typename T>
T KahanCheck(const std::vector<T>& numbers) {
    T sum = 0.0;
for (const T&amp; num : numbers) {
    T y = num;
    T t = sum + y;
    T c = (t - sum) - y;
    std::cout &lt;&lt; std::scientific &lt;&lt; std::setprecision(10) &lt;&lt; c &lt;&lt; std::endl;
    sum = t;
}

return sum;

}

Before today, I would have assumed:

  1. y can be split into positive high order and low order bits: y = y_high + y_low
  2. Adding sum + y = sum + y_high + y_low = sum + y_high as it will truncate and ignore the low order bits of y_low
  3. The compensation c = (t - sum) - y = (sum + y_high - sum) - y_high - y_low = -y_low will always be negative.

However, running the KahanCheck on the sorted array, you will find that c can be positive and negative, so clearly this assumption doesn't hold.

As a concrete example, if we consider the following floats (all decimals shown for 32bit float precision):.

sum   = 418.998487293720245361328125000e-05
y     =   9.554042480885982513427734375e-05
t     = 428.552553057670593261718750000e-05 # computed with float
exact = 428.552529774606227874755859375e-05 # computed exactly
                ^t is larger!

we can see that sum t actually over what would be the exact answer. So a floating point sum can be larger, despite losing low order bits.

As such, what I think is happening here is that by sorting the data, we bias the summation towards overestimation by limiting how many low order bits get truncated. I suspect out standard expected behavior of rounding up 0.5 to is what gives is this nudge.

You could compute the c terms from KahanCheck, adding up all the positive and negative contributions, and by sorting, you nudge this one way or the other.

Another thing to learn is that if you want to minimize the numerical inaccuracy Kahan summation way better than a costly sort. You could also consider pairwise recursive summation


I have attempted to verify my "rounding up" speculation with many varied experiments, but I have been unable to do so. I think my theory is unfortunately simply wrong. A few more observations:

Experimenting with only the range [0,1) greatly affect things:

  1. At [0,100) the errors are reversed; sorting does increase accuracy!
  2. At [0.1,1) the errors are the same, sorting or not.

So in my mind any satisfactory explanation would need to account for the distribution [0,1).

Mikael Öhman
  • 1,008
  • 4
  • 10
  • Summing a small term to a larger can go both ways. 1000 + 11 to 3 significant digits is probably 1100 instead of 1010. I am not understanding from your answer how sorting could bias the sum to over/underestimate? But seems a good explanation – lucmobz Mar 27 '24 at 16:01
  • 1
    Re " I suspect out standard expected behavior of rounding up 0.5 to is what gives is this nudge." This could empirically verified or falsified with the help of a compiler that supports setting the floating-point rounding mode (I know that the Intel classic compilers support it). The hypothesis offered ("by sorting the data, we bias the summation towards overestimation by limiting how many low order bits get truncated") looks like a good start to a conclusive explanation. Appropriately labeled as a partial answer: +1. – njuffa Mar 27 '24 at 20:03