Very interesting problem! I might have a partial answer.
To start, I replicated a simple C++ demo that can reproduce the effect
#include <iostream>
#include <vector>
#include <random>
#include <algorithm>
#include <numeric>
#include <iomanip>
#include <boost/multiprecision/cpp_dec_float.hpp>
using namespace boost::multiprecision;
template <typename T>
T KahanSum(const std::vector<T>& numbers) {
T sum = 0.0f;
T c = 0.0f;
for (const T& num : numbers) {
T y = num - c;
T t = sum + y;
c = (t - sum) - y;
sum = t;
}
return sum;
}
int main() {
// Define the number of random numbers
const int N = 1048576;
// Create a Mersenne Twister pseudo-random generator
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<float> dis(0.0f, 1.0f);
// Generate random numbers and store them in a vector
std::vector<float> numbers(N);
for (int i = 0; i < N; ++i) {
numbers[i] = dis(gen);
}
float sum_float = std::accumulate(numbers.begin(), numbers.end(), 0.0f);
double sum_double = std::accumulate(numbers.begin(), numbers.end(), 0.0);
auto sum_double_k = KahanSum(numbers);
// Calculate sum using higher precision (cpp_dec_float_100)
using high_precision_float = cpp_dec_float_100;
high_precision_float sum_high_precision = std::accumulate(numbers.begin(), numbers.end(), high_precision_float(0));
std::sort(numbers.begin(), numbers.end());
float sum_float_sorted = std::accumulate(numbers.begin(), numbers.end(), 0.0f);
double sum_double_sorted = std::accumulate(numbers.begin(), numbers.end(), 0.0);
auto sum_double_k_sorted = KahanSum(numbers);
std::cout << "Sum using float accumulator: " << std::scientific << std::setprecision(30) << sum_float << std::endl;
std::cout << "Sum using float accumulator after sorting: " << std::scientific << std::setprecision(30) << sum_float_sorted << std::endl;
std::cout << "Sum using double accumulator: " << std::scientific << std::setprecision(30) << sum_double << std::endl;
std::cout << "Sum using double accumulator after sorting: " << std::scientific << std::setprecision(30) << sum_double_sorted << std::endl;
std::cout << "Sum using high precision: " << std::scientific << std::setprecision(30) << sum_high_precision << std::endl;
std::cout << "Sum using KahanSum: " << std::scientific << std::setprecision(30) << sum_k << std::endl;
std::cout << "Sum using KahanSum sorted: " << std::scientific << std::setprecision(30) << sum_k_sorted << std::endl;
float float_diff = std::abs(static_cast<double>(sum_float) - static_cast<double>(sum_high_precision));
float float_sorted_diff = std::abs(static_cast<double>(sum_float_sorted) - static_cast<double>(sum_high_precision));
std::cout << "Error of float accumulator: " << float_diff/sum_high_precision << std::endl;
std::cout << "Error of float accumulator after sorting: " << float_sorted_diff/sum_high_precision << std::endl;
std::cout << "Error of float accumulator: " << float_diff/sum_high_precision << std::endl;
std::cout << "Error of float accumulator after sorting: " << float_sorted_diff/sum_high_precision << std::endl;
return 0;
}
I included even higher precision float to confirm, but the double is actually enough here for a "true" answer. I also included the Kahan summation algorithm.
- Using the Kahan summation completely removes this effect, and the final answer is more accurate.
- The unsorted sum is smaller than the true answer, and the sorted sum is larger.
Wait what? Larger? I admit I haven't read the floating point spec, but my gut feeling was that adding large_sum + small_term would always produce an underestimate.
Diving into the Kahan summation helps our understanding, we can modify it to just report the summation error, without modifying the values:
template <typename T>
T KahanCheck(const std::vector<T>& numbers) {
T sum = 0.0;
for (const T& num : numbers) {
T y = num;
T t = sum + y;
T c = (t - sum) - y;
std::cout << std::scientific << std::setprecision(10) << c << std::endl;
sum = t;
}
return sum;
}
Before today, I would have assumed:
y can be split into positive high order and low order bits: y = y_high + y_low
- Adding
sum + y = sum + y_high + y_low = sum + y_high as it will truncate and ignore the low order bits of y_low
- The compensation
c = (t - sum) - y = (sum + y_high - sum) - y_high - y_low = -y_low will always be negative.
However, running the KahanCheck on the sorted array, you will find that c can be positive and negative, so clearly this assumption doesn't hold.
As a concrete example, if we consider the following floats (all decimals shown for 32bit float precision):.
sum = 418.998487293720245361328125000e-05
y = 9.554042480885982513427734375e-05
t = 428.552553057670593261718750000e-05 # computed with float
exact = 428.552529774606227874755859375e-05 # computed exactly
^t is larger!
we can see that sum t actually over what would be the exact answer. So a floating point sum can be larger, despite losing low order bits.
As such, what I think is happening here is that by sorting the data, we bias the summation towards overestimation by limiting how many low order bits get truncated. I suspect out standard expected behavior of rounding up 0.5 to is what gives is this nudge.
You could compute the c terms from KahanCheck, adding up all the positive and negative contributions, and by sorting, you nudge this one way or the other.
Another thing to learn is that if you want to minimize the numerical inaccuracy Kahan summation way better than a costly sort. You could also consider pairwise recursive summation
I have attempted to verify my "rounding up" speculation with many varied experiments, but I have been unable to do so. I think my theory is unfortunately simply wrong. A few more observations:
Experimenting with only the range [0,1) greatly affect things:
- At
[0,100) the errors are reversed; sorting does increase accuracy!
- At
[0.1,1) the errors are the same, sorting or not.
So in my mind any satisfactory explanation would need to account for the distribution [0,1).
floats in $(0,1)$ and summed them in given order using afloatand a_Quadaccumulator. If by "no-operation" we mean that the accumulator did not increase when adding a number, I am seeing 10773 of those (close to your result). However, with thefloataccumulator I am seeing a relative error of 7.7e-6 in the sum, much larger than shown in the sample output in the question. I used/fp:strict(equiv. Linux:-fp-model=strict) when building with the Intel classic compiler. – njuffa Mar 21 '24 at 22:53floats in afloataccumulator. Based on that, the relative error of 1.12e-6 reported by the test output above seems unusually low, something that warrants further investigation. – njuffa Mar 21 '24 at 23:08floats uniformly distributed in $(0,1)$ increases, the relative error of their sum accumulated in afloatincreases much faster when sorted in ascending order than when using the given order. For $N = 2^{24}$ I observe relative error of 4.17e-2 when accumulating in afloatin ascending order, with either PRNG, while relative error when accumulating in native order is of order 1e-5. – njuffa Mar 22 '24 at 00:10Yes a no-operation is when the sum does not change.
– lucmobz Mar 22 '24 at 06:19