Why does iteration over an inclusive range generate longer assembly in Rust?

Question

These two loops are equivalent in C++ and Rust:

#include <cstdint>
uint64_t sum1(uint64_t n) {  
    uint64_t sum = 0;
    for (uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    for j in 0u64..=num {
        sum += 1;
    }
    return sum;
}

However the C++ version generates a very terse assembly:

sum1(unsigned long):                               # @sum1(unsigned long)
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

while Rust's version is very long with two checks in the loop instead of one:

example::sum1:
        xor     eax, eax
        xor     ecx, ecx
.LBB0_1:
        mov     rdx, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        add     rax, 1
        cmp     rdx, rdi
        jae     .LBB0_3
        cmp     rcx, rdi
        jbe     .LBB0_1
.LBB0_3:
        ret

Godbolt: https://godbolt.org/z/xYW94qxjK

What is Rust intrinsically trying to prevent that C++ is carefree about?

I think it's something about inclusive ranges... I don't know the details, but I saw it mentioned recently. Try seeing what happens if you change the Rust loop to `for j in 0..num+1` — Herohtar, Jan 11 '22 at 19:47
@Herohtar Well that optimizes to a closed formula and no loop. Same happens with C++ as well. — , Jan 11 '22 at 19:51
@user3840170 Sorry for the noob question (c++ dev transitioning to Rust apologies) but is there a way to constrain a variable like with C++ `__builtin_assume( num<100);`? — , Jan 11 '22 at 19:59
@Jellyboy There's [`core::intrinsics::assume`](https://doc.rust-lang.org/stable/std/intrinsics/fn.assume.html), but it's perma-unstable. — Chayim Friedman, Jan 11 '22 at 20:00
@Jellyboy On stable, you can do `if num < 100 { unsafe { core::hint::unreachable_unchecked(); } }`. — Chayim Friedman, Jan 11 '22 at 20:07
@Jellyboy: For just seeing what a compiler will do with a loop, it's often sufficient to do `num &= 0xffff` or a right-shift or something; the extra AND or MOVZX instruction will be outside the loop. (And as we saw on a question a couple days ago about sum(1..n), some clang optimizer passes don't benefit from the value-range info that `__builtin_assume( num<100)` gives, but do work when there's actual guaranteed truncation as part of the program logic. — Peter Cordes, Jan 13 '22 at 01:33
FWIW, I have a crate I use for stating these kinds of assumptions succinctly, https://crates.io/crates/assume. E.g. `assume!(unsafe: num < 100)`. — GManNickG, Jan 13 '22 at 21:57

user3840170 · Accepted Answer · 2022-01-12T10:20:33.483

44

Overflow in the iterator state.

The C++ version will loop forever when given a large enough input:

#include <cstdint>

[[gnu::noinline]]
std::uint64_t sum1(std::uint64_t n) {  
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}

#include <iostream>

int main() {
    std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;

    return 0;
}

This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.

Strictly speaking, invoking sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow itself, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why I applied the [[gnu::noinline]] attribute to the function to prevent the compiler from taking ‘advantage’ of that in optimisation passes.

The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:

use std::num::Wrapping;

fn sum1(num: u64) -> u64 {
    let mut sum = Wrapping(0);
    for _ in 0..=num {
        sum += Wrapping(1);
    }
    return sum.0;
}

fn main() {
    println!("{}", sum1(0xffffffff_ffffffff));
}

The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18446744073709551616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.

edited Jan 12 '22 at 10:20

answered Jan 11 '22 at 20:09

user3840170

22,750
2
21
50

I used this version and it produces the same assembly as the one above in the question. https://godbolt.org/z/6xE49YMxd – Jan 11 '22 at 20:15
8

That's intentional. user3840170's changes were made to get a *debug* build to "work" (your version would cause a panic due to an overflow in `sum`). Release builds have overflow checks disabled (integer arithmetic wraps on overflow), so the `Wrapping` wrapper doesn't change the behavior there. – Francis Gagné Jan 11 '22 at 20:31
@FrancisGagné So I thought `-C opt-level=3` would be enough to get a release version in Godbolt. Is there anything else I'd need to pass? – Jan 11 '22 at 20:35
6

No, you're getting this backwards. You *are* getting a release build with `-C opt-level=3`. user3840170's changes are only important in a *debug* build (in which an overflow causes a panic, instead of just wrapping around; `Wrapping` replaces the panic with wrapping arithmetic). Your version has different behavior in debug vs. release mode when calling `sum1(0xffffffff_ffffffff)`, while user3840170's version has the same behavior in both modes, but they both behave the same in release mode (indeed, they generate the same assembly code, as you pointed out). – Francis Gagné Jan 11 '22 at 20:48
@Stargateur: A local variable being incremented doesn't count as a side-effect, as far as I know, as it's not observable (without further code). – Matthieu M. Jan 12 '22 at 08:36
1

@MatthieuM. that look like a strange rule of c++ it's look arbitrary and unclear – Stargateur Jan 12 '22 at 08:44
2

@MatthieuM. It was actually C++'s idea in the first place, developed as part of the addition of multithreading to C++11, and then bolted back into C11 with reservations -- compare C++11 [intro.multithread] paragraph 10 with the much weaker statement in C11 6.8.5p6. The previous versions of both standards (C++98, C99) don't contain any of this language. Anyway, the rationale appears to have been "allow compiler transformations such as removal of empty loops, even when termination cannot be proven" but I don't think that's an excuse for removal of loops that provably _don't_ terminate. – zwol Jan 12 '22 at 14:51
@zwol: Many thanks for the correction! And I'm still not sure I agree even with empty loops: I guess it's useful for loops containing only `assert`, but it's common enough in the embedded world to define `abort` as `while true {}` to be able to have the time to plug in a debugger and see where the code stopped... – Matthieu M. Jan 12 '22 at 15:21
Why doesn't the compiler do it this way with one check? `add rdx, -1` and `jc` or move the count into `rcx` and use `loop`. I thought the compiler would change counting up into down as it sees fit. – 93Iq2Gg2cZtLMO Jan 12 '22 at 21:33

score 18 · Answer 2 · answered Jan 12 '22 at 10:49

18

@user3840170 correctly identified the difference: overflow check in Rust, and not in C++.

Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM¹.

The optimal code generation for short loops, is to split the loop, transforming:

for j in 0u64..=num {
    sum += 1;
}

Into:

for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
    sum += 1;
}

if 0 <= num {
    sum += 1;
}

This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula².

The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.

Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.

¹ See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.

² Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.

answered Jan 12 '22 at 10:49

Matthieu M.

268,556
42
412
681

AND I realize now if you split that loop, not only it gets optimized as per number of checks, the loop will get eliminated completely too, returning N+1. – Jan 12 '22 at 12:20
the code of rangeinclusif is so complex I still don't understand why they did like this, https://doc.rust-lang.org/src/core/iter/range.rs.html#913 call is empty but still do manual check, and code is unnecessary complex call next_spec and duplicate code everywhere and is_empty a critical function check `is_exhausted` first when 99.9% of case it's should just check the value. https://doc.rust-lang.org/src/core/ops/range.rs.html#539 I'm not at all surprise that LLVM can't optimize even this simple loop. – Stargateur Jan 12 '22 at 18:00
also I disagree about overflow check, it's not obligatory a range can be that anything you could use the Generic to do a lot of range and not would make sense to call this behavior a "overflow check" I also do not understand why you said it shouldn't be a iterator – Stargateur Jan 12 '22 at 18:04
What is `if 0 <= num` supposed to do? `num` being an unsigned type, this always evaluates to true? – Will Jan 12 '22 at 23:27
Another transformation that could work is `do{}while(j != num+1)`, where `num+1` is a wrapping add. A normal `for` loop compiles to a `do{}while()` loop [with a branch to skip it if the condition is false before the first iteration](https://stackoverflow.com/q/47783926/224132), but here the inclusive range means the trip-count is guaranteed non-zero so it actually takes *less* asm to implelement. I think the only overhead is incrementing `num`. (Or a copy of it, and tying up a register if the original `num` is needed later, assuming a compiler would rather save/restore another reg than `dec`) – Peter Cordes Jan 13 '22 at 01:49
1

@PeterCordes: While this works in this particular implementation, it doesn't in general as the `start` and `end` bounds can be "crossed": that is `3..=2` should iterate 0 times. (And with a general type, there's the issue that not all types support wrapping adds) – Matthieu M. Jan 13 '22 at 10:14
@MatthieuM. ah yes.... now I see why being an iterator is a problem.... damm. We can't change that ? – Stargateur Jan 13 '22 at 14:40
@Stargateur: Could maybe be changed as part of edition; but it may break quite a bit of code so nobody's dared until now (and the next edition is 2024). – Matthieu M. Jan 13 '22 at 14:44
@Will: Disappointment that Loop Splitting doesn't occur -- which would allow having a closed formula instead of SIMD. – Matthieu M. Jan 14 '22 at 12:01

score 16 · Answer 3 · edited Jan 12 '22 at 17:44

16

These two loops are equivalent in C++ and Rust

Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).

A rust equivalent code could be:

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    let mut j = 0;
    while j <= num {
        sum = sum.wrapping_add(1);
        j = j.wrapping_add(1);
    }
    sum
}

currently produce the following asm godbolt:

example::sum1:
        xor     eax, eax
.LBB0_1:
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

edited Jan 12 '22 at 17:44

idmean

14,246
8
52
81

answered Jan 11 '22 at 20:23

Stargateur

20,831
8
51
78

1

Elaborating a bit, this more direct translation does have the same output: https://rust.godbolt.org/z/GbToeKec7 – GManNickG Jan 11 '22 at 21:38
Shouldnt it be `saturating_add()` instead for `sum`? – Jan 12 '22 at 00:24
@Jellyboy No, unsigned arithmetic in C++ is modulo 2^k (in this case, k==64). Saturating in C++ would look like: `sum += sum == static_cast(-1) ? 0 : 1;`. – GManNickG Jan 12 '22 at 02:47

Why does iteration over an inclusive range generate longer assembly in Rust?

3 Answers3

Linked

Related