2

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.

Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?

Dan
  • 11,400
  • 3
  • 44
  • 79
  • 3
    memory barrier and cache coherency are different things – Support Ukraine Oct 11 '21 at 12:14
  • Undefined behaviour is not the kernel stuff, but how the compilator will interpret your code and generate the binary – Ôrel Oct 11 '21 at 12:15
  • 1
    `if the next layer up` Well, C is not necessarily "next layer up", and undefined behavior in C _only_ means that there is no _requirement_ on the behavior of the program requested by C standard - there may be requirements from other standards and specific C programs can depend on specific hardware&compiler behaviors. – KamilCuk Oct 11 '21 at 12:22
  • 1
    https://softwareengineering.stackexchange.com/ might be a better fit for this Q. – hyde Oct 11 '21 at 12:26
  • Memory barriers enforce ordering, not coherence. A barrier instruction only has a direct effect on the core that issues it; all other cores are oblivious to it, except that they see updates in a certain order. Cache coherence, in contrast, is a sort of broadcast mechanism "I need X" which permits cores lazily update RAM, presumably to maintain fast access and avoid redundant updates. – mevets Oct 11 '21 at 12:35
  • 4
    Suppose CPU A sets byte 0 of the cache line and CPU B sets byte 15 at more or less the same time. There's no way to resolve this without cache coherence. Doing two operations will always have a race. – stark Oct 11 '21 at 12:38
  • Ok, I guess I had the wrong idea of what a memory barrier is, so I edited the question. – Dan Oct 11 '21 at 12:53
  • 2
    @stark That's a good point, the language does say that you can do writes at any granularity larger than a byte without disturbing adjacent memory locations – Dan Oct 11 '21 at 13:07
  • 1
    Right, so in fact, standard C more or less *requires* that you have coherent caches. For a machine with incoherent caches to conform, it would have to have 1-byte cache lines. Moreover, every acquire barrier would have to invalidate the core's entire cache, and every release barrier would have to flush the whole thing. That would be prohibitively expensive. – Nate Eldredge Oct 11 '21 at 13:22
  • Cache coherence is more than just the same memory location. Cache coherence is what lets you write one memory location on one core and the *next* memory location on a different core! – user253751 Oct 11 '21 at 13:32

1 Answers1

6

The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.

But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.

C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)

In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note

(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)


Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.

ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays

ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.

The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731