mo_relaxed is fine for both load and store of a stop flag
There's also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running or exit_now flag was important.
IDK why Herb thinks stop.store shouldn't be relaxed; in his talk, his slides have a comment that says // not relaxed on the assignment, but he doesn't say anything about the store side before moving on to "is it worth it".
Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else. I hope that wasn't his only reason; I find that unhelpful when trying to understand what memory order would actually be necessary and why. But anyway, I think either that or a mistake on his part.
The ISO C++ standard doesn't say anything about how soon stores become visible or what might influence that, just Section 6.9.2.3 Forward progress
18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
Another thread can loop arbitrarily many times before its load actually sees this store value, even if they're both seq_cst, assuming there's no other synchronization of any kind between them. Low inter-thread latency is a performance issue, not correctness / formal guarantee.
And non-infinite inter-thread latency is apparently only a "should" QOI (quality of implementation) issue. :P Nothing in the standard suggests that seq_cst would help on an implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, e.g. on a hypothetical implementation with explicit cache flushes instead of cache coherency. (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.)
On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don't make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache. (After invalidating any other copies of the line.)
Stronger orders, and barriers, don't make things happen sooner in an absolute sense, they just delay other things until they're allowed to happen relative to the store or load. (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn't fill up, and
See also (my similar answers on):
The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order. That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources. (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we're waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; only very small scale reordering.
because who cares if the thread stops with a slightly bigger delay.
More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete. (Of course, this work will get discarded if it's in the shadow of a a mis-speculated branch on the load result when we eventually load true.) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. And it's a stop flag so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.
That phrasing makes it sound like an acquire load or release store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread. (Which is not the case).
The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false. And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions, not like ARM64 ldapr.
BTW, Herb is right that the dirty flag can also be relaxed, only because of the thread.join sync between the reader and any possible writer. Otherwise yeah, release / acquire.
But in this case, dirty only needs to be atomic<> at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB. e.g. because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses.