DEC Alpha: why no 8/16-bit load/stores?

Question

The first version of the DEC Alpha had no load/store instructions for 8 or 16-bit values; if you wanted to deal with data of such sizes, you had to do it by shifting and masking values in registers as necessary. (This restriction was abandoned later because too much existing technology was designed around the assumption that addressing bytes is a commonplace thing to do; I'm asking purely about the technical reasons for doing it that way in the first version.)

On the face of it, it makes sense that it would simplify the hardware. According to http://alasir.com/articles/alpha_history/press/alpha_intro.html

Alpha is unconventional in the approach to byte manipulation. Single-byte stores found in conventional RISC architectures force cache and memory implementations to include byte shift-and-mask logic, and sequencer logic to perform read-modify-write on memory words. This approach is awkward to implement quickly, and tends to slow down cache access to normal 32- or 64-bit aligned quantities. It also makes it awkward to build a high-speed error-correcting write-back cache, which is often needed to keep a very fast RISC implementation busy. It also can make it difficult to pipeline multiple byte operations.

But hang on. Alpha did support storing 32-bit numbers. But it was a 64-bit architecture. So that's already supporting storing values smaller than a full word. Doesn't that already incur precisely the complexity that they were trying to avoid?

In other words, in a 64-bit CPU, isn't it pointless to refuse to support 8-bit stores if you are already supporting 32-bit stores that already need the same kind of hardware support? Or if not, why not?

I guess Alpha is still early in the RISC game so there's desire to make load/store hard wired logic to make them fast so simplicity is key. Right now everything is micro coded then you can easily create a slow path in micro code without impacting the fast path. — user3528438, Feb 24 '20 at 17:14
@user3528438 There is truth in what you say, but then, if that was an argument against supporting 8-bit stores, why was it not an equally strong argument against supporting 32-bit stores? Don't those already break the simplicity? — rwallace, Feb 24 '20 at 17:16
Having 32-bit integer and float register arithmetic, as well as 64 bit, would have been weird if there were no 32-bit loads and stores. Using "64 bit for everything" often wastes half the RAM, which was also a scarce resource back then. — alephzero, Feb 24 '20 at 17:34
Another fun fact is that early ARM CPUs didn't have half-word (16 bit) load/store capability too, only bytes and (32bit) words were supported. https://en.wikichip.org/wiki/arm/armv1#load_instructions — lvd, Feb 24 '20 at 23:21
@lvd: I suspect the C Standard's rules about struct operations disturbing padding are designed to allow implementations to use 32-bit writes when updating 16-bit structure members that are followed by 16 bits of padding, without having to worry about whether another struct with a Common Initial Sequence might use the padding bits for some other purpose. — supercat, Feb 24 '20 at 23:34
@alephzero: And half the cache footprint and/or memory bandwidth, which are always scarce. — Peter Cordes, Feb 25 '20 at 08:04
@lvd not only it didn't has 16-bit load/store, it also didn't have 8-bit signed load/store. That's why char is usually unsigned on ARM. Later halfword and signed byte operations were introduced but they aren't as efficient due to the lack of space — phuclv, Feb 25 '20 at 14:00
If you are interested in the Alpha architecture, Raymond Chen has an interesting deep dive beginning here: https://devblogs.microsoft.com/oldnewthing/20170807-00/?p=96766 — Flydog57, Feb 25 '20 at 18:12
@phuclv: The half-word and signed-byte load/store use a shorter opcode format, but that only makes them "inefficient" in cases where the normal opcode formats (r1+shifted r2, or r1+/- up to 2048) would be adequate but the shorter forms (r1+r2, or r1+/-some smaller value) would not. — supercat, Feb 25 '20 at 18:20
It did not help that DEC's C compiler would not automatically convert a loop that did raster protesting byte by byte into a loop that did it word by word, and other systems had caches that make the speedup from writing more complex code hardly worth the effort. (Nearly all our customers had these other systems) — Ian Ringrose, Dec 10 '20 at 16:27
@IanRingrose: The C language also specifies that a char store on one thread will not interfere with another thread performing a char load or store of the adjacent address above or below. There's no way a compiler can handle that efficiently. — supercat, Sep 17 '23 at 18:50
@supercat My recall is the C standard did not cover threading at the time. — Ian Ringrose, Sep 18 '23 at 21:51
@IanRingrose: The C Standard may not have covered it, but there were certainly well established conventions upon which many programs relied. People were writing multi-threaded code in C even before C89, despite the fact that it would take 20 years for the Standard to acknowledge the possibility of such constructs. — supercat, Sep 18 '23 at 21:57

supercat · Accepted Answer · 2020-02-24T23:30:46.613

18

Support for byte writes throughout a memory system is expensive. Among other things, if one wishes to use error-corrected memory that can correct single-bit errors, a memory that can be written in independent 8-bit chunks byte-writable memory will require four extra bits per octet, or 16 bits per 32-bit word. A memory that is limited to writing 16-bit chunks will require five bits extra per 16-bit word, or 10 bits per word; one that's limited to writing 32-bit chunks would only require six extra bits per word.

Worse, support for byte granularity increases the complexity of caching hardware. If one is using a 256-bit bus between the cache and main memory, limiting writes to multiples of 32 bits would require using eight read/write control circuits and keeping track of eight dirty bits per row. Allowing individual octets to be written would increase that to 32 separate read/write control circuits and 32 dirty bits.

For most tasks, having to use a read-modify-write sequence to update individual bytes would not be an issue. Unfortunately for DEC, many programming languages specify that adjacent bytes within any character array may be safely written by different threads without requiring synchronization, and offer no means by which programmers can indicate that they don't need such semantics. While the Alpha could efficiently process programs that wouldn't require it to accommodate the possibility of simultaneous writes to different parts of a word, languages provide no means of identifying such programs.

edited Feb 24 '20 at 23:30

answered Feb 24 '20 at 17:31

supercat

35,993
3
63
159

1

Are there 32 bytes in 256 bits? And why are 'dirty' bits needed per byte? Isn't it enough to have single 'dirty' bit for the whole cacheline? – lvd Feb 24 '20 at 23:16
1

@lvd: Mea culpa on 32 vs 64. On a system with weak memory coherency (which is what I think Alpha uses), there is no effort to negotiate ownership of cache lines. If multiple cores write the same word of memory, there's no guarantee of who will win, but if two cores each write a different word in a cache line, and each core eventually commits all writes, and each core flushes its read cache after the other core commits writes, then memory should ultimately end up consistent. That will only work, though, if cores limit their write-back to the portions of the cache that they wrote themselves. – supercat Feb 24 '20 at 23:30
2

To be fair, thread aware high-level-language memory models generally didn't appear until after first-gen Alpha was designed and released. Java was 1995, but IDK when the language added a thread-aware memory model. C/C++ wasn't formally until C11 / C++11, although it was widely understood as a de-facto standard before that. – Peter Cordes Feb 25 '20 at 08:05
3

Related: Can modern x86 hardware not store a single byte to memory? - my answer covers other ISAs, including Alpha as the only(?) modern byte-addressable ISA without byte loads/stores, making it notable and interesting. Also Advantage of byte addressable memory over word addressable memory mentions L1d ECC overhead as one of Alpha's main reasons for no byte stores. – Peter Cordes Feb 25 '20 at 08:07
2

Do you have more evidence about Alpha keeping per-word dirty status? Unless you RFO on a per-word basis, how can stores to the same line by different cores ensure ordering? Would memory barriers and atomic RMW operations have to flush to coherent L2 cache to make sure 2 cores weren't incrementing separate copies of the same word? If not, how could you get sequential consistency on Alpha with "weak coherency"? Linux managed to run on it, so I assume it was possible to write multi-threaded code for Alpha. I've only read about Alpha being as weak as C++11 (relaxed not consume) but coherent. – Peter Cordes Feb 25 '20 at 08:11
1

@PeterCordes: Stores to the same location by different cores wouldn't be ordered in the absence of barriers in both cores. Stores to different locations on the same line would end up being consistent because the write-back operations would be disjoint, rendering order irrelevant. It's been about 25 years since I studied the Alpha, and I never actually used one, so my memory may be faulty, but I think that's how it worked. – supercat Feb 25 '20 at 14:49
1

What does a barrier even do? In normal CPUs that use (some variant of) MESI to enforce coherency and consistency, a full barrier just has to ensure that the store buffer drains and commit to L1d; the line(s) can stay in Modified state. On your proposed system, a full barrier would have to do something more to make previous stores visible to other cores before later loads. And e.g. C++11 requires there to be a single modification order for any given atomic object. Also, atomic RMW via LL/SC would also have to do something make all cores act on a single copy of a shared counter. – Peter Cordes Feb 25 '20 at 15:02
1

So basically I don't think this can work without some kind of HW mechanism to enforce disjoint dirty maps. I've heard of tracking dirty status within a line on a per-word basis to optimize write-back, but not of doing that instead of MESI. Could that be what you're remembering? AFAIK, all single-system-image SMP systems are cache-coherent; as wiki says says "non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model." – Peter Cordes Feb 25 '20 at 15:08
1

@PeterCordes: A release fence flushes the write cache out to memory. An acquire fence flushes the read cache (or indicated portions thereof). If one or more cores write to disjoint areas of memory and then perform release fences, and then other cores perform an acquire fence, they will see everything that was written by the previous cores. – supercat Feb 25 '20 at 15:44
1

@PeterCordes: I agree with you that such an architecture wouldn't work at all well with the threading semantics that would be standardized 15 years later, since all Atomic-qualified accesses would need to operate directly (and expensively) on main memory, but if one writes code to minimize the use of such accesses such a design could be just fine for many purposes. – supercat Feb 25 '20 at 15:52
1

The Linux kernel "standardized" stuff for its own use with inline asm macros like smp_mb() and mb() for full barriers. Alpha used to be one of the platforms that Linux supported well. Also, Alpha definitely had cache->cache transfers that didn't have to go through memory. This is how 21264 could violate causality in real life, making store1 / wmb() / store2 in the write side appear out of order to a reader that tried to use the data dependency (mo_consume style) to order the reads without a rmb(). Memory order consume usage in C11 has links / quotes – Peter Cordes Feb 25 '20 at 16:19
1

@PeterCordes: In any case, my point is that I don't remember the Alpha using anything like MESI for "ordinary" stores, and I think I would have remembered if it had [though it was 25 years ago, so...] If cache write-back only hits dirty lines, then operations on disjoint regions of storage won't require any coordination and false sharing won't be an issue. A core that has loaded a cache line would be able to write to it straight away without having to first check whether anything else may have written to other parts of the line. – supercat Feb 25 '20 at 17:12
1

https://news.ycombinator.com/item?id=17670095 has some people's comments about Alpha. Including one specific description of Alpha as coherent but ultra-weak consistency (e.g. lack of ordering between cache banks) https://news.ycombinator.com/item?id=17672467 leading to not getting mo_consume for free. So other people that remember very specific details of Alpha behaviour are saying it was always coherent. Also, https://www.kernel.org/doc/Documentation/memory-barriers.txt describe a model that covers Alpha, and it's 100% clear that it's coherent, just with reordering possible. – Peter Cordes Feb 25 '20 at 17:25
1

Perhaps Alpha used something other than MESI for coherency, but release stores were possible, making sure all previous stores (and loads) were visible to other cores before this store. Unless a write memory barrier was implemented as a very expensive cleaning write-back of all dirty cache lines (word), that would require some kind of coherency tracking mechanism to make it possible to guarantee global visibility of previous stores. (I think, unless I'm missing something major.) – Peter Cordes Feb 25 '20 at 17:29
1

@PeterCordes: A substantial number of applications require performing a large number of independent operations on a large shared collection of read-only data. If a core only needs to publish the results of its computations once a second or so, the performance cost of doing a full write-cache flush each time would be trivial, especially if much of the information being written would never be accessed again. Of course, performance with some other kinds of tasks would be abysmal, but there may be a reason the Alpha failed in the marketplace. – supercat Feb 25 '20 at 17:44
1

@supercat: I'm pretty sure the reason is that corporate politics killed Alpha, not technical problems. (DEC was bought by Compaq, Compaq bet on Itanium and killed the EV8 / EV9 projects. https://www.realworldtech.com/ev8-mckinley/ says the design team mostly moved to Intel). Alpha was all single-core per package, but I assume there must have been multi-socket systems. https://en.wikipedia.org/wiki/Alpha_21364#Memory_controller says coherency was MESI, but that's 21364 which never made it to the open market, only a few Compaq systems. Possibly 21064 didn't even do SMP at all, hence no MESI? – Peter Cordes Feb 25 '20 at 18:03
1

@PeterCordes: I don't remember multiple cores per socket, but think there was an intention to support multiple chips per system. As for whether the device was killed by politics or technical limitations, why not both? – supercat Feb 25 '20 at 18:09
1

21464 (EV8) was going to be super-wide with SMT (still single core per die), but was killed in 2001. https://www.realworldtech.com/compaq-sacrifices-alpha/ has the details. At least 21364 and later had some kind of "interconnect" between sockets, and on-die memory controllers. Nothing I've read has given any indication of a technical limitation, or suggested that Alpha couldn't scale well on tightly coupled parallel problems. Of course this is about later Alpha, like 21264, so possibly 21064 had some kind of weakness? I haven't found anything about SMP 21064 specifically yet. – Peter Cordes Feb 25 '20 at 18:09
1

https://www.hpl.hp.com/hpjournal/dtj/vol7num1/vol7num1art4.pdf talks about AlphaServer 8000 systems, with up to 12 21164 CPUs: Cache coherency information for each system bus transaction is broadcast on the system bus as each transaction's data bus transfer is initiated. It also talks about in order vs. out-of-order bus transfers, saying something complicates the business of maintaining coherent, ordered memory updates. Alpha 21164 at least had coherent shared memory. There are also some details about atomic RMW and sub-block writes, and bank lock/unlock. (Search for "cohere") – Peter Cordes Feb 25 '20 at 18:19
1

@PeterCordes: I seem to recall later Alphas supporting 8-bit read/write. Do you remember which generation that happened with? – supercat Feb 25 '20 at 18:21
1

https://en.wikipedia.org/wiki/DEC_Alpha#Byte-Word_Extensions_(BWX) were new in 21164A (EV56). So I think that post-dates the 21164 AlphaServer 8000 systems the my previous comment was about. Presumably byte stores are implemented with a cache RMW cycle during commit from the store buffer into L1d, like many modern non-x86 CPUs. (With merging in the store buffer trying to form complete cache-word commits.) – Peter Cordes Feb 25 '20 at 18:25
1

@PeterCordes: That makes me wonder if my recollection about having to track dirty bits for each individually-writable value might have been accurate with respect to the first-generation Alpha I was learning about. Too bad I got rid of my Alpha manual when I moved, figuring I'd never need it again. – supercat Feb 25 '20 at 18:29
2

@PeterCordes I had access to a pre-release Alpha, and they were SMP from Day 1 (or before). DEC 4000/7000/10000 AXP. – richardb Feb 25 '20 at 18:41
1

Reading more of https://www.hpl.hp.com/hpjournal/dtj/vol7num1/vol7num1art4.pdf, and other stuff, it seems there is no shared last-level cache in that generation. But "When a processor reads a block of data from a second processor's cache" implies core->core without write-back to DRAM. Also, they talk about bank locking saying This approach provides for atomic read-modify-write sequences and coherent subblock writes - if block=word, this just means safe / coherent byte stores, I guess. I don't think they're saying that word stores aren't normally coherent (block=line) but not sure. – Peter Cordes Feb 25 '20 at 18:42
1

@richardb: thanks, yeah, https://en.wikipedia.org/wiki/DEC_4000_AXP dual-socket 21064 ruled out that unlikely guess about single-core-only early Alphas, but still perhaps supercat's recollection was based on a single-socket system that didn't have to do any cache coherency. (Although that's unlikely, probably lack of MESI in a single-core system wouldn't have been notable). It makes sense that Alpha was designed from the ground up for high-end SMP systems, of course. – Peter Cordes Feb 25 '20 at 18:48

dave · Answer 2 · 2020-03-03T03:40:08.207

16

Apart from anything else, Alpha was the VAX replacement - indeed, it was internally called EVAX. It would be necessary to take VMS source code written in VAX MACRO-32 assembler and compile it into Alpha machine code.

VMS had a lot of dependency on 32-bit words. MACRO-32 code was explicit about length. Higher-level code, in BLISS-32, tended to be explicit about operand length or perhaps merely assumed that a fullword was exactly 32 bits. Ergo, any credible hardware base for VMS requires efficient 32-bit operations.

I assume there was perceived to be no particular byte requirement for VMS since most (non-C) software would be using VMS conventions for strings (counted, not with a terminator character) and VAX machine instructions (MOVC3, CMPC3, etc., or the equivalent BLISS-32 'CH$' builtins), said instructions easily being emulated on Alpha. C was not seen as a major implementation language on VMS at the time.

(Warning: opinion of one DEC software guy only)

edited Mar 03 '20 at 03:40

answered Feb 24 '20 at 18:12

dave

35,301
3
80
160

Which site, @another-dave? My father was an Alpha software guy at ZKO. – T.J.L. Feb 25 '20 at 16:49
Various, most notably in REO, TW, and LKG. – dave Feb 25 '20 at 17:48
1

VAX MACRO was mostly compiled to Alpha instructions. PALcode was only for the hairy stuff like INSQTI, which absolutely had to appear atomic. – richardb Feb 26 '20 at 21:36
1

True, I'm not now sure what made me write the PALcode thing after the character-string instructions which would easily be implemented in normal Alpha code. I'll edit that out and forget I ever said it. Thanks. – dave Mar 03 '20 at 03:37
1

Also w.r.t. strings the Alpha had instructions which made word-length-based searching for a character within a string fast. – davidbak Jun 23 '20 at 16:00

DEC Alpha: why no 8/16-bit load/stores?

2 Answers2