Fastest 8-bit microprocessor for multiply-accumulate?

Question

I'm wanting to identify which 8-bit microprocessor would have the best performance for a multiply-accumulate operation.

By "operation", I mean the minimal implementation for 16-bit operands and 32-bit result in assembly language.

By "8-bit microprocessor", I mean an early microprocessor (pre-1990) having an 8-bit external data bus -- I don't care about ALU or register size.

By "best performance", I mean lowest elapsed wall-clock-time to compute and store result.

Is there an objective answer, probably based on the CPU having a "fast" hardware multiplier, while sporting sufficient scratchpad registers and high clock rate?

I don’t have objective benchmarking results handy, but since you only limit the external data bus, the answer is likely to be something like the 80188, or if you allow later implementations of pre-1990 CPUs, something like the high-clock-frequency Z80s... — Stephen Kitt, May 09 '22 at 15:51
The 68008 was pretty fast internally, but was 8 but externally. — Mark Williams, May 09 '22 at 15:52
@StephenKitt I wouldn't want the old core pushed to 100s of MHz. But a simple CMOS re-implementation running at 10s of MHz would be "retro enough". — Brian H, May 09 '22 at 15:58
Have you seen https://retrocomputing.stackexchange.com/questions/1711/what-api-did-the-math-box-provide ? I wouldn't be surprised if the combination of 6502 plus math box compared favorably with any other mass-produced 8-bit-data-bus machines from the early 1980s, though I suspect by 1990 higher-speed versions of the NEC v20 might have been faster. — supercat, May 09 '22 at 17:41
"Is there an objective answer" I doubt that as the 'rules' allow any kind of CPU able to run from an 8-bit bus, even without dedicated instructions, which includes all x86 up to at least Pentium. Thus it essentially comes down to the highest clock rate at the time. A somewhat more serious way may be to really restrict this to 8 bit CPUs with dedicated MAC support. Like some Mitsubishi 740 variants, essentially a 6502 or STM's ST7 and STM8 family (6800 offspring) and several other. — Raffzahn, May 09 '22 at 22:18
@Raffzahn can the 286 and later really run on an 8-bit bus on their own? I was under the impression that transparent 8-bit transfers, e.g. for 8-bit ISA cards, were mediated by the chipset. — Stephen Kitt, May 10 '22 at 09:02
@StephenKitt 286 works like 8086, so I guess the chipset may need to be involved (would have to check). Same with 386 which can be told by /BS16 to do 16 bit memory, while 486 ans Pentium use /BS8, /BS16 to reconfigure the bus to 8/16 bit. With /BS8 and /BS16 pulled all transfers run thru D7..D0, essentially making it to an 8 bit CPU according to above definition. Pentium works similar (IIRC) similar but would need external selectors. I just remember this oddity as I once did build a 486 card for the Apple II, working much like the Z80 card. — Raffzahn, May 10 '22 at 10:47
@Raffzahn cool, I didn’t know about BS8/BS16! I see they are not supported on the Pentium though, and as you say external byte assembly logic is required to interface with buses narrower than 64 bits. — Stephen Kitt, May 10 '22 at 10:55
@StephenKitt Well, you're right. Then again Pentium was after 1990, so let me restrict my interjection to 486 - which ofc includes the genuine i486DX50 (50 MHz bus) of 1989. 68k CPUs btw had similar ways to adapt to smaller busses. The 68030 could be forced to do 8 bit by handling DSACK0/1 the right way (one high one low). Then all transfer was done 8 bit on D31..24 while A2..0 showed byte addresses. I think the 68040 worked alike, but I'm not sure. the interface changed with the 060. — Raffzahn, May 10 '22 at 11:17

score 49 · Accepted Answer · edited May 12 '22 at 07:44

49

And sorry to nerd-snipe this, but of course the ideal candidate for a fast multiply-add is a Digital Signal Processor CPU, and nothing in your requirements says that DSPs are excluded. It will be also hard to draw the line between "DSP" and "general-purpose CPU with hardware multiplier", because that's basically what DSPs are (I/O design aside).

As a data point, one of the first standalone DSPs was the NEC µPD7720, released in 1980 (pre-1990), with an external 8-bit databus (but internal 16-bit buses), a 16-bit ALU, and a 16-bit parallel multiplier, and a multiply-accumulate operation took 122 ns.

Later DSPs (still pre-1990) would be faster, but then they quickly get a 16-bit external databus (though one could argue that e.g. a TMS32010 in co-processor mode would still use an external 8-bit bus).

edited May 12 '22 at 07:44

Toby Speight

1,611
14
31

answered May 09 '22 at 17:21

dirkt

27,321
3
72
113

12

I don't think it's a snipe and I was hoping for some unconventional answers. – Brian H May 09 '22 at 17:28
And note: CD players were a big thing in the 1980's. A CD stores 16-bit samples, but it was quite common to use oversampling, to allow the digital part of the chain to work with fewer bits at a time. Philips originally used 14 bits, and some went all the way down to 1 bit with a lot of oversampling--but as I recall, there were a fair number of 8-bit DSP designs as well. – Jerry Coffin May 11 '22 at 00:15
I used to program DSP chips – JosephDoggie May 12 '22 at 14:36

Stephen Kitt · Answer 2 · 2022-05-10T13:24:19.997

21

There’s probably another CPU meeting your criteria which beats this, but as a data-point, a number of x86 CPUs fit the bill. One is the 80C188 (introduced in 1987) which was available at frequencies up to 25MHz, and the compatible Am188ER CPU was available at frequencies up to 50MHz. These have an 8-bit data bus, but a 16-bit ALU. Another is the 80486, introduced just before your cut-off; see below. Like all x86 CPUs these support the following minimal 16-to-32-bit multiplication:

MUL BX

This multiplies AX and BX, unsigned, and stores the result in DX:AX. If you want to read the operands from memory and write them back:

MOV AX, [op1]
MOV BX, [op2]
MUL BX
MOV [resh], DX
MOV [resl], AX

where op1 and op2 point to the operands, resh points to the high word of the result, and resl to the low word.

The timings are as follows (MOV is slightly slower on the Am188ER):

MUL r16: 35–37 cycles
MOV: 12 cycles on the 80188, 13 on the Am188ER for a 16-bit transfer between a register and memory (in either direction)

resulting in a total of at most 89 cycles, i.e. 1.8µs at 50Mhz.

A complete multiply-accumulate would look like this, courtesy of supercat, multiplying ranges of words starting at ES:SI and DS:BX+SI+2 respectively, over CX words:

    ES LODSW
lp: IMUL [BX+SI]
    ADD DI, AX
    ADC BP, DX
    ES LODSW
    LOOP lp

The result is accumulated in BP (high word) and DI (low word).

This takes advantage of prefetching to shave cycles off the overall execution time. The reference 80188 timings are 18 cycles per LODSW, 3 per addition, 34–37 per IMUL, 15 per loop taken (including instruction queue reinitialisation and target instruction prefetch). The effective address calculations would add a lot more cycles on an 8086/8088 but are handled by a dedicated hardware unit on the 80188, with no additional (visible) cost.

Raffzahn pointed out that the 80486 can also run with an 8-bit data bus; while it has a 32-bit data bus natively, if BS8 and BS16 are pulled low, it will split up data transfers into 8-bit units. The 486 greatly reduces the base number of cycles required for all the instructions above: 1 cycle per register-to-memory MOV, 13–26 cycles per 16-bit register-based multiplication, 1 cycle per register-based addition, 5 cycles per LODSW, 7 per loop taken. I imagine the transfer splits add a number of additional cycles, although those might be hidden by the cache. The 486’s clock maxed out at 50MHz externally (in 1989), but the CPU was clock-doubled and then tripled during the 90s, up to 100MHz in Intel variants and 160MHz in AMD variants.

edited May 10 '22 at 13:24

answered May 09 '22 at 16:16

Stephen Kitt

121,835
17
505
462

I would think for a multiply-accumulate dot product with operands placed arbitrarily in memory, optimal code would be something like: lodsw es:whatever/ lp: imul [bx+si] / add di,ax / adc bp,dx / lodsw es:whatever / loop lp. The instructions following the multiply could be fetched while the multiply was being processed, thus saving about 12 cycles from the effective execution time of the following instruction. – supercat May 09 '22 at 18:21
@StephenKitt: Was the Am188ER (at any clock rate) introduced prior to 1990, as the question asks? I worked at AMD in the second half of the 1990s, and seem to recall the part was introduced in that time frame. – njuffa May 10 '22 at 06:00
@njuffa I wasn’t sure about that, I picked that one because I could find timing information for it quickly. IIRC the 80C188’s cycle counts are in a similar ballpark, and it went up to 25MHz; I’ll adjust this to reference the Intel chip later today. – Stephen Kitt May 10 '22 at 06:20
1

@njuffa Not sure about the ER, or any of the E*, type but intel offered the 'plain' 188 with 10 MHz right from the start and soon after with 12.5 MHz. AMD had in 1985 as well 'plain' 188 with 10 and 12.5 MHz but outclassed Intel with versions of 16/20/25 MHz. So 80188@25 is quite standard prior to 1990. – Raffzahn May 10 '22 at 10:59
1

@Raffzahn when njuffa asked about this, my answer only mentioned the Am188ER ;-). – Stephen Kitt May 10 '22 at 11:06
@supercat: That loop looks good to me, good use of one pointer to index another array with BX holding the distance between arrays (GCC/clang miss that). The ES prefix is not necessary, though; it's totally reasonable to write a dot-product function that works for inputs that have to be within 64kiB of each other, both addressed using the same segment register. Also, the question just asked for MAC speed, not a full dot-product, so a microbenchmark of just mul / add/adc with no memory access would qualify. Or a sum-of-squares of an array would reduce to lodsw/mul ax/add/adc/loop. – Peter Cordes May 10 '22 at 21:13
486 is quite a different beast: the loop instruction is slower on 486 than dec/jnz (this 8086 answer quotes tables that include 486 timings. And 32-bit operand-size is available. So maybe movzx eax, word [ebx+ecx] / imul eax,eax / add edx, eax / add ecx, 2 / jnz .loop (with ECX counting up towards zero, and EBX pointing at the end of the array.) movzx might not be ideal, at 3 cycles. – Peter Cordes May 10 '22 at 21:22
Hmm, non-widening imul r,r on 486 isn't any faster than a full widening mul r32, and has data-dependent performance with the same fastest case (13c) for all operand-sizes. So I'd guess that the same input value give the same performance regardless of operand-size. So we're not gaining any speed by avoiding writing EDX (like we would on a modern x86), but it doesn't cost extra to get the 32-bit product in a single register. (Except in setup of the inputs; they need to be zero- or sign-extended. That prevents lodsw, or prevents memory-source [i]mul in a dot product loop) – Peter Cordes May 10 '22 at 21:29
lodsw is slow on 486 anyway (5 cycles), so it's good to avoid it. Apparently more efficient to do a zero-extending load as xor eax,eax (1c) / mov ax, [mem] (1c), instead of a 3c movzx reg,mem. (486 has cache, so 16-bit loads can still plausibly be single-cycle, despite an 8-bit bus.) So yeah, I think that as a replacement for movzx in my sum-of-squares loop in the previous comment should be quite good. – Peter Cordes May 10 '22 at 21:35
2

Running the 80486 with /BS8 asserted limits it to transfer 8 bits at a time. You still have to provide the 8 bits on different data lines, depending on the address requested, though. You don't get around routing all 32 data pins to somewhere, so calling it a processor that has an "8-bit data bus" (even if only in this special mode of operation) is borderline. – Michael Karcher May 11 '22 at 21:31

score 7 · Answer 3 · answered May 09 '22 at 22:05

7

I'll suggest the 8051. It's reasonably early (1980) and definitely 8-bit. There are modern derivatives like this claiming speeds of up to 430 MHz and including a 32-bit hardware multiplier, but if you like it a little more conventional, even Microchip's AT89LP, running at 20 MHz with 2-cycle hardware multipliers, will outperform most "classic" hardware.

answered May 09 '22 at 22:05

Michael Graf

10,030
2
36
54

Jup, there's much hidden in micro controllers, not at least due the fact that some of their tasks do quite well benefit from MAC operations to generate/follow characteristics. – Raffzahn May 10 '22 at 10:48
2

I think the question was looking for chips that actually existed in 1989, not just modern implementations of ISAs or chips that existed then. Those are interesting side-notes, but for things like adding 32-bit multiplier HW into 8051, if that counts, then we're pretty close to modern x86 counting since it evolved out of 8088 / 8086, but now has 512-bit wide SIMD-integer stuff like vpmaddwd zmm0, zmm0 / vpaddd zmm1, zmm0 to do 32x 16-bit multiply-accumulate operations in parallel, with a throughput of 1 cycle on Skylake-X / Ice Lake (and room to load to get sum-of-squares of an array.) – Peter Cordes May 11 '22 at 02:03
1

Anyway, interesting so I'm not downvoting, but also not upvoting due to not mentioning actual MAC speed on an 8051 available in 1989, or release dates for either feature or chip you mention. – Peter Cordes May 11 '22 at 02:04

score 7 · Answer 4 · answered May 11 '22 at 10:14

The Motorola 6809 was one of the first (the first?) 8 bit microprocessors with a multiply instruction implemented in hardware.

It was even pre 1980, introduced in 1978.

One MUL instruction took 11 cycles (8 bit inputs, 16 bit output), i.e. 16 bit x 16 bit multiplication required 44 cycles + some cycles for adding.

I don't know, however, how competitive this was with other 8 bit CPUs at that time.

Kevin Cozens · Answer 5 · 2022-05-19T02:50:50.127

My first thought was a 6809 as it is a true 8-bit microprocessor but you want "best performance" as measured in wall-clock time. A better choice would be a 68008 as it has 32-bit registers so it could easily handle 16x16 multiply and 32-bit add/subtracts.

I think of "best performance" in number of clock cycles to do the job. If you are only measuring "performance" in terms of the wall clock the faster the clock the less time it will take. You could put a 68008 core in to an FPGA and crank up the clock speed for the operation to take less time on the clock.

It could be argued that you don't have a real microprocessor if you just have a core in an FPGA but it would behave the same as the real thing but a whole lot faster.

If you broaden the definition of "microprocessor" to other devices that can have an 8-bit data bus then you open it up to the use of CPLDs, FPGAs, or DSPs.

Thanks for the response. I had the 6809 and 68008 in mind too, when I asked the question. You are right - A DSP coprocessor was the historical solution for this type of use-case, even with 8-bit systems. — Brian H, May 19 '22 at 12:27

score 1 · Answer 6 · answered May 15 '22 at 01:27

1

You can sure consider a co-processor, e.g. AMD 29516.

MPY16 series of multipliers can get you into ~50ns per operation level

Compatible multipliers were made by Analog Devices (ADSP1016, 40-50ns at 150mW) and LOGIC LMU16/216) in CMOS, by Weitek (WTL1516/A/B, 50-100ns at 0.9-1.8W) in NMOS, by Synertek (SY66016 100ns at 1.5W) in HMOS, by AMD (Am29516 38ns at 4W) in ECL, as well as many others.

The interface on these chips are quite wide but a 8-bit bus can still connect to it although slower.

answered May 15 '22 at 01:27

user3528438

1,319
11
10

2

The 29K series weren't contemporary with 8-bit machines. The 2900 series were, though. Atari used AMD2901s in their arcade machine for fast mathematics – scruss May 15 '22 at 15:31
@scruss - how wide was their multiplier of 2901s? Just curious. – davidbak May 18 '22 at 23:41
1

16 bits, so 4 chained 2901s: Atari Math Box – scruss May 18 '22 at 23:51

score 1 · Answer 7 · answered May 18 '22 at 14:26

Not sure about calling an 80486 an 8 bitter even if you can throttle external access to 8 bits. For a 'real' 8051 job the Silabs EFM8 series aren't too shabby. I've run them at 72 MHz and the multiply instruction takes 4 cycles, but that's only 8 by 8 bits.

score 1 · Answer 8 · answered May 20 '22 at 15:04

I grew up with the Z80 on CP/M systems but occasionally used the 6502 on Apple IIs and BBC micros. Over time I generally found the Z80 extensions to the 8080 instruction set saved a few bytes but oddly not CPU cycles, so I tended to stick to the 8080 set. And curiously, although the paucity of registers on the 6502, and its slower clock speed made it look a lot slower, in fact with the right architecture it could be astonishingly fast (there is an article on this about the history of ARM). The culmination of my Z80-CP/M experience used the Hitachi SB180 which I think could run at 6MHz) - I suspect that would be one of the top contenders for this (apart from using a dedicated multiplication unit). I wrote a bunch of arithmetic routines for this as part of a project I was working on and spent a lot of time optimizing them. (If they're of interest, I still have them, but only as the scan of a printout.)

In those days we built our own computers and I designed (but sadly never implemented) a hardware multiplier based on quarter-squares stored in EPROMs which would have been lightning-fast. Happy days....

score 0 · Answer 9 · answered May 21 '22 at 14:40

Maybe you could have a look at the Motorola 68008; its architecture and machine language had much better reputation than Intel's ones in the 80s. Another point is that it runs at higher frequency than the 6809. The 68008 is the 8bits bus version of the 68k, it was used in the Sinclair QL; while the 68000 was used in Sega's Genesis (Megadrive) console, Amiga and Atari 16 bits machines.

Fastest 8-bit microprocessor for multiply-accumulate?

9 Answers9