Motorola 6809 Stack Blasting

Question

I've seen several mentions of a "Stack Blasting" technique for copying memory quickly on the 6809. What is it? How does it work?

Stephen Kitt · Accepted Answer · 2016-05-02T16:28:12.407

The 6809 has a couple of instructions which provide very quick ways of pulling (popping) and pushing registers off and onto the stack, PULS/PULU and PSHS/PSHU. PSHx only takes 5 cycles plus one per byte, which is much faster than other methods of writing to memory. So by pointing the stack registers (U and/or S) at the area of memory you want to read from and/or write to, you can quickly access memory.

This works particularly well for copying memory, at least in cases where the layout is amenable: read into as many registers as possible using PULS or PULU (you get to specify which registers are used, the CPU doesn't force you to push or pull all of them), and write using PSHU or PSHS. You'd set S to point to the memory to be read, and U to the address to write to, or vice versa; the pull and push instructions increment and decrement the stack registers as appropriate. The 6809 has three general-purpose 16-bit registers, D (also known as A and B combined), X and Y, and one register that can often be spared in these kinds of loops, DP; using this technique you can copy 7 bytes in 24 cycles:

; point S at the memory to be read
; point U at the target
PULS D,X,Y,DP     ; read 7 bytes into D, X, Y, DP, and increment S
PSHU D,X,Y,DP     ; write D, X, Y, DP to memory and decrement U

This compares to the basic memory copying technique, via the U stack pointer, which takes 16 cycles per word (two bytes) — five cycles per load/store, and three cycles per index increment:

; point X at the memory to be read
; point Y at the target
LDU ,X++          ; read 2 bytes into U and increment X
STU ,Y++          ; write U to memory and increment Y

So stack blasting takes less than 4 cycles per byte, whereas LDU/STU takes 8 cycles per byte... Furthermore, since stack blasting copies more memory per instruction pair, there's less overhead involved with loops or loop unrolling!

There are of course some caveats. In particular, any interrupts which occur during this stack blast will affect the values being copied, since the interrupt will use the "stack" that's been set up for the copy. And since this is stack-based, reads and writes move in opposite directions... These two factors mean that stack blasting is only really useful for writing graphics to the screen: the source material can be laid out appropriately in advance, and changes caused by interrupts would only result in temporary artifacts on screen.

Tom Moertel has a great write-up of this technique, including more detailed comparisons with other memory-copying techniques (offset load/stores instead of incrementing load/stores for example). He shows how he went from 157 cycles to copy 14 bytes using a typical loop, to 120 cycles with an unrolled loop, to 98 cycles with an unrolled, fixed-offset copy, down to 56 cycles using stack blasting.

How about comparing the number of cycles it takes to copy 14 bytes with Stack Blasting vs. 14 bytes with normal auto-increment/-decrement indexed or indexed with offset? That would demonstrate the tremendous speed-up of this technique. — blubberdiblub, May 01 '16 at 21:36
I think the caveat regarding the opposite read and write direction is very serious thing and makes this optimization not generally usable as a replacement for common memory copy. As mentioned in the referenced article this special behavior fits well only in special cases like graphic manipulations where fixed regions (tiles) has to moved into graphic memory. Such tiles can be prepared in such that they are stored in opposite order. — Johann Klasek, May 02 '16 at 15:37
Thanks @JohannKlasek; I started out explaining stack blasting, and got carried away — it isn't generally applicable for any kind of memory copying, as my answer would have led readers to believe. I've updated my answer to clarify that. — Stephen Kitt, May 02 '16 at 16:29
the 65816 has essentially the same trick available, and the answer to the "render Super Hi-Res quickly" question has much detail. — peter ferrie, Feb 25 '17 at 00:43
If one needed in-order copy, how would time be affected if one used auto-patched code for something like pulu d,x,y / xcg d,y / leau u+offs1 / pshu d,x,y / leau u+offs2? It wouldn't be as fast as using just pull and push instructions, but it would still handle 6 bytes per pass. — supercat, Apr 27 '18 at 20:04
Actually, thinking about it, the xcg isn't needed. So for the code in the article, indicated code, each sequence of one short lea plus two pairs of pulu/pshs would need three extra lea instructions plus one extra cycle on what had been the short lea. So 56 cycles per 14 bytes would become 84 without having to disturb the stack pointer. That's an increase from 3.5 cycles/byte to 6, but still better than 8. — supercat, May 09 '18 at 19:01

Motorola 6809 Stack Blasting

1 Answers1

Linked