I've seen several mentions of a "Stack Blasting" technique for copying memory quickly on the 6809. What is it? How does it work?
1 Answers
The 6809 has a couple of instructions which provide very quick ways of pulling (popping) and pushing registers off and onto the stack, PULS/PULU and PSHS/PSHU. PSHx only takes 5 cycles plus one per byte, which is much faster than other methods of writing to memory. So by pointing the stack registers (U and/or S) at the area of memory you want to read from and/or write to, you can quickly access memory.
This works particularly well for copying memory, at least in cases where the layout is amenable: read into as many registers as possible using PULS or PULU (you get to specify which registers are used, the CPU doesn't force you to push or pull all of them), and write using PSHU or PSHS. You'd set S to point to the memory to be read, and U to the address to write to, or vice versa; the pull and push instructions increment and decrement the stack registers as appropriate. The 6809 has three general-purpose 16-bit registers, D (also known as A and B combined), X and Y, and one register that can often be spared in these kinds of loops, DP; using this technique you can copy 7 bytes in 24 cycles:
; point S at the memory to be read
; point U at the target
PULS D,X,Y,DP ; read 7 bytes into D, X, Y, DP, and increment S
PSHU D,X,Y,DP ; write D, X, Y, DP to memory and decrement U
This compares to the basic memory copying technique, via the U stack pointer, which takes 16 cycles per word (two bytes) — five cycles per load/store, and three cycles per index increment:
; point X at the memory to be read
; point Y at the target
LDU ,X++ ; read 2 bytes into U and increment X
STU ,Y++ ; write U to memory and increment Y
So stack blasting takes less than 4 cycles per byte, whereas LDU/STU takes 8 cycles per byte... Furthermore, since stack blasting copies more memory per instruction pair, there's less overhead involved with loops or loop unrolling!
There are of course some caveats. In particular, any interrupts which occur during this stack blast will affect the values being copied, since the interrupt will use the "stack" that's been set up for the copy. And since this is stack-based, reads and writes move in opposite directions... These two factors mean that stack blasting is only really useful for writing graphics to the screen: the source material can be laid out appropriately in advance, and changes caused by interrupts would only result in temporary artifacts on screen.
Tom Moertel has a great write-up of this technique, including more detailed comparisons with other memory-copying techniques (offset load/stores instead of incrementing load/stores for example). He shows how he went from 157 cycles to copy 14 bytes using a typical loop, to 120 cycles with an unrolled loop, to 98 cycles with an unrolled, fixed-offset copy, down to 56 cycles using stack blasting.
- 121,835
- 17
- 505
- 462
pulu d,x,y / xcg d,y / leau u+offs1 / pshu d,x,y / leau u+offs2? It wouldn't be as fast as using just pull and push instructions, but it would still handle 6 bytes per pass. – supercat Apr 27 '18 at 20:04xcgisn't needed. So for the code in the article, indicated code, each sequence of one shortleaplus two pairs of pulu/pshs would need three extraleainstructions plus one extra cycle on what had been the short lea. So 56 cycles per 14 bytes would become 84 without having to disturb the stack pointer. That's an increase from 3.5 cycles/byte to 6, but still better than 8. – supercat May 09 '18 at 19:01