Size-optimization of a 256-byte "LDIR" adapter

Question

I'm working on tweaking ABC-800 BASIC II machine code, and there's a routine there that does a "256-byte" equivalent of LDIR. B is not used, C=0 transfers 256 bytes, C=1 transfers 1 byte, etc.

The existing code is:

0000                             ; C - length (0=256 bytes)
0000                             ; DE - destination
0000                             ; HL - source
653D                          .ORG   653dh   
653D   ED A0        BLKTF:    LDI      
653F   AF                     XOR   A   
6540   B1                     OR   C   
6541   20 FA                  JR   NZ,BLKTF   
6543   C9                     RET

That is nice, compact code, but obviously using LDIR could be faster. The best I could come up with is one byte longer:

653D                          .ORG   653dh   
653D   AF           BLKTF:    XOR   A  ; clear A 
653E   B9                     CP   C   ; CF=0 if C was zero
653F   3F                     CCF      ; CF=1 if C was zero
6540   17                     RLA      ; rotate C into A
6541   47                     LD   B,A ; now BC = 1..256
6542   ED B0                  LDIR      
6544   C9                     RET

Is there a way to do better, retaining the same API?

LDIR is famously slow. It may make code clearer, quicker to write, or possibly a smaller binary, etc. I do not know if it would be slower than the code you're starting with though. — hippietrail, Jan 06 '24 at 06:19
It will not be slower original code take 16+4+4+12 ldir takes 21(rest is executed once and takes 4+4+8)... So it's slower only for c =1 — Selvin, Jan 06 '24 at 06:46
See How fast is memcpy on the Z80?. And from this article on hacker news: On the Z80, LDIR is not as fast as unrolling the loop to produce a block of LDI instructions. You can jump into the middle of the block at count % blocksize to copy sizes which are not a round number. This was a fairly common game trick - Oh, that's actually a circular link. — Greenonline, Jan 06 '24 at 23:43
Does this answer your question? How fast is memcpy on the Z80? — Greenonline, Jan 06 '24 at 23:45
@Greenonline Not really. Those cases are about LDOR and generic mem copy. This is about a very specific case with 8 bit length and 00 equalling 256. Also, it's about code size, not/less about speed optimization. — Raffzahn, Jan 07 '24 at 01:10
@Raffzahn Yes, quite so. This code is a part of a heavily size-optimized tokenizing basic interpreter. Everything that’s not active at runtime had to be squeezed down so that the byte code vm could do its job fast. This particular basic doesn’t even store source code, just token-like byte code. It has fairly impressive runtime performance, handily beating every other Z80 built-in basic I know of. Such an engineering gem, and basically unknown outside of Scandinavia. I’m having fun reverse engineering it and improving it slightly. — Kuba hasn't forgotten Monica, Jan 07 '24 at 21:49
@Kubahasn'tforgottenMonica No doubt.the ABC 800 were quite nice machines. Just unobtanium down here. Otherwise I had for sure on in my collection. I did work with one in a project ca. 1980. Remember the BASIC being rather comfortable. — Raffzahn, Jan 07 '24 at 23:26

score 12 · Accepted Answer · edited Jan 06 '24 at 17:59

12

You can go with:

653D                          .ORG   653dh   
653D   06 00                  LD    B, 0   
653F   0D                     DEC   C   
6540   03                     INC   BC   
6541   ED B0                  LDIR   
6543   C9                     RET

or

653D                          .ORG   653dh   
653D   AF                     XOR   A   
653E   47                     LD    B, A   
653F   0D                     DEC   C   
6540   03                     INC   BC   
6541   ED B0                  LDIR   
6543   C9                     RET

When C is 1-255 then, after decrementation, it is 0-254. After incrementation of BC, it returns to its previous value. However, if it is 0 then after DEC it becomes 255. INC BC will overflow C and increment B which was 0 as we set it at the beginning.

Final note: are you sure that you can change B?

edited Jan 06 '24 at 17:59

Greenonline

4,296
2
19
57

answered Jan 06 '24 at 05:45

Selvin

236
2
5

3

Dang, this is awesome! Thank you so much! I assume that B could change since in the original code the C=0 case always modifies B. I'll look at the call sites to make double sure. Thankfully Ghidra makes that easy. I tried the DEC BC, INC C route - so close :) Now I know to always try it multiple ways :) – Kuba hasn't forgotten Monica Jan 06 '24 at 15:51
1

Oh yes, I forgot that ldi can also modify bc – Selvin Jan 06 '24 at 17:14
My attempt was `XOR A / LD B,A / LDI / RET PO / LD B,A / LDIR / RET`` which I think would be a little faster., since "LDI / RET PO" would take the same amount of time as one iteration of "LDIR", but only need one 4-cycle instruction as "cleanup", eliminating the need to do a six-cycle "INC BC". – supercat Jan 08 '24 at 17:53

Size-optimization of a 256-byte "LDIR" adapter

1 Answers1