24

I just came across this amazing 1976 article by Woz. In it he describes a relatively complete floating-point system for the 6502 with a 32-bit format (similar to earlier MS code).

I understand the code, mostly, but I am curious about its performance. I know that a lot of it runs through FMUL and thus one would expect that newer designs using unwound loops and/or self-modifying code would improve on that.

But given the constraints of the time, mostly memory and a desire to be read-only (for some machines anyway), has this code been greatly improved upon?

I have poked about a bit for benchmarks comparing this code to MS's version, but have not found anything applicable - the Rugg/Feldmann would do it but the only numbers I see are for MS BASIC vs. Integer, so Woz's FP code is not being run in either case.

Greenonline
  • 4,296
  • 2
  • 19
  • 57
Maury Markowitz
  • 19,803
  • 1
  • 47
  • 138
  • 6
    The code includes logic to work around a bug in the ROR instruction of the very first 6502 microprocessors, and would probably be more than twice as fast without that. – supercat Jun 24 '19 at 18:39
  • Wow, that's interesting. I was aware of this bug, but had not seen an estimate on the difference in time! – Maury Markowitz Jun 24 '19 at 18:40
  • 6
    Addresses $1F66 to $1F75 could be replaced by 12 bytes worth of LSR and ROR instructions with total execution time of 30 cycles. Instead, they run six iterations of a loop with an execution time of 25 cycles/iteration. That function probably represents the majority of the execution time of a floating-point multiply. – supercat Jun 24 '19 at 19:36
  • Looking over the code for the 6502 versions of MS, this code was replaced in v1.1, which the Apple II used. – Maury Markowitz Jun 25 '19 at 10:56
  • Just to be petty about this: "how good" is to be interpreted in the objective sense of "how does its performance compare to other implementations?" – Tommy Jun 25 '19 at 22:51
  • 2
    … or how precise/repeatable/feature filled? The Rankin/Wozniak code doesn't handle NaNs or Infs and has no transcendental functions, so it's of limited general use today. It's also single precision, so cumulative rounding errors will build up quickly. – scruss Jun 26 '19 at 10:12
  • 1
    Good points scruss. Admittedly, I was mostly interested in pure performance against the other implementations like the one in MS. – Maury Markowitz Jun 26 '19 at 14:19

1 Answers1

10

Steve Wozniak wrote most of his software to be compact rather than fast, reflecting the constraints of affordable memory hardware of his time. That often resulted in contortions that made it run considerably slower than a speed-optimised implementation, such as the extensive reuse of the FMUL subroutine mentioned.

Home micros sold before 1980 typically came with 8K or 16K of ROM in total, which had to support all the features of both BASIC and native user programs. The early Apple machines were no exception.

FP routines written for a less space-constrained machine, such as the BBC Micro which often had 48K of ROM from the factory (16K MOS, 16K BASIC, 16K DFS), could be considerably faster due to the use of more specialised routines and more speed-optimised coding techniques that took up more space. The BBC Master capitalised on a 128K "MegaROM" (named because 128KB = 1 megabit) and the more capable 65C02 to further accelerate the FP and graphics routines.

It's hard to directly compare Woz FP and BBC Micro FP because they operate to different precisions - 4 and 5 bytes respectively - so the BBC Micro has to do more work to complete its calculations. Nevertheless, a Mandelbrot-based benchmark on different BASICs ported to a common (relatively fast) machine shows that the BBC implementation was still faster:

ehBasic: 172 seconds
Applesoft: 161 seconds
BBC Basic 1-3: 124 seconds
BBC Basic 4: 96 seconds

In the above table, ehBasic is effectively an expanded version of MS BASIC implementing a 5-byte FP format. BBC Basic 4 is the Master version using 65C02 instructions.

Kaz
  • 8,086
  • 2
  • 38
  • 81
Chromatix
  • 16,791
  • 1
  • 49
  • 69
  • Erm... the Question is not about Applesoft - and especially not about the BBC - but Woz' FP routines. So none of the text of this 'answer' nor the benchmarks are related to the question in any way. – Raffzahn Jun 29 '19 at 00:39
  • 1
    As far as I'm aware, Woz' FP routines were used in Applesoft BASIC. The question effectively asked for a comparison with other FP implementations. – Chromatix Jun 29 '19 at 00:44
  • 3
    Nop, Applesoft uses Microsoft's FP routines (1+4 byte format), while Woz' implementation is based on 32 bit (1+3 byte) representation. They where only used with Integer BASIC and Assembly routines. And while the question is about comparsion, it's useless to write mostly about a different BASIC when there is no relation and no comparsion at all. – Raffzahn Jun 29 '19 at 00:51
  • In that case, the answer could be completed with a direct comparison between the Woz routines and one of the others in the list. – Chromatix Jun 29 '19 at 00:56
  • 1
    @Chromatix Could you do that comparison, then? As it stands, this answer doesn't really answer the question. – wizzwizz4 Jun 29 '19 at 05:43
  • 1
    I don't really have time today, but I'll see what I can organise. – Chromatix Jun 29 '19 at 05:46
  • The MsBasic for 6502 has options for either 3+1 or 4+1 FP so a direct comparison should be possible. Just requires some work. – PeterI Jul 05 '19 at 09:03
  • Most excellent Chroma! And it is very interesting to see the improvement for the 65C02. Can you offer any insight on why this may be? With the exception of STZ and perhaps the zero-page jump table addressing, I'm not sure why it would be better? – Maury Markowitz Jul 06 '19 at 13:23
  • The 65C02 includes several extra instructions and addressing modes which are very useful in practice. Looking at http://8bs.com/basic/basic4-a6cf.htm (scroll down), the FMUL routine uses PHX and PLX, instructions which on the NMOS 6502 have to be emulated with TXA:PHA and PLA:TAX respectively, clobbering the accumulator which requires more instructions to save and restore. In other cases the indirect mode without indexing can save having to clobber the X or Y register, when the ZP pointer is correct as-is. – Chromatix Jul 06 '19 at 15:44
  • Ah yes chrom, that makes sense. And some real space savings too. I’ll be INCA makes it’s appearance too, it is amazing that wasn’t in the original. – Maury Markowitz Jul 11 '19 at 14:12
  • @Chromatix: On any particular invocation of that code, would any particular PHX instruction end up pushing things to more than one address, or could PHX/PLX be replaced with STX/LDX, which are larger, but only require six cycles per pair rather than seven? – supercat Apr 30 '21 at 16:20
  • @supercat More importantly, STX/LDX would require reserving space in scratchpad RAM which could be rather constrained. The six-cycle pairing implies zero page which was very crowded; there was more space elsewhere, but that would require eight cycles per pair. – Chromatix Apr 30 '21 at 18:29
  • @Chromatix: Most programs have some zero-page locations that are used for temporary storage, whose value wouldn't need to be preserved across a floating-point multiply. Even if no such storage is available, it may in many cases be reasonable to load and push the contents of a ZP address and restore it aftewards. That would add 13 cycles per byte to the outside of the function, but save a cycle for each loop iteration involving the byte. – supercat Apr 30 '21 at 19:31
  • @supercat That sounds… marginal. You need at last 14 iterations of that loop before it wins even a single cycle You have to weigh the cost in code size, too - which was also constrained. Some of the space saved by using 65C02 instructions was used to add new features. – Chromatix Apr 30 '21 at 19:39
  • @Chromatix: Maybe it's my Atari 2600 background, but I have a bit of a soft spot for the "classic" 6502, and also tended to favor load/store zp over pla/pha for performance reasons. I wonder how the 65C02's design might have been different if the designers had checked to see what the 6502 did with undocumented opcodes and sought to imitate the behavior of the more useful ones like LAX, SAX, and DCP? – supercat Apr 30 '21 at 19:49
  • @Chromatix: Also, whether or not the trick is worthwhile in this particular case, there are many cases where storing a bunch of zero-page locations on entry to a function and restoring them afterward may allow code to be more efficient than would be possible without using zero page. – supercat Apr 30 '21 at 22:21
  • As a minor added insight: many compilers from that area (e.g. small c) contained "optimizers" whose purpose was to reduce code size by e.g. replacing common 3+ insn sequences by subroutine calls, something we nowadays would call a definite deoptimisation. This shows that, at the time, space was at a premium. – Remember Monica Jul 19 '22 at 21:27