16

(question copied from SO (from the days before RC.SE existed): http://stackoverflow.com/q/31877835/477476)

Zero-page memory maps of the PET that I've found claim that the zero page address range$00C2..$00D9 are used for static data, e.g. http://www.classiccmp.org/dunfield/pet/petmem.txt says:

 RIDATA 00C2        Cassette Temp (64#00AA) read flags: 0=scan,
                    1-15=count, $40=load, $80=end of tape marker
 RIPRTY 00C3        Cassette Short Cnt (64#00AB): counter of seconds
                    before tape write / checksum
 PNT    00C4-00C5   Pointer: Current Screen Line Address
 PNTR   00C6        Cursor Column on Current Line
 SAL    00C7-00C8   Pointer: Tape Buffer/ Screen Scrolling
 EAL    00C9-00CA   Tape End Addresses/End of Program
 CMP0   00CB-00CC   Tape Timing Constants
 QTSW   00CD        Flag: Editor in Quote Mode, $00 = NO
 BITTS  00CE        Cassette Temp (64#00B4): Tape read timer flag
                    =IRQ enabled for Timer 1
        00CF        End of tape read
        00D0        Read character error
 FNLEN  00D1        Length of Current File Name
 LA     00D2        Current Logical File Number
 SA     00D3        Current Secondary Address
 FA     00D4        Current Device Number
 LNMX   00D5        Physical Screen Line Length
        00D5        4.80: right side of window
 TAPE1  00D6-00D7   Pointer: Start of Tape Buffer
 TBLX   00D8        Current Cursor Physical Line Number
 DATAX  00D9        Current Character to Print

However, looking at the ROM disassembly, one can find places where the address $00C2 is jumped to, e.g. http://www.zimmers.net/anonftp/pub/cbm/firmware/computers/pet/d/rom-1.html#C70A :

 C70A  4C C2 00             JMP iC2       

Looking at a disassembly starting at $00C2 after booting up a PET, I can see reasonable-looking code:

.C:00c2  E6 C9       INC $C9
.C:00c4  D0 02       BNE $00C8
.C:00c6  E6 CA       INC $CA
.C:00c8  AD 00 04    LDA $0400
.C:00cb  C9 3A       CMP #$3A
.C:00cd  B0 0A       BCS $00D9
.C:00cf  C9 20       CMP #$20
.C:00d1  F0 EF       BEQ $00C2
.C:00d3  38          SEC
.C:00d4  E9 30       SBC #$30
.C:00d6  38          SEC
.C:00d7  E9 D0       SBC #$D0
.C:00d9  60          RTS

What is this area used for? Where is the code that assembles this program into this area? What's this code supposed to do? (It seems to be scanning the area starting at $0400 for : and characters?)

Cactus
  • 2,700
  • 14
  • 43
  • Question copied from SO (from the days before RC.SE existed): http://stackoverflow.com/q/31877835/477476 – Cactus Apr 20 '16 at 01:44
  • That comment should probably be at the top or bottom of the answer; it's being flagged as no longer needed. – wizzwizz4 Dec 01 '18 at 07:58

2 Answers2

14

It's part of the BASIC interpreter loop. It reads one byte of the tokenized BASIC program, setting zero flag if it's a colon or a zero byte and clearing the carry if it's a number. You can see it used in the main part of the interpreter loop at address C6B5.

I'm not sure why this routine was placed in zero page. It's a cycle (or rarely two) faster to use LDA $0400 over LDA ($C9),Y, but I can't see it actually making much difference.

I should also note that the ROM disassembly you're looking at appears to be for a BASIC 1.0 ROM, while the memory map you've referenced is for versions 2.0 and 4.0.

Here's what Mapping the Commodore 64 by Sheldon Leemon says about the equivalent C64 routine:

115-138   $73-$8A   CHRGET
Subroutine: Get Next BASIC Text Character

...

CHRGET is a crucial routine which BASIC uses to read text characters, such as the text of the BASIC program which is being interpreted. It is placed on zero page to make the routine run faster. Since it keeps track of the address of the character being read within the routine itself, the routine must be in RAM in order to update that pointer. The pointer to the address of the byte currently being read is really the operand of a LDA instruction. When entered from CHRGET, the routine increments the pointer by modifying the operand at TXTPTR (122, $7A), thus allowing the next character to be read.

Entry at CHRGOT (121, $79) allows the current character to be read again. The CHRGET routine skips spaces, sets the various flags or the status register (.P) to indicate whether the character read was a digit, statement terminator, or other type of character, and returns with the retrieved character in the Accumulator (.A).

...

As this is such a central routine, a disassembly listing is given below to provide a better understanding of how it works.

115 $73   CHRGET  INC TXTPTR   ; increment low byte of TXTPTR
117 $75           BNE CHRGOT   ; if low byte isn't 0, skip next
119 $77           INC TXTPTR+1 ; increment high byte of TXTPTR
121 $79   CHRGOT  LDA          ; load byte from where TXTPTR points
                               ; entry here does not update TXTPTR,
                               ; allowing you to readl the old byte again
122 $7A   TXTPTR  $0207        ; pointer is really the LDA operand
                               ; TXTPTR+1 points to 512-580 ($200-$250)
                               ; when reading from the input buffer
                               ; in direct mode
124 $7C   POINTB  CMP #$3A     ; carry flag set if > ASCII numeral 9
126 $7E           BCS EXIT     ; character is not a numeral--exit
128 $80           CMP #$20     ; if it is an ASCII space...
130 $82           BEQ CHRGET   ; ignore it and get next character
132 $84           SEC          ; prepare to subtract
133 $85           SBC #$30     ; ASCII 0-9 are between 48-57 ($30-$39)
135 $87           SEC          ; prepare to subtract again
136 $88           SBC #$D0     ; if < ASCII 0 (57, $39) then carry is set
138 $8A   EXIT    RTS          ; carry is clear only for numeral on return

The Accumulator (.A register) holds the character that was read on exit from the routine. Status register (.P) bits which can be tested for on exit are:

Carry Clear if the character was an ASCII digit 0-9. Carry Set, otherwise. Zero Set only if the character was a statement terminator 0 or an ASCII colon, 58 ($3A). Otherwise, Zero Clear.

Cactus
  • 2,700
  • 14
  • 43
  • Answer copied from original SO question http://stackoverflow.com/a/31880470/477476 – Cactus Apr 20 '16 at 01:43
  • 1
    FWIW, Applesoft BASIC uses an identical routine on the Apple II (at address $B1). – fadden Apr 20 '16 at 17:30
  • 2
    and to be clear about the double-subtract - it restores the character to what it was originally, while setting the carry appropriately based on the range. – peter ferrie Feb 20 '17 at 18:57
7

In addition to the Community Wiki answer:

This routine with entries CHRGET and CHRGOT is not only part of the interpreter loop, it used in every statement and function which has to "parse" further data from the BASIC text (program), e.g. for parameters. On some occasions the pointer TXTPTR will be stacked, typically while executing a DEFFN'd function.

Especially on older versions of CBM BASIC (which PET uses) this routine is the place to extent (patch) the interpreter for additional commands (only with newer BASIC versions hooks where introduced to do it in a nicer way).

Regarding the placement of the routine in zero page: Even it uses self-modifying code, the TXTPTR refers the LDA parameter, I don't think the speed issue was the cause for this. Maybe they wanted to keep the Y register untouched. Comparing to BASIC 7.0 on a C128 this routine is moved out of page 0 and contains additional bank switching stuff. On the other hand you may count on having Y=0 after a call to this routine which may be valuable, too.

Cactus
  • 2,700
  • 14
  • 43
Johann Klasek
  • 851
  • 4
  • 12
  • I find the design a bit curious; if I were designing an interpreter, I'd use a two-byte pointer for the start of the current "line" and a byte for the offset, so as to take advantage of (zp),y mode. That would limit lines to 256 bytes, but that seems a common limit anyway. I wonder what motivated the MS-BASIC design? – supercat Jun 01 '16 at 18:25
  • 1
    The MS interpreter has been first written on 8080 to my knowledge. But even on this CPU you are not able to use a register exclusively for the TXTPTR handling. Too much overhead to save and restore the register for other needs - and the TXTPTR handling is rather small compared to the other stuff the interpreter has to do. Especially on a 6502 it would by a major design failure having Y blocked for other routines. BTW, the interpreter has to be portable. It's not a good idea to rely on CPU specifics like indexing addressing modes not available on other CPUs (e.g. 8080). – Johann Klasek Jun 03 '16 at 14:30
  • A design optimized for (ZP),Y mode would be a bit different, but the 6800, 6809, 8080, and Z80 have sufficiently varied register sets that an interpreter designed around one of those will be more efficient than one written for something else and then adapted. On the other hand, there are probably some more important speed-up opportunities available, like special-casing an exponent value of 00 for floating-point numbers as indicating 16-bit integers in the range -32768..32767. Having FP math routines check whether both arguments qualify and doing a conversion if only one does... – supercat Jun 03 '16 at 14:59
  • ...would slightly slow down computations in cases which actually use floating-point math, but would greatly speed up code that doesn't. Another big speed-up would be a simple generational GC: after the first GC cycle, keep a copy of the bottom-of-string pointer. On subsequent GC cycles, start at that mark rather than at top-of-memory, unless or until the amount of memory freed up is too small, (restart from top of memory in that case). That simple change could probably cut GC cost by 90% in many real-world programs. – supercat Jun 03 '16 at 15:02
  • GC addresses not the overall performance of an interpreter (hits those programs at most which use lot of strings and have a too small free space on the string heap. A constant loss of time is the inefficiency in looking up variables, jumping to a specific line or translating constants into FP representation. Just replace the lame GC of a CBM BASIC V2 by a buffering or a back-link version and everything is ok. I think the version you described above might be some kind of optimization but seems far away from being deterministic in its time behavior (maybe I didn't get it what you mean). ;) – Johann Klasek Jun 05 '16 at 11:10
  • The programs I've seen that get royally globbered by GC performance typically have an array of strings that get assigned at program startup and never modified again. The effect of the enhancement would be to say that following a GC cycle, any strings that were alive at the end of that cycle will not be examined again unless or until it's necessary to reclaim memory from there. Ideally, the GC would count the total length of the strings in the upper area so it could determine whether the space that would be freed by processing it would be worthwhile, but... – supercat Jun 06 '16 at 15:13
  • ...even something simplistic like running a full GC cycle every 16 GC cycles would probably reduce GC overhead by more than 75%. – supercat Jun 06 '16 at 15:18
  • To my experiance it is not typical to have large static string arrays (except for adventure games). My applications do in generally refilling and scrambling the array, so there is no easy assumption to made how the strings on the heap are used. Your proposal and its assumptions does not help in general situations. Why should I rely on a waggling assumption when a GC replacement on buffer/back-link basis does well in all cases? – Johann Klasek Sep 14 '16 at 13:35
  • Even the every 16 GC cycle strategy is used running the full GC could take a lot of time because the a squared-ordered GC is not significantly faster if the heap is ordered (only if it is asserted that a given, upper part of the heap is known to be "in order"). The calculation effort to reorder the remaining heap will be still in some squared fashion. – Johann Klasek Sep 14 '16 at 13:50
  • GC efficiency is often a trade-off between the amount of space required versus performance. The 1980s MS-Basic GC approach is very light on space utilization, but the performance is horrible. A generational approach would improve GC performance in many (though not all) cases while using 2-6 extra bytes of RAM total (along with some ROM). Other approaches would be faster in almost every case except that they would need more RAM, and in some cases that could increase the frequency of GC enormously or prevent a program from working at all. – supercat Sep 14 '16 at 14:27
  • I remember a magazine article with a custom GC for the C64 and VIC-20 whose code fit in the tape buffer (192 bytes); it couldn't replace the slow GC that would occur if memory got full, but could be invoked before that. Supposedly the performance was very sensitive to the amount of slack space available at the time of invocation, so I would guess that it may have used an algorithm that on each step would copy a quantity of strings proportional to the slack space. Not sure exactly how it was implemented, though. – supercat Sep 14 '16 at 15:51
  • My replacement implementation for the C64 uses the buffer method and uses the 8 K beneath the Kernal ROM. The worst case with over 2 hours runtime takes only around 3 s. http://klasek.at/c64/gc/ (sorry, currently just in german). No restrictions in space to normal Basic programs. Only in case graphical data is placed to $E000 your in bad luck. – Johann Klasek Sep 15 '16 at 16:07
  • The backlink implementation needs 2 additional bytes on heap for non-empty strings, but this version is generally twice as fast compared to the buffer method. It fits into the standard ROM replacing the old routine! However, both variants are in all cases faster - never saw a situation where not. – Johann Klasek Sep 15 '16 at 16:09
  • By my understanding, given INPUT A$:B$=A$ the back-link approach would require that the second statement make a redundant copy of the string, while the MS approach would allow it to get by just copying the 3-byte descriptor. If one were willing to accept a change in how programs were stored, the efficiency of the MS approach could have been improved considerably if strings in memory were preceded by a length byte (descriptors could then omit the length byte; strings of 0 or 1 byte could be represented by pointers whose MSB was zero or 1 without using external memory). – supercat Sep 15 '16 at 16:52
  • If all strings in memory were at least 2 bytes and were preceded by a length byte, code which relocated each string could zero out the length byte and store the new address in the following two bytes. If a GC was started when there were 8K worth of strings and 1K of slack space, the system could find the ending address of the last string to fit entirely within the first 1K of utilized space and could scan all string descriptors and move to slack space any strings that were entirely within in the lowest 1K of used space to the slack space. Any unused slack space could then be merged... – supercat Sep 15 '16 at 17:01
  • ...with the space the strings had occupied, and the resulting slack space could be used to scan through the next bunch of strings. If one triggers a GC when slack space is 1/8 of the total, one would need 7 passes to move all the strings to the bottom of memory, and then one cleanup pass to relocate everything back up to the top (one could use a bulk move for that, and simply adjust all string pointers by a fixed displacement). – supercat Sep 15 '16 at 17:03