Why are branches relative in many 8-bit CPUs?

Question

I was looking over an old article on the 6809 and was perusing the opcodes and noticed that the branch instructions came in two flavors, long and short. That sparked a memory about one of the 6502-series CPUs that added similar long-branches - 65CE02 perhaps?

So, why are branches in most 6502 systems based on 8-bit relative addresses instead of absolute? One byte less in the opcode, but it would seem that would be just as useful to JMP, but BRA was only added sometime later. Performance? Isn't the add into the PC the same as reading another address byte?

"One byte less in the opcode" - is very important when you only have a few K of memory. — tum_, Nov 23 '19 at 13:59
Related: Why do x86 jump/call instructions use relative displacements instead of absolute destinations? on SO. Interesting to note that MIPS does have section-absolute j instructions but most modern ISAs use relative branches for everything except indirect branches. On x86 there's a choice between short rel8 vs. near rel16 or rel32 jumps/branches just to save code-size. — Peter Cordes, Nov 23 '19 at 23:05
@tum_ : not only memory, but time. From my experience working with 8 bit microcontrollers and having to decode weird and very fast signals at the inputs, the biggest bottlenecks were always branches. Having a branch take one clock cycle more or less can make all the difference between a certain CPU at a certain clock setting being suitable to a task or not. — vsz, Nov 25 '19 at 07:05
Most processors have relative jumps, absolute is less useful and less used. This is not limited to 8 bit instruction sets. — old_timer, Feb 13 '20 at 12:32
@vsz: If the 6502 had branch instructions which loaded the low byte of the PC while leaving the high byte unaffected, such an instruction could likely have been processed in two cycles rather than three. If the family of branch instructions were augmented with ones that would branch into the next or preceding page, those could execute in three cycles rather than four. — supercat, Dec 02 '21 at 17:32

Raffzahn · Answer 1 · 2021-12-02T17:14:30.287

TL;DR:

It is all about making one of the most important instructions as performant as possible, while keeping everything manageable for tools at the time (plus a little bit of dogma). The branching is thus the most optimized instruction of the whole 6502 design.

In addition, long branches are not really in demand (*1). Of the 116 branches used in the original KIM 6530-003 ROM not a single one is followed by a JMP - which would be the case if any longer distance had to be covered. Same observation can be made by looking at many other contemporary sources (*2).

So why add 8 (9) opcodes for something as exotic and little of practical use?

The Long Read

Why were (are?) branches relative in most 8-bit CPUs?

Because it saves code space and covers the great majority of cases - long versions can easily be created by using the inverted case branching around a long jump (absolute or relative), thus halving the number of code positions needed (*3) at the expense of a few cycles in rare cases where a long distance is to be covered.

[Insert]

Generic Considerations About Jumping and Branching

Beside the semantic separation of branching as result of a decision and jumping as unconditional, design of an instruction set is always about waging alternatives and finding a middle ground.

Without any doubt, a system needs a way to set the PC to any location within the program address space. Usually called a Jump. Many systems offer in addition a Subroutine Call as special case with return address management, often accomplished by a specific Return instruction (*4).

In addition it does make quite sense to be able to easily change control instruction flow based on conditions (*5). In a simple and symmetric world, this could apply to all three mentioned variations (Jump, Call, Return).

A classic example would be the 8080 with having all of them as unconditional and conditional version. Nice and symmetric, but comes at a hefty cost of 27 opcodes. That's more than 10% of the instruction set.

Now, while conditional jumping is neat and may save code length, looking at real code shows that the vast amount of Calls and Returns are unconditional. So it might be acceptable to drop conditions from them and just go by having jumps conditional - and when needed, Call or Return can be prefixed by a jump using the inverse condition. This reduced the needed codepoints to only 11.

Looking even closer will reveal that the huge majority of conditional jumps only reach for quite short distances forth or back. Often just a single or a few instructions. Thus, kind of a short distance jump could be used to save in code size, an important measure. For most classic CPUs code length is also quite related to execution time (*6) thus a shorter distance encoding will speed up execution - as well as reducing penalty when prefixing non conditional instructions.

While there are CPUs having very short distance branches, given the 8 bit code nature, a +/- 7 bit offset seems like a good compromise here. And that's what did set the 6800 apart from its 8080 counterpart. Short branches that save on code in most cases (as well as code points), which can be used to synthesize other conditional instructions with an acceptable penalty.

Skips would be an even more generic approach, by skipping always the next instruction when its condition is met. Here no length is needed, thus they'd be single byte instructions needing only 8 (based on 8080) code points in total. Not only any conditional jump can be synthesized, but all instructions can be made conditional. On the backside, all conditional instructions would take additional time and as soon as more than one instruction is to be skipped, it gets quite costly. So trade off is speed (many standard cases) for simplicity and general usability.

The last resort could be adding conditions to every opcodes (like on ARM). It's like combining the original (8080) approach with the generic skip application - but now at great expense of most of the code space.

Long story short, optimized short and relative branches offer a trade off between all of this with an emphasis on most common use cases.

[Back to the Question]

So, why are branches in most 6502 systems based on 8-bit relative addresses instead of absolute?

I guess you mean as well 16 bit relative.

It's all about most-used case. Branches are for short distance decision making. Like testing if the upper half of a pointer needs to be incremented, a sign to be adjusted, or a compare is successful or not.

Adding another set of long relative would have added at least 8 long branch instructions, some more ROM space, but more important now always requiring an additional cycle to do the 16 bit calculation (plus the additional fetch).

Maybe most important for the 6502 philosophy, it would have added the need to make the 8/16 bit offset decision detectable for the (very primitive) single pass assembler they had in mind. The whole 6502 assembly syntax is made in a way that encoding can be determined in whole by only looking at the actual line processed. With two encoding variations for relative, it would have required either:

a long opcode name variant like BEL, BNL, BCSL ... oops, breaking the 3 letter structure they wanted for their simple assembler. So this would end up with something quite ugly and hard to remember. Or
some additional keyword like SHORT and LONG (*8), again breaking the syntax, or
adding some 'special' symbol for either case, like BNE !$02 for short or alike.

In all cases the burden to decide which is right is put on the programmer to know ahead of time which to use. Of course, using

a two pass Assembler

could have resolved that - for the cost of making the assembler a magnitude more complex and costly - hard to do on such simple system as they were thinking about at the time. The simple straight forward assembling was one of the major advantages compared to 'big' systems.

One byte less in the opcode,

That's the first and most important, as it saves a whole Byte from one of the most used instructions (see below) and a whole cycle of execution. And it's a cycle that gets more often than not saved at highly utilized code sections - like decisions when scanning a string, comparing a lot of alternatives in a case like structure, or some tight counting loop.

but it would seem that would be just as useful to JMP, but BRA was only added sometime later.

Admittedly the 6502 makers were somewhat into dogma(*9), thus branches had to be conditioned, while jumps are unconditioned (and absolute). So the handy BRA, which was already existing in the 6800, got left out for the 6502 (*10). Sometimes one gets quite stuck during design.

Performance?

Yup, exactly that. Branches are essential to any computing - which is, after all, about decision making on many levels - and most of them all the way down on signs, carry and equal or not.

Looking at contemporary sources shows that branches are among the most used instructions. Within the KIM 6530-003 source there are 116 branches ...

Give yourself some time to let that sink in:

One Hundred and Sixteen Branch Instruction within a 1 KiB ROM.

That is 232 ($E8) bytes in length. Almost a fourth of the whole ROM is filled with branches.

... the second most prevalent group, right after LDx (162, thereof 108 LDA) and STx (153, thereof 115 STA) and on par with JSR and far ahead of everything else (*11). The combination of being one of the most used instruction (*12) and its potential for optimization makes it the most desirable target for doing so.

In fact, the branch is eventually the most optimized instruction within the 6502. It's really worth to study its workings using the Visual 6502 emulator. Several special items have been added to make sure that its execution is down to the minimum number of cycles, that is two for not taken and three for taken, unless a page is crossed. The later optimization was for example not considered worthwhile for any of the indexed writes, were a page crossing could happen - here instructions always took the additional cycle.

Isn't the add into the PC the same as reading another address byte?

No, it's done in parallel, ahead of time (during PHI1) so no time is wasted to calculate it.

Branches are one of the niftiest parts in 6502 logic. I really suggest taking a few hours and go thru their details on Visual 6502 to understand.

*1 - This should be already clear from a design point of view: Code needing long conditional branches is usually not well structured to start with.

*2 - For example the much later (1978) and way bigger (8 KiB) MS-Basic 1.1, used for PET, OSI and Apple II, features only 23 cases of a jump following a branch - of ~140 jumps at all.

*3 - Since a long jump (usually) always exists, branches are needed only once (as short).

*4 - Some designs go for the bare minimum having only a single jump instruction, always delivering the previous PC (return address) in a special location (usually a register), and handling everything via this single instruction. Others have different jumps, but no return, and so on. But for this we stay with what's common on mainstream 8 bit.

*5 - It is quite possible to do computation without changing instruction flow (branching), not at least used by Raul Ronjas proof about the Z3 being a full figured computer. Still, it gets easily out of hand even with small programs and is as useful as a Turing machine for real life computing.

*6 - This is not only true for the 6502 being quite close in its execution time with the number of memory cycles, but as well for the Z80. Many of its really nice enhanced operations are not as great when it comes to execution time due the needed prefix adding 4 cycles just to mark them.

*7 - Quite handy when it comes to all the little cases were only a single instruction has to be bypassed, like increment of a multi byte pointer. With conditional jumps a 16 bit pointer some INC; JNC +n; INC, costing 5 byte code for absolute, 4 bytes with relative, but only 3 with skip.

*8 - Oops, did anyone say x86?

*9 - Like the infamous 'real' indexing argument over the 6800 way, that brought two 8 bit 'real' indices at the cost of no base register, making structure handling on the 6502 rather clumsy.

*10 - I wouldn't have minded if it had just be renamed to JMP or short, but then again, this would have collided with the single pass assembly they had in mind.

*11 - The next most used group are compares (CMx), way down at 39 instances (38 of them being CMP) and absolute jumps (JMP ) at 31. It nicely reflects that most work of a CPU is about shoveling data and branching accordingly.

*12 - Of course, this is just quick static analysis, dynamic quantities will differ - then again, due to branches being part of each and every looping, they usually end up with an even bigger share.

I don't know this particular CPU, but in general, "branches" contain a simple integer offset (from the current instruction), but "jumps" tend to support the full range of possibilities for memory operand specification in the particular instruction set - indexed, indirect, base-and-displacement, etc. — dave, Nov 23 '19 at 19:45
I wonder if the Z80 instruction set was designed at a time when the part was expected to use an 8-bit ALU? The Z80's instruction set would make sense with an 8-bit ALU, and the decision to use a 4-bit ALU would have made sense without the added Z80 instructions, but the usefulness of many Z80 features is undermined by the ALU. The performance of JR, Ix+d addressing modes, and LDIR/LDDR, etc. all incur a two-cycle penalty as a consequence of the four-bit ALU. — supercat, Nov 23 '19 at 20:22
@another-dave Well, I guess one could argue a lot about the semantics, and many have done so. But in general, branching is used if there alternatives going from one point into different directions, like branches branching on a tree. A jump in contrast changes position without any doubt. Then again, this differentiation is only neccessary on designs doing so. As usual, names are what people are making of them. Hardware doesn't care for names. (and for the size argument, x86 uses JMP for everything from 8 bit REL to 32 bit absolute and 48 bit segmented, while on /370 everything is a branch. — Raffzahn, Nov 23 '19 at 20:30
I wonder whether branching could have been made faster if the 6502 had worked like the 1802, which leaves the upper 8 bits of PC unchanged while loading the lower 8. To be sure, relative branches offer many advantages, but if the data bus could be fed directly to the LSB of the data bus, that could have saved a cycle, and would also have allowed for loops of up to 255 bytes (with the caveat that jumps would sometimes be necessary to go between loops). — supercat, Nov 23 '19 at 22:20
@supercat for all cases where it doesn't change a page, the branches are already as ast as they can be. Reading two bytes takes two cycles, and thats the duration for not taken, and 3 when branching, which again is what it needs to fetch the next OP. Can't be any faster. — Raffzahn, Nov 24 '19 at 01:22
@Raffzahn: x86 doesn't have any absolute direct near branches, only absolute segmented jumps (to a new CS:[E]IP) can have an absolute target as part of the machine code. (And yes I omitted CS:RIP on purpose.) It's only 32-bit and 48-bit absolute segmented that exist (jmp ptr16:16 and ptr16:32) that exist, or near jmp rel32 in 32 and 64-bit mode. Not jmp abs32. https://www.felixcloutier.com/x86/jmp. I know this is off topic and wasn't the point of your last comment; yes x86 calls everything JMP. MIPS is interesting where jumps are section-absolute, branches are relative — Peter Cordes, Nov 24 '19 at 02:54
@Raffzahn: The 6507 requires that each half of the address bus come from a register, an ALU computation based upon previously-available values, or from a newly-fetched data bus value. When taking a relative branch, the computation of the address can't begin until the end of the second cycle. If a branch simply expressed the offset in the same page directly, the address LSB could be gated from the data bus directly without having to go through the ALU first, saving a cycle if the chip was designed for that. — supercat, Nov 24 '19 at 06:53
@supercat Well, you want for sure spend some time with the 6502 simulator to learn how the 2/3 cycle branch does work :) — Raffzahn, Nov 24 '19 at 11:09
@PeterCordes Jup. And then there is the indirect near jump, with target IP as 16 or 32 bit in reg or memory :)) I guess one could do a whole course about all the variations the execution sequence can change in an x86 system. — Raffzahn, Nov 24 '19 at 11:18
@Raffzahn: I think I was going to mention that but ran out of space in a comment. Yes, if you want to jump to a specific absolute address you want mov reg, imm64 / jmp reg, if a call/jmp rel32 won't reach (64-bit mode problem only) or the machine code is PIC and can move relative to the destination. Not a whole course, but Call an absolute pointer in x86 machine code is a multi-page SO answer I wrote, only about user-space with only a small section on far jumps; nothing about trap gates and call gates :P — Peter Cordes, Nov 24 '19 at 20:49
@PeterCordes Great write up. I like it. I guess it wasn't the right place to insert imm16 as well :)) — Raffzahn, Nov 24 '19 at 22:04
I remember generating jump tables that could be invoked using rts. — Peter Smith, Nov 25 '19 at 15:17
@Raffzahn: The design would need to be different, but that wouldn't necessarily imply more expensive. The instruction fetch following a two-cycle branch would need to have the low-order word of the address bus captured off the data bus even though PCL wouldn't hold the right value yet, but mechanisms already exist, used in ZP-addressing-mode instructions, to output a newly-fetched data byte directly on the low-order address bus. The difference here would be that the upper half of the address bus would come from PCH rather than being zeroed. — supercat, Nov 25 '19 at 15:58
@Raffzahn: I would think that the circuitry required for 2/2 branching would be no more complicated than the circuitry for 2/3/4 cycle branching. At the start of cycle 2, the state of all flags and the opcode byte would be known, so a combinatorial branch/no-branch indicator could be computed well before the end of cycle 2 phase 1, and thus by the start of cycle 2 phase 2 the CPU would be able to know whether the low-order address bus for the next cycle should come from PCL or forwarded from the data bus (as it would be for a zero-page operation). — supercat, Nov 25 '19 at 19:20
@Raffzahn: There is no way one could design a CPU with the 6502's bus timing and process constraints which could perform a relative jump faster than 2/3. I see no reason, however, that a it would have been impractical to design a 6502-like processor whose branches specify the low-order portion of PC directly, and which could execute branches in 2/2 cycles, with the same worst-case internal timing constraints as e.g. lda zp or lda (zp),y, both of which must output a valid address on the third cycle using data that only became available at the end of the previous cycle. — supercat, Nov 25 '19 at 20:46
@supercat Taking away the prefetch would just slow all execution. By now this gets rather useless, I'd say, go ahead, build your CPU and surprise us all with the result. — Raffzahn, Nov 25 '19 at 20:51

score 14 · Answer 2 · edited Dec 11 '19 at 23:21

14

On the 6502, the designers did this for efficiency. This is documented in the original MCS 6500 Microcomputer Family Programming Manual:

If one considers that the instruction JMP required three bytes, one for OP CODE, one for new program counter low (PCL) and one for new program counter high (PCH) it is seen that jump on carry set would also require three bytes. Because most programs for control require many continual jumps or branches, the MCS650X uses "relative" addressing for all conditional test instructions. To perform any branch, the program counter must be changed. In relative addressing, however, we add the value in the memory location following the OP CODE to the program counter. This allows us to specify a new program counter location with only two bytes, one for the OP CODE and one for the value to be added. (§4.1.1 p. 38)

This is obviously more space-efficient (one byte for the opcode and one byte for the 8-bit offset versus one byte for the opcode and two bytes for a 16-bit absolute address), but also (perhaps less obviously) more time-efficient when averaged out: a branch not taken is only two cycles (to read the opcode and offset) with the relative address, but still three cyles with an absolute address. (Taken is three cycles in both cases. The one exception is that a taken relative branch needs four cycles when it crosses a page boundary, due to the need for a second add for the carry from the low byte of the address, but that's relatively infrequent.)

It would be possible to try to achieve this with partial absolute addresses (e.g., by specifying only the lower eight bits of the absolute address and taking the upper eight bits from the PC, effectively using "the current page,"), but that gets pretty complex for the programmer because you'd need to know about the absolute location of your code to avoid accidental jumps to the wrong page.¹

Another reason for having some sort of relative jumps is to allow creation of more easily relocatable code. Code that uses only relative jumps can be copied to another location and "just work"; code with absolute jumps must have those patched up for the new jump target locations. While having a limited set of relative branch instructions doesn't let you relocate arbitrary code, it still makes it easy to relocate small routines, which is valuable. There are, for example, not-infrequent cases where using self-modifying code on the 6502 can increase both speed and memory efficiency. Self-modifying code can't be run from ROM, but if you can easily copy small routines from ROM to RAM that opens up this technique for code intended to be in ROM.

Relocatable code wasn't a major priority on the 6502 (though I have little doubt that the designers did have in mind some support for this from the start—whatever they could fit in without adding cost to the design), but it was for the 6809, where you noticed that they'd added long branches. The MC6809 data sheet says in its very first paragraph that it "supports modern programming techniques such as position independence," and later in the discussion of long and short relative branches, "Position-independent code can be easily generated through the use of relative branching" (p. 20). Somewhere there's a larger discussion of Motorola's vision of building ROMs for specific machines from a large library of relocatable code, but I don't have a reference for that at the moment.

¹ That's more complex than it sounds on some CPUs. Consider a branch in the last two bytes of a page on a 6502: that puts the PC on the next page before it's used to calculate the branch address. There are ways of working around this, too, such as considering addresses in the "other half" of the page to be in the previous or next page, as appropriate, but now you're piling complexity on complexity.

edited Dec 11 '19 at 23:21

ecm

953
2
7
16

answered Nov 24 '19 at 03:57

cjs

25,592
2
79
179

1

MIPS j instructions do work in this section-absolute way you're talking about as a hypothetical for 6502. (vs. relative bne etc). They replace the low 28 bits of the 32-bit PC, so they're absolute within a 256MB region. And yes, relative to the end of the jump = relative to the delay slot instruction address means it can jump into the next section. How to Calculate Jump Target Address and Branch Target Address?. But 256MiB is a lot less of a problem than 256B boundaries! And MIPS was designed to use virtual memory allowing easy choice of code address. – Peter Cordes Nov 24 '19 at 20:54
With a limited amount of memory space (64k addresses may sound a lot...), every byte was important and for that reason functions were kept small and efficient. In all the years I was writing assembly for the 6502 (and parts that used it as the core such as the CMD65151 communications processor) I never had the need for a conditional branch outside the relative address range. It was considered poor practice (function too large) if a conditional was outside this range at least in the places I worked at the time. – Peter Smith Dec 08 '19 at 14:42
"The one exception is that a taken relative branch needs four cycles when it crosses a page boundary, due to the need for a second add for the carry from the low bye of the address, but that's relatively infrequent." Which isn't an inherent issue at all, but only due implementation. – Raffzahn Dec 09 '19 at 00:16
@PeterSmith True enough,but the designers did address this anyway in the Programming Manual §4.1.4, stating that "longer programs will occasionally find it necessary to conditionally branch to a location that is significantly further away than the branch command will directly reach" and giving a technique to facilitate such branches. – cjs Dec 09 '19 at 00:17
@Raffzahn It was indeed an inherent issue unless you ignore the fact that the primary design goal of the 6502 was to produce an inexpensive processor. Going to significant extra expense to add a 16-bit adder for this would have worked directly against that design goal by adding significant expense for very little benefit. – cjs Dec 09 '19 at 00:26
@CurtJ.Sampson Erm, no. Working against a design goal is different from being inherent. For one, the 2/3cy is inherent to prefetching, thus can't be avoided unless the whole CPU is redesigned, but the case of page crossing doesn't require a 16 bit adder, as the PC already got two incrementer - separate for PCH and PCL. So all it needs is to feed carry from the ALU into the PCH incrementer. Also, it's not really a minor benefit. Branches being the most important instruction group already rectified several logic patches to make them work at 2/3. And they are worth it. – Raffzahn Dec 09 '19 at 01:11
@Raffzahn: The PCH incrementer would only have been useful for handling forward branches across page boundaries, and even there I'm not sure what the setup time was for the "increment/don't increment" input. If the 6502 had used branches in the style of the 1802, which specify the LSB of an address in the current page without wasting a cycle, it might not have been hard to make a "JNZ $FF" fetched from $1234 perform fetches from $1234 and $1235, and then either $1236 or $12FF based on the Z flag, but having the next cycle fetch from $1300 would have been tricky. – supercat Dec 09 '19 at 15:40
@supercat The difference between an up and an up/down counter are a mux per stage, which in turn are two transistors. So making the incrementer decrement in case of a back branch isn't big at all - also, the decision of up or down can be taken direct from the high bit of the offset - ahead of time. There is also enough time to propagate the carry. – Raffzahn Dec 09 '19 at 18:21
@Raffzahn: When using a single-metal NMOS process, transistor count isn't nearly as important as routing. Further, while the upper bit of the operand could be used to distinguish between incrementing and decrementing, the carry out of the ALU would be needed to know whether or not the upper byte of the program counter would need to be modified at all, and that wouldn't be available until the end of the cycle before the code would need to be fetched using the new program counter value. – supercat Dec 09 '19 at 19:27
@Raffzahn This is yet again a situation where you should simply update your answer to include whatever you think is relevant about this rather than starting long arguments in comments on someone else's answer. Limit your need to always prove you're right and get in the last word to your own answers and we'll all be happier. – cjs Dec 09 '19 at 23:56
@CurtJ.Sampson Cool, except, it's not about my answer, but about a claim you added which simply doesn't add up. The same way my answer is not about you, but about the question asked. So, accept critic as an offer and, please, don't start your usual fights. Keeping things straight and focused is for sure a great way to happiness, isn't it? – Raffzahn Dec 10 '19 at 00:08
Actually, your claim that they could have removed the extra cycle on cross-page branches at no extra cost is the one that doesn't stand up to scrutiny: the 6502 designers were not dumb and would not have left in the extra cycle if they hadn't gotten some corresponding benefit. Or perhaps you are just misusing "inherent," or deliberately misreading my answer. – cjs Dec 10 '19 at 04:55
@CurtJ.Sampson: The usefulness of making code relocatable would for many purposes be worth the extra cycle. The designers of the 6502 certainly didn't correctly predict how useful or useless all aspects of their design would end up being; the (zp,x) addressing mode, for example, required more circuitry and yet ended up being less useful than a non-indexed (zp) mode would have been. – supercat Dec 10 '19 at 06:22
"the (zp,x) addressing mode...ended up being less useful than a non-indexed (zp) mode would have been." @supercat I quite disagree with you there. (zp,x) is very useful for one or more stacks (8- or 16-bit) in the zero page. Very handy for interpreters (Forth interpreters seem almost invariably to use this), though useful for straight assembly code as well. Garth Wilson wrote a whole treatise on this. (I've also gotten "non-indexed" mode for free when I happened to have 0 in the X register anyway.) – cjs Dec 10 '19 at 09:06
@CurtJ.Sampson: His sequence to load a 16-bit value from the address on the top of virtual stack is 16 bytes of code; 35 or 40 cycles. Using (zp,y), if y is left zero between operations "lda 0,x / sta ptr / lda 1,x / sta ptr+1 / lda (ptr),y / sta 0,x / iny / lda (ptr),y / sta 1,x / dey" for 36 or 37. If straight (zp) existed and Y were left 1 between operations, that would save two bytes and four cycles. Note that even if X is zero, a load of (zp) would be a cycle faster than (zp,x) and a store of (zp) would be faster than both (zp,x) and (zp),y. – supercat Dec 10 '19 at 15:50
@CurtJ.Sampson: While (zp,x) might seem like it perhaps "should" be useful, I've never seen code using it that ends up being meaningfully faster than would be possible in its absence. Further, I've seen lots of code that ends up using indexed modes with an index that receives no benefit from indexing, but has to waste time ensuring that an index register is zero. – supercat Dec 10 '19 at 15:52
@supercat Like I said, look at any Forth interpreter. – cjs Dec 10 '19 at 16:13
@CurtJ.Sampson: How do they avoid having to spend about as much time incrementing the address at 0,x as would be necessary to copy that to a fixed location for (zp),y addressing? – supercat Dec 10 '19 at 16:14
All of these answers and comments seem to be missing the point though. JMP is also a very frequent instruction, looking over the MS BASIC code on the 6502, for example, while branches outnumber JMPs, it's not by a major amount. Moreover, it is common to see a branch-over-jmp. If the saved space, relocation and saved cycles are so important and useful, why isn't there a BRA in the original design? It would appear to be a zero-cost implementation, as BVS is widely used for this purpose. – Maury Markowitz Feb 11 '20 at 14:43
@Maury It was not a zero-cost implementation, and the cost may have been significant. The original decoder decoded nnn 000 00 as a branch instruction, and all eight nnn slots were already used by the existing branch instructions. Adding additional decoding to allow more than eight branch instructions probably wasn't seen as worthwhile, given that a primary objective of the CPU was to be cheap. – cjs Feb 12 '20 at 09:49
@cjs Where I can read some more about the original decoder of 6502 and how did it perform the decoding? – SasQ Apr 27 '20 at 21:27
@SasQ The details, and why they did things certain ways, are relatively deep, dark magic. (Or in other words, they don't make any sense unless you understand a reasonable amount about how you would go about designing a 5000-transistor chip in 1975.) I suppose the best place to get the details is with visual6502.org. However, as a sofware guy, I found that aiming to learn enough to design my own sigle-board computer first was the best (and most enjoyable!) way to work towards getting the background needed to understand what goes on in CPU design. – cjs May 02 '20 at 19:33
Well then maybe stop assuming things and try me. Let's say I have my share of knowledge of electronics and building computers out of 74xx chips under my belt, so that shouldn't be the problem. I know visual6502.org already. However, it would be a waste of time to try reverse-engineering the chip at the transistor level if someone else did that already before me. P.S. I don't believe in magic. Magic is how simple-minded people think of technology that they can't comprehend. – SasQ May 03 '20 at 20:56
@SasQ If you're not going to explain what level of knowledge you have, you're forcing people into making assumptions. Anyway, you want me to assume the other way, no problem. You've said you "know visual6502.org already, so I assume you've already read all the decode ROM material they've made available and it satisfied you. As for "magic," you clearly don't understand the hacker meaning of the term, but I'll also assume you can go find the resources to figure that out. – cjs May 05 '20 at 09:54
I asked a simple question: where to find more about the details of decoding (since you seemed like someone who might know something more). Do you really expect me to write elaborates and CVs about all my experience and credentials just to avoid people being an ASS and ASSume that I'm a total noob? ¬,¬ OK, I got the message, let me assume something too: that you don't really know much about it either and just trying to hide it by pretending to be smart. I don't need that kind of help then, thanks for nothing, I guess I need to find what I need myself, as always. Bye... – SasQ May 05 '20 at 20:41

score 4 · Answer 3 · answered Dec 07 '19 at 15:00

4

Another possible reason: with PC-relative addresses, you can easily relocate your program in the memory. Sometimes is a good idea to have a program you can load and run from any address (well, not really ANY, but you know...)

When you program like "jump 10 bytes forward", you can easily relocate. With "jump at $12A5", you have a fix memory location your program can be loaded into.

answered Dec 07 '19 at 15:00

Martin Maly

5,535
18
45

1

Except that the 6502 in question doesn't have any other relative instructions. Jumps and subroutine calls are all absolute, only branches are relative. – Raffzahn Dec 07 '19 at 20:38
@Raffzahn The relocation abilities provided by branch instructions alone are not useless without also having relative jumps and subroutine calls. It's perfectly reasonable to want to provide the ability to relocate individual subroutines (say, from ROM to RAM) without spending the additional cost required to provide general relocation facilities. – cjs Dec 10 '19 at 00:00
@CurtJ.Sampson Yes, it is and I love to use it - just this question is about the 6502 offering relative only for branches, but not any other execution transfer. Isn't it? – Raffzahn Dec 10 '19 at 00:11
@Raffzahn Right. And you can still get some reasonably useful relocation abilities by doing exactly that, without going to the extra expense of adding relative long branches. It could be that the original designers weren't thinking about relocatable code at all (they didn't mention it in the original Programming Manual), but I find that unlikely. – cjs Dec 10 '19 at 04:51
@CurtJ.Sampson Right, it seams rather unlikely that they didn't think about it, and Looking at the ISA structure, a long relative wouldn't have been a big issue, just adding a clock. It's all there. Similar there is no indirect JSR. What all of that has in common is it's only needed in systems which need to load and configure code at runtime. The 6500 was, like many early micros, targeted at what we call embedded today. Here all code gets static linked before burned into (EP)ROM. And lets be honest, the overwhelming number of 6500 usages ended up not in console but control. – Raffzahn Dec 10 '19 at 08:16
@Raffzahn _"The 6500 was....targeted at what we call embedded today." That's misleading; it was clearly targeted at both embedded and general purpose microcomputer systems. – cjs Dec 10 '19 at 08:57
@CurtJ.Sampson The 6502 as well as the 65816 fall short on essential functions needed for dynamic systems (desktop) - functions not needed on static (embedded) systems. Just because it was used as such doesn't change the the way it was intended. – Raffzahn Dec 11 '19 at 22:59
@Raffzahn, That's unupported opinion, far from fact. – cjs Dec 11 '19 at 23:20
@CurtJ.Sampson Works both ways, doesn't it? Except, I can (as mentioned) point out clear indications. So were are your facts? – Raffzahn Dec 12 '19 at 00:04
@Raffzahn Let's start with the fact, which you have denied, that there is no function essential to desktop microcomputer systems that the 6502 is missing. Were it missing, the large number of different and very successful (in terms of design) microcomputers based on the 6502 would not have been produced. Q.E.D. – cjs Dec 12 '19 at 00:31
@CurtJ.Sampson Sorry, but that proof is faulty from start. The 6502 is still Turing complete, thus any function can be emulated by a series of other functions. As said before, just because it has been used for dynamic system doesn't mean it was meant for such. That argument is non reversal :) Be real. – Raffzahn Dec 12 '19 at 00:41
@Raffzahn Re-read my comment and note that I said the computers were "successful (in terms of design)." The PET, Apples, Ataris, etc. were not full of hacks to get around the 6502's unsuitability as a processor for a general purpose computer. That, and that the first releases of the 6502 made for excellent desktop CPUs but poor MCUs (no RAM, ROM or even I/O without the expense of external parts) makes it clear that "we had no idea this would be used for general purpose computers" is nonsense. – cjs Dec 12 '19 at 05:32
Relocatable code is indeed the right answer when it comes to branches being relative. However, absolute addresses are sometimes a good thing to have too: when you want to jump to some "library subroutine", you usually need a fixed, known location (e.g. somewhere in ROM or the operating system). Jumping there might get very cumbersome and confusing (you jump to the same library subroutine, and yet the address is different in each jump). And, ironically, it would break after relocation :q – SasQ Apr 27 '20 at 21:26
The Apple II exploited the relocatable capability with ROM code used in the peripheral cards and accessed by the PR#X command from basic. Each slot has 256 bytes of address space available but the user can (within limits) put the card into any slot so the executable address is not know until it is called. Having relative jumps makes that work well. It is not necessary that data is relocatable as that is statically positioned. – Kevin White Apr 11 '21 at 16:27
@KevinWhite: I wonder to what extent that worked out better than would have been a design where card ROMs would get copied to RAM and then patched? A lot of I/O cards spend a lot of time and code space recomputing the addresses for I/O registers or screen holes every time a character is output. Loading code into RAM and patching such things would eliminate that need, and also allow cards that needed less than 256 bytes of code to use whatever amount of their RAM was convenient for persistent data that would otherwise have to go in screen holes. – supercat Dec 02 '21 at 19:05

user16234 · Answer 4 · 2020-02-10T14:58:15.123

Much discussion here about the benefits and costs of short vs long addressing, but the posted question asks about relative vs not. A machine like the RCA 1802, for example, has both short and long branches, for the reasons well covered here already, but they are absolute, and not relative.

What this means is that where your code ends up in memory determines where the short branches can reach, and this is not usually known when you are initially writing the code. In other words, memory is organized as a series of short (256-byte) pages, and short branches can only operate within the page they are in. So, you write your code, as a part of a bunch of other code, but you have no idea whether it will run properly or not until you've mapped all of the code into memory. (Linked, etc.)

By using an adder inside the machine to implement relative branches, your smaller/faster code can be relatively ignorant of exactly where it resides in memory. This is an advantage, especially with unsophicated (or even no) development tools, appropriate for the time these machines were current.

When coding for the 1802, it's necessary to think of memory as being a bunch of 256-byte pages to a greater degree than one would typically do with the 6502, but on the 6502 there are many situations where manually dividing code into 256-byte chunks may be helpful for a variety of reasons, such as avoiding four-cycle branches and allowing more efficient jump tables. Keyword dispatch in a 6502 BASIC interpreter, for example, could be shrunk to lda keywordFunctionLo,x / sta keyJump / jmp (keyJump)--thirteen cycles if keyJump isn't in zero-page, but keyJump+1 permanently holds... — supercat, Dec 02 '21 at 17:39
...the upper byte of the address of keyword-processing routines. Some of the "routines" would need to simply be jumps to actual code elsewhere (adding three cycles), but the most commonly executed ones could be placed in the same page as all the other jumps. — supercat, Dec 02 '21 at 17:39

Why are branches relative in many 8-bit CPUs?

4 Answers4

TL;DR:

The Long Read

Generic Considerations About Jumping and Branching