Separate code and data address spaces on the Z80

Question

Reading through the Z80 datasheet, I noticed something interesting. The Z80 separates its instruction execution into separate phases (which are called "machine cycles" in the official literature, but I don't like that terminology because it is too easy to confuse with clock cycles ... a machine cycle consists of multiple clock cycles, usually 3 or 4) that perform different kinds of operation. In the first such phase, which is called M1, the processor fetches the instruction to execute in the remaining phases, then once execution has finished it cycles back to M1 again and fetches the next instruction.

Interestingly, the processor has a pin that provides a signal identifying whether or not it is in the M1 phase. Apparently this is because the memory timing for instruction fetch is different to memory accesses that are performed by the executing instruction, so if your RAM has an access time > 1.5 clock cycles but < 2 cycles you need to assert the /WAIT signal during instruction fetches but not data fetches. But I see no reason this couldn't be used as a seventeenth address line, effectively allowing the Z80 to address an entire 64K of data memory without the inconvenience of needing to fit its program in the same space.

Are there any examples of machines that did this? Or any reason why it wouldn't work?

As it is already answered, direct way of using /M1 as additional address line is impossible. However, you can use external hardware to trace Z80's opcode flow and direct accesses to different memory areas depending on whether Z80 still reads command operands (for example LD HL,#1234 has singe M1 cycle and then two ordinary read cycles to read operand) or make further accesses commanded by the instruction just read. Unfortunately, the complexity of such a hardware is relatively high and it could be easily made only in FPGA or CPLD devices. — lvd, Jul 31 '17 at 15:44
CP/M Plus supported bank-switching architectures. One concept I remember was to have the Z80 address space divided into four 16KB logical banks, and have these mapped freely to 16KB-segments of a 1MB physical address space. A single application could still only use ~56K, but the O/S resided in its own bank. I used that for a memory-mapped color graphics card with 512 by 768 RGB pixels, occupying 384 KB of physical space and addressed by some tricky bank switching ASM code. — Ralf Kleberhoff, Aug 04 '17 at 15:10
@lvd: One difficulty on the Z80 as compared with e.g. the 6502 is that some instructions include more than one M1 cycle, and the meaning of a byte may be affected by one or more previous bytes. For example, after two consecutive M1 cycles that fetch FD and 36, the the next memory cycle would be some kind of opcode fetch, but the cycle after that may or may not be. If the FD was the start of a "LD (IY+d),n" instruction, the two following fetches could be operand values "d" and "n", but if FD was the second half of "set 7,l" the "36" would be the opcode "LD (HL),n", the next fetch... — supercat, Feb 19 '18 at 21:03
...would be the operand "n", and the fetch after that would be retrieving the value from (HL). Tracking which bytes are operand fetches and which are data fetches wouldn't require a humongous state machine, but it would be non-trivial. — supercat, Feb 19 '18 at 21:04
There are even more funny things. Consider BIT 7,(IX+1) command. First it fetches DD prefix, then CB prefix. Then it fetches 1, a displacement (not using M1 cycle, just ordinary memory read). And finally it fetches another opcode with M1 cycle! Only then the actual execution starts. — lvd, Feb 20 '18 at 08:58
Another example. Let's consider the big sequence of DD bytes happened to be executed by Z80. Though formally undocumented, it is still a command -- with lots of DD prefixes. When those end, another non-prefix (or DD/CB prefix pair) will be executed. Needless to say that such a prefix elongation is still a single command and can't be interrupted by even NMI. — lvd, Feb 20 '18 at 09:01
One of the reasons for the existance of M1 was to support Z80 peripheral chips during interrupt processing. An efficent mechanism was implemented to allow the peripherals to detect RETI instructions as they were fetched by the processor. The peripherals could then reset themselves and also participate in priority daisy chains (using IEI and IEO signams). Not often fully exploited but when used allowed very efficent large-scale interrupt driven IO. — Tim Ring, May 16 '18 at 09:44
@lvd: The only documented instructions that start with DD/CB are those where (HL) is turned into (IX+d), which is probably why the Z80 fetches the +d before fetching the byte that indicates what should be done with (IX+d). Is the displacement ignored if the follow-on byte doesn't use (HL)? — supercat, May 30 '18 at 15:13
@supercat: no, they change into undocumented OP r,(IX+d) group, that still operate over memory, then copy result also to a register. More here: http://www.z80.info/z80undoc.htm — lvd, Jun 01 '18 at 10:52
@lvd: Interesting. I'm curious what the creators of the Z80 were thinking with the whole IX/IY design (intended usage patterns, etc.) If there had been plans for some instructions that got dropped (e.g. more 16-bit transfers to/from IX/IY) the design might make sense, but doing anything with the registers always seems unbearably awkward. A prefix to compute either "BC+n", "DE+n", "HL+n" or "SP+n" with a displacement +/-31 and use that in place of "(HL)" or "(SP)" in the next instruction would seem like it could have been more useful than all of the IX/IY functionality put together. — supercat, Jun 01 '18 at 15:47
@lvd: I'm sure the designers of the Z80 were smart people, but IX/IY are so awkward to use it seems like there must be something "missing" from the intended design. Being able to use short displacements is sometimes useful, but the required six-byte 29-cycle ld ix,0 / add ix,de sequence to transfer a value into ix can add a displacement "for free", meaning the short displacements add cost but little value). If, however, there had been a two-byte 8-cycle ex de,ix/iy instruction, the IX/IY displacements could have been much more valuable. Maybe DD EB and FD EB had been planned? — supercat, Jun 01 '18 at 16:18
@supercat ex de, i[xy] would have been tricky to implement: ex de, hl just toggles a flipflop that inverts the select lines to the de and hl registers. Extending that to IX and IY would have required either a much more sophisticated decoder or actual data swapping, which with only a 4-bit wide data path into or out of temporary storage would have taken at least 16 cycles. — Jules, Jun 01 '18 at 19:31
@Jules: If one were to load DE, HL, DE', HL', and SP with five distinct values, then all 120 permutations of those values can be obtained using some sequence of the four-cycle instructions exx, ex de,hl and ex sp,hl. Were it not for ex sp,hl, there would be eight possible permutations which could be accommodated easily using three flops called "prime", "hsel", and "altsel". Accesses to HL use register (prime:hsel), accesses to DE use (prime:!hl), exchanging de and hl would invert the value of hsel, and exx would flip the value of hl while swapping the values of hsel and altsel. — supercat, Jun 01 '18 at 19:48
@Jules: The existence of "ex sp,hl", however, complicates things so much that I'm not sure adding ix and iy to the mix would really make things worse. Simply have seven 16-bit registers with addresses 000 to 110, five three-bit registers which are initialized with 000, 001, 010, 011, and 100, and a 6-bit register initialized to 101110. The ex instructions could then shuffle the values of the appropriate 3-bit registers, while exx would swap the contents of the first two 3-bit registers and the 6-bit one. — supercat, Jun 01 '18 at 19:53
In any case, the usefulness of IX/IY is severely undermined by the lack of an efficient means of getting data into them and out of them. I find it hard to imagine that the original design intention would have been to have programmers load IX with zero and then add BC, DE, or SP to that. It would seem much more plausible that the designers intended to provide some form of 16-bit load or exchange operations usable with ix/iy, but after difficulties arose the approach of "load zero and add" was deemed "good enough". — supercat, Jun 01 '18 at 20:01
@Jules: BTW, how does the Z80 manage to efficiently include ex sp,hl as a 4-cycle instruction? In the absence of exx, there would be six permutations, which would be a bit awkward but not impossible. With 120 permutations, though, I can't see any approach that would be simpler than having a bunch of 3-bit registers to select which of five 16-bit registers is mapped to each of DE, HL, DE', HL', and SP. — supercat, Jun 01 '18 at 20:19
@supercat - AFAIK, the way the Z80 register set works is using the set of three flipflops you describe. See Ken Shirriff's blog where he's gone into some detail on how the register file works. ex (sp), hl takes 19 cycles, and as far as I'm aware there isn't an ex sp, hl, so I think you're getting confused somewhere...? — Jules, Jun 01 '18 at 20:59
@Jules: I was looking at https://sites.google.com/site/timeproofing/z80-instruction-set-1/timings which lists a 4-cycle ex sp,hl, and [now that you mention it] is missing ex (sp),hl. In any case, ix/iy seem like they would take up far more circuitry than they end up being worth, so the only plausible reason I see for their existence would be that they were intended to be far more usable than they ended up being. Do you know of anything that would indicate how the Z80's design evolved from conception to completion? — supercat, Jun 01 '18 at 21:41
@supercat - there's a link on the site I linked above to an interview with key people at Zilog. I haven't read all of it, but it does go into a little depth about the design decisions made, so maybe there's something in there that would help...? — Jules, Jun 01 '18 at 21:44
@Jules: Okay. thanks. I hope you can see why I was confused given the page I linked. The page you linked seems to suggest that the registers sit on a 16-bit bus, so I don't see why more than 3 cycles would be needed to move reg1 to WZ, move reg2 to reg1, and move WZ to reg2, but I'll have to look some more to see what interviews had to say. — supercat, Jun 01 '18 at 22:03
@supercat - because the only place to move them to is the temporary storage of the ALU, which is only 4 bits wide. — Jules, Jun 01 '18 at 22:08
@Jules: They could also be moved via the increment/decrement latch circuit, which is 16 bits wide (and is used for "LD SP,HL") though the design would require separating out register reads and writes, and could not overlap them with a succeeding instruction fetch, so the time required would be six cycles rather than 3. That's a really nice blog you pointed me to, btw. — supercat, Jun 01 '18 at 22:16
@Jules: BTW, I wonder how much circuitry would have been required to use the increment circuit to expedite the processing of IX+n addressing mode or jr instructions by arranging for the incrementer to be capable of adding or subtracting 256 rather than just one? It really seems a waste that instructions which add an 8-bit signed displacement burn so many T-states on an increment or decrement operation. — supercat, Jun 02 '18 at 20:08
@supercat, I remember reading a blog post that suggested that what is missing in the current design is a more efficient 8-bit adder curcuit, which would have made the execution of commands manipulating the index registers substantially more efficient. Does this explanation make sense to you? — introspec, May 10 '19 at 16:13
@introspec: The processing of IX+n, JR, and the register BC decrement of LDI/LDD/LDIR/LDDR, all require feeding four nybbles of a 16-bit value through a 4-bit adder despite the fact that the upper 8 bits will either be left as is, incremented, or decremented. Using an 8-bit adder would certainly help with that, but would represent a significant increase in circuit complexity. My question is whether the need to use the four-bit adder on the upper byte could have been dealt with in other ways (for the LDI or LDD, simply omitting operations on BC would have helped). — supercat, May 10 '19 at 17:13

score 25 · Accepted Answer · answered Jul 28 '17 at 20:05

The M1 line literally means machine cycle 1 — and you're right about timing; the instruction fetch part of an M1 cycle is only two clock cycles long and will sample the data lines in the first clock cycle that it finds WAIT not asserted in, whereas a normal read is three clock cycles long and will sample the data lines in the clock cycle immediately after the first in which it finds WAIT not asserted.

However when the Z80 reads an operation with an operand, such as LD A, 23, only the opcode itself is accessed in machine cycle 1. So if you were to use M1 naively as a seventeenth address line then you'd see that operands were no longer fetched from the same area as operations. Which would be problematic for most execution flows.

However, what you could do is latch some part of the address fetched in each M1 cycle and use that to alter the memory map — so that code in one address range sees one memory map but code in another range sees a different memory map. This idea is lifted from an expansion of the Acorn Electron, which is a 6502 machine but the SYNC pin there does pretty much the same thing in indicating which bytes are opcodes. This particular expansion switches the RAM seen when the OS is in its drawing routines with a separate bank visible to the BASIC ROM. It thereby takes the area used for display out of the normal addressing range.

score 12 · Answer 2 · answered May 29 '18 at 22:16

In the 80s and 90s some arcade games the M1 signal was used for anti-bootlegging measures. The developers used use a customized assembler that would allow him/her to mark code and data sections, and the assembled program was encrypted such that code sections were encrypted one way and data sections were encrypted another way. The resulting combination of encrypted code and data would be written to a single EPROM.

The arcade game hardware had a custom module containing a Z80 and support circuitry to execute the game program by applying one of two different decryption functions based on the state of the M1 pin. In theory this made it very difficult to duplicate a program, and to modify it (often copyright messages were removed by bootleggers).

In practice bootleggers would duplicate games by dumping the Z80 address space twice after gaining physical access to the Z80 data/address/control bus in the module, once for each state of the M1 pin. This produced two copies of the EPROM contents, one with the content decrypted as if it were code, another with the content decrypted as if it were data.

The bootleg arcade game would have a regular Z80 and now two EPROMs and it would select which EPROM was being used based on M1 to pick the decrypted program or data. This essentially reversed the protection the original developers applied by de-interleaving the interleaved encrypted code and data.

A more dedicated effort was to manually merge the decrypted code and decrypted data back together into a single unified EPROM, but this was time consuming and error prone and was not commonly done, if at all, until modern times when arcade collectors wanted to replace the short-lived custom modules with a plain Z80.

So M1 has some useful, if obscure, functionality.

If a machine used a CD4052 (4x2 MOSFET switch) with AB inputs connected to M1 and A1, and the other pins connected between the Z80's D0/D1 and the rest of the bus so that the pins would be swapped only on M1 cycles from even addresses, code could guard against the EPROM-clone trick by copying code to RAM and running it there (sometimes changing the code's address mod 4). Not sure if anyone ever took things that far, but I think it would offer a pretty good level of reverse-engineering protection for the cost of a single chip. — supercat, Jun 01 '18 at 16:51

Bob Jacobsen · Answer 3 · 2018-05-30T14:59:49.607

In the early '80s I worked on a "communications processor" which let you use ASCII terminals as an IBM 3270 display subsystem. The Z80 was critically short of both RAM and program memory (27256 EEPROM chips, IIRC) for the task of emulating 8 displays.

To handle the RAM shortage the hardware got a 8kB bank setup in high RAM that was manually switched by the program. To handle the program space shortage, there was a "shadow PC counter" that worked with the M1 and a handful of gates to detect the conditions in @Tommy's answer so that program accesses went to the EEPROM chips and true data access went to RAM. There were a handful of IX and IY instructions that we weren't allowed to use.

Ugly, but it allowed the product to handle twice as many connections, which made it an economic winner.

score 4 · Answer 4 · edited Jul 31 '17 at 20:22

But I see no reason this couldn't be used as a seventeenth address line, effectively allowing the Z80 to address an entire 64K of data memory without the inconvenience of needing to fit its program in the same space.

The question as it applies to the Z80 has already been answered, but the more general answer is yes. The obvious example is the 8086. From the perspective of the programmer, addresses are 16 bits wide. However, the chip itself has 20 bits of address space. The extra four bits come from a segment register. Code access is done through the code segment, data accesses through the data segment or the extra segment and stack accesses are done through the stack segment. Thus, the programmer can access up to 256 kilobytes at a time.

Taking the concept of keeping instructions and data separate to the maximum, you can have, not only a separate address space, but separate data paths too. i.e. when you fetch an instruction, it comes in on a different bus to the data. This is known as the Harvard Architecture. The Harvard architecture has the advantage over the Von Neuman architecture in that you can be doing instruction fetches and data accesses at the same time. You can also design caching policies to optimise for instructions and data separately. The disadvantage is having more bus lines.

Internally, almost all modern CPUs use the Harvard Architecture. Instructions are fetched from a different cache to data and the fetch uses a different pathway. So, from the microcoder's point of view, your separation scheme is pretty much standard.

score 3 · Answer 5 · answered May 30 '18 at 15:30

Another important function of M1 which isn't yet mentioned is to synchronize in-circuit emulators. An in-circuit emulator is a device which contains a processor, a connector that plugs into a processor socket, and some circuitry between them. The emulator can be set to behave like a processor, but it can at any point or when a certain breakpoint is triggered, disconnect its bus from that of the host system, save the state of registers, and then start executing its own code to provide a user interface with its own buttons and display. That interface may allow individual memory reads and writes to be performed on the host system, and may also allow for operations to copy or fill a range of memory, scan a range of memory for a certain value, configure breakpoints, etc.

While some low-cost debuggers may use interrupts to gain control, that could cause difficulties if a "real" interrupt happens at the same time as the debugger tries to gain control. A better ICE will do do something like wait for the start of an instruction (perhaps figuring that a /M1 cycle which follows a write must be the start of an instruction), disconnect from the host bus and connect to its internal bus, replace the data from the bus with 0xFF (a RST 38h instruction), pretend that the upper address bits on the next two write cycles are all set (so the writes will go somewhere in the range 0xFFF0 to 0xFFFF regardless of the value of SP), and then start executing code from its own ROM.

A proper "start of instruction" indication would have been more convenient than the Z80's "some kind of opcode byte", but most code won't go very long without a write cycle, and once a write cycle occurs the state machine in the ICE should be able to follow things after that. Perhaps an ICE could track things even without M1, but it would require a more complicated state machine that understands almost all Z80 opcodes rather than just a few prefix bytes.

Interesting, yes, I'd not considered that use. And of course all of this must have been much simpler on the 8080 (where the M1 signal originated) due to the fact that it didn't have the system of opcode prefixes used by the Z80. — Jules, May 31 '18 at 00:00

score 2 · Answer 6 · answered May 10 '19 at 14:09

As answered the M1 is not used during full instruction fetch, just the first opcode byte. Note that the M1 has other purposes and shouldn't be messed with, especially if Z80 peripheral chips such as PIO/SIO/CTC etc. are used on same system. Mode 2 interrupts used the M1 along with IORQ as part of interrupt acknowledge, vector placement on bus and end of interrupt processing (peripheral chips detecting RETI ended interrupt). A very sophisticated and powerful interrupt system with interrupt chaining, vectored interrupts etc. (way ahead of the Intel 8086 system and didn't require an external interrupt controller). The business I was in in the eighties built process controllers and communication controllers and we stuck with the Z80 for years because of its peripheral chips and interrupt system.

score 2 · Answer 7 · answered May 30 '18 at 09:18

As indicated in other answers, M1 can't be used to implement a Harvard addressing scheme, as immediate operands are fetched from program memory in cycles subsequent to M1. But it can be used by hardware to detect parts of program execution.

One use for this was on the ZX Spectrum's Interface One (for Microdrives and serial ports): it detects execution of I/O operations (using M1 + address), and in response, asserts #ROMCS to disable the built-in ROM and enables its own Shadow ROM in the bottom 16K¹. I can't remember how the return to the Spectrum ROM is achieved - it may be the same mechanism, but I no longer have my copy of Spectrum Shadow ROM Disassembly with which to check.

¹ IIRC, an 8K ROM that appears twice in the address space - A13 not decoded.

score 1 · Answer 8 · answered Jan 20 '23 at 23:21

M1 can be used with a helper state machine driven by the opcode contents to determine which accesses are code space and which are data space. Given the Z80 opcode map, such a state machine isn't terribly complex. There are big chunks of opcode space that are quite regular.

For example, the first opcode byte decodes as follows:

(all opcodes octal!)
1 byte: all except given below
2 bytes:
  020,030,...,070 - relative jumps
  0_6             - LD immediate
  3_6             - immediate arithmetic
  323,333         - IN/OUT (m):A
  313             - CB prefix - bit opcodes
  355 then ___    - ED prefix - all except those that are 3 or 4 bytes below
3 bytes:
  001,021,041,061 - LD ww,mn
  042,052,...,072 - LD (mn):HL/A
  3_2             - JP f, mn      
  303             - JP mn
  3_4             - CALL f,mn
  315             - CALL mn
  355 then 0_0    - Z180 IN0
  355 then 0_1    - Z180 OUT0
  355 then 244,264 - Z180 TST m
4 bytes:
  355 then 2_3    - LD (mn):ww
state changes:
  335,375         - IX/IY prefix, don't count this byte, add/override state transitions per IX/IY prefix

IX/IY prefix modifies the expected opcode lengths as follows. Start with the table above, and add/override the following entries:

IX/IY + 1 byte:
  166             - HALT
IX/IY + 2 bytes:
  064,065         - INC/DEC (HL)
  160,261,...,265 - LD (HL), s
  1_6             - LD g, (HL)
  2_6             - arithmetic with (HL) as source
  351             - JP (HL)
IX/IY + 3 bytes:
  066             - LD (HL), m
IX/IY + 4 bytes:
  313 then __6    - CB instructions using (HL)
  355 then 064    - Z180 TST (HL)

At some point in the past I've put together a state machine for the above with a Johnson counter for the 1st, 2nd, etc. byte, steered by discrete logic that decoded all the cases. It's doable, if tedious. Trivial in a PLD or FPGA though.

Separate code and data address spaces on the Z80

8 Answers8