Is it possible to raise the frequency of command input to the processor in this way?

Question

In the 70's and 80's RAM chips worked at a lower frequency than the CPU.

That is, the processor worked at a frequency higher than the RAM. We have that the CPU cannot receive one instruction from the RAM, per clock, since the RAM is lagging behind. We have that the CPU is idle while waiting for the next command.

There are different ways to work with this.
One of it - make a CPU with microcode. While the CPU is waiting for a command from the RAM, the microcode is running at the CPU frequency (for example, doing a complex command on registers). Or you can add cache mem in CPU.

Is it possible to do so (approximately)?

The processor operates at a frequency of 25 MHz, the memory has a delay of 80 ns (12.5 MHz). The RAM cannot give the CPU a command at each of its clock cycles. You can take two identical RAM chips and run them by changing the frequency in the clock signal in one of chip. Pick up such a shift that. The 1st chip is being recharged, the 2nd chip is giving a command to the CPU. The 2nd chip is recharged, the 1st gives the command. It turns out that in RAM1 there are only even instructions, and in RAM2 only odd ones. That is, the CPU can receive commands at its own frequency (25 million times per second), despite the fact that the RAM chips operate at 12.5 MHz.

--

I mean about such a mechanism itself from the point of view of design, and not from the point of view of taking C64 now and doing this with it.

Maybe such methods were used in mainframes or supercomputers?

Nothing requires that the CPU 'do nothing' while awaiting memory response. Microcode is not a requirement for 'not doing nothing'. — dave, May 31 '23 at 00:54
Sounds pretty much like modern dual-channel memory systems, but those interleave at cache-line granularity because DDR SDRAM is designed around burst transfers of 32 or 64 bytes. — Peter Cordes, Jun 01 '23 at 23:49
It was common for the CPU to take many clock cycles to perform an instruction so although the clock may have been faster than memory the instruction execution time may not have been. — Kevin White, Jun 02 '23 at 00:57
Re: your description of microcode being made to be slow(?) - did you mean that packing more CISC complexity into microcode that you can access via compact machine-code instruction let the same program run with less code-fetch bandwidth? I've been debating with Raffzahn in comments that that's probably the idea behind what you wrote, and that the idea makes sense even though that's not a primary reason for choosing microcode. — Peter Cordes, Jun 02 '23 at 14:45
i.e. that it's really CISC and multi-cycle instructions that are the key here, leading to higher code density that reduces code-fetch bandwidth for the same program. Microcode is one way to build such an implementation, and enables CISC with limited transistor counts. — Peter Cordes, Jun 02 '23 at 14:50
Your premise is incorrect, at least for the 70's. It wasn't uncommon to have "zero wait state" RAM, meaning the CPU was never delayed waiting for the RAM. — Mark Ransom, Jun 03 '23 at 02:44

Wayne Conrad · Answer 1 · 2023-05-31T00:27:14.400

You can take two identical RAM chips and run them by changing the frequency in the clock signal in one of chip. Pick up such a shift that. The 1st chip is being recharged, the 2nd chip is giving a command to the CPU. The 2nd chip is recharged, the 1st gives the command. It turns out that in RAM1 there are only even instructions, and in RAM2 only odd ones. That is, the CPU can receive commands at its own frequency (25 million times per second), despite the fact that the RAM chips operate at 12.5 MHz.

Yes. This is called interleaved memory and it was certainly found in the wild. From the linked Wikipedia article:

With interleaved memory, memory addresses are allocated to each memory bank in turn. For example, in an interleaved system with two memory banks (assuming word-addressable memory), if logical address 32 belongs to bank 0, then logical address 33 would belong to bank 1, logical address 34 would belong to bank 0, and so on. An interleaved memory is said to be n-way interleaved when there are n banks and memory location i resides in bank i mod n.

And, as with so much in computing, IBM gets the credit for inventing it:

Early research into interleaved memory was performed at IBM in the 60s and 70s in relation to the IBM 7030 Stretch computer,[4] but development went on for decades improving design, flexibility and performance to produce modern implementations.

This is part of the reason for the four-plane memory layout of IBM's early graphics cards (CGA/EGA) -- the electron guns in the CRT needed data faster than a single bank of RAM chips could deliver it, so they designed it to read across four separate banks in parallel. — smitelli, Jun 01 '23 at 18:31
The CGA is not planar because of read speed. The CGA uses banks of memory because of fundamental limitations in how many scan lines the 6845 CRTC controller can count. — Cody Gray - on strike, Jun 02 '23 at 08:10

score 13 · Answer 2 · edited Jun 02 '23 at 10:12

In the 70's and 80's RAM chips worked at a lower frequency than the CPU. That is, the processor worked at a frequency higher than the RAM.

Not really. At least with microprocessors, RAM was usually equal or faster than the CPU. RAM being slower is a development of the 1990s and later again, with micros. After all, the same cycle has been run through with mainframes.

We have that the CPU is idle while waiting for the next command.

Not really. A CPU waiting for data is a relatively new thing - again for micros.

One of it - make a CPU with microcode. While the CPU is waiting for a command from the RAM, the microcode is running at the CPU frequency (for example, doing a complex command on registers).

Nope. Microcode is not made to 'fill time' but to simplify CPU design (save on components) and/or run more complex operations.

Or you can add cache mem in CPU.

Well, yes, but it doesn't have to be in CPU. After all, Cache is nothing else than faster memory which usually is only affordable in comparably small quantities - otherwise it would be used for all memory.

Is it possible to do so (approximately)?

Only the second and only in part.

What to do?

In the end it's all about storage attachment and storage hierarchy(*1).

On a more detailed level there are four basic methods to improve average RAM access time:

Using regions of faster RAM for critical data/code.
Using interleaved memory regions.
Using a wider memory interface.
Using a cache.

#1 Faster Regions

The first method was very common at a time when vastly different technologies were used as RAM. A good example is a rather early machine such as Zuse Z23 where main memory was made from drum, but the first 256 words were core. Likewise the (way bigger) Univac 1105 whose base memory had 8 KiWords core plus 16 KiWords of drum (*2).

For program logic there was no difference between either type of memory. It was one continuous address space, but code - and especially data - stored in the first 8 Ki could be accessed without any waiting for the drum to rotate and deliver a random word (*3).

In a way, many modern micro processors support a similar working. They allow locking cache lines to a certain memory address. That is, those lines will be taken out of regular cache operation, loaded once with the desired content and never overwritten by other data, thus making sure that those code/data sections are always ready at maximum speed.

Another more modern example, though for a different reason, was the Commodore Amiga. While all its memory was semiconductor of equal speed, one portion (Fast RAM) could be accessed at full CPU speed, while another (Chip-RAM) was shared with I/O. Thus it was advised to store code and non I/O-related data whenever possible always in Fast RAM.

#2 Interleaved Memory Regions

If a memory device takes a certain time before it can be accessed again, why not use several memory devices, each with its own cool down, independently? Of course this would require an access scheme that (hopefully) allows predictable access patterns. Luckily reading code for execution is (mostly) exactly that: strictly sequential RAM access.

Having two RAM devices (banks) with interleaved addresses will give twice the RAM speed. As long as access is sequential. But even a random access might have a 50% chance for hitting the right (next) bank (*4). Having four quadruples RAM speed, and so on.

Interleaved memory blocks have been used since way back in core times - at least with mainframes where one had to have multiple core blocks anyway. It can be traced back to at least the 1961 IBM 7030, and was a common technique with /360(ish) Mainframes throughout the 60s and later.

It's all common with PCs supporting parallel banks since 286 times (*5).

Another modern analogue is RAID arrays.

#3 A Wider Memory Interface

There is no reason to tie the width of a memory interface to the internal word width of a CPU. A very common example are 8-bit devices on a 16-bit CPU. Only half the CPU width is used to access data. The same can be done by widening - like giving that 16-bit CPU a 32-bit memory interface. So whenever it accesses a 16-bit item from RAM, 32 bits are read at once. One half is directly forwarded to the CPU, the other half is latched and kept in case the following access needs it - which is again true for all sequential access.

Like before, doubling size doubles bandwidth. That method became quite popular with mainframes, by 1980 reaching memory interface widths of up to 1024 or 2048 bit (128/256 bytes) - all for a 32-bit CPU.

For microprocessors the Pentium may qualify as the first mainstream CPU utilizing this. While being a 32-bit CPU like the 486, it featured a 64-bit (8-byte) memory interface to double RAM bandwidth.

The method was (and is) quite popular with GPUs (graphic cards). Their access pattern is even more linear than for a standard CPU, so even the 1999 GForce 256 featured a 128-bit interface. By 2003 Nvidia reached 256 bits (NV42) which is more or less the standard up to today (*6).

Such wide memory interfaces are not only a precursor of cache in the sense that they prepare additional data for future access, but also by enabling fast cache loads. Which brings us to the last item:

#4 Cache

In some way a cache is the very same as item #1. A dedicated, faster storage area for most important data. Except now no longer the programmer decides (*7), but the CPU hardware/software does so. This takes some additional hardware, so it took some time for technology to be applied (for a more in depth history see here).

Caches do not obsolete #3..4, but rather extend thereupon.

*1 - Think of the basic hierarchy of

Disk
    -> Drum
        -> Core
            -> Semiconductor

each faster than the previous, but more expensive and smaller in capacity. In the /370 world this is the basic way to look at all storage. It goes in many more stages from cache all the way to tape libraries and even offline storage like card stacks :))

*2 - Fully expanded 16 Ki of Core and 32 Ki of Drum.

*3 - This is not entirely correct, as programs could be written (stored) in a way that data and code were spread out to the right way to eliminate any waiting. Of course, it was less effort if you could avoid needing such an arrangement.

*4 - With careful alignment, like back in the days of drum memory, this can be made 100% again :))

*5 - Most recent are putting it to the extreme by handling those blocks as independent memory channels, but that's a different story.

*6 - Further growth didn't stop for missing performance gains - there are chips with 384 and 512 or multiples of 256 - but number of pins and resulting board design will get exponentially more expensive, so such wide interfaces are only useful for extreme applications where money is of lesser concern.

*7 - Well, he still can, but that's very fine fine-tuning.

Note also that the OP's question is about fetching instructions, which do tend to have adjacent-address access patterns, at least until you get to a branch/jump/call. Thus: separate I- and D-caches; lookahead fetch of the target; branch prediction; speculative execution. All of these mechanisms help with keeping the CPU supplied with things to do. — dave, May 31 '23 at 12:10
@another-dave of course they do - and many more ways - but they are all fine tuning of caches. There are many more ways to improve memory speed way before adding the complexity of a cache and that's what this is about. — Raffzahn, May 31 '23 at 12:47
Caches and memory width/interleave are the crucial things here. Looking at mainstream micros rather than larger systems, it was really only the '386 that started off significantly faster than the attached memory (i.e. wait states would be the norm rather than being needed occasionally) and that was when Intel also introduces a cache controller chip (delivered late IIRC, which slowed '386 acceptance). So performance comes from caching, pipelining, multiple (i.e. superscalar) ALUs, instruction reordering and finally instruction decomposition: all of which are by now in wide use. — Mark Morgan Lloyd, Jun 01 '23 at 07:55
8088 was heavily bottlenecked on memory most of the time, because of its 8-bit bus. Even 8086 was fairly often when working with registers. Each byte (or aligned 2-byte word on 8086) takes about 4 cycles to access, but most of the instruction cycle-counts for things like add ax, bx (a 2-byte instruction) are less than 4 cycles. e.g. 2 or 3 cycles. https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html Maybe 8088 was the exception, not the rule, among microprocessors, but it was very notably widespread. — Peter Cordes, Jun 02 '23 at 00:01
Microcode is not made to 'fill time' - I think what the OP was the CISC benefit of having a compact machine-code instruction trigger a bunch of work inside the CPU, e.g. x86 push 123 or stosw which stores and updates the stack pointer or DI, vs. a RISC would need a store and then a pointer increment. (Not a good comparison for push, since normally you'd just addiu once to change the stack pointer and store at different offsets, but consider appending to an array vs. stosw.) Or x86 rep movsb which is memcpy in microcode, avoiding any code-fetch memory traffic during it. — Peter Cordes, Jun 02 '23 at 00:13
So you're right that microcode is just an implementation strategy, but I think there's something interesting to be said about it allowing very CISCy instructions which do have some of the benefit the OP's thinking of, of easily allowing instruction prefetch while running slow instructions. Overall I agree with most of the points your answer makes, though. — Peter Cordes, Jun 02 '23 at 00:17
@PeterCordes All x86 examples shown can be done without an explicit micro program (defining what one is, is rather blurry anyway, each end every implementation method needs to perform in stages). Microcode and prefetch are different issues. Prefetch is a neither a feature of it's own not tied to micro programming. It is as well tied to execution stages to identify information to be (pre)fetched. It's a method of parallel execution, so not 'using time inbetween' as asked by the OP - especially as the OP's assumption is about memory being slow. Prefetch does not tackle that in any way. — Raffzahn, Jun 02 '23 at 09:26
@PeterCordes The use of micro code is simply a form of reducing hardware complexity, not doing anything different or with different results. It does neither add functionality nor optimize execution - in reality it only carries the risk of slowing execution down by executing slower than memory would allow - the ability to crank up internal speed is thus just a way to cover up for the slow down introduced by saving hardware due micro code. — Raffzahn, Jun 02 '23 at 09:30
Do you agree or not that microcoded CPUs allow CISC ISAs with more complex instructions (and thus maybe smaller machine-code for the same program) to be implemented, for a fixed amount of transistors and/or implementation complexity for early CPU architects who didn't have modern design tools? The slowness of microcode makes it easier for memory to keep up with the CPU, especially with a simple prefetch unit decoupled from microcode execution. The more CISCy the CPU and the more cycles per instruction varies by insn, the more buffering benefit you get from a tiny prefetch buffer, maybe. — Peter Cordes, Jun 02 '23 at 14:40
If we want to be generous to the OP, they might have been thinking in those terms. They didn't really express it clearly or accurately, but there is perhaps an underlying idea that's not totally wrong, IMO. — Peter Cordes, Jun 02 '23 at 14:41
@PeterCordes Sorry,that question is meaning less if you tweak two parameters at once (complex instructions and transistor budget) and only created to force a certain narrative wich I do not see as existing. — Raffzahn, Jun 02 '23 at 15:01

Justme · Answer 3 · 2023-05-31T06:27:57.450

It is possible, was done, and is still done.

While interleaved memory could be useful in early systems to speed up memory bandwidth, you are still running into a problem where the system needs to wait if you need to access even or odd addresses consecutively. Another way is to not use interleaving, but make the data bus twice as wide, such as the 80286 with 16-bit data bus and 80386 with 32-bit data bus.

But as you need double the memory banks it basically meant you need double the amount of memory chips, or memory modules. For example a standard PC with a 386DX CPU has 32-bit memory bus, so that alone required four 8-bit SIMM modules, and interleaving that would require eight modules.

To be fair, the problem was not so large on early x86 hardware, as the CPUs had a prefetch queue, so multiple bytes of executable code is already inside the CPU and more code is fetched into CPU queue while the bus would be otherwise unused while executing opcodes that takes multiple cycles.

Standard memory modules have chips that are already divided into multiple banks, and have had them for a long time.

For the C64 example you have, it's actually the other way around. The DRAM is used by the CPU and graphics chip in alternate fashion (with some exceptions where CPU is halted for some time to have more data loaded to graphics chip).