What is the history and development of memory caching?

Question

I have tried to research the history and development of memory caching online, but I find it hard to find good information. Many resources online would have you believe caching was introduced with the Intel 80486, and generally assuming it's a only thing for microprocessors. The Stanford Superfoonly design from the early 1970s included a cache that was projected to provide a ~10x speedup over a PDP-10. I'm sure even earlier examples could be found elsewhere.

Even in the x86 world, “introduced with the Intel 80486” is a big simplification; see What was the first x86 CPU to use a cache of any kind? — Stephen Kitt, Jun 21 '22 at 07:05
The IBM Stretch had a unit to accelerate memory accesses, but as far as I can see it's not what we would consider a cache today: http://www.bitsavers.org/pdf/ibm/7030/TR00.03000.703_StretchVM_59.pdf — Lars Brinkhoff, Jun 21 '22 at 08:48
Is this specifically about on-chip CPU caching of main memory, or about caching of memory heirarchies as a general computer science principle? — Michael Kay, Jun 22 '22 at 17:45
486? That wasn't released until 1989. (But yes, had fully on-chip caches). MIPS R2000 from 1986 had split I/D caches, with the controllers on-chip but the actual data + tags off-chip. Classic RISC pipelines were built around 1 instruction per clock, with single-cycle latency D cache. (So the off-chip SRAM limited clock speeds to 15 MHz, but was still a proper CPU cache with the access logic integrated into the pipeline. It stalled on miss instead of scoreboarding loads like modern in-order pipelines, but could read I-cache and D-cache in the same cycle.) — Peter Cordes, Jun 23 '22 at 10:50
@MichaelKay, since early caches were used before VLSI CPUs were available, off-chip is on-topic. — Lars Brinkhoff, Jun 27 '22 at 05:59

Stephen Kitt · Answer 1 · 2022-06-23T13:05:00.900

25

The concept of cache memory was formalised by Maurice Wilkes in his 1965 paper, Slave Memories and Dynamic Storage Allocation. This describes a hierarchical memory setup with a small amount of fast core memory serving a larger amount of slower core memory. It refers to system descriptions of “slave memories” in existing computer designs at the time, the ETL Mk-6 computers and the Atlas 2; these had very small, very-high-speed memories used as instruction caches (the Atlas 2’s cache was however never implemented). Wilkes’ paper discusses the practicalities of extending the concept to use larger amounts of cache for more general purposes.

It covers many concepts and concerns which will still be familiar to present-day readers: tag bits, cache coherency (which shows up in the paper as the need to write back dirty words in the cache on program switches), associativity…

The usefulness of cache memories quickly spread, and even an overview of their history and development would be quite long. One could start by looking at the citations of the Wilkes paper, and other articles published in the 60s and 70s such as DJ Kuck and DH Lawrie’s The use and performance of memory hierarchies: A survey (which features an extensive bibliography).

Caches appeared in general-purpose processors in the following years; early examples include DEC’s KL10, based on the Superfoonly design you mention, and the various cache-equipped System/360 models mentioned in Raffzahn’s answer.

It took a while for microprocessors to include cache, for a number of reasons, most importantly the available transistor budget (see Why did Intel abandon unified CPU cache? for some discussion of that), but also the fact that early microprocessors were slow enough that memory accesses weren’t necessarily a huge problem.

edited Jun 23 '22 at 13:05

answered Jun 21 '22 at 07:36

Stephen Kitt

121,835
17
505
462

1

Data point: the Motorola 68020 had a 256-byte instruction cache in 1984. – Lars Brinkhoff Jun 21 '22 at 08:50
Yes, I mention that in Why did Intel abandon unified CPU cache? (albeit without mentioning the date), the last link in the answer above. – Stephen Kitt Jun 21 '22 at 08:51
1

One issue with searching for the history of cache memory has been different terminology: “cache“ may be standard now, but I found a lot of older IBM documentation called it “buffer“, where is nowadays we, or at least I, use “buffer“ as very similar to “queue”. – Krazy Glew Jun 21 '22 at 17:21
Also hardware in memory caching is related to “associative memory“, which was a popular early topic. IIRC quotes like “the most important topic in computer architecture is associative memory“. Not just the caches and TLBs we are familiar with, but people considered programmer controlled associative memory, beyond software control caching. IMHO transparent HW memory caching is the most successful survivor of the old research in associative memory. TCAMS (ternary contact associate of memory) e.g. as used in network routers are a similar survivor. – Krazy Glew Jun 21 '22 at 17:29
@KrazyGlew: It's also possible for computers to include constructs which are like caches, but really small, and it's sometimes unclear whether such things should be referred to as caches. For example, a 32-bit machine that supports byte and half-word accesses might buffer the last 32 bit word which was involved with a byte or half-word load, so that if a subsequent request accesses the other half there would be no need to do a second load operation. For some kinds of code, such a buffer could have a 50% hit rate at very little cost. – supercat Jun 22 '22 at 16:03
the need to flush the cache when switching programs - That's only a thing with virtually-tagged caches. Most modern CPUs use physically tagged caches. And usually with ways to avoid synonym/homonym problems, either inherently (PIPT, or VIPT with enough associativity that the index bits are from the byte-within-page part of the address so they translate for free), or via OS cooperation (like page coloring). But yes, virtually-tagged caches are a thing on some CPUs, and do require flush on context-switch. (And could be problematic if the same phys page were mapped twice in the same process?) – Peter Cordes Jun 23 '22 at 12:16
1

Or if you don't have cache-coherent DMA, you'd also have to flush before reading in a new program from disk, even without virtual memory. – Peter Cordes Jun 23 '22 at 12:17

Raffzahn · Accepted Answer · 2023-09-06T19:16:13.113

Preface: This focuses on real machines, available as production units, not prototypes or experimental designs. Nor are the examples exhaustive. I will also spare any discussion of memory hierarchy but go with the meaning of CPU cache as it's canon today.

The development of Cache is a continuation of storage hierarchy (*1), a principle still visible in IBM Mainframes and interlinkt with the development of virtual memory. Both are methods to increase speed of most active memory regions while at the same time providing ever larger amounts of usable address space.

The first step might have been machines like (*2) the Z23, a 1961 transistor based reimplementation of the earlier Z22. While the Z22 only placed the first 16 words (the registers) in Core, the Z23 had 256 additional words of core within the (Drum) address space.

The mid 1960s also mark the point in time where core installation became large enough to complete replace Drum as main memory, redesignating it as very fast external storage. Beside rapid growing size and independent address space layout, this also resulted in independence from Drum timing(*3). This independence allowed the use of different cycle times depending on different memory types, etc.

Next step was virtual addressing, allowing to put arbitrary memory regions in limited core while keeping the rest on larger but less expensive media than drums - aka magnetic Disks :). Base is a TLB (Translation Lookaside Buffer). First production machines to implement a TLB might have been the IBM 360/67 of 1965 and the GE 645 (*4) of 1967, although the later might still be count as SST build for the project.

Core was, at the time, with below 1 µs access, already incredible fast, but semiconductor memory became a possibility soon after. This created the same opportunity of speed increase as with Core vs. Drum.

In January 1968 IBM introduced the /360 Model 85 providing Cache as we know it today. The first units were eventually delivered in December of 1969. The Model 85 used the 2385 Processor Storage in

2 or 4 way configuration (*5) with
512 KiB to 4 MiB and 960 ns cycle time

plus a

32 KiB cache at 240 ns

The Model 85 became especially influential due a description of its cache design by John Liptay's article Structural aspects of the System/360 Model 85 - Part II The cache (*6), published in IBM's Systems Journal Vol.7 No.1 of March 1968, p. 15-21. (scanned here).

In fact, IBM did beat themself to market in August 1969, when they delivered the /360 Model 195, including as well 32 KiB cache, putting memory speed to the extreme:

four megabyte Core at 754 ns
one megabyte Thin-Film Memory at 120 ns (!) (*7)
32 KiB Semiconductor RAM, acting as cache,at 54 ns

The 195 was the fastes general purpose computer at it's time, only beaten in pure FP power by Cray's CDC6600(*8). The Model 195 performance was comparable to an early 1990s Pentium.

Now, for the x86 time line, The 486 was the first (Intel) implementation with an on chip cache. Caches have been used before for 286 and 386 systems as well (*9). For the 386 Intel even offered a dedicated cache controller, the 82385, which has been used in a series of motherboards.

After all, cache isn't anything special, just a logic that saves some memory in a fast RAM and slows the CPU down when the desired content is not within that fast section. As a result even 8 bit systems used cache. The best known examples may be the ZIP Chip and Rocket Chip, for the Apple II, both utilizing 8 KiB of static RAM to have a 65C02 run at 4..10 MHz in a standard 1 MHz Apple II. But it was implemented already as early as 1985 with the Speed Demon in 1985, going 3.5 MHz with a 4 KiB Cache.

In 1988 Apple introduced the Apple IIc Plus which essentially included a Zip-Chip based design on the motherboard, running the 1 MHz base system with 4 MHz Cache, making it the most useful Apple II up to date.

Bottom line: 1968/69 would be a safe assumption about first cache architecture as we understand it today delivered in production units. As expected, Cache is way older and way more used than just with the 80486.

*1 - Think of the basic hierarchy of

Disk
- -> Drum
  - -> Core
    - -> Semiconductor

each faster than the previous, but as well more expensive and smaller in capacity. Mainframe view of storage goes in many more stages from cache all the way to tape libraries and even offline storage like card stacks :))

*2 - Z23 used as example due being a linear step from the Z22, with only tubes exchanged for transistors and more core available as add on, but there are several other based on the combination of Core and Drum within a singular address space in the 1950..1965 time frame.

*3 - Drum based computers (like the 1955 Z22), are deeply intervened with the drum rotation. System clock is derivative of the Drums rotation, thus very monotonous and unchangeable - at least not without a huge speed penalty. In fact, programs for Drum computers themself were often written with rotation speed and drum size in mind, but that's a different story.

*4 - Famous for being build for running Multics

*5 - That means the memory was organized interleaved, so 2/4 consecutive access could be done in a single cycle. A bit like today's way of burst access.

*6 - Notable that he already describes it as Cache, while IBM continued for several years to call it 'Buffer Memory' in continuation of their hierarchical storage concept, where everything is Memory of a different stage.

*7 - Thin-Film Memory is a kind of 'integrated' Core memory, where the cores themself are little (magnetic) metal dots placed on a glass plate covered by 'wires' placed in a fashion similar to PCB traces are made today. So the result are quite small boards with several dozens KiB each.

It took solid state a few years to surpass that and core in general. In some way they are to core memory what the the Bugatti Veyron 16 cylinder engine is to combustion engines. A last glorious and magnificent uprising against time before being made useless by tiny, cheap, soulless electric engines that simply do the job way better than any fuel sucker ever could do.

*8 - Until the 195 IBM tried to place the /360 as well for scientific number crunching. But the late 1960s are when general purpose CPUs and high performance FP finally parted way and IBM focusing on the former. A bit like again today with GPU's used for number crunching while standard GP-CPU do the data shovelling from and to disks or across the netword.

*9 - There's also the Segment Descriptor Cache, starting with the 286, but that's a different beast.

I never saw core as slow as 8 µs. The slow model of the IBM 1130 had 3.2 µs core. — John Doty, Jun 22 '22 at 00:22
@JohnDoty you're right, I intended to write .8. For clarity I changed it to below 1 µs. I hope that's fine. — Raffzahn, Jun 22 '22 at 15:32
Excellent answer. BTW there's something missing from the end of *7, "the cores themselves are little metal dots controlled by". — Wayne Conrad, Jun 22 '22 at 16:33
The Cambridge Titan I believe had cached memory in 1963 - https://en.wikipedia.org/wiki/Titan_(1963_computer) — Michael Kay, Jun 22 '22 at 17:49
@MichaelKay Not really. Not a cache, but a higher speed memory region called Fast Operand Store. Anything stored at absolute address 0..8 would be accessed faster. Essentially as Zuse's Z22 did in '55 (Z23 expanded to 256 words in '61). A planned 32 instruction Slave Store, to be used in loops, was never finished. If it had worked, code would have to be loaded under program control, thus again not a cache as we know it today, but exactly like the Z23. Last, with 3 installation, all different, one of them being the engineering prototype, it doesn't really qualify as production device. — Raffzahn, Jun 22 '22 at 18:21
I simultaneously love and hate your analogy to the Bugatti Veyron's V16, as I simultaneously love and hate the soul-sucking electric motor as applied to road-going vehicles. — FreeMan, Jun 23 '22 at 13:42
I can't hear about drum memory without thinking of The Story of Mel. — John Bollinger, Jun 23 '22 at 15:28
@FreeMan Well, I as one trained in electrics I do know how superior electric engines are and how much I love them. If you ever get a chance to feel the punch a BMW i8 can deliver, try it. I'm fine with not hearing the roar of a gas guzzler. Still, the mechanical beauty of a combustion engine is a marvel, one can touch with one'S own hands in every detail, is simply not there with an electric motor. (BTW, it's rather a VVR16, as it's neither a V nor a W setup) — Raffzahn, Jun 23 '22 at 20:49
@JohnBollinger Yeah. While I have drums, I'm still looking for a LGP-30. I had the chance to play with one many years ago in it's remarkable, still, having one would be even better :))) — Raffzahn, Jun 23 '22 at 20:53
@Raffzahn: How did the cost and benefits of of a 32KiB RAM cache compare with those of simply using interleaving with core memory, or else using a wider bus and some minimal buffering? Given only a 4:1 cycle time difference between cache and core, 32KiB of semiconductor memory would seem outrageously expensive for the marginal benefit over using other techniques. — supercat, Jun 28 '22 at 18:17
Even the BESM-6 in 1968 (the beginning of serial production) had what can (with some stretch) be called a cache. Associative, non-addressable super-fast RAM - 4 48-bit words of instruction cache (8 instructions), 4+8 words of data cache (separate for read and write data). https://besm-6.ru/besm6.html — Wheelmagister, Sep 06 '23 at 18:40

score 2 · Answer 3 · answered Sep 06 '23 at 17:48

Cache in minicomputers. I received a DEC PDP-11/45 in Summer of 1972. The 45 had core memory slots in the backlane as well as, I believe, 4 slots for fast and expensive (semiconductor) memory runing at 300ns. The Fabritek memory systems company in Minneapolis was making add-on core boxes for the 45. An engineer at Fabritek worked with me as the customer to make an add-in 256 word cache memory for the 45 that it into a fast memory slot. This was roughly 1975. It worked very well and sped up the 45 nicely. His comany Minntronics sold a bunch of cache memories for the Dec PDP-11/45.

score -2 · Answer 4 · answered Jun 24 '22 at 00:30

I think, we owe most naming in computing to IBM and other mainframe producers.

Because I see very close relations of names to enterprise accounting.

First example - computer - first it was name of occupation, in accounting department - this was a person, who made computations.

Second registers - in large scale accounting, in acc dept, exists special accountant desktop, which have few deepenings, to place here books, with which now work accountant.

And then these experts, who integrate first computers to enterprise accounting departments, faces with slow memory and fast memory. And those times, computer experts was extremely good paid, so for sure, they have account in bank and used all bank services.

So they known, that you could spend money from account with check, but is is could be slow, for example, bank could work with checks only on working days, 9-18, but use cache (paper money, or even coins) is fast, but cache volume is small, so name cache sticks.

For exact technology, first computers was relay based and they are slow, tens of Hertz in best cases; next appear tubes - could be Megahertz, but non-reliable; than appear ferromagnetic coils and delay lines.

Before ICs, broadest tech was tubes, they was fast, but expensive, and first fast RAM sure was on tubes, and slow was on delay lines and on ferromagnetic coils.

But for tubes I'm not sure, if from them made RAM, or only registers, because make from binary tubes is extremely expensive, and memory tubes appear later.

When appear transistors, they very early used to make logic and RAM, but this was later.

And when appear universal ICs, most computing terminology already exists.

One more notice - first widely known talks and even some works on parallel computing made in 1950s and early 1960s (IBM, Ferranti, Burroughs, Parallel computing history on wikipedia), so they know cost of cache long before i486 :)

Money is cash, not cache; the etymologies are very different. A cache is “a store of things that may be required in the future, which can be retrieved rapidly”, that’s where the name comes from; nothing to do with cash as liquid money. — Stephen Kitt, Jun 24 '22 at 04:19
@Stephen Kitt, cash vs cache exact letters are not important, because important pronounce, which is extremely close for these words. — Serge Sergeev, Dec 04 '22 at 17:39
"exact letters are not important" -- I could not disagree more. — dave, Sep 06 '23 at 18:12
The etymology of "cash" is from the French "caisse" meaning "money box". The etymology of "cache" is from the French "cacher" which means "to hide". — JeremyP, Sep 07 '23 at 11:44

What is the history and development of memory caching?

4 Answers4

Linked