Skip to content

Instantly share code, notes, and snippets.

@pervognsen
Last active September 21, 2023 07:35
Show Gist options
  • Save pervognsen/c3cda8cce0f3bf5ad68a753d9ab5259c to your computer and use it in GitHub Desktop.
Save pervognsen/c3cda8cce0f3bf5ad68a753d9ab5259c to your computer and use it in GitHub Desktop.

One thing that surprises newer programmers is that the older 8-bit microcomputers from the 70s and 80s were designed to run at the speed of random memory access to DRAM and ROM. The C64 was released in 1982 when I was born and its 6502 CPU ran at 1 MHz (give or take depending on NTSC vs PAL). It had a 2-stage pipelined design that was designed to overlap execution and instruction fetch for the current and next instruction. Cycle counting was simple to understand and master since it was based almost entirely on the number of memory accesses (1 cycle each), with a 1-cycle penalty for taken branches because of the pipelined instruction fetch for the next sequential instruction. So, the entire architecture was based on keeping the memory subsystem busy 100% of the time by issuing a read or write every cycle. One-byte instructions with no memory operands like INX still take the minimum 2 cycles per instruction and end up redundantly issuing the same memory request two cycles in a row.

You can play around with a silicon-accurate 6502 simulator here and see how internal registers change as you step:

http://visual6502.org/JSSim/expert.html

For reference, halfcyc is the cycle counter, phi0 is the phase, and AB is 16-bit value on the address bus.

Given everything I said, you can infer that a random-access read/write to memory (ROM as well as DRAM) must complete within a 1 MHz cycle, and therefore a random-access read cycle must be less than 1 microsecond.

Obviously this is very different from modern computers. A modern Intel processor, which is designed for fast L1 access, still requires 3 cycles for a read/write to L1 cache, which is SRAM that's physically close to the load/store unit, and L2 and L3 accesses are progressively slower, to say nothing of DRAM. So I thought it'd be fun to calculate and compare the DRAM read cycle time on a C64 compared to DRAM in a modern high-end PC. You might be shocked by the result!

The DRAM read cycle on the C64 must fit within the 1 MHz machine cycle, so while it was likely quite a bit faster than that to provide some safety margin and account for wire delays, let's just use 1 microsecond as a conservative estimate.

Getting 1:1 comparison data from a modern DRAM datasheet is surprisingly hard. DRAMs are by their nature not designed for true random access. They ideally want to do burst accesses from within the same DRAM row, so a lot of the timing characteristics are broken down at that granularity, and it's hard to find the sum of all the durations we want.

What we're trying to measure is the following. Assume the DRAM currently has another row open in its SRAM row buffer.

  1. We first have to write back the buffered row from the SRAM to its original row of DRAM cells.
  2. We then have to precharge the row amplifier to prepare for buffering the new row. This precharging is necessary because the DRAM cells hold such a small charge that they won't be able to directly charge the row buffer's input gates to the logic-high voltage level. Instead we have a sensor array that is precharged to a high-gain metastable state between low and high where it is hyper-sensitive to tiny perturbations that will push it in either direction. Usually metastability is resolved over time by noise, but here we rely on the released DRAM charge to do so. (Metastability is often presented in digital logic textbooks targeting people with a non-analog background as a mysterious phenomenon, but it is in this state that a digital circuit most closely approximates a linear amplifier, so from the linear circuit design perspective the low and high states used in digital design are the annoying, degenerate ones!)
  3. After precharging, we may then read the addressed DRAM row into the row buffer.
  4. Finally, we may select the columns of bytes we're addressing from the row buffer.

These steps constitute a random-access read cycle, where consecutive accesses don't address the same DRAM row.

Let's start with a slightly older technology, an end-of-lifed SDR SDRAM part from Micron:

https://www.micron.com/~/media/documents/products/data-sheet/dram/512mb_sdr.pdf

In this datasheet, stages 1 and 2 correspond to the PRECHARGE command, stage 3 corresponds to the ACTIVE command, and stage 4 corresponds to the READ command. Let's look at some timings:

ACTIVE-to-PRECHARGE command:      t_RAS = 37 ns
PRECHARGE command period:         t_RP  = 15 ns
ACTIVE-to-ACTIVE command period:  t_RC  = 60 ns
ACTIVE-to-READ delay:             t_RCD = 15 ns
Total random-access read cycle:   t     = 127 ns

So, this is well below 10x faster than the DRAM powering the C64!

Another way of looking at it: If you wanted to run a modern CPU with interlocked DRAM access in the manner of the 6502, you'd be limited to a clock rate of 7 MHz with this SDR DRAM chip from Micron.

Let's look at something closer to the cutting edge with a DDR3 SDRAM from Micron:

https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr3/1gb_1_35v_ddr3l.pdf

Let's pull the same timings again. I picked the 800 MHz part, so 1 cycle = 1.25 ns:

t_RAS = 15 cycles
t_RP  = 5 cycles
t_RC  = 20 cycles
t_RCD = 5 cycles
Total = 45 cycles

Converting to units of time, that is 56.25 ns, so better than twice as fast as the SDR part. However, this still would only let us clock a DRAM-synchronous CPU at around 17.7 MHz!

Compare that to the fact that the modern CPU with that DDR3 RAM can peak at a rate of over 4 GHz, with multiple cores, hardware threads, and a high level of instruction-level parallelism per thread!

I hope I didn't screw up these calculations or misinterpret the datasheet timings. Corrections welcome!

@mattgodbolt
Copy link

The British BBC Micro used a 6502 at 2MHz and also multiplexed RAM access between the video processor and the CPU. It's so crazy how far CPU speeds have moved one and how (relatively) little RAM speeds. Another interesting point is the 6502 had no "memory enable" pin so unconditionally accessed RAM on every processor cycle: many redundant reads and sometimes even writes were made!

Tons of extra info on this (and emulating it cycle-perfectly) at http://xania.org/Emulation-archive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment