A "cache" is, in general, any structure that keeps the results of processes/computations for easy access later on. The idea is that any time a given result is required in the future, a quick and efficient lookup from the cache serves as a replacement for doing the whole (potentially expensive) process/computation over again, just to get the same result. A simple example of this is the memorization of times tables. Knowing that 12 times 12 is 144 virtually instantly is useful and worth the memorization effort, compared to requiring someone to recalculate every time, even for (commonly used) small integers like 12 or 5.
It turns out that caches are crucial to the performance of any CPU (Central Processing Unit) even approaching modern standards; CPU caches started to be used around the middle of the 1980s. Two types of caches are common on CPUs, although caches with other purposes are possible. However, only one type of cache is usually referred to when speaking of "CPU caches", namely those caches which act as caches of main memory/RAM.
Here, we will discuss the importance of CPU caches, how they are typically laid out, and some design considerations that factor into cache size, cache associativity, and cache algorithms.
The CPU that I am using as I write this is currently 6 years old and officially discontinued by Intel; its clock frequency is 3.4GHz. This means that, during a typical clock cycle running at the nominal max speed (not counting overclocking), roughly 0.3 nanoseconds elapses. Most CPU instructions used for computation are run in just a single clock cycle (including many which operate on multiple pieces of data at once), and most of the rest are run in a handful of clock cycles at most. Compare this to the time it takes to access main memory (RAM), which is typically on the order of 100 nanoseconds. As such, it will not really matter how efficient your "business logic" computation code is if you have to resort to main memory on every memory access, since all the memory accesses will easily bottleneck the entire CPU's efficiency.
With CPU caches, the same memory operations that took 100ns when resorting to main memory instead take on the order of 0.5ns when there is a cache "hit" (i.e. we found the data we needed already in the cache) in the L1 cache (we'll see what L1 means in the next section). If we can keep most memory accesses as being cache hits, we can keep memory accesses from being a major bottleneck in computation speed. In addition, part of the cache is reserved for storing instructions. When a program's binary (essentially containing the machine code for the program) is executed, it is loaded into main memory, and then the CPU can pull a bunch of the machine code into its instruction cache to be read and executed. You can imagine how much slower the execution process would be if every time the CPU needed to consult the machine code, it had to resort to fetching from main memory. This is why smaller code is often surprisingly faster in practice than code that is equivalent, or even better, but simply larger.
Of course, not all memory accesses can be hits, since the cache will be "cold" when we first start running a program, since that program has not accessed memory at all yet and thus nothing useful is in the cache. Additionally, caches have a finite size, and so we can only access so much memory before there are inevitably cache collisions in which some data from main memory needs to be put into some spot in the cache where data already exists. This requires evicting the existing data from the cache, which will result in a cache "miss" if we ever try to access that memory again.
In order to be much more efficient than accessing main memory, caches must be designed significantly differently. Perhaps most importantly, caches are small (typically the entire cache capacity of all cache levels on a CPU is thousands of times smaller than the amount of main memory available) and very close to the CPU, actually being built into the same "die" as the CPU itself. This means that there is guaranteed to be very low latency for data to be transmitted from anywhere in a cache to the destination CPU core.
Since CPU caches are highly specialized and much smaller than main memory, they are usually constructed with more costly physical memory, namely SRAM. SRAM is more expensive (in terms of production cost) and less space-efficient than the usual kind of main memory (DRAM), but is much faster and does not require periodic refreshes. However, the exact details of the physical memory construction used depends on design considerations arising from the use of multiple cache "levels".
Modern CPU caches are actually laid out in terms of multiple cache levels, named L1, L2, L3, and so on. Most CPUs only have three levels (L3 being the largest level in terms of capacity), although a few have L4 caches as well. The L1 cache is the smallest and fastest cache, typically weighing in at around 64KB and being the first resort for all memory accesses. If you can find what you want in the L1 cache, you're golden. The L1 is generally partitioned into the "data cache" (which is what it sounds like) and the "instruction cache" (used for caching those machine instructions from main memory to be executed). The L2 is the next largest, weighing in at more like 256KB or so. The L2 is not quite as fast as the L1, being typically about 14 times slower, but is still far faster than main memory. The L2 is not partitioned and is used entirely for data, acting as both a backup and a manager for its associated L1 cache. Finally, the L3 cache is the largest cache, usually weighing in at upwards of 2MB. This cache, unlike the L1 and L2, is not specific to a processor core and is shared amongst all of the cores on a multicore CPU chip.
The L1 cache usually makes use of SRAM that is optimized for speed rather than physical size, since it is small in capacity and can thus afford to be less dense. The L2 is generally also SRAM, but more optimized for physical size to accomodate its significantly larger capacity. L3 is similar to L2, although even more optimized for size, occasionally even constructed using specialized types of DRAM instead. L4 caches, if they exist at all, are generally not located directly on the chip and use some kind of DRAM.
The basic operation of the cache hierarchy is as follows: When accessing memory, we first check the L1 cache. If the data is there, we continue on very quickly. If not, we check the L2 cache. If it's not there, we check the L3 cache. If it's not there, we either have to resort to directly accessing main memory, or in the case of a CPU with an L4 cache, we check the L4 cache. Even then, if it's not in the L4, we must access main memory directly.
Each cache is itself split into "blocks". Blocks are sometimes also called "lines", which are then often grouped into "sets" of blocks, for cache associativity and algorithm purposes, which will be touched on later. Each block represents a single logical chunk of data that was retrieved from some arbitrary location in main memory. Each block contains the main part, the data, as well as a few bits for the "tag" and the "flags". The tag is just a way of recording the address that the data came from (at least, part of it; we'll get to that). The flags are generally one or more bits indicating whether or not that block is in a certain state, like that the block is "invalid" (doesn't contain anything useful) or "dirty" (has been written to and differs from the corresponding data in main memory). The flag bits may encode more or less information, depending on the complexity and the guarantees that the cache gives, especially when it comes to ensuring cache coherence (something not covered here).
Cache size is, as with almost everything, a tradeoff. A larger cache means lower miss rates, since any given piece of data is more likely to already be in the cache. However, a larger cache also means that access times are slower, because it requires physically larger memory (and thus higher latencies) and/or a slower form of memory storage to keep the physical space & monetary cost down. Additionally, there is cost associated with searching the cache for a given piece of data. If the cache is larger, the search algorithm may become slower by some proportion.
One way of quantifying cache efficiency is by the average access time, that is, the average time it takes to access memory, whether that's for a specific cache or for memory accesses by the CPU in general. Average access time is equal to the time taken for a cache hit, plus the product of the miss rate and the miss penalty. For example, in a simplified setup where there is only an L1 cache and the main memory, if an L1 cache hit takes 0.5ns, a main memory access takes 100ns, and the miss rate of the L1 cache is 50%, then the average access time for the L1 (and overall) is 0.5ns + (50% * 100ns), which is 50.5ns. In the usual case of a multi-level cache layout, the average access time of the L1 cache is computed in terms of its "miss penalty", which is just the average access time of the L2. Then the L2 average access time is computed in terms of its miss penalty, which is the average access time of the L3, and so on.
Cache size directly affects hit time (larger cache, larger hit time), and affects miss rate (larger cache, smaller miss rate). Additionally, the miss penalty is equal to the average access time of the next cache up in the hierarchy, which is itself directly affected by the size of that higher level cache.
When we access main memory, we cache the results; but when we do that, how do we decide which block to put the data into? The general way in which the cache maps main memory addresses to cache blocks is called the cache's "associativity". The two simplest kinds of associativity are "full-associativity" and "direct-mapping". Full-associativity dictates that any given single address can be mapped to any single block in the cache. Direct-mapping dictates that any given single address can only be mapped to a single unique block in the cache, which cannot be changed.
Fully-associative caches offer the advantage that they are completely "fair"; there is no bias towards certain regions of main memory over others, nor towards particular memory access patterns. This is a very nice property, since we can't magically predict how accesses will be made by the software that runs on the CPU. However, there is a serious drawback: fully-associative caches are the slowest for lookups, since in order to check if a piece of data is in the cache, we have to check every single block.
On the other hand, direct-mapped caches have the fastest lookups, since all you have to do to check for a piece of data is look in a single block. But, again, there is a serious drawback: direct-mapped caches are extremely unfair, which can (and will) lead to pathological behavior where two or more pieces of data are being accessed often but all fight for the same block, constantly evicting one another and leading to cache misses all the time.
In reality, almost all CPUs take a compromise between these two extremes, using what is called "N-way set associativity". Under this regime, any given single address can be mapped to any of a set of N blocks, where N is both greater than 1 and is a power of two (typically 2 or 4). This is where the sets of blocks mentioned earlier come into play.
Of course, once we've figured out what kind of associativity we want, we need to figure out how exactly to map addresses to blocks (or sets, as the case may be). The goal here is to maximize "fairness", as mentioned before. We don't want to accidentally prioritize certain addresses over others. The easiest way to do this is using integer remainders.
If we assume the kind of associativity with the simplest kind of mapping, viz. direct-mapping, mapping a main memory address to a block is simple. We divide the address (e.g. 1032) by the size of the cache (e.g. 32 bytes), and then divide the remainder of that division by the size of each block in the cache (e.g. 4). The result of that last division (rounding towards zero) is the block that the address maps to, which in this toy example is remainder(1032 / 32) / 4 = 2, thus the third block (the first block is block 0).
The main result (quotient) of the first division operation then serves as the aforementioned tag for that block. Since exactly which block the address maps to only takes into account the remainder of that division, the quotient is all that is left. This missing piece of information can be used when we look up a specific block in the cache, to make sure that the block that the lookup address mapped to actually has the data we want (since it could be data from some other address that just happened to map to the same block).
N-way set associative caches are a bit more complicated, but the point here is that instead of mapping addresses to blocks, the process described here can be used to map addresses to sets of N blocks.
One other consideration is the size chosen for the blocks. If the blocks are very large, then we have potentially fewer blocks to search and we can pull more data from main memory at a time, thus resulting in potentially fewer cache misses. However, at the same time, larger blocks means that there is very little granularity in what data we store in the cache. If we just needed to read a 2-byte integer, but our cache blocks are 64 bytes in size, then we essentially are trashing 62 bytes worth of cache capacity, and are potentially unnecessarily evicting larger things from the cache. In essence, our cache can only represent very few locations in memory at a time. Smaller cache blocks have, naturally enough, the opposite tradeoffs.
Finally, one last area of CPU cache design that we will consider is cache replacement algorithms. In reality, this is only one kind of a number of different kinds of algorithms that may be involved in managing caches in a real-world CPU like an x86_64 processor. However, since associativity is so important, we need some way to handle caches with associativity that is not direct-mapped (i.e. N-way set associative, or possibly even fully-associative). The idea here is that when we read from main memory, we use some mapping technique similar to the one described earlier to map the address to some set of blocks; but then we have to choose which block in the set to put the data in.
There are, as usual, tradeoffs to be had here. The main tradeoff is between one extreme of the perfect cache algorithm (usually known as "Bélády's optimal algorithm", or the "clairvoyant algorithm"), and the other extreme of very naïve algorithms. We could even include direct-mapping as the "most naïve" algorithm. More "perfect" algorithms are more difficult to design and are complex & costly to implement, in addition to often being somewhat specific to the use-case in mind, but very efficient use of the cache by keeping things in the cache that are more likely to be accessed in the future (more cache hits). More naïve algorithms are simple to design and implement, and are very efficient in their operation, but don't guarantee quite so efficient/smart use of the cache.
Two algorithms that could be seen as reasonable compromises between simplicity and smartness are FIFO and LRU, with LRU being on the more "smart/complex" side of things. FIFO stands for "First In, First Out", and does essentially what it sounds like. When a block of a set has to be evicted, it is always the one that has been sitting in the cache for the longest. This can be implemented by a simple "circular buffer"-type setup, where you simply write new cache entries in the first block first, the second block second, and so on, wrapping around to the first block when you run out of blocks in the set. LRU stands for "Least Recently Used", and improves on FIFO by keeping track of when blocks have been last read from or written to. Then, when one needs to be evicted, the one that has been unused for the longest time is the evictee.
One example of a cache algorithm that is more naïve but not as naïve as direct-mapping would be RR, which stands for "Random Replacement". This algorithm is quite simple, as it simply chooses a random evictee every time. Because of its simplicity (and thus power efficiency), this algorithm is actually used in some ARM processors. It has the additional advantage of being significantly more fair than direct-mapping.
The design of CPU caches is one of the many crucial areas of improving CPU performance, especially now, as CPUs are not getting (much) "faster" in terms of overall clock speeds and latency due to fundamental physical limitations. As a result, careful and detailed consideration of their design and the tradeoffs therein are extremely important to any future computer design.
Additionally, it behooves programmers in the process of writing software to know the inner workings, tradeoffs, and philosophy of caches due to the direct impact it has on the efficiency of their code. Hopefully this document serves as a reasonable introduction to these things, especially for programmers who want to know more.