Skip to content

Instantly share code, notes, and snippets.

@rygorous
Last active April 22, 2019 16:15
Show Gist options
  • Save rygorous/6ce04569f213f3dc987b9274cdd677c8 to your computer and use it in GitHub Desktop.
Save rygorous/6ce04569f213f3dc987b9274cdd677c8 to your computer and use it in GitHub Desktop.
Anandtech on Zen.
http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6
"The High-Level Zen Overview"
- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
prediction means that higher throughput can be maintained longer and in the fastest order possible."
Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,
they also may or may not reduce branch misprediction costs.
- "The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged
instructions will filter through the micro-op cache."
It's 2 branches per BTB *entry*. The image at the top of that page has the original slide from AMD that mentions
this. I can't parse what the second half of that sentence is trying to tell me.
- "On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC
instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple
instructions."
What is "it" and what is being scheduled over multiple instructions? Also, "combining" of the pipes for 256-bit
AVX instructions seems unlikely; AVX is very explicitly designed so the 128-bit halves are almost entirely
independent, so the far more likely scenario is that 256-bit AVX ops are just split into two independently
scheduled 128-bit ops for execution. (E.g. AMDs Jaguar/Puma cores also do this for AVX, and the earlier
Bobcat cores used this to to execute 128-bit float ops with a 64-bit wide FP/SIMD execution unit).
"Fetch"
- "The L1-Instruction Cache can also accept 32 Bytes/cycle from the L2 cache as other instructions are placed through the
load/store unit for another cycle around for execution."
Word salad again? L1 I$ talks to the (unified) L2. The load/store units don't enter into it, that's data cache.
- "The new Stack Engine comes into play between the queue and the dispatch"
AMD has had dedicated stack engines (using that exact name) since Bulldozer. They're possibly "new" as in
improved, but it's not something new in this microarchitecture generation. (Intel has had them since Core 2).
UPDATE: Actually, AMD already had Stack Engines in the K10 - a 10-year old microarchitecture at this point.
"Execution, Load/Store, INT and FP Scheduling"
- "Only two ALUs are capable of branches, one of the ALUs can perform IMUL operations (signed multiply), and
only one can do CRC operations."
IMUL in this context means "integer multiplier" (as opposed to FMUL, which is floating-point), not the x86
IMUL instruction. All integer multiplications go through an IMUL pipe, not just signed ones.
- "The TLB buffer for the L2 cache for already decoded addresses is two level here, with the L1 TLB supporting
64-entry at all page sizes and the L2 TLB going for 1.5K-entry with no 1G pages. The TLB and data pipes are
split in this design, which relies on tags to determine if the data is in the cache or to start the data
prefetch earlier in the pipeline."
The L2 TLB is not a TLB for the level-2 cache, it's a second-level TLB. Both the actual data and the page
translation have their own seperate multi-level cache hierarchy.
Splitting data and TLB lookup means they use a so-called VIPT (Virtually Indexed, Physically Tagged) cache.
That's a fairly standard design, and means the data fetch can start early, before the translated physical
address is known. Prefetching doesn't really enter into it. (Intel has been using VIPT for years; so have
AMDs Bobcat/Jaguar/Puma cores.)
- "We have two MUL and two ADD in the FP unit, capable of joining to form two 128-bit FMACs, but not one
256-bit AVX."
A FMAC (fused-multiply-accumulate) is *not* something you can get by "joining" a FP multiplier and an FP adder.
It's different hardware, since a FMA operation only has one rounding step, whereas MUL+ADD rounds twice.
Most likely, the MUL units are actually fused-multiply-add units with a multiply-only fast path. (Something
like Quinnell's Bridge FMA). I'd like to know how that's realized, but alas, no more details there.
"Simultaneous MultiThreading (SMT)"
- "The two new instructions are CLZERO and PTE Coalescing."
PTE coalescing isn't an instruction. It's an extension in the page table entry format.
- "The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds.
This allows a thread to clear a poisoned cache line atomically (in one cycle) in preparation for zero data
structures. It also allows a level of repeatability when the cache line is filled with expected data."
Word salad again? The purpose of CLZERO is the same as the purpose of equivalent instructions in other
ISAs, e.g. PowerPC "dcbz": to make a cache line writeable without requiring the previous contents to be
sent from whoever held the cache line previously. That's a bandwidth optimization. Such cache lines are
zero-initialized because you need to initialize them to _something_ - else you might leak previous
contents of that cache line, which might be data your process isn't supposed to be able to see, e.g.
encryption keys recently used by a different processs.
- "PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables,
and is a software transparent implementation. This is useful for reducing the number of entries in the
TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor
to be met."
Word salad again; haven't found any detailed description of this yet, but my guess is that there is
some way for the OS to tag that a group of 8 aligned, contiguous 4k page table entries maps a
physically contiguous 32k region. Which would be a backwards-compatible way to introduce 32k pages
(which in turn means fewer TLB misses). I have no idea why the branch predictor is mentioned here,
it has nothing to do with any of this.
"Power, Performance, and Pre-Fetch: AMD SenseMI"
- "Every generation of CPUs from the big companies come with promises of better prediction and better
pre-fetch models. These are both important to hide latency within a core which might be created by
instruction decode, queuing, or more usually, moving data between caches and main memory to be
ready for the instructions. With Ryzen, AMD is introducing its new Neural Net Prediction hardware
model along with Smart Pre-Fetch."
AMD have been using Neural Net branch predictors since at least Bobcat (2011). They didn't make
a big deal of it then and do announce it now because NNs weren't a very hot topic then and they
are now. Deep Learning fans, sorry to disappoint you, but this is a Perceptron predictor. :)
(As per the Hot Chips slides on Zen that they're always showing cropped screenshots of.)
- "For Zen this means two branches can be predicted per cycle (so, one per thread per cycle)"
Two branches per cycle yes, one per thread per cycle: unlikely. This part of the pipeline (fetch)
usually works on one thread per cycle. And if it was one branch per thread per cycle, there would
be little point in storing information about 2 branches per BTB entry (as they misquoted earlier).
"Thoughts and Comparisons"
Oh dear. The table at the top is dubious, to say the least.
- According to the table, Skylake/Kaby Lake/Broadwell have a 1536-entry L2 ITLB but no L2 DTLB.
That's just wrong. All of these have a 1536-entry *unified* (both data+instruction) TLB.
- Zen: "decode: 4 uops/cycle". The Hot Chips slides very clearly state (multiple times) that it's
a 4 *x86 instructions* per cycle, not 4 uops. It might be 4 uops/cycle when running from the
uop cache, but flagging that as "decode" seems weird.
- Skylake: "decode: 5 uops/cycle" / Broadwell: "decode: 4 uops/cycle". Er, citation needed. Both
of these decode up to 4 x86 instructions (modulo macro-fusion) as well. Skylake might have
higher rate fetching from the uop cache, maybe?
- It says that Zen can dispatch 6 uops/cycle. The Hot chips slides say up to 6 uops/cycle
*to the integer pipe* plus up to 4 uops/cycle to the FP pipe, which makes 10 total. At this
point in the pipeline, every execution unit has its own queue that it pulls instructions
from. There's essentially no reason to limit dispatch width. I'm going with the Hot Chips
slides here. Nor would it make much sense to build 8-wide retire (as they noted earlier)
when you can dispatch at most 6 uops/cycle.
- Likewise, Skylake/Kaby Lake/Broadwell can all dispatch 8 uops/cycle, not 6 or 4. (Not that
it really matters, since the bottlenecks are elsewhere.)
- Retire rate: it depends on what you count. Retirement is on instructions, but macro-fused
pairs on Intel cores (for the purpose of rate limitations in the core) count as one
instruction, even though they show up as 2 (x86) instructions in the perf counters.
Skylake can retire up to 4 fused-domain operations per cycle, which can be up to 6 x86
instructions (but not 8 - I'd like to know where that number comes from).
- AGU: Skylake and Broadwell are both 2+1 (2 for loads, 1 for stores). 2+2 would make no
sense - the maximum is 2 loads and 1 store per cycle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment