Last active
April 22, 2019 16:15
-
-
Save rygorous/6ce04569f213f3dc987b9274cdd677c8 to your computer and use it in GitHub Desktop.
Anandtech on Zen.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6 | |
"The High-Level Zen Overview" | |
- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially | |
long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch | |
prediction means that higher throughput can be maintained longer and in the fastest order possible." | |
Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does | |
that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation, | |
they also may or may not reduce branch misprediction costs. | |
- "The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged | |
instructions will filter through the micro-op cache." | |
It's 2 branches per BTB *entry*. The image at the top of that page has the original slide from AMD that mentions | |
this. I can't parse what the second half of that sentence is trying to tell me. | |
- "On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC | |
instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple | |
instructions." | |
What is "it" and what is being scheduled over multiple instructions? Also, "combining" of the pipes for 256-bit | |
AVX instructions seems unlikely; AVX is very explicitly designed so the 128-bit halves are almost entirely | |
independent, so the far more likely scenario is that 256-bit AVX ops are just split into two independently | |
scheduled 128-bit ops for execution. (E.g. AMDs Jaguar/Puma cores also do this for AVX, and the earlier | |
Bobcat cores used this to to execute 128-bit float ops with a 64-bit wide FP/SIMD execution unit). | |
"Fetch" | |
- "The L1-Instruction Cache can also accept 32 Bytes/cycle from the L2 cache as other instructions are placed through the | |
load/store unit for another cycle around for execution." | |
Word salad again? L1 I$ talks to the (unified) L2. The load/store units don't enter into it, that's data cache. | |
- "The new Stack Engine comes into play between the queue and the dispatch" | |
AMD has had dedicated stack engines (using that exact name) since Bulldozer. They're possibly "new" as in | |
improved, but it's not something new in this microarchitecture generation. (Intel has had them since Core 2). | |
UPDATE: Actually, AMD already had Stack Engines in the K10 - a 10-year old microarchitecture at this point. | |
"Execution, Load/Store, INT and FP Scheduling" | |
- "Only two ALUs are capable of branches, one of the ALUs can perform IMUL operations (signed multiply), and | |
only one can do CRC operations." | |
IMUL in this context means "integer multiplier" (as opposed to FMUL, which is floating-point), not the x86 | |
IMUL instruction. All integer multiplications go through an IMUL pipe, not just signed ones. | |
- "The TLB buffer for the L2 cache for already decoded addresses is two level here, with the L1 TLB supporting | |
64-entry at all page sizes and the L2 TLB going for 1.5K-entry with no 1G pages. The TLB and data pipes are | |
split in this design, which relies on tags to determine if the data is in the cache or to start the data | |
prefetch earlier in the pipeline." | |
The L2 TLB is not a TLB for the level-2 cache, it's a second-level TLB. Both the actual data and the page | |
translation have their own seperate multi-level cache hierarchy. | |
Splitting data and TLB lookup means they use a so-called VIPT (Virtually Indexed, Physically Tagged) cache. | |
That's a fairly standard design, and means the data fetch can start early, before the translated physical | |
address is known. Prefetching doesn't really enter into it. (Intel has been using VIPT for years; so have | |
AMDs Bobcat/Jaguar/Puma cores.) | |
- "We have two MUL and two ADD in the FP unit, capable of joining to form two 128-bit FMACs, but not one | |
256-bit AVX." | |
A FMAC (fused-multiply-accumulate) is *not* something you can get by "joining" a FP multiplier and an FP adder. | |
It's different hardware, since a FMA operation only has one rounding step, whereas MUL+ADD rounds twice. | |
Most likely, the MUL units are actually fused-multiply-add units with a multiply-only fast path. (Something | |
like Quinnell's Bridge FMA). I'd like to know how that's realized, but alas, no more details there. | |
"Simultaneous MultiThreading (SMT)" | |
- "The two new instructions are CLZERO and PTE Coalescing." | |
PTE coalescing isn't an instruction. It's an extension in the page table entry format. | |
- "The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds. | |
This allows a thread to clear a poisoned cache line atomically (in one cycle) in preparation for zero data | |
structures. It also allows a level of repeatability when the cache line is filled with expected data." | |
Word salad again? The purpose of CLZERO is the same as the purpose of equivalent instructions in other | |
ISAs, e.g. PowerPC "dcbz": to make a cache line writeable without requiring the previous contents to be | |
sent from whoever held the cache line previously. That's a bandwidth optimization. Such cache lines are | |
zero-initialized because you need to initialize them to _something_ - else you might leak previous | |
contents of that cache line, which might be data your process isn't supposed to be able to see, e.g. | |
encryption keys recently used by a different processs. | |
- "PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables, | |
and is a software transparent implementation. This is useful for reducing the number of entries in the | |
TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor | |
to be met." | |
Word salad again; haven't found any detailed description of this yet, but my guess is that there is | |
some way for the OS to tag that a group of 8 aligned, contiguous 4k page table entries maps a | |
physically contiguous 32k region. Which would be a backwards-compatible way to introduce 32k pages | |
(which in turn means fewer TLB misses). I have no idea why the branch predictor is mentioned here, | |
it has nothing to do with any of this. | |
"Power, Performance, and Pre-Fetch: AMD SenseMI" | |
- "Every generation of CPUs from the big companies come with promises of better prediction and better | |
pre-fetch models. These are both important to hide latency within a core which might be created by | |
instruction decode, queuing, or more usually, moving data between caches and main memory to be | |
ready for the instructions. With Ryzen, AMD is introducing its new Neural Net Prediction hardware | |
model along with Smart Pre-Fetch." | |
AMD have been using Neural Net branch predictors since at least Bobcat (2011). They didn't make | |
a big deal of it then and do announce it now because NNs weren't a very hot topic then and they | |
are now. Deep Learning fans, sorry to disappoint you, but this is a Perceptron predictor. :) | |
(As per the Hot Chips slides on Zen that they're always showing cropped screenshots of.) | |
- "For Zen this means two branches can be predicted per cycle (so, one per thread per cycle)" | |
Two branches per cycle yes, one per thread per cycle: unlikely. This part of the pipeline (fetch) | |
usually works on one thread per cycle. And if it was one branch per thread per cycle, there would | |
be little point in storing information about 2 branches per BTB entry (as they misquoted earlier). | |
"Thoughts and Comparisons" | |
Oh dear. The table at the top is dubious, to say the least. | |
- According to the table, Skylake/Kaby Lake/Broadwell have a 1536-entry L2 ITLB but no L2 DTLB. | |
That's just wrong. All of these have a 1536-entry *unified* (both data+instruction) TLB. | |
- Zen: "decode: 4 uops/cycle". The Hot Chips slides very clearly state (multiple times) that it's | |
a 4 *x86 instructions* per cycle, not 4 uops. It might be 4 uops/cycle when running from the | |
uop cache, but flagging that as "decode" seems weird. | |
- Skylake: "decode: 5 uops/cycle" / Broadwell: "decode: 4 uops/cycle". Er, citation needed. Both | |
of these decode up to 4 x86 instructions (modulo macro-fusion) as well. Skylake might have | |
higher rate fetching from the uop cache, maybe? | |
- It says that Zen can dispatch 6 uops/cycle. The Hot chips slides say up to 6 uops/cycle | |
*to the integer pipe* plus up to 4 uops/cycle to the FP pipe, which makes 10 total. At this | |
point in the pipeline, every execution unit has its own queue that it pulls instructions | |
from. There's essentially no reason to limit dispatch width. I'm going with the Hot Chips | |
slides here. Nor would it make much sense to build 8-wide retire (as they noted earlier) | |
when you can dispatch at most 6 uops/cycle. | |
- Likewise, Skylake/Kaby Lake/Broadwell can all dispatch 8 uops/cycle, not 6 or 4. (Not that | |
it really matters, since the bottlenecks are elsewhere.) | |
- Retire rate: it depends on what you count. Retirement is on instructions, but macro-fused | |
pairs on Intel cores (for the purpose of rate limitations in the core) count as one | |
instruction, even though they show up as 2 (x86) instructions in the perf counters. | |
Skylake can retire up to 4 fused-domain operations per cycle, which can be up to 6 x86 | |
instructions (but not 8 - I'd like to know where that number comes from). | |
- AGU: Skylake and Broadwell are both 2+1 (2 for loads, 1 for stores). 2+2 would make no | |
sense - the maximum is 2 loads and 1 store per cycle. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment