rygorous · April 22, 2019 16:15
diff --git a/gistfile1.txt b/gistfile1.txt
 http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6

 "The High-Level Zen Overview"

 - "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
  long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
  prediction means that higher throughput can be maintained longer and in the fastest order possible."
  
  Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
  that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,
  they also may or may not reduce branch misprediction costs.
  
 - "The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged
  instructions will filter through the micro-op cache."

  It's 2 branches per BTB *entry*. The image at the top of that page has the original slide from AMD that mentions
  this. I can't parse what the second half of that sentence is trying to tell me.

 - "On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC
  instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple
  instructions."
  
  What is "it" and what is being scheduled over multiple instructions? Also, "combining" of the pipes for 256-bit
  AVX instructions seems unlikely; AVX is very explicitly designed so the 128-bit halves are almost entirely
  independent, so the far more likely scenario is that 256-bit AVX ops are just split into two independently
  scheduled 128-bit ops for execution. (E.g. AMDs Jaguar/Puma cores also do this for AVX, and the earlier
  Bobcat cores used this to to execute 128-bit float ops with a 64-bit wide FP/SIMD execution unit).

 "Fetch"
  
 - "The L1-Instruction Cache can also accept 32 Bytes/cycle from the L2 cache as other instructions are placed through the
  load/store unit for another cycle around for execution."
  
  Word salad again? L1 I$ talks to the (unified) L2. The load/store units don't enter into it, that's data cache.

 - "The new Stack Engine comes into play between the queue and the dispatch"

  AMD has had dedicated stack engines (using that exact name) since Bulldozer. They're possibly "new" as in
  improved, but it's not something new in this microarchitecture generation. (Intel has had them since Core 2).
  
  UPDATE: Actually, AMD already had Stack Engines in the K10 - a 10-year old microarchitecture at this point.

 "Execution, Load/Store, INT and FP Scheduling"

 - "Only two ALUs are capable of branches, one of the ALUs can perform IMUL operations (signed multiply), and
  only one can do CRC operations."
  
  IMUL in this context means "integer multiplier" (as opposed to FMUL, which is floating-point), not the x86
  IMUL instruction. All integer multiplications go through an IMUL pipe, not just signed ones.

 - "The TLB buffer for the L2 cache for already decoded addresses is two level here, with the L1 TLB supporting
  64-entry at all page sizes and the L2 TLB going for 1.5K-entry with no 1G pages. The TLB and data pipes are
  split in this design, which relies on tags to determine if the data is in the cache or to start the data
  prefetch earlier in the pipeline."
  
  The L2 TLB is not a TLB for the level-2 cache, it's a second-level TLB. Both the actual data and the page
  translation have their own seperate multi-level cache hierarchy.
  
  Splitting data and TLB lookup means they use a so-called VIPT (Virtually Indexed, Physically Tagged) cache.
  That's a fairly standard design, and means the data fetch can start early, before the translated physical
  address is known. Prefetching doesn't really enter into it. (Intel has been using VIPT for years; so have
  AMDs Bobcat/Jaguar/Puma cores.)
  
 - "We have two MUL and two ADD in the FP unit, capable of joining to form two 128-bit FMACs, but not one
  256-bit AVX."
  
  A FMAC (fused-multiply-accumulate) is *not* something you can get by "joining" a FP multiplier and an FP adder.
  It's different hardware, since a FMA operation only has one rounding step, whereas MUL+ADD rounds twice.
  Most likely, the MUL units are actually fused-multiply-add units with a multiply-only fast path. (Something
  like Quinnell's Bridge FMA). I'd like to know how that's realized, but alas, no more details there.
  
 "Simultaneous MultiThreading (SMT)"

 - "The two new instructions are CLZERO and PTE Coalescing."

  PTE coalescing isn't an instruction. It's an extension in the page table entry format.
  
 - "The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds.
  This allows a thread to clear a poisoned cache line atomically (in one cycle) in preparation for zero data
  structures. It also allows a level of repeatability when the cache line is filled with expected data."
  
  Word salad again? The purpose of CLZERO is the same as the purpose of equivalent instructions in other
  ISAs, e.g. PowerPC "dcbz": to make a cache line writeable without requiring the previous contents to be
  sent from whoever held the cache line previously. That's a bandwidth optimization. Such cache lines are
  zero-initialized because you need to initialize them to _something_ - else you might leak previous
  contents of that cache line, which might be data your process isn't supposed to be able to see, e.g.
  encryption keys recently used by a different processs.
  
 - "PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables,
  and is a software transparent implementation. This is useful for reducing the number of entries in the
  TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor
  to be met."
  
  Word salad again; haven't found any detailed description of this yet, but my guess is that there is
  some way for the OS to tag that a group of 8 aligned, contiguous 4k page table entries maps a
  physically contiguous 32k region. Which would be a backwards-compatible way to introduce 32k pages
  (which in turn means fewer TLB misses). I have no idea why the branch predictor is mentioned here,
  it has nothing to do with any of this.

 "Power, Performance, and Pre-Fetch: AMD SenseMI"

 - "Every generation of CPUs from the big companies come with promises of better prediction and better
  pre-fetch models. These are both important to hide latency within a core which might be created by
  instruction decode, queuing, or more usually, moving data between caches and main memory to be
  ready for the instructions. With Ryzen, AMD is introducing its new Neural Net Prediction hardware
  model along with Smart Pre-Fetch."
  
  AMD have been using Neural Net branch predictors since at least Bobcat (2011). They didn't make
  a big deal of it then and do announce it now because NNs weren't a very hot topic then and they
  are now. Deep Learning fans, sorry to disappoint you, but this is a Perceptron predictor. :)
  (As per the Hot Chips slides on Zen that they're always showing cropped screenshots of.)
  
 - "For Zen this means two branches can be predicted per cycle (so, one per thread per cycle)"

  Two branches per cycle yes, one per thread per cycle: unlikely. This part of the pipeline (fetch)
  usually works on one thread per cycle. And if it was one branch per thread per cycle, there would
  be little point in storing information about 2 branches per BTB entry (as they misquoted earlier).
  
 "Thoughts and Comparisons"

 Oh dear. The table at the top is dubious, to say the least.

 - According to the table, Skylake/Kaby Lake/Broadwell have a 1536-entry L2 ITLB but no L2 DTLB.
  That's just wrong. All of these have a 1536-entry *unified* (both data+instruction) TLB.
 - Zen: "decode: 4 uops/cycle". The Hot Chips slides very clearly state (multiple times) that it's
  a 4 *x86 instructions* per cycle, not 4 uops. It might be 4 uops/cycle when running from the
  uop cache, but flagging that as "decode" seems weird.
 - Skylake: "decode: 5 uops/cycle" / Broadwell: "decode: 4 uops/cycle". Er, citation needed. Both
  of these decode up to 4 x86 instructions (modulo macro-fusion) as well. Skylake might have
  higher rate fetching from the uop cache, maybe?
 - It says that Zen can dispatch 6 uops/cycle. The Hot chips slides say up to 6 uops/cycle
  *to the integer pipe* plus up to 4 uops/cycle to the FP pipe, which makes 10 total. At this
  point in the pipeline, every execution unit has its own queue that it pulls instructions
  from. There's essentially no reason to limit dispatch width. I'm going with the Hot Chips
  slides here. Nor would it make much sense to build 8-wide retire (as they noted earlier)
  when you can dispatch at most 6 uops/cycle.
 - Likewise, Skylake/Kaby Lake/Broadwell can all dispatch 8 uops/cycle, not 6 or 4. (Not that
  it really matters, since the bottlenecks are elsewhere.)
 - Retire rate: it depends on what you count. Retirement is on instructions, but macro-fused
  pairs on Intel cores (for the purpose of rate limitations in the core) count as one 
  instruction, even though they show up as 2 (x86) instructions in the perf counters.
  Skylake can retire up to 4 fused-domain operations per cycle, which can be up to 6 x86
  instructions (but not 8 - I'd like to know where that number comes from).
 - AGU: Skylake and Broadwell are both 2+1 (2 for loads, 1 for stores). 2+2 would make no
  sense - the maximum is 2 loads and 1 store per cycle.
	http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6

	"The High-Level Zen Overview"

	- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
	long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
	prediction means that higher throughput can be maintained longer and in the fastest order possible."

	Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
	that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,
	they also may or may not reduce branch misprediction costs.

	- "The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged
	instructions will filter through the micro-op cache."

	It's 2 branches per BTB entry. The image at the top of that page has the original slide from AMD that mentions
	this. I can't parse what the second half of that sentence is trying to tell me.

	- "On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC
	instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple
	instructions."

	What is "it" and what is being scheduled over multiple instructions? Also, "combining" of the pipes for 256-bit
	AVX instructions seems unlikely; AVX is very explicitly designed so the 128-bit halves are almost entirely
	independent, so the far more likely scenario is that 256-bit AVX ops are just split into two independently
	scheduled 128-bit ops for execution. (E.g. AMDs Jaguar/Puma cores also do this for AVX, and the earlier
	Bobcat cores used this to to execute 128-bit float ops with a 64-bit wide FP/SIMD execution unit).

	"Fetch"

	- "The L1-Instruction Cache can also accept 32 Bytes/cycle from the L2 cache as other instructions are placed through the
	load/store unit for another cycle around for execution."

	Word salad again? L1 I$ talks to the (unified) L2. The load/store units don't enter into it, that's data cache.

	- "The new Stack Engine comes into play between the queue and the dispatch"

	AMD has had dedicated stack engines (using that exact name) since Bulldozer. They're possibly "new" as in
	improved, but it's not something new in this microarchitecture generation. (Intel has had them since Core 2).

	UPDATE: Actually, AMD already had Stack Engines in the K10 - a 10-year old microarchitecture at this point.

	"Execution, Load/Store, INT and FP Scheduling"

	- "Only two ALUs are capable of branches, one of the ALUs can perform IMUL operations (signed multiply), and
	only one can do CRC operations."

	IMUL in this context means "integer multiplier" (as opposed to FMUL, which is floating-point), not the x86
	IMUL instruction. All integer multiplications go through an IMUL pipe, not just signed ones.

	- "The TLB buffer for the L2 cache for already decoded addresses is two level here, with the L1 TLB supporting
	64-entry at all page sizes and the L2 TLB going for 1.5K-entry with no 1G pages. The TLB and data pipes are
	split in this design, which relies on tags to determine if the data is in the cache or to start the data
	prefetch earlier in the pipeline."

	The L2 TLB is not a TLB for the level-2 cache, it's a second-level TLB. Both the actual data and the page
	translation have their own seperate multi-level cache hierarchy.

	Splitting data and TLB lookup means they use a so-called VIPT (Virtually Indexed, Physically Tagged) cache.
	That's a fairly standard design, and means the data fetch can start early, before the translated physical
	address is known. Prefetching doesn't really enter into it. (Intel has been using VIPT for years; so have
	AMDs Bobcat/Jaguar/Puma cores.)

	- "We have two MUL and two ADD in the FP unit, capable of joining to form two 128-bit FMACs, but not one
	256-bit AVX."

	A FMAC (fused-multiply-accumulate) is not something you can get by "joining" a FP multiplier and an FP adder.
	It's different hardware, since a FMA operation only has one rounding step, whereas MUL+ADD rounds twice.
	Most likely, the MUL units are actually fused-multiply-add units with a multiply-only fast path. (Something
	like Quinnell's Bridge FMA). I'd like to know how that's realized, but alas, no more details there.

	"Simultaneous MultiThreading (SMT)"

	- "The two new instructions are CLZERO and PTE Coalescing."

	PTE coalescing isn't an instruction. It's an extension in the page table entry format.

	- "The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds.
	This allows a thread to clear a poisoned cache line atomically (in one cycle) in preparation for zero data
	structures. It also allows a level of repeatability when the cache line is filled with expected data."

	Word salad again? The purpose of CLZERO is the same as the purpose of equivalent instructions in other
	ISAs, e.g. PowerPC "dcbz": to make a cache line writeable without requiring the previous contents to be
	sent from whoever held the cache line previously. That's a bandwidth optimization. Such cache lines are
	zero-initialized because you need to initialize them to _something_ - else you might leak previous
	contents of that cache line, which might be data your process isn't supposed to be able to see, e.g.
	encryption keys recently used by a different processs.

	- "PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables,
	and is a software transparent implementation. This is useful for reducing the number of entries in the
	TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor
	to be met."

	Word salad again; haven't found any detailed description of this yet, but my guess is that there is
	some way for the OS to tag that a group of 8 aligned, contiguous 4k page table entries maps a
	physically contiguous 32k region. Which would be a backwards-compatible way to introduce 32k pages
	(which in turn means fewer TLB misses). I have no idea why the branch predictor is mentioned here,
	it has nothing to do with any of this.

	"Power, Performance, and Pre-Fetch: AMD SenseMI"

	- "Every generation of CPUs from the big companies come with promises of better prediction and better
	pre-fetch models. These are both important to hide latency within a core which might be created by
	instruction decode, queuing, or more usually, moving data between caches and main memory to be
	ready for the instructions. With Ryzen, AMD is introducing its new Neural Net Prediction hardware
	model along with Smart Pre-Fetch."

	AMD have been using Neural Net branch predictors since at least Bobcat (2011). They didn't make
	a big deal of it then and do announce it now because NNs weren't a very hot topic then and they
	are now. Deep Learning fans, sorry to disappoint you, but this is a Perceptron predictor. :)
	(As per the Hot Chips slides on Zen that they're always showing cropped screenshots of.)

	- "For Zen this means two branches can be predicted per cycle (so, one per thread per cycle)"

	Two branches per cycle yes, one per thread per cycle: unlikely. This part of the pipeline (fetch)
	usually works on one thread per cycle. And if it was one branch per thread per cycle, there would
	be little point in storing information about 2 branches per BTB entry (as they misquoted earlier).

	"Thoughts and Comparisons"

	Oh dear. The table at the top is dubious, to say the least.

	- According to the table, Skylake/Kaby Lake/Broadwell have a 1536-entry L2 ITLB but no L2 DTLB.
	That's just wrong. All of these have a 1536-entry unified (both data+instruction) TLB.
	- Zen: "decode: 4 uops/cycle". The Hot Chips slides very clearly state (multiple times) that it's
	a 4 x86 instructions per cycle, not 4 uops. It might be 4 uops/cycle when running from the
	uop cache, but flagging that as "decode" seems weird.
	- Skylake: "decode: 5 uops/cycle" / Broadwell: "decode: 4 uops/cycle". Er, citation needed. Both
	of these decode up to 4 x86 instructions (modulo macro-fusion) as well. Skylake might have
	higher rate fetching from the uop cache, maybe?
	- It says that Zen can dispatch 6 uops/cycle. The Hot chips slides say up to 6 uops/cycle
	to the integer pipe plus up to 4 uops/cycle to the FP pipe, which makes 10 total. At this
	point in the pipeline, every execution unit has its own queue that it pulls instructions
	from. There's essentially no reason to limit dispatch width. I'm going with the Hot Chips
	slides here. Nor would it make much sense to build 8-wide retire (as they noted earlier)
	when you can dispatch at most 6 uops/cycle.
	- Likewise, Skylake/Kaby Lake/Broadwell can all dispatch 8 uops/cycle, not 6 or 4. (Not that
	it really matters, since the bottlenecks are elsewhere.)
	- Retire rate: it depends on what you count. Retirement is on instructions, but macro-fused
	pairs on Intel cores (for the purpose of rate limitations in the core) count as one
	instruction, even though they show up as 2 (x86) instructions in the perf counters.
	Skylake can retire up to 4 fused-domain operations per cycle, which can be up to 6 x86
	instructions (but not 8 - I'd like to know where that number comes from).
	- AGU: Skylake and Broadwell are both 2+1 (2 for loads, 1 for stores). 2+2 would make no
	sense - the maximum is 2 loads and 1 store per cycle.