Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active October 12, 2025 10:02
Show Gist options
  • Save Chubek/d6480bbed9cba78c527b718af33d2a11 to your computer and use it in GitHub Desktop.
Save Chubek/d6480bbed9cba78c527b718af33d2a11 to your computer and use it in GitHub Desktop.
x86-64 Dossier
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<!-- KaTeX CSS -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-n8MVd4RsNIU0tAv4ct0nTaAbDJwPJzDEaqSD1odI+WdtXRGWt2kTvGFasHpSy3SV" crossorigin="anonymous">
<!-- KaTeX JavaScript -->
<title>Dossier - x86asm-dossier</title>
<style>
:root {
/* Dracula color palette */
--bg-color: #1a1b26;
--current-line: #24283b;
--selection: #364a82;
--foreground: #c0caf5;
--comment: #565f89;
--cyan: #8be9fd;
--green: #50fa7b;
--orange: #ffb86c;
--pink: #ff79c6;
--purple: #bd93f9;
--red: #ff5555;
--yellow: #f1fa8c;
/* Semantic colors */
--text-color: var(--foreground);
--heading-color: var(--purple);
--link-color: var(--cyan);
--link-hover: var(--pink);
--code-bg: var(--current-line);
--code-border: #6272a4;
--toc-bg: #21222c;
--toc-border: var(--current-line);
--blockquote-border: var(--purple);
--table-border: var(--current-line);
}
* {
box-sizing: border-box;
}
body {
font-family: Georgia, 'Times New Roman', serif;
line-height: 1.7;
max-width: 1400px;
margin: 0 auto;
padding: 20px;
background-color: var(--bg-color);
color: var(--text-color);
display: flex;
gap: 2rem;
}
/* Table of Contents */
#TOC {
position: sticky;
top: 20px;
min-width: 250px;
max-width: 300px;
height: fit-content;
max-height: calc(100vh - 40px);
overflow-y: auto;
background-color: var(--toc-bg);
border: 1px solid var(--toc-border);
border-radius: 8px;
padding: 1.5rem;
font-size: 0.9rem;
}
#TOC ul {
list-style: none;
padding-left: 0;
margin: 0;
}
#TOC > ul > li {
margin-bottom: 0.5rem;
}
#TOC ul ul {
padding-left: 1rem;
margin-top: 0.3rem;
}
#TOC a {
color: var(--link-color);
text-decoration: none;
display: block;
padding: 0.2rem 0;
transition: all 0.2s;
}
#TOC a:hover {
color: var(--link-hover);
transform: translateX(4px);
}
/* Main content */
main {
flex: 1;
min-width: 0;
max-width: 800px;
}
h1, h2, h3, h4, h5, h6 {
color: var(--heading-color);
margin-top: 2rem;
margin-bottom: 1rem;
line-height: 1.3;
font-weight: 600;
}
h1 {
font-size: 2.5rem;
border-bottom: 2px solid var(--purple);
padding-bottom: 0.5rem;
color: var(--pink);
}
h2 {
font-size: 2rem;
border-bottom: 1px solid var(--current-line);
padding-bottom: 0.3rem;
}
h3 { font-size: 1.5rem; color: var(--cyan); }
h4 { font-size: 1.25rem; color: var(--green); }
h5 { font-size: 1.1rem; color: var(--orange); }
h6 { font-size: 1rem; color: var(--yellow); }
a {
color: var(--link-color);
text-decoration: none;
transition: color 0.2s;
}
a:hover {
color: var(--link-hover);
text-decoration: underline;
}
/* Code blocks */
pre {
background-color: var(--code-bg);
border: 1px solid var(--code-border);
border-radius: 6px;
padding: 1rem;
overflow-x: auto;
font-size: 0.9rem;
}
code {
font-family: 'Consolas', 'Monaco', 'Courier New', monospace;
background-color: var(--code-bg);
padding: 0.2rem 0.4rem;
border-radius: 3px;
font-size: 0.9em;
color: var(--pink);
}
pre code {
background-color: transparent;
padding: 0;
color: var(--foreground);
}
/* Blockquotes */
blockquote {
margin: 1.5rem 0;
padding: 1rem;
padding-left: 1.5rem;
border-left: 4px solid var(--blockquote-border);
background-color: rgba(189, 147, 249, 0.1);
color: var(--foreground);
font-style: italic;
border-radius: 0 6px 6px 0;
}
blockquote p {
margin: 0.5rem 0;
}
/* Tables */
table {
border-collapse: collapse;
width: 100%;
margin: 1.5rem 0;
}
table th,
table td {
border: 1px solid var(--table-border);
padding: 0.75rem;
text-align: left;
}
table th {
background-color: var(--code-bg);
color: var(--purple);
font-weight: bold;
}
table tr:nth-child(even) {
background-color: rgba(68, 71, 90, 0.3);
}
table tr:hover {
background-color: rgba(68, 71, 90, 0.5);
}
/* Images */
img {
max-width: 100%;
height: auto;
border-radius: 6px;
border: 1px solid var(--current-line);
}
/* Lists */
ul, ol {
margin: 1rem 0;
padding-left: 2rem;
}
li {
margin: 0.5rem 0;
}
li::marker {
color: var(--purple);
}
/* Horizontal rule */
hr {
border: none;
border-top: 2px solid var(--current-line);
margin: 2rem 0;
}
/* Inline emphasis */
strong {
color: var(--orange);
font-weight: bold;
}
em {
color: var(--yellow);
font-style: italic;
}
/* Selection */
::selection {
background-color: var(--selection);
color: var(--foreground);
}
/* Responsive design */
@media (max-width: 900px) {
body {
flex-direction: column;
}
#TOC {
position: static;
max-width: 100%;
max-height: 300px;
margin-bottom: 2rem;
}
main {
max-width: 100%;
}
}
/* Scrollbar styling */
::-webkit-scrollbar {
width: 12px;
}
::-webkit-scrollbar-track {
background: var(--bg-color);
}
::-webkit-scrollbar-thumb {
background: var(--current-line);
border-radius: 6px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--comment);
}
/* Title block */
#title-block-header {
margin-bottom: 3rem;
padding-bottom: 1rem;
border-bottom: 2px solid var(--purple);
}
#title-block-header .title {
margin-top: 0;
color: var(--pink);
}
#title-block-header .subtitle {
color: var(--purple);
font-size: 1.3rem;
margin: 0.5rem 0;
}
#title-block-header .author {
color: var(--cyan);
margin: 0.3rem 0;
}
#title-block-header .date {
color: var(--comment);
font-size: 0.9rem;
margin: 0.3rem 0;
}
</style>
<style>
html { -webkit-text-size-adjust: 100%; }
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ background-color: #f8f8f8; }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ef2929; } /* Alert */
code span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #204a87; } /* Attribute */
code span.bn { color: #0000cf; } /* BaseN */
code span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4e9a06; } /* Char */
code span.cn { color: #8f5902; } /* Constant */
code span.co { color: #8f5902; font-style: italic; } /* Comment */
code span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */
code span.dt { color: #204a87; } /* DataType */
code span.dv { color: #0000cf; } /* DecVal */
code span.er { color: #a40000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #0000cf; } /* Float */
code span.fu { color: #204a87; font-weight: bold; } /* Function */
code span.im { } /* Import */
code span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #204a87; font-weight: bold; } /* Keyword */
code span.op { color: #ce5c00; font-weight: bold; } /* Operator */
code span.ot { color: #8f5902; } /* Other */
code span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */
code span.sc { color: #ce5c00; font-weight: bold; } /* SpecialChar */
code span.ss { color: #4e9a06; } /* SpecialString */
code span.st { color: #4e9a06; } /* String */
code span.va { color: #000000; } /* Variable */
code span.vs { color: #4e9a06; } /* VerbatimString */
code span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */
</style>
</head>
<body>
<nav id="TOC" role="doc-toc">
<ul>
<li><a href="#chapter-1-introduction-to-x86-64-architecture"
id="toc-chapter-1-introduction-to-x86-64-architecture"><strong>Chapter
1: Introduction to x86-64 Architecture</strong></a>
<ul>
<li><a href="#evolution-from-8086-to-x86-64"
id="toc-evolution-from-8086-to-x86-64"><strong>1.1 Evolution from 8086
to x86-64</strong></a>
<ul>
<li><a href="#the-journey-from-16-bit-to-64-bit"
id="toc-the-journey-from-16-bit-to-64-bit"><strong>The Journey from
16-bit to 64-bit</strong></a></li>
<li><a href="#the-32-bit-revolution-80386-and-ia-32"
id="toc-the-32-bit-revolution-80386-and-ia-32"><strong>The 32-bit
Revolution: 80386 and IA-32</strong></a></li>
<li><a href="#the-64-bit-extension-amd64-and-intel-64"
id="toc-the-64-bit-extension-amd64-and-intel-64"><strong>The 64-bit
Extension: AMD64 and Intel 64</strong></a></li>
<li><a href="#compiler-perspective-evolutionary-complexity"
id="toc-compiler-perspective-evolutionary-complexity"><strong>Compiler
Perspective: Evolutionary Complexity</strong></a></li>
</ul></li>
<li><a href="#x86-64-execution-environment-and-modes"
id="toc-x86-64-execution-environment-and-modes"><strong>1.2 x86-64
Execution Environment and Modes</strong></a>
<ul>
<li><a href="#operating-modes"
id="toc-operating-modes"><strong>Operating Modes</strong></a></li>
<li><a href="#execution-state"
id="toc-execution-state"><strong>Execution State</strong></a></li>
<li><a href="#privilege-levels-and-protection"
id="toc-privilege-levels-and-protection"><strong>Privilege Levels and
Protection</strong></a></li>
</ul></li>
<li><a
href="#register-architecture-general-purpose-segment-and-system-registers"
id="toc-register-architecture-general-purpose-segment-and-system-registers"><strong>1.3
Register Architecture: General Purpose, Segment, and System
Registers</strong></a>
<ul>
<li><a href="#general-purpose-registers"
id="toc-general-purpose-registers"><strong>General-Purpose
Registers</strong></a></li>
<li><a href="#special-purpose-registers"
id="toc-special-purpose-registers"><strong>Special-Purpose
Registers</strong></a></li>
<li><a href="#segment-registers-in-64-bit-mode"
id="toc-segment-registers-in-64-bit-mode"><strong>Segment Registers in
64-bit Mode</strong></a></li>
<li><a href="#control-registers"
id="toc-control-registers"><strong>Control Registers</strong></a></li>
<li><a href="#model-specific-registers-msrs"
id="toc-model-specific-registers-msrs"><strong>Model-Specific Registers
(MSRs)</strong></a></li>
<li><a href="#compiler-register-usage-conventions"
id="toc-compiler-register-usage-conventions"><strong>Compiler Register
Usage Conventions</strong></a></li>
</ul></li>
<li><a href="#memory-models-and-addressing"
id="toc-memory-models-and-addressing"><strong>1.4 Memory Models and
Addressing</strong></a>
<ul>
<li><a href="#virtual-address-space"
id="toc-virtual-address-space"><strong>Virtual Address
Space</strong></a></li>
<li><a href="#memory-segmentation-in-64-bit-mode"
id="toc-memory-segmentation-in-64-bit-mode"><strong>Memory Segmentation
in 64-bit Mode</strong></a></li>
<li><a href="#addressing-modes"
id="toc-addressing-modes"><strong>Addressing Modes</strong></a></li>
</ul></li>
</ul></li>
<li><a
href="#chapter-2-x86-64-instruction-set-architecture-fundamentals"
id="toc-chapter-2-x86-64-instruction-set-architecture-fundamentals"><strong>Chapter
2: x86-64 Instruction Set Architecture Fundamentals</strong></a>
<ul>
<li><a href="#instruction-format-and-prefixes-rex-vex-evex"
id="toc-instruction-format-and-prefixes-rex-vex-evex"><strong>2.1
Instruction Format and Prefixes (REX, VEX, EVEX)</strong></a>
<ul>
<li><a href="#basic-instruction-format"
id="toc-basic-instruction-format"><strong>Basic Instruction
Format</strong></a></li>
<li><a href="#legacy-prefixes" id="toc-legacy-prefixes"><strong>Legacy
Prefixes</strong></a></li>
<li><a href="#rex-prefix" id="toc-rex-prefix"><strong>REX
Prefix</strong></a></li>
<li><a href="#vex-prefix-avx" id="toc-vex-prefix-avx"><strong>VEX Prefix
(AVX)</strong></a></li>
<li><a href="#evex-prefix-avx-512"
id="toc-evex-prefix-avx-512"><strong>EVEX Prefix
(AVX-512)</strong></a></li>
<li><a href="#compiler-encoding-decisions"
id="toc-compiler-encoding-decisions"><strong>Compiler Encoding
Decisions</strong></a></li>
</ul></li>
<li><a href="#data-movement-instructions"
id="toc-data-movement-instructions"><strong>2.2 Data Movement
Instructions</strong></a>
<ul>
<li><a href="#basic-move-instructions"
id="toc-basic-move-instructions"><strong>Basic Move
Instructions</strong></a></li>
<li><a href="#zero-and-sign-extension"
id="toc-zero-and-sign-extension"><strong>Zero and Sign
Extension</strong></a></li>
<li><a href="#conditional-moves"
id="toc-conditional-moves"><strong>Conditional Moves</strong></a></li>
<li><a href="#special-data-movement"
id="toc-special-data-movement"><strong>Special Data
Movement</strong></a></li>
<li><a href="#compiler-optimization-patterns"
id="toc-compiler-optimization-patterns"><strong>Compiler Optimization
Patterns</strong></a></li>
</ul></li>
<li><a href="#arithmetic-and-logic-operations"
id="toc-arithmetic-and-logic-operations"><strong>2.3 Arithmetic and
Logic Operations</strong></a>
<ul>
<li><a href="#integer-arithmetic"
id="toc-integer-arithmetic"><strong>Integer Arithmetic</strong></a></li>
<li><a href="#logical-operations"
id="toc-logical-operations"><strong>Logical Operations</strong></a></li>
<li><a href="#flag-manipulation" id="toc-flag-manipulation"><strong>Flag
Manipulation</strong></a></li>
</ul></li>
<li><a href="#bit-manipulation-and-shifts"
id="toc-bit-manipulation-and-shifts"><strong>2.4 Bit Manipulation and
Shifts</strong></a>
<ul>
<li><a href="#shift-operations" id="toc-shift-operations"><strong>Shift
Operations</strong></a></li>
<li><a href="#bit-scanning-and-manipulation"
id="toc-bit-scanning-and-manipulation"><strong>Bit Scanning and
Manipulation</strong></a></li>
<li><a href="#compiler-bit-manipulation-patterns"
id="toc-compiler-bit-manipulation-patterns"><strong>Compiler Bit
Manipulation Patterns</strong></a></li>
</ul></li>
<li><a href="#control-flow-branches-loops-and-calls"
id="toc-control-flow-branches-loops-and-calls"><strong>2.5 Control Flow:
Branches, Loops, and Calls</strong></a>
<ul>
<li><a href="#unconditional-jumps"
id="toc-unconditional-jumps"><strong>Unconditional
Jumps</strong></a></li>
<li><a href="#conditional-branches"
id="toc-conditional-branches"><strong>Conditional
Branches</strong></a></li>
<li><a href="#loop-instructions" id="toc-loop-instructions"><strong>Loop
Instructions</strong></a></li>
<li><a href="#compiler-control-flow-patterns"
id="toc-compiler-control-flow-patterns"><strong>Compiler Control Flow
Patterns</strong></a></li>
<li><a href="#branch-prediction-considerations"
id="toc-branch-prediction-considerations"><strong>Branch Prediction
Considerations</strong></a></li>
</ul></li>
<li><a href="#string-operations" id="toc-string-operations"><strong>2.6
String Operations</strong></a>
<ul>
<li><a href="#basic-string-instructions"
id="toc-basic-string-instructions"><strong>Basic String
Instructions</strong></a></li>
<li><a href="#rep-prefixes" id="toc-rep-prefixes"><strong>REP
Prefixes</strong></a></li>
<li><a href="#optimized-string-operations"
id="toc-optimized-string-operations"><strong>Optimized String
Operations</strong></a></li>
<li><a href="#compiler-string-intrinsics"
id="toc-compiler-string-intrinsics"><strong>Compiler String
Intrinsics</strong></a></li>
</ul></li>
<li><a href="#compiler-perspective-instruction-selection-patterns"
id="toc-compiler-perspective-instruction-selection-patterns"><strong>2.7
Compiler Perspective: Instruction Selection Patterns</strong></a>
<ul>
<li><a href="#instruction-selection-overview"
id="toc-instruction-selection-overview"><strong>Instruction Selection
Overview</strong></a></li>
<li><a href="#common-optimization-patterns"
id="toc-common-optimization-patterns"><strong>Common Optimization
Patterns</strong></a></li>
<li><a href="#peephole-optimizations"
id="toc-peephole-optimizations"><strong>Peephole
Optimizations</strong></a></li>
<li><a href="#code-generation-examples"
id="toc-code-generation-examples"><strong>Code Generation
Examples</strong></a></li>
<li><a href="#compiler-instruction-costs"
id="toc-compiler-instruction-costs"><strong>Compiler Instruction
Costs</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-3-memory-architecture-and-addressing-modes"
id="toc-chapter-3-memory-architecture-and-addressing-modes"><strong>Chapter
3: Memory Architecture and Addressing Modes</strong></a>
<ul>
<li><a href="#x86-64-memory-organization"
id="toc-x86-64-memory-organization"><strong>3.1 x86-64 Memory
Organization</strong></a>
<ul>
<li><a href="#virtual-address-space-layout"
id="toc-virtual-address-space-layout"><strong>Virtual Address Space
Layout</strong></a></li>
<li><a href="#memory-segmentation-in-64-bit-mode-1"
id="toc-memory-segmentation-in-64-bit-mode-1"><strong>Memory
Segmentation in 64-bit Mode</strong></a></li>
<li><a href="#page-table-structure"
id="toc-page-table-structure"><strong>Page Table
Structure</strong></a></li>
<li><a href="#memory-types-and-caching"
id="toc-memory-types-and-caching"><strong>Memory Types and
Caching</strong></a></li>
</ul></li>
<li><a href="#complex-addressing-modes"
id="toc-complex-addressing-modes"><strong>3.2 Complex Addressing
Modes</strong></a>
<ul>
<li><a href="#general-addressing-mode-format"
id="toc-general-addressing-mode-format"><strong>General Addressing Mode
Format</strong></a></li>
<li><a href="#addressing-mode-examples"
id="toc-addressing-mode-examples"><strong>Addressing Mode
Examples</strong></a></li>
<li><a href="#rip-relative-addressing"
id="toc-rip-relative-addressing"><strong>RIP-Relative
Addressing</strong></a></li>
<li><a href="#addressing-mode-encoding"
id="toc-addressing-mode-encoding"><strong>Addressing Mode
Encoding</strong></a></li>
</ul></li>
<li><a href="#memory-access-patterns-and-optimization"
id="toc-memory-access-patterns-and-optimization"><strong>3.3 Memory
Access Patterns and Optimization</strong></a>
<ul>
<li><a href="#cache-friendly-access-patterns"
id="toc-cache-friendly-access-patterns"><strong>Cache-Friendly Access
Patterns</strong></a></li>
<li><a href="#prefetching"
id="toc-prefetching"><strong>Prefetching</strong></a></li>
<li><a href="#non-temporal-memory-access"
id="toc-non-temporal-memory-access"><strong>Non-Temporal Memory
Access</strong></a></li>
</ul></li>
<li><a href="#stack-operations-and-management"
id="toc-stack-operations-and-management"><strong>3.4 Stack Operations
and Management</strong></a>
<ul>
<li><a href="#stack-frame-layout"
id="toc-stack-frame-layout"><strong>Stack Frame Layout</strong></a></li>
<li><a href="#stack-alignment" id="toc-stack-alignment"><strong>Stack
Alignment</strong></a></li>
<li><a href="#red-zone" id="toc-red-zone"><strong>Red
Zone</strong></a></li>
</ul></li>
<li><a href="#memory-barriers-and-atomics"
id="toc-memory-barriers-and-atomics"><strong>3.5 Memory Barriers and
Atomics</strong></a>
<ul>
<li><a href="#memory-ordering" id="toc-memory-ordering"><strong>Memory
Ordering</strong></a></li>
<li><a href="#atomic-operations"
id="toc-atomic-operations"><strong>Atomic Operations</strong></a></li>
<li><a href="#transactional-memory-tsx"
id="toc-transactional-memory-tsx"><strong>Transactional Memory
(TSX)</strong></a></li>
</ul></li>
<li><a href="#effective-address-calculation-lea"
id="toc-effective-address-calculation-lea"><strong>3.6 Effective Address
Calculation (LEA)</strong></a>
<ul>
<li><a href="#lea-instruction-capabilities"
id="toc-lea-instruction-capabilities"><strong>LEA Instruction
Capabilities</strong></a></li>
<li><a href="#compiler-lea-patterns"
id="toc-compiler-lea-patterns"><strong>Compiler LEA
Patterns</strong></a></li>
<li><a href="#lea-vs-other-instructions"
id="toc-lea-vs-other-instructions"><strong>LEA vs Other
Instructions</strong></a></li>
</ul></li>
<li><a href="#compiler-memory-optimization-strategies"
id="toc-compiler-memory-optimization-strategies"><strong>3.7 Compiler
Memory Optimization Strategies</strong></a>
<ul>
<li><a href="#structure-layout-and-padding"
id="toc-structure-layout-and-padding"><strong>Structure Layout and
Padding</strong></a></li>
<li><a href="#loop-optimization-and-memory-access"
id="toc-loop-optimization-and-memory-access"><strong>Loop Optimization
and Memory Access</strong></a></li>
<li><a href="#alias-analysis-and-optimization"
id="toc-alias-analysis-and-optimization"><strong>Alias Analysis and
Optimization</strong></a></li>
<li><a href="#memory-access-coalescing"
id="toc-memory-access-coalescing"><strong>Memory Access
Coalescing</strong></a></li>
<li><a href="#summary-and-key-takeaways"
id="toc-summary-and-key-takeaways"><strong>Summary and Key
Takeaways</strong></a></li>
<li><a href="#looking-ahead" id="toc-looking-ahead"><strong>Looking
Ahead</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-4-stack-operations-and-calling-conventions"
id="toc-chapter-4-stack-operations-and-calling-conventions"><strong>Chapter
4: Stack Operations and Calling Conventions</strong></a>
<ul>
<li><a href="#stack-architecture-fundamentals"
id="toc-stack-architecture-fundamentals"><strong>4.1 Stack Architecture
Fundamentals</strong></a>
<ul>
<li><a href="#stack-layout-and-growth-direction"
id="toc-stack-layout-and-growth-direction"><strong>Stack Layout and
Growth Direction</strong></a></li>
<li><a href="#stack-pointer-alignment-requirements"
id="toc-stack-pointer-alignment-requirements"><strong>Stack Pointer
Alignment Requirements</strong></a></li>
<li><a href="#stack-frame-structure"
id="toc-stack-frame-structure"><strong>Stack Frame
Structure</strong></a></li>
</ul></li>
<li><a href="#system-v-amd64-abi"
id="toc-system-v-amd64-abi"><strong>4.2 System V AMD64 ABI</strong></a>
<ul>
<li><a href="#register-usage-convention"
id="toc-register-usage-convention"><strong>Register Usage
Convention</strong></a></li>
<li><a href="#function-calling-examples"
id="toc-function-calling-examples"><strong>Function Calling
Examples</strong></a></li>
<li><a href="#floating-point-and-mixed-arguments"
id="toc-floating-point-and-mixed-arguments"><strong>Floating-Point and
Mixed Arguments</strong></a></li>
<li><a href="#red-zone-usage" id="toc-red-zone-usage"><strong>Red Zone
Usage</strong></a></li>
<li><a href="#variable-argument-functions"
id="toc-variable-argument-functions"><strong>Variable Argument
Functions</strong></a></li>
</ul></li>
<li><a href="#microsoft-x64-abi" id="toc-microsoft-x64-abi"><strong>4.3
Microsoft x64 ABI</strong></a>
<ul>
<li><a href="#register-convention-differences"
id="toc-register-convention-differences"><strong>Register Convention
Differences</strong></a></li>
<li><a href="#function-prologue-and-epilogue-windows"
id="toc-function-prologue-and-epilogue-windows"><strong>Function
Prologue and Epilogue (Windows)</strong></a></li>
<li><a href="#floating-point-parameter-passing-windows"
id="toc-floating-point-parameter-passing-windows"><strong>Floating-Point
Parameter Passing (Windows)</strong></a></li>
</ul></li>
<li><a href="#stack-frame-management"
id="toc-stack-frame-management"><strong>4.4 Stack Frame
Management</strong></a>
<ul>
<li><a href="#frame-pointer-vs-frame-pointer-omission"
id="toc-frame-pointer-vs-frame-pointer-omission"><strong>Frame Pointer
vs Frame Pointer Omission</strong></a></li>
<li><a href="#dynamic-stack-allocation-alloca"
id="toc-dynamic-stack-allocation-alloca"><strong>Dynamic Stack
Allocation (alloca)</strong></a></li>
<li><a href="#stack-unwinding-support"
id="toc-stack-unwinding-support"><strong>Stack Unwinding
Support</strong></a></li>
</ul></li>
<li><a href="#leaf-vs-non-leaf-functions"
id="toc-leaf-vs-non-leaf-functions"><strong>4.5 Leaf vs Non-Leaf
Functions</strong></a>
<ul>
<li><a href="#leaf-function-optimization"
id="toc-leaf-function-optimization"><strong>Leaf Function
Optimization</strong></a></li>
<li><a href="#tail-call-optimization"
id="toc-tail-call-optimization"><strong>Tail Call
Optimization</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-5-exception-handling-and-stack-unwinding"
id="toc-chapter-5-exception-handling-and-stack-unwinding"><strong>Chapter
5: Exception Handling and Stack Unwinding</strong></a>
<ul>
<li><a href="#exception-handling-fundamentals"
id="toc-exception-handling-fundamentals"><strong>5.1 Exception Handling
Fundamentals</strong></a>
<ul>
<li><a href="#types-of-exceptions-in-x86-64"
id="toc-types-of-exceptions-in-x86-64"><strong>Types of Exceptions in
x86-64</strong></a></li>
<li><a href="#exception-frame-layout"
id="toc-exception-frame-layout"><strong>Exception Frame
Layout</strong></a></li>
</ul></li>
<li><a href="#stack-unwinding-mechanisms"
id="toc-stack-unwinding-mechanisms"><strong>5.2 Stack Unwinding
Mechanisms</strong></a>
<ul>
<li><a href="#dwarf-cfi-call-frame-information"
id="toc-dwarf-cfi-call-frame-information"><strong>DWARF CFI (Call Frame
Information)</strong></a></li>
<li><a href="#manual-stack-walking"
id="toc-manual-stack-walking"><strong>Manual Stack
Walking</strong></a></li>
</ul></li>
<li><a href="#seh-structured-exception-handling-on-windows"
id="toc-seh-structured-exception-handling-on-windows"><strong>5.3 SEH
(Structured Exception Handling) on Windows</strong></a>
<ul>
<li><a href="#seh-frame-setup" id="toc-seh-frame-setup"><strong>SEH
Frame Setup</strong></a></li>
<li><a href="#unwind-information-structure"
id="toc-unwind-information-structure"><strong>Unwind Information
Structure</strong></a></li>
</ul></li>
<li><a href="#c-exception-handling-implementation"
id="toc-c-exception-handling-implementation"><strong>5.4 C++ Exception
Handling Implementation</strong></a>
<ul>
<li><a href="#itanium-abi-exception-model-gccclang"
id="toc-itanium-abi-exception-model-gccclang"><strong>Itanium ABI
Exception Model (GCC/Clang)</strong></a></li>
<li><a href="#raii-and-destructor-calls-during-unwinding"
id="toc-raii-and-destructor-calls-during-unwinding"><strong>RAII and
Destructor Calls During Unwinding</strong></a></li>
</ul></li>
<li><a href="#signal-handling-and-asynchronous-exceptions"
id="toc-signal-handling-and-asynchronous-exceptions"><strong>5.5 Signal
Handling and Asynchronous Exceptions</strong></a>
<ul>
<li><a href="#posix-signal-frame"
id="toc-posix-signal-frame"><strong>POSIX Signal Frame</strong></a></li>
</ul></li>
<li><a href="#stack-unwinding-fundamentals"
id="toc-stack-unwinding-fundamentals"><strong>5.2 Stack Unwinding
Fundamentals</strong></a>
<ul>
<li><a href="#frame-pointer-chaining"
id="toc-frame-pointer-chaining">Frame Pointer Chaining</a></li>
</ul></li>
<li><a href="#dwarf-cfi-system-v-amd64"
id="toc-dwarf-cfi-system-v-amd64"><strong>5.3 DWARF CFI (System V
AMD64)</strong></a></li>
<li><a href="#windows-x64-seh-and-unwind-info"
id="toc-windows-x64-seh-and-unwind-info"><strong>5.4 Windows x64 SEH and
Unwind Info</strong></a></li>
<li><a href="#language-level-exception-flow-itanium-c-abi"
id="toc-language-level-exception-flow-itanium-c-abi"><strong>5.5
Language-Level Exception Flow (Itanium C++ ABI)</strong></a></li>
<li><a href="#signals-posix-asynchronous-exceptions"
id="toc-signals-posix-asynchronous-exceptions"><strong>5.6 Signals
(POSIX Asynchronous Exceptions)</strong></a></li>
<li><a href="#practical-stack-unwinding-example"
id="toc-practical-stack-unwinding-example"><strong>5.7 Practical Stack
Unwinding Example</strong></a></li>
<li><a href="#key-points" id="toc-key-points"><strong>Key
Points:</strong></a></li>
</ul></li>
<li><a href="#chapter-6-x87-fpu-and-legacy-floating-point"
id="toc-chapter-6-x87-fpu-and-legacy-floating-point"><strong>Chapter 6:
x87 FPU and Legacy Floating Point</strong></a>
<ul>
<li><a href="#x87-fpu-architecture-overview"
id="toc-x87-fpu-architecture-overview"><strong>6.1 x87 FPU Architecture
Overview</strong></a>
<ul>
<li><a href="#x87-register-stack-model"
id="toc-x87-register-stack-model"><strong>x87 Register Stack
Model</strong></a></li>
<li><a href="#x87-control-and-status-words"
id="toc-x87-control-and-status-words"><strong>x87 Control and Status
Words</strong></a></li>
</ul></li>
<li><a href="#x87-instruction-categories"
id="toc-x87-instruction-categories"><strong>6.2 x87 Instruction
Categories</strong></a>
<ul>
<li><a href="#data-transfer-instructions"
id="toc-data-transfer-instructions"><strong>Data Transfer
Instructions</strong></a></li>
<li><a href="#arithmetic-operations"
id="toc-arithmetic-operations"><strong>Arithmetic
Operations</strong></a></li>
<li><a href="#transcendental-functions"
id="toc-transcendental-functions"><strong>Transcendental
Functions</strong></a></li>
</ul></li>
<li><a href="#comparison-and-conditional-operations"
id="toc-comparison-and-conditional-operations"><strong>6.3 Comparison
and Conditional Operations</strong></a>
<ul>
<li><a href="#comparison-instructions"
id="toc-comparison-instructions"><strong>Comparison
Instructions</strong></a></li>
<li><a href="#conditional-move-fcmovcc"
id="toc-conditional-move-fcmovcc"><strong>Conditional Move
(FCMOVcc)</strong></a></li>
</ul></li>
<li><a href="#exception-handling"
id="toc-exception-handling"><strong>6.4 Exception Handling</strong></a>
<ul>
<li><a href="#x87-exception-types"
id="toc-x87-exception-types"><strong>x87 Exception
Types</strong></a></li>
<li><a href="#exception-service"
id="toc-exception-service"><strong>Exception Service</strong></a></li>
<li><a href="#flag-testing-in-software"
id="toc-flag-testing-in-software"><strong>Flag Testing in
Software</strong></a></li>
<li><a href="#cooperating-with-os-exception-models"
id="toc-cooperating-with-os-exception-models"><strong>Cooperating with
OS Exception Models</strong></a></li>
<li><a href="#modern-context-why-care-in-x8664"
id="toc-modern-context-why-care-in-x8664"><strong>6.5 Modern Context:
Why Care in x86‑64</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-7-sse-and-sse2-programming"
id="toc-chapter-7-sse-and-sse2-programming"><strong>Chapter 7: SSE and
SSE2 Programming</strong></a>
<ul>
<li><a href="#ssesse2-architecture-overview"
id="toc-ssesse2-architecture-overview"><strong>7.1 SSE/SSE2 Architecture
Overview</strong></a>
<ul>
<li><a href="#introduction-to-streaming-simd-extensions"
id="toc-introduction-to-streaming-simd-extensions"><strong>Introduction
to Streaming SIMD Extensions</strong></a></li>
<li><a href="#mxcsr-controlstatus-register"
id="toc-mxcsr-controlstatus-register"><strong>MXCSR Control/Status
Register</strong></a></li>
</ul></li>
<li><a href="#sse-floating-point-operations"
id="toc-sse-floating-point-operations"><strong>7.2 SSE Floating-Point
Operations</strong></a>
<ul>
<li><a href="#single-precision-scalar-operations"
id="toc-single-precision-scalar-operations"><strong>Single-Precision
Scalar Operations</strong></a></li>
<li><a href="#single-precision-packed-operations"
id="toc-single-precision-packed-operations"><strong>Single-Precision
Packed Operations</strong></a></li>
<li><a href="#shuffle-and-permute-operations"
id="toc-shuffle-and-permute-operations"><strong>Shuffle and Permute
Operations</strong></a></li>
</ul></li>
<li><a href="#sse2-double-precision-operations"
id="toc-sse2-double-precision-operations"><strong>7.3 SSE2
Double-Precision Operations</strong></a>
<ul>
<li><a href="#double-precision-scalar-and-packed"
id="toc-double-precision-scalar-and-packed"><strong>Double-Precision
Scalar and Packed</strong></a></li>
</ul></li>
<li><a href="#sse2-integer-operations"
id="toc-sse2-integer-operations"><strong>7.4 SSE2 Integer
Operations</strong></a>
<ul>
<li><a href="#integer-data-movement"
id="toc-integer-data-movement"><strong>Integer Data
Movement</strong></a></li>
<li><a href="#integer-arithmetic-1"
id="toc-integer-arithmetic-1"><strong>Integer
Arithmetic</strong></a></li>
<li><a href="#logical-and-bitwise-operations"
id="toc-logical-and-bitwise-operations"><strong>Logical and Bitwise
Operations</strong></a></li>
<li><a href="#packing-and-unpacking-integers"
id="toc-packing-and-unpacking-integers"><strong>Packing and Unpacking
Integers</strong></a></li>
<li><a href="#conversion-between-integer-and-floating-point"
id="toc-conversion-between-integer-and-floating-point"><strong>Conversion
Between Integer and Floating Point</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-8-advanced-sse-extensions-sse3-ssse3-sse4"
id="toc-chapter-8-advanced-sse-extensions-sse3-ssse3-sse4"><strong>Chapter
8: Advanced SSE Extensions (SSE3, SSSE3, SSE4)</strong></a>
<ul>
<li><a href="#sse3-extensions" id="toc-sse3-extensions"><strong>8.1 SSE3
Extensions</strong></a>
<ul>
<li><a href="#horizontal-arithmetic-operations"
id="toc-horizontal-arithmetic-operations"><strong>Horizontal Arithmetic
Operations</strong></a></li>
<li><a href="#special-move-operations"
id="toc-special-move-operations"><strong>Special Move
Operations</strong></a></li>
<li><a href="#x87-fpu-integration-instructions"
id="toc-x87-fpu-integration-instructions"><strong>x87 FPU Integration
Instructions</strong></a></li>
</ul></li>
<li><a href="#ssse3-extensions" id="toc-ssse3-extensions"><strong>8.2
SSSE3 Extensions</strong></a>
<ul>
<li><a href="#absolute-value-and-sign-operations"
id="toc-absolute-value-and-sign-operations"><strong>Absolute Value and
Sign Operations</strong></a></li>
<li><a href="#horizontal-addition-with-saturation"
id="toc-horizontal-addition-with-saturation"><strong>Horizontal Addition
with Saturation</strong></a></li>
<li><a href="#multiply-and-add-packed"
id="toc-multiply-and-add-packed"><strong>Multiply and Add
Packed</strong></a></li>
<li><a href="#byte-shuffle-pshufb"
id="toc-byte-shuffle-pshufb"><strong>Byte Shuffle
(PSHUFB)</strong></a></li>
<li><a href="#alignment-operations"
id="toc-alignment-operations"><strong>Alignment
Operations</strong></a></li>
</ul></li>
<li><a href="#sse4.1-extensions" id="toc-sse4.1-extensions"><strong>8.3
SSE4.1 Extensions</strong></a>
<ul>
<li><a href="#blending-operations"
id="toc-blending-operations"><strong>Blending
Operations</strong></a></li>
<li><a href="#dot-product-instructions"
id="toc-dot-product-instructions"><strong>Dot Product
Instructions</strong></a></li>
<li><a href="#rounding-operations"
id="toc-rounding-operations"><strong>Rounding
Operations</strong></a></li>
<li><a href="#integer-minmax-operations"
id="toc-integer-minmax-operations"><strong>Integer Min/Max
Operations</strong></a></li>
<li><a href="#enhanced-integer-operations"
id="toc-enhanced-integer-operations"><strong>Enhanced Integer
Operations</strong></a></li>
</ul></li>
<li><a href="#compiler-mapping-and-usecases"
id="toc-compiler-mapping-and-usecases"><strong>8.5 Compiler Mapping and
Use‑Cases</strong></a></li>
</ul></li>
<li><a href="#chapter-9-avx-and-avx2-vector-extensions"
id="toc-chapter-9-avx-and-avx2-vector-extensions"><strong>Chapter 9: AVX
and AVX2 Vector Extensions</strong></a>
<ul>
<li><a href="#introduction-to-avx-architecture"
id="toc-introduction-to-avx-architecture"><strong>9.1 Introduction to
AVX Architecture</strong></a>
<ul>
<li><a href="#evolution-from-sse-to-avx"
id="toc-evolution-from-sse-to-avx"><strong>Evolution from SSE to
AVX</strong></a></li>
<li><a href="#ymm-register-architecture"
id="toc-ymm-register-architecture"><strong>YMM Register
Architecture</strong></a></li>
<li><a href="#vex-encoding-prefix"
id="toc-vex-encoding-prefix"><strong>VEX Encoding
Prefix</strong></a></li>
<li><a href="#state-management" id="toc-state-management"><strong>State
Management</strong></a></li>
</ul></li>
<li><a href="#avx-floating-point-operations"
id="toc-avx-floating-point-operations"><strong>9.2 AVX Floating-Point
Operations</strong></a>
<ul>
<li><a href="#bit-packed-operations"
id="toc-bit-packed-operations"><strong>256-bit Packed
Operations</strong></a></li>
<li><a href="#comparison-and-masking"
id="toc-comparison-and-masking"><strong>Comparison and
Masking</strong></a></li>
<li><a href="#broadcast-operations"
id="toc-broadcast-operations"><strong>Broadcast
Operations</strong></a></li>
</ul></li>
<li><a href="#avx-permutation-and-shuffle"
id="toc-avx-permutation-and-shuffle"><strong>9.3 AVX Permutation and
Shuffle</strong></a>
<ul>
<li><a href="#cross-lane-permutation"
id="toc-cross-lane-permutation"><strong>Cross-Lane
Permutation</strong></a></li>
<li><a href="#unpack-and-shuffle"
id="toc-unpack-and-shuffle"><strong>Unpack and Shuffle</strong></a></li>
</ul></li>
<li><a href="#avx2-integer-operations"
id="toc-avx2-integer-operations"><strong>9.4 AVX2 Integer
Operations</strong></a>
<ul>
<li><a href="#bit-integer-arithmetic"
id="toc-bit-integer-arithmetic"><strong>256-bit Integer
Arithmetic</strong></a></li>
<li><a href="#gather-operations"
id="toc-gather-operations"><strong>Gather Operations</strong></a></li>
<li><a href="#variable-shifts" id="toc-variable-shifts"><strong>Variable
Shifts</strong></a></li>
<li><a href="#cross-lane-permutation-1"
id="toc-cross-lane-permutation-1"><strong>Cross-Lane
Permutation</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-10-avx-512-and-future-extensions"
id="toc-chapter-10-avx-512-and-future-extensions"><strong>Chapter 10:
AVX-512 and Future Extensions</strong></a>
<ul>
<li><a href="#avx-512-architecture-overview"
id="toc-avx-512-architecture-overview"><strong>10.1 AVX-512 Architecture
Overview</strong></a>
<ul>
<li><a href="#introduction-to-avx-512"
id="toc-introduction-to-avx-512"><strong>Introduction to
AVX-512</strong></a></li>
<li><a href="#evex-encoding-structure"
id="toc-evex-encoding-structure"><strong>EVEX Encoding
Structure</strong></a></li>
<li><a href="#opmask-registers" id="toc-opmask-registers"><strong>Opmask
Registers</strong></a></li>
</ul></li>
<li><a href="#avx-512-foundation-instructions"
id="toc-avx-512-foundation-instructions"><strong>10.2 AVX-512 Foundation
Instructions</strong></a>
<ul>
<li><a href="#bit-arithmetic-operations"
id="toc-bit-arithmetic-operations"><strong>512-bit Arithmetic
Operations</strong></a></li>
<li><a href="#advanced-permutation"
id="toc-advanced-permutation"><strong>Advanced
Permutation</strong></a></li>
<li><a href="#scatter-operations"
id="toc-scatter-operations"><strong>Scatter Operations</strong></a></li>
</ul></li>
<li><a href="#avx-512-extension-sets"
id="toc-avx-512-extension-sets"><strong>10.3 AVX-512 Extension
Sets</strong></a>
<ul>
<li><a href="#avx-512bw-byte-and-word"
id="toc-avx-512bw-byte-and-word"><strong>AVX-512BW (Byte and
Word)</strong></a></li>
<li><a href="#avx-512dq-doubleword-and-quadword"
id="toc-avx-512dq-doubleword-and-quadword"><strong>AVX-512DQ (Doubleword
and Quadword)</strong></a></li>
<li><a href="#avx-512vnni-vector-neural-network-instructions"
id="toc-avx-512vnni-vector-neural-network-instructions"><strong>AVX-512VNNI
(Vector Neural Network Instructions)</strong></a></li>
<li><a href="#avx-512ifma-integer-fused-multiply-add"
id="toc-avx-512ifma-integer-fused-multiply-add"><strong>AVX-512IFMA
(Integer Fused Multiply-Add)</strong></a></li>
</ul></li>
<li><a href="#avx-512-optimization-patterns"
id="toc-avx-512-optimization-patterns"><strong>10.4 AVX-512 Optimization
Patterns</strong></a>
<ul>
<li><a href="#conditional-execution-with-masks"
id="toc-conditional-execution-with-masks"><strong>Conditional Execution
with Masks</strong></a></li>
<li><a href="#vectorizing-loops"
id="toc-vectorizing-loops"><strong>Vectorizing Loops</strong></a></li>
<li><a href="#reduction-strategies"
id="toc-reduction-strategies"><strong>Reduction
Strategies</strong></a></li>
<li><a href="#scatter-gather-performance"
id="toc-scatter-gather-performance"><strong>Scatter &amp; Gather
Performance</strong></a></li>
<li><a href="#evex-broadcast-for-loop-invariants"
id="toc-evex-broadcast-for-loop-invariants"><strong>EVEX Broadcast for
Loop Invariants</strong></a></li>
</ul></li>
<li><a href="#practical-considerations-future-trends"
id="toc-practical-considerations-future-trends"><strong>10.5 Practical
Considerations &amp; Future Trends</strong></a></li>
</ul></li>
<li><a href="#chapter-11-system-level-architecture-and-protection"
id="toc-chapter-11-system-level-architecture-and-protection"><strong>Chapter
11: System-Level Architecture and Protection</strong></a>
<ul>
<li><a href="#privilege-levels-and-protection-rings"
id="toc-privilege-levels-and-protection-rings"><strong>11.1 Privilege
Levels and Protection Rings</strong></a>
<ul>
<li><a href="#x86-64-protection-model"
id="toc-x86-64-protection-model"><strong>x86-64 Protection
Model</strong></a></li>
<li><a href="#segment-descriptors-and-gates"
id="toc-segment-descriptors-and-gates"><strong>Segment Descriptors and
Gates</strong></a></li>
<li><a href="#global-and-local-descriptor-tables"
id="toc-global-and-local-descriptor-tables"><strong>Global and Local
Descriptor Tables</strong></a></li>
</ul></li>
<li><a href="#control-registers-and-system-structures"
id="toc-control-registers-and-system-structures"><strong>11.2 Control
Registers and System Structures</strong></a>
<ul>
<li><a href="#control-register-programming"
id="toc-control-register-programming"><strong>Control Register
Programming</strong></a></li>
<li><a href="#model-specific-registers-msrs-1"
id="toc-model-specific-registers-msrs-1"><strong>Model-Specific
Registers (MSRs)</strong></a></li>
<li><a href="#task-state-segment-tss"
id="toc-task-state-segment-tss"><strong>Task State Segment
(TSS)</strong></a></li>
</ul></li>
<li><a href="#interrupt-and-exception-handling"
id="toc-interrupt-and-exception-handling"><strong>11.3 Interrupt and
Exception Handling</strong></a>
<ul>
<li><a href="#interrupt-descriptor-table-management"
id="toc-interrupt-descriptor-table-management"><strong>Interrupt
Descriptor Table Management</strong></a></li>
<li><a href="#system-call-mechanisms"
id="toc-system-call-mechanisms"><strong>System Call
Mechanisms</strong></a></li>
</ul></li>
<li><a href="#memory-protection-mechanisms"
id="toc-memory-protection-mechanisms"><strong>11.4 Memory Protection
Mechanisms</strong></a>
<ul>
<li><a href="#page-table-protection-attributes"
id="toc-page-table-protection-attributes"><strong>Page Table Protection
Attributes</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-12-virtual-memory-and-paging-mechanisms"
id="toc-chapter-12-virtual-memory-and-paging-mechanisms"><strong>Chapter
12: Virtual Memory and Paging Mechanisms</strong></a>
<ul>
<li><a href="#x86-64-paging-architecture"
id="toc-x86-64-paging-architecture"><strong>12.1 x86-64 Paging
Architecture</strong></a>
<ul>
<li><a href="#four-level-page-tables-pml4"
id="toc-four-level-page-tables-pml4"><strong>Four-Level Page Tables
(PML4)</strong></a></li>
<li><a href="#large-pages-2mb-and-1gb"
id="toc-large-pages-2mb-and-1gb"><strong>Large Pages (2MB and
1GB)</strong></a></li>
<li><a href="#five-level-paging-la57"
id="toc-five-level-paging-la57"><strong>Five-Level Paging
(LA57)</strong></a></li>
</ul></li>
<li><a href="#translation-lookaside-buffer-tlb-management"
id="toc-translation-lookaside-buffer-tlb-management"><strong>12.2
Translation Lookaside Buffer (TLB) Management</strong></a>
<ul>
<li><a href="#tlb-invalidation-techniques"
id="toc-tlb-invalidation-techniques"><strong>TLB Invalidation
Techniques</strong></a></li>
<li><a href="#page-attribute-table-pat"
id="toc-page-attribute-table-pat"><strong>Page Attribute Table
(PAT)</strong></a></li>
</ul></li>
<li><a href="#memory-protection-extensions"
id="toc-memory-protection-extensions"><strong>12.3 Memory Protection
Extensions</strong></a>
<ul>
<li><a href="#nx-bit-and-dep" id="toc-nx-bit-and-dep"><strong>NX Bit and
DEP</strong></a></li>
<li><a href="#memory-type-range-registers-mtrrs"
id="toc-memory-type-range-registers-mtrrs"><strong>Memory Type Range
Registers (MTRRs)</strong></a></li>
</ul></li>
<li><a href="#virtual-memory-operations"
id="toc-virtual-memory-operations"><strong>12.4 Virtual Memory
Operations</strong></a>
<ul>
<li><a href="#page-fault-handling"
id="toc-page-fault-handling"><strong>Page Fault
Handling</strong></a></li>
<li><a href="#memory-mapping-and-unmapping"
id="toc-memory-mapping-and-unmapping"><strong>Memory Mapping and
Unmapping</strong></a></li>
<li><a href="#copy-on-write-implementation"
id="toc-copy-on-write-implementation"><strong>Copy-on-Write
Implementation</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-13-interrupts-apic-and-multi-core-programming"
id="toc-chapter-13-interrupts-apic-and-multi-core-programming"><strong>Chapter
13: Interrupts, APIC, and Multi-Core Programming</strong></a>
<ul>
<li><a href="#interrupt-architecture"
id="toc-interrupt-architecture"><strong>13.1 Interrupt
Architecture</strong></a>
<ul>
<li><a href="#interrupt-descriptor-table-idt"
id="toc-interrupt-descriptor-table-idt"><strong>Interrupt Descriptor
Table (IDT)</strong></a></li>
<li><a href="#exception-handling-1"
id="toc-exception-handling-1"><strong>Exception
Handling</strong></a></li>
<li><a href="#hardware-vs-software-interrupts"
id="toc-hardware-vs-software-interrupts"><strong>Hardware vs Software
Interrupts</strong></a></li>
</ul></li>
<li><a href="#advanced-programmable-interrupt-controller-apic"
id="toc-advanced-programmable-interrupt-controller-apic"><strong>13.2
Advanced Programmable Interrupt Controller (APIC)</strong></a>
<ul>
<li><a href="#local-apic-programming"
id="toc-local-apic-programming"><strong>Local APIC
Programming</strong></a></li>
<li><a href="#io-apic-configuration"
id="toc-io-apic-configuration"><strong>I/O APIC
Configuration</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-14-security-extensions-and-virtualization"
id="toc-chapter-14-security-extensions-and-virtualization"><strong>Chapter
14: Security Extensions and Virtualization</strong></a>
<ul>
<li><a href="#hardware-assisted-security-features"
id="toc-hardware-assisted-security-features"><strong>14.1
Hardware-Assisted Security Features</strong></a>
<ul>
<li><a href="#nx-bit-no-execute" id="toc-nx-bit-no-execute"><strong>NX
Bit (No-Execute)</strong></a></li>
<li><a href="#smapsmep-supervisor-mode-accessexecution-prevention"
id="toc-smapsmep-supervisor-mode-accessexecution-prevention"><strong>SMAP/SMEP
(Supervisor Mode Access/Execution Prevention)</strong></a></li>
<li><a href="#intel-cet-control-flow-enforcement-technology"
id="toc-intel-cet-control-flow-enforcement-technology"><strong>Intel CET
(Control-flow Enforcement Technology)</strong></a></li>
<li><a href="#intel-sgx-software-guard-extensions"
id="toc-intel-sgx-software-guard-extensions"><strong>Intel SGX (Software
Guard Extensions)</strong></a></li>
</ul></li>
<li><a href="#virtualization-architecture"
id="toc-virtualization-architecture"><strong>14.2 Virtualization
Architecture</strong></a>
<ul>
<li><a href="#intel-vt-x-vmx-fundamentals"
id="toc-intel-vt-x-vmx-fundamentals"><strong>Intel VT-x (VMX)
Fundamentals</strong></a></li>
<li><a href="#extended-page-tables-ept"
id="toc-extended-page-tables-ept"><strong>Extended Page Tables
(EPT)</strong></a></li>
</ul></li>
<li><a href="#multi-core-and-multi-threading-security"
id="toc-multi-core-and-multi-threading-security"><strong>14.3 Multi-Core
and Multi-Threading Security</strong></a>
<ul>
<li><a href="#per-cpu-security-state"
id="toc-per-cpu-security-state"><strong>Per-CPU Security
State</strong></a></li>
<li><a href="#speculation-control"
id="toc-speculation-control"><strong>Speculation
Control</strong></a></li>
</ul></li>
<li><a href="#secure-coding-practices"
id="toc-secure-coding-practices"><strong>14.4 Secure Coding
Practices</strong></a>
<ul>
<li><a href="#stack-protection" id="toc-stack-protection"><strong>Stack
Protection</strong></a></li>
<li><a href="#secure-memory-operations"
id="toc-secure-memory-operations"><strong>Secure Memory
Operations</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#chapter-15-performance-optimization-techniques"
id="toc-chapter-15-performance-optimization-techniques">Chapter 15:
Performance Optimization Techniques</a>
<ul>
<li><a href="#microarchitectural-optimization-fundamentals"
id="toc-microarchitectural-optimization-fundamentals">15.1
Microarchitectural Optimization Fundamentals</a>
<ul>
<li><a href="#understanding-the-modern-x86-64-pipeline"
id="toc-understanding-the-modern-x86-64-pipeline">Understanding the
Modern x86-64 Pipeline</a></li>
<li><a href="#execution-ports-and-throughput"
id="toc-execution-ports-and-throughput">Execution Ports and
Throughput</a></li>
</ul></li>
<li><a href="#branch-prediction-optimization"
id="toc-branch-prediction-optimization">15.2 Branch Prediction
Optimization</a>
<ul>
<li><a href="#static-branch-prediction"
id="toc-static-branch-prediction">Static Branch Prediction</a></li>
<li><a href="#loop-optimization-and-unrolling"
id="toc-loop-optimization-and-unrolling">Loop Optimization and
Unrolling</a></li>
</ul></li>
<li><a href="#memory-access-optimization"
id="toc-memory-access-optimization">15.3 Memory Access Optimization</a>
<ul>
<li><a href="#cache-line-optimization"
id="toc-cache-line-optimization">Cache Line Optimization</a></li>
<li><a href="#non-temporal-stores-streaming-stores"
id="toc-non-temporal-stores-streaming-stores">Non-Temporal Stores
(Streaming Stores)</a></li>
</ul></li>
<li><a href="#simd-vectorization-techniques"
id="toc-simd-vectorization-techniques">15.4 SIMD Vectorization
Techniques</a>
<ul>
<li><a href="#auto-vectorization-patterns"
id="toc-auto-vectorization-patterns">Auto-Vectorization
Patterns</a></li>
<li><a href="#fma-fused-multiply-add-optimization"
id="toc-fma-fused-multiply-add-optimization">FMA (Fused Multiply-Add)
Optimization</a></li>
</ul></li>
<li><a href="#instruction-level-parallelism"
id="toc-instruction-level-parallelism">15.5 Instruction-Level
Parallelism</a>
<ul>
<li><a href="#dependency-chain-breaking"
id="toc-dependency-chain-breaking">Dependency Chain Breaking</a></li>
<li><a href="#software-pipelining" id="toc-software-pipelining">Software
Pipelining</a></li>
</ul></li>
<li><a href="#code-size-and-alignment-optimization"
id="toc-code-size-and-alignment-optimization">15.6 Code Size and
Alignment Optimization</a>
<ul>
<li><a href="#function-and-loop-alignment"
id="toc-function-and-loop-alignment">Function and Loop
Alignment</a></li>
<li><a href="#instruction-selection-for-size"
id="toc-instruction-selection-for-size">Instruction Selection for
Size</a></li>
</ul></li>
<li><a href="#profile-guided-optimization"
id="toc-profile-guided-optimization">15.7 Profile-Guided
Optimization</a>
<ul>
<li><a href="#using-performance-counters"
id="toc-using-performance-counters">Using Performance Counters</a></li>
</ul></li>
<li><a href="#practical-optimization-example"
id="toc-practical-optimization-example">15.8 Practical Optimization
Example</a></li>
<li><a href="#performance-analysis-tools"
id="toc-performance-analysis-tools">15.9 Performance Analysis Tools</a>
<ul>
<li><a href="#intel-vtune-profiler-integration"
id="toc-intel-vtune-profiler-integration">Intel VTune Profiler
Integration</a></li>
</ul></li>
<li><a href="#summary" id="toc-summary">Summary</a></li>
<li><a href="#exercises" id="toc-exercises">Exercises</a></li>
</ul></li>
<li><a href="#chapter-16-code-generation-and-compiler-backend"
id="toc-chapter-16-code-generation-and-compiler-backend">Chapter 16:
Code Generation and Compiler Backend</a>
<ul>
<li><a href="#compiler-architecture-overview"
id="toc-compiler-architecture-overview">16.1 Compiler Architecture
Overview</a>
<ul>
<li><a href="#compilation-pipeline"
id="toc-compilation-pipeline">Compilation Pipeline</a></li>
</ul></li>
<li><a href="#register-allocation" id="toc-register-allocation">16.2
Register Allocation</a>
<ul>
<li><a href="#graph-coloring-algorithm"
id="toc-graph-coloring-algorithm">Graph Coloring Algorithm</a></li>
<li><a href="#spill-code-generation"
id="toc-spill-code-generation">Spill Code Generation</a></li>
</ul></li>
<li><a href="#instruction-selection" id="toc-instruction-selection">16.3
Instruction Selection</a>
<ul>
<li><a href="#pattern-matching-and-tiling"
id="toc-pattern-matching-and-tiling">Pattern Matching and
Tiling</a></li>
<li><a href="#peephole-optimization"
id="toc-peephole-optimization">Peephole Optimization</a></li>
</ul></li>
<li><a href="#jit-compilation-implementation"
id="toc-jit-compilation-implementation">16.4 JIT Compilation
Implementation</a>
<ul>
<li><a href="#basic-jit-compiler-structure"
id="toc-basic-jit-compiler-structure">Basic JIT Compiler
Structure</a></li>
<li><a href="#advanced-jit-with-templates"
id="toc-advanced-jit-with-templates">Advanced JIT with
Templates</a></li>
</ul></li>
<li><a href="#dynamic-binary-translation"
id="toc-dynamic-binary-translation">16.5 Dynamic Binary Translation</a>
<ul>
<li><a href="#self-modifying-code"
id="toc-self-modifying-code">Self-Modifying Code</a></li>
</ul></li>
<li><a href="#machine-code-encoding" id="toc-machine-code-encoding">16.6
Machine Code Encoding</a>
<ul>
<li><a href="#x86-64-instruction-encoding"
id="toc-x86-64-instruction-encoding">x86-64 Instruction
Encoding</a></li>
<li><a href="#building-an-assembler"
id="toc-building-an-assembler">Building an Assembler</a></li>
</ul></li>
<li><a href="#optimization-pass-implementation"
id="toc-optimization-pass-implementation">16.7 Optimization Pass
Implementation</a>
<ul>
<li><a href="#dead-code-elimination" id="toc-dead-code-elimination">Dead
Code Elimination</a></li>
<li><a href="#constant-propagation"
id="toc-constant-propagation">Constant Propagation</a></li>
</ul></li>
<li><a href="#llvm-integration" id="toc-llvm-integration">16.8 LLVM
Integration</a>
<ul>
<li><a href="#llvm-ir-to-x86-64" id="toc-llvm-ir-to-x86-64">LLVM IR to
x86-64</a></li>
</ul></li>
<li><a href="#register-allocation-1" id="toc-register-allocation-1">16.2
Register Allocation</a>
<ul>
<li><a href="#graph-coloring-allocation"
id="toc-graph-coloring-allocation">Graph Coloring Allocation</a></li>
</ul></li>
<li><a href="#instruction-selection-1"
id="toc-instruction-selection-1">16.3 Instruction Selection</a>
<ul>
<li><a href="#matching-and-tiling" id="toc-matching-and-tiling">Matching
and Tiling</a></li>
</ul></li>
<li><a href="#late-stage-peephole-optimization"
id="toc-late-stage-peephole-optimization">16.4 Late-stage (Peephole)
Optimization</a></li>
<li><a href="#jit-compilation" id="toc-jit-compilation">16.5 JIT
Compilation</a></li>
<li><a href="#dynamic-binary-translation-self-modifying-code"
id="toc-dynamic-binary-translation-self-modifying-code">16.6 Dynamic
Binary Translation &amp; Self-modifying Code</a></li>
<li><a href="#machine-code-encoding-1"
id="toc-machine-code-encoding-1">16.7 Machine Code Encoding</a>
<ul>
<li><a href="#encoding-format" id="toc-encoding-format">Encoding
format:</a></li>
</ul></li>
<li><a href="#building-an-assembler-backend-emitter"
id="toc-building-an-assembler-backend-emitter">16.8 Building an
Assembler (Backend-emitter)</a></li>
<li><a href="#backend-optimization-passes"
id="toc-backend-optimization-passes">16.9 Backend Optimization
Passes</a></li>
<li><a href="#llvm-backend-integration"
id="toc-llvm-backend-integration">16.10 LLVM Backend
Integration</a></li>
<li><a href="#summary-1" id="toc-summary-1">Summary</a>
<ul>
<li><a href="#exercises-1" id="toc-exercises-1">Exercises</a></li>
</ul></li>
</ul></li>
<li><a href="#appendix-a-x86-64-instruction-reference-quick-guide"
id="toc-appendix-a-x86-64-instruction-reference-quick-guide">Appendix A:
x86-64 Instruction Reference Quick Guide</a>
<ul>
<li><a href="#a.1-instruction-format-overview"
id="toc-a.1-instruction-format-overview">A.1 Instruction Format
Overview</a>
<ul>
<li><a href="#general-encoding-structure"
id="toc-general-encoding-structure">General Encoding Structure</a></li>
<li><a href="#rex-prefix-40h-4fh" id="toc-rex-prefix-40h-4fh">REX Prefix
(40h-4Fh)</a></li>
<li><a href="#modrm-byte" id="toc-modrm-byte">ModR/M Byte</a></li>
</ul></li>
<li><a href="#a.2-data-movement-instructions"
id="toc-a.2-data-movement-instructions">A.2 Data Movement
Instructions</a>
<ul>
<li><a href="#basic-moves" id="toc-basic-moves">Basic Moves</a></li>
<li><a href="#stack-operations" id="toc-stack-operations">Stack
Operations</a></li>
<li><a href="#conditional-moves-cmovcc"
id="toc-conditional-moves-cmovcc">Conditional Moves (CMOVcc)</a></li>
</ul></li>
<li><a href="#a.3-arithmetic-instructions"
id="toc-a.3-arithmetic-instructions">A.3 Arithmetic Instructions</a>
<ul>
<li><a href="#integer-arithmetic-2"
id="toc-integer-arithmetic-2">Integer Arithmetic</a></li>
<li><a href="#bcd-and-ascii-adjust-legacy"
id="toc-bcd-and-ascii-adjust-legacy">BCD and ASCII Adjust
(Legacy)</a></li>
</ul></li>
<li><a href="#a.4-logical-instructions"
id="toc-a.4-logical-instructions">A.4 Logical Instructions</a></li>
<li><a href="#a.5-shift-and-rotate-instructions"
id="toc-a.5-shift-and-rotate-instructions">A.5 Shift and Rotate
Instructions</a></li>
<li><a href="#a.6-bit-manipulation-instructions"
id="toc-a.6-bit-manipulation-instructions">A.6 Bit Manipulation
Instructions</a></li>
<li><a href="#a.7-control-transfer-instructions"
id="toc-a.7-control-transfer-instructions">A.7 Control Transfer
Instructions</a>
<ul>
<li><a href="#unconditional-jumps-1"
id="toc-unconditional-jumps-1">Unconditional Jumps</a></li>
<li><a href="#conditional-jumps-jcc"
id="toc-conditional-jumps-jcc">Conditional Jumps (Jcc)</a></li>
<li><a href="#loop-instructions-1" id="toc-loop-instructions-1">Loop
Instructions</a></li>
</ul></li>
<li><a href="#a.8-string-instructions"
id="toc-a.8-string-instructions">A.8 String Instructions</a></li>
<li><a href="#a.9-flag-control-instructions"
id="toc-a.9-flag-control-instructions">A.9 Flag Control
Instructions</a></li>
<li><a href="#a.10-system-instructions"
id="toc-a.10-system-instructions">A.10 System Instructions</a></li>
<li><a href="#a.11-simd-instructions-sseavx"
id="toc-a.11-simd-instructions-sseavx">A.11 SIMD Instructions
(SSE/AVX)</a>
<ul>
<li><a href="#data-movement" id="toc-data-movement">Data
Movement</a></li>
<li><a href="#arithmetic-packed" id="toc-arithmetic-packed">Arithmetic
(Packed)</a></li>
<li><a href="#logical" id="toc-logical">Logical</a></li>
<li><a href="#comparison" id="toc-comparison">Comparison</a></li>
<li><a href="#shufflepermute"
id="toc-shufflepermute">Shuffle/Permute</a></li>
</ul></li>
<li><a href="#a.12-avxavx2-instructions"
id="toc-a.12-avxavx2-instructions">A.12 AVX/AVX2 Instructions</a>
<ul>
<li><a href="#three-operand-form"
id="toc-three-operand-form">Three-Operand Form</a></li>
<li><a href="#fma-fused-multiply-add"
id="toc-fma-fused-multiply-add">FMA (Fused Multiply-Add)</a></li>
<li><a href="#gatherscatter-avx2avx-512"
id="toc-gatherscatter-avx2avx-512">Gather/Scatter
(AVX2/AVX-512)</a></li>
</ul></li>
<li><a href="#a.13-avx-512-instructions"
id="toc-a.13-avx-512-instructions">A.13 AVX-512 Instructions</a>
<ul>
<li><a href="#mask-operations" id="toc-mask-operations">Mask
Operations</a></li>
<li><a href="#masked-operations" id="toc-masked-operations">Masked
Operations</a></li>
<li><a href="#special-avx-512-instructions"
id="toc-special-avx-512-instructions">Special AVX-512
Instructions</a></li>
</ul></li>
<li><a href="#a.14-transactional-memory-tsx"
id="toc-a.14-transactional-memory-tsx">A.14 Transactional Memory
(TSX)</a></li>
<li><a href="#a.15-security-extensions"
id="toc-a.15-security-extensions">A.15 Security Extensions</a>
<ul>
<li><a href="#intel-cet-control-flow-enforcement"
id="toc-intel-cet-control-flow-enforcement">Intel CET (Control-flow
Enforcement)</a></li>
<li><a href="#intel-sgx" id="toc-intel-sgx">Intel SGX</a></li>
</ul></li>
<li><a href="#a.16-common-instruction-patterns"
id="toc-a.16-common-instruction-patterns">A.16 Common Instruction
Patterns</a>
<ul>
<li><a href="#function-prologueepilogue"
id="toc-function-prologueepilogue">Function Prologue/Epilogue</a></li>
<li><a href="#system-v-amd64-abi-registers"
id="toc-system-v-amd64-abi-registers">System V AMD64 ABI
Registers</a></li>
<li><a href="#windows-x64-abi-registers"
id="toc-windows-x64-abi-registers">Windows x64 ABI Registers</a></li>
</ul></li>
<li><a href="#a.17-optimization-guidelines"
id="toc-a.17-optimization-guidelines">A.17 Optimization Guidelines</a>
<ul>
<li><a href="#alignment" id="toc-alignment">Alignment</a></li>
<li><a href="#instruction-selection-2"
id="toc-instruction-selection-2">Instruction Selection</a></li>
<li><a href="#pipeline-optimization"
id="toc-pipeline-optimization">Pipeline Optimization</a></li>
</ul></li>
<li><a href="#summary-2" id="toc-summary-2">Summary</a></li>
<li><a href="#instruction-extension-quick-map"
id="toc-instruction-extension-quick-map"><strong>Instruction &amp;
Extension Quick Map</strong></a>
<ul>
<li><a href="#encoding-basics" id="toc-encoding-basics"><strong>Encoding
Basics</strong></a></li>
<li><a href="#scalar-and-general-purpose-ops"
id="toc-scalar-and-general-purpose-ops"><strong>Scalar and General
Purpose Ops</strong></a></li>
<li><a href="#sse3-ssse3-sse4-highlights"
id="toc-sse3-ssse3-sse4-highlights"><strong>SSE3 / SSSE3 / SSE4
Highlights</strong></a></li>
<li><a href="#avx-avx2" id="toc-avx-avx2"><strong>AVX /
AVX2</strong></a></li>
<li><a href="#avx512" id="toc-avx512"><strong>AVX‑512</strong></a></li>
<li><a href="#systemarch"
id="toc-systemarch"><strong>System/Arch</strong></a></li>
</ul></li>
</ul></li>
<li><a href="#appendix-b-system-v-amd64-abi-summary"
id="toc-appendix-b-system-v-amd64-abi-summary">Appendix B: System V
AMD64 ABI Summary</a>
<ul>
<li><a href="#b.1-register-usage-conventions"
id="toc-b.1-register-usage-conventions">B.1 Register Usage
Conventions</a>
<ul>
<li><a href="#general-purpose-registers-1"
id="toc-general-purpose-registers-1">General Purpose Registers</a></li>
<li><a href="#floating-point-registers"
id="toc-floating-point-registers">Floating-Point Registers</a></li>
<li><a href="#special-registers" id="toc-special-registers">Special
Registers</a></li>
</ul></li>
<li><a href="#b.2-function-calling-convention"
id="toc-b.2-function-calling-convention">B.2 Function Calling
Convention</a>
<ul>
<li><a href="#argument-passing" id="toc-argument-passing">Argument
Passing</a></li>
<li><a href="#classification-rules"
id="toc-classification-rules">Classification Rules</a></li>
<li><a href="#aggregate-structunion-passing"
id="toc-aggregate-structunion-passing">Aggregate (Struct/Union)
Passing</a></li>
<li><a href="#variable-arguments-va_args"
id="toc-variable-arguments-va_args">Variable Arguments
(va_args)</a></li>
</ul></li>
<li><a href="#b.3-stack-frame-layout"
id="toc-b.3-stack-frame-layout">B.3 Stack Frame Layout</a>
<ul>
<li><a href="#stack-organization-high-to-low-address"
id="toc-stack-organization-high-to-low-address">Stack Organization (High
to Low Address)</a></li>
<li><a href="#red-zone-1" id="toc-red-zone-1">Red Zone</a></li>
<li><a href="#stack-alignment-1" id="toc-stack-alignment-1">Stack
Alignment</a></li>
</ul></li>
<li><a href="#b.4-return-values" id="toc-b.4-return-values">B.4 Return
Values</a>
<ul>
<li><a href="#scalar-returns" id="toc-scalar-returns">Scalar
Returns</a></li>
<li><a href="#aggregate-returns" id="toc-aggregate-returns">Aggregate
Returns</a></li>
</ul></li>
<li><a href="#b.5-function-prologue-and-epilogue"
id="toc-b.5-function-prologue-and-epilogue">B.5 Function Prologue and
Epilogue</a>
<ul>
<li><a href="#standard-prologue" id="toc-standard-prologue">Standard
Prologue</a></li>
<li><a href="#standard-epilogue" id="toc-standard-epilogue">Standard
Epilogue</a></li>
<li><a href="#leaf-function-optimization-1"
id="toc-leaf-function-optimization-1">Leaf Function
Optimization</a></li>
</ul></li>
<li><a href="#b.6-system-calls" id="toc-b.6-system-calls">B.6 System
Calls</a>
<ul>
<li><a href="#linux-system-call-convention"
id="toc-linux-system-call-convention">Linux System Call
Convention</a></li>
<li><a href="#system-call-example" id="toc-system-call-example">System
Call Example</a></li>
<li><a href="#common-system-call-numbers"
id="toc-common-system-call-numbers">Common System Call Numbers</a></li>
</ul></li>
<li><a href="#b.7-thread-local-storage-tls"
id="toc-b.7-thread-local-storage-tls">B.7 Thread-Local Storage (TLS)</a>
<ul>
<li><a href="#tls-access-models" id="toc-tls-access-models">TLS Access
Models</a></li>
</ul></li>
<li><a href="#b.8-exception-handling"
id="toc-b.8-exception-handling">B.8 Exception Handling</a>
<ul>
<li><a href="#stack-unwinding-dwarf"
id="toc-stack-unwinding-dwarf">Stack Unwinding (DWARF)</a></li>
<li><a href="#c-exception-handling" id="toc-c-exception-handling">C++
Exception Handling</a></li>
</ul></li>
<li><a href="#b.9-data-alignment-requirements"
id="toc-b.9-data-alignment-requirements">B.9 Data Alignment
Requirements</a>
<ul>
<li><a href="#structure-padding" id="toc-structure-padding">Structure
Padding</a></li>
</ul></li>
<li><a href="#b.10-executable-file-format-elf"
id="toc-b.10-executable-file-format-elf">B.10 Executable File Format
(ELF)</a>
<ul>
<li><a href="#program-headers" id="toc-program-headers">Program
Headers</a></li>
</ul></li>
<li><a href="#register-usage-and-preservation-rules"
id="toc-register-usage-and-preservation-rules">1️⃣ Register Usage and
Preservation Rules</a></li>
<li><a href="#calling-convention-essentials"
id="toc-calling-convention-essentials">2️⃣ Calling Convention
Essentials</a></li>
<li><a href="#stack-frame-and-alignment"
id="toc-stack-frame-and-alignment">3️⃣ Stack Frame and Alignment</a></li>
<li><a href="#system-calls-linux-amd64"
id="toc-system-calls-linux-amd64">4️⃣ System Calls (Linux AMD64)</a></li>
<li><a href="#data-alignment-rules" id="toc-data-alignment-rules">5️⃣
Data Alignment Rules</a></li>
<li><a href="#threadlocal-storage-tls"
id="toc-threadlocal-storage-tls">6️⃣ Thread‑Local Storage (TLS)</a></li>
<li><a href="#exceptionunwind-info" id="toc-exceptionunwind-info">7️⃣
Exception/Unwind Info</a></li>
</ul></li>
<li><a href="#appendix-c-nasmgasmasm-syntax-comparison"
id="toc-appendix-c-nasmgasmasm-syntax-comparison">Appendix C:
NASM/GAS/MASM Syntax Comparison</a>
<ul>
<li><a href="#c.1-basic-syntax-differences"
id="toc-c.1-basic-syntax-differences">C.1 Basic Syntax Differences</a>
<ul>
<li><a href="#instruction-format"
id="toc-instruction-format">Instruction Format</a></li>
<li><a href="#basic-instruction-examples"
id="toc-basic-instruction-examples">Basic Instruction Examples</a></li>
</ul></li>
<li><a href="#c.2-memory-addressing" id="toc-c.2-memory-addressing">C.2
Memory Addressing</a>
<ul>
<li><a href="#direct-memory-access" id="toc-direct-memory-access">Direct
Memory Access</a></li>
<li><a href="#complex-addressing-modes-1"
id="toc-complex-addressing-modes-1">Complex Addressing Modes</a></li>
</ul></li>
<li><a href="#c.3-data-definitions" id="toc-c.3-data-definitions">C.3
Data Definitions</a>
<ul>
<li><a href="#basic-data-types" id="toc-basic-data-types">Basic Data
Types</a></li>
<li><a href="#string-definitions" id="toc-string-definitions">String
Definitions</a></li>
</ul></li>
<li><a href="#c.4-sections-and-segments"
id="toc-c.4-sections-and-segments">C.4 Sections and Segments</a></li>
<li><a href="#c.5-macros-and-directives"
id="toc-c.5-macros-and-directives">C.5 Macros and Directives</a>
<ul>
<li><a href="#macro-definitions" id="toc-macro-definitions">Macro
Definitions</a></li>
<li><a href="#conditional-assembly"
id="toc-conditional-assembly">Conditional Assembly</a></li>
</ul></li>
<li><a href="#c.6-symbols-and-labels"
id="toc-c.6-symbols-and-labels">C.6 Symbols and Labels</a>
<ul>
<li><a href="#global-and-external-symbols"
id="toc-global-and-external-symbols">Global and External
Symbols</a></li>
<li><a href="#alignment-directives"
id="toc-alignment-directives">Alignment Directives</a></li>
</ul></li>
<li><a href="#c.7-procedure-definitions"
id="toc-c.7-procedure-definitions">C.7 Procedure Definitions</a>
<ul>
<li><a href="#function-declaration"
id="toc-function-declaration">Function Declaration</a></li>
</ul></li>
<li><a href="#c.8-simd-instructions" id="toc-c.8-simd-instructions">C.8
SIMD Instructions</a>
<ul>
<li><a href="#sseavx-instructions" id="toc-sseavx-instructions">SSE/AVX
Instructions</a></li>
<li><a href="#avx-512-with-masking"
id="toc-avx-512-with-masking">AVX-512 with Masking</a></li>
</ul></li>
<li><a href="#c.9-system-instructions"
id="toc-c.9-system-instructions">C.9 System Instructions</a>
<ul>
<li><a href="#privileged-instructions"
id="toc-privileged-instructions">Privileged Instructions</a></li>
</ul></li>
</ul></li>
</ul>
</nav>
<main>
<header id="title-block-header">
<h1 class="title">Dossier - x86asm-dossier</h1>
</header>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-XjKyOOlGwcjNTAIQHIpgOno0Hl1YQqzUOEleOLALmuqehneUG+vnGctmUb0ZY0l8" crossorigin="anonymous"></script>
<!-- Auto-render extension for automatic math rendering -->
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous"></script>
<!-- Initialize KaTeX auto-render -->
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
delimiters: [
{left: "", right: "", display: true},
{left: "\\[", right: "\\]", display: true},
{left: "\\(", right: "\\)", display: true}
],
throwOnError: false
});
});
</script>
<h2 id="chapter-1-introduction-to-x86-64-architecture"><strong>Chapter
1: Introduction to x86-64 Architecture</strong></h2>
<h3 id="evolution-from-8086-to-x86-64"><strong>1.1 Evolution from 8086
to x86-64</strong></h3>
<h4 id="the-journey-from-16-bit-to-64-bit"><strong>The Journey from
16-bit to 64-bit</strong></h4>
<p>The x86-64 architecture, also known as AMD64 or Intel 64, represents
the culmination of over four decades of evolutionary development that
began with Intel’s 8086 processor in 1978. Understanding this evolution
is crucial for both assembly programmers and compiler engineers, as many
architectural decisions in modern x86-64 CPUs stem from maintaining
backward compatibility while extending capabilities.</p>
<p>The 8086 introduced a 16-bit architecture with segmented memory
addressing, allowing access to 1MB of memory through 20-bit addresses
formed by combining 16-bit segment and offset values. This seemingly
simple design decision would influence x86 architecture for decades to
come:</p>
<pre class="assembly"><code>; 8086 segmented addressing example
mov ax, 0x1234 ; Load segment value
mov ds, ax ; Set data segment
mov bx, [0x5678] ; Access memory at DS:0x5678 (physical: 0x179B8)</code></pre>
<h4 id="the-32-bit-revolution-80386-and-ia-32"><strong>The 32-bit
Revolution: 80386 and IA-32</strong></h4>
<p>The 80386, introduced in 1985, brought true 32-bit computing to the
x86 family. This processor introduced:</p>
<ul>
<li><p><strong>32-bit general-purpose registers</strong> (EAX, EBX, ECX,
EDX, ESI, EDI, EBP, ESP)</p></li>
<li><p><strong>Flat memory model</strong> with 4GB address
space</p></li>
<li><p><strong>Protected mode</strong> with privilege levels and memory
protection</p></li>
<li><p><strong>Virtual memory</strong> support with paging</p></li>
</ul>
<pre class="assembly"><code>; 32-bit code example
mov eax, [ebx + ecx*4 + 0x1000] ; Complex addressing modes
push ebp ; 32-bit stack operations
mov ebp, esp</code></pre>
<p>The IA-32 architecture maintained full backward compatibility,
running 16-bit code in “real mode” or “virtual 8086 mode” while offering
protected mode for modern operating systems.</p>
<h4 id="the-64-bit-extension-amd64-and-intel-64"><strong>The 64-bit
Extension: AMD64 and Intel 64</strong></h4>
<p>In 2003, AMD introduced the x86-64 architecture with the Opteron and
Athlon 64 processors, later adopted by Intel as Intel 64. This extension
brought revolutionary changes while maintaining the x86 legacy:</p>
<p><strong>Key Enhancements:</strong></p>
<ul>
<li><p><strong>64-bit general-purpose registers</strong> (RAX, RBX, RCX,
RDX, RSI, RDI, RBP, RSP)</p></li>
<li><p><strong>Eight new general-purpose registers</strong>
(R8-R15)</p></li>
<li><p><strong>64-bit instruction pointer</strong> (RIP) with
RIP-relative addressing</p></li>
<li><p><strong>Larger virtual address space</strong> (48-bit in initial
implementations, up to 57-bit in recent CPUs)</p></li>
<li><p><strong>SSE2 as baseline</strong> floating-point
architecture</p></li>
<li><p><strong>NX bit</strong> for enhanced security</p></li>
</ul>
<pre class="assembly"><code>; 64-bit code showcasing new features
mov rax, 0x123456789ABCDEF0 ; 64-bit immediate
mov r10, [rip + data_label] ; RIP-relative addressing
add r8d, r9d ; New registers (32-bit portion)
movaps xmm0, [rsp + 16] ; SSE mandatory in 64-bit mode</code></pre>
<h4 id="compiler-perspective-evolutionary-complexity"><strong>Compiler
Perspective: Evolutionary Complexity</strong></h4>
<p>From a compiler’s viewpoint, this evolution presents both
opportunities and challenges:</p>
<ol type="1">
<li><p><strong>Register Allocation</strong>: The increase from 8 to 16
general-purpose registers dramatically improves register allocation
algorithms’ effectiveness, reducing memory traffic.</p></li>
<li><p><strong>Addressing Modes</strong>: The addition of RIP-relative
addressing enables position-independent code generation without
performance penalties.</p></li>
<li><p><strong>Compatibility Burden</strong>: Compilers must handle
multiple target modes (16-bit, 32-bit, 64-bit) with different
instruction encodings and constraints.</p></li>
</ol>
<h3 id="x86-64-execution-environment-and-modes"><strong>1.2 x86-64
Execution Environment and Modes</strong></h3>
<h4 id="operating-modes"><strong>Operating Modes</strong></h4>
<p>The x86-64 architecture supports several operating modes, each with
distinct characteristics:</p>
<h5 id="long-mode-64-bit-mode"><strong>Long Mode (64-bit
Mode)</strong></h5>
<p>The primary operating mode for modern operating systems, consisting
of two sub-modes:</p>
<ul>
<li><p><strong>64-bit Mode</strong>: Full 64-bit operation with all
architectural enhancements</p></li>
<li><p><strong>Compatibility Mode</strong>: Runs legacy 32-bit and
16-bit protected mode applications without modification</p></li>
</ul>
<pre class="assembly"><code>; 64-bit mode characteristics
; Default operand size: 32-bit
mov eax, ebx ; 32-bit operation (default)
mov rax, rbx ; 64-bit operation (REX prefix required)
; Default address size: 64-bit
mov rax, [rbx] ; 64-bit addressing
mov rax, [ebx] ; 32-bit addressing (0x67 prefix)</code></pre>
<h5 id="legacy-modes"><strong>Legacy Modes</strong></h5>
<ul>
<li><p><strong>Protected Mode</strong>: 32-bit operation, used by 32-bit
operating systems</p></li>
<li><p><strong>Real Mode</strong>: 16-bit operation, used during system
boot</p></li>
<li><p><strong>System Management Mode (SMM)</strong>: Special mode for
system firmware</p></li>
</ul>
<h4 id="execution-state"><strong>Execution State</strong></h4>
<p>The processor execution state in 64-bit mode includes:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Conceptual representation of CPU state</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> X86_64_State <span class="op">{</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> <span class="co">// General-purpose registers</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> rax<span class="op">,</span> rbx<span class="op">,</span> rcx<span class="op">,</span> rdx<span class="op">;</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> rsi<span class="op">,</span> rdi<span class="op">,</span> rbp<span class="op">,</span> rsp<span class="op">;</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> r8<span class="op">,</span> r9<span class="op">,</span> r10<span class="op">,</span> r11<span class="op">;</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> r12<span class="op">,</span> r13<span class="op">,</span> r14<span class="op">,</span> r15<span class="op">;</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a> <span class="co">// Instruction pointer</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> rip<span class="op">;</span></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a> <span class="co">// Flags register</span></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> rflags<span class="op">;</span></span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a> <span class="co">// Segment registers (mostly unused in 64-bit)</span></span>
<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint16_t</span> cs<span class="op">,</span> ds<span class="op">,</span> es<span class="op">,</span> fs<span class="op">,</span> gs<span class="op">,</span> ss<span class="op">;</span></span>
<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a> <span class="co">// Control registers</span></span>
<span id="cb5-19"><a href="#cb5-19" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> cr0<span class="op">,</span> cr2<span class="op">,</span> cr3<span class="op">,</span> cr4<span class="op">,</span> cr8<span class="op">;</span></span>
<span id="cb5-20"><a href="#cb5-20" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-21"><a href="#cb5-21" aria-hidden="true" tabindex="-1"></a> <span class="co">// Debug registers</span></span>
<span id="cb5-22"><a href="#cb5-22" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> dr0<span class="op">,</span> dr1<span class="op">,</span> dr2<span class="op">,</span> dr3<span class="op">;</span></span>
<span id="cb5-23"><a href="#cb5-23" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> dr6<span class="op">,</span> dr7<span class="op">;</span></span>
<span id="cb5-24"><a href="#cb5-24" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-25"><a href="#cb5-25" aria-hidden="true" tabindex="-1"></a> <span class="co">// XMM/YMM/ZMM registers for SIMD</span></span>
<span id="cb5-26"><a href="#cb5-26" aria-hidden="true" tabindex="-1"></a> <span class="kw">union</span> <span class="op">{</span></span>
<span id="cb5-27"><a href="#cb5-27" aria-hidden="true" tabindex="-1"></a> uint128_t xmm<span class="op">[</span><span class="dv">32</span><span class="op">];</span> <span class="co">// SSE</span></span>
<span id="cb5-28"><a href="#cb5-28" aria-hidden="true" tabindex="-1"></a> uint256_t ymm<span class="op">[</span><span class="dv">32</span><span class="op">];</span> <span class="co">// AVX</span></span>
<span id="cb5-29"><a href="#cb5-29" aria-hidden="true" tabindex="-1"></a> uint512_t zmm<span class="op">[</span><span class="dv">32</span><span class="op">];</span> <span class="co">// AVX-512</span></span>
<span id="cb5-30"><a href="#cb5-30" aria-hidden="true" tabindex="-1"></a> <span class="op">};</span></span>
<span id="cb5-31"><a href="#cb5-31" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-32"><a href="#cb5-32" aria-hidden="true" tabindex="-1"></a> <span class="co">// x87 FPU state (legacy)</span></span>
<span id="cb5-33"><a href="#cb5-33" aria-hidden="true" tabindex="-1"></a> <span class="dt">long</span> <span class="dt">double</span> st<span class="op">[</span><span class="dv">8</span><span class="op">];</span></span>
<span id="cb5-34"><a href="#cb5-34" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint16_t</span> fpu_control<span class="op">,</span> fpu_status<span class="op">,</span> fpu_tag<span class="op">;</span></span>
<span id="cb5-35"><a href="#cb5-35" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span></code></pre></div>
<h4 id="privilege-levels-and-protection"><strong>Privilege Levels and
Protection</strong></h4>
<p>x86-64 maintains the four privilege levels (rings) from IA-32:</p>
<ul>
<li><p><strong>Ring 0</strong>: Kernel mode (highest privilege)</p></li>
<li><p><strong>Ring 1-2</strong>: Rarely used (device drivers in some
systems)</p></li>
<li><p><strong>Ring 3</strong>: User mode (lowest privilege)</p></li>
</ul>
<pre class="assembly"><code>; Checking current privilege level
mov rax, cs
and rax, 3 ; Extract CPL (Current Privilege Level)
jz kernel_mode ; Jump if in ring 0</code></pre>
<h3
id="register-architecture-general-purpose-segment-and-system-registers"><strong>1.3
Register Architecture: General Purpose, Segment, and System
Registers</strong></h3>
<h4 id="general-purpose-registers"><strong>General-Purpose
Registers</strong></h4>
<p>The x86-64 architecture provides 16 general-purpose registers, each
64 bits wide, with accessible sub-registers:</p>
<pre class="assembly"><code>; Register naming conventions and sub-registers
; 64-bit | 32-bit | 16-bit | 8-bit high | 8-bit low
; RAX | EAX | AX | AH | AL
; RBX | EBX | BX | BH | BL
; RCX | ECX | CX | CH | CL
; RDX | EDX | DX | DH | DL
; RSI | ESI | SI | - | SIL
; RDI | EDI | DI | - | DIL
; RBP | EBP | BP | - | BPL
; RSP | ESP | SP | - | SPL
; R8 | R8D | R8W | - | R8B
; R9 | R9D | R9W | - | R9B
; R10 | R10D | R10W | - | R10B
; R11 | R11D | R11W | - | R11B
; R12 | R12D | R12W | - | R12B
; R13 | R13D | R13W | - | R13B
; R14 | R14D | R14W | - | R14B
; R15 | R15D | R15W | - | R15B</code></pre>
<p><strong>Important Behavior</strong>: Operations on 32-bit
sub-registers zero-extend to 64 bits:</p>
<pre class="assembly"><code> mov rax, 0xFFFFFFFFFFFFFFFF
mov eax, 0x12345678 ; RAX now contains 0x0000000012345678
; But 8-bit and 16-bit operations don&#39;t zero-extend
mov rax, 0xFFFFFFFFFFFFFFFF
mov ax, 0x1234 ; RAX now contains 0xFFFFFFFFFFFF1234
mov al, 0x56 ; RAX now contains 0xFFFFFFFFFFFF1256</code></pre>
<h4 id="special-purpose-registers"><strong>Special-Purpose
Registers</strong></h4>
<pre class="assembly"><code>; RFLAGS register (selected bits)
; Bit | Name | Description
; 0 | CF | Carry Flag
; 2 | PF | Parity Flag
; 4 | AF | Auxiliary Carry Flag
; 6 | ZF | Zero Flag
; 7 | SF | Sign Flag
; 8 | TF | Trap Flag
; 9 | IF | Interrupt Enable Flag
; 10 | DF | Direction Flag
; 11 | OF | Overflow Flag
; 12-13| IOPL| I/O Privilege Level
; 14 | NT | Nested Task
; 16 | RF | Resume Flag
; 17 | VM | Virtual-8086 Mode
; 18 | AC | Alignment Check
; 19 | VIF | Virtual Interrupt Flag
; 20 | VIP | Virtual Interrupt Pending
; 21 | ID | CPUID available
pushfq ; Push RFLAGS
pop rax ; Read RFLAGS into RAX
or rax, 0x200 ; Set IF (enable interrupts)
push rax
popfq ; Restore modified RFLAGS</code></pre>
<h4 id="segment-registers-in-64-bit-mode"><strong>Segment Registers in
64-bit Mode</strong></h4>
<p>While segmentation is largely disabled in 64-bit mode, segment
registers still serve important purposes:</p>
<pre class="assembly"><code>; CS (Code Segment) - determines execution mode and privilege
; SS (Stack Segment) - largely ignored, but checked for NULL
; DS, ES - completely ignored in 64-bit mode
; FS, GS - used for thread-local storage and special OS purposes
; Typical FS/GS usage in Linux
mov rax, fs:[0] ; Access thread-local storage
; Windows uses GS for TEB (Thread Environment Block)
mov rax, gs:[0x30] ; Get PEB pointer from TEB</code></pre>
<h4 id="control-registers"><strong>Control Registers</strong></h4>
<p>Control registers govern fundamental CPU behavior:</p>
<pre class="assembly"><code>; CR0 - System control flags
; Bit 0 (PE): Protected Mode Enable
; Bit 16 (WP): Write Protect
; Bit 31 (PG): Paging Enable
; CR3 - Page Directory Base (top-level page table pointer)
mov rax, cr3 ; Read current page table base
and rax, ~0xFFF ; Mask out PCID and flags
mov cr3, rax ; Flush TLB by reloading CR3
; CR4 - Architecture extensions
; Bit 5 (PAE): Physical Address Extension
; Bit 7 (PGE): Page Global Enable
; Bit 10 (OSXMMEXCPT): OS SIMD exception support
; Bit 18 (OSXSAVE): XSAVE enabled</code></pre>
<h4 id="model-specific-registers-msrs"><strong>Model-Specific Registers
(MSRs)</strong></h4>
<p>MSRs provide access to processor-specific features:</p>
<pre class="assembly"><code>; Reading MSRs
mov ecx, 0xC0000080 ; EFER MSR (Extended Feature Enable)
rdmsr ; Read MSR into EDX:EAX
; Bit 8 (LME): Long Mode Enable
; Bit 10 (LMA): Long Mode Active
; Bit 11 (NXE): No-Execute Enable
; Writing MSRs (privileged operation)
mov ecx, 0x277 ; IA32_PAT MSR (Page Attribute Table)
mov edx, 0x00070406
mov eax, 0x00070406
wrmsr ; Write EDX:EAX to MSR</code></pre>
<h4 id="compiler-register-usage-conventions"><strong>Compiler Register
Usage Conventions</strong></h4>
<p>Different ABIs specify register usage:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co">// System V AMD64 ABI (Linux, macOS, BSD)</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Function parameters: RDI, RSI, RDX, RCX, R8, R9</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="co">// Return value: RAX (RDX:RAX for 128-bit)</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="co">// Callee-saved: RBX, RBP, R12-R15</span></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11</span></span>
<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a><span class="co">// Microsoft x64 ABI (Windows)</span></span>
<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a><span class="co">// Function parameters: RCX, RDX, R8, R9</span></span>
<span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a><span class="co">// Return value: RAX</span></span>
<span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a><span class="co">// Callee-saved: RBX, RBP, RDI, RSI, RSP, R12-R15</span></span>
<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a><span class="co">// Caller-saved: RAX, RCX, RDX, R8-R11</span></span></code></pre></div>
<h3 id="memory-models-and-addressing"><strong>1.4 Memory Models and
Addressing</strong></h3>
<h4 id="virtual-address-space"><strong>Virtual Address
Space</strong></h4>
<p>The x86-64 architecture implements a 64-bit virtual address space,
though current implementations use only 48-57 bits:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Canonical address form (48-bit implementation)</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 63:48 must be copies of bit 47</span></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="co">// Valid ranges:</span></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="co">// 0x0000000000000000 - 0x00007FFFFFFFFFFF (user space)</span></span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="co">// 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF (kernel space)</span></span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> is_canonical_address<span class="op">(</span><span class="dt">uint64_t</span> addr<span class="op">)</span> <span class="op">{</span></span>
<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a> <span class="co">// Check if address is in canonical form</span></span>
<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> high_bits <span class="op">=</span> addr <span class="op">&gt;&gt;</span> <span class="dv">47</span><span class="op">;</span></span>
<span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> high_bits <span class="op">==</span> <span class="dv">0</span> <span class="op">||</span> high_bits <span class="op">==</span> <span class="bn">0x1FFFF</span><span class="op">;</span></span>
<span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h4 id="memory-segmentation-in-64-bit-mode"><strong>Memory Segmentation
in 64-bit Mode</strong></h4>
<p>Segmentation is largely disabled in 64-bit mode:</p>
<pre class="assembly"><code>; Segment registers in 64-bit mode:
; - Base addresses forced to 0 (except FS/GS)
; - Limits not checked (except for FS/GS in some cases)
; - CS still determines privilege level and operating mode
; Setting up FS base for thread-local storage
mov ecx, 0xC0000100 ; FS_BASE MSR
mov edx, 0 ; High 32 bits of base
mov eax, thread_data ; Low 32 bits of base
wrmsr
; Now FS-relative addressing uses thread_data as base
mov rax, fs:[0] ; Load from thread_data + 0</code></pre>
<h4 id="addressing-modes"><strong>Addressing Modes</strong></h4>
<p>x86-64 supports complex addressing modes with the general form:</p>
<p><strong>[base + index*scale + displacement]</strong></p>
<pre class="assembly"><code>; Direct addressing
mov rax, [0x401000] ; Absolute address (rare in 64-bit)
; Register indirect
mov rax, [rbx] ; Address in RBX
; Register + displacement
mov rax, [rbx + 8] ; RBX + 8
mov rax, [rbx - 16] ; RBX - 16
; Indexed addressing
mov rax, [rbx + rcx*8] ; RBX + RCX*8 (scale: 1, 2, 4, or 8)
; Full complex addressing</code></pre>
<hr />
<h2
id="chapter-2-x86-64-instruction-set-architecture-fundamentals"><strong>Chapter
2: x86-64 Instruction Set Architecture Fundamentals</strong></h2>
<h3 id="instruction-format-and-prefixes-rex-vex-evex"><strong>2.1
Instruction Format and Prefixes (REX, VEX, EVEX)</strong></h3>
<h4 id="basic-instruction-format"><strong>Basic Instruction
Format</strong></h4>
<p>x86-64 instructions consist of several optional and mandatory
components that can create instructions from 1 to 15 bytes in
length:</p>
<p>[Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement]
[Immediate]</p>
<p>Let’s examine each component:</p>
<pre class="assembly"><code>; Example: mov rax, [rbx + rcx*8 + 0x1000]
; Encoding: 48 8B 84 CB 00 10 00 00
; 48 - REX.W prefix (64-bit operand)
; 8B - Opcode (MOV r64, r/m64)
; 84 - ModR/M byte (mod=10, reg=000, r/m=100)
; CB - SIB byte (scale=11, index=001, base=011)
; 00 10 00 00 - 32-bit displacement (0x1000)</code></pre>
<h4 id="legacy-prefixes"><strong>Legacy Prefixes</strong></h4>
<p>Legacy prefixes modify instruction behavior and can appear in any
order:</p>
<pre class="assembly"><code>; Prefix groups (max one from each group):
; Group 1: Lock and repeat
lock add [rax], rbx ; F0 - LOCK prefix for atomic operations
rep movsb ; F3 - REP prefix for string operations
; Group 2: Segment override (largely ignored in 64-bit mode)
mov rax, fs:[rbx] ; 64 - FS segment override
mov rax, gs:[0] ; 65 - GS segment override
; Group 3: Operand size override
mov ax, bx ; 66 - 16-bit operation in 64-bit mode
; Group 4: Address size override
mov rax, [ebx] ; 67 - 32-bit addressing in 64-bit mode</code></pre>
<h4 id="rex-prefix"><strong>REX Prefix</strong></h4>
<p>The REX (Register Extension) prefix is crucial for 64-bit operations
and accessing extended registers:</p>
<p>REX = 0100WRXB</p>
<ul>
<li><p>W: 64-bit operand size</p></li>
<li><p>R: Extension of ModR/M reg field</p></li>
<li><p>X: Extension of SIB index field</p></li>
<li><p>B: Extension of ModR/M r/m field, SIB base, or opcode
reg</p></li>
</ul>
<pre class="assembly"><code>; REX prefix examples
mov r8, rax ; 49 89 C0 - REX.B for r8
mov rax, r9 ; 4C 89 C8 - REX.R for r9
mov r10, r11 ; 4D 89 DA - REX.RB for both
mov eax, ebx ; 89 D8 - No REX (32-bit)
mov rax, rbx ; 48 89 D8 - REX.W (64-bit)</code></pre>
<h4 id="vex-prefix-avx"><strong>VEX Prefix (AVX)</strong></h4>
<p>VEX encoding enables three-operand forms and accesses YMM
registers:</p>
<pre class="assembly"><code>; 2-byte VEX: C5 [R vvvv L pp]
; 3-byte VEX: C4 [RXB map] [W vvvv L pp]
vaddps ymm0, ymm1, ymm2 ; C5 F4 58 C2
; Non-destructive: ymm0 = ymm1 + ymm2
; Compare with legacy SSE:
addps xmm0, xmm1 ; 0F 58 C1
; Destructive: xmm0 = xmm0 + xmm1</code></pre>
<h4 id="evex-prefix-avx-512"><strong>EVEX Prefix (AVX-512)</strong></h4>
<p>EVEX extends VEX with masking, broadcasting, and 512-bit
operations:</p>
<pre class="assembly"><code>; 4-byte EVEX: 62 [RXBR&#39;00mm] [Wvvvv1pp] [zLLb Vaaa]
vaddps zmm0{k1}, zmm1, zmm2 ; Masked addition
vbroadcastss zmm0, [rax] ; Broadcast single value
vaddps zmm0, zmm1, [rax]{1to16} ; Memory broadcast</code></pre>
<h4 id="compiler-encoding-decisions"><strong>Compiler Encoding
Decisions</strong></h4>
<p>Compilers must choose optimal encodings:</p>
<div class="sourceCode" id="cb22"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler&#39;s encoding selection logic</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> select_encoding<span class="op">(</span>Instruction<span class="op">*</span> insn<span class="op">)</span> <span class="op">{</span></span>
<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> <span class="op">(</span>insn<span class="op">-&gt;</span>needs_rex<span class="op">())</span> <span class="op">{</span></span>
<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a> <span class="co">// Use REX for extended registers or 64-bit ops</span></span>
<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a> emit_rex<span class="op">(</span>insn<span class="op">);</span></span>
<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>insn<span class="op">-&gt;</span>is_vector<span class="op">()</span> <span class="op">&amp;&amp;</span> insn<span class="op">-&gt;</span>has_avx<span class="op">())</span> <span class="op">{</span></span>
<span id="cb22-7"><a href="#cb22-7" aria-hidden="true" tabindex="-1"></a> <span class="co">// Prefer VEX for AVX instructions</span></span>
<span id="cb22-8"><a href="#cb22-8" aria-hidden="true" tabindex="-1"></a> emit_vex<span class="op">(</span>insn<span class="op">);</span></span>
<span id="cb22-9"><a href="#cb22-9" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>insn<span class="op">-&gt;</span>needs_evex_features<span class="op">())</span> <span class="op">{</span></span>
<span id="cb22-10"><a href="#cb22-10" aria-hidden="true" tabindex="-1"></a> <span class="co">// Use EVEX for AVX-512 or special features</span></span>
<span id="cb22-11"><a href="#cb22-11" aria-hidden="true" tabindex="-1"></a> emit_evex<span class="op">(</span>insn<span class="op">);</span></span>
<span id="cb22-12"><a href="#cb22-12" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb22-13"><a href="#cb22-13" aria-hidden="true" tabindex="-1"></a> <span class="co">// Minimize instruction size when possible</span></span>
<span id="cb22-14"><a href="#cb22-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h3 id="data-movement-instructions"><strong>2.2 Data Movement
Instructions</strong></h3>
<h4 id="basic-move-instructions"><strong>Basic Move
Instructions</strong></h4>
<p>The MOV instruction family forms the foundation of data movement:</p>
<pre class="assembly"><code>; Register to register
mov rax, rbx ; 64-bit
mov eax, ebx ; 32-bit (zero-extends to 64-bit)
mov ax, bx ; 16-bit (preserves upper bits)
mov al, bl ; 8-bit (preserves upper bits)
; Immediate to register
mov rax, 0x123456789 ; 64-bit immediate (10-byte encoding)
mov eax, 0x12345678 ; 32-bit immediate (5-byte encoding)
mov rax, -1 ; Optimized as: mov rax, 0xFFFFFFFFFFFFFFFF
; Memory operations
mov rax, [rbx] ; Load
mov [rbx], rax ; Store
mov qword [rbx], 100 ; Immediate to memory</code></pre>
<h4 id="zero-and-sign-extension"><strong>Zero and Sign
Extension</strong></h4>
<pre class="assembly"><code>; Zero extension
movzx eax, byte [rbx] ; Zero-extend byte to 32-bit
movzx rax, word [rbx] ; Zero-extend word to 64-bit
; Sign extension
movsx eax, byte [rbx] ; Sign-extend byte to 32-bit
movsxd rax, dword [rbx] ; Sign-extend dword to 64-bit
; Implicit zero extension with 32-bit ops
mov eax, [rbx] ; Zeros bits 63:32</code></pre>
<h4 id="conditional-moves"><strong>Conditional Moves</strong></h4>
<p>Conditional moves eliminate branches for simple selections:</p>
<pre class="assembly"><code>; cmovcc reg, reg/mem
cmp rax, rbx
cmovl rax, rcx ; Move if less (signed)
cmovb rax, rcx ; Move if below (unsigned)
; Compiler pattern for: x = (a &lt; b) ? c : d
cmp rdi, rsi ; Compare a, b
mov rax, r8 ; rax = d
cmovl rax, rdx ; rax = c if a &lt; b</code></pre>
<h4 id="special-data-movement"><strong>Special Data
Movement</strong></h4>
<pre class="assembly"><code>; Exchange
xchg rax, rbx ; Atomic exchange
xchg [mem], rax ; Implicit LOCK prefix
; Load effective address
lea rax, [rbx + rcx*8 + 16] ; Address calculation
lea rdi, [rip + data] ; RIP-relative addressing
; Stack operations
push rax ; RSP -= 8; [RSP] = RAX
pop rbx ; RBX = [RSP]; RSP += 8
; Special moves
bswap rax ; Byte swap (endianness conversion)
cmpxchg [rbx], rcx ; Compare and exchange (atomic)</code></pre>
<h4 id="compiler-optimization-patterns"><strong>Compiler Optimization
Patterns</strong></h4>
<div class="sourceCode" id="cb27"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Structure copy optimization</span></span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Point <span class="op">{</span> <span class="dt">long</span> x<span class="op">,</span> y<span class="op">,</span> z<span class="op">;</span> <span class="op">};</span></span>
<span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive approach: multiple loads/stores</span></span>
<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> copy_naive<span class="op">(</span>Point<span class="op">*</span> dst<span class="op">,</span> Point<span class="op">*</span> src<span class="op">)</span> <span class="op">{</span></span>
<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a> dst<span class="op">-&gt;</span>x <span class="op">=</span> src<span class="op">-&gt;</span>x<span class="op">;</span></span>
<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a> dst<span class="op">-&gt;</span>y <span class="op">=</span> src<span class="op">-&gt;</span>y<span class="op">;</span></span>
<span id="cb27-8"><a href="#cb27-8" aria-hidden="true" tabindex="-1"></a> dst<span class="op">-&gt;</span>z <span class="op">=</span> src<span class="op">-&gt;</span>z<span class="op">;</span></span>
<span id="cb27-9"><a href="#cb27-9" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb27-10"><a href="#cb27-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-11"><a href="#cb27-11" aria-hidden="true" tabindex="-1"></a><span class="co">// Optimized assembly:</span></span>
<span id="cb27-12"><a href="#cb27-12" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rax, [rsi]</span></span>
<span id="cb27-13"><a href="#cb27-13" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rdx, [rsi+8]</span></span>
<span id="cb27-14"><a href="#cb27-14" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rcx, [rsi+16]</span></span>
<span id="cb27-15"><a href="#cb27-15" aria-hidden="true" tabindex="-1"></a><span class="co">// mov [rdi], rax</span></span>
<span id="cb27-16"><a href="#cb27-16" aria-hidden="true" tabindex="-1"></a><span class="co">// mov [rdi+8], rdx</span></span>
<span id="cb27-17"><a href="#cb27-17" aria-hidden="true" tabindex="-1"></a><span class="co">// mov [rdi+16], rcx</span></span>
<span id="cb27-18"><a href="#cb27-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-19"><a href="#cb27-19" aria-hidden="true" tabindex="-1"></a><span class="co">// Or with SIMD:</span></span>
<span id="cb27-20"><a href="#cb27-20" aria-hidden="true" tabindex="-1"></a><span class="co">// movups xmm0, [rsi]</span></span>
<span id="cb27-21"><a href="#cb27-21" aria-hidden="true" tabindex="-1"></a><span class="co">// movups xmm1, [rsi+16]</span></span>
<span id="cb27-22"><a href="#cb27-22" aria-hidden="true" tabindex="-1"></a><span class="co">// movups [rdi], xmm0</span></span>
<span id="cb27-23"><a href="#cb27-23" aria-hidden="true" tabindex="-1"></a><span class="co">// movups [rdi+16], xmm1</span></span></code></pre></div>
<h3 id="arithmetic-and-logic-operations"><strong>2.3 Arithmetic and
Logic Operations</strong></h3>
<h4 id="integer-arithmetic"><strong>Integer Arithmetic</strong></h4>
<pre class="assembly"><code>; Addition and subtraction
add rax, rbx ; rax += rbx, sets flags
adc rax, rbx ; rax += rbx + CF (multi-precision)
sub rax, rbx ; rax -= rbx
sbb rax, rbx ; rax -= rbx + CF
; Increment/decrement (don&#39;t affect CF)
inc rax ; rax++
dec rbx ; rbx--
; Multiplication
mul rbx ; RDX:RAX = RAX * RBX (unsigned)
imul rbx ; RDX:RAX = RAX * RBX (signed)
imul rax, rbx ; RAX = RAX * RBX (truncated)
imul rax, rbx, 5 ; RAX = RBX * 5
; Division
xor rdx, rdx ; Clear high dividend
div rbx ; RAX = RDX:RAX / RBX, RDX = remainder
idiv rbx ; Signed division
; LEA for arithmetic
lea rax, [rbx + rcx] ; Addition without flags
lea rax, [rbx + rbx*4] ; Multiply by 5
lea rax, [rbx + rbx*2 + 7] ; rax = rbx*3 + 7</code></pre>
<h4 id="logical-operations"><strong>Logical Operations</strong></h4>
<pre class="assembly"><code>; Bitwise operations
and rax, rbx ; Bitwise AND
or rax, rbx ; Bitwise OR
xor rax, rbx ; Bitwise XOR
not rax ; Bitwise NOT
; Testing without modifying
test rax, rbx ; AND but only set flags
test rax, rax ; Common idiom to check zero/sign
; Bit manipulation
bt rax, 5 ; Test bit 5
bts rax, 5 ; Test and set bit 5
btr rax, 5 ; Test and reset bit 5
btc rax, 5 ; Test and complement bit 5</code></pre>
<h4 id="flag-manipulation"><strong>Flag Manipulation</strong></h4>
<pre class="assembly"><code>; Direct flag operations
clc ; Clear carry flag
stc ; Set carry flag
cmc ; Complement carry flag
cld ; Clear direction flag
std ; Set direction flag
; Flag-based byte set
cmp rax, rbx
setl al ; AL = 1 if less, 0 otherwise
sete al ; AL = 1 if equal
; Compiler pattern for: bool result = (a &lt; b)
cmp rdi, rsi
setl al
movzx eax, al ; Zero-extend to full register</code></pre>
<h3 id="bit-manipulation-and-shifts"><strong>2.4 Bit Manipulation and
Shifts</strong></h3>
<h4 id="shift-operations"><strong>Shift Operations</strong></h4>
<pre class="assembly"><code>; Logical shifts (fill with zeros)
shl rax, 5 ; Shift left by 5
shr rax, cl ; Shift right by CL bits
; Arithmetic shifts (preserve sign)
sal rax, 5 ; Same as SHL
sar rax, cl ; Arithmetic right shift
; Rotates
rol rax, 8 ; Rotate left
ror rax, cl ; Rotate right
rcl rax, 1 ; Rotate through carry left
rcr rax, 1 ; Rotate through carry right
; Double-precision shifts
shld rax, rbx, 5 ; Shift RAX left, fill from RBX
shrd rax, rbx, cl ; Shift RAX right, fill from RBX</code></pre>
<h4 id="bit-scanning-and-manipulation"><strong>Bit Scanning and
Manipulation</strong></h4>
<pre class="assembly"><code>; Find first set bit
bsf rax, rbx ; Scan forward (LSB to MSB)
bsr rax, rbx ; Scan reverse (MSB to LSB)
; Leading/trailing zeros (with BMI)
lzcnt rax, rbx ; Count leading zeros
tzcnt rax, rbx ; Count trailing zeros
; Population count
popcnt rax, rbx ; Count set bits
; BMI extensions
andn rax, rbx, rcx ; RAX = ~RBX &amp; RCX
blsi rax, rbx ; Extract lowest set bit
blsr rax, rbx ; Reset lowest set bit
blsmsk rax, rbx ; Mask up to lowest set bit
; BMI2 advanced operations
pdep rax, rbx, rcx ; Parallel bit deposit
pext rax, rbx, rcx ; Parallel bit extract
bzhi rax, rbx, rcx ; Zero high bits
mulx rdx, rax, rbx ; Unsigned multiply without flags</code></pre>
<h4 id="compiler-bit-manipulation-patterns"><strong>Compiler Bit
Manipulation Patterns</strong></h4>
<div class="sourceCode" id="cb33"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Efficient bit field extraction</span></span>
<span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a><span class="dt">uint64_t</span> extract_bits<span class="op">(</span><span class="dt">uint64_t</span> value<span class="op">,</span> <span class="dt">int</span> start<span class="op">,</span> <span class="dt">int</span> length<span class="op">)</span> <span class="op">{</span></span>
<span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a> <span class="co">// Compiler may generate:</span></span>
<span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a> <span class="co">// mov rax, rdi</span></span>
<span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a> <span class="co">// shr rax, rsi ; Shift by start</span></span>
<span id="cb33-6"><a href="#cb33-6" aria-hidden="true" tabindex="-1"></a> <span class="co">// bzhi rax, rax, rdx ; Zero bits above length</span></span>
<span id="cb33-7"><a href="#cb33-7" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> <span class="op">(</span>value <span class="op">&gt;&gt;</span> start<span class="op">)</span> <span class="op">&amp;</span> <span class="op">((</span><span class="dv">1</span><span class="bu">ULL</span> <span class="op">&lt;&lt;</span> length<span class="op">)</span> <span class="op">-</span> <span class="dv">1</span><span class="op">);</span></span>
<span id="cb33-8"><a href="#cb33-8" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb33-9"><a href="#cb33-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-10"><a href="#cb33-10" aria-hidden="true" tabindex="-1"></a><span class="co">// Bit permutation with PDEP</span></span>
<span id="cb33-11"><a href="#cb33-11" aria-hidden="true" tabindex="-1"></a><span class="dt">uint64_t</span> pack_rgb_to_565<span class="op">(</span><span class="dt">uint8_t</span> r<span class="op">,</span> <span class="dt">uint8_t</span> g<span class="op">,</span> <span class="dt">uint8_t</span> b<span class="op">)</span> <span class="op">{</span></span>
<span id="cb33-12"><a href="#cb33-12" aria-hidden="true" tabindex="-1"></a> <span class="co">// With BMI2:</span></span>
<span id="cb33-13"><a href="#cb33-13" aria-hidden="true" tabindex="-1"></a> <span class="co">// pdep eax, edi, 0xF800 ; R in bits 15:11</span></span>
<span id="cb33-14"><a href="#cb33-14" aria-hidden="true" tabindex="-1"></a> <span class="co">// pdep ecx, esi, 0x07E0 ; G in bits 10:5</span></span>
<span id="cb33-15"><a href="#cb33-15" aria-hidden="true" tabindex="-1"></a> <span class="co">// or eax, ecx</span></span>
<span id="cb33-16"><a href="#cb33-16" aria-hidden="true" tabindex="-1"></a> <span class="co">// pdep ecx, edx, 0x001F ; B in bits 4:0</span></span>
<span id="cb33-17"><a href="#cb33-17" aria-hidden="true" tabindex="-1"></a> <span class="co">// or eax, ecx</span></span>
<span id="cb33-18"><a href="#cb33-18" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> <span class="op">((</span>r <span class="op">&amp;</span> <span class="bn">0xF8</span><span class="op">)</span> <span class="op">&lt;&lt;</span> <span class="dv">8</span><span class="op">)</span> <span class="op">|</span> <span class="op">((</span>g <span class="op">&amp;</span> <span class="bn">0xFC</span><span class="op">)</span> <span class="op">&lt;&lt;</span> <span class="dv">3</span><span class="op">)</span> <span class="op">|</span> <span class="op">(</span>b <span class="op">&gt;&gt;</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb33-19"><a href="#cb33-19" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h3 id="control-flow-branches-loops-and-calls"><strong>2.5 Control Flow:
Branches, Loops, and Calls</strong></h3>
<h4 id="unconditional-jumps"><strong>Unconditional Jumps</strong></h4>
<pre class="assembly"><code>; Direct jump
jmp label ; RIP = label
; Indirect jump
jmp rax ; RIP = RAX
jmp qword [rbx] ; RIP = memory[RBX]
; Function calls
call function ; Push return address, jump
call rax ; Indirect call
call qword [rbx + rax*8] ; Call through function table
; Returns
ret ; Pop return address to RIP
ret 16 ; Return and adjust RSP by 16</code></pre>
<h4 id="conditional-branches"><strong>Conditional Branches</strong></h4>
<pre class="assembly"><code>; Based on single flag
je label ; Jump if equal (ZF=1)
jne label ; Jump if not equal (ZF=0)
jc label ; Jump if carry (CF=1)
jnc label ; Jump if not carry (CF=0)
; Based on comparisons (signed)
jl label ; Jump if less
jle label ; Jump if less or equal
jg label ; Jump if greater
jge label ; Jump if greater or equal
; Based on comparisons (unsigned)
jb label ; Jump if below
jbe label ; Jump if below or equal
ja label ; Jump if above
jae label ; Jump if above or equal
; Special conditions
jo label ; Jump if overflow
js label ; Jump if sign (negative)
jp label ; Jump if parity even
jcxz label ; Jump if CX/ECX/RCX is zero</code></pre>
<h4 id="loop-instructions"><strong>Loop Instructions</strong></h4>
<pre class="assembly"><code>; Traditional loop instructions (slower on modern CPUs)
mov rcx, 100
.loop:
; ... loop body ...
loop .loop ; Decrement RCX and jump if non-zero
; Preferred pattern for modern CPUs
mov rcx, 100
.loop:
; ... loop body ...
dec rcx
jnz .loop
; String operation loops
mov rcx, string_length
rep movsb ; Repeat MOVSB RCX times</code></pre>
<h4 id="compiler-control-flow-patterns"><strong>Compiler Control Flow
Patterns</strong></h4>
<div class="sourceCode" id="cb37"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a><span class="co">// If-else pattern</span></span>
<span id="cb37-2"><a href="#cb37-2" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="op">(</span>a <span class="op">&lt;</span> b<span class="op">)</span> <span class="op">{</span></span>
<span id="cb37-3"><a href="#cb37-3" aria-hidden="true" tabindex="-1"></a> x <span class="op">=</span> y<span class="op">;</span></span>
<span id="cb37-4"><a href="#cb37-4" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb37-5"><a href="#cb37-5" aria-hidden="true" tabindex="-1"></a> x <span class="op">=</span> z<span class="op">;</span></span>
<span id="cb37-6"><a href="#cb37-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb37-7"><a href="#cb37-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-8"><a href="#cb37-8" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler generates:</span></span>
<span id="cb37-9"><a href="#cb37-9" aria-hidden="true" tabindex="-1"></a><span class="co">// cmp rdi, rsi</span></span>
<span id="cb37-10"><a href="#cb37-10" aria-hidden="true" tabindex="-1"></a><span class="co">// jge .else</span></span>
<span id="cb37-11"><a href="#cb37-11" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rax, rdx</span></span>
<span id="cb37-12"><a href="#cb37-12" aria-hidden="true" tabindex="-1"></a><span class="co">// jmp .end</span></span>
<span id="cb37-13"><a href="#cb37-13" aria-hidden="true" tabindex="-1"></a><span class="co">// .else:</span></span>
<span id="cb37-14"><a href="#cb37-14" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rax, rcx</span></span>
<span id="cb37-15"><a href="#cb37-15" aria-hidden="true" tabindex="-1"></a><span class="co">// .end:</span></span>
<span id="cb37-16"><a href="#cb37-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-17"><a href="#cb37-17" aria-hidden="true" tabindex="-1"></a><span class="co">// Switch statement (jump table)</span></span>
<span id="cb37-18"><a href="#cb37-18" aria-hidden="true" tabindex="-1"></a><span class="cf">switch</span> <span class="op">(</span>x<span class="op">)</span> <span class="op">{</span></span>
<span id="cb37-19"><a href="#cb37-19" aria-hidden="true" tabindex="-1"></a> <span class="cf">case</span> <span class="dv">0</span><span class="op">:</span> <span class="cf">return</span> a<span class="op">;</span></span>
<span id="cb37-20"><a href="#cb37-20" aria-hidden="true" tabindex="-1"></a> <span class="cf">case</span> <span class="dv">1</span><span class="op">:</span> <span class="cf">return</span> b<span class="op">;</span></span>
<span id="cb37-21"><a href="#cb37-21" aria-hidden="true" tabindex="-1"></a> <span class="cf">case</span> <span class="dv">2</span><span class="op">:</span> <span class="cf">return</span> c<span class="op">;</span></span>
<span id="cb37-22"><a href="#cb37-22" aria-hidden="true" tabindex="-1"></a> <span class="cf">default</span><span class="op">:</span> <span class="cf">return</span> d<span class="op">;</span></span>
<span id="cb37-23"><a href="#cb37-23" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb37-24"><a href="#cb37-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-25"><a href="#cb37-25" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler may generate:</span></span>
<span id="cb37-26"><a href="#cb37-26" aria-hidden="true" tabindex="-1"></a><span class="co">// cmp edi, 2</span></span>
<span id="cb37-27"><a href="#cb37-27" aria-hidden="true" tabindex="-1"></a><span class="co">// ja .default</span></span>
<span id="cb37-28"><a href="#cb37-28" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rax, [rip + .jump_table]</span></span>
<span id="cb37-29"><a href="#cb37-29" aria-hidden="true" tabindex="-1"></a><span class="co">// movsxd rdi, dword [rax + rdi*4]</span></span>
<span id="cb37-30"><a href="#cb37-30" aria-hidden="true" tabindex="-1"></a><span class="co">// add rax, rdi</span></span>
<span id="cb37-31"><a href="#cb37-31" aria-hidden="true" tabindex="-1"></a><span class="co">// jmp rax</span></span>
<span id="cb37-32"><a href="#cb37-32" aria-hidden="true" tabindex="-1"></a><span class="co">// .jump_table:</span></span>
<span id="cb37-33"><a href="#cb37-33" aria-hidden="true" tabindex="-1"></a><span class="co">// dd .case0 - .jump_table</span></span>
<span id="cb37-34"><a href="#cb37-34" aria-hidden="true" tabindex="-1"></a><span class="co">// dd .case1 - .jump_table</span></span>
<span id="cb37-35"><a href="#cb37-35" aria-hidden="true" tabindex="-1"></a><span class="co">// dd .case2 - .jump_table</span></span></code></pre></div>
<h4 id="branch-prediction-considerations"><strong>Branch Prediction
Considerations</strong></h4>
<pre class="assembly"><code>; Predictable branches (favor forward not-taken, backward taken)
.loop:
; ... work ...
dec rcx
jnz .loop ; Backward branch, predicted taken
test rax, rax
jz .skip ; Forward branch, predicted not-taken
; ... common case ...
.skip:
; Branch hints (legacy, mostly ignored by modern CPUs)
jz .unlikely ; 3E prefix for &quot;not taken&quot; hint
2E jnz .likely ; 2E prefix for &quot;taken&quot; hint</code></pre>
<h3 id="string-operations"><strong>2.6 String Operations</strong></h3>
<h4 id="basic-string-instructions"><strong>Basic String
Instructions</strong></h4>
<pre class="assembly"><code>; String move operations
movsb ; Move byte [RSI] to [RDI], adjust pointers
movsw ; Move word
movsd ; Move dword
movsq ; Move qword
; String compare
cmpsb ; Compare bytes at [RSI] and [RDI]
; String scan
scasb ; Compare AL with [RDI]
; String store
stosb ; Store AL at [RDI]
; String load
lodsb ; Load [RSI] into AL
; Direction flag controls pointer adjustment
cld ; Clear DF: increment pointers
std ; Set DF: decrement pointers</code></pre>
<h4 id="rep-prefixes"><strong>REP Prefixes</strong></h4>
<pre class="assembly"><code>; Repeat string operations
mov rcx, 1000
rep movsb ; Copy RCX bytes
mov rcx, 1000
mov al, 0
rep stosb ; Fill RCX bytes with zero
; Conditional repeats
mov rcx, 1000
repne scasb ; Scan while not equal
; RCX now contains remaining count
mov rcx, 1000
repe cmpsb ; Compare while equal</code></pre>
<h4 id="optimized-string-operations"><strong>Optimized String
Operations</strong></h4>
<pre class="assembly"><code>; Fast memory copy pattern
memcpy:
mov rax, rdi ; Save destination
cmp rdx, 32
jb .small
; Large copy with SIMD
.large_loop:
movdqa xmm0, [rsi]
movdqa xmm1, [rsi+16]
movdqa [rdi], xmm0
movdqa [rdi+16], xmm1
add rsi, 32
add rdi, 32
sub rdx, 32
cmp rdx, 32
jae .large_loop
.small:
; Handle remaining bytes
test rdx, rdx
jz .done
rep movsb
.done:
ret</code></pre>
<h4 id="compiler-string-intrinsics"><strong>Compiler String
Intrinsics</strong></h4>
<div class="sourceCode" id="cb42"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler recognition of patterns</span></span>
<span id="cb42-2"><a href="#cb42-2" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span><span class="op">*</span> memset_pattern<span class="op">(</span><span class="dt">void</span><span class="op">*</span> s<span class="op">,</span> <span class="dt">int</span> c<span class="op">,</span> <span class="dt">size_t</span> n<span class="op">)</span> <span class="op">{</span></span>
<span id="cb42-3"><a href="#cb42-3" aria-hidden="true" tabindex="-1"></a> <span class="co">// Compiler may replace with:</span></span>
<span id="cb42-4"><a href="#cb42-4" aria-hidden="true" tabindex="-1"></a> <span class="co">// mov rax, rdi</span></span>
<span id="cb42-5"><a href="#cb42-5" aria-hidden="true" tabindex="-1"></a> <span class="co">// movzx esi, sil</span></span>
<span id="cb42-6"><a href="#cb42-6" aria-hidden="true" tabindex="-1"></a> <span class="co">// mov rcx, rdx</span></span>
<span id="cb42-7"><a href="#cb42-7" aria-hidden="true" tabindex="-1"></a> <span class="co">// rep stosb</span></span>
<span id="cb42-8"><a href="#cb42-8" aria-hidden="true" tabindex="-1"></a> <span class="co">// ret</span></span>
<span id="cb42-9"><a href="#cb42-9" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb42-10"><a href="#cb42-10" aria-hidden="true" tabindex="-1"></a> <span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*</span> p <span class="op">=</span> <span class="op">(</span><span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*)</span>s<span class="op">;</span></span>
<span id="cb42-11"><a href="#cb42-11" aria-hidden="true" tabindex="-1"></a> <span class="cf">while</span> <span class="op">(</span>n<span class="op">--)</span> <span class="op">*</span>p<span class="op">++</span> <span class="op">=</span> c<span class="op">;</span></span>
<span id="cb42-12"><a href="#cb42-12" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> s<span class="op">;</span></span>
<span id="cb42-13"><a href="#cb42-13" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb42-14"><a href="#cb42-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-15"><a href="#cb42-15" aria-hidden="true" tabindex="-1"></a><span class="co">// Modern compilers optimize to:</span></span>
<span id="cb42-16"><a href="#cb42-16" aria-hidden="true" tabindex="-1"></a><span class="co">// - REP STOSB for small sizes</span></span>
<span id="cb42-17"><a href="#cb42-17" aria-hidden="true" tabindex="-1"></a><span class="co">// - SIMD loops for large sizes</span></span>
<span id="cb42-18"><a href="#cb42-18" aria-hidden="true" tabindex="-1"></a><span class="co">// - Non-temporal stores for very large sizes</span></span></code></pre></div>
<h3 id="compiler-perspective-instruction-selection-patterns"><strong>2.7
Compiler Perspective: Instruction Selection Patterns</strong></h3>
<h4 id="instruction-selection-overview"><strong>Instruction Selection
Overview</strong></h4>
<p>Modern compilers use pattern matching to select optimal
instructions:</p>
<div class="sourceCode" id="cb43"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb43-1"><a href="#cb43-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler&#39;s instruction selection process</span></span>
<span id="cb43-2"><a href="#cb43-2" aria-hidden="true" tabindex="-1"></a>class InstructionSelector <span class="op">{</span></span>
<span id="cb43-3"><a href="#cb43-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">void</span> select<span class="op">(</span>IR_Node<span class="op">*</span> node<span class="op">)</span> <span class="op">{</span></span>
<span id="cb43-4"><a href="#cb43-4" aria-hidden="true" tabindex="-1"></a> <span class="cf">switch</span> <span class="op">(</span>node<span class="op">-&gt;</span>type<span class="op">)</span> <span class="op">{</span></span>
<span id="cb43-5"><a href="#cb43-5" aria-hidden="true" tabindex="-1"></a> <span class="cf">case</span> IR_ADD<span class="op">:</span></span>
<span id="cb43-6"><a href="#cb43-6" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> <span class="op">(</span>is_constant<span class="op">(</span>node<span class="op">-&gt;</span>right<span class="op">,</span> <span class="dv">1</span><span class="op">))</span></span>
<span id="cb43-7"><a href="#cb43-7" aria-hidden="true" tabindex="-1"></a> emit_inc<span class="op">(</span>node<span class="op">-&gt;</span>left<span class="op">);</span></span>
<span id="cb43-8"><a href="#cb43-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>is_lea_candidate<span class="op">(</span>node<span class="op">))</span></span>
<span id="cb43-9"><a href="#cb43-9" aria-hidden="true" tabindex="-1"></a> emit_lea<span class="op">(</span>node<span class="op">);</span></span>
<span id="cb43-10"><a href="#cb43-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span></span>
<span id="cb43-11"><a href="#cb43-11" aria-hidden="true" tabindex="-1"></a> emit_add<span class="op">(</span>node<span class="op">);</span></span>
<span id="cb43-12"><a href="#cb43-12" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span><span class="op">;</span></span>
<span id="cb43-13"><a href="#cb43-13" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb43-14"><a href="#cb43-14" aria-hidden="true" tabindex="-1"></a> <span class="cf">case</span> IR_MULTIPLY<span class="op">:</span></span>
<span id="cb43-15"><a href="#cb43-15" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> <span class="op">(</span>is_power_of_two<span class="op">(</span>node<span class="op">-&gt;</span>right<span class="op">))</span></span>
<span id="cb43-16"><a href="#cb43-16" aria-hidden="true" tabindex="-1"></a> emit_shift<span class="op">(</span>node<span class="op">);</span></span>
<span id="cb43-17"><a href="#cb43-17" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>is_lea_multiply<span class="op">(</span>node<span class="op">))</span></span>
<span id="cb43-18"><a href="#cb43-18" aria-hidden="true" tabindex="-1"></a> emit_lea<span class="op">(</span>node<span class="op">);</span></span>
<span id="cb43-19"><a href="#cb43-19" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span></span>
<span id="cb43-20"><a href="#cb43-20" aria-hidden="true" tabindex="-1"></a> emit_imul<span class="op">(</span>node<span class="op">);</span></span>
<span id="cb43-21"><a href="#cb43-21" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span><span class="op">;</span></span>
<span id="cb43-22"><a href="#cb43-22" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb43-23"><a href="#cb43-23" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb43-24"><a href="#cb43-24" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span></code></pre></div>
<h4 id="common-optimization-patterns"><strong>Common Optimization
Patterns</strong></h4>
<pre class="assembly"><code>; Strength reduction
; Multiply by constant → LEA/shift
; x * 5 becomes:
lea rax, [rdi + rdi*4]
; x * 100 becomes:
lea rax, [rdi + rdi*4] ; x * 5
lea rax, [rax + rax*4] ; x * 25
shl rax, 2 ; x * 100
; Division by constant → multiply by reciprocal
; x / 10 becomes (for unsigned):
mov rax, 0xCCCCCCCCCCCCCCCD ; Reciprocal constant
mul rdi
shr rdx, 3 ; Result in RDX
; Conditional to branchless
; x = (a &lt; b) ? c : d becomes:
cmp rdi, rsi
mov rax, r8 ; d
cmovl rax, rdx ; c if less</code></pre>
<h4 id="peephole-optimizations"><strong>Peephole
Optimizations</strong></h4>
<pre class="assembly"><code>; Before optimization:
mov rax, 0
add rax, rbx
; After: Eliminate redundant move
mov rax, rbx
; Before:
cmp rax, 0
je label
; After: Use TEST for zero comparison
test rax, rax
je label
; Before:
mov [rsp+8], rax
mov rbx, [rsp+8]
; After: Eliminate store-load
mov [rsp+8], rax
mov rbx, rax</code></pre>
<h4 id="code-generation-examples"><strong>Code Generation
Examples</strong></h4>
<div class="sourceCode" id="cb46"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Structure field access</span></span>
<span id="cb46-2"><a href="#cb46-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Point <span class="op">{</span> <span class="dt">long</span> x<span class="op">,</span> y<span class="op">,</span> z<span class="op">;</span> <span class="op">};</span></span>
<span id="cb46-3"><a href="#cb46-3" aria-hidden="true" tabindex="-1"></a><span class="dt">long</span> get_y<span class="op">(</span>Point<span class="op">*</span> p<span class="op">)</span> <span class="op">{</span></span>
<span id="cb46-4"><a href="#cb46-4" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> p<span class="op">-&gt;</span>y<span class="op">;</span></span>
<span id="cb46-5"><a href="#cb46-5" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb46-6"><a href="#cb46-6" aria-hidden="true" tabindex="-1"></a><span class="co">// Generates:</span></span>
<span id="cb46-7"><a href="#cb46-7" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rax, [rdi + 8]</span></span>
<span id="cb46-8"><a href="#cb46-8" aria-hidden="true" tabindex="-1"></a><span class="co">// ret</span></span>
<span id="cb46-9"><a href="#cb46-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb46-10"><a href="#cb46-10" aria-hidden="true" tabindex="-1"></a><span class="co">// Array indexing</span></span>
<span id="cb46-11"><a href="#cb46-11" aria-hidden="true" tabindex="-1"></a><span class="dt">long</span> array_access<span class="op">(</span><span class="dt">long</span><span class="op">*</span> arr<span class="op">,</span> <span class="dt">long</span> idx<span class="op">)</span> <span class="op">{</span></span>
<span id="cb46-12"><a href="#cb46-12" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> arr<span class="op">[</span>idx<span class="op">];</span></span>
<span id="cb46-13"><a href="#cb46-13" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb46-14"><a href="#cb46-14" aria-hidden="true" tabindex="-1"></a><span class="co">// Generates:</span></span>
<span id="cb46-15"><a href="#cb46-15" aria-hidden="true" tabindex="-1"></a><span class="co">// mov rax, [rdi + rsi*8]</span></span>
<span id="cb46-16"><a href="#cb46-16" aria-hidden="true" tabindex="-1"></a><span class="co">// ret</span></span>
<span id="cb46-17"><a href="#cb46-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb46-18"><a href="#cb46-18" aria-hidden="true" tabindex="-1"></a><span class="co">// Complex expression</span></span>
<span id="cb46-19"><a href="#cb46-19" aria-hidden="true" tabindex="-1"></a><span class="dt">long</span> expr<span class="op">(</span><span class="dt">long</span> a<span class="op">,</span> <span class="dt">long</span> b<span class="op">,</span> <span class="dt">long</span> c<span class="op">)</span> <span class="op">{</span></span>
<span id="cb46-20"><a href="#cb46-20" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> <span class="op">(</span>a <span class="op">+</span> b<span class="op">)</span> <span class="op">*</span> c <span class="op">-</span> <span class="op">(</span>a <span class="op">&lt;&lt;</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb46-21"><a href="#cb46-21" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb46-22"><a href="#cb46-22" aria-hidden="true" tabindex="-1"></a><span class="co">// Generates:</span></span>
<span id="cb46-23"><a href="#cb46-23" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rax, [rdi + rsi] ; a + b</span></span>
<span id="cb46-24"><a href="#cb46-24" aria-hidden="true" tabindex="-1"></a><span class="co">// imul rax, rdx ; * c</span></span>
<span id="cb46-25"><a href="#cb46-25" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rcx, [rdi*8] ; a &lt;&lt; 3</span></span>
<span id="cb46-26"><a href="#cb46-26" aria-hidden="true" tabindex="-1"></a><span class="co">// sub rax, rcx ; final result</span></span>
<span id="cb46-27"><a href="#cb46-27" aria-hidden="true" tabindex="-1"></a><span class="co">// ret</span></span></code></pre></div>
<h4 id="compiler-instruction-costs"><strong>Compiler Instruction
Costs</strong></h4>
<div class="sourceCode" id="cb47"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Simplified cost model for instruction selection</span></span>
<span id="cb47-2"><a href="#cb47-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> InsnCost <span class="op">{</span></span>
<span id="cb47-3"><a href="#cb47-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> latency<span class="op">;</span> <span class="co">// Cycles to produce result</span></span>
<span id="cb47-4"><a href="#cb47-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> throughput<span class="op">;</span> <span class="co">// Inverse throughput</span></span>
<span id="cb47-5"><a href="#cb47-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> size<span class="op">;</span> <span class="co">// Encoding size</span></span>
<span id="cb47-6"><a href="#cb47-6" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb47-7"><a href="#cb47-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb47-8"><a href="#cb47-8" aria-hidden="true" tabindex="-1"></a>InsnCost costs<span class="op">[]</span> <span class="op">=</span> <span class="op">{</span></span>
<span id="cb47-9"><a href="#cb47-9" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;mov r,r&quot;</span><span class="op">,</span> <span class="dv">0</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">2</span><span class="op">},</span> <span class="co">// Zero latency (move elimination)</span></span>
<span id="cb47-10"><a href="#cb47-10" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;add r,r&quot;</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">3</span><span class="op">},</span></span>
<span id="cb47-11"><a href="#cb47-11" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;lea simple&quot;</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">3</span><span class="op">},</span></span>
<span id="cb47-12"><a href="#cb47-12" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;lea complex&quot;</span><span class="op">,</span> <span class="dv">3</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">4</span><span class="op">},</span> <span class="co">// 3-component LEA</span></span>
<span id="cb47-13"><a href="#cb47-13" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;imul r,r&quot;</span><span class="op">,</span> <span class="dv">3</span><span class="op">,</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">3</span><span class="op">},</span></span>
<span id="cb47-14"><a href="#cb47-14" aria-hidden="true" tabindex="-1"></a> <span class="op">{</span><span class="st">&quot;div&quot;</span><span class="op">,</span> <span class="dv">20</span><span class="op">,</span> <span class="dv">20</span><span class="op">,</span> <span class="dv">2</span><span class="op">},</span> <span class="co">// Very expensive</span></span>
<span id="cb47-15"><a href="#cb47-15" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb47-16"><a href="#cb47-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb47-17"><a href="#cb47-17" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler chooses based on optimization goals:</span></span>
<span id="cb47-18"><a href="#cb47-18" aria-hidden="true" tabindex="-1"></a><span class="co">// -Os: Minimize size</span></span>
<span id="cb47-19"><a href="#cb47-19" aria-hidden="true" tabindex="-1"></a><span class="co">// -O2: Balance latency/throughput</span></span>
<span id="cb47-20"><a href="#cb47-20" aria-hidden="true" tabindex="-1"></a><span class="co">// -O3: Aggressive optimization</span></span></code></pre></div>
<p>This foundation in instruction encoding and core operations provides
the basis for understanding how compilers transform high-level code into
efficient x86-64 machine code. The next chapter will explore memory
architecture and addressing modes in greater detail.</p>
<hr />
<h2
id="chapter-3-memory-architecture-and-addressing-modes"><strong>Chapter
3: Memory Architecture and Addressing Modes</strong></h2>
<h3 id="x86-64-memory-organization"><strong>3.1 x86-64 Memory
Organization</strong></h3>
<h4 id="virtual-address-space-layout"><strong>Virtual Address Space
Layout</strong></h4>
<p>The x86-64 architecture provides a 64-bit virtual address space,
though current implementations use only 48-57 bits:</p>
<p>Canonical 48-bit Address Space: 0x0000000000000000 -
0x00007FFFFFFFFFFF User space (128 TB) 0x0000800000000000 -
0xFFFF7FFFFFFFFFFF Non-canonical (invalid) 0xFFFF800000000000 -
0xFFFFFFFFFFFFFFFF Kernel space (128 TB)</p>
<p>With 57-bit addressing (Intel LA57): 0x0000000000000000 -
0x00FFFFFFFFFFFFFF User space (64 PB) 0x0100000000000000 -
0xFEFFFFFFFFFFFFFF Non-canonical 0xFF00000000000000 - 0xFFFFFFFFFFFFFFFF
Kernel space (64 PB)</p>
<h4 id="memory-segmentation-in-64-bit-mode-1"><strong>Memory
Segmentation in 64-bit Mode</strong></h4>
<p>While segmentation is largely disabled in 64-bit mode, some aspects
remain:</p>
<pre class="assembly"><code>; Segment registers in 64-bit mode
; CS, DS, ES, SS - Base forced to 0, limits ignored
; FS, GS - Base addresses can be set via MSRs
; Thread-local storage using FS/GS
mov rax, fs:[0] ; Read thread-local variable
mov rax, gs:[0x10] ; Access per-CPU data (kernel)
; Setting FS/GS base
mov ecx, 0xC0000100 ; FS_BASE MSR
mov eax, edi ; Low 32 bits
mov edx, esi ; High 32 bits
wrmsr</code></pre>
<h4 id="page-table-structure"><strong>Page Table Structure</strong></h4>
<p>Modern x86-64 uses 4 or 5-level paging:</p>
<div class="sourceCode" id="cb49"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb49-1"><a href="#cb49-1" aria-hidden="true" tabindex="-1"></a><span class="co">// 4-level paging structure (48-bit addresses)</span></span>
<span id="cb49-2"><a href="#cb49-2" aria-hidden="true" tabindex="-1"></a><span class="kw">typedef</span> <span class="kw">struct</span> <span class="op">{</span></span>
<span id="cb49-3"><a href="#cb49-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint64_t</span> entries<span class="op">[</span><span class="dv">512</span><span class="op">];</span></span>
<span id="cb49-4"><a href="#cb49-4" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> PageTable<span class="op">;</span></span>
<span id="cb49-5"><a href="#cb49-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb49-6"><a href="#cb49-6" aria-hidden="true" tabindex="-1"></a><span class="co">// Virtual address breakdown (4-level):</span></span>
<span id="cb49-7"><a href="#cb49-7" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 47:39 - PML4 index (9 bits)</span></span>
<span id="cb49-8"><a href="#cb49-8" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 38:30 - PDPT index (9 bits)</span></span>
<span id="cb49-9"><a href="#cb49-9" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 29:21 - PD index (9 bits)</span></span>
<span id="cb49-10"><a href="#cb49-10" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 20:12 - PT index (9 bits)</span></span>
<span id="cb49-11"><a href="#cb49-11" aria-hidden="true" tabindex="-1"></a><span class="co">// Bits 11:0 - Page offset (12 bits)</span></span>
<span id="cb49-12"><a href="#cb49-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb49-13"><a href="#cb49-13" aria-hidden="true" tabindex="-1"></a><span class="co">// Page table entry format</span></span>
<span id="cb49-14"><a href="#cb49-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_PRESENT </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">0</span><span class="op">)</span></span>
<span id="cb49-15"><a href="#cb49-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_WRITABLE </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb49-16"><a href="#cb49-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_USER </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">2</span><span class="op">)</span></span>
<span id="cb49-17"><a href="#cb49-17" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_PWT </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">3</span><span class="op">)</span></span>
<span id="cb49-18"><a href="#cb49-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_PCD </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">4</span><span class="op">)</span></span>
<span id="cb49-19"><a href="#cb49-19" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_ACCESSED </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">5</span><span class="op">)</span></span>
<span id="cb49-20"><a href="#cb49-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_DIRTY </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">6</span><span class="op">)</span></span>
<span id="cb49-21"><a href="#cb49-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_HUGE </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">7</span><span class="op">)</span><span class="pp"> </span><span class="co">// PS bit</span></span>
<span id="cb49-22"><a href="#cb49-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_GLOBAL </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">8</span><span class="op">)</span></span>
<span id="cb49-23"><a href="#cb49-23" aria-hidden="true" tabindex="-1"></a><span class="pp">#define PTE_NX </span><span class="op">(</span><span class="dv">1</span><span class="bu">ULL</span><span class="pp"> </span><span class="op">&lt;&lt;</span><span class="pp"> </span><span class="dv">63</span><span class="op">)</span></span></code></pre></div>
<h4 id="memory-types-and-caching"><strong>Memory Types and
Caching</strong></h4>
<pre class="assembly"><code>; Memory types (set via PAT/MTRR)
; UC - Uncacheable
; WC - Write Combining
; WT - Write Through
; WP - Write Protected
; WB - Write Back (normal cacheable)
; Cache control instructions
clflush [rax] ; Flush cache line
clflushopt [rax] ; Optimized flush
clwb [rax] ; Write back without invalidate
; Memory fences
mfence ; Full memory fence
sfence ; Store fence
lfence ; Load fence</code></pre>
<h3 id="complex-addressing-modes"><strong>3.2 Complex Addressing
Modes</strong></h3>
<h4 id="general-addressing-mode-format"><strong>General Addressing Mode
Format</strong></h4>
<p>x86-64 supports the flexible addressing mode: [base + index*scale +
displacement]</p>
<p>Where:</p>
<ul>
<li><p>base: any general-purpose register</p></li>
<li><p>index: any GPR except RSP</p></li>
<li><p>scale: 1, 2, 4, or 8</p></li>
<li><p>displacement: 0, 8-bit, or 32-bit signed</p></li>
</ul>
<h4 id="addressing-mode-examples"><strong>Addressing Mode
Examples</strong></h4>
<pre class="assembly"><code>; Direct addressing
mov rax, [0x1000] ; Absolute address (rare in 64-bit)
mov rax, [label] ; RIP-relative (preferred)
; Register indirect
mov rax, [rbx] ; [base]
; Displacement
mov rax, [rbx + 8] ; [base + disp8]
mov rax, [rbx + 1000] ; [base + disp32]
; Scaled index
mov rax, [rbx + rcx*8] ; [base + index*scale]
; Full addressing mode
mov rax, [rbx + rcx*8 + 16] ; [base + index*scale + disp]
; Special cases
mov rax, [rcx*2 + 100] ; [index*scale + disp] - no base
mov rax, [rsp + 8] ; RSP requires SIB byte</code></pre>
<h4 id="rip-relative-addressing"><strong>RIP-Relative
Addressing</strong></h4>
<p>RIP-relative addressing is crucial for position-independent code:</p>
<pre class="assembly"><code>; RIP-relative data access
data: dq 0x123456789ABCDEF0
func:
mov rax, [rip + data] ; Load from data
lea rbx, [rip + data] ; Get address of data
; Compiler-generated RIP-relative
; C code: extern int global_var;
; int x = global_var;
; Generates:
mov eax, [rip + global_var@GOTPCREL] ; Via GOT
; or
mov eax, [rip + global_var] ; Direct</code></pre>
<h4 id="addressing-mode-encoding"><strong>Addressing Mode
Encoding</strong></h4>
<pre class="assembly"><code>; ModR/M byte: [mod][reg][r/m]
; mod: 00 = no disp, 01 = disp8, 10 = disp32, 11 = register
; reg: register operand or opcode extension
; r/m: register or memory operand
; SIB byte: [scale][index][base]
; Required when:
; - Using RSP as base
; - Using scaled index
; - Using [*] addressing
; Examples with encoding details:
mov rax, [rbx] ; ModR/M: 03 (no SIB needed)
mov rax, [rsp] ; ModR/M: 04, SIB: 24
mov rax, [rbx + rcx*8] ; ModR/M: 04, SIB: CB</code></pre>
<h3 id="memory-access-patterns-and-optimization"><strong>3.3 Memory
Access Patterns and Optimization</strong></h3>
<h4 id="cache-friendly-access-patterns"><strong>Cache-Friendly Access
Patterns</strong></h4>
<pre class="assembly"><code>; Sequential access (prefetcher-friendly)
process_array:
xor rdx, rdx ; Sum
.loop:
add rdx, [rdi] ; Sequential read
add rdi, 8
dec rsi
jnz .loop
; Strided access (less efficient)
process_strided:
xor rdx, rdx
.loop:
add rdx, [rdi]
add rdi, 64 ; 8 cache lines stride
dec rsi
jnz .loop</code></pre>
<h4 id="prefetching"><strong>Prefetching</strong></h4>
<pre class="assembly"><code>; Software prefetch instructions
prefetchnta [rax] ; Non-temporal (bypass cache)
prefetcht0 [rax] ; To L1 cache
prefetcht1 [rax] ; To L2 cache
prefetcht2 [rax] ; To L3 cache
prefetchw [rax] ; For write
; Compiler prefetch pattern
process_with_prefetch:
mov rcx, rsi
sub rcx, 8 ; Prefetch 8 iterations ahead
.loop:
prefetcht0 [rdi + 64] ; Prefetch next cache line
; Process current data
movaps xmm0, [rdi]
movaps xmm1, [rdi + 16]
; ... processing ...
add rdi, 64
dec rsi
jnz .loop</code></pre>
<h4 id="non-temporal-memory-access"><strong>Non-Temporal Memory
Access</strong></h4>
<pre class="assembly"><code>; Non-temporal stores (bypass cache)
movnti [rax], rbx ; NT store integer
movntdq [rax], xmm0 ; NT store 128-bit
movntpd [rax], xmm0 ; NT store packed double
; Non-temporal loads (SSE4.1)
movntdqa xmm0, [rax] ; NT load 128-bit
; Example: Large memory copy bypassing cache
large_memcpy:
.loop:
movdqa xmm0, [rsi]
movdqa xmm1, [rsi + 16]
movdqa xmm2, [rsi + 32]
movdqa xmm3, [rsi + 48]
movntdq [rdi], xmm0
movntdq [rdi + 16], xmm1
movntdq [rdi + 32], xmm2
movntdq [rdi + 48], xmm3
add rsi, 64
add rdi, 64
sub rdx, 64
jnz .loop
sfence ; Ensure completion
ret</code></pre>
<h3 id="stack-operations-and-management"><strong>3.4 Stack Operations
and Management</strong></h3>
<h4 id="stack-frame-layout"><strong>Stack Frame Layout</strong></h4>
<pre class="assembly"><code>; Typical stack frame structure
; Higher addresses
; ...
; [rbp + 24] - Argument 8 (if passed on stack)
; [rbp + 16] - Argument 7 (if passed on stack)
; [rbp + 8] - Return address
; [rbp + 0] - Saved RBP (frame pointer)
; [rbp - 8] - Local variable 1
; [rbp - 16] - Local variable 2
; [rsp] - Top of stack
; Lower addresses
; Function prologue
function:
push rbp ; Save frame pointer
mov rbp, rsp ; Establish frame
sub rsp, 32 ; Allocate locals
; Function epilogue
mov rsp, rbp ; Restore stack
pop rbp ; Restore frame pointer
ret</code></pre>
<h4 id="stack-alignment"><strong>Stack Alignment</strong></h4>
<pre class="assembly"><code>; System V AMD64 ABI requires 16-byte alignment before CALL
align_stack:
test rsp, 15 ; Check alignment
jz .aligned
sub rsp, 8 ; Align if needed
.aligned:
call function
; Compiler ensures alignment
; Before call: RSP mod 16 = 8
; CALL pushes 8-byte return address
; In function: RSP mod 16 = 0</code></pre>
<h4 id="red-zone"><strong>Red Zone</strong></h4>
<pre class="assembly"><code>; 128-byte red zone below RSP (System V AMD64)
; Can be used without adjusting RSP
leaf_function:
mov [rsp - 8], rdi ; Use red zone
mov [rsp - 16], rsi
; ... computation ...
mov rax, [rsp - 8]
ret
; Signal handlers and kernel must respect red zone
; Windows x64 has no red zone!</code></pre>
<h3 id="memory-barriers-and-atomics"><strong>3.5 Memory Barriers and
Atomics</strong></h3>
<h4 id="memory-ordering"><strong>Memory Ordering</strong></h4>
<pre class="assembly"><code>; x86-64 memory model (Total Store Order)
; Guarantees:
; - Loads are not reordered with loads
; - Stores are not reordered with stores
; - Stores are not reordered with older loads
; - Loads may be reordered with older stores
; Memory barriers
mfence ; Full barrier
sfence ; Store barrier
lfence ; Load barrier + speculation barrier</code></pre>
<h4 id="atomic-operations"><strong>Atomic Operations</strong></h4>
<pre class="assembly"><code>; LOCK prefix for atomicity
lock add [rax], rbx ; Atomic add
lock xchg [rax], rbx ; XCHG is implicitly locked
lock cmpxchg [rax], rbx ; Compare and exchange
; Lock-free patterns
atomic_increment:
mov rax, 1
lock xadd [rdi], rax ; Fetch-and-add
inc rax ; Return old + 1
ret
; Compare-and-swap loop
cas_loop:
mov rax, [rdi] ; Load current value
.retry:
mov rdx, rax
add rdx, 1 ; Compute new value
lock cmpxchg [rdi], rdx ; Try to update
jnz .retry ; Retry if changed
ret</code></pre>
<h4 id="transactional-memory-tsx"><strong>Transactional Memory
(TSX)</strong></h4>
<pre class="assembly"><code>; Hardware Lock Elision (HLE)
xacquire lock add [rax], rbx ; Begin transaction
xrelease lock sub [rax], rbx ; End transaction
; Restricted Transactional Memory (RTM)
transaction:
xbegin .abort ; Start transaction
; ... transactional code ...
mov rax, [shared_data]
add rax, 1
mov [shared_data], rax
xend ; Commit transaction
jmp .done
.abort:
; Handle abort (check EAX for reason)
and eax, 0xFF ; Abort status
cmp eax, 0xFF ; Explicit abort?
je .fallback
; Retry logic...
.fallback:
; Non-transactional path
.done:</code></pre>
<h3 id="effective-address-calculation-lea"><strong>3.6 Effective Address
Calculation (LEA)</strong></h3>
<h4 id="lea-instruction-capabilities"><strong>LEA Instruction
Capabilities</strong></h4>
<pre class="assembly"><code>; LEA performs address calculation without memory access
; Useful for arithmetic and address computation
; Simple arithmetic
lea rax, [rbx + 5] ; rax = rbx + 5
lea rax, [rbx + rcx] ; rax = rbx + rcx
; Scaled arithmetic
lea rax, [rbx*2] ; rax = rbx * 2
lea rax, [rbx + rbx*2] ; rax = rbx * 3
lea rax, [rbx + rbx*4] ; rax = rbx * 5
lea rax, [rbx + rbx*8] ; rax = rbx * 9
; Complex calculations
lea rax, [rbx + rcx*4 + 10] ; rax = rbx + rcx*4 + 10
; Three-operand arithmetic
lea rax, [rdi + rsi] ; rax = rdi + rsi (preserves both)</code></pre>
<h4 id="compiler-lea-patterns"><strong>Compiler LEA
Patterns</strong></h4>
<div class="sourceCode" id="cb64"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb64-1"><a href="#cb64-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Array indexing</span></span>
<span id="cb64-2"><a href="#cb64-2" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span><span class="op">*</span> array_element<span class="op">(</span><span class="dt">int</span><span class="op">*</span> base<span class="op">,</span> <span class="dt">long</span> i<span class="op">,</span> <span class="dt">long</span> j<span class="op">,</span> <span class="dt">long</span> stride<span class="op">)</span> <span class="op">{</span></span>
<span id="cb64-3"><a href="#cb64-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> <span class="op">&amp;</span>base<span class="op">[</span>i <span class="op">*</span> stride <span class="op">+</span> j<span class="op">];</span></span>
<span id="cb64-4"><a href="#cb64-4" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb64-5"><a href="#cb64-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Generates:</span></span>
<span id="cb64-6"><a href="#cb64-6" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rax, [rdx + rsi]</span></span>
<span id="cb64-7"><a href="#cb64-7" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rax, [rdi + rax*4]</span></span>
<span id="cb64-8"><a href="#cb64-8" aria-hidden="true" tabindex="-1"></a><span class="co">// ret</span></span>
<span id="cb64-9"><a href="#cb64-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-10"><a href="#cb64-10" aria-hidden="true" tabindex="-1"></a><span class="co">// Structure offset calculation</span></span>
<span id="cb64-11"><a href="#cb64-11" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Large <span class="op">{</span> <span class="dt">char</span> data<span class="op">[</span><span class="dv">1024</span><span class="op">];</span> <span class="op">};</span></span>
<span id="cb64-12"><a href="#cb64-12" aria-hidden="true" tabindex="-1"></a>Large<span class="op">*</span> next_element<span class="op">(</span>Large<span class="op">*</span> ptr<span class="op">,</span> <span class="dt">long</span> offset<span class="op">)</span> <span class="op">{</span></span>
<span id="cb64-13"><a href="#cb64-13" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> ptr <span class="op">+</span> offset<span class="op">;</span></span>
<span id="cb64-14"><a href="#cb64-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb64-15"><a href="#cb64-15" aria-hidden="true" tabindex="-1"></a><span class="co">// Generates:</span></span>
<span id="cb64-16"><a href="#cb64-16" aria-hidden="true" tabindex="-1"></a><span class="co">// shl rsi, 10 ; offset * 1024</span></span>
<span id="cb64-17"><a href="#cb64-17" aria-hidden="true" tabindex="-1"></a><span class="co">// lea rax, [rdi + rsi]</span></span>
<span id="cb64-18"><a href="#cb64-18" aria-hidden="true" tabindex="-1"></a><span class="co">// ret</span></span></code></pre></div>
<h4 id="lea-vs-other-instructions"><strong>LEA vs Other
Instructions</strong></h4>
<pre class="assembly"><code>; LEA advantages:
; - No flags modification
; - Three-operand form
; - Single-cycle execution (simple forms)
; Comparison: x = y * 5
; Using IMUL:
mov rax, rdi
imul rax, 5 ; 3-cycle latency
; Using LEA:
lea rax, [rdi + rdi*4] ; 1-cycle latency
; Complex LEA can be slower
lea rax, [rbx + rcx*8 + 1000] ; 3-cycle latency on some CPUs</code></pre>
<h3 id="compiler-memory-optimization-strategies"><strong>3.7 Compiler
Memory Optimization Strategies</strong></h3>
<h4 id="structure-layout-and-padding"><strong>Structure Layout and
Padding</strong></h4>
<div class="sourceCode" id="cb66"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb66-1"><a href="#cb66-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler structure padding</span></span>
<span id="cb66-2"><a href="#cb66-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Unoptimized <span class="op">{</span></span>
<span id="cb66-3"><a href="#cb66-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> a<span class="op">;</span> <span class="co">// Offset 0</span></span>
<span id="cb66-4"><a href="#cb66-4" aria-hidden="true" tabindex="-1"></a> <span class="co">// 7 bytes padding</span></span>
<span id="cb66-5"><a href="#cb66-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">double</span> b<span class="op">;</span> <span class="co">// Offset 8</span></span>
<span id="cb66-6"><a href="#cb66-6" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> c<span class="op">;</span> <span class="co">// Offset 16</span></span>
<span id="cb66-7"><a href="#cb66-7" aria-hidden="true" tabindex="-1"></a> <span class="co">// 3 bytes padding</span></span>
<span id="cb66-8"><a href="#cb66-8" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> d<span class="op">;</span> <span class="co">// Offset 20</span></span>
<span id="cb66-9"><a href="#cb66-9" aria-hidden="true" tabindex="-1"></a><span class="op">};</span> <span class="co">// Size: 24 bytes</span></span>
<span id="cb66-10"><a href="#cb66-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb66-11"><a href="#cb66-11" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Optimized <span class="op">{</span></span>
<span id="cb66-12"><a href="#cb66-12" aria-hidden="true" tabindex="-1"></a> <span class="dt">double</span> b<span class="op">;</span> <span class="co">// Offset 0</span></span>
<span id="cb66-13"><a href="#cb66-13" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> d<span class="op">;</span> <span class="co">// Offset 8</span></span>
<span id="cb66-14"><a href="#cb66-14" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> a<span class="op">;</span> <span class="co">// Offset 12</span></span>
<span id="cb66-15"><a href="#cb66-15" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> c<span class="op">;</span> <span class="co">// Offset 13</span></span>
<span id="cb66-16"><a href="#cb66-16" aria-hidden="true" tabindex="-1"></a> <span class="co">// 2 bytes padding</span></span>
<span id="cb66-17"><a href="#cb66-17" aria-hidden="true" tabindex="-1"></a><span class="op">};</span> <span class="co">// Size: 16 bytes</span></span>
<span id="cb66-18"><a href="#cb66-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb66-19"><a href="#cb66-19" aria-hidden="true" tabindex="-1"></a><span class="co">// Assembly access patterns</span></span>
<span id="cb66-20"><a href="#cb66-20" aria-hidden="true" tabindex="-1"></a><span class="co">// Unoptimized:</span></span>
<span id="cb66-21"><a href="#cb66-21" aria-hidden="true" tabindex="-1"></a><span class="co">// movzx eax, byte [rdi] ; a</span></span>
<span id="cb66-22"><a href="#cb66-22" aria-hidden="true" tabindex="-1"></a><span class="co">// movsd xmm0, [rdi + 8] ; b</span></span>
<span id="cb66-23"><a href="#cb66-23" aria-hidden="true" tabindex="-1"></a><span class="co">// movzx ecx, byte [rdi + 16] ; c</span></span>
<span id="cb66-24"><a href="#cb66-24" aria-hidden="true" tabindex="-1"></a><span class="co">// mov edx, [rdi + 20] ; d</span></span>
<span id="cb66-25"><a href="#cb66-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb66-26"><a href="#cb66-26" aria-hidden="true" tabindex="-1"></a><span class="co">// Optimized:</span></span>
<span id="cb66-27"><a href="#cb66-27" aria-hidden="true" tabindex="-1"></a><span class="co">// movsd xmm0, [rdi] ; b</span></span>
<span id="cb66-28"><a href="#cb66-28" aria-hidden="true" tabindex="-1"></a><span class="co">// mov edx, [rdi + 8] ; d</span></span>
<span id="cb66-29"><a href="#cb66-29" aria-hidden="true" tabindex="-1"></a><span class="co">// movzx eax, byte [rdi + 12] ; a</span></span>
<span id="cb66-30"><a href="#cb66-30" aria-hidden="true" tabindex="-1"></a><span class="co">// movzx ecx, byte [rdi + 13] ; c</span></span></code></pre></div>
<h4 id="loop-optimization-and-memory-access"><strong>Loop Optimization
and Memory Access</strong></h4>
<pre class="assembly"><code>; Original loop
.loop1:
mov rax, [rdi]
add rax, [rsi]
mov [rdx], rax
add rdi, 8
add rsi, 8
add rdx, 8
dec rcx
jnz .loop1
; Unrolled and optimized
.loop2:
; Prefetch next iteration
prefetcht0 [rdi + 64]
prefetcht0 [rsi + 64]
; Process 4 elements at once
mov rax, [rdi]
mov rbx, [rdi + 8]
mov r8, [rdi + 16]
mov r9, [rdi + 24]
add rax, [rsi]
add rbx, [rsi + 8]
add r8, [rsi + 16]
add r9, [rsi + 24]
mov [rdx], rax
mov [rdx + 8], rbx
mov [rdx + 16], r8
mov [rdx + 24], r9
add rdi, 32
add rsi, 32
add rdx, 32
sub rcx, 4
jnz .loop2</code></pre>
<h4 id="alias-analysis-and-optimization"><strong>Alias Analysis and
Optimization</strong></h4>
<div class="sourceCode" id="cb68"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb68-1"><a href="#cb68-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Compiler must assume pointers may alias</span></span>
<span id="cb68-2"><a href="#cb68-2" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> may_alias<span class="op">(</span><span class="dt">int</span><span class="op">*</span> a<span class="op">,</span> <span class="dt">int</span><span class="op">*</span> b<span class="op">,</span> <span class="dt">int</span><span class="op">*</span> c<span class="op">,</span> <span class="dt">int</span> n<span class="op">)</span> <span class="op">{</span></span>
<span id="cb68-3"><a href="#cb68-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">;</span> i<span class="op">++)</span> <span class="op">{</span></span>
<span id="cb68-4"><a href="#cb68-4" aria-hidden="true" tabindex="-1"></a> a<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> b<span class="op">[</span>i<span class="op">]</span> <span class="op">+</span> c<span class="op">[</span>i<span class="op">];</span> <span class="co">// Must reload c[i] each time</span></span>
<span id="cb68-5"><a href="#cb68-5" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb68-6"><a href="#cb68-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb68-7"><a href="#cb68-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb68-8"><a href="#cb68-8" aria-hidden="true" tabindex="-1"></a><span class="co">// With restrict keyword</span></span>
<span id="cb68-9"><a href="#cb68-9" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> no_alias<span class="op">(</span><span class="dt">int</span><span class="op">*</span> <span class="dt">restrict</span> a<span class="op">,</span> <span class="dt">int</span><span class="op">*</span> <span class="dt">restrict</span> b<span class="op">,</span> </span>
<span id="cb68-10"><a href="#cb68-10" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span><span class="op">*</span> <span class="dt">restrict</span> c<span class="op">,</span> <span class="dt">int</span> n<span class="op">)</span> <span class="op">{</span></span>
<span id="cb68-11"><a href="#cb68-11" aria-hidden="true" tabindex="-1"></a> <span class="co">// Compiler can optimize more aggressively</span></span>
<span id="cb68-12"><a href="#cb68-12" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">;</span> i<span class="op">++)</span> <span class="op">{</span></span>
<span id="cb68-13"><a href="#cb68-13" aria-hidden="true" tabindex="-1"></a> a<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> b<span class="op">[</span>i<span class="op">]</span> <span class="op">+</span> c<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb68-14"><a href="#cb68-14" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb68-15"><a href="#cb68-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb68-16"><a href="#cb68-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb68-17"><a href="#cb68-17" aria-hidden="true" tabindex="-1"></a><span class="co">// Assembly difference:</span></span>
<span id="cb68-18"><a href="#cb68-18" aria-hidden="true" tabindex="-1"></a><span class="co">// may_alias inner loop:</span></span>
<span id="cb68-19"><a href="#cb68-19" aria-hidden="true" tabindex="-1"></a><span class="co">// mov eax, [rsi + rcx*4]</span></span>
<span id="cb68-20"><a href="#cb68-20" aria-hidden="true" tabindex="-1"></a><span class="co">// add eax, [rdx + rcx*4] ; Must reload</span></span>
<span id="cb68-21"><a href="#cb68-21" aria-hidden="true" tabindex="-1"></a><span class="co">// mov [rdi + rcx*4], eax</span></span>
<span id="cb68-22"><a href="#cb68-22" aria-hidden="true" tabindex="-1"></a><span class="co">// inc rcx</span></span>
<span id="cb68-23"><a href="#cb68-23" aria-hidden="true" tabindex="-1"></a><span class="co">// cmp rcx, r8</span></span>
<span id="cb68-24"><a href="#cb68-24" aria-hidden="true" tabindex="-1"></a><span class="co">// jl .loop</span></span>
<span id="cb68-25"><a href="#cb68-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb68-26"><a href="#cb68-26" aria-hidden="true" tabindex="-1"></a><span class="co">// no_alias can use vector instructions:</span></span>
<span id="cb68-27"><a href="#cb68-27" aria-hidden="true" tabindex="-1"></a><span class="co">// movdqu xmm0, [rsi + rcx*4]</span></span>
<span id="cb68-28"><a href="#cb68-28" aria-hidden="true" tabindex="-1"></a><span class="co">// paddd xmm0, [rdx + rcx*4]</span></span>
<span id="cb68-29"><a href="#cb68-29" aria-hidden="true" tabindex="-1"></a><span class="co">// movdqu [rdi + rcx*4], xmm0</span></span>
<span id="cb68-30"><a href="#cb68-30" aria-hidden="true" tabindex="-1"></a><span class="co">// add rcx, 4</span></span>
<span id="cb68-31"><a href="#cb68-31" aria-hidden="true" tabindex="-1"></a><span class="co">// cmp rcx, r8</span></span>
<span id="cb68-32"><a href="#cb68-32" aria-hidden="true" tabindex="-1"></a><span class="co">// jl .loop</span></span></code></pre></div>
<h4 id="memory-access-coalescing"><strong>Memory Access
Coalescing</strong></h4>
<pre class="assembly"><code>; Inefficient: Multiple small accesses
load_bytes:
movzx eax, byte [rdi]
movzx ecx, byte [rdi + 1]
movzx edx, byte [rdi + 2]
movzx esi, byte [rdi + 3]
; Efficient: Single coalesced access
load_dword:
mov eax, [rdi] ; Load all 4 bytes
movzx ecx, al ; Extract byte 0
movzx edx, ah ; Extract byte 1
shr eax, 16</code></pre>
<hr />
<h4 id="summary-and-key-takeaways"><strong>Summary and Key
Takeaways</strong></h4>
<p>In moving from the “big picture” of x86‑64’s virtual address space
down through the mechanics of base/index/scale encoding, this chapter
shows that memory architecture is one of the richest areas where the
assembler’s low‑level control and the compiler’s high‑level choices
meet.</p>
<ul>
<li><p><strong>Addressing</strong> — The
<code>[base + index*scale + displacement]</code> model, with
RIP‑relative addressing in long mode, is central to both hand‑written
position‑independent code and compiler‑generated relocatable
binaries.</p></li>
<li><p><strong>Segmentation and paging</strong> — While segmentation is
largely gone in 64‑bit mode, FS/GS bases and 4‑/5‑level page tables
still introduce powerful indirection points for per‑thread/per‑CPU data
in systems programming.</p></li>
<li><p><strong>Caching and access patterns</strong> — The architecture’s
total store order model and rich cache‑control instructions mean that
both inline assembly loops and compiler auto‑vectorized code can be
strongly influenced by how data is laid out and traversed.</p></li>
<li><p><strong>Stack discipline</strong> — ABI‑mandated alignment,
red‑zone usage, and prologue/epilogue conventions are the groundwork on
which safe interoperability with C/C++ runtimes depends.</p></li>
<li><p><strong>Atomicity and ordering</strong> — LOCK‑prefixed
instructions, fences, and transactional execution influence everything
from spinlocks to lock‑free data structures.</p></li>
<li><p><strong>LEA as a computational tool</strong> — Beyond its name,
LEA is more than “load effective address” — it’s a flexible
three‑operand, flag‑neutral arithmetic builder that compilers lean on
heavily.</p></li>
<li><p><strong>Compiler optimisation levers</strong> — Structure
padding, alias analysis, unrolling, and access coalescing are examples
of how high‑level awareness directly affects instruction choices and
addressing modes.</p></li>
</ul>
<p>From an <em>assembly programmer’s</em> perspective, this means
knowing when to pick a particular addressing mode or cache hint and how
to encode it. From a <em>compiler engineer’s</em> perspective, it means
recognising the patterns in the front‑end’s IR that can be lowered to
the most efficient form — or, conversely, where a less‑fancy form will
avoid microarchitectural pitfalls.</p>
<hr />
<h4 id="looking-ahead"><strong>Looking Ahead</strong></h4>
<p>The interplay between memory and computation becomes even tighter
when we start adding <strong>vector and floating‑point
extensions</strong> to the mix. In the next chapter, we’ll shift
into:</p>
<blockquote>
<p><strong>Chapter 4: SIMD Fundamentals — SSE, AVX, and
Beyond</strong></p>
</blockquote>
<p>Here we’ll build on the memory foundation to explore:</p>
<ul>
<li><p>How aligned vs. unaligned loads/stores affect SIMD
throughput</p></li>
<li><p>Layout of vector registers and their preservation across
calls</p></li>
<li><p>Instruction encoding with VEX/EVEX and the role of
masking/broadcasting</p></li>
<li><p>Compiler vectorisation strategies and how they map to real
encodings</p></li>
<li><p>The marriage of cache‑friendly access patterns with vector
loops</p></li>
</ul>
<p>This transition is natural: the same addressing forms, alignment
rules, and prefetching hints you’ve just seen in scalar code have direct
consequences for vectorised kernels in both hand‑coded assembly and
compiler‑generated loops.</p>
<hr />
<p>If you like, I can proceed immediately with
<strong>Chapter 4</strong> so you can see how we combine the memory
model we’ve covered here with SIMD register usage for high‑performance,
vectorised routines. Would you like me to continue?</p>
<hr />
<h2
id="chapter-4-stack-operations-and-calling-conventions"><strong>Chapter
4: Stack Operations and Calling Conventions</strong></h2>
<h3 id="stack-architecture-fundamentals"><strong>4.1 Stack Architecture
Fundamentals</strong></h3>
<h4 id="stack-layout-and-growth-direction"><strong>Stack Layout and
Growth Direction</strong></h4>
<p>The x86-64 stack grows downward from high to low addresses, with RSP
pointing to the top (lowest address) of the stack:</p>
<pre class="assembly"><code>; Stack memory layout (addresses decrease downward)
; Higher addresses (0x7FFFFFFFFFFF)
; ┌─────────────────────┐
; │ Environment vars │
; ├─────────────────────┤
; │ Program arguments │
; ├─────────────────────┤
; │ Stack frames │ ← Stack grows down
; │ ↓ │
; │ [unused space] │
; │ ↑ │
; │ Heap │ ← Heap grows up
; ├─────────────────────┤
; │ .bss/.data │
; ├─────────────────────┤
; │ .text │
; └─────────────────────┘
; Lower addresses (0x400000)
; Basic stack operations
push rax ; RSP -= 8; [RSP] = RAX
pop rbx ; RBX = [RSP]; RSP += 8
; Equivalent manual operations
sub rsp, 8
mov [rsp], rax ; Same as push rax
mov rbx, [rsp]
add rsp, 8 ; Same as pop rbx</code></pre>
<h4 id="stack-pointer-alignment-requirements"><strong>Stack Pointer
Alignment Requirements</strong></h4>
<pre class="assembly"><code>; System V AMD64 ABI: RSP must be 16-byte aligned before CALL
; Windows x64 ABI: Same requirement
check_alignment:
; Stack aligned to 16 bytes at function entry
; (RSP + 8) mod 16 = 0 after CALL
push rbp ; RSP now 16-byte aligned
mov rbp, rsp
; Maintain alignment for calls
sub rsp, 24 ; Local space (not 16-byte aligned)
and rsp, -16 ; Force 16-byte alignment
call some_function ; RSP+8 will be 16-byte aligned</code></pre>
<h4 id="stack-frame-structure"><strong>Stack Frame
Structure</strong></h4>
<pre class="assembly"><code>; Standard stack frame layout
; ┌─────────────────────┐ Higher addresses
; │ Caller&#39;s frame │
; ├─────────────────────┤
; │ Arguments 7+ │ [rbp + 16 + n*8]
; ├─────────────────────┤
; │ Return address │ [rbp + 8]
; ├─────────────────────┤
; │ Saved RBP │ [rbp] ← RBP points here
; ├─────────────────────┤
; │ Local variables │ [rbp - n]
; ├─────────────────────┤
; │ Saved registers │
; ├─────────────────────┤
; │ Stack arguments │
; └─────────────────────┘ ← RSP points here
; Lower addresses
function_with_frame:
push rbp ; Save caller&#39;s frame pointer
mov rbp, rsp ; Establish new frame
sub rsp, 48 ; Allocate local space
; Save callee-saved registers if used
push rbx
push r12
push r13
; Function body
mov [rbp - 8], rdi ; Store first param as local
mov [rbp - 16], rsi ; Store second param
; Restore and return
pop r13
pop r12
pop rbx
mov rsp, rbp ; Restore stack pointer
pop rbp ; Restore frame pointer
ret</code></pre>
<h3 id="system-v-amd64-abi"><strong>4.2 System V AMD64 ABI</strong></h3>
<h4 id="register-usage-convention"><strong>Register Usage
Convention</strong></h4>
<pre class="assembly"><code>; Parameter passing registers (in order)
; Integer/Pointer: RDI, RSI, RDX, RCX, R8, R9
; Floating-point: XMM0-XMM7
; Register preservation rules:
; Caller-saved (volatile): RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
; Callee-saved (non-volatile): RBX, RBP, R12-R15
; Special registers:
; RAX: Return value (also RDX for 128-bit returns)
; RSP: Stack pointer (must maintain alignment)
; RBP: Optional frame pointer</code></pre>
<h4 id="function-calling-examples"><strong>Function Calling
Examples</strong></h4>
<pre class="assembly"><code>; C prototype: long sum6(long a, long b, long c, long d, long e, long f)
call_sum6:
; First 6 arguments in registers
mov rdi, 1 ; a
mov rsi, 2 ; b
mov rdx, 3 ; c
mov rcx, 4 ; d
mov r8, 5 ; e
mov r9, 6 ; f
call sum6
; Result in RAX
; C prototype: long sum8(long a, long b, long c, long d,
; long e, long f, long g, long h)
call_sum8:
; First 6 in registers, rest on stack
push 8 ; h (8th argument)
push 7 ; g (7th argument)
mov rdi, 1 ; a
mov rsi, 2 ; b
mov rdx, 3 ; c
mov rcx, 4 ; d
mov r8, 5 ; e
mov r9, 6 ; f
call sum8
add rsp, 16 ; Clean up stack arguments</code></pre>
<h4 id="floating-point-and-mixed-arguments"><strong>Floating-Point and
Mixed Arguments</strong></h4>
<pre class="assembly"><code>; C: double compute(int a, double b, float c, long d, double e)
call_compute:
mov edi, 42 ; a (int in EDI)
movsd xmm0, [double_b] ; b (double in XMM0)
movss xmm1, [float_c] ; c (float in XMM1)
mov rsi, 100 ; d (long in RSI)
movsd xmm2, [double_e] ; e (double in XMM2)
call compute
; Result in XMM0
; Structure passing (≤16 bytes passed in registers)
; struct Point { double x, y; }; // 16 bytes
; void process_point(Point p);
pass_struct:
movsd xmm0, [point_x] ; First 8 bytes in XMM0
movsd xmm1, [point_y] ; Second 8 bytes in XMM1
call process_point</code></pre>
<h4 id="red-zone-usage"><strong>Red Zone Usage</strong></h4>
<pre class="assembly"><code>; 128-byte red zone below RSP (System V AMD64 only!)
; Leaf functions can use without adjusting RSP
leaf_function:
; Can use [rsp-128] to [rsp-1] without adjusting RSP
mov [rsp - 8], rdi ; Save parameter
mov [rsp - 16], rsi
; Computation
add rdi, rsi
imul rdi, [rsp - 8]
mov rax, rdi ; Return value
ret
; Non-leaf functions CANNOT rely on red zone
non_leaf_function:
sub rsp, 128 ; Must allocate space
mov [rsp + 8], rdi ; Save parameters
mov [rsp + 16], rsi
call other_function ; Call may overwrite red zone
add rsp, 128
ret</code></pre>
<h4 id="variable-argument-functions"><strong>Variable Argument
Functions</strong></h4>
<pre class="assembly"><code>; C: int printf(const char* format, ...);
; Requires special handling for variable arguments
call_printf:
; For varargs, AL = number of vector registers used
lea rdi, [format_string] ; First fixed argument
mov rsi, 42 ; First variable argument
movsd xmm0, [double_val] ; FP argument
mov al, 1 ; 1 XMM register used
call printf
; Implementing varargs function
varargs_function:
; Save all potential argument registers
push rdi
push rsi
push rdx
push rcx
push r8
push r9
; Save XMM registers if AL &gt; 0
test al, al
jz .no_xmm
; Save XMM0-XMM7
sub rsp, 128
movaps [rsp], xmm0
movaps [rsp + 16], xmm1
; ... save remaining XMM registers
.no_xmm:
; Process arguments using va_list</code></pre>
<h3 id="microsoft-x64-abi"><strong>4.3 Microsoft x64 ABI</strong></h3>
<h4 id="register-convention-differences"><strong>Register Convention
Differences</strong></h4>
<pre class="assembly"><code>; Microsoft x64 calling convention
; Parameter registers: RCX, RDX, R8, R9 (first 4)
; Floating-point: XMM0-XMM3 (correspond to parameter position)
;
; Caller-saved: RAX, RCX, RDX, R8-R11, XMM0-XMM5
; Callee-saved: RBX, RBP, RDI, RSI, R12-R15, XMM6-XMM15
; Shadow space requirement (32 bytes always reserved)
ms_x64_caller:
sub rsp, 32 ; Shadow space (mandatory)
mov rcx, 1 ; First argument
mov rdx, 2 ; Second argument
mov r8, 3 ; Third argument
mov r9, 4 ; Fourth argument
call function
add rsp, 32 ; Clean shadow space</code></pre>
<h4 id="function-prologue-and-epilogue-windows"><strong>Function
Prologue and Epilogue (Windows)</strong></h4>
<pre class="assembly"><code>; Windows x64 function with frame pointer
windows_function:
push rbp ; Save frame pointer
push rdi ; Save non-volatile registers
push rsi
push rbx
sub rsp, 32h ; Allocate locals + maintain alignment
lea rbp, [rsp + 32h] ; Frame pointer setup
; Function body
mov [rbp - 8], rcx ; Save first parameter
; Epilogue
lea rsp, [rbp]
pop rbx
pop rsi
pop rdi
pop rbp
ret
; Windows unwind information (for SEH)
.PROC windows_function
push rbp
.pushreg rbp
push rdi
.pushreg rdi
push rsi
.pushreg rsi
push rbx
.pushreg rbx
sub rsp, 32h
.allocstack 32h
lea rbp, [rsp + 32h]
.setframe rbp, 32h
.endprolog
; Function body
.ENDPROC</code></pre>
<h4 id="floating-point-parameter-passing-windows"><strong>Floating-Point
Parameter Passing (Windows)</strong></h4>
<pre class="assembly"><code>; Windows x64: FP args go in XMM registers matching position
; void process(int a, double b, int c, float d)
windows_fp_call:
sub rsp, 32 ; Shadow space
mov ecx, 10 ; a (integer in RCX)
movsd xmm1, [double_b] ; b (double in XMM1 - 2nd position)
mov r8d, 20 ; c (integer in R8)
movss xmm3, [float_d] ; d (float in XMM3 - 4th position)
call process
add rsp, 32</code></pre>
<h3 id="stack-frame-management"><strong>4.4 Stack Frame
Management</strong></h3>
<h4 id="frame-pointer-vs-frame-pointer-omission"><strong>Frame Pointer
vs Frame Pointer Omission</strong></h4>
<pre class="assembly"><code>; With frame pointer (traditional, easier debugging)
with_frame_pointer:
push rbp
mov rbp, rsp
sub rsp, 32 ; Locals
mov [rbp - 8], rdi ; Access locals via RBP
mov [rbp - 16], rsi
; RBP provides stable reference point
leave ; mov rsp, rbp; pop rbp
ret
; Without frame pointer (optimization)
without_frame_pointer:
sub rsp, 32 ; Locals
mov [rsp + 24], rdi ; Access locals via RSP
mov [rsp + 16], rsi
; All offsets relative to RSP
; One more register available (RBP)
add rsp, 32
ret
; Compiler chooses based on:
; -fomit-frame-pointer (GCC/Clang)
; /Oy (MSVC)</code></pre>
<h4 id="dynamic-stack-allocation-alloca"><strong>Dynamic Stack
Allocation (alloca)</strong></h4>
<pre class="assembly"><code>; Implementing variable-size stack allocation
; C: void* alloca(size_t size)
my_alloca:
; RDI contains size (System V AMD64)
add rdi, 15 ; Round up to 16-byte boundary
and rdi, -16
sub rsp, rdi ; Allocate space
mov rax, rsp ; Return pointer
ret
; Using dynamic allocation
function_with_vla:
push rbp
mov rbp, rsp
; Allocate variable-length array
mov rdi, [rbp + 16] ; Get size parameter
shl rdi, 3 ; Multiply by 8 (sizeof(long))
add rdi, 15
and rdi, -16 ; Align to 16 bytes
sub rsp, rdi ; Allocate
mov rax, rsp ; RAX points to array
; Use array...
leave
ret</code></pre>
<h4 id="stack-unwinding-support"><strong>Stack Unwinding
Support</strong></h4>
<pre class="assembly"><code>; DWARF CFI directives (Linux/Unix)
function_with_cfi:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
sub rsp, 32
; Function body
leave
.cfi_def_cfa rsp, 8
ret
.cfi_endproc
; Exception handling frame setup
exception_aware_function:
push rbp
mov rbp, rsp
sub rsp, 32
; Set up exception handler
lea rax, [exception_handler]
mov [rsp], rax ; Handler address
; Code that might throw
call potentially_throwing_function
; Clean up
add rsp, 32
pop rbp
ret
exception_handler:
; Handle exception
; RSP points to exception record</code></pre>
<h3 id="leaf-vs-non-leaf-functions"><strong>4.5 Leaf vs Non-Leaf
Functions</strong></h3>
<h4 id="leaf-function-optimization"><strong>Leaf Function
Optimization</strong></h4>
<pre class="assembly"><code>; Leaf function (doesn&#39;t call other functions)
; Can use red zone, minimal prologue/epilogue
leaf_strlen:
; RDI = string pointer (System V AMD64)
xor rax, rax ; Counter
.loop:
cmp byte [rdi + rax], 0
je .done
inc rax
jmp .loop
.done:
ret ; No stack frame needed
; Non-leaf equivalent
non_leaf_strlen:
push rbp
mov rbp, rsp
sub rsp, 16 ; Space for locals
mov [rsp], rdi ; Save string pointer
call some_validation ; Calls another function
mov rdi, [rsp] ; Restore pointer
; ... strlen logic ...
leave
ret</code></pre>
<h4 id="tail-call-optimization"><strong>Tail Call
Optimization</strong></h4>
<pre class="assembly"><code>; Recursive function with tail call
; long factorial(long n, long acc)
factorial_tail:
test rdi, rdi ; if (n == 0)
jz .base_case
; Tail call: factorial(
```assembly
; factorial(n - 1, acc * n)
imul rsi, rdi ; acc *= n
dec rdi ; n--
jmp factorial_tail ; Jump, don&#39;t push return address
; No new frame built
.base_case:
mov rax, rsi ; Return accumulated result
ret</code></pre>
<p><strong>Explanation:</strong><br />
Unlike a typical recursive call (<code>call factorial_tail</code>), here
the function jumps directly to the next invocation without creating a
fresh stack frame. This is possible because there’s no work left to do
after the recursive step — the “tail” position. Tail‑call optimization
(TCO) eliminates stack growth in deep recursion, turning recursion into
iteration at the assembly level.</p>
<hr />
<p>✅ <strong>Key Takeaways from Chapter 4:</strong></p>
<ol type="1">
<li><p><strong>Stack discipline matters:</strong> Both the downward
growth and strict 16‑byte alignment rules are central to correct ABI
conformance on x86‑64.</p></li>
<li><p><strong>Know your ABI differences:</strong> System V AMD64 and
Microsoft x64 rules differ in parameter registers, shadow space, and
red‑zone usage.</p></li>
<li><p><strong>Frame management impacts debugging:</strong> Using a
frame pointer helps tooling and stack traces; omitting it saves a
register and can improve performance.</p></li>
<li><p><strong>Leaf functions can skip overhead:</strong> Red‑zone and
no‑frame-pointer optimizations keep them minimal.</p></li>
<li><p><strong>Tail calls avoid extra frames:</strong> When the last
action is a call, using a jump preserves stack height.</p></li>
</ol>
<hr />
<p>If you’d like, I can now continue this progression into
<strong>Chapter 5: SIMD and Vector Extensions</strong>. It will connect
the alignment concepts from Chapter 3 and the stack layout constraints
here to how SSE, AVX, and AVX‑512 arguments are passed, saved, and
restored, with highly optimized examples for numeric processing. Do you
want me to proceed with that?</p>
<hr />
<h2
id="chapter-5-exception-handling-and-stack-unwinding"><strong>Chapter 5:
Exception Handling and Stack Unwinding</strong></h2>
<h3 id="exception-handling-fundamentals"><strong>5.1 Exception Handling
Fundamentals</strong></h3>
<h4 id="types-of-exceptions-in-x86-64"><strong>Types of Exceptions in
x86-64</strong></h4>
<p>The x86-64 architecture supports multiple exception types, each
requiring different handling mechanisms:</p>
<pre class="assembly"><code>; Hardware exceptions (interrupts/traps/faults)
; ┌────────────────┬────────┬──────────────────────────────┐
; │ Exception │ Vector │ Type &amp; Description │
; ├────────────────┼────────┼──────────────────────────────┤
; │ Divide Error │ 0 │ Fault - DIV/IDIV by zero │
; │ Debug │ 1 │ Fault/Trap - Debug exception │
; │ NMI │ 2 │ Interrupt - Non-maskable │
; │ Breakpoint │ 3 │ Trap - INT3 instruction │
; │ Overflow │ 4 │ Trap - INTO instruction │
; │ BOUND Range │ 5 │ Fault - BOUND instruction │
; │ Invalid Opcode │ 6 │ Fault - UD2 or invalid │
; │ Device Not Av. │ 7 │ Fault - No FPU │
; │ Double Fault │ 8 │ Abort - Exception during exc │
; │ Invalid TSS │ 10 │ Fault - Task switch error │
; │ Segment Not P. │ 11 │ Fault - Segment not present │
; │ Stack-Segment │ 12 │ Fault - Stack exception │
; │ General Prot. │ 13 │ Fault - Protection violation │
; │ Page Fault │ 14 │ Fault - Page not present │
; │ x87 FPU Error │ 16 │ Fault - FPU error │
; │ Alignment │ 17 │ Fault - Unaligned access │
; │ Machine Check │ 18 │ Abort - Hardware error │
; │ SIMD FP │ 19 │ Fault - SSE/AVX exception │
; └────────────────┴────────┴──────────────────────────────┘
; Exception handler entry point
exception_handler:
; CPU automatically pushes (in order):
; SS, RSP, RFLAGS, CS, RIP
; (Error code for some exceptions)
push rax ; Save all registers
push rbx
push rcx
push rdx
push rsi
push rdi
push rbp
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
mov rdi, rsp ; Pass exception frame to handler
call handle_exception
pop r15 ; Restore registers
pop r14
; ... restore all
add rsp, 8 ; Skip error code if present
iretq ; Return from interrupt</code></pre>
<h4 id="exception-frame-layout"><strong>Exception Frame
Layout</strong></h4>
<pre class="assembly"><code>; Exception stack frame (after CPU push)
; ┌─────────────────────┐ Higher addresses
; │ Old SS │ [RSP + 32]
; ├─────────────────────┤
; │ Old RSP │ [RSP + 24]
; ├─────────────────────┤
; │ Old RFLAGS │ [RSP + 16]
; ├─────────────────────┤
; │ Old CS │ [RSP + 8]
; ├─────────────────────┤
; │ Old RIP │ [RSP] ← RSP after exception
; ├─────────────────────┤
; │ Error Code │ (only for some exceptions)
; └─────────────────────┘
; Accessing exception information
get_fault_address:
push rbp
mov rbp, rsp
; For page fault, CR2 contains fault address
mov rax, cr2
; Error code at [rbp + 16] for faults with error code
mov rdx, [rbp + 16]
test rdx, 1 ; Check present bit
jz .not_present
test rdx, 2 ; Check write bit
jnz .write_fault
; ... handle different fault types</code></pre>
<h3 id="stack-unwinding-mechanisms"><strong>5.2 Stack Unwinding
Mechanisms</strong></h3>
<h4 id="dwarf-cfi-call-frame-information"><strong>DWARF CFI (Call Frame
Information)</strong></h4>
<p>The DWARF format provides detailed unwinding information for
debuggers and exception handlers:</p>
<pre class="assembly"><code>; DWARF CFI directives for stack unwinding
complex_function:
.cfi_startproc
.cfi_personality 0x3, __gxx_personality_v0
.cfi_lsda 0x3, .LLSDA0
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
push rbx
.cfi_offset rbx, -24
push r12
.cfi_offset r12, -32
sub rsp, 48 ; Local variables
; Function body that might throw
call potentially_throwing_function
; Cleanup
add rsp, 48
pop r12
.cfi_restore r12
pop rbx
.cfi_restore rbx
pop rbp
.cfi_def_cfa rsp, 8
ret
.cfi_endproc
; Exception handling table (LSDA - Language Specific Data Area)
.LLSDA0:
.byte 0xff ; LPStart encoding (omitted)
.byte 0x00 ; TType encoding (absolute)
.uleb128 .LLSDATT0-.LLSDATTD0
.LLSDATTD0:
.byte 0x01 ; Call site encoding (uleb128)
.uleb128 .LLSDACSE0-.LLSDACSB0
.LLSDACSB0:
; Call site table
.uleb128 .LEHB0-.LFB0 ; Start of try block
.uleb128 .LEHE0-.LEHB0 ; Length of try block
.uleb128 .L1-.LFB0 ; Landing pad
.uleb128 0x01 ; Action record
.LLSDACSE0:
.LLSDATT0:
; Type table and action records</code></pre>
<h4 id="manual-stack-walking"><strong>Manual Stack Walking</strong></h4>
<pre class="assembly"><code>; Stack unwinding without debug info
; Walk the chain of frame pointers
walk_stack:
push rbp
mov rbp, rsp
sub rsp, 32
; RDI = output buffer for addresses
; RSI = max frames to capture
mov rcx, rsi ; Frame counter
mov r8, rbp ; Current frame pointer
xor rdx, rdx ; Frame index
.walk_loop:
test rcx, rcx
jz .done
; Validate frame pointer
mov rax, r8
and rax, 7 ; Check alignment
jnz .invalid_frame
; Check if readable (simplified)
cmp r8, 0x1000 ; Too low?
jb .invalid_frame
mov r9, 0x7fffffffffff
cmp r8, r9 ; Too high?
ja .invalid_frame
; Get return address
mov rax, [r8 + 8]
mov [rdi + rdx*8], rax
; Move to next frame
mov r8, [r8] ; Follow chain
inc rdx
dec rcx
; Check for end of chain
test r8, r8
jnz .walk_loop
.done:
mov rax, rdx ; Return frame count
leave
ret
.invalid_frame:
mov rax, -1 ; Error
leave
ret</code></pre>
<h3 id="seh-structured-exception-handling-on-windows"><strong>5.3 SEH
(Structured Exception Handling) on Windows</strong></h3>
<h4 id="seh-frame-setup"><strong>SEH Frame Setup</strong></h4>
<pre class="assembly"><code>; Windows x64 Structured Exception Handling
; Uses function tables and unwind info
seh_protected_function:
; Function prolog with SEH markers
push rbp
.pushreg rbp
mov rbp, rsp
.setframe rbp, 0
sub rsp, 64
.allocstack 64
; Save non-volatile registers
mov [rbp - 8], rbx
.savereg rbx, -8
mov [rbp - 16], rsi
.savereg rsi, -16
mov [rbp - 24], rdi
.savereg rdi, -24
.endprolog
; Set up exception handler
lea rcx, [exception_filter]
lea rdx, [exception_handler]
call __C_specific_handler_install
; Protected code block
.try_begin:
call risky_operation
test rax, rax
jz .error_path
; Normal execution continues
jmp .try_end
.error_path:
; Trigger exception
mov rcx, 0xC0000005 ; Access violation code
call RaiseException
.try_end:
; Cleanup and return
mov rbx, [rbp - 8]
mov rsi, [rbp - 16]
mov rdi, [rbp - 24]
leave
ret
exception_filter:
; RCX = EXCEPTION_POINTERS
mov rax, [rcx] ; EXCEPTION_RECORD
mov rdx, [rax] ; Exception code
cmp rdx, 0xC0000005 ; Access violation?
je .handle_it
mov eax, 0 ; EXCEPTION_CONTINUE_SEARCH
ret
.handle_it:
mov eax, 1 ; EXCEPTION_EXECUTE_HANDLER
ret
exception_handler:
; Handle the exception
; Can modify context to resume execution
ret</code></pre>
<h4 id="unwind-information-structure"><strong>Unwind Information
Structure</strong></h4>
<pre class="assembly"><code>; Windows x64 unwind information
; Located in .pdata and .xdata sections
; .pdata entry (RUNTIME_FUNCTION)
.section .pdata
.long function_start ; Begin address (RVA)
.long function_end ; End address (RVA)
.long unwind_info ; Unwind info address (RVA)
; .xdata entry (UNWIND_INFO)
.section .xdata
unwind_info:
.byte 0x01 ; Version:Flags (1:0)
.byte prolog_size ; Size of prolog
.byte unwind_code_count ; Count of unwind codes
.byte frame_register:4 ; Frame register
.byte frame_offset:4 ; Frame register offset (scaled)
; Unwind codes array
.byte prolog_offset_1 ; Offset in prolog
.byte unwind_op_1:4 ; Operation
.byte op_info_1:4 ; Operation info
; UWOP_PUSH_NONVOL = 0
; UWOP_ALLOC_LARGE = 1
; UWOP_ALLOC_SMALL = 2
; UWOP_SET_FPREG = 3
; UWOP_SAVE_NONVOL = 4
; UWOP_SAVE_XMM128 = 8</code></pre>
<h3 id="c-exception-handling-implementation"><strong>5.4 C++ Exception
Handling Implementation</strong></h3>
<h4 id="itanium-abi-exception-model-gccclang"><strong>Itanium ABI
Exception Model (GCC/Clang)</strong></h4>
<pre class="assembly"><code>; C++ try/catch implementation details
cpp_exception_example:
.cfi_startproc
.cfi_personality 0x3, __gxx_personality_v0
.cfi_lsda 0x3, .LLSDA1
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
sub rsp, 32 ; Space for exception object
.LEHB0: ; Begin exception region
; Allocate exception object
mov edi, 16 ; Size of exception
call __cxa_allocate_exception
mov rbx, rax ; Save exception pointer
; Construct exception object
mov rdi, rbx
lea rsi, [exception_message]
call std::runtime_error::runtime_error
; Throw exception
mov rdi, rbx ; Exception object
lea rsi, [_ZTISt13runtime_error] ; Type info
xor edx, edx ; No destructor
call __cxa_throw ; Never returns
.LEHE0: ; End exception region
; Normal return path (unreachable after throw)
xor eax, eax
leave
.cfi_def_cfa rsp, 8
ret
.L1: ; Landing pad (catch handler)
.cfi_def_cfa rbp, 16
mov rdi, rax ; Exception object
call __cxa_begin_catch
; Handle exception
mov rdi, rax
call process_exception
call __cxa_end_catch
; Continue execution
xor eax, eax
leave
ret
.cfi_endproc
; Personality routine (called during unwinding)
; Determines if frame can handle exception
__gxx_personality_v0:
; Complex logic to:
; 1. Parse LSDA (Language Specific Data Area)
; 2. Check type matching
; 3. Find appropriate catch handler
; 4. Execute cleanup code</code></pre>
<h4 id="raii-and-destructor-calls-during-unwinding"><strong>RAII and
Destructor Calls During Unwinding</strong></h4>
<pre class="assembly"><code>; Automatic destructor calls during stack unwinding
function_with_raii:
push rbp
mov rbp, rsp
sub rsp, 64
; Construct local object with destructor
lea rdi, [rbp - 32] ; Object address
call MyClass::MyClass
; Register destructor for unwinding
lea rdi, [rbp - 32]
lea rsi, [MyClass::~MyClass]
call __cxa_push_cleanup
.try_block:
; Code that might throw
lea rdi, [rbp - 32]
call MyClass::riskyOperation
; Normal cleanup
lea rdi, [rbp - 32]
call MyClass::~MyClass
leave
ret
.cleanup_landing_pad:
; Called during exception unwinding
push rax ; Save exception
; Call destructor
lea rdi, [rbp - 32]
call MyClass::~MyClass
pop rdi ; Restore exception
call _Unwind_Resume ; Continue unwinding</code></pre>
<h3 id="signal-handling-and-asynchronous-exceptions"><strong>5.5 Signal
Handling and Asynchronous Exceptions</strong></h3>
<h4 id="posix-signal-frame"><strong>POSIX Signal Frame</strong></h4>
<pre class="assembly"><code>; Signal handler with siginfo
signal_handler:
; RDI = signal number
; RSI = siginfo_t*
; RDX = ucontext_t*
push rbp
mov rbp, rsp
sub rsp, 32
; Save parameters
mov [rbp - 8], rdi ; Signal number
mov [rbp - 16], rsi ; siginfo_t
mov [rbp - 24], rdx ; ucontext_t
; Access
# **Chapter 5: Exception Handling and Stack Unwinding**
This chapter covers how the x86‑64 architecture and its ABI conventions implement **hardware exceptions**, **OS-level signal delivery**, and **language/runtime-level exception unwinding**. We draw from **Intel’s architecture manuals**, **System V AMD64 ABI**, **Windows x64 ABI**, and **C++ runtime conventions (Itanium ABI for GCC/Clang)**.
---
## **5.1 Exception Basics in x86‑64**
### Hardware Exceptions
On Intel/AMD‑64 CPUs, exceptions are synchronous events triggered by execution faults (divide‑by‑zero, page faults, invalid opcodes, GP faults, alignment checks, etc.). They are handled through the **IDT (Interrupt Descriptor Table)** — each exception vector points to an ISR (interrupt service routine) or trap handler.
**Fault vs Trap vs Abort:**
- **Fault:** Restartable at faulting instruction (e.g., page fault).
- **Trap:** Return after completing the current instruction (e.g., breakpoint `INT3`).
- **Abort:** Non-recoverable (e.g., machine check, double fault).
Example: Divide‑by‑zero handler skeleton:
```asm
section .text
global div_by_zero_handler
div_by_zero_handler:
; CPU pushes RIP, CS, RFLAGS, possibly error code
push rax rbx rcx rdx rsi rdi rbp r8 r9 r10 r11 r12 r13 r14 r15
mov rdi, rsp ; pointer to saved registers
call handle_div_error ; our C or asm routine
pop r15 r14 r13 r12 r11 r10 r9 r8 rbp rdi rsi rdx rcx rbx rax
add rsp, 8 ; skip error code if present
iretq</code></pre>
<p><strong>Exception frame</strong> layout follows the architecture’s
push order; for faults with an error code, the code is pushed before
RIP.</p>
<p>On <strong>page faults</strong>, CR2 holds the faulting linear
address. The error code bits tell us:</p>
<ul>
<li><p>Bit 0: Present?</p></li>
<li><p>Bit 1: Write access?</p></li>
<li><p>Bit 2: User mode?</p></li>
<li><p>Bit 3: Reserved bit violation?</p></li>
<li><p>Bit 4: Instruction fetch?</p></li>
</ul>
<hr />
<h3 id="stack-unwinding-fundamentals"><strong>5.2 Stack Unwinding
Fundamentals</strong></h3>
<p>Occasionally, we need to <strong>walk up the stack</strong> to find
calling functions — either to produce a backtrace (debugging) or to run
cleanups during exceptions.</p>
<h4 id="frame-pointer-chaining">Frame Pointer Chaining</h4>
<p>Many compilers emit a standard frame pointer chain
(<code>RBP</code>), allowing manual stack walking:</p>
<div class="sourceCode" id="cb95"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb95-1"><a href="#cb95-1" aria-hidden="true" tabindex="-1"></a><span class="fu">walk_stack:</span></span>
<span id="cb95-2"><a href="#cb95-2" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="kw">rbp</span> <span class="co">; current frame pointer</span></span>
<span id="cb95-3"><a href="#cb95-3" aria-hidden="true" tabindex="-1"></a><span class="fu">.loop:</span></span>
<span id="cb95-4"><a href="#cb95-4" aria-hidden="true" tabindex="-1"></a> <span class="bu">test</span> <span class="kw">rax</span><span class="op">,</span> <span class="kw">rax</span></span>
<span id="cb95-5"><a href="#cb95-5" aria-hidden="true" tabindex="-1"></a> <span class="cf">jz</span> <span class="op">.</span>done</span>
<span id="cb95-6"><a href="#cb95-6" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rcx</span><span class="op">,</span> <span class="op">[</span><span class="kw">rax</span><span class="op">+</span><span class="dv">8</span><span class="op">]</span> <span class="co">; return address</span></span>
<span id="cb95-7"><a href="#cb95-7" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rdx</span><span class="op">,</span> <span class="op">[</span><span class="kw">rax</span><span class="op">]</span> <span class="co">; previous frame pointer</span></span>
<span id="cb95-8"><a href="#cb95-8" aria-hidden="true" tabindex="-1"></a> <span class="co">; save rcx somewhere...</span></span>
<span id="cb95-9"><a href="#cb95-9" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="kw">rdx</span></span>
<span id="cb95-10"><a href="#cb95-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">jmp</span> <span class="op">.</span>loop</span>
<span id="cb95-11"><a href="#cb95-11" aria-hidden="true" tabindex="-1"></a><span class="fu">.done:</span></span>
<span id="cb95-12"><a href="#cb95-12" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span></code></pre></div>
<p>If <strong>frame pointer omission</strong> is enabled, DWARF Call
Frame Information (CFI) or Windows unwind info is needed.</p>
<hr />
<h3 id="dwarf-cfi-system-v-amd64"><strong>5.3 DWARF CFI (System V
AMD64)</strong></h3>
<p>The <strong>System V AMD64 ABI</strong> defines how unwinders locate
a function’s call frame:</p>
<ul>
<li><p><strong>.cfi_startproc / .cfi_endproc</strong> delimit function’s
unwind metadata.</p></li>
<li><p><code>.cfi_def_cfa_register</code> selects the CFA (Canonical
Frame Address) register.</p></li>
<li><p><code>.cfi_offset</code> declares where each saved register lives
relative to CFA.</p></li>
</ul>
<p>Example with possible throw:</p>
<div class="sourceCode" id="cb96"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb96-1"><a href="#cb96-1" aria-hidden="true" tabindex="-1"></a>.cfi_startproc</span>
<span id="cb96-2"><a href="#cb96-2" aria-hidden="true" tabindex="-1"></a>.cfi_personality <span class="bn">0x3</span><span class="op">,</span> __gxx_personality_v0</span>
<span id="cb96-3"><a href="#cb96-3" aria-hidden="true" tabindex="-1"></a><span class="bu">push</span> <span class="kw">rbp</span></span>
<span id="cb96-4"><a href="#cb96-4" aria-hidden="true" tabindex="-1"></a>.cfi_def_cfa_offset <span class="dv">16</span></span>
<span id="cb96-5"><a href="#cb96-5" aria-hidden="true" tabindex="-1"></a>.cfi_offset <span class="kw">rbp</span><span class="op">,</span> <span class="op">-</span><span class="dv">16</span></span>
<span id="cb96-6"><a href="#cb96-6" aria-hidden="true" tabindex="-1"></a><span class="bu">mov</span> <span class="kw">rbp</span><span class="op">,</span> <span class="kw">rsp</span></span>
<span id="cb96-7"><a href="#cb96-7" aria-hidden="true" tabindex="-1"></a>.cfi_def_cfa_register <span class="kw">rbp</span></span>
<span id="cb96-8"><a href="#cb96-8" aria-hidden="true" tabindex="-1"></a><span class="bu">push</span> <span class="kw">rbx</span></span>
<span id="cb96-9"><a href="#cb96-9" aria-hidden="true" tabindex="-1"></a>.cfi_offset <span class="kw">rbx</span><span class="op">,</span> <span class="op">-</span><span class="dv">24</span></span>
<span id="cb96-10"><a href="#cb96-10" aria-hidden="true" tabindex="-1"></a><span class="co">; body...</span></span>
<span id="cb96-11"><a href="#cb96-11" aria-hidden="true" tabindex="-1"></a><span class="bu">pop</span> <span class="kw">rbx</span></span>
<span id="cb96-12"><a href="#cb96-12" aria-hidden="true" tabindex="-1"></a>.cfi_restore <span class="kw">rbx</span></span>
<span id="cb96-13"><a href="#cb96-13" aria-hidden="true" tabindex="-1"></a><span class="bu">pop</span> <span class="kw">rbp</span></span>
<span id="cb96-14"><a href="#cb96-14" aria-hidden="true" tabindex="-1"></a>.cfi_def_cfa <span class="kw">rsp</span><span class="op">,</span> <span class="dv">8</span></span>
<span id="cb96-15"><a href="#cb96-15" aria-hidden="true" tabindex="-1"></a><span class="cf">ret</span></span>
<span id="cb96-16"><a href="#cb96-16" aria-hidden="true" tabindex="-1"></a>.cfi_endproc</span></code></pre></div>
<p>This data is consumed by <code>_Unwind_RaiseException</code> inside
libgcc_s or libc++abi for C++ stack unwinding.</p>
<hr />
<h3 id="windows-x64-seh-and-unwind-info"><strong>5.4 Windows x64 SEH and
Unwind Info</strong></h3>
<p>Windows uses <strong>Structured Exception Handling (SEH)</strong> and
<strong>unwind metadata</strong> in <code>.pdata</code> and
<code>.xdata</code> sections.</p>
<ul>
<li><p><strong>.pdata</strong>: runtime function table entries
(start/end RVA, pointer to unwind info).</p></li>
<li><p><strong>UNWIND_INFO</strong>: describes prolog, saved registers,
and optional exception handler pointer.</p></li>
</ul>
<p>Example SEH-protected function:</p>
<div class="sourceCode" id="cb97"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb97-1"><a href="#cb97-1" aria-hidden="true" tabindex="-1"></a>seh_fn PROC</span>
<span id="cb97-2"><a href="#cb97-2" aria-hidden="true" tabindex="-1"></a> <span class="bu">push</span> <span class="kw">rbp</span></span>
<span id="cb97-3"><a href="#cb97-3" aria-hidden="true" tabindex="-1"></a> .pushreg <span class="kw">rbp</span></span>
<span id="cb97-4"><a href="#cb97-4" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rbp</span><span class="op">,</span> <span class="kw">rsp</span></span>
<span id="cb97-5"><a href="#cb97-5" aria-hidden="true" tabindex="-1"></a> .setframe <span class="kw">rbp</span><span class="op">,</span><span class="dv">0</span></span>
<span id="cb97-6"><a href="#cb97-6" aria-hidden="true" tabindex="-1"></a> <span class="bu">sub</span> <span class="kw">rsp</span><span class="op">,</span> <span class="bn">40h</span></span>
<span id="cb97-7"><a href="#cb97-7" aria-hidden="true" tabindex="-1"></a> .allocstack <span class="bn">40h</span></span>
<span id="cb97-8"><a href="#cb97-8" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="op">[</span><span class="kw">rbp</span><span class="op">-</span><span class="dv">8</span><span class="op">],</span> <span class="kw">rbx</span></span>
<span id="cb97-9"><a href="#cb97-9" aria-hidden="true" tabindex="-1"></a> .savereg <span class="kw">rbx</span><span class="op">,</span> <span class="op">-</span><span class="dv">8</span></span>
<span id="cb97-10"><a href="#cb97-10" aria-hidden="true" tabindex="-1"></a> .endprolog</span>
<span id="cb97-11"><a href="#cb97-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb97-12"><a href="#cb97-12" aria-hidden="true" tabindex="-1"></a> <span class="co">; risk code</span></span>
<span id="cb97-13"><a href="#cb97-13" aria-hidden="true" tabindex="-1"></a> <span class="cf">call</span> may_fault</span>
<span id="cb97-14"><a href="#cb97-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb97-15"><a href="#cb97-15" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rbx</span><span class="op">,</span> <span class="op">[</span><span class="kw">rbp</span><span class="op">-</span><span class="dv">8</span><span class="op">]</span></span>
<span id="cb97-16"><a href="#cb97-16" aria-hidden="true" tabindex="-1"></a> <span class="bu">leave</span></span>
<span id="cb97-17"><a href="#cb97-17" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span>
<span id="cb97-18"><a href="#cb97-18" aria-hidden="true" tabindex="-1"></a>seh_fn ENDP</span></code></pre></div>
<p>The Windows unwind codes (UWOP_PUSH_NONVOL, UWOP_ALLOC_SMALL, etc.)
tell RtlUnwind how to restore registers and stack.</p>
<hr />
<h3 id="language-level-exception-flow-itanium-c-abi"><strong>5.5
Language-Level Exception Flow (Itanium C++ ABI)</strong></h3>
<p>GCC and Clang on AMD64 Linux/macOS implement the <strong>Itanium
ABI</strong> personality function model:</p>
<ul>
<li><p>Each try block is a “call site” in LSDA (Language Specific Data
Area).</p></li>
<li><p>On throw, the unwinder calls the <strong>personality
function</strong> with each frame’s LSDA to evaluate catches and
destructors.</p></li>
<li><p>RAII destructors are called automatically during
unwinding.</p></li>
</ul>
<p>Example throw/catch at asm level:</p>
<div class="sourceCode" id="cb98"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb98-1"><a href="#cb98-1" aria-hidden="true" tabindex="-1"></a><span class="fu">try_block_start:</span></span>
<span id="cb98-2"><a href="#cb98-2" aria-hidden="true" tabindex="-1"></a> <span class="co">; might throw</span></span>
<span id="cb98-3"><a href="#cb98-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">call</span> risky_op</span>
<span id="cb98-4"><a href="#cb98-4" aria-hidden="true" tabindex="-1"></a><span class="fu">try_block_end:</span></span>
<span id="cb98-5"><a href="#cb98-5" aria-hidden="true" tabindex="-1"></a> <span class="co">; normal path</span></span>
<span id="cb98-6"><a href="#cb98-6" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span>
<span id="cb98-7"><a href="#cb98-7" aria-hidden="true" tabindex="-1"></a><span class="fu">catch_lpad:</span></span>
<span id="cb98-8"><a href="#cb98-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">call</span> __cxa_begin_catch</span>
<span id="cb98-9"><a href="#cb98-9" aria-hidden="true" tabindex="-1"></a> <span class="co">; handle</span></span>
<span id="cb98-10"><a href="#cb98-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">call</span> __cxa_end_catch</span>
<span id="cb98-11"><a href="#cb98-11" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span></code></pre></div>
<p>Throwing (<code>__cxa_throw</code>) never returns; stack unwinding
proceeds via <code>_Unwind_RaiseException</code>.</p>
<hr />
<h3 id="signals-posix-asynchronous-exceptions"><strong>5.6 Signals
(POSIX Asynchronous Exceptions)</strong></h3>
<p>Unix signals (SIGSEGV, SIGFPE, SIGILL, etc.) are delivered
asynchronously. The kernel sets up a <strong>signal frame</strong>
containing register context (<code>ucontext_t</code>), so a handler can
inspect/modify state.</p>
<div class="sourceCode" id="cb99"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb99-1"><a href="#cb99-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> handler<span class="op">(</span><span class="dt">int</span> sig<span class="op">,</span> siginfo_t <span class="op">*</span>si<span class="op">,</span> <span class="dt">void</span> <span class="op">*</span>ctx<span class="op">)</span> <span class="op">{</span></span>
<span id="cb99-2"><a href="#cb99-2" aria-hidden="true" tabindex="-1"></a> ucontext_t <span class="op">*</span>uc <span class="op">=</span> ctx<span class="op">;</span></span>
<span id="cb99-3"><a href="#cb99-3" aria-hidden="true" tabindex="-1"></a> printf<span class="op">(</span><span class="st">&quot;FaultAddr=</span><span class="sc">%p</span><span class="st"> RIP=</span><span class="sc">%llx\n</span><span class="st">&quot;</span><span class="op">,</span> si<span class="op">-&gt;</span>si_addr<span class="op">,</span></span>
<span id="cb99-4"><a href="#cb99-4" aria-hidden="true" tabindex="-1"></a> <span class="op">(</span><span class="dt">unsigned</span> <span class="dt">long</span> <span class="dt">long</span><span class="op">)</span>uc<span class="op">-&gt;</span>uc_mcontext<span class="op">.</span>gregs<span class="op">[</span>REG_RIP<span class="op">]);</span></span>
<span id="cb99-5"><a href="#cb99-5" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>At asm level, the OS switches to the handler much like an interrupt,
saving all state.</p>
<hr />
<h3 id="practical-stack-unwinding-example"><strong>5.7 Practical Stack
Unwinding Example</strong></h3>
<p>Here’s how a backtrace is implemented with DWARF:</p>
<ol type="1">
<li><p>Locate unwind info for current RIP.</p></li>
<li><p>Apply unwind ops to recover caller’s CFA and register
set.</p></li>
<li><p>Repeat for each frame until the top.</p></li>
</ol>
<p>On Windows: use <code>RtlCaptureContext</code> and
<code>RtlLookupFunctionEntry</code>.</p>
<p>On Linux: <code>libunwind</code> or
<code>_Unwind_Backtrace</code>.</p>
<hr />
<h3 id="key-points"><strong>Key Points:</strong></h3>
<ul>
<li><p><strong>Hardware exceptions</strong> are defined in the CPU docs;
handling them requires saving/restoring full CPU state.</p></li>
<li><p><strong>Stack unwinding</strong> depends on ABI-provided metadata
(DWARF or Windows unwind info) if frame pointers aren’t
available.</p></li>
<li><p><strong>Language runtimes</strong> (C++/Rust/Go) integrate with
these mechanisms via personality functions and metadata tables.</p></li>
<li><p><strong>Asynchronous signals</strong> use an OS-delivered stack
frame; care must be taken for async-signal safety.</p></li>
<li><p>RAII destructors and cleanup handlers run <em>during</em>
unwinding automatically.</p></li>
<li><p>In cross-platform code, <strong>System V AMD64 ABI</strong> and
<strong>Microsoft x64 ABI</strong> have distinct unwind metadata
formats.</p></li>
</ul>
<hr />
<p>If you want, I can <strong>append detailed Intel/AMD references from
the attached manuals</strong> — e.g., excerpts from Volume 3 on IDT,
exception vectors, and CR2 usage — so this chapter directly cites
official architecture specs for each mechanism.<br />
Do you want me to integrate those official reference snippets next?</p>
<hr />
<h2 id="chapter-6-x87-fpu-and-legacy-floating-point"><strong>Chapter 6:
x87 FPU and Legacy Floating Point</strong></h2>
<h3 id="x87-fpu-architecture-overview"><strong>6.1 x87 FPU Architecture
Overview</strong></h3>
<h4 id="x87-register-stack-model"><strong>x87 Register Stack
Model</strong></h4>
<p>The x87 FPU uses a unique stack-based architecture with eight 80-bit
floating-point registers (ST0-ST7), unlike the flat register model of
modern SSE/AVX:</p>
<pre class="assembly"><code>; x87 FPU Register Stack
; ┌─────────────────────────────────────┐
; │ ST(0) - Top of Stack (TOS) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(1) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(2) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(3) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(4) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(5) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(6) │ 80 bits
; ├─────────────────────────────────────┤
; │ ST(7) │ 80 bits
; └─────────────────────────────────────┘
; 80-bit Extended Precision Format
; ┌───┬────────────────┬──────────────────────────────────┐
; │ S │ Exponent (15) │ Significand (64 bits) │
; └───┴────────────────┴──────────────────────────────────┘
; Bit: 79 78-64 63-0
; Basic stack operations
fld_example:
fld dword [float_value] ; Push 32-bit float onto stack
fld qword [double_value] ; Push 64-bit double
fld tbyte [extended_value] ; Push 80-bit extended
; Stack now: ST(0) = extended, ST(1) = double, ST(2) = float
faddp ; ST(1) = ST(1) + ST(0), pop
fstp qword [result] ; Store and pop</code></pre>
<h4 id="x87-control-and-status-words"><strong>x87 Control and Status
Words</strong></h4>
<pre class="assembly"><code>; x87 Control Word (FCW) - 16 bits
; ┌──┬──┬──┬──┬──┬──┬────┬────┬──┬──┬──┬──┬──┬──┬──┬──┐
; │X │ RC │ PC │XX│ PM │UM │OM │ZM │DM │IM │
; └──┴──┴──┴──┴──┴──┴────┴────┴──┴──┴──┴──┴──┴──┴──┴──┘
; Bits: 15-13 12-11 10-9 8-7 6 5 4 3 2 1 0
;
; RC = Rounding Control (00=nearest, 01=down, 10=up, 11=truncate)
; PC = Precision Control (00=single, 10=double, 11=extended)
; Exception Masks: PM=Precision, UM=Underflow, OM=Overflow,
; ZM=Zero divide, DM=Denormal, IM=Invalid
; x87 Status Word (FSW) - 16 bits
; ┌──┬────────┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
; │B │ TOP │C3│ ST │C2│C1│C0│ES│SF│PE│UE│OE│ZE│DE│IE│
; └──┴────────┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
; Bits: 15 14-11 10-8 7 6 5 4 3 2 1 0
;
; B = Busy, TOP = Stack top pointer
; C0-C3 = Condition codes
; ES = Error summary, SF = Stack fault
; Exception flags: PE=Precision, UE=Underflow, OE=Overflow,
; ZE=Zero divide, DE=Denormal, IE=Invalid
control_word_setup:
sub rsp, 16
; Get current control word
fstcw [rsp]
mov ax, [rsp]
; Set rounding mode to truncate (11b)
or ax, 0x0C00 ; Set bits 11-10 to 11
mov [rsp + 2], ax
fldcw [rsp + 2] ; Load new control word
; Do computation with truncation
fld qword [value]
frndint ; Round to integer using current mode
; Restore original control word
fldcw [rsp]
add rsp, 16
ret</code></pre>
<h3 id="x87-instruction-categories"><strong>6.2 x87 Instruction
Categories</strong></h3>
<h4 id="data-transfer-instructions"><strong>Data Transfer
Instructions</strong></h4>
<pre class="assembly"><code>; Loading values onto the stack
load_operations:
fld dword [mem32] ; Load 32-bit float
fld qword [mem64] ; Load 64-bit double
fld tbyte [mem80] ; Load 80-bit extended
fld st(2) ; Duplicate ST(2) to ST(0)
fld1 ; Push 1.0
fldz ; Push 0.0
fldpi ; Push π
fldl2e ; Push log₂(e)
fldl2t ; Push log₂(10)
fldlg2 ; Push log₁₀(2)
fldln2 ; Push ln(2)
; Storing values from the stack
store_operations:
fst dword [mem32] ; Store ST(0) as float (no pop)
fstp qword [mem64] ; Store ST(0) as double and pop
fstp tbyte [mem80] ; Store ST(0) as extended and pop
fist word [mem16] ; Store ST(0) as 16-bit int
fistp dword [mem32] ; Store as 32-bit int and pop
fistp qword [mem64] ; Store as 64-bit int and pop
; Integer loading
integer_load:
fild word [int16] ; Load 16-bit integer
fild dword [int32] ; Load 32-bit integer
fild qword [int64] ; Load 64-bit integer
; Exchange operations
fxch ; Exchange ST(0) with ST(1)
fxch st(3) ; Exchange ST(0) with ST(3)</code></pre>
<h4 id="arithmetic-operations"><strong>Arithmetic
Operations</strong></h4>
<pre class="assembly"><code>; Basic arithmetic
arithmetic_ops:
; Addition
fadd st(0), st(1) ; ST(0) = ST(0) + ST(1)
fadd dword [mem32] ; ST(0) = ST(0) + mem32
faddp st(2), st(0) ; ST(2) = ST(2) + ST(0), pop
fiadd word [int16] ; ST(0) = ST(0) + (float)int16
; Subtraction
fsub st(0), st(1) ; ST(0) = ST(0) - ST(1)
fsubr st(0), st(1) ; ST(0) = ST(1) - ST(0) (reverse)
fsubp ; ST(1) = ST(1) - ST(0), pop
fisubr dword [int32] ; ST(0) = int32 - ST(0)
; Multiplication
fmul st(0), st(2) ; ST(0) = ST(0) * ST(2)
fmulp st(1), st(0) ; ST(1) = ST(1) * ST(0), pop
fimul word [int16] ; ST(0) = ST(0) * int16
; Division
fdiv st(0), st(1) ; ST(0) = ST(0) / ST(1)
fdivr st(0), st(1) ; ST(0) = ST(1) / ST(0) (reverse)
fdivp ; ST(1) = ST(1) / ST(0), pop
fidiv dword [int32] ; ST(0) = ST(0) / int32
; Other operations
fsqrt ; ST(0) = sqrt(ST(0))
fabs ; ST(0) = |ST(0)|
fchs ; ST(0) = -ST(0)
frndint ; ST(0) = round(ST(0))</code></pre>
<h4 id="transcendental-functions"><strong>Transcendental
Functions</strong></h4>
<pre class="assembly"><code>; Trigonometric functions
trig_operations:
; Calculate sin(x)
fld qword [angle] ; Load angle in radians
fsin ; ST(0) = sin(ST(0))
fstp qword [result]
; Calculate cos(x)
fld qword [angle]
fcos ; ST(0) = cos(ST(0))
; Calculate both sin and cos
fld qword [angle]
fsincos ; ST(0) = cos, ST(1) = sin
; Calculate tan(x)
fld qword [angle]
fptan ; ST(0) = 1.0, ST(1) = tan
fstp st(0) ; Pop the 1.0
; Calculate arctan(y/x)
fld qword [y]
fld qword [x]
fpatan ; ST(0) = arctan(ST(1)/ST(0))
; Logarithmic and exponential
log_exp_operations:
; Calculate log₂(x)
fld1
fld qword [x]
fyl2x ; ST(0) = ST(1) * log₂(ST(0))
; Calculate log₁₀(x) = log₂(x) * log₁₀(2)
fldlg2 ; Load log₁₀(2)
fld qword [x]
fyl2x
; Calculate ln(x) = log₂(x) * ln(2)
fldln2 ; Load ln(2)
fld qword [x]
fyl2x
; Calculate 2^x
fld qword [x]
f2xm1 ; ST(0) = 2^ST(0) - 1 (for |x| &lt; 1)
fld1
faddp ; Add 1 to get 2^x
; Calculate x^y using: x^y = 2^(y*log₂(x))
fld qword [y]
fld qword [x]
fyl2x ; ST(0) = y * log₂(x)
fld st(0)
frndint ; Get integer part
fsub st(1), st(0) ; Fractional part in ST(1)
fxch
f2xm1 ; 2^frac - 1
fld1
faddp ; 2^frac
fscale ; Scale by 2^int</code></pre>
<h3 id="comparison-and-conditional-operations"><strong>6.3 Comparison
and Conditional Operations</strong></h3>
<h4 id="comparison-instructions"><strong>Comparison
Instructions</strong></h4>
<pre class="assembly"><code>; Comparison operations
comparison_ops:
; Compare and set flags
fcom st(1) ; Compare ST(0) with ST(1)
fcomp dword [mem32] ; Compare with memory and pop
fcompp ; Compare ST(0), ST(1) and pop both
ficom word [int16] ; Compare with integer
; Unordered compare (handles NaN)
fucom st(1) ; Unordered compare
fucomp st(2) ; Compare and pop
fucompp ; Compare and pop both
; Test and classify
ftst ; Compare ST(0) with 0.0
fxam ; Examine ST(0) and set condition codes
; Transfer flags to CPU
transfer_flags:
; Method 1: Via AX register
fstsw ax ; Store status word in AX
sahf ; Store AH into FLAGS
ja .greater ; Now can use CPU conditional jumps
jb .less
je .equal
; Method 2: Via memory
fstsw [status_word]
mov ax, [status_word]
test ax, 0x4500 ; Check C3, C2, C0 bits
; Floating compare and set EFLAGS directly (P6+)
fcomi st(0), st(1) ; Compare and set ZF, PF, CF
jae .greater_equal ; Can use CPU jumps directly
fucomi st(0), st(1) ; Unordered compare version
jp .unordered ; Jump if unordered (NaN)</code></pre>
<h4 id="conditional-move-fcmovcc"><strong>Conditional Move
(FCMOVcc)</strong></h4>
<pre class="assembly"><code>; Conditional moves based on EFLAGS (P6+)
conditional_moves:
; Setup comparison in integer unit
cmp eax, ebx
; Conditional FP moves based on integer flags
fcmovb st(0), st(1) ; Move if below (CF=1)
fcmove st(0), st(2) ; Move if equal (ZF=1)
fcmovbe st(0), st(3) ; Move if below or equal
fcmovnb st(0), st(4) ; Move if not below
fcmovne st(0), st(5) ; Move if not equal
fcmovnbe st(0), st(6) ; Move if not below or equal
fcmovu st(0), st(7) ; Move if unordered (PF=1)
fcmovnu st(0), st(1) ; Move if not unordered
; Example: Conditional selection
fmax_implementation:
fld qword [value1]
fld qword [value2] ; ST(0) = value2, ST(1) = value1
fcomi st(0), st(1) ; Compare ST(0) with ST(1)
fcmovb st(0), st(1) ; If ST(0) &lt; ST(1), move ST(1) to ST(0)
fstp qword [maximum] ; Store maximum
fstp st(0) ; Clean stack
ret</code></pre>
<h3 id="exception-handling"><strong>6.4 Exception Handling</strong></h3>
<h4 id="x87-exception-types"><strong>x87 Exception Types</strong></h4>
<pre class="assembly"><code>; x87 exceptions and their handling
exception_handling:
; Save current control word
sub rsp, 16
fstcw [rsp]
; Unmask all exceptions for testing
mov ax, [rsp]
and ax, 0xFFC0 ; Clear all exception mask bits
mov [rsp + 2], ax
fldcw [rsp + 2]
; Operations that might raise exceptions:
; 1. Invalid Operation (IE)
fldz
fldz
fdiv st(0), st(1) ; 0/0 = NaN
; 2. Divide by Zero (ZE)
Continuing **6.4 Exception Handling**, we cover the actual *types*, *masking*, and *recovery* for the x87 FPU, drawing on the Intel SDM Vol.1 Ch.8 and legacy floating‑point sections.
---
### **x87 Exception Types**
The status word&#39;s low bits ($IE, DE, ZE, OE, UE, PE$) represent:
| Bit | Flag | Description |
|-----|------|-------------|
| 0 | **IE** | Invalid Operation — e.g., $\sqrt{-1}$, $\log(0)$, $\text{comparison with NaN}$, stack underflow/overflow |
| 1 | **DE** | Denormal Operand — operand is denormalized (subnormal) |
| 2 | **ZE** | Divide‑by‑Zero — divisor is 0.0 |
| 3 | **OE** | Overflow — result too large for chosen precision |
| 4 | **UE** | Underflow — result too small; gradual underflow possible |
| 5 | **PE** | Precision — inexact rounding occurred |
---
### **Masking vs. Unmasking**
The **Control Word** contains *Exception Mask bits* for each of the above.
- **Mask bit = 1** → Exception is *masked* (execution continues, result set to NaN/Inf/ZF/etc.).
- **Mask bit = 0** → Exception is *unmasked* (processor raises `#MF` — x87 Floating-Point Exception).
Example: Unmask only Divide-by-Zero:
```asm
sub rsp, 16
fstcw [rsp] ; Save CW
mov ax, [rsp]
and ax, 0xFFFB ; Clear bit 2 (ZM mask) → unmask
mov [rsp+2], ax
fldcw [rsp+2]
fld1
fldz
fdiv st(0), st(1) ; Should raise #MF if unmasked
fldcw [rsp] ; Restore CW
add rsp, 16</code></pre>
<hr />
<h4 id="exception-service"><strong>Exception Service</strong></h4>
<p>When unmasked and an FP exception occurs:</p>
<ol type="1">
<li><p>The operation completes or traps immediately.</p></li>
<li><p>FPU sets corresponding flag in <strong>Status Word</strong> and
sets <strong>ES</strong> (Error Summary) bit.</p></li>
<li><p>Processor signals exception handler:</p>
<ul>
<li>On <em>hardware</em> level: raises <code>#MF</code> on the next FP
instruction (deferred reporting — see Intel SDM “Deferred Floating Point
Exceptions”).</li>
<li>On <em>OS</em> level: Linux delivers <code>SIGFPE</code> with a
<code>siginfo_t</code> pointing to faulting instruction.</li>
</ul></li>
</ol>
<p>Deferred means: if you want immediate detection, insert
<code>fwait</code> after expected fault instruction.</p>
<hr />
<h4 id="flag-testing-in-software"><strong>Flag Testing in
Software</strong></h4>
<p>You can interrogate the <strong>Status Word</strong> directly:</p>
<div class="sourceCode" id="cb108"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb108-1"><a href="#cb108-1" aria-hidden="true" tabindex="-1"></a><span class="bu">fstsw</span> <span class="kw">ax</span> <span class="co">; Store SW in AX</span></span>
<span id="cb108-2"><a href="#cb108-2" aria-hidden="true" tabindex="-1"></a><span class="bu">test</span> <span class="kw">ax</span><span class="op">,</span> <span class="bn">0x0004</span> <span class="co">; Check ZE (bit 2)</span></span>
<span id="cb108-3"><a href="#cb108-3" aria-hidden="true" tabindex="-1"></a><span class="cf">jnz</span> <span class="op">.</span>div_by_zero</span>
<span id="cb108-4"><a href="#cb108-4" aria-hidden="true" tabindex="-1"></a><span class="bu">test</span> <span class="kw">ax</span><span class="op">,</span> <span class="bn">0x0020</span> <span class="co">; Check PE</span></span>
<span id="cb108-5"><a href="#cb108-5" aria-hidden="true" tabindex="-1"></a><span class="cf">jnz</span> <span class="op">.</span>inexact</span></code></pre></div>
<p>Or mask all but one flag to detect it.</p>
<hr />
<h4 id="cooperating-with-os-exception-models"><strong>Cooperating with
OS Exception Models</strong></h4>
<h5 id="unix-like-systems-linuxmacosbsd"><strong>Unix-like systems
(Linux/macOS/BSD)</strong></h5>
<ul>
<li><p>The kernel sets FPU state per-thread and saves/restores on
context switches (<code>FXSAVE</code>/<code>FXRSTOR</code>).</p></li>
<li><p><code>SIGFPE</code> handlers can query FP state with
<code>fenv.h</code> functions:</p>
<div class="sourceCode" id="cb109"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb109-1"><a href="#cb109-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;fenv.h&gt;</span></span>
<span id="cb109-2"><a href="#cb109-2" aria-hidden="true" tabindex="-1"></a>feclearexcept<span class="op">(</span>FE_ALL_EXCEPT<span class="op">);</span></span>
<span id="cb109-3"><a href="#cb109-3" aria-hidden="true" tabindex="-1"></a><span class="co">// ...</span></span>
<span id="cb109-4"><a href="#cb109-4" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="op">(</span>fetestexcept<span class="op">(</span>FE_DIVBYZERO<span class="op">))</span> <span class="op">{</span> <span class="co">/* handle */</span> <span class="op">}</span></span></code></pre></div></li>
</ul>
<h5 id="windows-structured-exception-handling"><strong>Windows
(Structured Exception Handling)</strong></h5>
<ul>
<li><p>Unmasked FPU exceptions show as
<code>EXCEPTION_FLT_DIVIDE_BY_ZERO</code>,
<code>EXCEPTION_FLT_OVERFLOW</code>, etc.</p></li>
<li><p>SEH handler gets <code>EXCEPTION_RECORD</code> with
floating-point status.</p></li>
</ul>
<hr />
<h4 id="modern-context-why-care-in-x8664"><strong>6.5 Modern Context:
Why Care in x86‑64</strong></h4>
<p>Even though SSE/AVX have largely replaced x87 for new code:</p>
<ul>
<li><p>Legacy libraries (especially math and scientific code) may still
use x87 for extended precision.</p></li>
<li><p>Certain ABI contexts preserve x87 stack for
compatibility.</p></li>
<li><p>OS signal/exception machinery is designed to work with both x87
and SIMD FP exceptions.</p></li>
</ul>
<hr />
<p>✅ With 6.4 complete, we can now proceed to <strong>6.5 Transitioning
from x87 to SSE/AVX</strong>, where we’ll show side-by-side x87 vs SSE
implementations of the same tasks, and explain mixed-mode pitfalls
(e.g., register state contamination).</p>
<p>If you want, I can immediately continue with <strong>6.5</strong> to
wrap up this chapter into a full “Legacy to Modern FP” narrative before
we move to <strong>Chapter 7: SIMD Fundamentals (SSE/AVX)</strong>.
Would you like me to do that?</p>
<hr />
<h2 id="chapter-7-sse-and-sse2-programming"><strong>Chapter 7: SSE and
SSE2 Programming</strong></h2>
<h3 id="ssesse2-architecture-overview"><strong>7.1 SSE/SSE2 Architecture
Overview</strong></h3>
<h4 id="introduction-to-streaming-simd-extensions"><strong>Introduction
to Streaming SIMD Extensions</strong></h4>
<p>SSE (Streaming SIMD Extensions) and SSE2 represent the modern
approach to floating-point and integer SIMD operations in x86-64,
replacing the legacy x87 FPU for most applications:</p>
<pre class="assembly"><code>; SSE/SSE2 Register Architecture
; ┌─────────────────────────────────────┐
; │ XMM0 │ 128 bits (16 bytes) │
; ├─────────────────────────────────────┤
; │ XMM1 │ 128 bits │
; ├─────────────────────────────────────┤
; │ XMM2 │ 128 bits │
; ├─────────────────────────────────────┤
; │ ... │ │
; ├─────────────────────────────────────┤
; │ XMM15 │ 128 bits (x86-64 only) │
; └─────────────────────────────────────┘
; Data Types and Packing
; ┌────────────────────────────────────────────────┐
; │ 4 × 32-bit single-precision floats (SSE) │
; ├────────────────────────────────────────────────┤
; │ 2 × 64-bit double-precision floats (SSE2) │
; ├────────────────────────────────────────────────┤
; │ 16 × 8-bit integers (SSE2) │
; ├────────────────────────────────────────────────┤
; │ 8 × 16-bit integers (SSE2) │
; ├────────────────────────────────────────────────┤
; │ 4 × 32-bit integers (SSE2) │
; ├────────────────────────────────────────────────┤
; │ 2 × 64-bit integers (SSE2) │
; └────────────────────────────────────────────────┘
; Basic SSE/SSE2 operation example
sse_intro:
movaps xmm0, [aligned_floats] ; Load 4 floats (must be 16-byte aligned)
movups xmm1, [unaligned_floats] ; Load 4 floats (no alignment required)
addps xmm0, xmm1 ; Add 4 float pairs in parallel
mulps xmm0, xmm2 ; Multiply 4 floats in parallel
movaps [result], xmm0 ; Store result
ret</code></pre>
<h4 id="mxcsr-controlstatus-register"><strong>MXCSR Control/Status
Register</strong></h4>
<pre class="assembly"><code>; MXCSR - 32-bit control and status register
; ┌──┬──┬──────┬──────┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
; │FZ│RC│ DAZ │ Res │PM│UM│OM│ZM│DM│IM│PE│UE│OE│ZE│DE│IE│
; └──┴──┴──────┴──────┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
; Bits: 15 14-13 6 5 4 3 2 1 0
;
; FZ = Flush to Zero
; RC = Rounding Control (00=nearest, 01=down, 10=up, 11=truncate)
; DAZ = Denormals Are Zero
; Exception Masks: PM=Precision, UM=Underflow, OM=Overflow,
; ZM=Zero divide, DM=Denormal, IM=Invalid
; Exception Flags: PE, UE, OE, ZE, DE, IE (same meanings)
mxcsr_setup:
sub rsp, 8
; Get current MXCSR
stmxcsr [rsp]
mov eax, [rsp]
; Enable flush-to-zero mode for performance
or eax, 0x8000 ; Set FZ bit
or eax, 0x0040 ; Set DAZ bit
mov [rsp], eax
ldmxcsr [rsp]
; Perform computations
; ...
add rsp, 8
ret</code></pre>
<h3 id="sse-floating-point-operations"><strong>7.2 SSE Floating-Point
Operations</strong></h3>
<h4 id="single-precision-scalar-operations"><strong>Single-Precision
Scalar Operations</strong></h4>
<pre class="assembly"><code>; Scalar operations (operate on lowest element only)
scalar_float_ops:
; Load operations
movss xmm0, [float_val] ; Load single float to low 32 bits
movss xmm1, xmm2 ; Copy low 32 bits, zero upper
; Arithmetic (suffix &#39;ss&#39; = scalar single)
addss xmm0, xmm1 ; xmm0[31:0] += xmm1[31:0]
subss xmm0, [memory] ; Subtract from memory
mulss xmm0, xmm2 ; Multiply scalar
divss xmm0, xmm3 ; Divide scalar
sqrtss xmm0, xmm1 ; Square root of scalar
; Min/Max operations
maxss xmm0, xmm1 ; Maximum of two scalars
minss xmm0, [value] ; Minimum with memory
; Comparisons (set all bits to 1 or 0)
cmpss xmm0, xmm1, 0 ; Equal (EQ)
cmpss xmm0, xmm1, 1 ; Less than (LT)
cmpss xmm0, xmm1, 2 ; Less than or equal (LE)
cmpss xmm0, xmm1, 3 ; Unordered (NaN check)
cmpss xmm0, xmm1, 4 ; Not equal (NEQ)
cmpss xmm0, xmm1, 5 ; Not less than (NLT)
cmpss xmm0, xmm1, 6 ; Not less than or equal (NLE)
cmpss xmm0, xmm1, 7 ; Ordered (not NaN)
; Scalar compare and set EFLAGS
ucomiss xmm0, xmm1 ; Compare and set ZF, PF, CF
jbe .less_or_equal ; Can use CPU conditional jumps
comiss xmm0, [value] ; Ordered compare (signals on NaN)
jp .is_nan ; Jump if unordered</code></pre>
<h4 id="single-precision-packed-operations"><strong>Single-Precision
Packed Operations</strong></h4>
<pre class="assembly"><code>; Packed operations (operate on all 4 floats)
packed_float_ops:
; Aligned loads/stores (16-byte alignment required)
movaps xmm0, [aligned_array] ; Load 4 floats
movaps [result], xmm0 ; Store 4 floats
; Unaligned loads/stores (slower but no alignment requirement)
movups xmm1, [unaligned_array] ; Load 4 floats unaligned
movups [result], xmm1 ; Store unaligned
; Arithmetic (suffix &#39;ps&#39; = packed single)
addps xmm0, xmm1 ; Add 4 float pairs
subps xmm0, [memory] ; Subtract 4 floats
mulps xmm0, xmm2 ; Multiply 4 pairs
divps xmm0, xmm3 ; Divide 4 pairs
sqrtps xmm0, xmm1 ; Square root of 4 floats
rcpps xmm0, xmm1 ; Reciprocal approximation (fast 1/x)
rsqrtps xmm0, xmm1 ; Reciprocal sqrt approximation
; Min/Max operations
maxps xmm0, xmm1 ; Element-wise maximum
minps xmm0, [array] ; Element-wise minimum
; Horizontal operations (SSE3)
haddps xmm0, xmm1 ; Horizontal add
; xmm0[31:0] = xmm0[31:0] + xmm0[63:32]
; xmm0[63:32] = xmm0[95:64] + xmm0[127:96]
; xmm0[95:64] = xmm1[31:0] + xmm1[63:32]
; xmm0[127:96]= xmm1[95:64] + xmm1[127:96]
; Example: Vector dot product
dot_product_4:
movaps xmm0, [vector_a] ; Load a0, a1, a2, a3
movaps xmm1, [vector_b] ; Load b0, b1, b2, b3
mulps xmm0, xmm1 ; a0*b0, a1*b1, a2*b2, a3*b3
; Sum all elements (SSE3 version)
haddps xmm0, xmm0 ; a0*b0+a1*b1, a2*b2+a3*b3, ...
haddps xmm0, xmm0 ; Final sum in all positions
movss [result], xmm0 ; Store scalar result
ret</code></pre>
<h4 id="shuffle-and-permute-operations"><strong>Shuffle and Permute
Operations</strong></h4>
<pre class="assembly"><code>; Shuffling elements within and between registers
shuffle_operations:
; shufps - Shuffle packed singles
; Immediate byte selects which elements: [d3 d2 | d1 d0]
; d0, d1 select from xmm1 (source)
; d2, d3 select from xmm0 (destination)
movaps xmm0, [array_a] ; a3, a2, a1, a0
movaps xmm1, [array_b] ; b3, b2, b1, b0
shufps xmm0, xmm1, 0xE4 ; 11 10 01 00 binary
; Result: xmm0 = [a3, a2, b1, b0]
; movhlps - Move high to low packed single
movhlps xmm0, xmm1 ; xmm0[63:0] = xmm1[127:64]
; movlhps - Move low to high packed single
movlhps xmm0, xmm1 ; xmm0[127:64] = xmm1[63:0]
; unpcklps - Unpack and interleave low singles
unpcklps xmm0, xmm1
; Result: xmm0 = [b1, a1, b0, a0]
; unpckhps - Unpack and interleave high singles
unpckhps xmm0, xmm1
; Result: xmm0 = [b3, a3, b2, a2]</code></pre>
<h3 id="sse2-double-precision-operations"><strong>7.3 SSE2
Double-Precision Operations</strong></h3>
<h4 id="double-precision-scalar-and-packed"><strong>Double-Precision
Scalar and Packed</strong></h4>
<pre class="assembly"><code>; Scalar double operations (suffix &#39;sd&#39;)
scalar_double_ops:
movsd xmm0, [double_val] ; Load scalar double
addsd xmm0, xmm1 ; Add scalar doubles
subsd xmm0, [memory] ; Subtract
mulsd xmm0, xmm2 ; Multiply
divsd xmm0, xmm3 ; Divide
sqrtsd xmm0, xmm1 ; Square root
maxsd xmm0, xmm1 ; Maximum
minsd xmm0, xmm1 ; Minimum
ucomisd xmm0, xmm1 ; Compare and set flags
jae .greater_or_equal
; Packed double operations (suffix &#39;pd&#39;)
packed_double_ops:
movapd xmm0, [aligned_doubles] ; Load 2 doubles (aligned)
movupd xmm1, [unaligned_doubles] ; Load 2 doubles (unaligned)
addpd xmm0, xmm1 ; Add 2 double pairs
subpd xmm0, [memory] ; Subtract 2 doubles
mulpd xmm0, xmm2 ; Multiply 2 pairs
divpd xmm0, xmm3 ; Divide 2 pairs
sqrtpd xmm0, xmm1 ; Square root of 2 doubles
maxpd xmm0, xmm1 ; Element-wise maximum
minpd xmm0, [array] ; Element-wise minimum
; Horizontal add (SSE3)
haddpd xmm0, xmm1
; xmm0[63:0] = xmm0[63:0] + xmm0[127:64]
; xmm0[127:64]= xmm1[63:0] + xmm1[127:64]</code></pre>
<h3 id="sse2-integer-operations"><strong>7.4 SSE2 Integer
Operations</strong></h3>
<h4 id="integer-data-movement"><strong>Integer Data
Movement</strong></h4>
<pre class="assembly"><code>; Integer move operations
integer_moves:
; Move aligned/unaligned 128-bit integer data
movdqa xmm0, [aligned_ints] ; Aligned 128-bit move
movdqu xmm1, [unaligned_ints] ; Unaligned 128-bit move
; Move 32/64-bit integers to/from XMM
movd xmm0, eax ; Move 32-bit int to XMM[31:0]
movd ebx, xmm1 ; Extract XMM[31:0] to 32-bit reg
movq xmm0, rax ; Move 64-bit int to XMM[63:0]
movq rcx, xmm2 ; Extract XMM[63:0] to 64-bit reg
; Broadcast operations
; (Later SSE versions add more broadcast instructions)
movd xmm0, eax
pshufd xmm0, xmm0, 0x00 ; Broadcast to all 4 dwords</code></pre>
<h4 id="integer-arithmetic-1"><strong>Integer Arithmetic</strong></h4>
<pre class="assembly"><code>; Packed integer arithmetic
integer_arithmetic:
movdqa xmm0, [int_array_a]
movdqa xmm1, [int_array_b]
; Addition (b=byte, w=word, d=dword, q=qword)
paddb xmm0, xmm1 ; Add 16 bytes
paddw xmm0, xmm1 ; Add 8 words
paddd xmm0, xmm1 ; Add 4 dwords
paddq xmm0, xmm1 ; Add 2 qwords
; Subtraction
psubb xmm0, xmm1 ; Subtract 16 bytes
psubw xmm0, xmm1 ; Subtract 8 words
psubd xmm0, xmm1 ; Subtract 4 dwords
psubq xmm0,
Continuing from the SSE2 integer arithmetic discussion, we should complete the rest of the integer capabilities before closing out Chapter 7:
---
### **Integer Arithmetic (continued)**
```assembly
; Multiplication
pmullw xmm0, xmm1 ; Multiply 8 signed/unsigned words -&gt; low 16 bits
pmulhw xmm0, xmm1 ; Multiply 8 signed words -&gt; high 16 bits
pmulhuw xmm0, xmm1 ; Multiply 8 unsigned words -&gt; high 16 bits
pmuludq xmm0, xmm1 ; Multiply 2 unsigned doublewords -&gt; quadword results
; Saturating arithmetic
paddsb xmm0, xmm1 ; Add signed bytes with saturation
paddusb xmm0, xmm1 ; Add unsigned bytes with saturation
paddsw xmm0, xmm1 ; Add signed words with saturation
paddusw xmm0, xmm1 ; Add unsigned words with saturation
psubsb xmm0, xmm1 ; Subtract signed bytes with saturation
psubusb xmm0, xmm1 ; Subtract unsigned bytes with saturation
psubsw xmm0, xmm1 ; Subtract signed words with saturation
psubusw xmm0, xmm1 ; Subtract unsigned words with saturation</code></pre>
<hr />
<h4 id="logical-and-bitwise-operations"><strong>Logical and Bitwise
Operations</strong></h4>
<pre class="assembly"><code> ; Bitwise logical ops
pand xmm0, xmm1 ; Bitwise AND
por xmm0, xmm1 ; Bitwise OR
pxor xmm0, xmm1 ; Bitwise XOR
pandn xmm0, xmm1 ; bitwise AND NOT (¬xmm0 AND xmm1)
; Compare packed integers
pcmpeqb xmm0, xmm1 ; Compare equal (bytes)
pcmpeqw xmm0, xmm1 ; Compare equal (words)
pcmpeqd xmm0, xmm1 ; Compare equal (dwords)
pcmpgtb xmm0, xmm1 ; Compare greater than (signed bytes)
pcmpgtw xmm0, xmm1 ; Compare greater than (signed words)
pcmpgtd xmm0, xmm1 ; Compare greater than (signed dwords)
; Shifts
psllw xmm0, xmm1 ; Shift left words by count in xmm1[63:0]
pslld xmm0, xmm1 ; Shift left dwords
psllq xmm0, xmm1 ; Shift left qwords
psraw xmm0, xmm1 ; Shift right arithmetic (signed) words
psrad xmm0, xmm1 ; Shift right arithmetic dwords
psrlw xmm0, xmm1 ; Shift right logical words
psrld xmm0, xmm1 ; Shift right logical dwords
psrlq xmm0, xmm1 ; Shift right logical qwords
; Shift by immediate
psllw xmm0, 4 ; Shift left words by imm8</code></pre>
<hr />
<h4 id="packing-and-unpacking-integers"><strong>Packing and Unpacking
Integers</strong></h4>
<pre class="assembly"><code> ; Pack with signed saturation
packsswb xmm0, xmm1 ; Pack words to bytes with saturation
packssdw xmm0, xmm1 ; Pack dwords to words with saturation
; Pack with unsigned saturation
packuswb xmm0, xmm1 ; Pack words to bytes, unsigned saturation
packusdw xmm0, xmm1 ; Pack dwords to words, unsigned saturation
; Unpack and interleave
punpcklbw xmm0, xmm1 ; Unpack and interleave low-order bytes
punpckhbw xmm0, xmm1 ; Unpack and interleave high-order bytes
punpcklwd xmm0, xmm1 ; Unpack and interleave low-order words
punpckhwd xmm0, xmm1 ; Unpack and interleave high-order words
punpckldq xmm0, xmm1 ; Unpack and interleave low-order dwords
punpckhdq xmm0, xmm1 ; Unpack and interleave high-order dwords</code></pre>
<hr />
<h4
id="conversion-between-integer-and-floating-point"><strong>Conversion
Between Integer and Floating Point</strong></h4>
<p>SSE2 adds rich conversion instructions to move between integer and FP
domains:</p>
<pre class="assembly"><code> ; Integer → float
cvtdq2ps xmm0, xmm1 ; Convert 4 signed dwords to packed single float
cvtpi2ps xmm0, mm1 ; Convert 2 signed dwords to packed single float
cvtpd2ps xmm0, xmm1 ; Convert 2 doubles to 2 single floats
; Float → integer
cvtps2dq xmm0, xmm1 ; Convert 4 single floats to signed dwords
cvtps2pi mm0, xmm1 ; Convert 2 single floats to signed dwords
cvtps2pd xmm0, xmm1 ; Convert 2 single floats to packed double float
; Double → integer
cvtpd2dq xmm0, xmm1 ; Convert 2 double floats to signed dwords
cvtpd2pi mm0, xmm1 ; Convert 2 double floats to signed dwords
; With truncation (round toward zero)
cvttps2dq xmm0, xmm1 ; Convert with truncation
cvttpd2dq xmm0, xmm1 ; Convert 2 doubles to signed dwords (truncate)</code></pre>
<hr />
<p><strong>Performance Notes:</strong></p>
<ul>
<li><p>Use <code>movaps</code>/<code>movapd</code> for aligned data to
avoid penalties; align dynamic allocations to 16 bytes.</p></li>
<li><p>Group dependent SIMD instructions apart to avoid bypass
delays.</p></li>
<li><p>MXCSR settings for DAZ/FZ can significantly speed up streaming
computations with denormal-sensitive data.</p></li>
</ul>
<hr />
<p><strong>Compiler Perspective:</strong></p>
<ul>
<li><p>Modern GCC/Clang/ICC auto-vectorization often emits these SSE2
ops when <code>-O3</code> is enabled and target supports
<code>-msse2</code>.</p></li>
<li><p>Intrinsics (<code>_mm_*</code>) map directly to these
instructions and avoid manual inline assembly.</p></li>
<li><p>Knowledge of exact ops helps in verifying compiler output and
hand‑tuning bottlenecks.</p></li>
</ul>
<hr />
<p>This wraps up <strong>Chapter 7</strong> with a complete set of
SSE/SSE2 float and integer instruction families, their usage patterns,
and tuning considerations.</p>
<p>Next, in <strong>Chapter 8</strong> we will push deeper into
<strong>Advanced SSE Extensions</strong> — SSE3, SSSE3, and SSE4 —
adding horizontal reductions, absolute values, byte shuffles, blends,
and text-processing/vector-masking features.</p>
<p>Do you want me to proceed directly into
<strong>Chapter 8</strong>?</p>
<hr />
<h2
id="chapter-8-advanced-sse-extensions-sse3-ssse3-sse4"><strong>Chapter
8: Advanced SSE Extensions (SSE3, SSSE3, SSE4)</strong></h2>
<h3 id="sse3-extensions"><strong>8.1 SSE3 Extensions</strong></h3>
<h4 id="horizontal-arithmetic-operations"><strong>Horizontal Arithmetic
Operations</strong></h4>
<p>SSE3 (introduced with Pentium 4 Prescott) adds horizontal operations
that combine elements within the same register, solving common reduction
patterns:</p>
<pre class="assembly"><code>; Horizontal operations in SSE3
sse3_horizontal:
; HADDPS - Horizontal add packed single-precision
movaps xmm0, [vec_a] ; a3, a2, a1, a0
movaps xmm1, [vec_b] ; b3, b2, b1, b0
haddps xmm0, xmm1
; Result: xmm0 = [b3+b2, b1+b0, a3+a2, a1+a0]
; HADDPD - Horizontal add packed double-precision
movapd xmm2, [doubles_a] ; a1, a0
movapd xmm3, [doubles_b] ; b1, b0
haddpd xmm2, xmm3
; Result: xmm2 = [b1+b0, a1+a0]
; HSUBPS - Horizontal subtract packed single
hsubps xmm0, xmm1
; Result: xmm0 = [b3-b2, b1-b0, a3-a2, a1-a0]
; HSUBPD - Horizontal subtract packed double
hsubpd xmm2, xmm3
; Result: xmm2 = [b1-b0, a1-a0]
; Practical example: Sum all elements in a vector
sum_vector_elements:
movaps xmm0, [vector] ; Load 4 floats
haddps xmm0, xmm0 ; [-, -, sum01, sum23]
haddps xmm0, xmm0 ; [-, -, -, total]
movss [sum], xmm0 ; Store scalar result
ret
; Alternative mixed add/subtract
addsubps_example:
movaps xmm0, [array_a]
movaps xmm1, [array_b]
addsubps xmm0, xmm1 ; Alternating subtract/add
; xmm0[31:0] = a0 - b0
; xmm0[63:32] = a1 + b1
; xmm0[95:64] = a2 - b2
; xmm0[127:96]= a3 + b3
ret</code></pre>
<h4 id="special-move-operations"><strong>Special Move
Operations</strong></h4>
<pre class="assembly"><code>; SSE3 move operations
sse3_moves:
; MOVSLDUP - Move/duplicate low singles
movsldup xmm0, [memory]
; Result: xmm0 = [src[3], src[3], src[1], src[1]]
; MOVSHDUP - Move/duplicate high singles
movshdup xmm0, [memory]
; Result: xmm0 = [src[2], src[2], src[0], src[0]]
; MOVDDUP - Move/duplicate double
movddup xmm0, [double_val]
; Result: xmm0 = [src[0], src[0]] (duplicate 64-bit value)
; LDDQU - Load unaligned 128 bits (optimized for cache-line splits)
lddqu xmm0, [unaligned_data]
; More efficient than MOVDQU when crossing cache boundaries</code></pre>
<h4 id="x87-fpu-integration-instructions"><strong>x87 FPU Integration
Instructions</strong></h4>
<pre class="assembly"><code>; Monitor/MWait instructions for CPU power management
monitor_wait:
; Set up monitor address
lea rax, [monitor_addr]
xor ecx, ecx ; No extensions
xor edx, edx ; No hints
monitor ; Set up address monitoring
; Wait for event or store to monitored address
xor eax, eax ; No hints
xor ecx, ecx ; No extensions
mwait ; Enter optimized waiting state</code></pre>
<h3 id="ssse3-extensions"><strong>8.2 SSSE3 Extensions</strong></h3>
<h4 id="absolute-value-and-sign-operations"><strong>Absolute Value and
Sign Operations</strong></h4>
<p>SSSE3 (Supplemental SSE3, Core 2) adds critical byte-manipulation and
absolute value operations:</p>
<pre class="assembly"><code>; Absolute value operations
ssse3_absolute:
movdqa xmm0, [signed_bytes]
pabsb xmm0, xmm0 ; Absolute value of 16 signed bytes
movdqa xmm1, [signed_words]
pabsw xmm1, xmm1 ; Absolute value of 8 signed words
movdqa xmm2, [signed_dwords]
pabsd xmm2, xmm2 ; Absolute value of 4 signed dwords
; Sign operations
ssse3_sign:
movdqa xmm0, [data_bytes]
movdqa xmm1, [sign_bytes]
psignb xmm0, xmm1 ; Negate/zero/keep based on sign
; If xmm1[i] &lt; 0: xmm0[i] = -xmm0[i]
; If xmm1[i] = 0: xmm0[i] = 0
; If xmm1[i] &gt; 0: xmm0[i] = xmm0[i]
psignw xmm2, xmm3 ; Sign operation on words
psignd xmm4, xmm5 ; Sign operation on dwords</code></pre>
<h4 id="horizontal-addition-with-saturation"><strong>Horizontal Addition
with Saturation</strong></h4>
<pre class="assembly"><code>; Horizontal add with saturation
ssse3_hadd:
movdqa xmm0, [bytes_a]
movdqa xmm1, [bytes_b]
; PHADDW - Horizontal add adjacent pairs of words
phaddw xmm0, xmm1
; Each pair of adjacent words is summed
; PHADDD - Horizontal add adjacent pairs of dwords
phaddd xmm0, xmm1
; PHADDSW - Horizontal add words with signed saturation
phaddsw xmm0, xmm1
; Horizontal subtract variants
phsubw xmm0, xmm1 ; Horizontal subtract words
phsubd xmm0, xmm1 ; Horizontal subtract dwords
phsubsw xmm0, xmm1 ; Horizontal subtract with saturation</code></pre>
<h4 id="multiply-and-add-packed"><strong>Multiply and Add
Packed</strong></h4>
<pre class="assembly"><code>; Multiply-add operations
ssse3_madd:
movdqa xmm0, [bytes_a]
movdqa xmm1, [bytes_b]
; PMADDUBSW - Multiply unsigned/signed bytes, add pairs
pmaddubsw xmm0, xmm1
; Multiply unsigned bytes from xmm0 with signed bytes from xmm1
; Add adjacent products with signed saturation
; Store 8 word results
; PMULHRSW - Multiply high with round and scale
pmulhrsw xmm0, xmm1
; Multiply signed words, shift right 15, round</code></pre>
<h4 id="byte-shuffle-pshufb"><strong>Byte Shuffle (PSHUFB)</strong></h4>
<p>The most powerful SSSE3 instruction for byte manipulation:</p>
<pre class="assembly"><code>; PSHUFB - Shuffle bytes
byte_shuffle:
movdqa xmm0, [source_bytes]
movdqa xmm1, [shuffle_mask]
pshufb xmm0, xmm1
; For each byte position i in result:
; If xmm1[i] &amp; 0x80: result[i] = 0
; Else: result[i] = xmm0[xmm1[i] &amp; 0x0F]
; Example: Reverse byte order (endian swap)
reverse_bytes:
movdqa xmm0, [data]
movdqa xmm1, [reverse_mask] ; 0F 0E 0D 0C 0B 0A 09 08 07 06 05 04 03 02 01 00
pshufb xmm0, xmm1
movdqa [result], xmm0
ret
; Example: Extract specific bytes
extract_bytes:
movdqa xmm0, [source]
; Extract bytes 0, 4, 8, 12, zero the rest
movdqa xmm1, [.mask]
pshufb xmm0, xmm1
ret
.mask:
db 0x00, 0x04, 0x08, 0x0C ; Positions to extract
db 0x80, 0x80, 0x80, 0x80 ; Zero these positions
db 0x80, 0x80, 0x80, 0x80
db 0x80, 0x80, 0x80, 0x80</code></pre>
<h4 id="alignment-operations"><strong>Alignment Operations</strong></h4>
<pre class="assembly"><code>; PALIGNR - Concatenate and extract aligned result
alignment_ops:
movdqa xmm0, [buffer_low]
movdqa xmm1, [buffer_high]
; Extract 16 bytes starting at byte offset 3
palignr xmm0, xmm1, 3
; Concatenates xmm1:xmm0, then extracts bytes [18:3]
; Use case: Sliding window operations
; Process overlapping 16-byte windows from a stream
sliding_window:
movdqa xmm0, [window_prev]
movdqa xmm1, [window_curr]
palignr xmm1, xmm0, 4 ; Shift window by 4 bytes
; Process xmm1...
movdqa xmm0, [window_next]
palignr xmm0, xmm1, 4 ; Continue sliding
ret</code></pre>
<h3 id="sse4.1-extensions"><strong>8.3 SSE4.1 Extensions</strong></h3>
<h4 id="blending-operations"><strong>Blending Operations</strong></h4>
<p>SSE4.1 (Penryn) adds flexible blending and improved integer
operations:</p>
<pre class="assembly"><code>; Blend operations
sse41_blending:
; BLENDPS - Blend packed single-precision using immediate mask
movaps xmm0, [array_a]
movaps xmm1, [array_b]
blendps xmm0, xmm1, 0b1010 ; Blend using immediate
; Bit i=0: select from xmm0
; Bit i=1: select from xmm1
; Result: [b3, a2, b1, a0]
; BLENDPD - Blend packed double-precision
blendpd xmm2, xmm3, 0b01 ; Select xmm3[63:0], xmm2[127:64]
; BLENDVPS - Variable blend using sign bit of xmm0
movaps xmm2, [mask] ; High bit of each element controls
blendvps xmm0, xmm1, xmm0 ; Blend based on sign bits
; PBLENDVB - Variable byte blend
movdqa xmm0, [bytes_a]
movdqa xmm1, [bytes_b]
pblendvb xmm0, xmm1 ; Blend using high bit of each byte in xmm0
; PBLENDW - Blend words with immediate
pblendw xmm0, xmm1, 0xF0 ; Blend high 4 words from xmm1</code></pre>
<h4 id="dot-product-instructions"><strong>Dot Product
Instructions</strong></h4>
<pre class="assembly"><code>; Dot product operations
sse41_dot_product:
; DPPS - Dot product of packed singles
movaps xmm0, [vector_a]
movaps xmm1, [vector_b]
dpps xmm0, xmm1, 0xF1
; Immediate byte: [mask_out:mask_in]
; mask_in (bits 4-7): Which products to sum
; mask_out (bits 0-3): Where to store result
; Example: Full 4-element dot product
dpps xmm0, xmm1, 0xFF ; All products, broadcast to all
; Example: 3D dot product (ignore 4th element)
dpps xmm0, xmm1, 0x71 ; Products 0,1,2; store in position 0
; DPPD - Dot product of packed doubles
dppd xmm2, xmm3, 0x31 ; Dot product, result in low element</code></pre>
<h4 id="rounding-operations"><strong>Rounding Operations</strong></h4>
<pre class="assembly"><code>; Rounding with selectable modes
sse41_rounding:
; ROUNDPS - Round packed singles
movaps xmm0, [floats]
roundps xmm1, xmm0, 0x00 ; Round to nearest (even)
roundps xmm2, xmm0, 0x01 ; Round down (floor)
roundps xmm3, xmm0, 0x02 ; Round up (ceil)
roundps xmm4, xmm0, 0x03 ; Truncate (toward zero)
roundps xmm5, xmm0, 0x04 ; Use MXCSR.RC field
; ROUNDPD - Round packed doubles
roundpd xmm1, xmm0, 0x02 ; Ceiling for doubles
; ROUNDSS/ROUNDSD - Scalar versions
roundss xmm1, xmm0, 0x01 ; Floor single scalar
roundsd xmm1, xmm0, 0x03 ; Truncate double scalar</code></pre>
<h4 id="integer-minmax-operations"><strong>Integer Min/Max
Operations</strong></h4>
<pre class="assembly"><code>; Min/Max for more integer types
sse41_minmax:
movdqa xmm0, [ints_a]
movdqa xmm1, [ints_b]
; Signed operations
pminsb xmm0, xmm1 ; Min of signed bytes
pmaxsb xmm0, xmm1 ; Max of signed bytes
pminsw xmm0, xmm1 ; Min of signed words (SSE2)
pmaxsw xmm0, xmm1 ; Max of signed words (SSE2)
pminsd xmm0, xmm1 ; Min of signed dwords
pmaxsd xmm0, xmm1 ; Max of signed dwords
; Unsigned operations
pminub xmm0, xmm1 ; Min of unsigned bytes (SSE2)
pmaxub xmm0, xmm1 ; Max of unsigned bytes (SSE2)
pminuw xmm0, xmm1 ; Min of unsigned words
pmaxuw xmm0, xmm1 ; Max of unsigned words
pminud xmm0, xmm1 ; Min of unsigned dwords
pmaxud xmm0, xmm1 ; Max of unsigned dwords</code></pre>
<h4 id="enhanced-integer-operations"><strong>Enhanced Integer
Operations</strong></h4>
<pre class="assembly"><code>; Extended integer multiply
sse41_multiply:
; PMULLD - Multiply packed signed dwords, low 32 bits
movdqa xmm0, [dwords_a]
movdqa xmm1, [dwords_b]
pmulld xmm0, xmm1 ; 4 × 32-bit multiplies
; PMULDQ - Multiply packed signed dwords, 64-bit results
pmuldq x
...mm0, xmm1 ; Produces two 64‑bit products in each register
; PMINUW / PMAXUW we&#39;ve covered above
; PHMINPOSUW – horizontal minimum + position for unsigned words
mix_operations:
phminposuw xmm0, xmm1
; Finds lowest 16‑bit unsigned in xmm1 and returns value in low word,
; its index in second low word, rest zeroed
### **Test/Z‑mask Generation**
assembly
; PTEST – bit test for AND/ANDN results
ptest_example:
movdqa xmm0, [mask_a]
movdqa xmm1, [mask_b]
ptest xmm0, xmm1
; Generates ZF=1 if (xmm0 AND xmm1)=0
; Generates CF=1 if (NOT xmm0 AND xmm1)=0
; Useful for mask emptiness checks before blending
---
## **8.4 SSE4.2 Extensions**
SSE4.2 (Nehalem) is more specialised: it adds integer string/text processing, CRC accumulation, and compare–aggregation instructions aimed at accelerating data parsing and matching.
### **String/Text Comparison**
These operate on packed data in xmm regs as if each lane was a substring and they update flags accordingly:
assembly
; Packed compare explicit length
pcmpestri_example:
; RAX = length of string A, EDX = length of string B
; Immediate controls comparison mode (equal‑any, equal‑each, ranges…)
movdqa xmm0, [strA]
movdqa xmm1, [strB]
mov eax, lenA
mov edx, lenB
pcmpestri xmm0, xmm1, 0x00 ; Equal anywhere, unsigned bytes
; Result: ECX = index of first match, ZF=1 if match found
pcmpistri_example:
; Packed compare implicit length (up to first NUL)
movdqa xmm0, [strA]
movdqa xmm1, [strB]
pcmpistrm xmm0, xmm1, 0x18 ; Mask result in xmm0
; Useful for quickly building compare‑mask for set membership tests
Control immediates choose between **equal‑any**, **equal‑each**, **ranges**, and signed/unsigned data.
### **Compare‑mask extraction**
assembly
; PCMPxSTRM – output comparison mask
membership_test:
pcmpistrm xmm0, xmm1, 0x0C ; Equal‑any, signed bytes, implicit length
pmovmskb eax, xmm0 ; Extract high bits to mask integer
test eax, eax
jz no_members
---
### **CRC32 Hardware Accumulation**
assembly
crc_accumulation:
xor eax, eax
mov ecx, [data32]
crc32 eax, ecx ; Accumulate over 32‑bit chunk
mov cl, [next_byte]
crc32 eax, ecx ; Accumulate over single byte
; Supports qword/byte/dword variants with reg/mem source
Ideal for fast checksums over large buffers; compilers emit this when `__builtin_crc32*` intrinsics are used.
---
### **Other Integer Operations**
assembly
; POPCNT – population count (number of set bits)
mov rcx, [bitmask]
popcnt rax, rcx ; Count 1 bits in 64‑bit mask
; Packed min/max for 64‑bit integers
pminsq xmm0, xmm1 ; Minimum of signed qwords
pmaxuq xmm0, xmm1 ; Maximum of unsigned qwords</code></pre>
<hr />
<h3 id="compiler-mapping-and-usecases"><strong>8.5 Compiler Mapping and
Use‑Cases</strong></h3>
<p>Modern compilers (GCC/Clang/MSVC) map these advanced ops if you:</p>
<ul>
<li><p>Enable appropriate flags:<br />
<code>-msse3 -mssse3 -msse4.1 -msse4.2</code></p></li>
<li><p>Use intrinsics in <code>&lt;tmmintrin.h&gt;</code>,
<code>&lt;smmintrin.h&gt;</code>, <code>&lt;nmmintrin.h&gt;</code> for
SSSE3/SSE4</p></li>
<li><p>Rely on auto‑vectorisation: reductions will prefer
<code>HADDPS</code> in SSE3;<br />
byte shuffles and blends come from mask‑driven select patterns;<br />
string comp comes from <code>memcmp</code>/<code>strchr</code>‐like
idioms</p></li>
</ul>
<p><strong>Micro‑optimisation opportunities:</strong></p>
<ul>
<li><p><strong><code>PSHUFB</code></strong> replaces multiple
shifts/masks when rearranging bytes.</p></li>
<li><p><strong>Blends</strong> avoid branches in element‑selection
code.</p></li>
<li><p><strong>Horizontal adds</strong> speed up summations without
shuffle overhead.</p></li>
<li><p><strong>PTEST</strong> is a constant‑time mask empty check vs
slower scalar loop.</p></li>
</ul>
<hr />
<p><strong>Next Steps:</strong><br />
Chapter 9 will extend these concepts into <strong>AVX and AVX2</strong>
with 256‑bit YMM registers, non‑destructive three‑operand form,
gather‑load, and richer permute/shuffle capabilities — vastly increasing
parallelism beyond SSE’s 128‑bit lanes.</p>
<p>Do you want me to proceed directly into <strong>Chapter 9: AVX and
AVX2 Fundamentals</strong> now so the SIMD coverage remains
continuous?</p>
<hr />
<h2 id="chapter-9-avx-and-avx2-vector-extensions"><strong>Chapter 9: AVX
and AVX2 Vector Extensions</strong></h2>
<h3 id="introduction-to-avx-architecture"><strong>9.1 Introduction to
AVX Architecture</strong></h3>
<h4 id="evolution-from-sse-to-avx"><strong>Evolution from SSE to
AVX</strong></h4>
<p>AVX (Advanced Vector Extensions, introduced with Sandy Bridge 2011)
represents a major architectural enhancement over SSE:</p>
<pre class="assembly"><code>; Key AVX improvements:
; 1. 256-bit YMM registers (YMM0-YMM15, YMM16-YMM31 with AVX-512)
; 2. Three-operand non-destructive syntax (VEX encoding)
; 3. Explicit zeroing of upper bits
; 4. New instructions for permutation and broadcast
; SSE (destructive two-operand):
movaps xmm0, [src1]
addps xmm0, [src2] ; xmm0 destroyed
; AVX (non-destructive three-operand):
vmovaps ymm0, [src1]
vaddps ymm2, ymm0, ymm1 ; ymm0 preserved</code></pre>
<h4 id="ymm-register-architecture"><strong>YMM Register
Architecture</strong></h4>
<pre class="assembly"><code>; YMM register layout (256 bits)
; YMM0 = [255:128 upper lane | 127:0 lower lane]
; XMM0 aliases the lower 128 bits of YMM0
avx_register_demo:
; Load 256-bit data
vmovaps ymm0, [aligned_256bit_data]
; Extract lanes
vextractf128 xmm1, ymm0, 1 ; Extract upper 128 bits
; xmm1 = ymm0[255:128]
; Insert lanes
vinsertf128 ymm2, ymm1, xmm3, 0 ; Insert into lower lane
; ymm2 = [ymm1[127:0] | xmm3]
; Zero upper bits when using legacy SSE
vzeroupper ; Clear ymm[255:128] for all registers
; Critical for SSE/AVX transition performance</code></pre>
<h4 id="vex-encoding-prefix"><strong>VEX Encoding Prefix</strong></h4>
<pre class="assembly"><code>; VEX prefix enables:
; - 3-operand instructions
; - Access to YMM registers
; - Explicit vector length (128/256 bit)
; 2-byte VEX (0xC5)
vaddps xmm0, xmm1, xmm2 ; C5 F8 58 C2
; 3-byte VEX (0xC4) for extended features
vaddps ymm0, ymm1, ymm2 ; C4 E1 7C 58 C2
; VEX.L bit controls vector length:
; L=0: 128-bit operation
; L=1: 256-bit operation</code></pre>
<h4 id="state-management"><strong>State Management</strong></h4>
<pre class="assembly"><code>; AVX state transitions
avx_state_management:
; Save AVX state (OS must support XSAVE)
mov eax, 7 ; Save x87, SSE, AVX
xor edx, edx
xsave [save_area]
; Check AVX support
mov eax, 1
cpuid
test ecx, 1 &lt;&lt; 28 ; Check AVX bit
jz no_avx
; Enable AVX in XCR0
xor ecx, ecx
xgetbv ; Get XCR0
or eax, 0x06 ; Enable AVX and SSE
xsetbv ; Set XCR0
; Clean upper state before SSE code
vzeroupper ; Avoid transition penalties
call sse_function
; Restore AVX state
mov eax, 7
xor edx, edx
xrstor [save_area]</code></pre>
<h3 id="avx-floating-point-operations"><strong>9.2 AVX Floating-Point
Operations</strong></h3>
<h4 id="bit-packed-operations"><strong>256-bit Packed
Operations</strong></h4>
<pre class="assembly"><code>; AVX packed single-precision (8 floats)
avx_packed_single:
vmovaps ymm0, [vec_a_256] ; Load 8 floats
vmovaps ymm1, [vec_b_256]
; Arithmetic operations
vaddps ymm2, ymm0, ymm1 ; Add 8 floats
vsubps ymm3, ymm0, ymm1 ; Subtract 8 floats
vmulps ymm4, ymm0, ymm1 ; Multiply 8 floats
vdivps ymm5, ymm0, ymm1 ; Divide 8 floats
vsqrtps ymm6, ymm0 ; Square root of 8 floats
; FMA (Fused Multiply-Add) - requires FMA3
vfmadd213ps ymm0, ymm1, ymm2 ; ymm0 = ymm0*ymm1 + ymm2
vfmsub132ps ymm0, ymm1, ymm2 ; ymm0 = ymm0*ymm2 - ymm1
vfnmadd231ps ymm0, ymm1, ymm2; ymm0 = -(ymm1*ymm2) + ymm0
; AVX packed double-precision (4 doubles)
avx_packed_double:
vmovapd ymm0, [vec_a_dp] ; Load 4 doubles
vmovapd ymm1, [vec_b_dp]
vaddpd ymm2, ymm0, ymm1 ; Add 4 doubles
vmulpd ymm3, ymm0, ymm1 ; Multiply 4 doubles
vmaxpd ymm4, ymm0, ymm1 ; Maximum of 4 doubles
vminpd ymm5, ymm0, ymm1 ; Minimum of 4 doubles</code></pre>
<h4 id="comparison-and-masking"><strong>Comparison and
Masking</strong></h4>
<pre class="assembly"><code>; AVX comparisons with predicates
avx_compare:
vcmpps ymm2, ymm0, ymm1, 0 ; EQ (equal)
vcmpps ymm3, ymm0, ymm1, 1 ; LT (less than)
vcmpps ymm4, ymm0, ymm1, 2 ; LE (less or equal)
vcmpps ymm5, ymm0, ymm1, 3 ; UNORD (unordered)
vcmpps ymm6, ymm0, ymm1, 4 ; NEQ (not equal)
vcmpps ymm7, ymm0, ymm1, 5 ; NLT (not less than)
; Use comparison mask for blending
vcmpps ymm2, ymm0, ymm1, 1 ; Create mask
vblendvps ymm3, ymm4, ymm5, ymm2 ; Conditional select
; Masked operations using AND/ANDN/OR
masked_operations:
vcmpps ymm2, ymm0, ymm1, 0 ; Generate mask
vandps ymm3, ymm0, ymm2 ; Keep where mask=1
vandnps ymm4, ymm2, ymm1 ; Keep where mask=0
vorps ymm5, ymm3, ymm4 ; Combine results</code></pre>
<h4 id="broadcast-operations"><strong>Broadcast Operations</strong></h4>
<pre class="assembly"><code>; Broadcast scalar to all elements
avx_broadcast:
; Broadcast single float to 8 positions
vbroadcastss ymm0, dword [scalar_float]
; Broadcast double to 4 positions
vbroadcastsd ymm1, qword [scalar_double]
; Broadcast from register
vbroadcastss ymm2, xmm0 ; Lowest float to all 8
; Broadcast 128-bit to both lanes
vbroadcastf128 ymm3, xmmword [data_128]
; ymm3[127:0] = ymm3[255:128] = mem[127:0]
; Practical use: scalar-vector multiply
scalar_vector_mul:
vbroadcastss ymm0, dword [scalar]
vmulps ymm1, ymm0, [vector_256]
vmovaps [result_256], ymm1</code></pre>
<h3 id="avx-permutation-and-shuffle"><strong>9.3 AVX Permutation and
Shuffle</strong></h3>
<h4 id="cross-lane-permutation"><strong>Cross-Lane
Permutation</strong></h4>
<pre class="assembly"><code>; VPERM2F128 - Permute 128-bit lanes
lane_permutation:
vperm2f128 ymm2, ymm0, ymm1, 0x20
; Immediate selects which 128-bit chunks:
; Bits [1:0]: Source for dest[127:0]
; Bits [5:4]: Source for dest[255:128]
; Sources: ymm0_lo, ymm0_hi, ymm1_lo, ymm1_hi
; Example: Swap lanes within register
vperm2f128 ymm1, ymm0, ymm0, 0x01
; ymm1 = [ymm0_lo | ymm0_hi]
; Example: Broadcast upper lane
vperm2f128 ymm2, ymm0, ymm0, 0x11
; ymm2 = [ymm0_hi | ymm0_hi]
; VPERMILPS - Permute within lanes
within_lane_permute:
; Each lane permuted independently
vpermilps ymm1, ymm0, 0b10110001
; Control: 2 bits per element select source position
; Lower lane: ymm0[127:0] permuted
; Upper lane: ymm0[255:128] permuted separately
; Variable permute using register control
vmovaps ymm2, [permute_indices]
vpermilps ymm3, ymm0, ymm2</code></pre>
<h4 id="unpack-and-shuffle"><strong>Unpack and Shuffle</strong></h4>
<pre class="assembly"><code>; Unpack operations (256-bit)
avx_unpack:
vunpcklps ymm2, ymm0, ymm1
; Lower lane: interleave low halves of ymm0[127:0], ymm1[127:0]
; Upper lane: interleave low halves of ymm0[255:128], ymm1[255:128]
vunpckhps ymm3, ymm0, ymm1
; Similar but high halves
; Shuffle within lanes
vshufps ymm4, ymm0, ymm1, 0b10110001
; Each 128-bit lane shuffled independently
; Blend operations
avx_blending:
vblendps ymm2, ymm0, ymm1, 0b10101010
; Immediate mask selects per-element
; Variable blend
vblendvps ymm3, ymm0, ymm1, ymm2
; Sign bit of ymm2 elements controls selection</code></pre>
<h3 id="avx2-integer-operations"><strong>9.4 AVX2 Integer
Operations</strong></h3>
<p>AVX2 (Haswell 2013) extends integer SIMD to 256 bits:</p>
<h4 id="bit-integer-arithmetic"><strong>256-bit Integer
Arithmetic</strong></h4>
<pre class="assembly"><code>; AVX2 integer operations
avx2_integer_ops:
; Load 256-bit integer data
vmovdqa ymm0, [int_array_a]
vmovdqa ymm1, [int_array_b]
; Packed integer arithmetic
vpaddb ymm2, ymm0, ymm1 ; Add 32 bytes
vpaddw ymm3, ymm0, ymm1 ; Add 16 words
vpaddd ymm4, ymm0, ymm1 ; Add 8 dwords
vpaddq ymm5, ymm0, ymm1 ; Add 4 qwords
; Saturating arithmetic
vpaddsb ymm6, ymm0, ymm1 ; Add with signed saturation
vpaddusw ymm7, ymm0, ymm1 ; Add with unsigned saturation
; Multiplication
vpmullw ymm8, ymm0, ymm1 ; Multiply 16 words (low)
vpmulhw ymm9, ymm0, ymm1 ; Multiply 16 words (high)
vpmulld ymm10, ymm0, ymm1 ; Multiply 8 dwords (low)
; Min/Max
vpmaxsb ymm11, ymm0, ymm1 ; Max of signed bytes
vpminsw ymm12, ymm0, ymm1 ; Min of signed words
vpmaxud ymm13, ymm0, ymm1 ; Max of unsigned dwords</code></pre>
<h4 id="gather-operations"><strong>Gather Operations</strong></h4>
<p>AVX2’s gather instructions enable vectorized indirect memory
access:</p>
<pre class="assembly"><code>; Gather instructions - load from non-contiguous memory
avx2_gather:
; VPGATHERDD - Gather 32-bit ints using 32-bit indices
; dst[i] = mem[base + index[i] * scale]
lea rsi, [base_array]
vmovdqa ymm1, [indices] ; 8 x 32-bit indices
vpcmpeqd ymm2, ymm2, ymm2 ; All-ones mask
vpgatherdd ymm0, [rsi + ymm1*4], ymm2
; ymm0[i] = mem[rsi + ymm1[i]*4]
; ymm2 is zeroed after gather
; VPGATHERDQ - Gather 64-bit values using 32-bit indices
vmovdqa xmm3, [indices_32] ; 4 x 32-bit indices
vpcmpeqq ymm4, ymm4, ymm4 ; All-ones mask
vpgatherdq ymm5, [rsi + xmm3*8], ymm4
; VGATHERDPS - Gather single-precision floats
vmovaps ymm6, [float_indices]
vpcmpeqd ymm7, ymm7, ymm7
vgatherdps ymm8, [rsi + ymm6*4], ymm7
; Practical example: Indexed lookup table
lookup_table_gather:
lea rax, [lookup_table]
vmovdqa ymm0, [indices_8x] ; 8 indices
vpcmpeqd ymm1, ymm1, ymm1 ; Mask
vpgatherdd ymm2, [rax + ymm0*4], ymm1
vmovdqa [results], ymm2</code></pre>
<h4 id="variable-shifts"><strong>Variable Shifts</strong></h4>
<pre class="assembly"><code>; Per-element variable shifts
avx2_variable_shifts:
vmovdqa ymm0, [data_to_shift]
vmovdqa ymm1, [shift_counts]
; Variable logical shifts
vpsllvd ymm2, ymm0, ymm1 ; Left shift dwords
vpsrlvd ymm3, ymm0, ymm1 ; Right shift dwords
vpsllvq ymm4, ymm0, ymm1 ; Left shift qwords
vpsrlvq ymm5, ymm0, ymm1 ; Right shift qwords
; Variable arithmetic shift
vpsravd ymm6, ymm0, ymm1 ; Arithmetic right shift dwords
; Bit manipulation
bit_manipulation:
; Bit field extract/deposit (requires BMI2)
vpext ymm7, ymm0, ymm1 ; Parallel extract
vpdep ymm8, ymm0, ymm1 ; Parallel deposit</code></pre>
<h4 id="cross-lane-permutation-1"><strong>Cross-Lane
Permutation</strong></h4>
<pre class="assembly"><code>; Full 256-bit permutation
avx2_permute:
; VPERMQ - Permute qwords across lanes
vpermq ymm1, ymm0, 0b10110001
; Each 2-bit field selects source qword
; VPERMD - Permute dwords with indices
vmovdqa ymm2, [permute_indices]
vpermd ymm3, ymm2, ymm0</code></pre>
<p>If we integrate this <strong>Chapter 9</strong> into your existing
16‑chapter progression, it slots perfectly after Chapter 8’s SSE4
coverage and before a logical Chapter 10 on AVX‑512.</p>
<p>What I’ve done here is give you the <strong>deep architectural
context</strong>, the <strong>instruction syntax</strong> (with
immediate masks, broadcast rules, and per‑lane behavior), and plenty of
<strong>realistic snippet examples</strong> that match the way the
Intel® SDM explains operand forms and execution semantics in the files
you provided.</p>
<p>Here’s how it ties back to the earlier chapters you generated:</p>
<ul>
<li><p><strong>Instruction encoding</strong> — builds directly on the
REX/VEX fundamentals from Chapter 2.</p></li>
<li><p><strong>State management and XCR0 enabling</strong> — complements
your Chapter 5 coverage of CPU state saving/restoring in exception
handling.</p></li>
<li><p><strong>YMM register handling,
<code>vzeroupper</code>/<code>vzeroall</code></strong> — critical for
avoiding penalties you touched on briefly under compiler patterns in
Chapter 2 but now expanded into practical sequences.</p></li>
<li><p><strong>FMA operations</strong> — adds the precision and
performance aspects that were missing from the SSE sections.</p></li>
<li><p><strong>Permutation and gather</strong> — new hardware
capabilities that require the memory addressing knowledge from Chapter 3
to be effective.</p></li>
<li><p><strong>Integer SIMD</strong> — fills the gap between
floating‑point AVX and later AVX‑512 mask/ternary forms.</p></li>
</ul>
<p>If you want, for <strong>Chapter 10</strong>, we can continue this
vector evolution into <strong>AVX‑512</strong>, integrating the EVEX
prefix, mask registers (k0–k7), 512‑bit ZMM registers, and the
additional features like scatter, compress/expand, and ternary logic ops
— essentially completing the modern SIMD pipeline story started back in
Chapter 7.</p>
<p>Do you want me to proceed with <strong>Chapter 10: AVX‑512 and EVEX
Encoding</strong> next in the same detailed, example‑driven format?</p>
<hr />
<h2 id="chapter-10-avx-512-and-future-extensions"><strong>Chapter 10:
AVX-512 and Future Extensions</strong></h2>
<h3 id="avx-512-architecture-overview"><strong>10.1 AVX-512 Architecture
Overview</strong></h3>
<h4 id="introduction-to-avx-512"><strong>Introduction to
AVX-512</strong></h4>
<p>AVX-512 (introduced with Knights Landing 2016, Skylake-X 2017)
represents the most significant SIMD expansion in x86-64 history:</p>
<pre class="assembly"><code>; AVX-512 key features:
; 1. 512-bit ZMM registers (ZMM0-ZMM31)
; 2. 8 opmask registers (k0-k7) for predication
; 3. EVEX prefix encoding (4-byte)
; 4. Embedded rounding and broadcast
; 5. Scatter/gather enhancements
; 6. New instruction families (conflict detection, compress/expand)
; Register hierarchy:
; ZMM0[511:0] contains YMM0[255:0] contains XMM0[127:0]
; k0-k7: 64-bit mask registers (k0 special - no write masking)
avx512_basic_example:
; 512-bit operation with masking
vmovaps zmm0, [aligned_512_data]
vcmpps k1, zmm0, zmm1, 0x01 ; Compare, result in k1
vaddps zmm2{k1}, zmm0, zmm1 ; Masked add
; Only elements where k1[i]=1 are updated</code></pre>
<h4 id="evex-encoding-structure"><strong>EVEX Encoding
Structure</strong></h4>
<pre class="assembly"><code>; EVEX prefix format (4 bytes):
; Byte 0: 0x62
; Byte 1: P0 - R, X, B, R&#39;, mmmm fields
; Byte 2: P1 - W, vvvv, pp fields
; Byte 3: P2 - z, L&#39;L, b, V&#39;, aaa fields
; EVEX enables:
; - 32 vector registers (via R&#39; and V&#39; bits)
; - Opmask registers (aaa field)
; - Embedded broadcast (b bit)
; - Zeroing vs merging (z bit)
; - Rounding control (L&#39;L bits with b=1)
evex_encoding_examples:
; Static rounding mode embedded in instruction
vaddps zmm0, zmm1, zmm2, {rn-sae} ; Round to nearest
vaddps zmm0, zmm1, zmm2, {rd-sae} ; Round down
vaddps zmm0, zmm1, zmm2, {ru-sae} ; Round up
vaddps zmm0, zmm1, zmm2, {rz-sae} ; Round toward zero
; Broadcast from memory
vaddps zmm0, zmm1, dword ptr [rax]{1to16} ; Broadcast to 16 floats
vaddpd zmm0, zmm1, qword ptr [rax]{1to8} ; Broadcast to 8 doubles</code></pre>
<h4 id="opmask-registers"><strong>Opmask Registers</strong></h4>
<pre class="assembly"><code>; Opmask register operations
opmask_operations:
; Generate masks from comparisons
vcmpps k1, zmm0, zmm1, 0x00 ; k1 = (zmm0 == zmm1)
vcmpps k2, zmm0, zmm1, 0x01 ; k2 = (zmm0 &lt; zmm1)
; Mask logic operations
kandw k3, k1, k2 ; k3 = k1 &amp; k2
korw k4, k1, k2 ; k4 = k1 | k2
kxnorw k5, k1, k2 ; k5 = ~(k1 ^ k2)
knotw k6, k1 ; k6 = ~k1
; Mask register shifts
kshiftlw k7, k1, 3 ; Shift left by 3
kshiftrw k0, k1, 5 ; Note: k0 write allowed here
; Test and set flags
kortestw k1, k2 ; Set ZF/CF based on k1|k2
ktestw k1, k2 ; Set ZF/CF based on k1&amp;k2
; Merging vs Zeroing masking
masking_modes:
; Merging: preserve destination where mask=0
vaddps zmm0{k1}, zmm1, zmm2
; zmm0[i] = (k1[i]) ? (zmm1[i]+zmm2[i]) : zmm0[i]
; Zeroing: zero destination where mask=0
vaddps zmm0{k1}{z}, zmm1, zmm2
; zmm0[i] = (k1[i]) ? (zmm1[i]+zmm2[i]) : 0</code></pre>
<h3 id="avx-512-foundation-instructions"><strong>10.2 AVX-512 Foundation
Instructions</strong></h3>
<h4 id="bit-arithmetic-operations"><strong>512-bit Arithmetic
Operations</strong></h4>
<pre class="assembly"><code>; AVX-512F (Foundation) - Core operations
avx512_arithmetic:
; Load 512-bit data
vmovaps zmm0, [aligned_512_array] ; 16 floats
vmovapd zmm1, [aligned_512_doubles] ; 8 doubles
; Arithmetic with embedded rounding
vaddps zmm2, zmm0, zmm1, {rn-sae}
vmulps zmm3, zmm0, zmm1, {rd-sae}
vfmadd213ps zmm4, zmm0, zmm1, {ru-sae}
; Reduction operations
vreduceps zmm5, zmm0, 0x08 ; Reduce precision
vrcp14ps zmm6, zmm0 ; 14-bit reciprocal
vrsqrt14ps zmm7, zmm0 ; 14-bit reciprocal sqrt
; Min/max with SAE (Suppress All Exceptions)
vmaxps zmm8, zmm0, zmm1, {sae}
vminps zmm9, zmm0, zmm1, {sae}
; Integer operations
avx512_integer:
vmovdqa64 zmm0, [int64_array] ; 8 × 64-bit
vmovdqa32 zmm1, [int32_array] ; 16 × 32-bit
vpaddd zmm2, zmm0, zmm1
vpmuludq zmm3, zmm0, zmm1 ; Multiply unsigned
vpsllvq zmm4, zmm0, zmm1 ; Variable shift
; Conflict detection (AVX-512CD)
vpconflictd zmm5, zmm0 ; Find duplicate indices
vplzcntd zmm6, zmm0 ; Leading zero count</code></pre>
<h4 id="advanced-permutation"><strong>Advanced Permutation</strong></h4>
<pre class="assembly"><code>; Two-source permutation with indices
avx512_permute:
; VPERMI2PS/PD/D/Q - Permute using indices in zmm1
vpermi2ps zmm1, zmm0, zmm2
; zmm1[i] = select(zmm0, zmm2)[zmm1[i] &amp; 0x1F]
; VPERMT2PS/PD/D/Q - Permute using indices, overwrite zmm2
vpermt2ps zmm2, zmm1, zmm0
; zmm2[i] = select(zmm0, zmm2_old)[zmm1[i] &amp; 0x1F]
; VPERMPS/PD - Full cross-lane permutation
vpermps zmm3, zmm1, zmm0
; zmm3[i] = zmm0[zmm1[i] &amp; 0x0F]
; Compress and expand
vcompressps zmm4{k1}, zmm0 ; Pack masked elements
vexpandps zmm5{k1}, [mem] ; Expand to mask positions
; Ternary logic (VPTERNLOGD/Q)
ternary_logic:
; Perform arbitrary 3-input boolean function
vpternlogd zmm0, zmm1, zmm2, 0xE8
; Immediate encodes truth table for function
; 0xE8 = A&amp;B | A&amp;C | B&amp;C (majority function)
; Common patterns:
vpternlogd zmm3, zmm3, zmm3, 0xFF ; Set all ones
vpternlogd zmm4, zmm4, zmm4, 0x00 ; Clear to zero
vpternlogd zmm5, zmm6, zmm7, 0x96 ; XOR (A^B^C)</code></pre>
<h4 id="scatter-operations"><strong>Scatter Operations</strong></h4>
<pre class="assembly"><code>; Scatter stores - opposite of gather
avx512_scatter:
; VPSCATTERDD - Scatter 32-bit values
lea rax, [base_array]
vmovdqa32 zmm0, [values_to_scatter]
vmovdqa32 zmm1, [scatter_indices]
kxnorw k1, k1, k1 ; All-ones mask
vpscatterdd [rax + zmm1*4]{k1}, zmm0
; mem[rax + zmm1[i]*4] = zmm0[i]
; VPSCATTERDQ - Scatter 64-bit values
vmovdqa64 zmm2, [qword_values]
vmovdqa32 ymm3, [dword_indices]
kmovw k2, 0xFF ; 8-element mask
vpscatterdq [rax + ymm3*8]{k2}, zmm2
; Conflict-free scatter pattern
vpconflictd zmm4, zmm1 ; Check for conflicts
vptestmd k3, zmm4, zmm4 ; Create conflict mask
; Handle conflicts with sequential stores
; Practical scatter example: Histogram update
histogram_scatter:
vmovdqa32 zmm0, [bin_indices] ; Which bins
vmovdqa32 zmm1, [increments] ; How much to add
vpgatherdd zmm2{k1}, [histogram + zmm0*4] ; Gather current
vpaddd zmm2, zmm2, zmm1 ; Add increments
vpscatterdd [histogram + zmm0*4]{k1}, zmm2 ; Scatter back</code></pre>
<h3 id="avx-512-extension-sets"><strong>10.3 AVX-512 Extension
Sets</strong></h3>
<h4 id="avx-512bw-byte-and-word"><strong>AVX-512BW (Byte and
Word)</strong></h4>
<pre class="assembly"><code>; Byte and word operations on ZMM registers
avx512bw_operations:
; 64 byte operations
vmovdqu8 zmm0, [byte_array]
vpaddb zmm1, zmm0, zmm0 ; 64 parallel adds
vpcmpub k1, zmm0, zmm1, 0x02 ; Unsigned compare
; 32 word operations
vmovdqu16 zmm2, [word_array]
vpmullw zmm3, zmm2, zmm2 ; 32 multiplies
vpacksswb zmm4, zmm2, zmm3 ; Pack to bytes
; Mask operations for bytes/words
kunpckdq k2, k1, k1 ; Unpack 32→64 bit mask
kaddb k3, k1, k2 ; Byte mask add
; String operations
vpcmpb k4, zmm0, zmm1, 0x00 ; String compare
vpcompressb zmm5{k4}, zmm0 ; Compress matching bytes</code></pre>
<h4 id="avx-512dq-doubleword-and-quadword"><strong>AVX-512DQ (Doubleword
and Quadword)</strong></h4>
<pre class="assembly"><code>; Enhanced DQ operations
avx512dq_operations:
; Floating-point to integer conversions
vcvttpd2qq zmm0, zmm1 ; Double to quad with truncation
vcvtqq2pd zmm2, zmm0 ; Quad to double
; Logical operations on FP data
vandpd zmm3, zmm1, zmm2 ; AND on double data
vxorpd zmm4, zmm1, zmm2 ; XOR on double data
; Range restriction
vrangeps zmm5, zmm0, zmm1, 0x08
; Flexible min/max/clamp operations
; Reduction with masking
vreducepd zmm6{k1}, zmm0, 0x04
; Extract/insert 128/256-bit chunks
vextractf64x2 xmm7, zmm0, 2 ; Extract 2 doubles
vinsertf64x4 zmm8, zmm1, ymm2, 1 ; Insert 4 doubles</code></pre>
<h4
id="avx-512vnni-vector-neural-network-instructions"><strong>AVX-512VNNI
(Vector Neural Network Instructions)</strong></h4>
<pre class="assembly"><code>; VNNI - Optimized for deep learning inference
avx512_vnni:
; Dot product of bytes with dword accumulation
vpdpbusd zmm0, zmm1, zmm2
; zmm0[i] += sum(zmm1.byte[4i+j] * zmm2.byte[4i+j])
; for j in 0..3, unsigned × signed
; Word dot product with dword accumulation
vpdpwssd zmm3, zmm4, zmm5
; zmm3[i] += zmm4.word[2i] * zmm5.word[2i]
; + zmm4.word[2i+1] * zmm5.word[2i+1]
; Optimized convolution kernel
vnni_convolution:
vzeroall ; Clear accumulators
mov rcx, kernel_size
.loop:
vmovdqu8 zmm0, [input + rcx]
vmovdqu8 zmm1, [weights + rcx]
vpdpbusd zmm16, zmm0, zmm1 ; Accumulate
add rcx, 64
cmp rcx, kernel_end
jb .loop
; Apply bias and activation
vpaddd zmm16, zmm16, [bias]
vpmaxsd zmm16, zmm16, zmm31 ; ReLU (zmm31 = 0)</code></pre>
<h4 id="avx-512ifma-integer-fused-multiply-add"><strong>AVX-512IFMA
(Integer Fused Multiply-Add)</strong></h4>
<pre class="assembly"><code>; 52-bit integer FMA operations
avx512_ifma:
; VPMADD52LUQ - Multiply and add low 52 bits
vpmadd52luq zmm0, zmm1, zmm2
; zmm0[i] += (zmm1[i] * zmm2[i]) &amp; ((1&lt;&lt;52)-1)
; VPMADD52HUQ - Multiply and add high 52 bits
vpmadd52huq zmm3, zmm1, zmm2
; zmm3[i] += (zmm1[i] * zmm2[i]) &gt;&gt; 52
; Large integer multiplication (up to 104 bits)
big_int_multiply:
vpmadd52luq zmm0, zmm4, zmm5 ; Low 52 bits
vpmadd52huq zmm1, zmm4, zmm5 ; High 52 bits
; Combine zmm0 and zmm1 for full result</code></pre>
<h3 id="avx-512-optimization-patterns"><strong>10.4 AVX-512 Optimization
Patterns</strong></h3>
<h4 id="conditional-execution-with-masks"><strong>Conditional Execution
with Masks</strong></h4>
<pre class="assembly"><code>; Replace branches with masked operations
conditional_without_branch:
; Traditional branching code:
; if (a[i] &gt; b[i]) c[i] = a[i] + b[i]
; else c[i] = a[i] - b[i]
; AVX-512 branchless version:
vmovaps zmm0, [array_a
```assembly
vmovaps zmm1, [array_b] ; Load b[]
vcmpps k1, zmm0, zmm1, 0x0E ; k1 = (a &gt; b)
vaddps zmm2{k1}, zmm0, zmm1 ; Add where k1=1 (merge)
vsubps zmm2{k1^}, zmm0, zmm1 ; Sub where k1=0 (merge)
vmovaps [array_c], zmm2 ; Store result</code></pre>
<p>This eliminates branch misprediction penalties — ideal for
data-parallel IF/ELSE patterns.</p>
<hr />
<h4 id="vectorizing-loops"><strong>Vectorizing Loops</strong></h4>
<p>Using 512-bit registers:</p>
<pre class="assembly"><code>vectorized_loop:
mov rsi, ptrA
mov rdi, ptrB
mov rcx, ptrC
mov rdx, length
.loop:
vmovaps zmm0, [rsi]
vmovaps zmm1, [rdi]
vfmadd213ps zmm0, zmm1, [coeff] ; zmm0 = zmm0 * coeff + zmm1
vmovaps [rcx], zmm0
add rsi, 64
add rdi, 64
add rcx, 64
sub rdx, 16
jg .loop</code></pre>
<p>Processes 16 single-precision floats per iteration.</p>
<hr />
<h4 id="reduction-strategies"><strong>Reduction Strategies</strong></h4>
<p>For sums, dot products, etc.:</p>
<pre class="assembly"><code>horizontal_sum512:
vaddps zmm0, zmm0, zmm1
vaddps zmm0, zmm0, zmm2
vaddps zmm0, zmm0, zmm3
; Collapse lanes
vextractf64x4 ymm1, zmm0, 1
vaddps ymm0, ymm0, ymm1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
; Horizontal add final 4 floats in xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
movss [result], xmm0</code></pre>
<hr />
<h4 id="scatter-gather-performance"><strong>Scatter &amp; Gather
Performance</strong></h4>
<ul>
<li><p><strong>Gather</strong>: Load non-contiguous memory — costly if
cache misses.</p></li>
<li><p><strong>Scatter</strong>: Write to indexed locations — risk of
write-combining penalties.</p></li>
<li><p><em>Optimization</em>: Use contiguous access where possible, or
reorder data (software prefetch).</p></li>
</ul>
<hr />
<h4 id="evex-broadcast-for-loop-invariants"><strong>EVEX Broadcast for
Loop Invariants</strong></h4>
<p>Example: applying a scalar multiply:</p>
<pre class="assembly"><code> vbroadcastss zmm1, [scalar] ; Fill all lanes with scalar
vmulps zmm0, zmm0, zmm1</code></pre>
<p>Avoids redundant scalar loads and expands efficiently.</p>
<hr />
<h3 id="practical-considerations-future-trends"><strong>10.5 Practical
Considerations &amp; Future Trends</strong></h3>
<ol type="1">
<li><strong>State Save/Restore:</strong>
<ul>
<li>AVX-512 state uses <code>XSAVE</code>/<code>XRSTOR</code> with XCR0
enabling ZMM and mask state.</li>
<li>Context size &gt;2KB — beware syscall overhead.</li>
</ul></li>
<li><strong>Downclocking:</strong>
<ul>
<li>AVX-512 heavy workloads may lower CPU frequency — balance
scalar/SIMD mix.</li>
</ul></li>
<li><strong>Hybrid Code Paths:</strong>
<ul>
<li>Detect CPU capabilities via CPUID (<code>leaf 7</code>) and dispatch
optimized kernels.</li>
</ul></li>
<li><strong>Upcoming Extensions:</strong>
<ul>
<li><strong>AMX (Tile register) matrices</strong></li>
<li><strong>AVX10</strong> (possible unified 512-bit and 256-bit
subset)</li>
<li><strong>Enhanced Memory Operations</strong> for AI workloads.</li>
</ul></li>
</ol>
<hr />
<p>✅ This closes <strong>Chapter 10</strong> with a <strong>full
AVX-512 operational guide</strong>, covering <strong>architecture,
instruction sets, mask handling, permutation/scatter/gather, neural
network acceleration, IFMA</strong>, and <strong>optimization
patterns</strong>.</p>
<p>If you want, I can now <strong>continue with Chapters 11–16</strong>
focusing on <strong>system-level aspects like kernel mode assembly,
atomic ops, SIMD in multimedia, optimization case studies, JIT
compilation, and real-world applications</strong> — would you like me to
proceed?</p>
<hr />
<h2
id="chapter-11-system-level-architecture-and-protection"><strong>Chapter
11: System-Level Architecture and Protection</strong></h2>
<h3 id="privilege-levels-and-protection-rings"><strong>11.1 Privilege
Levels and Protection Rings</strong></h3>
<h4 id="x86-64-protection-model"><strong>x86-64 Protection
Model</strong></h4>
<p>The x86-64 architecture implements a hierarchical protection
mechanism through privilege levels:</p>
<pre class="assembly"><code>; Protection rings (0-3)
; Ring 0: Kernel/Supervisor mode - full hardware access
; Ring 1: Device drivers (rarely used in modern OS)
; Ring 2: Device drivers (rarely used in modern OS)
; Ring 3: User mode - restricted access
; Current Privilege Level (CPL) stored in CS.RPL
; CPL checks occur on:
; - Segment loads
; - Privileged instruction execution
; - I/O port access
; - Control register access
check_privilege_level:
mov ax, cs
and ax, 3 ; Extract CPL from CS selector
test ax, ax
jz kernel_mode ; CPL=0
; User mode (CPL=3)
; Cannot execute: HLT, LGDT, LIDT, MOV CRx, etc.</code></pre>
<h4 id="segment-descriptors-and-gates"><strong>Segment Descriptors and
Gates</strong></h4>
<pre class="assembly"><code>; 64-bit Segment Descriptor Format (8 bytes)
; Bits 63-56: Base[31:24]
; Bits 55: G (Granularity)
; Bits 54: D/B (Default operation size)
; Bits 53: L (64-bit code segment)
; Bits 52: AVL (Available)
; Bits 51-48: Limit[19:16]
; Bits 47: P (Present)
; Bits 46-45: DPL (Descriptor Privilege Level)
; Bits 44: S (System/Code/Data)
; Bits 43-40: Type
; Bits 39-16: Base[23:0]
; Bits 15-0: Limit[15:0]
; Gate Descriptors (Call, Interrupt, Trap)
; 128-bit structure in 64-bit mode
gate_descriptor_example:
; Interrupt Gate Descriptor (16 bytes)
dq offset_low_and_selector ; Offset[15:0], Selector
dq offset_high_and_attributes ; Offset[63:16], Type, DPL, P
; System Segment Descriptors (TSS, LDT)
tss_descriptor:
; 16-byte TSS descriptor in GDT
dq tss_base_and_limit
dq tss_base_high_and_attributes</code></pre>
<h4 id="global-and-local-descriptor-tables"><strong>Global and Local
Descriptor Tables</strong></h4>
<pre class="assembly"><code>; GDT (Global Descriptor Table)
gdt_setup:
; Minimal 64-bit GDT
gdt_start:
dq 0 ; Null descriptor
gdt_code_64:
dq 0x00209A0000000000 ; 64-bit code, DPL=0
gdt_data:
dq 0x0000920000000000 ; Data segment, DPL=0
gdt_user_code_64:
dq 0x0020FA0000000000 ; 64-bit code, DPL=3
gdt_user_data:
dq 0x0000F20000000000 ; Data segment, DPL=3
gdt_tss:
dq 0 ; TSS descriptor (16 bytes)
dq 0
gdt_end:
gdt_ptr:
dw gdt_end - gdt_start - 1 ; Limit
dq gdt_start ; Base
; Load GDT
lgdt [gdt_ptr]
; IDT (Interrupt Descriptor Table)
idt_setup:
; Each entry is 16 bytes in 64-bit mode
idt_start:
times 256 dq 0, 0 ; 256 interrupt gates
idt_end:
idt_ptr:
dw idt_end - idt_start - 1
dq idt_start
lidt [idt_ptr]</code></pre>
<h3 id="control-registers-and-system-structures"><strong>11.2 Control
Registers and System Structures</strong></h3>
<h4 id="control-register-programming"><strong>Control Register
Programming</strong></h4>
<pre class="assembly"><code>; CR0 - System Control
cr0_bits:
; Bit 0: PE (Protected Mode Enable)
; Bit 1: MP (Monitor Coprocessor)
; Bit 2: EM (Emulation)
; Bit 3: TS (Task Switched)
; Bit 4: ET (Extension Type)
; Bit 5: NE (Numeric Error)
; Bit 16: WP (Write Protect)
; Bit 18: AM (Alignment Mask)
; Bit 29: NW (Not Write-through)
; Bit 30: CD (Cache Disable)
; Bit 31: PG (Paging)
mov rax, cr0
or rax, 0x80000001 ; Enable paging and protection
mov cr0, rax
; CR3 - Page Directory Base
cr3_management:
; Bits 51:12 - Physical address of PML4
; Bit 3: PWT (Page-level Write-Through)
; Bit 4: PCD (Page-level Cache Disable)
mov rax, pml4_table
mov cr3, rax ; Load new page tables
; CR4 - Architecture Extensions
cr4_features:
; Bit 5: PAE (Physical Address Extension)
; Bit 7: PGE (Page Global Enable)
; Bit 9: OSFXSR (OS FXSAVE/FXRSTOR support)
; Bit 10: OSXMMEXCPT (OS XMM exceptions)
; Bit 18: OSXSAVE (XSAVE enabled)
; Bit 20: SMEP (Supervisor Mode Execution Prevention)
; Bit 21: SMAP (Supervisor Mode Access Prevention)
mov rax, cr4
or rax, 0x006006E0 ; Enable modern features
mov cr4, rax</code></pre>
<h4 id="model-specific-registers-msrs-1"><strong>Model-Specific
Registers (MSRs)</strong></h4>
<pre class="assembly"><code>; MSR Access via RDMSR/WRMSR
msr_operations:
; IA32_EFER (0xC0000080) - Extended Features
mov ecx, 0xC0000080
rdmsr ; Read into EDX:EAX
or eax, 0x100 ; Set LME (Long Mode Enable)
or eax, 0x800 ; Set NXE (No-Execute Enable)
wrmsr
; IA32_STAR (0xC0000081) - SYSCALL target
mov ecx, 0xC0000081
mov edx, 0x00230010 ; SYSRET/SYSCALL CS/SS
mov eax, 0 ; Reserved
wrmsr
; IA32_LSTAR (0xC0000082) - Long mode SYSCALL target
mov ecx, 0xC0000082
mov rax, syscall_handler
mov rdx, rax
shr rdx, 32
wrmsr
; FS/GS Base MSRs
mov ecx, 0xC0000100 ; IA32_FS_BASE
mov rax, thread_local_storage
mov rdx, rax
shr rdx, 32
wrmsr
mov ecx, 0xC0000101 ; IA32_GS_BASE
wrmsr ; Set GS base
mov ecx, 0xC0000102 ; IA32_KERNEL_GS_BASE
wrmsr ; For SWAPGS</code></pre>
<h4 id="task-state-segment-tss"><strong>Task State Segment
(TSS)</strong></h4>
<pre class="assembly"><code>; 64-bit TSS Structure (104 bytes minimum)
struc TSS64
.reserved0 resd 1 ; Reserved
.rsp0 resq 1 ; Ring 0 stack pointer
.rsp1 resq 1 ; Ring 1 stack pointer
.rsp2 resq 1 ; Ring 2 stack pointer
.reserved1 resq 1
.ist1 resq 1 ; Interrupt Stack Table 1
.ist2 resq 1 ; IST 2
.ist3 resq 1 ; IST 3
.ist4 resq 1 ; IST 4
.ist5 resq 1 ; IST 5
.ist6 resq 1 ; IST 6
.ist7 resq 1 ; IST 7
.reserved2 resq 1
.reserved3 resw 1
.iopb_offset resw 1 ; I/O Permission Bitmap offset
endstruc
; TSS Setup
setup_tss:
; Initialize TSS
mov rdi, tss_base
xor rax, rax
mov rcx, 104/8
rep stosq ; Clear TSS
; Set ring 0 stack
mov rax, kernel_stack_top
mov [tss_base + TSS64.rsp0], rax
; Set IST entries for critical interrupts
mov rax, nmi_stack_top
mov [tss_base + TSS64.ist1], rax
mov rax, df_stack_top
mov [tss_base + TSS64.ist2], rax
; Load TSS
mov ax, tss_selector
ltr ax</code></pre>
<h3 id="interrupt-and-exception-handling"><strong>11.3 Interrupt and
Exception Handling</strong></h3>
<h4 id="interrupt-descriptor-table-management"><strong>Interrupt
Descriptor Table Management</strong></h4>
<pre class="assembly"><code>; IDT Gate Types
; 0x8E: Interrupt Gate (disables interrupts)
; 0x8F: Trap Gate (leaves interrupts enabled)
; DPL in bits 6-5 of type byte
; Create IDT entry
create_idt_entry:
; Input: RDI = handler address, RSI = selector, RDX = type
mov rax, rdi
mov rbx, rdi
shr rbx, 16
; Entry structure:
mov [idt_entry], ax ; Offset[15:0]
mov [idt_entry+2], si ; Selector
mov [idt_entry+4], dl ; Type and attributes
mov [idt_entry+5], 0 ; IST
mov [idt_entry+6], bx ; Offset[31:16]
shr rax, 32
mov [idt_entry+8], eax ; Offset[63:32]
mov [idt_entry+12], 0 ; Reserved
; Exception handlers with error codes
exception_with_error_code:
; CPU pushes: SS, RSP, RFLAGS, CS, RIP, Error Code
push rax ; Save registers
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
mov rdi, [rsp + 88] ; Error code
mov rsi, [rsp + 96] ; RIP
call handle_page_fault
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
add rsp, 8 ; Remove error code
iretq</code></pre>
<h4 id="system-call-mechanisms"><strong>System Call
Mechanisms</strong></h4>
<pre class="assembly"><code>; SYSCALL/SYSRET (AMD64 fast system call)
syscall_setup:
; Set SYSCALL entry point in LSTAR
mov ecx, 0xC0000082
mov rax, syscall_entry
mov rdx, rax
shr rdx, 32
wrmsr
; Set SYSCALL/SYSRET CS/SS in STAR
mov ecx, 0xC0000081
mov edx, 0x00230010 ; SYSRET CS/SS | SYSCALL CS/SS
xor eax, eax
wrmsr
; Set SYSCALL flags mask in SFMASK
mov ecx, 0xC0000084
mov eax, 0x47700 ; Clear IF, TF, DF
xor edx, edx
wrmsr
syscall_entry:
; SYSCALL: RCX=RIP, R11=RFLAGS
; RAX=syscall number, RDI/RSI/RDX/R10/R8/R9=args
swapgs ; Switch to kernel GS
mov [gs:saved_user_rsp], rsp
mov rsp, [gs:kernel_stack]
push rcx ; Save user RIP
push r11 ; Save user RFLAGS
; Dispatch system call
cmp rax, max_syscall
ja invalid_syscall
lea rbx, [syscall_table]
call [rbx + rax*8]
pop r11 ; Restore RFLAGS
pop rcx ; Restore RIP
mov rsp, [gs:saved_user_rsp]
swapgs
sysretq
; SYSENTER/SYSEXIT (Intel fast system call - legacy)
sysenter_setup:
; Less commonly used in 64-bit mode
mov ecx, 0x174 ; IA32_SYSENTER_CS
mov eax, kernel_cs
xor edx, edx
wrmsr
mov ecx, 0x175 ; IA32_SYSENTER_ESP
mov rax, kernel_stack
mov rdx, rax
shr rdx, 32
wrmsr
mov ecx, 0x176 ; IA32_SYSENTER_EIP
mov rax, sysenter_entry
mov rdx, rax
shr rdx, 32
wrmsr</code></pre>
<h3 id="memory-protection-mechanisms"><strong>11.4 Memory Protection
Mechanisms</strong></h3>
<h4 id="page-table-protection-attributes"><strong>Page Table Protection
Attributes</strong></h4>
<pre class="assembly"><code>; Page Table Entry (PTE) Protection Bits
pte_protection_bits:
; Bit 0: P (Present)
; Bit 1: R/W (Read/Write)
; Bit 2: U/S (User/Supervisor)
; Bit 3: PWT (Write-Through)
; Bit 4: PCD (Cache Disable)
; Bit 5: A (Accessed)
; Bit 6: D (Dirty)
; Bit 7: PAT (Page Attribute Table)
; Bit 8: G (Global)
; Bit 63: NX (No-Execute)
; Set up protected page
setup_protected_page:
mov rax, physical_address
or rax, 0x03 ; Present, R/W
or rax, 0x8000000000000000 ; Set NX bit
mov [pte_entry], rax
; SMEP (Supervisor Mode Execution Prevention)
enable_smep:
mov rax, cr4
or rax, (1 &lt;&lt; 20) ; Set SMEP bit
mov cr4, rax
; Now kernel cannot execute user pages
; SMAP (Supervisor Mode Access Prevention)
enable_smap:
mov rax, cr4
or rax, (1 &lt;&lt; 21) ; Set SMAP bit
mov cr4, rax
; Kernel needs STAC/CLAC to access user memory
; Accessing user memory with SMAP
kernel_copy_from_user:
stac ; Allow user access
mov rax, [user_ptr</code></pre>
<hr />
<p><strong>11.1 Privilege Levels and Rings</strong><br />
We start with the CPU’s hierarchical protection model—rings 0–3—with
<code>CPL</code> taken from the CS selector and enforced for privileged
instructions, control register writes, and I/O. The chapter lays out how
segment descriptors encode DPL and type, covering code/data vs. system
segments (TSS, LDT). Both GDT and IDT setup sequences are shown,
including minimal 64‑bit GDT entries, selector values, and loading with
<code>LGDT</code>/<code>LIDT</code>.</p>
<p><strong>11.2 Control Registers and System Structures</strong><br />
We walk through CR0, CR3, and CR4 bitfields in detail (PE, PG, WP; PWT,
PCD; PAE, SMEP, SMAP, OSXSAVE) and show how to enable long mode and
paging. MSRs are programmed via <code>RDMSR</code>/<code>WRMSR</code>,
including <code>IA32_EFER</code> for LME/NXE,
<code>IA32_STAR</code>/<code>LSTAR</code> for SYSCALL, and FS/GS base
for TLS. A full 64‑bit <code>TSS</code> layout with RSP0–RSP2, IST1–IST7
is given, along with clearing, stack pointer initialization, and loading
via <code>LTR</code>.</p>
<p><strong>11.3 Interrupt and Exception Handling</strong><br />
IDT entry format in 64‑bit mode is documented (Interrupt Gate vs. Trap
Gate attributes) with a helper to fill entries given a handler,
selector, and type. There’s a complete handler skeleton for exceptions
with error codes, preserving registers, retrieving the error code and
RIP, and returning with <code>IRETQ</code>. We cover both AMD64
SYSCALL/SYSRET fast paths — STAR/LSTAR/SFMASK setup — and Intel’s
SYSENTER/SYSEXIT legacy setup.</p>
<p><strong>11.4 Memory Protection Mechanisms</strong><br />
PTE bitfields are defined (Present, R/W, U/S, PAT, Global, NX) and
examples given for creating a no‑execute page. We also walk through
enabling Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode
Access Prevention (SMAP) via CR4, plus the use of
<code>STAC</code>/<code>CLAC</code> when the kernel needs to touch user
pointers.</p>
<hr />
<p>This chapter’s code samples are low‑level and complete enough to
assemble for experimentation, bridging CPU architecture control with OS
privilege enforcement. It’s the natural continuation after Ch.10’s
EVEX/ZMM vector work, moving firmly into the kernel and protection
space. If you’d like, I can now proceed to <strong>Chapter 12</strong> —
which will cover <em>System Interrupt Routing, APIC, and SMP Boot</em> —
so we can keep building toward a full 16‑chapter manual. Would you like
me to go ahead?</p>
<hr />
<h2 id="chapter-12-virtual-memory-and-paging-mechanisms"><strong>Chapter
12: Virtual Memory and Paging Mechanisms</strong></h2>
<h3 id="x86-64-paging-architecture"><strong>12.1 x86-64 Paging
Architecture</strong></h3>
<h4 id="four-level-page-tables-pml4"><strong>Four-Level Page Tables
(PML4)</strong></h4>
<p>The x86-64 architecture uses a 4-level hierarchical paging structure
to translate 48-bit virtual addresses:</p>
<pre class="assembly"><code>; Virtual Address Layout (48-bit canonical)
; Bits 47-39: PML4 index (9 bits, 512 entries)
; Bits 38-30: PDPT index (9 bits, 512 entries)
; Bits 29-21: PD index (9 bits, 512 entries)
; Bits 20-12: PT index (9 bits, 512 entries)
; Bits 11-0: Page offset (12 bits, 4KB)
; Page Table Entry Format (8 bytes)
PTE_PRESENT equ 0x001 ; Page present
PTE_WRITABLE equ 0x002 ; Read/Write
PTE_USER equ 0x004 ; User accessible
PTE_PWT equ 0x008 ; Write-through
PTE_PCD equ 0x010 ; Cache disable
PTE_ACCESSED equ 0x020 ; Accessed flag
PTE_DIRTY equ 0x040 ; Dirty flag
PTE_LARGE equ 0x080 ; Large page (2MB/1GB)
PTE_GLOBAL equ 0x100 ; Global page
PTE_NX equ 0x8000000000000000 ; No-execute
; Create page table hierarchy
create_page_tables:
; Allocate aligned pages
mov rdi, pml4_base ; 4KB aligned
xor rax, rax
mov rcx, 512
rep stosq ; Clear PML4
; Map first 2MB using 4KB pages
mov rax, pdpt_base
or rax, PTE_PRESENT | PTE_WRITABLE
mov [pml4_base], rax
mov rax, pd_base
or rax, PTE_PRESENT | PTE_WRITABLE
mov [pdpt_base], rax
mov rax, pt_base
or rax, PTE_PRESENT | PTE_WRITABLE
mov [pd_base], rax
; Fill page table with 4KB pages
mov rcx, 512
xor rbx, rbx ; Physical address
mov rdi, pt_base
.fill_pt:
mov rax, rbx
or rax, PTE_PRESENT | PTE_WRITABLE
stosq
add rbx, 0x1000 ; Next 4KB page
loop .fill_pt
; Load page tables
mov rax, pml4_base
mov cr3, rax</code></pre>
<h4 id="large-pages-2mb-and-1gb"><strong>Large Pages (2MB and
1GB)</strong></h4>
<pre class="assembly"><code>; 2MB Large Pages (at PD level)
setup_2mb_pages:
; PDPTE points to PD
mov rax, pd_base
or rax, PTE_PRESENT | PTE_WRITABLE
mov [pdpt_base], rax
; PD entries directly map 2MB pages
mov rcx, 512
xor rbx, rbx
mov rdi, pd_base
.map_2mb:
mov rax, rbx
or rax, PTE_PRESENT | PTE_WRITABLE | PTE_LARGE
stosq
add rbx, 0x200000 ; Next 2MB
loop .map_2mb
; 1GB Huge Pages (at PDPT level)
setup_1gb_pages:
; Check CPU support
mov eax, 0x80000001
cpuid
test edx, (1 &lt;&lt; 26) ; Check PDPE1GB bit
jz .no_1gb_support
; PDPT entries directly map 1GB pages
mov rcx, 4 ; Map first 4GB
xor rbx, rbx
mov rdi, pdpt_base
.map_1gb:
mov rax, rbx
or rax, PTE_PRESENT | PTE_WRITABLE | PTE_LARGE
stosq
add rbx, 0x40000000 ; Next 1GB
loop .map_1gb
.no_1gb_support:</code></pre>
<h4 id="five-level-paging-la57"><strong>Five-Level Paging
(LA57)</strong></h4>
<pre class="assembly"><code>; 57-bit Virtual Addressing with PML5
; Bits 56-48: PML5 index (9 bits)
; Bits 47-39: PML4 index
; Bits 38-30: PDPT index
; Bits 29-21: PD index
; Bits 20-12: PT index
; Bits 11-0: Offset
check_la57_support:
mov eax, 7
xor ecx, ecx
cpuid
test ecx, (1 &lt;&lt; 16) ; LA57 in ECX bit 16
jz .no_la57
; Enable 5-level paging
mov rax, cr4
or rax, (1 &lt;&lt; 12) ; Set LA57 bit
mov cr4, rax
; Set up PML5 table
mov rdi, pml5_base
xor rax, rax
mov rcx, 512
rep stosq
; PML5[0] -&gt; PML4
mov rax, pml4_base
or rax, PTE_PRESENT | PTE_WRITABLE
mov [pml5_base], rax
; Load PML5
mov rax, pml5_base
mov cr3, rax
.no_la57:</code></pre>
<h3 id="translation-lookaside-buffer-tlb-management"><strong>12.2
Translation Lookaside Buffer (TLB) Management</strong></h3>
<h4 id="tlb-invalidation-techniques"><strong>TLB Invalidation
Techniques</strong></h4>
<pre class="assembly"><code>; Single page invalidation
invalidate_page:
; Input: RDI = virtual address
invlpg [rdi] ; Invalidate single TLB entry
; Full TLB flush via CR3 reload
flush_tlb:
mov rax, cr3
mov cr3, rax ; Reload CR3 flushes TLB
; Process Context ID (PCID) - preserves global pages
pcid_operations:
; Check PCID support
mov eax, 1
cpuid
test ecx, (1 &lt;&lt; 17) ; PCID bit
jz .no_pcid
; Enable PCID
mov rax, cr4
or rax, (1 &lt;&lt; 17) ; Set PCIDE
mov cr4, rax
; Use PCID in CR3 (bits 11:0)
mov rax, pml4_base
or rax, 0x001 ; PCID = 1
mov cr3, rax
; INVPCID instruction for targeted flush
; Type 0: Individual address
; Type 1: Single PCID
; Type 2: All including globals
; Type 3: All non-globals
mov rax, 1 ; Type: single PCID
mov rcx, pcid_descriptor ; 128-bit descriptor
invpcid rax, [rcx]
.no_pcid:
; Global page optimization
mark_global_pages:
; Set G bit for kernel pages
mov rax, [kernel_pte]
or rax, PTE_GLOBAL
mov [kernel_pte], rax
; Global pages survive CR3 reload (unless CR4.PGE cleared)</code></pre>
<h4 id="page-attribute-table-pat"><strong>Page Attribute Table
(PAT)</strong></h4>
<pre class="assembly"><code>; PAT MSR Configuration (0x277)
setup_pat:
mov ecx, 0x277
rdmsr
; Default PAT values:
; PAT0: WB (Write-Back)
; PAT1: WT (Write-Through)
; PAT2: UC- (Uncached minus)
; PAT3: UC (Uncached)
; PAT4: WB
; PAT5: WT
; PAT6: UC-
; PAT7: UC
; Modify for custom caching
; Bits 2:0 = PAT0, 10:8 = PAT1, etc.
mov eax, 0x0007040600070406
mov edx, 0x0007040600070406
wrmsr
; Use PAT in page table entry
set_page_caching:
; PAT index = PTE bits: PAT(7) | PCD(4) | PWT(3)
mov rax, physical_addr
or rax, PTE_PRESENT | PTE_WRITABLE
or rax, 0x08 ; PWT=1 -&gt; PAT index 1 (WT)
mov [pte_entry], rax</code></pre>
<h3 id="memory-protection-extensions"><strong>12.3 Memory Protection
Extensions</strong></h3>
<h4 id="nx-bit-and-dep"><strong>NX Bit and DEP</strong></h4>
<pre class="assembly"><code>; Enable NX (No-Execute) bit support
enable_nx:
; Check NX support
mov eax, 0x80000001
cpuid
test edx, (1 &lt;&lt; 20) ; NX bit
jz .no_nx
; Enable in EFER MSR
mov ecx, 0xC0000080 ; IA32_EFER
rdmsr
or eax, (1 &lt;&lt; 11) ; Set NXE
wrmsr
; Mark data pages as non-executable
mov rax, [data_pte]
or rax, PTE_NX ; Set bit 63
mov [data_pte], rax
.no_nx:
; Protection Keys (PKU)
setup_protection_keys:
; Check PKU support
mov eax, 7
xor ecx, ecx
cpuid
test ecx, (1 &lt;&lt; 3) ; PKU bit
jz .no_pku
; Enable in CR4
mov rax, cr4
or rax, (1 &lt;&lt; 22) ; Set PKE
mov cr4, rax
; Set protection key in PTE (bits 62:59)
mov rax, [user_pte]
and rax, ~(0xF &lt;&lt; 59) ; Clear key bits
or rax, (2 &lt;&lt; 59) ; Set key = 2
mov [user_pte], rax
; Configure PKRU register
xor ecx, ecx ; PKRU index 0
mov eax, 0xFFFFFFFC ; Disable access to key 2
xor edx, edx
wrpkru
.no_pku:</code></pre>
<h4 id="memory-type-range-registers-mtrrs"><strong>Memory Type Range
Registers (MTRRs)</strong></h4>
<pre class="assembly"><code>; MTRR Configuration
configure_mtrrs:
; Disable MTRRs during setup
mov ecx, 0x2FF ; IA32_MTRR_DEF_TYPE
rdmsr
and eax, ~(1 &lt;&lt; 11) ; Clear E bit
wrmsr
; Set variable MTRR for framebuffer
mov ecx, 0x200 ; IA32_MTRR_PHYSBASE0
mov rax, 0xF0000000 ; Physical base
or rax, 0x01 ; Type = WC (Write-Combining)
xor rdx, rdx
wrmsr
mov ecx, 0x201 ; IA32_MTRR_PHYSMASK0
mov rax, 0xFFF00000 ; 1MB size
or rax, (1 &lt;&lt; 11) ; Valid bit
mov rdx, 0x0F ; High bits of mask
wrmsr
; Enable MTRRs
mov ecx, 0x2FF
rdmsr
or eax, (1 &lt;&lt; 11) ; Set E bit
or eax, (1 &lt;&lt; 10) ; Set FE (Fixed MTRRs)
wrmsr</code></pre>
<h3 id="virtual-memory-operations"><strong>12.4 Virtual Memory
Operations</strong></h3>
<h4 id="page-fault-handling"><strong>Page Fault Handling</strong></h4>
<pre class="assembly"><code>; Page Fault Handler (Exception 14)
page_fault_handler:
push rax
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
; Get fault address from CR2
mov rdi, cr2
; Get error code (on stack)
mov rsi, [rsp + 88]
; Analyze error code
test rsi, 0x01 ; Present bit
jz .not_present
test rsi, 0x02 ; Write access
jnz .write_fault
test rsi, 0x04 ; User mode
jnz .user_fault
test rsi, 0x10 ; Instruction fetch
jnz .exec_fault
.not_present:
; Handle demand paging
call allocate_page
mov rbx, rax ; Physical page
; Calculate PTE address
mov rax, rdi ; Fault address
shr rax, 12 ; Page number
and rax, 0x1FF ; PT index
shl rax, 3 ; *8 for entry size
add rax, pt_base
; Install PTE
mov rdx, rbx
or rdx, PTE_PRESENT | PTE_WRITABLE | PTE_USER
mov [rax], rdx
; Invalidate TLB
invlpg [rdi]
jmp .done
.write_fault:
; Handle copy-on-write
call handle_cow
jmp .done
.user_fault:
; Check user permissions
call check_user_access
jmp .done
.exec_fault:
; Check NX violation
call handle_nx_violation
.done:
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
add rsp, 8 ; Remove error code
iretq</code></pre>
<h4 id="memory-mapping-and-unmapping"><strong>Memory Mapping and
Unmapping</strong></h4>
<pre class="assembly"><code>; Map virtual to physical address
map_page:
; Input: RDI = virtual, RSI = physical, RDX = flags
push rbx
push rcx
; Walk page tables
mov rax, rdi
shr rax, 39
and rax, 0x1FF ; PML4 index
shl rax, 3
add rax, pml4_base
; Check PML4E
mov rbx, [rax]
test rbx, PTE_PRESENT
jnz .pdpt_exists
; Allocate PDPT
call allocate_page
or rax, PTE_PRESENT | PTE_WRITABLE | PTE_USER
mov [rax], rax
mov rbx, rax
.pdpt_exists:
and rbx, ~0xFFF ; Clear flags
mov rax, rdi
shr rax, 30
and rax, 0x1FF ; PDPT index
shl rax, 3
add rax, rbx
; Continue for PD and PT...
; Install final PTE
mov rax, rsi ; Physical address
or rax, rdx ; Flags
mov [final_pte], rax
invlpg [rdi] ; Flush TLB
pop rcx
pop rbx
ret
; Unmap pages
unmap_pages:
; Input: RDI = start address, RSI = page count
push rcx
mov rcx, rsi
.unmap_loop:
; Clear PTE
call get_pte_address
mov qword [rax], 0
invlpg [rdi]
add rdi, 0x1000
loop .unmap_loop
pop rcx
ret</code></pre>
<h4 id="copy-on-write-implementation"><strong>Copy-on-Write
Implementation</strong></h4>
<pre class="assembly"><code>; Mark pages for COW
setup_cow:
; Input: RDI = start, RSI = end
push rax
push rbx
.cow_loop:
call get_pte_address
mov rbx, [rax]
and rbx, ~PTE_WRITABLE ; Clear write bit
or rbx, (1 &lt;&lt; 9) ; Set available bit for COW
mov [rax], rbx
add rdi, 0x1000
cmp rdi, rsi
jb .cow_loop
pop rbx
pop rax
ret
; Handle COW fault
handle_cow:
; Input: RDI = fault address
push rax
push rbx
push rcx
push rdx
; Get PTE
call get_pte_address
mov rbx, [rax]
; Check if COW page
test rbx, (1 &lt;&lt; 9) ; COW bit
jz .not_cow
; Allocate new page
call allocate_page
mov rdx, rax ; New physical page
; Copy old page content
mov rsi, rbx
and rsi, ~0xFFF ; Old physical address
mov rdi, rdx
mov rcx, 512 ; 4KB / 8
rep movsq
; Update PTE
call get_pte_address
mov rbx, rdx
or rbx, PTE_PRESENT | PTE_WRITABLE | PTE_USER
and rbx, ~(1 &lt;&lt; 9) ; Clear COW bit
mov [rax], rbx
invlpg [rdi]
.not_cow:
pop rdx
pop rcx
pop rbx
pop rax
ret</code></pre>
<hr />
<p>This chapter provides comprehensive coverage of x86-64’s virtual
memory architecture, from the 4-level page table hierarchy through TLB
management, protection mechanisms, and practical implementations of page
fault handling and COW. The code examples are detailed enough to
understand the low-level mechanics while being practical for real
implementation.</p>
<p>Would you like me to continue with <strong>Chapter 13</strong>, which
could cover <em>Advanced Synchronization and Atomic Operations</em>?</p>
<hr />
<h2
id="chapter-13-interrupts-apic-and-multi-core-programming"><strong>Chapter
13: Interrupts, APIC, and Multi-Core Programming</strong></h2>
<h3 id="interrupt-architecture"><strong>13.1 Interrupt
Architecture</strong></h3>
<h4 id="interrupt-descriptor-table-idt"><strong>Interrupt Descriptor
Table (IDT)</strong></h4>
<p>The IDT contains up to 256 gate descriptors that specify handlers for
interrupts and exceptions:</p>
<pre class="assembly"><code>; IDT Gate Descriptor (16 bytes in 64-bit mode)
struc idt_gate
.offset_low resw 1 ; Offset bits 0-15
.selector resw 1 ; Code segment selector
.ist resb 1 ; Interrupt Stack Table (bits 0-2)
.attributes resb 1 ; Type and attributes
.offset_mid resw 1 ; Offset bits 16-31
.offset_high resd 1 ; Offset bits 32-63
.reserved resd 1 ; Reserved
endstruc
; Gate types
GATE_INTERRUPT equ 0x8E ; Interrupt gate (IF cleared)
GATE_TRAP equ 0x8F ; Trap gate (IF unchanged)
GATE_CALL equ 0x8C ; Call gate (not for interrupts)
; Build IDT
build_idt:
mov rdi, idt_base
mov rcx, 256
.fill_idt:
; Default handler for all vectors
mov rax, default_handler
mov [rdi + idt_gate.offset_low], ax
shr rax, 16
mov [rdi + idt_gate.offset_mid], ax
shr rax, 16
mov [rdi + idt_gate.offset_high], eax
mov word [rdi + idt_gate.selector], 0x08 ; Kernel CS
mov byte [rdi + idt_gate.ist], 0 ; No IST
mov byte [rdi + idt_gate.attributes], GATE_INTERRUPT
mov dword [rdi + idt_gate.reserved], 0
add rdi, 16
loop .fill_idt
; Set specific handlers
mov rsi, exception_handlers
xor rcx, rcx
.set_exceptions:
mov rax, [rsi + rcx*8]
test rax, rax
jz .skip
; Calculate IDT entry address
mov rdi, idt_base
shl rcx, 4 ; *16 for entry size
add rdi, rcx
shr rcx, 4
; Install handler
mov [rdi + idt_gate.offset_low], ax
shr rax, 16
mov [rdi + idt_gate.offset_mid], ax
shr rax, 16
mov [rdi + idt_gate.offset_high], eax
.skip:
inc rcx
cmp rcx, 32 ; First 32 are exceptions
jb .set_exceptions
; Load IDT
lidt [idt_descriptor]
ret
idt_descriptor:
dw idt_size - 1 ; Limit
dq idt_base ; Base address</code></pre>
<h4 id="exception-handling-1"><strong>Exception Handling</strong></h4>
<pre class="assembly"><code>; Exception handlers with error codes
; Stack frame on entry:
; [RSP+40] SS
; [RSP+32] RSP
; [RSP+24] RFLAGS
; [RSP+16] CS
; [RSP+8] RIP
; [RSP] Error code (for some exceptions)
; Page Fault Handler (#PF, vector 14)
page_fault_handler:
; Error code already on stack
push rax
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
mov rdi, cr2 ; Faulting address
mov rsi, [rsp + 120] ; Error code
mov rdx, [rsp + 128] ; RIP
; Check error code bits
test rsi, 0x01 ; Present
jz .not_present
test rsi, 0x02 ; Write
jnz .write_violation
test rsi, 0x04 ; User mode
jnz .user_violation
test rsi, 0x08 ; Reserved bit set
jnz .reserved_violation
test rsi, 0x10 ; Instruction fetch
jnz .nx_violation
.not_present:
call handle_page_not_present
jmp .done
.write_violation:
call handle_write_protection
jmp .done
.user_violation:
call handle_user_access
jmp .done
.nx_violation:
call handle_nx_fault
.done:
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
add rsp, 8 ; Remove error code
iretq
; General Protection Fault (#GP, vector 13)
general_protection_handler:
; Has error code
push rax
push rcx
push rdx
mov rax, [rsp + 24] ; Error code
test rax, 0x01 ; External event
jnz .external
; Decode selector index
and rax, 0xFFF8 ; Selector index
shr rax, 3
; Check if GDT or LDT
test dword [rsp + 24], 0x04 ; TI bit
jnz .ldt_error
; Handle GDT selector error
call handle_gdt_error
jmp .done
.ldt_error:
call handle_ldt_error
jmp .done
.external:
call handle_external_gp
.done:
pop rdx
pop rcx
pop rax
add rsp, 8 ; Remove error code
iretq
; Double Fault (#DF, vector 8)
; Uses IST to ensure valid stack
double_fault_handler:
; Critical error - system is likely corrupted
cli
; Save minimal state
push rax
push rcx
push rdx
; Log error
mov rdi, double_fault_msg
call panic_print
; Attempt to get fault information
mov rax, [rsp + 24] ; RIP
mov rcx, [rsp + 32] ; CS
mov rdx, [rsp + 40] ; RFLAGS
; System halt
hlt
jmp $</code></pre>
<h4 id="hardware-vs-software-interrupts"><strong>Hardware vs Software
Interrupts</strong></h4>
<pre class="assembly"><code>; Software interrupt handling
; INT instruction pushes: FLAGS, CS, RIP
; No error code pushed
; System call via INT 0x80 (legacy)
int80_handler:
; Save all registers (System V ABI)
push rax
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
; System call number in RAX
cmp rax, MAX_SYSCALL
ja .invalid
; Call handler
mov rcx, syscall_table
call [rcx + rax*8]
; Result in RAX
mov [rsp + 80], rax ; Store in saved RAX
.done:
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
iretq
.invalid:
mov rax, -ENOSYS
jmp .done
; Hardware interrupt from APIC
apic_timer_handler:
push rax
push rcx
push rdx
; Increment tick counter
lock inc qword [system_ticks]
; Check for scheduler quantum
mov rax, [current_task]
dec dword [rax + task.quantum]
jnz .no_schedule
; Need reschedule
call schedule
.no_schedule:
; Send EOI to APIC
mov rax, [apic_base]
mov dword [rax + APIC_EOI], 0
pop rdx
pop rcx
pop rax
iretq</code></pre>
<h3 id="advanced-programmable-interrupt-controller-apic"><strong>13.2
Advanced Programmable Interrupt Controller (APIC)</strong></h3>
<h4 id="local-apic-programming"><strong>Local APIC
Programming</strong></h4>
<pre class="assembly"><code>; APIC Register Offsets
APIC_ID equ 0x020 ; APIC ID
APIC_VERSION equ 0x030 ; Version
APIC_TPR equ 0x080 ; Task Priority
APIC_EOI equ 0x0B0 ; End of Interrupt
APIC_LDR equ 0x0D0 ; Logical Destination
APIC_DFR equ 0x0E0 ; Destination Format
APIC_SPURIOUS equ 0x0F0 ; Spurious Interrupt Vector
APIC_ISR equ 0x100 ; In-Service (8 registers)
APIC_TMR equ 0x180 ; Trigger Mode (8 registers)
APIC_IRR equ 0x200 ; Interrupt Request (8 registers)
APIC_ESR equ 0x280 ; Error Status
APIC_ICR_LOW equ 0x300 ; Interrupt Command (low)
APIC_ICR_HIGH equ 0x310 ; Interrupt Command (high)
APIC_TIMER_LVT equ 0x320 ; Timer Local Vector Table
APIC_THERMAL_LVT equ 0x330 ; Thermal LVT
APIC_PERF_LVT equ 0x340 ; Performance Counter LVT
APIC_LINT0_LVT equ 0x350 ; LINT0 LVT
APIC_LINT1_LVT equ 0x360 ; LINT1 LVT
APIC_ERROR_LVT equ 0x370 ; Error LVT
APIC_TIMER_INIT equ 0x380 ; Timer Initial Count
APIC_TIMER_CURR equ 0x390 ; Timer Current Count
APIC_TIMER_DIV equ 0x3E0 ; Timer Divide Configuration
; Initialize Local APIC
init_local_apic:
; Get APIC base from MSR
mov ecx, 0x1B ; IA32_APIC_BASE MSR
rdmsr
and eax, 0xFFFFF000 ; Clear lower 12 bits
mov [apic_base], rax
; Enable APIC (set bit 11)
or eax, 0x800
wrmsr
; Map APIC to virtual memory (if needed)
mov rdi, APIC_VIRT_BASE
mov rsi, rax ; Physical address
mov rdx, PAGE_SIZE
mov rcx, PAGE_PRESENT | PAGE_WRITE | PAGE_NO_CACHE
call map_pages
; Set up spurious interrupt vector
mov rax, [apic_base]
mov dword [rax + APIC_SPURIOUS], 0x1FF ; Enable APIC, vector 0xFF
; Set task priority to accept all interrupts
mov dword [rax + APIC_TPR], 0
; Configure Local Vector Table entries
; Timer (periodic mode, vector 0x20)
mov dword [rax + APIC_TIMER_LVT], 0x20020
; Error interrupt (vector 0x21)
mov dword [rax + APIC_ERROR_LVT], 0x21
; Performance counter (vector 0x22)
mov dword [rax + APIC_PERF_LVT], 0x22
; Thermal sensor (vector 0x23)
mov dword [rax + APIC_THERMAL_LVT], 0x23
; LINT0 and LINT1 (masked)
mov dword [rax + APIC_LINT0_LVT], 0x10000
mov dword [rax + APIC_LINT1_LVT], 0x10000
ret
; APIC Timer Programming
setup_apic_timer:
; Input: RDI = frequency in Hz
push rax
push rcx
push rdx
mov rax, [apic_base]
; Set divide value to 16
mov dword [rax + APIC_TIMER_DIV], 0x03
; Calculate initial count
; count = (bus_frequency / 16) / desired_frequency
mov rcx, [apic_bus_frequency]
shr rcx, 4 ; Divide by 16
xor rdx, rdx
div rdi ; Divide by desired frequency
; Set initial count
mov [rax + APIC_TIMER_INIT], eax
; Start timer in periodic mode
mov dword [rax + APIC_TIMER_LVT], 0x20020
pop rdx
pop rcx
pop rax
ret</code></pre>
<h4 id="io-apic-configuration"><strong>I/O APIC
Configuration</strong></h4>
<pre class="assembly"><code>; I/O APIC Registers
IOAPIC_ID equ 0x00
IOAPIC_VERSION equ 0x01
IOAPIC_ARB equ 0x02
IOAPIC_REDTBL equ 0x10 ; Redirection table base
; I/O APIC indirect access
ioapic_read:
; Input: RDI = register
; Output: RAX = value
mov rax, [ioapic_base]
mov dword [rax], edi ; Select register
mov eax, dword [rax + 0x10] ; Read data
ret
ioapic_write:
; Input: RDI = register, RSI = value
mov rax, [ioapic_base]
mov dword [rax], edi ; Select register
mov dword [rax + 0x10], esi ; Write data
ret
; Configure I/O APIC redirection entry
setup_ioapic_entry:
; Input: RDI = IRQ, RSI = vector, RDX = destination APIC ID
push rax
push rcx
; Calculate redirection table register
shl rdi, 1 ; Each entry is 2 registers
add rdi, IOAPIC_REDTBL
; Low 32 bits: vector, delivery mode, destination mode
mov rax, rsi ; Vector
or rax, (0 &lt;&lt; 8) ; Fixed delivery
or rax, (0 &lt;&lt; 11) ; Physical destination
or rax, (0 &lt;&lt; 13) ; Active high
or rax, (0 &lt;&lt; 15) ; Edge triggered
or rax, (0 &lt;&lt; 16) ; Unmask interrupt
push rdi
mov rsi, rax
call ioapic_write </code></pre>
<hr />
<h2
id="chapter-14-security-extensions-and-virtualization"><strong>Chapter
14: Security Extensions and Virtualization</strong></h2>
<h3 id="hardware-assisted-security-features"><strong>14.1
Hardware-Assisted Security Features</strong></h3>
<h4 id="nx-bit-no-execute"><strong>NX Bit (No-Execute)</strong></h4>
<p>The NX bit, enabled via <code>IA32_EFER.NXE</code> (MSR
<code>0xC0000080</code>), prevents code execution from data pages. When
set in a Page Table Entry (PTE), it marks the page as
non-executable.</p>
<pre class="assembly"><code>; Enable NX support
enable_nx:
mov ecx, 0xC0000080 ; IA32_EFER MSR
rdmsr
or eax, (1 &lt;&lt; 11) ; Set NXE bit
wrmsr
; Mark data pages non-executable in page tables
mov rax, [data_page_pte_addr]
or rax, (1 &lt;&lt; 63) ; Set bit 63 (NX bit) in PTE
mov [data_page_pte_addr], rax
ret
; Page fault handler for NX violations
nx_fault_handler:
; CR2 contains faulting address
mov rdi, cr2
; Error code bit 4 indicates instruction fetch
mov rsi, [rsp] ; Error code on stack
test rsi, 0x10
jz .not_nx_fault
; Handle NX violation
call log_nx_violation
; Terminate offending process
call terminate_current_process
.not_nx_fault:
ret</code></pre>
<h4
id="smapsmep-supervisor-mode-accessexecution-prevention"><strong>SMAP/SMEP
(Supervisor Mode Access/Execution Prevention)</strong></h4>
<p>These features prevent kernel exploitation by blocking access to
user-space memory from kernel mode.</p>
<pre class="assembly"><code>; Enable SMAP and SMEP
enable_smap_smep:
; Check CPU support via CPUID
mov eax, 7
xor ecx, ecx
cpuid
; Check SMAP (bit 20) and SMEP (bit 7) support
test ebx, (1 &lt;&lt; 20)
jz .no_smap
test ebx, (1 &lt;&lt; 7)
jz .no_smep
; Enable in CR4
mov rax, cr4
or rax, (1 &lt;&lt; 21) ; SMAP bit
or rax, (1 &lt;&lt; 20) ; SMEP bit
mov cr4, rax
.no_smap:
.no_smep:
ret
; Safe user memory access with SMAP
copy_from_user:
; Input: RDI = kernel buffer, RSI = user buffer, RDX = size
; Returns: RAX = 0 on success, -EFAULT on failure
; Temporarily allow user access
stac ; Set AC flag
; Verify user address range
mov rax, USER_SPACE_END
sub rax, rsi
cmp rax, rdx
jb .fault
; Copy with exception handling
cld
mov rcx, rdx
rep movsb
; Disable user access
clac ; Clear AC flag
xor rax, rax ; Success
ret
.fault:
clac
mov rax, -EFAULT
ret</code></pre>
<h4 id="intel-cet-control-flow-enforcement-technology"><strong>Intel CET
(Control-flow Enforcement Technology)</strong></h4>
<p>CET provides hardware-based protection against ROP/JOP attacks
through shadow stacks and indirect branch tracking.</p>
<pre class="assembly"><code>; Shadow Stack Setup
setup_shadow_stack:
; Check CET support
mov eax, 7
xor ecx, ecx
cpuid
test ecx, (1 &lt;&lt; 7) ; CET_SS bit
jz .no_cet
; Enable CET in CR4
mov rax, cr4
or rax, (1 &lt;&lt; 23) ; CET bit
mov cr4, rax
; Allocate shadow stack
mov rdi, SHADOW_STACK_SIZE
call allocate_shadow_stack_memory
; Load shadow stack pointer
mov ecx, 0x6A4 ; IA32_PL3_SSP MSR
mov rdx, rax
shr rdx, 32
wrmsr
; Enable shadow stack in IA32_U_CET
mov ecx, 0x6A0 ; IA32_U_CET MSR
rdmsr
or eax, 0x01 ; SH_STK_EN
wrmsr
.no_cet:
ret
; Indirect Branch Tracking
; Valid indirect targets must begin with ENDBRANCH
valid_indirect_target:
endbr64 ; Mark as valid branch target
; Function code follows...
ret
; CET-aware exception handler
cet_exception_handler:
; Check if CET exception
mov rax, [rsp] ; Error code
test rax, CET_FAULT_FLAG
jz .not_cet
; Log CET violation
mov rdi, [rsp + 8] ; RIP of violation
call log_cet_violation
; Terminate process
call terminate_current_process
.not_cet:
ret</code></pre>
<h4 id="intel-sgx-software-guard-extensions"><strong>Intel SGX (Software
Guard Extensions)</strong></h4>
<p>SGX creates secure enclaves for protecting sensitive code and
data.</p>
<pre class="assembly"><code>; Check SGX support
check_sgx_support:
mov eax, 7
xor ecx, ecx
cpuid
test ebx, (1 &lt;&lt; 2) ; SGX bit
jz .no_sgx
; Check SGX leaf functions
mov eax, 0x12
xor ecx, ecx
cpuid
; EAX contains SGX1 support flags
; EBX contains MISCSELECT
; EDX contains maximum enclave size
.no_sgx:
ret
; Enclave creation flow (simplified)
create_enclave:
; 1. Reserve memory region for enclave
mov rdi, enclave_size
call reserve_memory_region
mov [enclave_base], rax
; 2. Create SECS (SGX Enclave Control Structure)
lea rdi, [secs_page]
mov rsi, [enclave_base]
encls ; ECREATE instruction
; 3. Add pages to enclave
mov rcx, [enclave_pages]
.add_pages:
push rcx
; Add regular page
lea rdi, [page_info]
encls ; EADD instruction
; Extend measurement
lea rdi, [page_info]
encls ; EEXTEND instruction
pop rcx
loop .add_pages
; 4. Initialize enclave
lea rdi, [sigstruct]
lea rsi, [secs_page]
lea rdx, [einittoken]
encls ; EINIT instruction
ret
; Enter enclave
enter_enclave:
; Save state
push rbx
push rcx
; Set up parameters
lea rbx, [tcs_page] ; Thread Control Structure
lea rcx, [aep_handler] ; Asynchronous Exit Pointer
; Enter enclave
enclu ; EENTER instruction
; Returns here after enclave exit
pop rcx
pop rbx
ret</code></pre>
<h3 id="virtualization-architecture"><strong>14.2 Virtualization
Architecture</strong></h3>
<h4 id="intel-vt-x-vmx-fundamentals"><strong>Intel VT-x (VMX)
Fundamentals</strong></h4>
<pre class="assembly"><code>; VMX capability checking
check_vmx_capability:
; Check CPUID for VMX support
mov eax, 1
cpuid
test ecx, (1 &lt;&lt; 5) ; VMX bit
jz .no_vmx
; Read VMX capability MSRs
mov ecx, 0x480 ; IA32_VMX_BASIC
rdmsr
; EAX[30:0] = VMCS revision ID
; EAX[48:32] = VMCS region size
; EDX = VMX capabilities
mov [vmcs_revision_id], eax
shr rax, 32
and rax, 0x1FFF
mov [vmcs_region_size], rax
.no_vmx:
ret
; VMCS (Virtual Machine Control Structure) setup
setup_vmcs:
; Allocate 4KB aligned VMCS region
mov rdi, 4096
mov rsi, 4096 ; Alignment
call allocate_aligned_memory
mov [vmcs_region], rax
; Write VMCS revision identifier
mov rdi, rax
mov eax, [vmcs_revision_id]
mov [rdi], eax
; Clear VMCS
vmclear [vmcs_region]
jc .vmclear_failed
jz .vmclear_failed
; Load VMCS
vmptrld [vmcs_region]
jc .vmptrld_failed
jz .vmptrld_failed
; Configure VMCS fields
call setup_vmcs_host_state
call setup_vmcs_guest_state
call setup_vmcs_controls
ret
.vmclear_failed:
.vmptrld_failed:
mov rax, -1
ret
; VMCS host state setup
setup_vmcs_host_state:
; Host CR0
mov rax, cr0
mov rcx, 0x6C00 ; HOST_CR0
vmwrite rcx, rax
; Host CR3
mov rax, cr3
mov rcx, 0x6C02 ; HOST_CR3
vmwrite rcx, rax
; Host CR4
mov rax, cr4
mov rcx, 0x6C04 ; HOST_CR4
vmwrite rcx, rax
; Host RSP
lea rax, [host_stack_top]
mov rcx, 0x6C14 ; HOST_RSP
vmwrite rcx, rax
; Host RIP (VM exit handler)
lea rax, [vm_exit_handler]
mov rcx, 0x6C16 ; HOST_RIP
vmwrite rcx, rax
; Host segment selectors
mov ax, cs
mov rcx, 0x0C02 ; HOST_CS_SELECTOR
vmwrite rcx, rax
mov ax, ds
mov rcx, 0x0C06 ; HOST_DS_SELECTOR
vmwrite rcx, rax
; Continue with other segments...
ret
; VM entry
vm_enter:
; Check if already in VMX operation
vmxon [vmxon_region]
jc .vmxon_failed
; Load guest VMCS
vmptrld [guest_vmcs]
; Launch or resume VM
mov rax, [vm_launched]
test rax, rax
jnz .vm_resume
; First launch
vmlaunch
jmp .vm_entry_failed
.vm_resume:
vmresume
.vm_entry_failed:
; VM entry failed, check error
mov rcx, 0x4400 ; VM_INSTRUCTION_ERROR
vmread rax, rcx
; Handle error...
.vmxon_failed:
ret
; VM exit handler
vm_exit_handler:
; Save guest general purpose registers
push rax
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
; Read exit reason
mov rcx, 0x4402 ; EXIT_REASON
vmread rax, rcx
; Dispatch based on exit reason
and rax, 0xFFFF ; Basic exit reason
cmp rax, EXIT_REASON_CPUID
je .handle_cpuid
cmp rax, EXIT_REASON_IO
je .handle_io
cmp rax, EXIT_REASON_MSR_READ
je .handle_msr_read
cmp rax, EXIT_REASON_MSR_WRITE
je .handle_msr_write
cmp rax, EXIT_REASON_EPT_VIOLATION
je .handle_ept_violation
; ... other exit reasons ...
.exit_dispatch_done:
; Restore registers
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
; Resume guest
vmresume
; Should not reach here
jmp vm_resume_failed
.handle_cpuid:
; Emulate CPUID instruction
mov rcx, 0x681E ; GUEST_RIP
vmread rsi, rcx
add rsi, 2 ; CPUID is 2 bytes
vmwrite rcx, rsi
; Get CPUID leaf from guest RAX
mov rcx, 0x6800 ; GUEST_RAX
vmread rax, rcx
; Execute CPUID
cpuid
; Write results back to guest
mov rcx, 0x6800 ; GUEST_RAX
vmwrite rcx, rax
mov rcx, 0x6802 ; GUEST_RBX
vmwrite rcx, rbx
mov rcx, 0x6804 ; GUEST_RCX
vmwrite rcx, rcx
mov rcx, 0x6806 ; GUEST_RDX
vmwrite rcx, rdx
jmp .exit_dispatch_done</code></pre>
<h4 id="extended-page-tables-ept"><strong>Extended Page Tables
(EPT)</strong></h4>
<pre class="assembly"><code>; EPT structure setup
setup_ept:
; Allocate EPT PML4 table
mov rdi, 4096
call allocate_page
mov [ept_pml4], rax
; Clear EPT PML4
mov rdi, rax
xor eax, eax
mov rcx, 512
rep stosq
; Set up identity mapping for first 1GB
call setup_ept_identity_1gb
; Set EPTP in VMCS
mov rax, [ept_pml4]
or rax, 0x1E ; Memory type = WB, page walk = 4
mov rcx, 0x201A ; EPT_POINTER
vmwrite rcx, rax
; Enable EPT in secondary controls
mov rcx, 0x401E ; SECONDARY_VM_EXEC_CONTROL
vmread rax, rcx
or rax, (1 &lt;&lt; 1) ; Enable EPT
vmwrite rcx, rax
ret
; EPT violation handler
handle_ept_violation:
; Read guest physical address
mov rcx, 0x2400 ; GUEST_PHYSICAL_ADDRESS
vmread rdi, rcx
; Read exit qualification
mov rcx, 0x640A ; EXIT_QUALIFICATION
vmread rsi, rcx
; Check violation type
test rsi, 0x01 ; Read access
jnz .handle_read
test rsi, 0x02 ; Write access
jnz .handle_write
test rsi, 0x04 ; Execute access
jnz .handle_execute
.handle_read:
.handle_write:
.handle_execute:
; Map page if valid access
call validate_guest_access
test rax, rax
jz .invalid_access
; Add EPT mapping
mov rsi, rdi ; GPA
mov rdx, [host_page] ; HPA
mov rcx, EPT_READ | EPT_WRITE | EPT_EXECUTE
call add_ept_mapping
jmp .done
.invalid_access:
; Inject exception to guest
call inject_guest_page_fault
.done:
ret</code></pre>
<h3 id="multi-core-and-multi-threading-security"><strong>14.3 Multi-Core
and Multi-Threading Security</strong></h3>
<h4 id="per-cpu-security-state"><strong>Per-CPU Security
State</strong></h4>
<pre class="assembly"><code>; Per-CPU security context
struc cpu_security_context
.cr3_kernel resq 1 ; Kernel page tables
.cr3_user resq 1 ; User page tables (KPTI)
.shadow_stack resq 1 ; CET shadow stack
.gs_base_kernel resq 1 ; Kernel GS base
.gs_base_user resq 1 ; User GS base
.tss_addr resq 1 ; Task State Segment
.ist_stacks resq 7 ; Interrupt Stack Table
endstruc
; Initialize per-CPU security
init_cpu_security:
; Get CPU ID
mov rax, [gs:cpu_id]
; Calculate per-CPU data offset
mov rcx, cpu_security_context_size
mul rcx
lea rdi, [cpu_security_contexts + rax]
; Set up kernel page tables with KPTI
call create_kernel_page_tables
mov [rdi + cpu_security_context.cr3_kernel], rax
; Create user page tables (minimal kernel mapping)
call create_user_page_tables
mov [rdi + cpu_security_context.cr3_user], rax
; Allocate IST stacks for critical exceptions
mov rcx, 7
.alloc_ist:
push rcx
mov rdi, IST_STACK_SIZE
call allocate_stack
pop rcx
mov [rdi + cpu_security_context.ist_stacks + rcx*8 - 8], rax
loop .alloc_ist
; Set up TSS with IST
call setup_tss_with_ist
ret
; KPTI (Kernel Page Table Isolation) switching
switch_to_user_cr3:
push rax
mov rax, [gs:cpu_security + cpu_security_context.cr3_user]
mov cr3, rax
pop rax
ret
switch_to_kernel_cr3:
push rax
mov rax, [gs:cpu_security + cpu_security_context.cr3_kernel]
mov cr3, rax
pop rax
ret</code></pre>
<h4 id="speculation-control"><strong>Speculation Control</strong></h4>
<pre class="assembly"><code>; Mitigation for speculative execution vulnerabilities
; Indirect Branch Prediction Barrier
ibpb_barrier:
mov ecx, 0x49 ; IA32_PRED_CMD
mov eax, 1 ; IBPB command
xor edx, edx
wrmsr
ret
; Indirect Branch Restricted Speculation
enable_ibrs:
mov ecx, 0x48 ; IA32_SPEC_CTRL
rdmsr
or eax, 0x01 ; Set IBRS
wrmsr
ret
; Single Thread Indirect Branch Predictors
enable_stibp:
mov ecx, 0x48 ; IA32_SPEC_CTRL
rdmsr
or eax, 0x02 ; Set STIBP
wrmsr
ret
; Speculative Store Bypass Disable
enable_ssbd:
mov ecx, 0x48 ; IA32_SPEC_CTRL
rdmsr
or eax, 0x04 ; Set SSBD
wrmsr
ret
; Return Stack Buffer clearing
rsb_clear:
; Fill RSB with known targets
mov rcx, 32 ; RSB depth
.fill_loop:
call .rsb_target
.rsb_target:
pause
dec rcx
jnz .fill_loop
; Clear by doing equal calls/rets
mov rcx, 32
.clear_loop:
add rsp, 8 ; Pop without ret
dec rcx
jnz .clear_loop
ret
; Microarchitectural Data Sampling mitigation
mds_clear:
; Clear CPU buffers
mov ecx, 0x10B ; IA32_FLUSH_CMD
mov eax, 1 ; L1D_FLUSH command
xor edx, edx
wrmsr
; VERW instruction to clear other buffers
verw word [ds:verw_operand]
ret
verw_operand:
dw 0
; Context switch security
secure_context_switch:
; Clear speculation state
call ibpb_barrier
; Clear microarchitectural state
call mds_clear
; Switch page tables (KPTI)
mov rax, [next_task]
mov rax, [rax + task.cr3]
mov cr3, rax
; Clear sensitive registers
xor rax, rax
xor rcx, rcx
xor rdx, rdx
xor rsi, rsi
xor rdi, rdi
xor r8, r8
xor r9, r9
xor r10, r10
xor r11, r11
ret</code></pre>
<h3 id="secure-coding-practices"><strong>14.4 Secure Coding
Practices</strong></h3>
<h4 id="stack-protection"><strong>Stack Protection</strong></h4>
<pre class="assembly"><code>; Function with stack canary
secure_function:
push rbp
mov rbp, rsp
sub rsp, 0x100 ; Local variables
; Place canary
mov rax, [gs:stack_canary]
mov [rbp-8], rax
; Function body
; ... code ...
; Check canary before return
mov rax, [rbp-8]
cmp rax, [gs:stack_canary]
jne .stack_smash_detected
leave
ret
.stack_smash_detected:
call __stack_chk_fail
; Never returns
; ROP protection with CET
cet_protected_function:
endbr64 ; Valid indirect branch target
; Function prologue with shadow stack
push rbp
mov rbp, rsp
; Body
; ... code ...
; Return - automatically verified against shadow stack
leave
ret
; Control Flow Guard implementation
cfg_check:
; Before indirect call
; RCX = target address
push rax
push rdx
; Hash target address
mov rax, rcx
shr rax, 4 ; Align to 16 bytes
; Check in CFG bitmap
mov rdx, [cfg_bitmap_base]
bt [rdx], rax
jnc .invalid_target
pop rdx
pop rax
; Safe to call
call rcx
ret
.invalid_target:
; CFG violation
call cfg_violation_handler
; Never returns</code></pre>
<h4 id="secure-memory-operations"><strong>Secure Memory
Operations</strong></h4>
<pre class="assembly"><code>; Constant-time memory comparison
constant_time_compare:
; RDI = ptr1, RSI = ptr2, RDX = length
xor rax, rax ; Result accumulator
xor rcx, rcx ; Counter
.loop:
mov r8b, [rdi + rcx]
xor r8b, [rsi + rcx]
or al, r8b
inc rcx
cmp rcx, rdx
jb .loop
; RAX = 0 if equal, non-zero if different
test rax, rax
setz al
movzx rax, al
ret
; Secure memory clear
secure_zero_memory:
; RDI = buffer, RSI = length
push rcx
; Clear in multiple passes
mov rcx, 3 ; Number of passes
.pass_loop:
push rcx
push rdi
push rsi
; Overwrite with pattern
mov rax, 0x5555555555555555
mov rcx, rsi
shr rcx, 3 ; Convert to qwords
rep stosq
; Handle remaining bytes
mov rcx, rsi
and rcx, 7
mov al, 0x55
rep stosb
pop rsi
pop rdi
pop rcx
; Different pattern each pass
ror rax, 1
loop .pass_loop
; Final zero pass
xor rax, rax
mov rcx, rsi
shr rcx, 3
rep stosq
mov rcx, rsi
and rcx, 7
rep stosb
; Memory barrier
mfence
pop rcx
ret
; Bounds checking
checked_array_access:
; RDI = array base, RSI = index, RDX = element size
; RCX = array bound
; Check bounds
cmp rsi, rcx
jae .out_of_bounds
; Calculate address with overflow check
mov rax, rsi
mul rdx
jo .overflow
add rax, rdi
jc .overflow
; Safe to access
ret
.out_of_bounds:
.overflow:
; Return error
mov rax, -1
ret</code></pre>
<hr />
<p>This chapter covered hardware security features including NX,
SMAP/SMEP, CET, and SGX, along with virtualization technologies like
VT-x and EPT. We explored secure coding practices, speculation
mitigations, and memory protection techniques essential for writing
secure x86-64 code. The examples demonstrate practical implementations
of these security features at the assembly level.</p>
<hr />
<h2 id="chapter-15-performance-optimization-techniques">Chapter 15:
Performance Optimization Techniques</h2>
<h3 id="microarchitectural-optimization-fundamentals">15.1
Microarchitectural Optimization Fundamentals</h3>
<h4 id="understanding-the-modern-x86-64-pipeline">Understanding the
Modern x86-64 Pipeline</h4>
<p>Modern x86-64 processors are superscalar, out-of-order execution
machines with deep pipelines (typically 14-20 stages). Key components
affecting performance:</p>
<pre class="assembly"><code>; Pipeline stages example - Intel Skylake
; 1. Instruction Fetch (IF)
; 2. Instruction Length Decode (ILD)
; 3. Instruction Queue (IQ)
; 4. Instruction Decode (ID) - up to 4 instructions/cycle
; 5. Micro-op Queue
; 6. Register Rename/Allocate
; 7. Scheduler (RS - Reservation Station)
; 8. Execution Units (Multiple ports)
; 9. Writeback
; 10. Retire/Commit</code></pre>
<h4 id="execution-ports-and-throughput">Execution Ports and
Throughput</h4>
<p>Intel processors have multiple execution ports (Skylake has 8
ports):</p>
<pre class="assembly"><code>; Port utilization example
; Port 0: ALU, MUL, DIV, Branch
; Port 1: ALU, MUL, Fast LEA
; Port 2/3: Load
; Port 4: Store Data
; Port 5: ALU, Shuffle
; Port 6: ALU, Branch
; Port 7: Store Address
; Optimized code distributes across ports
vmulps ymm0, ymm1, ymm2 ; Port 0/1
vaddps ymm3, ymm4, ymm5 ; Port 0/1/5
vmovaps [rdi], ymm0 ; Port 2/3 + 4/7
; These can execute in parallel if no dependencies</code></pre>
<h3 id="branch-prediction-optimization">15.2 Branch Prediction
Optimization</h3>
<h4 id="static-branch-prediction">Static Branch Prediction</h4>
<pre class="assembly"><code>; Compiler hints for likely/unlikely branches
section .text
cmp rax, rbx
jg .likely_path ; Forward branches predicted not-taken
; unlikely code...
jmp .continue
.likely_path:
; hot path code...
.continue:
; Using conditional moves to avoid branches
cmp eax, ebx
mov ecx, 0 ; Default value
mov edx, 1 ; Alternative value
cmovg ecx, edx ; No branch, no misprediction penalty</code></pre>
<h4 id="loop-optimization-and-unrolling">Loop Optimization and
Unrolling</h4>
<pre class="assembly"><code>; Original loop - 1 iteration per cycle best case
.loop:
movsd xmm0, [rsi]
mulsd xmm0, xmm7
movsd [rdi], xmm0
add rsi, 8
add rdi, 8
dec rcx
jnz .loop
; Unrolled 4x with software pipelining
.loop_unrolled:
movsd xmm0, [rsi]
movsd xmm1, [rsi+8]
movsd xmm2, [rsi+16]
movsd xmm3, [rsi+24]
mulsd xmm0, xmm7
mulsd xmm1, xmm7
mulsd xmm2, xmm7
mulsd xmm3, xmm7
movsd [rdi], xmm0
movsd [rdi+8], xmm1
movsd [rdi+16], xmm2
movsd [rdi+24], xmm3
add rsi, 32
add rdi, 32
sub rcx, 4
jnz .loop_unrolled</code></pre>
<h3 id="memory-access-optimization">15.3 Memory Access Optimization</h3>
<h4 id="cache-line-optimization">Cache Line Optimization</h4>
<pre class="assembly"><code>; Cache line is 64 bytes on modern x86-64
; Align data structures to cache line boundaries
section .data
align 64
matrix: times 1024 dq 0.0 ; 64-byte aligned
section .text
; Prefetching for sequential access
.prefetch_loop:
prefetcht0 [rsi + 256] ; Prefetch 4 cache lines ahead
prefetcht0 [rsi + 320]
; Process current cache line
vmovaps ymm0, [rsi]
vmovaps ymm1, [rsi+32]
; ... processing ...
add rsi, 64
dec rcx
jnz .prefetch_loop</code></pre>
<h4 id="non-temporal-stores-streaming-stores">Non-Temporal Stores
(Streaming Stores)</h4>
<pre class="assembly"><code>; Bypass cache for large streaming writes
; Reduces cache pollution
.streaming_copy:
vmovaps ymm0, [rsi]
vmovaps ymm1, [rsi+32]
vmovaps ymm2, [rsi+64]
vmovaps ymm3, [rsi+96]
; Non-temporal stores bypass cache
vmovntps [rdi], ymm0
vmovntps [rdi+32], ymm1
vmovntps [rdi+64], ymm2
vmovntps [rdi+96], ymm3
add rsi, 128
add rdi, 128
sub rcx, 128
jnz .streaming_copy
sfence ; Ensure stores complete before continuing</code></pre>
<h3 id="simd-vectorization-techniques">15.4 SIMD Vectorization
Techniques</h3>
<h4 id="auto-vectorization-patterns">Auto-Vectorization Patterns</h4>
<pre class="assembly"><code>; Scalar addition loop
scalar_add:
movss xmm0, [rsi]
addss xmm0, [rdx]
movss [rdi], xmm0
add rsi, 4
add rdx, 4
add rdi, 4
dec rcx
jnz scalar_add
; Vectorized with AVX2
vector_add_avx2:
vmovaps ymm0, [rsi] ; Load 8 floats
vaddps ymm0, ymm0, [rdx] ; Add 8 floats
vmovaps [rdi], ymm0 ; Store 8 floats
add rsi, 32
add rdx, 32
add rdi, 32
sub rcx, 8
jnz vector_add_avx2
; AVX-512 version with masking for remainder
vector_add_avx512:
mov rax, rcx
and rax, ~15 ; Process 16 elements at a time
jz .remainder
.main_loop:
vmovaps zmm0, [rsi] ; Load 16 floats
vaddps zmm0, zmm0, [rdx]
vmovaps [rdi], zmm0
add rsi, 64
add rdx, 64
add rdi, 64
sub rax, 16
jnz .main_loop
.remainder:
and rcx, 15 ; Remainder count
jz .done
mov rbx, -1
bzhi rbx, rbx, rcx ; Create mask
kmovq k1, rbx
vmovaps zmm0{k1}{z}, [rsi]
vaddps zmm0{k1}, zmm0, [rdx]
vmovaps [rdi]{k1}, zmm0
.done:</code></pre>
<h4 id="fma-fused-multiply-add-optimization">FMA (Fused Multiply-Add)
Optimization</h4>
<pre class="assembly"><code>; SAXPY: Y = a*X + Y
; Scalar version
saxpy_scalar:
movss xmm0, [rsi]
mulss xmm0, xmm7 ; xmm7 contains scalar &#39;a&#39;
addss xmm0, [rdi]
movss [rdi], xmm0
; FMA version - single instruction, better accuracy
saxpy_fma:
vmovaps ymm0, [rdi]
vfmadd213ps ymm0, ymm7, [rsi] ; ymm0 = ymm7 * [rsi] + ymm0
vmovaps [rdi], ymm0</code></pre>
<h3 id="instruction-level-parallelism">15.5 Instruction-Level
Parallelism</h3>
<h4 id="dependency-chain-breaking">Dependency Chain Breaking</h4>
<pre class="assembly"><code>; Poor: Long dependency chain
mov rax, [rsi]
add rax, 1
add rax, [rsi+8]
add rax, [rsi+16]
add rax, [rsi+24]
; Better: Multiple accumulator chains
mov rax, [rsi]
mov rbx, [rsi+8]
mov rcx, [rsi+16]
mov rdx, [rsi+24]
add rax, 1
add rax, rbx
add rcx, rdx
add rax, rcx ; Shorter critical path</code></pre>
<h4 id="software-pipelining">Software Pipelining</h4>
<pre class="assembly"><code>; Matrix multiplication with software pipelining
; Overlap loads with computation
matrix_multiply_optimized:
; Preload first iteration
vmovaps zmm0, [rsi]
vmovaps zmm1, [rdx]
.loop:
; Current iteration computation
vfmadd231ps zmm16, zmm0, zmm1
; Prefetch next iteration while computing
vmovaps zmm2, [rsi + 64]
vmovaps zmm3, [rdx + 64]
vfmadd231ps zmm17, zmm0, zmm3
vfmadd231ps zmm18, zmm2, zmm1
; Move to next iteration
vmovaps zmm0, zmm2
vmovaps zmm1, zmm3
add rsi, 64
add rdx, 64
dec rcx
jnz .loop</code></pre>
<h3 id="code-size-and-alignment-optimization">15.6 Code Size and
Alignment Optimization</h3>
<h4 id="function-and-loop-alignment">Function and Loop Alignment</h4>
<pre class="assembly"><code>; Align functions to 16-byte boundaries
align 16
optimized_function:
; Function code...
; Align hot loops to 32-byte boundaries for fetch
align 32
.hot_loop:
; Critical loop body
; Keep under 32 bytes if possible for μop cache</code></pre>
<h4 id="instruction-selection-for-size">Instruction Selection for
Size</h4>
<pre class="assembly"><code>; Prefer shorter encodings
xor eax, eax ; 2 bytes - better than mov eax, 0 (5 bytes)
inc rsi ; 3 bytes - avoid if breaking fusion
add rsi, 1 ; 4 bytes - but allows macro-fusion with jnz
; Use VEX encoding when beneficial
vxorps xmm0, xmm0, xmm0 ; Clears upper bits, avoids transition penalty</code></pre>
<h3 id="profile-guided-optimization">15.7 Profile-Guided
Optimization</h3>
<h4 id="using-performance-counters">Using Performance Counters</h4>
<pre class="assembly"><code>; Example: Measuring cache misses
; Use rdpmc instruction to read performance counters
read_perf_counter:
mov ecx, 0 ; Counter index
rdpmc ; Read counter into EDX:EAX
shl rdx, 32
or rax, rdx ; Full 64-bit count in RAX
ret
; Instrument critical section
call read_perf_counter
mov [start_count], rax
; Critical code section
call optimized_function
call read_perf_counter
sub rax, [start_count]
; RAX now contains event count</code></pre>
<h3 id="practical-optimization-example">15.8 Practical Optimization
Example</h3>
<p>Here’s a complete optimized memory copy function using various
techniques:</p>
<pre class="assembly"><code>; High-performance memory copy
; RDI = destination, RSI = source, RDX = size in bytes
global fast_memcpy
fast_memcpy:
cmp rdx, 64
jb .small_copy
cmp rdx, 2048
jb .medium_copy
.large_copy:
; Use non-temporal stores for large copies
mov rcx, rdx
shr rcx, 6 ; Divide by 64
.large_loop:
prefetchnta [rsi + 256]
vmovdqa ymm0, [rsi]
vmovdqa ymm1, [rsi + 32]
vmovntdq [rdi], ymm0
vmovntdq [rdi + 32], ymm1
add rsi, 64
add rdi, 64
dec rcx
jnz .large_loop
sfence
and rdx, 63 ; Handle remainder
jz .done
.medium_copy:
; Use regular moves for medium sizes
mov rcx, rdx
shr rcx, 5 ; Divide by 32
jz .small_copy
.medium_loop:
vmovdqu ymm0, [rsi]
vmovdqu [rdi], ymm0
add rsi, 32
add rdi, 32
dec rcx
jnz .medium_loop
and rdx, 31
.small_copy:
; Handle remainder with scalar moves
mov rcx, rdx
rep movsb
.done:
vzeroupper ; Clear upper YMM bits
ret</code></pre>
<h3 id="performance-analysis-tools">15.9 Performance Analysis Tools</h3>
<h4 id="intel-vtune-profiler-integration">Intel VTune Profiler
Integration</h4>
<pre class="assembly"><code>; Markers for VTune analysis
%include &quot;ittnotify.h&quot;
section .data
domain_name db &quot;MyApp&quot;, 0
task_name db &quot;CriticalLoop&quot;, 0
section .text
; Create domain
lea rdi, [domain_name]
call __itt_domain_create
mov [domain_handle], rax
; Start task
lea rdi, [task_name]
mov rsi, [domain_handle]
call __itt_task_begin
; Critical code here
call optimized_function
; End task
mov rdi, [domain_handle]
call __itt_task_end</code></pre>
<h3 id="summary">Summary</h3>
<p>Performance optimization in x86-64 assembly requires
understanding:</p>
<ol type="1">
<li><p><strong>Microarchitecture</strong>: Pipeline stages, execution
ports, and μop cache</p></li>
<li><p><strong>Memory Hierarchy</strong>: Cache lines, prefetching, and
bandwidth limitations</p></li>
<li><p><strong>Instruction Selection</strong>: Choosing optimal
instructions for specific scenarios</p></li>
<li><p><strong>Parallelism</strong>: Both instruction-level and
data-level (SIMD)</p></li>
<li><p><strong>Measurement</strong>: Using performance counters and
profiling tools</p></li>
</ol>
<p>Key principles:</p>
<ul>
<li><p>Minimize branch mispredictions</p></li>
<li><p>Maximize instruction-level parallelism</p></li>
<li><p>Optimize memory access patterns</p></li>
<li><p>Use SIMD effectively</p></li>
<li><p>Profile and measure actual performance</p></li>
</ul>
<p>Modern compilers handle many optimizations automatically, but
critical hot paths often benefit from hand-tuned assembly, especially
when using advanced features like AVX-512 or specialized
instructions.</p>
<h3 id="exercises">Exercises</h3>
<ol type="1">
<li><p>Implement an optimized matrix transpose using AVX-512</p></li>
<li><p>Write a branch-free binary search using conditional
moves</p></li>
<li><p>Create a high-performance string comparison function using
SIMD</p></li>
<li><p>Optimize a hash table lookup to minimize cache misses</p></li>
<li><p>Profile and optimize a real-world function using performance
counters</p></li>
</ol>
<p>Next, we’ll explore Chapter 16: Advanced Topics, covering JIT
compilation, dynamic code generation, and formal verification
methods.</p>
<hr />
<h2 id="chapter-16-code-generation-and-compiler-backend">Chapter 16:
Code Generation and Compiler Backend</h2>
<p>Building on our performance optimization techniques, this chapter
explores how compilers generate x86-64 machine code and how to implement
compiler backends, JIT compilation, and dynamic code generation.</p>
<h3 id="compiler-architecture-overview">16.1 Compiler Architecture
Overview</h3>
<h4 id="compilation-pipeline">Compilation Pipeline</h4>
<div class="sourceCode" id="cb210"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb210-1"><a href="#cb210-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Typical compiler phases</span></span>
<span id="cb210-2"><a href="#cb210-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Source Code → Lexer → Parser → AST → IR → Optimization → Code Gen → Assembly</span></span></code></pre></div>
<p>The backend focuses on the final stages: IR (Intermediate
Representation) to assembly code generation.</p>
<pre class="assembly"><code>; Example: Simple expression tree to x86-64
; Expression: (a + b) * (c - d)
; Assuming a=RDI, b=RSI, c=RDX, d=RCX
code_gen_expression:
; Generate code for left subtree (a + b)
mov rax, rdi ; Load a
add rax, rsi ; Add b
; Generate code for right subtree (c - d)
mov r10, rdx ; Load c
sub r10, rcx ; Subtract d
; Combine results
imul rax, r10 ; Multiply results
ret</code></pre>
<h3 id="register-allocation">16.2 Register Allocation</h3>
<h4 id="graph-coloring-algorithm">Graph Coloring Algorithm</h4>
<pre class="assembly"><code>; Register allocation example
; Variables: v1, v2, v3, v4, v5
; Live ranges determine interference graph
section .text
; Before register allocation (pseudo-code)
; LOAD v1, [mem1]
; LOAD v2, [mem2]
; ADD v3, v1, v2
; LOAD v4, [mem3]
; MUL v5, v3, v4
; STORE [result], v5
; After register allocation
register_allocated:
mov rax, [mem1] ; v1 → RAX
mov rbx, [mem2] ; v2 → RBX
add rax, rbx ; v3 → RAX (reuse v1&#39;s register)
mov rcx, [mem3] ; v4 → RCX
imul rax, rcx ; v5 → RAX (reuse v3&#39;s register)
mov [result], rax
ret</code></pre>
<h4 id="spill-code-generation">Spill Code Generation</h4>
<pre class="assembly"><code>; When registers are exhausted, spill to stack
spill_example:
push rbp
mov rbp, rsp
sub rsp, 32 ; Allocate spill slots
; Too many live variables for available registers
mov rax, [input1]
mov rbx, [input2]
mov rcx, [input3]
mov rdx, [input4]
mov rsi, [input5]
mov rdi, [input6]
mov r8, [input7]
mov r9, [input8]
; Need more variables - spill
mov [rbp-8], rax ; Spill v1
mov rax, [input9] ; Load v9
; ... use rax ...
mov [rbp-16], rax ; Spill v9
mov rax, [rbp-8] ; Reload v1
mov rsp, rbp
pop rbp
ret</code></pre>
<h3 id="instruction-selection">16.3 Instruction Selection</h3>
<h4 id="pattern-matching-and-tiling">Pattern Matching and Tiling</h4>
<pre class="assembly"><code>; Instruction selector patterns
; Pattern: memory_operand + register → single instruction
; Naive code generation
naive_add:
mov rax, [mem_addr]
add rax, rbx
mov [mem_addr], rax
; Optimized selection using memory operands
optimized_add:
add [mem_addr], rbx ; Single instruction
; Complex addressing mode selection
; Array access: arr[i*8 + j]
array_access:
; Could generate:
mov rax, rsi ; i
shl rax, 3 ; i*8
add rax, rdx ; i*8 + j
mov rcx, [rdi + rax]; arr[i*8 + j]
; Better selection:
mov rcx, [rdi + rsi*8 + rdx] ; Single instruction</code></pre>
<h4 id="peephole-optimization">Peephole Optimization</h4>
<pre class="assembly"><code>; Common peephole patterns
section .text
; Pattern: Push followed by pop
; Before:
push rax
pop rbx
; After:
mov rbx, rax
; Pattern: Redundant moves
; Before:
mov rax, rbx
mov rbx, rax
; After:
mov rax, rbx
; Pattern: Constant folding
; Before:
mov rax, 5
add rax, 3
; After:
mov rax, 8
; Pattern: Strength reduction
; Before:
imul rax, 2
; After:
add rax, rax ; Or: shl rax, 1</code></pre>
<h3 id="jit-compilation-implementation">16.4 JIT Compilation
Implementation</h3>
<h4 id="basic-jit-compiler-structure">Basic JIT Compiler Structure</h4>
<div class="sourceCode" id="cb216"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb216-1"><a href="#cb216-1" aria-hidden="true" tabindex="-1"></a><span class="co">// C structure for JIT compiler</span></span>
<span id="cb216-2"><a href="#cb216-2" aria-hidden="true" tabindex="-1"></a><span class="kw">typedef</span> <span class="kw">struct</span> <span class="op">{</span></span>
<span id="cb216-3"><a href="#cb216-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">uint8_t</span><span class="op">*</span> code_buffer<span class="op">;</span></span>
<span id="cb216-4"><a href="#cb216-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> buffer_size<span class="op">;</span></span>
<span id="cb216-5"><a href="#cb216-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> current_pos<span class="op">;</span></span>
<span id="cb216-6"><a href="#cb216-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> JitCompiler<span class="op">;</span></span></code></pre></div>
<pre class="assembly"><code>; Runtime code generation example
; Generate function dynamically that adds constant
section .data
code_buffer: resb 4096
section .text
generate_adder:
; Input: RDI = constant to add
; Output: RAX = pointer to generated function
lea rax, [code_buffer]
; Generate: mov rax, rdi
mov byte [rax], 0x48
mov byte [rax+1], 0x89
mov byte [rax+2], 0xF8
; Generate: add rax, CONSTANT
mov byte [rax+3], 0x48
mov byte [rax+4], 0x05
mov dword [rax+5], edi ; Embed constant
; Generate: ret
mov byte [rax+9], 0xC3
; Make code executable
mov rdi, rax
mov rsi, 4096
mov rdx, 7 ; PROT_READ | PROT_WRITE | PROT_EXEC
call mprotect
lea rax, [code_buffer]
ret</code></pre>
<h4 id="advanced-jit-with-templates">Advanced JIT with Templates</h4>
<pre class="assembly"><code>; Template-based code generation
section .data
; Template for comparison function
template_start:
cmp rdi, 0x12345678 ; Placeholder for constant
jg .greater
jl .less
xor eax, eax ; Equal
ret
.greater:
mov eax, 1
ret
.less:
mov eax, -1
ret
template_end:
section .text
generate_comparator:
; Input: RDI = constant to compare against
; Copy template
mov rsi, template_start
mov rcx, template_end - template_start
lea rdx, [code_buffer]
.copy_loop:
mov al, [rsi]
mov [rdx], al
inc rsi
inc rdx
dec rcx
jnz .copy_loop
; Patch constant
lea rax, [code_buffer + 2] ; Offset to constant
mov [rax], edi
; Return pointer to generated code
lea rax, [code_buffer]
ret</code></pre>
<h3 id="dynamic-binary-translation">16.5 Dynamic Binary Translation</h3>
<h4 id="self-modifying-code">Self-Modifying Code</h4>
<pre class="assembly"><code>; Self-modifying code example
; Dynamically optimize based on runtime behavior
section .data
branch_counter: dq 0
threshold: dq 1000
section .text
adaptive_branch:
; Increment counter
inc qword [branch_counter]
; Check if we should optimize
mov rax, [branch_counter]
cmp rax, [threshold]
jl .normal_path
; Rewrite this code based on statistics
call optimize_hot_path
.normal_path:
; Original code
test rdi, rdi
jz .zero_case
jmp .nonzero_case
.zero_case:
; Handle zero
ret
.nonzero_case:
; Handle non-zero
ret
optimize_hot_path:
; Analyze branch statistics and rewrite
; the jump instruction to use likely path
push rbp
mov rbp, rsp
; Change page permissions for writing
lea rdi, [adaptive_branch]
mov rsi, 4096
mov rdx, 7 ; PROT_READ | PROT_WRITE | PROT_EXEC
call mprotect
; Rewrite branch (example: change jz to jnz)
lea rax, [adaptive_branch.normal_path]
mov byte [rax+6], 0x75 ; Change to JNZ
mov rsp, rbp
pop rbp
ret</code></pre>
<h3 id="machine-code-encoding">16.6 Machine Code Encoding</h3>
<h4 id="x86-64-instruction-encoding">x86-64 Instruction Encoding</h4>
<pre class="assembly"><code>; Understanding x86-64 encoding
; Format: [Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]
section .text
encode_instruction:
; Example: Encode &quot;add rax, rbx&quot; manually
; REX prefix: 0x48 (W=1 for 64-bit)
; Opcode: 0x01 (ADD r/m64, r64)
; ModR/M: 0xD8 (mod=11, reg=011, r/m=000)
db 0x48, 0x01, 0xD8
; Example: Encode &quot;mov r13, [r14 + rax*8 + 0x100]&quot;
; REX: 0x4D (W=1, R=1, B=1)
; Opcode: 0x8B (MOV r64, r/m64)
; ModR/M: 0xAC (mod=10, reg=101, r/m=100)
; SIB: 0xC6 (scale=11, index=000, base=110)
; Disp32: 0x00010000
db 0x4D, 0x8B, 0xAC, 0xC6
dd 0x100</code></pre>
<h4 id="building-an-assembler">Building an Assembler</h4>
<pre class="assembly"><code>; Simple assembler implementation
section .data
mnemonic_table:
db &quot;ADD&quot;, 0, 0x01
db &quot;SUB&quot;, 0, 0x29
db &quot;MOV&quot;, 0, 0x89
db &quot;CMP&quot;, 0, 0x39
section .text
assemble_instruction:
; Input: RSI = instruction string
; Output: RDI = encoded bytes
push rbp
mov rbp, rsp
sub rsp, 32
; Parse mnemonic
call parse_mnemonic
mov [rbp-8], rax ; Store opcode
; Parse operands
call parse_operands
mov [rbp-16], rax ; Store ModR/M byte
; Generate REX prefix if needed
call generate_rex
mov [rbp-24], rax
; Emit instruction
mov rdi, [output_buffer]
; Write REX if present
cmp byte [rbp-24], 0
je .no_rex
mov al, [rbp-24]
stosb
.no_rex:
; Write opcode
mov al, [rbp-8]
stosb
; Write ModR/M
mov al, [rbp-16]
stosb
mov rsp, rbp
pop rbp
ret</code></pre>
<h3 id="optimization-pass-implementation">16.7 Optimization Pass
Implementation</h3>
<h4 id="dead-code-elimination">Dead Code Elimination</h4>
<pre class="assembly"><code>; Dead code elimination pass
; Analyze and remove unreachable code
dead_code_elimination:
push rbp
mov rbp, rsp
; Build control flow graph
call build_cfg
; Mark reachable blocks
lea rdi, [entry_block]
call mark_reachable
; Remove unmarked blocks
lea rsi, [block_list]
.remove_loop:
mov rax, [rsi]
test rax, rax
jz .done
; Check if marked
test byte [rax + block.flags], REACHABLE
jnz .keep
; Remove block
call remove_block
.keep:
add rsi, 8
jmp .remove_loop
.done:
mov rsp, rbp
pop rbp
ret</code></pre>
<h4 id="constant-propagation">Constant Propagation</h4>
<pre class="assembly"><code>; Constant propagation implementation
constant_propagation:
push rbp
mov rbp, rsp
sub rsp, 256 ; Space for constant table
; Initialize constant table
lea rdi, [rbp-256]
mov rcx, 32
xor eax, eax
rep stosq
; Scan instructions
lea rsi, [instruction_list]
.scan_loop:
mov rax, [rsi]
test rax, rax
jz .done
; Check if MOV immediate
cmp byte [rax], 0xB8 ; MOV reg, imm
jb .not_const
cmp byte [rax], 0xBF
ja .not_const
; Record constant
movzx rcx, byte [rax]
and rcx, 7 ; Extract register
mov rdx, [rax+1] ; Get immediate value
mov [rbp-256+rcx*8], rdx
.not_const:
; Check for uses and replace
call replace_with_constants
add rsi, 8
jmp .scan_loop
.done:
mov rsp, rbp
pop rbp
ret</code></pre>
<h3 id="llvm-integration">16.8 LLVM Integration</h3>
<h4 id="llvm-ir-to-x86-64">LLVM IR to x86-64</h4>
<pre class="assembly"><code>; Example LLVM IR to x86-64 lowering
; LLVM IR: %result = add i64 %a, %b
llvm_lower_add:
; Input: RDI = LLVM instruction pointer
; Output: Machine code in buffer
; Extract operands from LLVM instruction
mov rsi, [rdi + llvm_inst.op1]
mov rdx, [rdi + llvm_inst.op2]
; Map virtual registers to physical
call get_physical_reg
mov r8, rax ; Physical reg for op1
mov rdi, rdx
call get_physical_reg
mov r9, rax ; Physical reg for op2
; Generate x86-64 add instruction
call emit_add_instruction
ret
emit_add_instruction:
; Emit: add r8, r9
; Calculate REX prefix
mov al, 0x48 ; REX.W
test r8, 8
jz .no_rex_r
or al, 0x04 ; REX.R
.no_rex_r:
test r9, 8
jz .no_rex_b
or al, 0x01 ; REX.B
.no_rex_b:
; Emit REX prefix
mov rdi, [code_ptr]
stosb
; Emit opcode
mov al, 0x01
stosb
; Emit ModR/M byte
mov al, 0xC0 ; mod=11 (register)
and r9b, 7
or al, r9b ; r/m field
and r8b, 7
shl r8b, 3
or al, r8b ; reg field
stosb
Got it — here’s **Chapter 16: Code Generation and Compiler Backend** in the same style as our earlier chapters, building on the material from Intel’s manuals and your other ASM sources, and integrating the kind of low‑level backend workflows compilers and JIT engines use.
---
# **Chapter 16 — Code Generation and Compiler Backend**
This chapter moves from hand‑tuned optimization into the compiler&#39;s domain: how to turn intermediate representation (IR) into machine code, how register allocation and instruction selection work, and how modern backends, JIT compilers, and binary translators produce efficient x86‑64 programs.
---
## 16.1 Compilation Pipeline and Backend Scope
A typical compiler architecture has:
Source → Lexical Analysis → Parsing → AST → IR → Optimization → Code Generation → Assembler → Linker
**Backend responsibilities:**
- Translate an optimized IR into target instructions
- Select efficient opcodes and addressing modes
- Allocate hardware registers and handle spills
- Emit correct encodings
- Optionally perform late‑stage optimizations (peephole, instruction scheduling)
Example — IR lowering for `(a + b) * (c - d)`:
```assembly
; Assuming:
; a → RDI, b → RSI, c → RDX, d → RCX
mov rax, rdi ; a
add rax, rsi ; a + b
mov r10, rdx ; c
sub r10, rcx ; c - d
imul rax, r10 ; (a+b) * (c-d)
ret</code></pre>
<hr />
<h3 id="register-allocation-1">16.2 Register Allocation</h3>
<h4 id="graph-coloring-allocation">Graph Coloring Allocation</h4>
<p>Modern backends build <strong>interference graphs</strong> of
variables whose live ranges overlap, then color them using available
physical registers.</p>
<p>Before allocation:</p>
<pre class="text"><code>v1 = load mem1
v2 = load mem2
v3 = v1 + v2
v4 = load mem3
v5 = v3 * v4
store result, v5</code></pre>
<p>After allocation:</p>
<pre class="assembly"><code>mov rax, [mem1] ; v1
mov rbx, [mem2] ; v2
add rax, rbx ; v3 in RAX
mov rcx, [mem3] ; v4
imul rax, rcx ; v5 in RAX
mov [result], rax
ret</code></pre>
<p>When registers run out, <em>spill code</em> saves values to the
stack:</p>
<pre class="assembly"><code>sub rsp, 32
mov [rsp], rax ; spill value
...
mov rax, [rsp] ; reload later
add rsp, 32</code></pre>
<hr />
<h3 id="instruction-selection-1">16.3 Instruction Selection</h3>
<h4 id="matching-and-tiling">Matching and Tiling</h4>
<p>Backends prefer single instructions over multi‑step sequences if the
ISA supports it.</p>
<p>Naive:</p>
<pre class="assembly"><code>mov rax, [mem]
add rax, rbx
mov [mem], rax</code></pre>
<p>Selected pattern:</p>
<pre class="assembly"><code>add [mem], rbx ; fewer instructions, less register pressure</code></pre>
<p>Intel’s complex addressing modes allow direct array indexing:</p>
<pre class="assembly"><code>mov rcx, [rdi + rsi*8 + rdx] ; arr[i*8 + j]</code></pre>
<hr />
<h3 id="late-stage-peephole-optimization">16.4 Late-stage (Peephole)
Optimization</h3>
<p>Peephole passes remove redundancy:</p>
<pre class="assembly"><code>; push + pop with different regs
push rax
pop rbx
; → mov rbx, rax
; Constant folding
mov rax, 5
add rax, 3
; → mov rax, 8</code></pre>
<p>Strength reduction:</p>
<pre class="assembly"><code>imul rax, 2 ; → add rax, rax or shl rax, 1</code></pre>
<hr />
<h3 id="jit-compilation">16.5 JIT Compilation</h3>
<p>JIT compilers emit binary directly into memory at runtime and mark it
executable.</p>
<p>Simple constant adder generator:</p>
<pre class="assembly"><code>; Generates: rax = rax + CONST; ret
mov byte [buf+0], 0x48
mov byte [buf+1], 0x05
mov dword [buf+2], edi ; constant
mov byte [buf+6], 0xC3
; mprotect(buf, size, PROT_EXEC) before jumping</code></pre>
<p>Templates allow patching constants:</p>
<pre class="assembly"><code>cmp rdi, 0xDEADBEEF ; patched at runtime
jg greater
jl less
xor eax, eax
ret</code></pre>
<hr />
<h3 id="dynamic-binary-translation-self-modifying-code">16.6 Dynamic
Binary Translation &amp; Self-modifying Code</h3>
<p>Runtime adaptation for hot paths:</p>
<pre class="assembly"><code>inc qword [branch_counter]
cmp [branch_counter], 1000
jl normal_path
; Change branch to likely path
mprotect code → writable
mov byte [branch+offset], 0x75 ; JNZ</code></pre>
<hr />
<h3 id="machine-code-encoding-1">16.7 Machine Code Encoding</h3>
<h4 id="encoding-format">Encoding format:</h4>
<p>[Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement]
[Immediate]</p>
<p>Add <code>rax, rbx</code>:</p>
<pre class="assembly"><code>; REX.W Opcode ModR/M
db 0x48, 0x01, 0xD8</code></pre>
<hr />
<h3 id="building-an-assembler-backend-emitter">16.8 Building an
Assembler (Backend-emitter)</h3>
<p>Minimal assembler loop:</p>
<pre class="assembly"><code>; parse mnemonic → get opcode
; parse operands → make ModR/M
; emit optional REX byte
; emit opcode + ModR/M + displacement/immediate</code></pre>
<hr />
<h3 id="backend-optimization-passes">16.9 Backend Optimization
Passes</h3>
<p>Late passes may include:</p>
<ul>
<li><p><strong>Dead code elimination</strong>: remove unreachable basic
blocks</p></li>
<li><p><strong>Constant propagation</strong>: replace runtime loads with
compile‑time constants, folding dependent operations</p></li>
<li><p><strong>Instruction scheduling</strong>: rearrange independent
instructions to avoid stalls</p></li>
</ul>
<hr />
<h3 id="llvm-backend-integration">16.10 LLVM Backend Integration</h3>
<p>LLVM IR:</p>
<div class="sourceCode" id="cb238"><pre
class="sourceCode llvm"><code class="sourceCode llvm"><span id="cb238-1"><a href="#cb238-1" aria-hidden="true" tabindex="-1"></a><span class="fu">%res</span> = <span class="kw">add</span> <span class="dt">i64</span> <span class="fu">%a</span>, <span class="fu">%b</span></span></code></pre></div>
<p>Lower to:</p>
<pre class="assembly"><code>add r8, r9 ; after reg assignment</code></pre>
<p>With correct REX bits for extended registers.</p>
<hr />
<h3 id="summary-1">Summary</h3>
<p>A compiler backend for x86‑64:</p>
<ol type="1">
<li><p>Maps IR operations to machine instructions</p></li>
<li><p>Allocates registers efficiently, spilling only when
necessary</p></li>
<li><p>Selects optimal encodings and complex addressing modes</p></li>
<li><p>Performs late optimizations to reduce size and improve
performance</p></li>
<li><p>Emits correct binary sequences</p></li>
<li><p>Optionally supports runtime generation and optimization
(JIT/DBT)</p></li>
</ol>
<p>Understanding these mechanics allows developers to:</p>
<ul>
<li><p>Integrate with LLVM or GCC backends</p></li>
<li><p>Write domain-specific JIT engines</p></li>
<li><p>Hand-optimize IR for extreme performance</p></li>
</ul>
<hr />
<h4 id="exercises-1">Exercises</h4>
<ol type="1">
<li><p>Implement a peephole optimizer that folds constants and removes
redundant moves.</p></li>
<li><p>Write a register allocator using a graph coloring algorithm for a
small IR.</p></li>
<li><p>Implement a JIT generator for a branchless min/max function in
x86‑64.</p></li>
<li><p>Produce a machine code encoder for <code>mov rax, imm64</code>
and explain each encoded byte.</p></li>
<li><p>Modify a compiled function at runtime to optimize its hot path
using <code>mprotect</code>.</p></li>
</ol>
<hr />
<p>If you want, for <strong>Chapter 17</strong>, we can dive into
<em>Dynamic Linking and Relocation</em>, including ELF relocation
records, PLT/GOT mechanics, and compiler-emitted relocation fixups,
which ties naturally into code generation and backend work. Would you
like me to proceed with that next?</p>
<hr />
<h2 id="appendix-a-x86-64-instruction-reference-quick-guide">Appendix A:
x86-64 Instruction Reference Quick Guide</h2>
<h3 id="a.1-instruction-format-overview">A.1 Instruction Format
Overview</h3>
<h4 id="general-encoding-structure">General Encoding Structure</h4>
<p>[Legacy Prefixes] [REX/VEX/EVEX] [Opcode] [ModR/M] [SIB]
[Displacement] [Immediate]</p>
<h4 id="rex-prefix-40h-4fh">REX Prefix (40h-4Fh)</h4>
<p>0100 WRXB W: 64-bit operand size R: Extension of ModR/M reg field X:
Extension of SIB index field B: Extension of ModR/M r/m field</p>
<h4 id="modrm-byte">ModR/M Byte</h4>
<p>[7:6] MOD - Addressing mode [5:3] REG - Register/Opcode [2:0] R/M -
Register/Memory</p>
<h3 id="a.2-data-movement-instructions">A.2 Data Movement
Instructions</h3>
<h4 id="basic-moves">Basic Moves</h4>
<pre class="assembly"><code>MOV dst, src ; Move data
MOVZX dst, src ; Move with zero extend
MOVSX dst, src ; Move with sign extend
MOVSXD r64, r/m32 ; Sign extend doubleword to quadword
LEA reg, mem ; Load effective address
XCHG op1, op2 ; Exchange values
BSWAP reg ; Byte swap</code></pre>
<h4 id="stack-operations">Stack Operations</h4>
<pre class="assembly"><code>PUSH op ; Push to stack
POP op ; Pop from stack
PUSHF/PUSHFQ ; Push FLAGS/RFLAGS
POPF/POPFQ ; Pop FLAGS/RFLAGS</code></pre>
<h4 id="conditional-moves-cmovcc">Conditional Moves (CMOVcc)</h4>
<pre class="assembly"><code>CMOVE/CMOVZ ; Move if equal/zero (ZF=1)
CMOVNE/CMOVNZ ; Move if not equal/not zero (ZF=0)
CMOVL/CMOVNGE ; Move if less (SF≠OF)
CMOVLE/CMOVNG ; Move if less or equal (ZF=1 or SF≠OF)
CMOVG/CMOVNLE ; Move if greater (ZF=0 and SF=OF)
CMOVGE/CMOVNL ; Move if greater or equal (SF=OF)
CMOVA/CMOVNBE ; Move if above (CF=0 and ZF=0)
CMOVAE/CMOVNB/CMOVNC ; Move if above or equal (CF=0)
CMOVB/CMOVNAE/CMOVC ; Move if below (CF=1)
CMOVBE/CMOVNA ; Move if below or equal (CF=1 or ZF=1)</code></pre>
<h3 id="a.3-arithmetic-instructions">A.3 Arithmetic Instructions</h3>
<h4 id="integer-arithmetic-2">Integer Arithmetic</h4>
<pre class="assembly"><code>ADD dst, src ; Addition
ADC dst, src ; Add with carry
SUB dst, src ; Subtraction
SBB dst, src ; Subtract with borrow
INC op ; Increment
DEC op ; Decrement
NEG op ; Two&#39;s complement negation
CMP op1, op2 ; Compare (sets flags)
MUL src ; Unsigned multiply (RDX:RAX = RAX * src)
IMUL src ; Signed multiply
IMUL dst, src ; Signed multiply (dst = dst * src)
IMUL dst, src, imm ; Signed multiply (dst = src * imm)
DIV src ; Unsigned divide (RAX = RDX:RAX / src)
IDIV src ; Signed divide</code></pre>
<h4 id="bcd-and-ascii-adjust-legacy">BCD and ASCII Adjust (Legacy)</h4>
<pre class="assembly"><code>AAA, AAS, AAD, AAM ; ASCII adjust (not in 64-bit mode)
DAA, DAS ; Decimal adjust (not in 64-bit mode)</code></pre>
<h3 id="a.4-logical-instructions">A.4 Logical Instructions</h3>
<pre class="assembly"><code>AND dst, src ; Logical AND
OR dst, src ; Logical OR
XOR dst, src ; Logical XOR
NOT op ; One&#39;s complement
TEST op1, op2 ; Logical compare (AND without storing)</code></pre>
<h3 id="a.5-shift-and-rotate-instructions">A.5 Shift and Rotate
Instructions</h3>
<pre class="assembly"><code>SHL/SAL op, count ; Shift left
SHR op, count ; Logical shift right
SAR op, count ; Arithmetic shift right
ROL op, count ; Rotate left
ROR op, count ; Rotate right
RCL op, count ; Rotate left through carry
RCR op, count ; Rotate right through carry
SHLD dst, src, count ; Double precision shift left
SHRD dst, src, count ; Double precision shift right</code></pre>
<h3 id="a.6-bit-manipulation-instructions">A.6 Bit Manipulation
Instructions</h3>
<pre class="assembly"><code>BT op, bit ; Bit test
BTS op, bit ; Bit test and set
BTR op, bit ; Bit test and reset
BTC op, bit ; Bit test and complement
BSF dst, src ; Bit scan forward
BSR dst, src ; Bit scan reverse
LZCNT dst, src ; Leading zero count (BMI)
TZCNT dst, src ; Trailing zero count (BMI)
POPCNT dst, src ; Population count
; BMI Instructions
ANDN dst, src1, src2 ; Logical AND NOT
BEXTR dst, src, ctrl ; Bit field extract
BLSI dst, src ; Extract lowest set bit
BLSMSK dst, src ; Mask up to lowest set bit
BLSR dst, src ; Reset lowest set bit</code></pre>
<h3 id="a.7-control-transfer-instructions">A.7 Control Transfer
Instructions</h3>
<h4 id="unconditional-jumps-1">Unconditional Jumps</h4>
<pre class="assembly"><code>JMP target ; Near/far jump
CALL target ; Call procedure
RET [imm16] ; Return from procedure</code></pre>
<h4 id="conditional-jumps-jcc">Conditional Jumps (Jcc)</h4>
<pre class="assembly"><code>JE/JZ ; Jump if equal/zero (ZF=1)
JNE/JNZ ; Jump if not equal/not zero (ZF=0)
JL/JNGE ; Jump if less (SF≠OF)
JLE/JNG ; Jump if less or equal (ZF=1 or SF≠OF)
JG/JNLE ; Jump if greater (ZF=0 and SF=OF)
JGE/JNL ; Jump if greater or equal (SF=OF)
JA/JNBE ; Jump if above (CF=0 and ZF=0)
JAE/JNB/JNC ; Jump if above or equal (CF=0)
JB/JNAE/JC ; Jump if below (CF=1)
JBE/JNA ; Jump if below or equal (CF=1 or ZF=1)
JO, JNO ; Jump if overflow/not overflow
JS, JNS ; Jump if sign/not sign
JP/JPE, JNP/JPO ; Jump if parity even/odd</code></pre>
<h4 id="loop-instructions-1">Loop Instructions</h4>
<pre class="assembly"><code>LOOP target ; Decrement RCX and jump if not zero
LOOPE/LOOPZ ; Loop while equal/zero
LOOPNE/LOOPNZ ; Loop while not equal/not zero
JRCXZ target ; Jump if RCX is zero</code></pre>
<h3 id="a.8-string-instructions">A.8 String Instructions</h3>
<pre class="assembly"><code>MOVS[B/W/D/Q] ; Move string
CMPS[B/W/D/Q] ; Compare string
SCAS[B/W/D/Q] ; Scan string
LODS[B/W/D/Q] ; Load string
STOS[B/W/D/Q] ; Store string
REP ; Repeat while RCX != 0
REPE/REPZ ; Repeat while equal/zero
REPNE/REPNZ ; Repeat while not equal/not zero</code></pre>
<h3 id="a.9-flag-control-instructions">A.9 Flag Control
Instructions</h3>
<pre class="assembly"><code>CLC, STC, CMC ; Clear/Set/Complement carry
CLD, STD ; Clear/Set direction flag
CLI, STI ; Clear/Set interrupt flag (privileged)
LAHF, SAHF ; Load/Store AH from/to FLAGS</code></pre>
<h3 id="a.10-system-instructions">A.10 System Instructions</h3>
<pre class="assembly"><code>RDMSR, WRMSR ; Read/Write Model-Specific Register
RDTSC, RDTSCP ; Read Time-Stamp Counter
CPUID ; CPU Identification
RDPMC ; Read Performance Counter
XGETBV, XSETBV ; Get/Set Extended Control Register
; Privileged Instructions
LGDT, SGDT ; Load/Store Global Descriptor Table
LIDT, SIDT ; Load/Store Interrupt Descriptor Table
LLDT, SLDT ; Load/Store Local Descriptor Table
LTR, STR ; Load/Store Task Register</code></pre>
<h3 id="a.11-simd-instructions-sseavx">A.11 SIMD Instructions
(SSE/AVX)</h3>
<h4 id="data-movement">Data Movement</h4>
<pre class="assembly"><code>MOVAPS/MOVUPS ; Move aligned/unaligned packed single
MOVAPD/MOVUPD ; Move aligned/unaligned packed double
MOVDQA/MOVDQU ; Move aligned/unaligned integer
MOVSS/MOVSD ; Move scalar single/double
MOVHPS/MOVLPS ; Move high/low packed single
MOVHPD/MOVLPD ; Move high/low packed double</code></pre>
<h4 id="arithmetic-packed">Arithmetic (Packed)</h4>
<pre class="assembly"><code>ADDPS/ADDPD ; Add packed single/double
SUBPS/SUBPD ; Subtract packed single/double
MULPS/MULPD ; Multiply packed single/double
DIVPS/DIVPD ; Divide packed single/double
SQRTPS/SQRTPD ; Square root packed single/double
MAXPS/MAXPD ; Maximum packed single/double
MINPS/MINPD ; Minimum packed single/double</code></pre>
<h4 id="logical">Logical</h4>
<pre class="assembly"><code>ANDPS/ANDPD ; Bitwise AND
ORPS/ORPD ; Bitwise OR
XORPS/XORPD ; Bitwise XOR
ANDNPS/ANDNPD ; Bitwise AND NOT</code></pre>
<h4 id="comparison">Comparison</h4>
<pre class="assembly"><code>CMPPS/CMPPD ; Compare packed
COMISS/COMISD ; Compare scalar (sets EFLAGS)
UCOMISS/UCOMISD ; Unordered compare scalar</code></pre>
<h4 id="shufflepermute">Shuffle/Permute</h4>
<pre class="assembly"><code>SHUFPS/SHUFPD ; Shuffle packed
UNPCKHPS/UNPCKLPS ; Unpack high/low single
UNPCKHPD/UNPCKLPD ; Unpack high/low double</code></pre>
<h3 id="a.12-avxavx2-instructions">A.12 AVX/AVX2 Instructions</h3>
<h4 id="three-operand-form">Three-Operand Form</h4>
<pre class="assembly"><code>VADDPS dst, src1, src2 ; dst = src1 + src2
VMULPS dst, src1, src2 ; dst = src1 * src2
VSUBPS dst, src1, src2 ; dst = src1 - src2</code></pre>
<h4 id="fma-fused-multiply-add">FMA (Fused Multiply-Add)</h4>
<pre class="assembly"><code>VFMADD132PS/PD ; dst = dst * src2 + src3
VFMADD213PS/PD ; dst = src2 * dst + src3
VFMADD231PS/PD ; dst = src2 * src3 + dst
VFMSUB###PS/PD ; Variants with subtraction
VFNMADD###PS/PD ; Variants with negation</code></pre>
<h4 id="gatherscatter-avx2avx-512">Gather/Scatter (AVX2/AVX-512)</h4>
<pre class="assembly"><code>VGATHERDPS/VGATHERQPS ; Gather single precision
VGATHERDPD/VGATHERQPD ; Gather double precision
VPGATHERDD/VPGATHERQD ; Gather doublewords
VPGATHERDQ/VPGATHERQQ ; Gather quadwords</code></pre>
<h3 id="a.13-avx-512-instructions">A.13 AVX-512 Instructions</h3>
<h4 id="mask-operations">Mask Operations</h4>
<pre class="assembly"><code>KMOVB/KMOVW/KMOVD/KMOVQ ; Move mask register
KANDW/KANDD/KANDQ ; AND mask registers
KORW/KORD/KORQ ; OR mask registers
KXORW/KXORD/KXORQ ; XOR mask registers
KNOTW/KNOTD/KNOTQ ; NOT mask register
KORTESTW/KORTESTD ; OR and test mask</code></pre>
<h4 id="masked-operations">Masked Operations</h4>
<pre class="assembly"><code>VADDPS zmm1{k1}, zmm2, zmm3 ; Masked addition
VMOVAPS zmm1{k1}{z}, [mem] ; Masked move with zeroing</code></pre>
<h4 id="special-avx-512-instructions">Special AVX-512 Instructions</h4>
<pre class="assembly"><code>VCOMPRESS## ; Compress packed data
VEXPAND## ; Expand packed data
VPERMI2## ; Full permute
VPERMT2## ; Full permute with overwrite
VCONFLICT## ; Detect conflicts</code></pre>
<h3 id="a.14-transactional-memory-tsx">A.14 Transactional Memory
(TSX)</h3>
<pre class="assembly"><code>XBEGIN target ; Begin transaction
XEND ; End transaction
XABORT imm8 ; Abort transaction
XTEST ; Test if in transaction</code></pre>
<h3 id="a.15-security-extensions">A.15 Security Extensions</h3>
<h4 id="intel-cet-control-flow-enforcement">Intel CET (Control-flow
Enforcement)</h4>
<pre class="assembly"><code>ENDBR32/ENDBR64 ; End branch markers
INCSSPD/INCSSPQ ; Increment shadow stack pointer
RDSSPD/RDSSPQ ; Read shadow stack pointer
SAVEPREVSSP ; Save previous shadow stack pointer
RSTORSSP ; Restore shadow stack pointer</code></pre>
<h4 id="intel-sgx">Intel SGX</h4>
<pre class="assembly"><code>ENCLS ; SGX Supervisor instructions
ENCLU ; SGX User instructions
ENCLV ; SGX Virtualization instructions</code></pre>
<h3 id="a.16-common-instruction-patterns">A.16 Common Instruction
Patterns</h3>
<h4 id="function-prologueepilogue">Function Prologue/Epilogue</h4>
<pre class="assembly"><code>; Prologue
push rbp
mov rbp, rsp
sub rsp, N ; Allocate stack space
; Epilogue
mov rsp, rbp
pop rbp
ret</code></pre>
<h4 id="system-v-amd64-abi-registers">System V AMD64 ABI Registers</h4>
<p>Arguments: RDI, RSI, RDX, RCX, R8, R9 Return: RAX (RDX:RAX for
128-bit) Preserved: RBX, RBP, R12-R15 Scratch: RAX, RCX, RDX, RSI, RDI,
R8-R11</p>
<h4 id="windows-x64-abi-registers">Windows x64 ABI Registers</h4>
<p>Arguments: RCX, RDX, R8, R9 Return: RAX Preserved: RBX, RBP, RDI,
RSI, RSP, R12-R15 Scratch: RAX, RCX, RDX, R8-R11</p>
<h3 id="a.17-optimization-guidelines">A.17 Optimization Guidelines</h3>
<h4 id="alignment">Alignment</h4>
<ul>
<li><p>Functions: 16-byte boundary</p></li>
<li><p>Loops: 16 or 32-byte boundary</p></li>
<li><p>Data: Natural alignment (4-byte for DWORD, 8-byte for
QWORD)</p></li>
<li><p>SIMD data: 16-byte (SSE), 32-byte (AVX), 64-byte
(AVX-512)</p></li>
</ul>
<h4 id="instruction-selection-2">Instruction Selection</h4>
<ul>
<li><p>Prefer <code>XOR reg,reg</code> over <code>MOV reg,0</code> for
zeroing</p></li>
<li><p>Use <code>TEST</code> instead of <code>CMP</code> with 0</p></li>
<li><p>Prefer <code>LEA</code> for complex address arithmetic</p></li>
<li><p>Use conditional moves to avoid branches</p></li>
</ul>
<h4 id="pipeline-optimization">Pipeline Optimization</h4>
<ul>
<li><p>Avoid partial register updates</p></li>
<li><p>Minimize dependency chains</p></li>
<li><p>Unroll small loops</p></li>
<li><p>Align branch targets</p></li>
</ul>
<h3 id="summary-2">Summary</h3>
<p>This quick reference covers the most commonly used x86-64
instructions. For complete details including:</p>
<ul>
<li><p>Exact encoding formats</p></li>
<li><p>All flag effects</p></li>
<li><p>Timing information</p></li>
<li><p>Exception conditions</p></li>
<li><p>Detailed operand forms</p></li>
</ul>
<p>Consult the Intel® 64 and IA-32 Architectures Software Developer’s
Manual volumes, particularly Volume 2 (Instruction Set Reference).</p>
<p>Key patterns to remember:</p>
<ol type="1">
<li><p>Most instructions support multiple operand sizes
(8/16/32/64-bit)</p></li>
<li><p>Memory operands use ModR/M an Got it — here’s a dense technical
quick reference for the instructions and extensions you’ve been
summarizing, drawn together across Intel’s Vol. 2 ISA and the SSE
through AVX‑512 chapters you described:</p></li>
</ol>
<hr />
<h3 id="instruction-extension-quick-map"><strong>Instruction &amp;
Extension Quick Map</strong></h3>
<h4 id="encoding-basics"><strong>Encoding Basics</strong></h4>
<ul>
<li><p><strong>REX</strong>: <code>0100WRXB</code> — W(64‑bit), R(ext
ModR/M reg), X(ext SIB index), B(ext ModR/M r/m)</p></li>
<li><p><strong>ModR/M</strong>: MOD (addr mode) | REG (reg/opcode) | R/M
(reg/mem)</p></li>
<li><p><strong>VEX/EVEX</strong>: enable 3‑operand forms, wider regs,
masking, embedded rounding</p></li>
</ul>
<hr />
<h4 id="scalar-and-general-purpose-ops"><strong>Scalar and General
Purpose Ops</strong></h4>
<ul>
<li><p><strong>MOV</strong>/<strong>MOVSX</strong>/<strong>MOVZX</strong>
– transfer and extend</p></li>
<li><p><strong>LEA</strong> – addr calc</p></li>
<li><p>Stack:
<code>PUSH</code>/<code>POP</code>/<code>PUSHFQ</code>/<code>POPFQ</code></p></li>
<li><p>Conditional moves: <code>CMOVcc</code> variants</p></li>
<li><p>Arithmetic: <code>ADD</code>, <code>SUB</code>, <code>ADC</code>,
<code>SBB</code>, <code>INC</code>, <code>DEC</code>, <code>NEG</code>,
<code>CMP</code></p></li>
<li><p>Multiply/divide: <code>MUL</code>, <code>IMUL</code>,
<code>DIV</code>, <code>IDIV</code></p></li>
<li><p>Logic: <code>AND</code>, <code>OR</code>, <code>XOR</code>,
<code>NOT</code>, <code>TEST</code></p></li>
<li><p>Shifts/rotates: <code>SHL</code>, <code>SHR</code>,
<code>SAR</code>, <code>ROL</code>, <code>ROR</code>, <code>SHLD</code>,
<code>SHRD</code></p></li>
<li><p>Bit ops: <code>BT</code>, <code>BTS</code>, <code>BTR</code>,
<code>BTC</code>, <code>BSF</code>, <code>BSR</code>,
<code>POPCNT</code></p></li>
<li><p>Control transfer: <code>Jcc</code>, <code>CALL</code>,
<code>RET</code>, <code>LOOPcc</code></p></li>
</ul>
<hr />
<h4 id="sse3-ssse3-sse4-highlights"><strong>SSE3 / SSSE3 / SSE4
Highlights</strong></h4>
<ul>
<li><p><strong>Horizontal arithmetic</strong>: <code>HADDPS/PD</code>,
<code>HSUBPS/PD</code>, <code>ADDSUBPS/PD</code></p></li>
<li><p><strong>Data move/shuffle</strong>: <code>MOVSLDUP</code>,
<code>MOVSHDUP</code>, <code>MOVDDUP</code>, <code>LDDQU</code></p></li>
<li><p><strong>Absolute values</strong>: <code>PABSB/W/D</code></p></li>
<li><p><strong>Sign ops</strong>: <code>PSIGNB/W/D</code></p></li>
<li><p><strong>Horiz add/sub saturating</strong>:
<code>PHADDW/D/SW</code>, <code>PHSUBW/D/SW</code></p></li>
<li><p><strong>Mult/add</strong>: <code>PMADDUBSW</code>,
<code>PMULHRSW</code></p></li>
<li><p><strong>Shuffles</strong>: <code>PSHUFB</code></p></li>
<li><p><strong>Align concat</strong>: <code>PALIGNR</code></p></li>
<li><p><strong>Blend</strong>: <code>BLENDPS/PD</code>,
<code>BLENDVPS/PD</code>, <code>PBLENDVB</code>,
<code>PBLENDW</code></p></li>
<li><p><strong>Dot product</strong>: <code>DPPS</code>,
<code>DPPD</code></p></li>
<li><p><strong>Rounding</strong>: <code>ROUNDPS/PD/SS/SD</code></p></li>
<li><p><strong>Min/max int</strong>: <code>PMINSB/MAXSB</code> +
word/dword/qword variants</p></li>
<li><p><strong>String ops</strong>: <code>PCMPxSTRx</code></p></li>
<li><p><strong>CRC</strong>: <code>CRC32</code></p></li>
<li><p><strong>POPCNT</strong> – population count</p></li>
</ul>
<hr />
<h4 id="avx-avx2"><strong>AVX / AVX2</strong></h4>
<ul>
<li><p><strong>YMM regs</strong> – 256‑bit, XMM alias</p></li>
<li><p><strong>VEX</strong> – 3‑operand, non‑destructive</p></li>
<li><p><strong>State mgmt</strong>: <code>vzeroupper</code>,
<code>XSAVE</code>/<code>XRSTOR</code>,
<code>XGETBV</code>/<code>XSETBV</code></p></li>
<li><p><strong>FP ops</strong>: <code>VADDPS/PD</code>,
<code>VMULPS/PD</code>, <code>VDIVPS/PD</code>,
<code>VSQRTPS/PD</code></p></li>
<li><p><strong>FMA3</strong>: <code>VFMADD*</code>,
<code>VFMSUB*</code>, <code>VFNMADD*</code>,
<code>VFNMSUB*</code></p></li>
<li><p><strong>Cmp/mask</strong>: <code>VCMP*</code>,
<code>VBLENDV*</code></p></li>
<li><p><strong>Broadcast</strong>: <code>VBROADCASTSS/SD</code></p></li>
<li><p><strong>Permute/shuffle</strong>: <code>VPERM2F128</code>,
<code>VPERMILPS/PD</code></p></li>
<li><p><strong>Blend imm/var</strong>: <code>VBLENDPS/PD</code>,
<code>VBLENDVPS/PD</code></p></li>
<li><p><strong>Integer ops 256‑bit</strong>: add/sub/mul/shift
(<code>VPADD*</code>, <code>VPSUB*</code>, <code>VPMUL*</code>,
<code>VPSLLV*</code>, <code>VPSRAV*</code>)</p></li>
<li><p><strong>Gather</strong>: <code>VGATHER*</code>,
<code>VPGATHER*</code></p></li>
</ul>
<hr />
<h4 id="avx512"><strong>AVX‑512</strong></h4>
<ul>
<li><p><strong>ZMM regs</strong> – 512‑bit, k0‑k7 masks</p></li>
<li><p><strong>EVEX</strong> – 512‑bit, predicated exec, embedded
broadcast/rounding</p></li>
<li><p><strong>Foundation</strong>: 512‑bit FP/INT ops,
<code>VREDUCE*</code>, <code>VRCP14*</code>,
<code>VRSQRT14*</code></p></li>
<li><p><strong>Permute</strong>: <code>VPERMI2*</code>,
<code>VPERMT2*</code>, <code>VPERM*</code></p></li>
<li><p><strong>Compress/expand</strong>: <code>VCOMPRESS*</code>,
<code>VEXPAND*</code></p></li>
<li><p><strong>Ternary logic</strong>:
<code>VPTERNLOGD/Q</code></p></li>
<li><p><strong>Scatter</strong>: <code>VPSCATTER*</code></p></li>
<li><p><strong>BW/DQ/VNNI/IFMA</strong>: byte/word/dword/qword
extensions, DL‑oriented dot products, integer FMA</p></li>
<li><p><strong>Mask forms</strong>: <code>{k}{z}</code> zeroing,
merging</p></li>
<li><p><strong>Conflict detect</strong>:
<code>VPCONFLICT*</code></p></li>
</ul>
<hr />
<h4 id="systemarch"><strong>System/Arch</strong></h4>
<ul>
<li><p>Rings 0–3, GDT/IDT/TSS (<code>LGDT</code>, <code>LIDT</code>,
<code>LTR</code>), control regs (CR0/3/4)</p></li>
<li><p>MSRs (<code>RDMSR</code>/<code>WRMSR</code>), syscall/sysret,
sysenter/sysexit</p></li>
<li><p>Paging: 4‑/5‑level, PTE flags (NX, G/U, R/W, P)</p></li>
<li><p>TLB ops: <code>INVVPID</code>, <code>INVPCID</code></p></li>
<li><p>Prot keys: <code>WRPKRU</code>, <code>RDPKRU</code></p></li>
<li><p>MTRRs – cache type controls</p></li>
<li><p>SMEP/SMAP – user/kernel exec/data prevent</p></li>
</ul>
<hr />
<p>This compresses your Chapter 8–12 progression into a lookup
table‑style reference while tying it to encoding details and systemic
ops. If you’d like, I can expand <strong>each extension (SSE3 through
AVX‑512)</strong> into an “intrinsics map” so you can jump between Intel
manual mnemonics and their C/C++ intrinsic function names — handy for
hybrid assembly/C optimization work. Would you like me to generate that
mapping next?</p>
<hr />
<h2 id="appendix-b-system-v-amd64-abi-summary">Appendix B: System V
AMD64 ABI Summary</h2>
<h3 id="b.1-register-usage-conventions">B.1 Register Usage
Conventions</h3>
<h4 id="general-purpose-registers-1">General Purpose Registers</h4>
<table>
<colgroup>
<col style="width: 28%" />
<col style="width: 20%" />
<col style="width: 31%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr>
<th>Register</th>
<th>Usage</th>
<th>Preserved</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>RAX</strong></td>
<td>Return value, 1st return register</td>
<td>No</td>
<td>Also used for syscall number</td>
</tr>
<tr>
<td><strong>RBX</strong></td>
<td>Callee-saved</td>
<td>Yes</td>
<td>Must be preserved across calls</td>
</tr>
<tr>
<td><strong>RCX</strong></td>
<td>4th argument</td>
<td>No</td>
<td>Used in syscalls for return address</td>
</tr>
<tr>
<td><strong>RDX</strong></td>
<td>3rd argument, 2nd return register</td>
<td>No</td>
<td>High 64 bits of 128-bit return</td>
</tr>
<tr>
<td><strong>RSI</strong></td>
<td>2nd argument</td>
<td>No</td>
<td>Source index for string ops</td>
</tr>
<tr>
<td><strong>RDI</strong></td>
<td>1st argument</td>
<td>No</td>
<td>Destination index for string ops</td>
</tr>
<tr>
<td><strong>RBP</strong></td>
<td>Frame pointer (optional)</td>
<td>Yes</td>
<td>Can be used as general register with
<code>-fomit-frame-pointer</code></td>
</tr>
<tr>
<td><strong>RSP</strong></td>
<td>Stack pointer</td>
<td>Yes</td>
<td>Must be 16-byte aligned before CALL</td>
</tr>
<tr>
<td><strong>R8</strong></td>
<td>5th argument</td>
<td>No</td>
<td>Additional scratch register</td>
</tr>
<tr>
<td><strong>R9</strong></td>
<td>6th argument</td>
<td>No</td>
<td>Additional scratch register</td>
</tr>
<tr>
<td><strong>R10</strong></td>
<td>Scratch, static chain pointer</td>
<td>No</td>
<td>Used for nested functions</td>
</tr>
<tr>
<td><strong>R11</strong></td>
<td>Scratch</td>
<td>No</td>
<td>Used by syscall/sysret</td>
</tr>
<tr>
<td><strong>R12</strong></td>
<td>Callee-saved</td>
<td>Yes</td>
<td>Must be preserved</td>
</tr>
<tr>
<td><strong>R13</strong></td>
<td>Callee-saved</td>
<td>Yes</td>
<td>Must be preserved</td>
</tr>
<tr>
<td><strong>R14</strong></td>
<td>Callee-saved</td>
<td>Yes</td>
<td>Must be preserved</td>
</tr>
<tr>
<td><strong>R15</strong></td>
<td>Callee-saved</td>
<td>Yes</td>
<td>Must be preserved</td>
</tr>
</tbody>
</table>
<h4 id="floating-point-registers">Floating-Point Registers</h4>
<table>
<colgroup>
<col style="width: 28%" />
<col style="width: 20%" />
<col style="width: 31%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr>
<th>Register</th>
<th>Usage</th>
<th>Preserved</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>XMM0</strong></td>
<td>1st FP arg, FP return value</td>
<td>No</td>
<td>Also used for complex returns</td>
</tr>
<tr>
<td><strong>XMM1</strong></td>
<td>2nd FP arg, 2nd FP return</td>
<td>No</td>
<td>Imaginary part of complex</td>
</tr>
<tr>
<td><strong>XMM2-XMM7</strong></td>
<td>3rd-8th FP arguments</td>
<td>No</td>
<td>Scratch registers</td>
</tr>
<tr>
<td><strong>XMM8-XMM15</strong></td>
<td>Scratch</td>
<td>No</td>
<td>Additional temporaries</td>
</tr>
<tr>
<td><strong>YMM0-YMM15</strong></td>
<td>AVX extension of XMM</td>
<td>No</td>
<td>Upper 128 bits not preserved</td>
</tr>
<tr>
<td><strong>ZMM0-ZMM31</strong></td>
<td>AVX-512 extension</td>
<td>No</td>
<td>Upper bits not preserved</td>
</tr>
<tr>
<td><strong>K0-K7</strong></td>
<td>AVX-512 mask registers</td>
<td>No</td>
<td>K0 usually means no masking</td>
</tr>
</tbody>
</table>
<h4 id="special-registers">Special Registers</h4>
<table>
<colgroup>
<col style="width: 41%" />
<col style="width: 29%" />
<col style="width: 29%" />
</colgroup>
<thead>
<tr>
<th>Register</th>
<th>Usage</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>RFLAGS</strong></td>
<td>Status flags</td>
<td>DF must be clear on entry/exit</td>
</tr>
<tr>
<td><strong>MXCSR</strong></td>
<td>SSE control/status</td>
<td>Must preserve rounding mode, exception masks</td>
</tr>
<tr>
<td><strong>x87 FPU</strong></td>
<td>Legacy floating-point</td>
<td>Not preserved, should be empty on entry</td>
</tr>
<tr>
<td><strong>FS</strong></td>
<td>Thread-local storage</td>
<td>Reserved for system use</td>
</tr>
<tr>
<td><strong>GS</strong></td>
<td>Thread-local storage (kernel)</td>
<td>Reserved for system use</td>
</tr>
</tbody>
</table>
<h3 id="b.2-function-calling-convention">B.2 Function Calling
Convention</h3>
<h4 id="argument-passing">Argument Passing</h4>
<div class="sourceCode" id="cb269"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb269-1"><a href="#cb269-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Function prototype</span></span>
<span id="cb269-2"><a href="#cb269-2" aria-hidden="true" tabindex="-1"></a><span class="dt">long</span> func<span class="op">(</span><span class="dt">int</span> a<span class="op">,</span> <span class="dt">long</span> b<span class="op">,</span> <span class="dt">char</span> <span class="op">*</span>c<span class="op">,</span> <span class="dt">double</span> d<span class="op">,</span> <span class="dt">float</span> e<span class="op">,</span> <span class="dt">short</span> f<span class="op">,</span> </span>
<span id="cb269-3"><a href="#cb269-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">long</span> g<span class="op">,</span> <span class="dt">double</span> h<span class="op">,</span> <span class="dt">int</span> i<span class="op">);</span></span>
<span id="cb269-4"><a href="#cb269-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb269-5"><a href="#cb269-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Register assignments:</span></span>
<span id="cb269-6"><a href="#cb269-6" aria-hidden="true" tabindex="-1"></a><span class="co">// RDI = a (int → sign-extended to 64-bit)</span></span>
<span id="cb269-7"><a href="#cb269-7" aria-hidden="true" tabindex="-1"></a><span class="co">// RSI = b (long)</span></span>
<span id="cb269-8"><a href="#cb269-8" aria-hidden="true" tabindex="-1"></a><span class="co">// RDX = c (pointer)</span></span>
<span id="cb269-9"><a href="#cb269-9" aria-hidden="true" tabindex="-1"></a><span class="co">// XMM0 = d (double)</span></span>
<span id="cb269-10"><a href="#cb269-10" aria-hidden="true" tabindex="-1"></a><span class="co">// XMM1 = e (float)</span></span>
<span id="cb269-11"><a href="#cb269-11" aria-hidden="true" tabindex="-1"></a><span class="co">// RCX = f (short → sign-extended)</span></span>
<span id="cb269-12"><a href="#cb269-12" aria-hidden="true" tabindex="-1"></a><span class="co">// R8 = g (long)</span></span>
<span id="cb269-13"><a href="#cb269-13" aria-hidden="true" tabindex="-1"></a><span class="co">// XMM2 = h (double)</span></span>
<span id="cb269-14"><a href="#cb269-14" aria-hidden="true" tabindex="-1"></a><span class="co">// R9 = i (int → sign-extended)</span></span></code></pre></div>
<h4 id="classification-rules">Classification Rules</h4>
<ol type="1">
<li><p><strong>INTEGER class</strong>: Integers, pointers (≤8
bytes)</p></li>
<li><p><strong>SSE class</strong>: float, double, __m64, __m128</p></li>
<li><p><strong>SSEUP class</strong>: Second half of __m128 in
structs</p></li>
<li><p><strong>X87/X87UP class</strong>: long double,
__float128</p></li>
<li><p><strong>MEMORY class</strong>: Aggregates &gt;16 bytes or
misaligned</p></li>
<li><p><strong>NO_CLASS</strong>: Padding bytes</p></li>
</ol>
<h4 id="aggregate-structunion-passing">Aggregate (Struct/Union)
Passing</h4>
<div class="sourceCode" id="cb270"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb270-1"><a href="#cb270-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Passed in registers (≤16 bytes, proper alignment)</span></span>
<span id="cb270-2"><a href="#cb270-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Point <span class="op">{</span> <span class="dt">double</span> x<span class="op">,</span> y<span class="op">;</span> <span class="op">};</span> <span class="co">// XMM0 (x), XMM1 (y)</span></span>
<span id="cb270-3"><a href="#cb270-3" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Small <span class="op">{</span> <span class="dt">int</span> a<span class="op">;</span> <span class="dt">char</span> b<span class="op">;</span> <span class="op">};</span> <span class="co">// RDI (packed in single register)</span></span>
<span id="cb270-4"><a href="#cb270-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb270-5"><a href="#cb270-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Passed by reference (&gt;16 bytes or complex)</span></span>
<span id="cb270-6"><a href="#cb270-6" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Large <span class="op">{</span> <span class="dt">double</span> arr<span class="op">[</span><span class="dv">10</span><span class="op">];</span> <span class="op">};</span> <span class="co">// Address in RDI</span></span>
<span id="cb270-7"><a href="#cb270-7" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Mixed <span class="op">{</span> <span class="dt">int</span> x<span class="op">;</span> <span class="dt">double</span> y<span class="op">[</span><span class="dv">3</span><span class="op">];</span> <span class="op">};</span> <span class="co">// Address in RDI</span></span></code></pre></div>
<h4 id="variable-arguments-va_args">Variable Arguments (va_args)</h4>
<pre class="assembly"><code>; Before calling variadic function:
mov eax, num_fp_args ; Number of XMM registers used (0-8)
; RAX = AL holds FP arg count for variadic functions</code></pre>
<h3 id="b.3-stack-frame-layout">B.3 Stack Frame Layout</h3>
<h4 id="stack-organization-high-to-low-address">Stack Organization (High
to Low Address)</h4>
<p>+————————+ Higher addresses | Previous frame | +————————+ | Return
address | ← Pushed by CALL +————————+ | Previous RBP | ← Optional frame
pointer +————————+ | Callee-saved regs | ← If used (RBX, R12-R15, etc.)
+————————+ | Local variables | ← Locals and temporaries +————————+ |
Alloca() space | ← Dynamic allocations +————————+ | Padding | ← For
16-byte alignment +————————+ | Outgoing args | ← Args 7+ for calls
+————————+ | Red zone (128 bytes)| ← Leaf function scratch space
+————————+ ← RSP (must be 16-byte aligned before CALL) Lower
addresses</p>
<h4 id="red-zone-1">Red Zone</h4>
<ul>
<li><p>128 bytes below RSP</p></li>
<li><p>Available for leaf functions (functions that don’t call
others)</p></li>
<li><p>Not preserved across function calls</p></li>
<li><p>Signal handlers don’t preserve red zone</p></li>
</ul>
<h4 id="stack-alignment-1">Stack Alignment</h4>
<pre class="assembly"><code>; Stack must be 16-byte aligned before CALL
; (RSP + 8) % 16 == 0 at function entry
; RSP % 16 == 0 before making a call
; Typical prologue ensuring alignment:
push rbp ; RSP now 16-byte aligned
mov rbp, rsp
sub rsp, N ; N must maintain 16-byte alignment
and rsp, -16 ; Force alignment if needed</code></pre>
<h3 id="b.4-return-values">B.4 Return Values</h3>
<h4 id="scalar-returns">Scalar Returns</h4>
<ul>
<li><p><strong>Integers/Pointers</strong>: RAX (up to 64 bits)</p></li>
<li><p><strong>128-bit integers</strong>: RDX:RAX (high:low)</p></li>
<li><p><strong>Floating-point</strong>: XMM0 (float/double)</p></li>
<li><p><strong>Long double</strong>: ST(0) (x87 stack)</p></li>
<li><p><strong>Complex float</strong>: XMM0 (real), XMM1
(imaginary)</p></li>
<li><p><strong>Complex double</strong>: XMM0 (real), XMM1
(imaginary)</p></li>
</ul>
<h4 id="aggregate-returns">Aggregate Returns</h4>
<div class="sourceCode" id="cb273"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb273-1"><a href="#cb273-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Small struct (≤16 bytes) - returned in registers</span></span>
<span id="cb273-2"><a href="#cb273-2" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Pair <span class="op">{</span> <span class="dt">long</span> a<span class="op">,</span> b<span class="op">;</span> <span class="op">};</span> <span class="co">// RAX (a), RDX (b)</span></span>
<span id="cb273-3"><a href="#cb273-3" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> FPPair <span class="op">{</span> <span class="dt">double</span> x<span class="op">,</span> y<span class="op">;</span> <span class="op">};</span> <span class="co">// XMM0 (x), XMM1 (y)</span></span>
<span id="cb273-4"><a href="#cb273-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb273-5"><a href="#cb273-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Large struct (&gt;16 bytes) - returned via hidden pointer</span></span>
<span id="cb273-6"><a href="#cb273-6" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Large <span class="op">{</span> <span class="dt">double</span> data<span class="op">[</span><span class="dv">10</span><span class="op">];</span> <span class="op">};</span> <span class="co">// Caller allocates, address in RDI</span></span>
<span id="cb273-7"><a href="#cb273-7" aria-hidden="true" tabindex="-1"></a><span class="co">// Other args shift: 1st visible arg → RSI, 2nd → RDX, etc.</span></span></code></pre></div>
<h3 id="b.5-function-prologue-and-epilogue">B.5 Function Prologue and
Epilogue</h3>
<h4 id="standard-prologue">Standard Prologue</h4>
<pre class="assembly"><code>func:
push rbp ; Save frame pointer (optional)
mov rbp, rsp ; Establish frame pointer (optional)
push rbx ; Save callee-saved registers
push r12
push r13
push r14
push r15
sub rsp, N ; Allocate local space (maintain alignment)
; Function body...</code></pre>
<h4 id="standard-epilogue">Standard Epilogue</h4>
<pre class="assembly"><code> ; Function body ends
add rsp, N ; Deallocate locals
pop r15 ; Restore callee-saved registers
pop r14
pop r13
pop r12
pop rbx
pop rbp ; Restore frame pointer
ret ; Return to caller</code></pre>
<h4 id="leaf-function-optimization-1">Leaf Function Optimization</h4>
<pre class="assembly"><code>leaf_func:
; No prologue needed if:
; - No calls to other functions
; - Uses only red zone (128 bytes)
; - Doesn&#39;t need callee-saved registers
mov [rsp-8], rdi ; Can use red zone
; ... computation ...
mov rax, [rsp-8] ; Return value
ret</code></pre>
<h3 id="b.6-system-calls">B.6 System Calls</h3>
<h4 id="linux-system-call-convention">Linux System Call Convention</h4>
<table>
<thead>
<tr>
<th>Register</th>
<th>Usage</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>RAX</strong></td>
<td>System call number</td>
</tr>
<tr>
<td><strong>RDI</strong></td>
<td>1st argument</td>
</tr>
<tr>
<td><strong>RSI</strong></td>
<td>2nd argument</td>
</tr>
<tr>
<td><strong>RDX</strong></td>
<td>3rd argument</td>
</tr>
<tr>
<td><strong>R10</strong></td>
<td>4th argument (not RCX!)</td>
</tr>
<tr>
<td><strong>R8</strong></td>
<td>5th argument</td>
</tr>
<tr>
<td><strong>R9</strong></td>
<td>6th argument</td>
</tr>
<tr>
<td><strong>RAX</strong></td>
<td>Return value (-errno on error)</td>
</tr>
<tr>
<td><strong>RCX</strong></td>
<td>Destroyed (stores return address)</td>
</tr>
<tr>
<td><strong>R11</strong></td>
<td>Destroyed (stores RFLAGS)</td>
</tr>
</tbody>
</table>
<h4 id="system-call-example">System Call Example</h4>
<pre class="assembly"><code>; write(1, &quot;Hello\n&quot;, 6)
mov rax, 1 ; sys_write
mov rdi, 1 ; fd = stdout
lea rsi, [msg] ; buffer
mov rdx, 6 ; count
syscall ; Make system call
; RAX = bytes written or -errno</code></pre>
<h4 id="common-system-call-numbers">Common System Call Numbers</h4>
<pre class="assembly"><code>; Linux x86-64 system calls (selection)
SYS_read equ 0
SYS_write equ 1
SYS_open equ 2
SYS_close equ 3
SYS_mmap equ 9
SYS_mprotect equ 10
SYS_munmap equ 11
SYS_brk equ 12
SYS_ioctl equ 16
SYS_access equ 21
SYS_pipe equ 22
SYS_select equ 23
SYS_mremap equ 25
SYS_fork equ 57
SYS_vfork equ 58
SYS_execve equ 59
SYS_exit equ 60
SYS_wait4 equ 61
SYS_kill equ 62
SYS_uname equ 63
SYS_fcntl equ 72
SYS_flock equ 73
SYS_fsync equ 74
SYS_fdatasync equ 75
SYS_truncate equ 76
SYS_getdents equ 78
SYS_getcwd equ 79
SYS_chdir equ 80
SYS_fchdir equ 81
SYS_rename equ 82
SYS_mkdir equ 83
SYS_rmdir equ 84
SYS_creat equ 85
SYS_link equ 86
SYS_unlink equ 87
SYS_symlink equ 88
SYS_readlink equ 89
SYS_chmod equ 90
SYS_fchmod equ 91
SYS_chown equ 92
SYS_fchown equ 93
SYS_lchown equ 94
SYS_getuid equ 102
SYS_syslog equ 103
SYS_getgid equ 104
SYS_setuid equ 105
SYS_setgid equ 106
SYS_geteuid equ 107
SYS_getegid equ 108
SYS_setpgid equ 109
SYS_getppid equ 110
SYS_getpgrp equ 111
SYS_setsid equ 112
SYS_getsid equ 124
SYS_clone equ 56
SYS_exit_group equ 231</code></pre>
<h3 id="b.7-thread-local-storage-tls">B.7 Thread-Local Storage
(TLS)</h3>
<h4 id="tls-access-models">TLS Access Models</h4>
<pre class="assembly"><code>; Initial Exec (IE) - static TLS
mov rax, QWORD PTR fs:variable@tpoff
; Local Exec (LE) - executable&#39;s TLS
mov rax, QWORD PTR fs:variable@tpoff
; General Dynamic (GD) - dlopen&#39;ed libraries
lea rdi, variable@tlsgd[rip]
call __tls_get_addr@plt
; Local Dynamic (LD) - multiple TLS vars
lea rdi, variable@tlsld[rip]
call __tls_get_addr@plt</code></pre>
<h3 id="b.8-exception-handling">B.8 Exception Handling</h3>
<h4 id="stack-unwinding-dwarf">Stack Unwinding (DWARF)</h4>
<pre class="assembly"><code>.cfi_startproc ; Start of function
.cfi_def_cfa_offset 16 ; Define CFA offset
.cfi_offset rbp, -16 ; RBP saved at CFA-16
.cfi_def_cfa_register rbp ; Use RBP as frame base
.cfi_endproc ; End of function</code></pre>
<h4 id="c-exception-handling">C++ Exception Handling</h4>
<ul>
<li><p>Landing pads for catch blocks</p></li>
<li><p>Personality routine: <code>__gxx_personality_v0</code></p></li>
<li><p>Unwinding library: <code>libgcc_s.so</code> /
<code>libunwind</code></p></li>
</ul>
<h3 id="b.9-data-alignment-requirements">B.9 Data Alignment
Requirements</h3>
<table>
<thead>
<tr>
<th>Type</th>
<th>Size</th>
<th>Alignment</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>char</code></td>
<td>1</td>
<td>1</td>
<td>No alignment requirement</td>
</tr>
<tr>
<td><code>short</code></td>
<td>2</td>
<td>2</td>
<td>Natural alignment</td>
</tr>
<tr>
<td><code>int</code></td>
<td>4</td>
<td>4</td>
<td>Natural alignment</td>
</tr>
<tr>
<td><code>long</code></td>
<td>8</td>
<td>8</td>
<td>Natural alignment</td>
</tr>
<tr>
<td><code>float</code></td>
<td>4</td>
<td>4</td>
<td>Natural alignment</td>
</tr>
<tr>
<td><code>double</code></td>
<td>8</td>
<td>8</td>
<td>Natural alignment</td>
</tr>
<tr>
<td><code>long double</code></td>
<td>16</td>
<td>16</td>
<td>Extended precision</td>
</tr>
<tr>
<td><code>__int128</code></td>
<td>16</td>
<td>16</td>
<td>GCC extension</td>
</tr>
<tr>
<td><code>pointer</code></td>
<td>8</td>
<td>8</td>
<td>All pointers are 64-bit</td>
</tr>
<tr>
<td><code>__m64</code></td>
<td>8</td>
<td>8</td>
<td>MMX type</td>
</tr>
<tr>
<td><code>__m128</code></td>
<td>16</td>
<td>16</td>
<td>SSE type</td>
</tr>
<tr>
<td><code>__m256</code></td>
<td>32</td>
<td>32</td>
<td>AVX type</td>
</tr>
<tr>
<td><code>__m512</code></td>
<td>64</td>
<td>64</td>
<td>AVX-512 type</td>
</tr>
</tbody>
</table>
<h4 id="structure-padding">Structure Padding</h4>
<div class="sourceCode" id="cb281"><pre
class="sourceCode c"><code class="sourceCode c"><span id="cb281-1"><a href="#cb281-1" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Example <span class="op">{</span></span>
<span id="cb281-2"><a href="#cb281-2" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> a<span class="op">;</span> <span class="co">// offset 0</span></span>
<span id="cb281-3"><a href="#cb281-3" aria-hidden="true" tabindex="-1"></a> <span class="co">// 3 bytes padding</span></span>
<span id="cb281-4"><a href="#cb281-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> b<span class="op">;</span> <span class="co">// offset 4</span></span>
<span id="cb281-5"><a href="#cb281-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> c<span class="op">;</span> <span class="co">// offset 8</span></span>
<span id="cb281-6"><a href="#cb281-6" aria-hidden="true" tabindex="-1"></a> <span class="co">// 7 bytes padding</span></span>
<span id="cb281-7"><a href="#cb281-7" aria-hidden="true" tabindex="-1"></a> <span class="dt">double</span> d<span class="op">;</span> <span class="co">// offset 16</span></span>
<span id="cb281-8"><a href="#cb281-8" aria-hidden="true" tabindex="-1"></a><span class="op">};</span> <span class="co">// Total size: 24 bytes (aligned to 8)</span></span></code></pre></div>
<h3 id="b.10-executable-file-format-elf">B.10 Executable File Format
(ELF)</h3>
<h4 id="program-headers">Program Headers</h4>
<p>Type Offset VirtAddr PhysAddr FileSize MemSize Flg Align PHDR 0x40
0x400040 0x400040 0x1f8 0x1f8 R 0x8 INTERP 0x238 0x400238 0x400238 0x1c
0x1c R 0x1 LOAD 0x0 0x400000 0x400000 0x1234 0x1234 R E 0x200000 LOAD
0x1e10 0x601e10 0x601e10 0x230 0x238 RW 0x200000 DYNAMIC 0 Got it —
here’s the <strong>System V AMD64 ABI Summary</strong> you were asking
for, with all the key details pulled together into one structured
reference.</p>
<hr />
<h3 id="register-usage-and-preservation-rules">1️⃣ Register Usage and
Preservation Rules</h3>
<p><strong>General-purpose registers (64‑bit)</strong></p>
<table>
<colgroup>
<col style="width: 7%" />
<col style="width: 58%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr>
<th>Reg</th>
<th>Purpose</th>
<th>Preserved across calls?</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAX</td>
<td>Return value / syscall number</td>
<td>No</td>
</tr>
<tr>
<td>RDI</td>
<td>1ᵗʰ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>RSI</td>
<td>2ⁿᵈ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>RDX</td>
<td>3ʳᵈ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>RCX</td>
<td>4ᵗʰ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>R8</td>
<td>5ᵗʰ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>R9</td>
<td>6ᵗʰ integer/pointer arg</td>
<td>No</td>
</tr>
<tr>
<td>R10</td>
<td>Scratch / 4ᵗʰ syscall arg</td>
<td>No</td>
</tr>
<tr>
<td>R11</td>
<td>Scratch</td>
<td>No</td>
</tr>
<tr>
<td>RBX</td>
<td>Callee-saved</td>
<td>Yes</td>
</tr>
<tr>
<td>RBP</td>
<td>Frame pointer (opt.) / callee-saved</td>
<td>Yes</td>
</tr>
<tr>
<td>RSP</td>
<td>Stack pointer</td>
<td>Yes</td>
</tr>
<tr>
<td>R12–R15</td>
<td>Callee-saved</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<p><strong>Vector/Floating registers</strong></p>
<ul>
<li><p>XMM0–XMM7: FP args/returns, caller‑saved</p></li>
<li><p>XMM8–XMM15: Caller‑saved temps</p></li>
<li><p>YMM/ZMM extend XMM — upper bits are <strong>not</strong>
preserved</p></li>
<li><p>K0–K7 (AVX‑512 masks): Caller‑saved</p></li>
</ul>
<hr />
<h3 id="calling-convention-essentials">2️⃣ Calling Convention
Essentials</h3>
<p><strong>Integer/pointer args</strong>:<br />
1 → RDI<br />
2 → RSI<br />
3 → RDX<br />
4 → RCX<br />
5 → R8<br />
6 → R9<br />
More → pushed on stack, right‑to‑left</p>
<p><strong>Floating‑point args</strong>:<br />
1 → XMM0<br />
2 → XMM1<br />
… up to XMM7<br />
More → stack</p>
<p><strong>Return values</strong>:</p>
<ul>
<li><p>Integer/pointer ≤64‑bit → RAX</p></li>
<li><p>128‑bit integer → RDX:RAX</p></li>
<li><p>Float/double → XMM0</p></li>
<li><p>Complex float/double → XMM0 (real), XMM1 (imag)</p></li>
<li><p>Small structs ≤16 bytes → registers per classification
rules</p></li>
<li><p>Larger aggregates → hidden pointer in RDI</p></li>
</ul>
<p><strong>Variadic functions</strong>: AL in RAX holds # of FP
registers used.</p>
<hr />
<h3 id="stack-frame-and-alignment">3️⃣ Stack Frame and Alignment</h3>
<ul>
<li><p><strong>16‑byte stack alignment before call</strong></p></li>
<li><p>Red zone: 128 bytes below RSP usable in leaf functions</p></li>
<li><p>Typical frame layout:</p></li>
</ul>
<table style="width:35%;">
<colgroup>
<col style="width: 34%" />
</colgroup>
<tbody>
<tr>
<td>args &gt; 6 (overflow)</td>
</tr>
<tr>
<td>local variables</td>
</tr>
<tr>
<td>callee-saved registers</td>
</tr>
<tr>
<td>saved RBP</td>
</tr>
<tr>
<td>return address</td>
</tr>
</tbody>
</table>
<pre><code> RSP → low addresses</code></pre>
<p>Example prologue/epilogue:</p>
<div class="sourceCode" id="cb283"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb283-1"><a href="#cb283-1" aria-hidden="true" tabindex="-1"></a><span class="fu">func:</span></span>
<span id="cb283-2"><a href="#cb283-2" aria-hidden="true" tabindex="-1"></a> <span class="bu">push</span> <span class="kw">rbp</span></span>
<span id="cb283-3"><a href="#cb283-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">mov</span> <span class="kw">rbp</span><span class="op">,</span> <span class="kw">rsp</span></span>
<span id="cb283-4"><a href="#cb283-4" aria-hidden="true" tabindex="-1"></a> <span class="bu">push</span> <span class="kw">rbx</span> <span class="kw">r12</span> <span class="kw">r13</span> <span class="kw">r14</span> <span class="kw">r15</span></span>
<span id="cb283-5"><a href="#cb283-5" aria-hidden="true" tabindex="-1"></a> <span class="bu">sub</span> <span class="kw">rsp</span><span class="op">,</span> <span class="dv">32</span> <span class="co">; locals</span></span>
<span id="cb283-6"><a href="#cb283-6" aria-hidden="true" tabindex="-1"></a> <span class="co">; body</span></span>
<span id="cb283-7"><a href="#cb283-7" aria-hidden="true" tabindex="-1"></a> <span class="bu">add</span> <span class="kw">rsp</span><span class="op">,</span> <span class="dv">32</span></span>
<span id="cb283-8"><a href="#cb283-8" aria-hidden="true" tabindex="-1"></a> <span class="bu">pop</span> <span class="kw">r15</span> <span class="kw">r14</span> <span class="kw">r13</span> <span class="kw">r12</span> <span class="kw">rbx</span></span>
<span id="cb283-9"><a href="#cb283-9" aria-hidden="true" tabindex="-1"></a> <span class="bu">pop</span> <span class="kw">rbp</span></span>
<span id="cb283-10"><a href="#cb283-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span></code></pre></div>
<hr />
<h3 id="system-calls-linux-amd64">4️⃣ System Calls (Linux AMD64)</h3>
<p>Registers:</p>
<table>
<thead>
<tr>
<th>Reg</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAX</td>
<td>syscall number</td>
</tr>
<tr>
<td>RDI</td>
<td>arg1</td>
</tr>
<tr>
<td>RSI</td>
<td>arg2</td>
</tr>
<tr>
<td>RDX</td>
<td>arg3</td>
</tr>
<tr>
<td>R10</td>
<td>arg4</td>
</tr>
<tr>
<td>R8</td>
<td>arg5</td>
</tr>
<tr>
<td>R9</td>
<td>arg6</td>
</tr>
<tr>
<td>RAX</td>
<td>return value</td>
</tr>
<tr>
<td>RCX,R11</td>
<td>trashed</td>
</tr>
</tbody>
</table>
<p>Example:</p>
<div class="sourceCode" id="cb284"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb284-1"><a href="#cb284-1" aria-hidden="true" tabindex="-1"></a><span class="co">; write(1, msg, len)</span></span>
<span id="cb284-2"><a href="#cb284-2" aria-hidden="true" tabindex="-1"></a><span class="bu">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="dv">1</span> <span class="co">; SYS_write</span></span>
<span id="cb284-3"><a href="#cb284-3" aria-hidden="true" tabindex="-1"></a><span class="bu">mov</span> <span class="kw">rdi</span><span class="op">,</span> <span class="dv">1</span> <span class="co">; fd = stdout</span></span>
<span id="cb284-4"><a href="#cb284-4" aria-hidden="true" tabindex="-1"></a><span class="bu">lea</span> <span class="kw">rsi</span><span class="op">,</span> <span class="op">[</span>rel msg<span class="op">]</span></span>
<span id="cb284-5"><a href="#cb284-5" aria-hidden="true" tabindex="-1"></a><span class="bu">mov</span> <span class="kw">rdx</span><span class="op">,</span> len</span>
<span id="cb284-6"><a href="#cb284-6" aria-hidden="true" tabindex="-1"></a><span class="cf">syscall</span></span></code></pre></div>
<hr />
<h3 id="data-alignment-rules">5️⃣ Data Alignment Rules</h3>
<table>
<thead>
<tr>
<th>Type</th>
<th>Size</th>
<th>Align</th>
</tr>
</thead>
<tbody>
<tr>
<td>char</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>short</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>int/float</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>long/double</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>long double</td>
<td>16</td>
<td>16</td>
</tr>
<tr>
<td>__m128</td>
<td>16</td>
<td>16</td>
</tr>
<tr>
<td>__m256</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td>__m512</td>
<td>64</td>
<td>64</td>
</tr>
</tbody>
</table>
<p>Structs padded automatically to meet member alignment.</p>
<hr />
<h3 id="threadlocal-storage-tls">6️⃣ Thread‑Local Storage (TLS)</h3>
<p>Access via <strong>FS</strong>/<strong>GS</strong> segment
register:</p>
<div class="sourceCode" id="cb285"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb285-1"><a href="#cb285-1" aria-hidden="true" tabindex="-1"></a><span class="bu">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="dt">qword</span> <span class="dt">ptr</span> <span class="kw">fs</span><span class="op">:[</span>var<span class="fu">@</span><span class="er">t</span>poff<span class="op">]</span></span></code></pre></div>
<p>Dynamic TLS via <code>__tls_get_addr</code> helper for libraries.</p>
<hr />
<h3 id="exceptionunwind-info">7️⃣ Exception/Unwind Info</h3>
<p>DWARF <code>.cfi_*</code> directives record stack frame layout for
debuggers/exception handlers.</p>
<hr />
<p>✅ This gives you the condensed but complete System V AMD64 ABI rules
— if you want, I can now <strong>append ready‑to‑use assembly
templates</strong> for functions, syscalls, and struct returns that
match this ABI so you can code directly without looking things up. Would
you like me to prepare that?</p>
<hr />
<h2 id="appendix-c-nasmgasmasm-syntax-comparison">Appendix C:
NASM/GAS/MASM Syntax Comparison</h2>
<h3 id="c.1-basic-syntax-differences">C.1 Basic Syntax Differences</h3>
<h4 id="instruction-format">Instruction Format</h4>
<table style="width:100%;">
<colgroup>
<col style="width: 19%" />
<col style="width: 13%" />
<col style="width: 26%" />
<col style="width: 28%" />
<col style="width: 13%" />
</colgroup>
<thead>
<tr>
<th>Feature</th>
<th>NASM</th>
<th>GAS (AT&amp;T)</th>
<th>GAS (Intel)</th>
<th>MASM</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Operand Order</strong></td>
<td><code>dest, src</code></td>
<td><code>src, dest</code></td>
<td><code>dest, src</code></td>
<td><code>dest, src</code></td>
</tr>
<tr>
<td><strong>Register Prefix</strong></td>
<td>None</td>
<td><code>%</code></td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td><strong>Immediate Prefix</strong></td>
<td>None</td>
<td><code>$</code></td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td><strong>Size Suffixes</strong></td>
<td>Use directives</td>
<td><code>b/w/l/q</code> suffix</td>
<td>Use PTR</td>
<td>Use PTR</td>
</tr>
<tr>
<td><strong>Comments</strong></td>
<td><code>;</code></td>
<td><code>#</code> or <code>/* */</code></td>
<td><code>#</code> or <code>/* */</code></td>
<td><code>;</code></td>
</tr>
<tr>
<td><strong>Hex Numbers</strong></td>
<td><code>0x123</code> or <code>123h</code></td>
<td><code>$0x123</code></td>
<td><code>0x123</code></td>
<td><code>123h</code> or <code>0123h</code></td>
</tr>
<tr>
<td><strong>Binary Numbers</strong></td>
<td><code>0b1010</code> or <code>1010b</code></td>
<td><code>0b1010</code></td>
<td><code>0b1010</code></td>
<td><code>1010b</code></td>
</tr>
</tbody>
</table>
<h4 id="basic-instruction-examples">Basic Instruction Examples</h4>
<div class="sourceCode" id="cb286"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb286-1"><a href="#cb286-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb286-2"><a href="#cb286-2" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="kw">rbx</span> <span class="co">; move rbx to rax</span></span>
<span id="cb286-3"><a href="#cb286-3" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="dv">123</span> <span class="co">; move immediate</span></span>
<span id="cb286-4"><a href="#cb286-4" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="kw">rbx</span><span class="op">]</span> <span class="co">; load from memory</span></span>
<span id="cb286-5"><a href="#cb286-5" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="dt">byte</span> <span class="op">[</span><span class="kw">rax</span><span class="op">],</span> <span class="dv">5</span> <span class="co">; store byte</span></span>
<span id="cb286-6"><a href="#cb286-6" aria-hidden="true" tabindex="-1"></a><span class="kw">add</span> <span class="kw">rax</span><span class="op">,</span> <span class="kw">rbx</span> <span class="co">; add registers</span></span>
<span id="cb286-7"><a href="#cb286-7" aria-hidden="true" tabindex="-1"></a><span class="kw">lea</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="kw">rbx</span><span class="op">+</span><span class="kw">rcx</span><span class="op">*</span><span class="dv">4</span><span class="op">+</span><span class="dv">10</span><span class="op">]</span> <span class="co">; load effective address</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T syntax)
movq %rbx, %rax # move rbx to rax
movq $123, %rax # move immediate
movq (%rbx), %rax # load from memory
movb $5, (%rax) # store byte
addq %rbx, %rax # add registers
leaq 10(%rbx,%rcx,4), %rax # load effective address</code></pre>
<pre class="gas"><code># GAS (Intel syntax)
.intel_syntax noprefix
mov rax, rbx # move rbx to rax
mov rax, 123 # move immediate
mov rax, [rbx] # load from memory
mov byte ptr [rax], 5 # store byte
add rax, rbx # add registers
lea rax, [rbx+rcx*4+10] # load effective address</code></pre>
<pre class="masm"><code>; MASM
mov rax, rbx ; move rbx to rax
mov rax, 123 ; move immediate
mov rax, [rbx] ; load from memory
mov byte ptr [rax], 5 ; store byte
add rax, rbx ; add registers
lea rax, [rbx+rcx*4+10] ; load effective address</code></pre>
<h3 id="c.2-memory-addressing">C.2 Memory Addressing</h3>
<h4 id="direct-memory-access">Direct Memory Access</h4>
<div class="sourceCode" id="cb290"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb290-1"><a href="#cb290-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb290-2"><a href="#cb290-2" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="bn">0x401000</span><span class="op">]</span> <span class="co">; absolute address</span></span>
<span id="cb290-3"><a href="#cb290-3" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span>myvar<span class="op">]</span> <span class="co">; labeled address</span></span>
<span id="cb290-4"><a href="#cb290-4" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="kw">rel</span> myvar<span class="op">]</span> <span class="co">; RIP-relative (explicit)</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T)
movq 0x401000, %rax # absolute address
movq myvar, %rax # labeled address
movq myvar(%rip), %rax # RIP-relative</code></pre>
<pre class="gas"><code># GAS (Intel)
mov rax, qword ptr [0x401000] # absolute
mov rax, qword ptr [myvar] # labeled
mov rax, qword ptr [rip+myvar] # RIP-relative</code></pre>
<pre class="masm"><code>; MASM
mov rax, qword ptr [401000h] ; absolute
mov rax, qword ptr [myvar] ; labeled
mov rax, qword ptr myvar ; also valid</code></pre>
<h4 id="complex-addressing-modes-1">Complex Addressing Modes</h4>
<div class="sourceCode" id="cb294"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb294-1"><a href="#cb294-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM - [base + index*scale + disp]</span></span>
<span id="cb294-2"><a href="#cb294-2" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="kw">rbx</span> <span class="op">+</span> <span class="kw">rcx</span><span class="op">*</span><span class="dv">8</span> <span class="op">+</span> <span class="dv">16</span><span class="op">]</span></span>
<span id="cb294-3"><a href="#cb294-3" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">[</span><span class="kw">rbx</span> <span class="op">+</span> <span class="dv">4</span><span class="op">*</span><span class="kw">rcx</span> <span class="op">-</span> <span class="dv">32</span><span class="op">]</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T) - disp(base, index, scale)
movq 16(%rbx, %rcx, 8), %rax
movq -32(%rbx, %rcx, 4), %rax</code></pre>
<pre class="gas"><code># GAS (Intel)
mov rax, [rbx + rcx*8 + 16]
mov rax, [rbx + rcx*4 - 32]</code></pre>
<pre class="masm"><code>; MASM
mov rax, [rbx + rcx*8 + 16]
mov rax, [rbx + rcx*4 - 32]</code></pre>
<h3 id="c.3-data-definitions">C.3 Data Definitions</h3>
<h4 id="basic-data-types">Basic Data Types</h4>
<div class="sourceCode" id="cb298"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb298-1"><a href="#cb298-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb298-2"><a href="#cb298-2" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.data</span></span>
<span id="cb298-3"><a href="#cb298-3" aria-hidden="true" tabindex="-1"></a> byte_val <span class="dt">db</span> <span class="bn">0x12</span> <span class="co">; 1 byte</span></span>
<span id="cb298-4"><a href="#cb298-4" aria-hidden="true" tabindex="-1"></a> word_val <span class="dt">dw</span> <span class="bn">0x1234</span> <span class="co">; 2 bytes</span></span>
<span id="cb298-5"><a href="#cb298-5" aria-hidden="true" tabindex="-1"></a> dword_val <span class="dt">dd</span> <span class="bn">0x12345678</span> <span class="co">; 4 bytes</span></span>
<span id="cb298-6"><a href="#cb298-6" aria-hidden="true" tabindex="-1"></a> qword_val <span class="dt">dq</span> <span class="bn">0x123456789ABCDEF</span> <span class="co">; 8 bytes</span></span>
<span id="cb298-7"><a href="#cb298-7" aria-hidden="true" tabindex="-1"></a> float_val <span class="dt">dd</span> <span class="fl">3.14</span> <span class="co">; 32-bit float</span></span>
<span id="cb298-8"><a href="#cb298-8" aria-hidden="true" tabindex="-1"></a> double_val <span class="dt">dq</span> <span class="fl">3.14159</span> <span class="co">; 64-bit double</span></span>
<span id="cb298-9"><a href="#cb298-9" aria-hidden="true" tabindex="-1"></a> string_val <span class="dt">db</span> <span class="st">&quot;Hello&quot;</span><span class="op">,</span> <span class="dv">0</span> <span class="co">; null-terminated</span></span>
<span id="cb298-10"><a href="#cb298-10" aria-hidden="true" tabindex="-1"></a> array_val <span class="dt">dd</span> <span class="dv">1</span><span class="op">,</span> <span class="dv">2</span><span class="op">,</span> <span class="dv">3</span><span class="op">,</span> <span class="dv">4</span><span class="op">,</span> <span class="dv">5</span> <span class="co">; array</span></span>
<span id="cb298-11"><a href="#cb298-11" aria-hidden="true" tabindex="-1"></a> buffer <span class="dt">resb</span> <span class="dv">256</span> <span class="co">; uninitialized</span></span></code></pre></div>
<pre class="gas"><code># GAS
.section .data
byte_val: .byte 0x12
word_val: .word 0x1234
dword_val: .long 0x12345678
qword_val: .quad 0x123456789ABCDEF
float_val: .float 3.14
double_val: .double 3.14159
string_val: .asciz &quot;Hello&quot; # null-terminated
string2: .ascii &quot;World&quot; # not null-terminated
array_val: .long 1, 2, 3, 4, 5
.section .bss
buffer: .skip 256 # uninitialized</code></pre>
<pre class="masm"><code>; MASM
.data
byte_val BYTE 12h
word_val WORD 1234h
dword_val DWORD 12345678h
qword_val QWORD 123456789ABCDEFh
float_val REAL4 3.14
double_val REAL8 3.14159
string_val BYTE &quot;Hello&quot;, 0
array_val DWORD 1, 2, 3, 4, 5
buffer BYTE 256 DUP(?) ; uninitialized</code></pre>
<h4 id="string-definitions">String Definitions</h4>
<div class="sourceCode" id="cb301"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb301-1"><a href="#cb301-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb301-2"><a href="#cb301-2" aria-hidden="true" tabindex="-1"></a>str1 <span class="dt">db</span> <span class="st">&#39;Hello World&#39;</span><span class="op">,</span> <span class="bn">0x0A</span><span class="op">,</span> <span class="dv">0</span> <span class="co">; with newline</span></span>
<span id="cb301-3"><a href="#cb301-3" aria-hidden="true" tabindex="-1"></a>str2 <span class="dt">db</span> <span class="st">`Hello</span><span class="ch">\n</span><span class="st">World</span><span class="ch">\0</span><span class="st">`</span> <span class="co">; C-style escapes</span></span>
<span id="cb301-4"><a href="#cb301-4" aria-hidden="true" tabindex="-1"></a>str3 <span class="dt">times</span> <span class="dv">64</span> <span class="dt">db</span> <span class="dv">0</span> <span class="co">; 64 zeros</span></span></code></pre></div>
<pre class="gas"><code># GAS
str1: .asciz &quot;Hello World\n&quot; # null-terminated
str2: .string &quot;Hello World&quot; # same as .asciz
str3: .fill 64, 1, 0 # 64 bytes of 0</code></pre>
<pre class="masm"><code>; MASM
str1 BYTE &quot;Hello World&quot;, 0Ah, 0
str2 BYTE &quot;Hello&quot;, 0Dh, 0Ah, &quot;World&quot;, 0
str3 BYTE 64 DUP(0)</code></pre>
<h3 id="c.4-sections-and-segments">C.4 Sections and Segments</h3>
<div class="sourceCode" id="cb304"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb304-1"><a href="#cb304-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb304-2"><a href="#cb304-2" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.text</span></span>
<span id="cb304-3"><a href="#cb304-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">global</span> _start</span>
<span id="cb304-4"><a href="#cb304-4" aria-hidden="true" tabindex="-1"></a><span class="fu">_start:</span></span>
<span id="cb304-5"><a href="#cb304-5" aria-hidden="true" tabindex="-1"></a> <span class="co">; code here</span></span>
<span id="cb304-6"><a href="#cb304-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb304-7"><a href="#cb304-7" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.data</span></span>
<span id="cb304-8"><a href="#cb304-8" aria-hidden="true" tabindex="-1"></a> <span class="co">; initialized data</span></span>
<span id="cb304-9"><a href="#cb304-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb304-10"><a href="#cb304-10" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.bss</span></span>
<span id="cb304-11"><a href="#cb304-11" aria-hidden="true" tabindex="-1"></a> <span class="co">; uninitialized data</span></span>
<span id="cb304-12"><a href="#cb304-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb304-13"><a href="#cb304-13" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.rodata</span></span>
<span id="cb304-14"><a href="#cb304-14" aria-hidden="true" tabindex="-1"></a> <span class="co">; read-only data</span></span></code></pre></div>
<pre class="gas"><code># GAS
.section .text
.global _start
_start:
# code here
.section .data
# initialized data
.section .bss
# uninitialized data
.section .rodata
# read-only data</code></pre>
<pre class="masm"><code>; MASM (Windows)
.code
main PROC
; code here
main ENDP
.data
; initialized data
.data?
; uninitialized data
.const
; read-only data</code></pre>
<h3 id="c.5-macros-and-directives">C.5 Macros and Directives</h3>
<h4 id="macro-definitions">Macro Definitions</h4>
<div class="sourceCode" id="cb307"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb307-1"><a href="#cb307-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb307-2"><a href="#cb307-2" aria-hidden="true" tabindex="-1"></a><span class="ot">%macro</span> pushall <span class="dv">0</span></span>
<span id="cb307-3"><a href="#cb307-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">push</span> <span class="kw">rax</span></span>
<span id="cb307-4"><a href="#cb307-4" aria-hidden="true" tabindex="-1"></a> <span class="kw">push</span> <span class="kw">rbx</span></span>
<span id="cb307-5"><a href="#cb307-5" aria-hidden="true" tabindex="-1"></a> <span class="kw">push</span> <span class="kw">rcx</span></span>
<span id="cb307-6"><a href="#cb307-6" aria-hidden="true" tabindex="-1"></a> <span class="kw">push</span> <span class="kw">rdx</span></span>
<span id="cb307-7"><a href="#cb307-7" aria-hidden="true" tabindex="-1"></a><span class="ot">%endmacro</span></span>
<span id="cb307-8"><a href="#cb307-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb307-9"><a href="#cb307-9" aria-hidden="true" tabindex="-1"></a><span class="ot">%macro</span> add3 <span class="dv">3</span> <span class="co">; 3 parameters</span></span>
<span id="cb307-10"><a href="#cb307-10" aria-hidden="true" tabindex="-1"></a> <span class="kw">mov</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">%</span><span class="dv">1</span></span>
<span id="cb307-11"><a href="#cb307-11" aria-hidden="true" tabindex="-1"></a> <span class="kw">add</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">%</span><span class="dv">2</span></span>
<span id="cb307-12"><a href="#cb307-12" aria-hidden="true" tabindex="-1"></a> <span class="kw">add</span> <span class="kw">rax</span><span class="op">,</span> <span class="op">%</span><span class="dv">3</span></span>
<span id="cb307-13"><a href="#cb307-13" aria-hidden="true" tabindex="-1"></a><span class="ot">%endmacro</span></span>
<span id="cb307-14"><a href="#cb307-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb307-15"><a href="#cb307-15" aria-hidden="true" tabindex="-1"></a><span class="co">; Usage</span></span>
<span id="cb307-16"><a href="#cb307-16" aria-hidden="true" tabindex="-1"></a>pushall</span>
<span id="cb307-17"><a href="#cb307-17" aria-hidden="true" tabindex="-1"></a>add3 <span class="kw">rdi</span><span class="op">,</span> <span class="kw">rsi</span><span class="op">,</span> <span class="kw">rdx</span></span></code></pre></div>
<pre class="gas"><code># GAS
.macro pushall
push %rax
push %rbx
push %rcx
push %rdx
.endm
.macro add3 p1, p2, p3
movq \p1, %rax
addq \p2, %rax
addq \p3, %rax
.endm
# Usage
pushall
add3 %rdi, %rsi, %rdx</code></pre>
<pre class="masm"><code>; MASM
pushall MACRO
push rax
push rbx
push rcx
push rdx
ENDM
add3 MACRO p1, p2, p3
mov rax, p1
add rax, p2
add rax, p3
ENDM
; Usage
pushall
add3 rdi, rsi, rdx</code></pre>
<h4 id="conditional-assembly">Conditional Assembly</h4>
<div class="sourceCode" id="cb310"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb310-1"><a href="#cb310-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb310-2"><a href="#cb310-2" aria-hidden="true" tabindex="-1"></a><span class="ot">%ifdef</span> DEBUG</span>
<span id="cb310-3"><a href="#cb310-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">mov</span> <span class="kw">rdi</span><span class="op">,</span> debug_msg</span>
<span id="cb310-4"><a href="#cb310-4" aria-hidden="true" tabindex="-1"></a> <span class="cf">call</span> print_debug</span>
<span id="cb310-5"><a href="#cb310-5" aria-hidden="true" tabindex="-1"></a><span class="ot">%endif</span></span>
<span id="cb310-6"><a href="#cb310-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb310-7"><a href="#cb310-7" aria-hidden="true" tabindex="-1"></a><span class="ot">%if</span> BUFFER_SIZE <span class="op">&gt;</span> <span class="dv">1024</span></span>
<span id="cb310-8"><a href="#cb310-8" aria-hidden="true" tabindex="-1"></a> <span class="ot">%error</span> &quot;Buffer too large<span class="st">&quot;</span></span>
<span id="cb310-9"><a href="#cb310-9" aria-hidden="true" tabindex="-1"></a><span class="ot">%elif</span> BUFFER_SIZE <span class="op">&lt;</span> <span class="dv">16</span></span>
<span id="cb310-10"><a href="#cb310-10" aria-hidden="true" tabindex="-1"></a> <span class="ot">%error</span> &quot;Buffer too small<span class="st">&quot;</span></span>
<span id="cb310-11"><a href="#cb310-11" aria-hidden="true" tabindex="-1"></a><span class="ot">%endif</span></span></code></pre></div>
<pre class="gas"><code># GAS
.ifdef DEBUG
movq $debug_msg, %rdi
call print_debug
.endif
.if BUFFER_SIZE &gt; 1024
.error &quot;Buffer too large&quot;
.elseif BUFFER_SIZE &lt; 16
.error &quot;Buffer too small&quot;
.endif</code></pre>
<pre class="masm"><code>; MASM
IFDEF DEBUG
mov rdi, OFFSET debug_msg
call print_debug
ENDIF
IF BUFFER_SIZE GT 1024
.ERR &lt;Buffer too large&gt;
ELSEIF BUFFER_SIZE LT 16
.ERR &lt;Buffer too small&gt;
ENDIF</code></pre>
<h3 id="c.6-symbols-and-labels">C.6 Symbols and Labels</h3>
<h4 id="global-and-external-symbols">Global and External Symbols</h4>
<div class="sourceCode" id="cb313"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb313-1"><a href="#cb313-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb313-2"><a href="#cb313-2" aria-hidden="true" tabindex="-1"></a><span class="kw">global</span> main <span class="co">; export symbol</span></span>
<span id="cb313-3"><a href="#cb313-3" aria-hidden="true" tabindex="-1"></a><span class="kw">extern</span> printf <span class="co">; import symbol</span></span>
<span id="cb313-4"><a href="#cb313-4" aria-hidden="true" tabindex="-1"></a><span class="kw">extern</span> data_var</span>
<span id="cb313-5"><a href="#cb313-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb313-6"><a href="#cb313-6" aria-hidden="true" tabindex="-1"></a><span class="fu">main:</span></span>
<span id="cb313-7"><a href="#cb313-7" aria-hidden="true" tabindex="-1"></a> <span class="co">; function code</span></span>
<span id="cb313-8"><a href="#cb313-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span>
<span id="cb313-9"><a href="#cb313-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb313-10"><a href="#cb313-10" aria-hidden="true" tabindex="-1"></a><span class="fu">.loop_label:</span> <span class="co">; local label</span></span>
<span id="cb313-11"><a href="#cb313-11" aria-hidden="true" tabindex="-1"></a> <span class="co">; loop code</span></span></code></pre></div>
<pre class="gas"><code># GAS
.global main # export symbol
.extern printf # import (optional)
main:
# function code
ret
.Lloop_label: # local label
# loop code</code></pre>
<pre class="masm"><code>; MASM
PUBLIC main ; export symbol
EXTERN printf:PROC ; import function
EXTERN data_var:QWORD ; import variable
main PROC
; function code
ret
main ENDP
@@: ; anonymous label
; loop code</code></pre>
<h4 id="alignment-directives">Alignment Directives</h4>
<div class="sourceCode" id="cb316"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb316-1"><a href="#cb316-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb316-2"><a href="#cb316-2" aria-hidden="true" tabindex="-1"></a><span class="kw">align</span> <span class="dv">16</span> <span class="co">; align to 16 bytes</span></span>
<span id="cb316-3"><a href="#cb316-3" aria-hidden="true" tabindex="-1"></a><span class="kw">alignb</span> <span class="dv">16</span><span class="op">,</span> nop <span class="co">; align with NOPs</span></span>
<span id="cb316-4"><a href="#cb316-4" aria-hidden="true" tabindex="-1"></a><span class="dt">times</span> <span class="op">(</span><span class="dv">16</span><span class="op">-($-</span>$$<span class="op">)</span> <span class="op">%</span> <span class="dv">16</span><span class="op">)</span> nop <span class="co">; manual alignment</span></span></code></pre></div>
<pre class="gas"><code># GAS
.align 16 # align to 2^16 on some platforms!
.p2align 4 # align to 2^4 = 16 bytes (portable)
.balign 16 # byte align to 16</code></pre>
<pre class="masm"><code>; MASM
ALIGN 16 ; align to 16 bytes</code></pre>
<h3 id="c.7-procedure-definitions">C.7 Procedure Definitions</h3>
<h4 id="function-declaration">Function Declaration</h4>
<div class="sourceCode" id="cb319"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb319-1"><a href="#cb319-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb319-2"><a href="#cb319-2" aria-hidden="true" tabindex="-1"></a><span class="kw">section</span> <span class="fu">.text</span></span>
<span id="cb319-3"><a href="#cb319-3" aria-hidden="true" tabindex="-1"></a><span class="kw">global</span> my_function</span>
<span id="cb319-4"><a href="#cb319-4" aria-hidden="true" tabindex="-1"></a><span class="fu">my_function:</span></span>
<span id="cb319-5"><a href="#cb319-5" aria-hidden="true" tabindex="-1"></a> <span class="kw">push</span> <span class="kw">rbp</span></span>
<span id="cb319-6"><a href="#cb319-6" aria-hidden="true" tabindex="-1"></a> <span class="kw">mov</span> <span class="kw">rbp</span><span class="op">,</span> <span class="kw">rsp</span></span>
<span id="cb319-7"><a href="#cb319-7" aria-hidden="true" tabindex="-1"></a> <span class="co">; function body</span></span>
<span id="cb319-8"><a href="#cb319-8" aria-hidden="true" tabindex="-1"></a> <span class="kw">pop</span> <span class="kw">rbp</span></span>
<span id="cb319-9"><a href="#cb319-9" aria-hidden="true" tabindex="-1"></a> <span class="cf">ret</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T)
.section .text
.global my_function
.type my_function, @function
my_function:
pushq %rbp
movq %rsp, %rbp
# function body
popq %rbp
ret
.size my_function, .-my_function</code></pre>
<pre class="gas"><code># GAS (Intel)
.intel_syntax noprefix
.global my_function
.type my_function, @function
my_function:
push rbp
mov rbp, rsp
# function body
pop rbp
ret</code></pre>
<pre class="masm"><code>; MASM
.code
my_function PROC
push rbp
mov rbp, rsp
; function body
pop rbp
ret
my_function ENDP</code></pre>
<h3 id="c.8-simd-instructions">C.8 SIMD Instructions</h3>
<h4 id="sseavx-instructions">SSE/AVX Instructions</h4>
<div class="sourceCode" id="cb323"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb323-1"><a href="#cb323-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb323-2"><a href="#cb323-2" aria-hidden="true" tabindex="-1"></a><span class="kw">movaps</span> <span class="kw">xmm0</span><span class="op">,</span> <span class="op">[</span><span class="kw">rsi</span><span class="op">]</span> <span class="co">; aligned move</span></span>
<span id="cb323-3"><a href="#cb323-3" aria-hidden="true" tabindex="-1"></a><span class="kw">movups</span> <span class="kw">xmm1</span><span class="op">,</span> <span class="op">[</span><span class="kw">rdi</span><span class="op">]</span> <span class="co">; unaligned move</span></span>
<span id="cb323-4"><a href="#cb323-4" aria-hidden="true" tabindex="-1"></a><span class="kw">addps</span> <span class="kw">xmm0</span><span class="op">,</span> <span class="kw">xmm1</span> <span class="co">; packed single add</span></span>
<span id="cb323-5"><a href="#cb323-5" aria-hidden="true" tabindex="-1"></a><span class="kw">vmovaps</span> <span class="kw">ymm0</span><span class="op">,</span> <span class="op">[</span><span class="kw">rsi</span><span class="op">]</span> <span class="co">; AVX 256-bit</span></span>
<span id="cb323-6"><a href="#cb323-6" aria-hidden="true" tabindex="-1"></a><span class="kw">vaddps</span> <span class="kw">ymm0</span><span class="op">,</span> <span class="kw">ymm1</span><span class="op">,</span> <span class="kw">ymm2</span> <span class="co">; AVX 3-operand</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T)
movaps (%rsi), %xmm0 # aligned move
movups (%rdi), %xmm1 # unaligned move
addps %xmm1, %xmm0 # packed single add
vmovaps (%rsi), %ymm0 # AVX 256-bit
vaddps %ymm2, %ymm1, %ymm0 # AVX 3-operand (src1, src2, dest)</code></pre>
<pre class="gas"><code># GAS (Intel)
movaps xmm0, [rsi]
movups xmm1, [rdi]
addps xmm0, xmm1
vmovaps ymm0, [rsi]
vaddps ymm0, ymm1, ymm2</code></pre>
<pre class="masm"><code>; MASM
movaps xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdi]
addps xmm0, xmm1
vmovaps ymm0, ymmword ptr [rsi]
vaddps ymm0, ymm1, ymm2</code></pre>
<h4 id="avx-512-with-masking">AVX-512 with Masking</h4>
<div class="sourceCode" id="cb327"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb327-1"><a href="#cb327-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb327-2"><a href="#cb327-2" aria-hidden="true" tabindex="-1"></a><span class="kw">vmovaps</span> <span class="kw">zmm0</span> <span class="op">{</span>k1<span class="op">},</span> <span class="op">[</span><span class="kw">rsi</span><span class="op">]</span> <span class="co">; masked load</span></span>
<span id="cb327-3"><a href="#cb327-3" aria-hidden="true" tabindex="-1"></a><span class="kw">vaddps</span> <span class="kw">zmm0</span> <span class="op">{</span>k1<span class="op">}{</span>z<span class="op">},</span> <span class="kw">zmm1</span><span class="op">,</span> <span class="kw">zmm2</span> <span class="co">; masked add with zeroing</span></span></code></pre></div>
<pre class="gas"><code># GAS (AT&amp;T)
vmovaps (%rsi), %zmm0{%k1}
vaddps %zmm2, %zmm1, %zmm0{%k1}{z}</code></pre>
<pre class="gas"><code># GAS (Intel)
vmovaps zmm0{k1}, [rsi]
vaddps zmm0{k1}{z}, zmm1, zmm2</code></pre>
<pre class="masm"><code>; MASM
vmovaps zmm0{k1}, zmmword ptr [rsi]
vaddps zmm0{k1}{z}, zmm1, zmm2</code></pre>
<h3 id="c.9-system-instructions">C.9 System Instructions</h3>
<h4 id="privileged-instructions">Privileged Instructions</h4>
<div class="sourceCode" id="cb331"><pre
class="sourceCode nasm"><code class="sourceCode nasm"><span id="cb331-1"><a href="#cb331-1" aria-hidden="true" tabindex="-1"></a><span class="co">; NASM</span></span>
<span id="cb331-2"><a href="#cb331-2" aria-hidden="true" tabindex="-1"></a><span class="kw">lgdt</span> <span class="op">[</span>gdtr<span class="op">]</span> <span class="co">; load GDT</span></span>
<span id="cb331-3"><a href="#cb331-3" aria-hidden="true" tabindex="-1"></a><span class="kw">lidt</span> <span class="op">[</span>idtr<span class="op">]</span> <span class="co">; load IDT</span></span>
<span id="cb331-4"><a href="#cb331-4" aria-hidden="true" tabindex="-1"></a><span class="kw">mov</span> <span class="kw">cr3</span><span class="op">,</span> <span class="kw">rax</span> <span class="co">;</span></span></code></pre></div>
<p>From what I see, you’ve broken it down into logically separated
sub‑sections:</p>
<ul>
<li><p><strong>Basic Syntax Differences</strong> — covers operand order,
register/immediate prefixes, size suffixes and comment/number
formats.</p></li>
<li><p><strong>Memory Addressing</strong> — examples for absolute, label
and RIP‑relative addressing, plus
base+index*scale+displacement.</p></li>
<li><p><strong>Data Definitions</strong> — maps NASM
<code>.db/.dw</code>‑style to GAS <code>.byte/.word</code> and MASM
<code>BYTE/WORD</code> formats, including strings, arrays and
uninitialised buffers.</p></li>
<li><p><strong>Sections/Segments</strong> — shows how the three syntaxes
declare <code>.text</code>, <code>.data</code>, <code>.bss</code> and
platform‑specific segments.</p></li>
<li><p><strong>Macros/Directives</strong> — explicit macro examples for
each assembler’s definition style, with conditional assembly
patterns.</p></li>
<li><p><strong>Symbols/Labels</strong> — global/external symbol
exports/imports and alignment directives.</p></li>
<li><p><strong>Procedure Definitions</strong> — function
prologues/epilogues and declaration differences.</p></li>
<li><p><strong>SIMD Instructions</strong> — SSE, AVX, AVX‑512 usage,
operand order and how masking/zeroing is written in each
dialect.</p></li>
<li><p><strong>System Instructions</strong> — privileged op
usage.</p></li>
</ul>
<p>This lines up completely with your “Appendix C: NASM/GAS/MASM Syntax
Comparison” bullet in the earlier summary — the table of operand
order/register/immediate formatting, addressing mode differences,
directives, and SIMD masking syntax match exactly.</p>
<p>If what you need is a consolidated, final deliverable, I can merge
this detailed appendix with the Chapter 14–16 and Appendix A–B summaries
you’ve already compiled into a single, continuous reference document.
That way you’d have:</p>
<ol type="1">
<li><p>High‑level narrative summaries for the chapter content.</p></li>
<li><p>Quick‑reference tables from Appendix A and ABI notes from
Appendix B.</p></li>
<li><p>This full syntax comparison section as Appendix C.</p></li>
</ol>
<p>Do you want me to generate that combined reference so it’s ready to
use as one cohesive manual?</p>
<hr />
</main>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment