Part III: Vectorization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

This section contains the following subjects:

airMeng / LLM Int4 Inference on Arc.md

Last active January 31, 2024 05:56

LLM Int4 Inference on Arc

IPEX

Intel® Extension for PyTorch(IPEX) extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.

XeTLA

Intel® Xe Templates for Linear Algebra (Intel® XeTLA) is a collection of SYCL/ESIMD templates that enable high-performance General Matrix Multiply (GEMM), Convolution (CONV), and related computations on Intel Xe GPU architecture. Intel® XeTLA offers reusable C++ templates for kernel, group and subgroup levels, allowing developers to optimize and specialize kernels based on data types, tiling policies, algorithms, fusion policies, and more.

Users can easily define new compression/de-compression prologue and insert right between BRGEMM to fully accelerate WOQ GEMM due to XeTLA's template designs.

airMeng / XeTLA.md

Last active June 29, 2024 02:25

XeTLA

HW Target

Device	PVC	MTL	DG2	LNL/BMG(TODO)	ARL(TODO)
ISA	Xe	Xe-lpg	Xe-hpg	Xe2	Xe-lpg+
DPAS	8,8,16	NA	8,8,8	8,4,16	8,8,8
2D Block	32, 64	NA	NA	32, 64	NA
1D Block	64	32	32	64	32

How to Add a new HW

airMeng / FP8 workflow.md

Last active January 23, 2025 06:07

FP8 Linear in TorchAO

%%{init: {'themeVariables': {'fontSize': '24px', 'fontFamily': 'Arial'}}}%%
graph TD
    E{Scaling Recipes}
    E -->|Delayed| F[Record History and Register Scale Buffer]
    E -->|Static| G[Register Scale Buffer]
    E -->|Dynamic| A
 F --&gt; A

airMeng / static.md

Last active April 17, 2025 03:09

Static quantization in AO

https://github.com/pytorch/ao/blob/main/tutorials/calibration_flow/static_quant.py

Turn to the tutorial above for 2 methods of static quantization

1. Regular Linear with quantized tensor-subclass

# Regular Linear
linear = torch.nn.Linear(
    observed_linear.in_features,

Meng, Hengyu airMeng

Part III: Vectorization Techniques

IPEX

XeTLA

HW Target

How to Add a new HW

1. Regular Linear with quantized tensor-subclass