- Exciting times ahead for Generative AI (GenAI) on edge devices.
- Growing compute and rapid pace of model innovation for the edge.
- The PyTorch ecosystem provides key tools to support development, including:
- AI Edge Generative API: AI Edge Torch
- Visualize and Debug: Model Explorer
- Tune / Experiment with new models (not covered today but worth noting):
-
Scientific Machine Learning:
- Has both overlaps and differences with broader trends in machine learning.
- PyTorch, alongside the Linux Foundation, contributes to stability in this domain.
- Transformers are becoming increasingly important in scientific applications.
- Solving equations (linear, differential, non-linear) is crucial for scientific computation.
- Modular frameworks are necessary to support various loss functions during pretraining.
- Scientific machine learning is still an emerging field. Most scientific codes are not yet differentiable.
-
Triton and Kernel Development:
- Prefer writing operators over using raw kernels, especially for library authors.
- Use
torch.library.custom_op
to wrap kernels into Python operators.
-
Triton Kernels:
- Triton kernels integrate seamlessly with
torch.compile
, offering a hackable and performant alternative to CUDA kernels.
- Triton kernels integrate seamlessly with
-
Choosing Battles Wisely:
- Models like Llama, Mistral, Pixtral, Gemma, Qwen, GPT, etc., represent a diverse set of tools.
- Focus on a subset of features that deliver the most significant impact for users, as it’s impractical to troubleshoot everything.
-
Frameworks:
- PyTorch, SageMaker, and Tense dominate the ecosystem.
- PyTorch versions: Focus on v24.x, v23.x, and <22.x for compatibility.
-
Model Types:
- CausalLM, classification, and other common model architectures.
- Communication and Automation:
- Open communication channels.
- Pin dependencies to specific versions for stability.
- Automate CI to re-run tests against the primary upstream dependencies.
- Early detection of breaking changes upstream can help catch new bugs early.
- A modular library with:
- Training recipes.
- Full training loops designed for easy copying and modification.
- Memory efficiency: From single-device to single-node setups.
-
Futures: Reference to objects that may not yet exist.
-
Actors: Remote class instances.
-
Shared In-Memory Distributed Object Store: Used for efficient data sharing.
-
Tasks: Remote functions executed on distributed systems.
-
Ray Classic API:
- Weak GPU support.
- CPU → Accelerator transition is accelerating.
- High flexibility, but significant overhead for small tasks:
- Costs associated with RPC.
- Dynamic memory allocation challenges.
- Difficult to efficiently support p2p protocols like RDMA or NCCL.
-
Goals:
- Run tasks that are 1-10ms with < 1% system overhead.
- Achieve GPU-GPU data transfers with < 10us overhead per operation.
-
What are aDAGs?:
- Static task graphs using a Ray Core-like API.
- Resources are allocated once and reused across multiple executions.
- Predefine p2p communication schedules to avoid deadlocks.
- Most Ray training and serving tasks leverage PyTorch.
- Ray libraries can scale nearly any AI workload.
- aDAG Expansion:
- Will extend support for complex workloads such as:
- Pipeline parallelism in vLLM (already integrated).
- 4D model parallelism.
- Will extend support for complex workloads such as:
- Pickles are unsafe:
- Only load models from trusted sources.
- Use
weights_only=True
when possible. - Scan pickle files for viruses and unwanted imports.
- Prefer alternative serialization formats when available.
-
Model, Implementation, and Hardware Specificity:
- Optimize based on specific hardware, model architectures, and topology.
- Effective control over fusions, memory allocations, and communication overlaps can significantly boost performance.
-
Thunder:
- A framework that allows manipulation of computations just-in-time, keeping the software stack thin and efficient.
-
Optimized CUDA/Triton Kernels:
- Quantized GeMM → Cutlass kernels.
- Grouped GeMM (e.g., Mixture of Experts) → Triton kernels.
- All reduce → CUDA kernels.
-
Attention Mechanisms:
- FlashAttention, Formers → Cutlass kernels.
-
Other optimizations:
- RoPE, CUDA graphs, and
torch.compile
for minimizing host overheads.
- RoPE, CUDA graphs, and
- Device Generalization:
- Categorize devices as CPU or accelerators.
- Apply generalization concepts across accelerator devices, including CUDA, HIP, and XPU.
- Manage devices, streams/queues, events, guards, generators, and allocators effectively for enhanced performance.