PyTorch Conference 2024

Scientific Machine Learning:
- Has both overlaps and differences with broader trends in machine learning.
- PyTorch, alongside the Linux Foundation, contributes to stability in this domain.
- Transformers are becoming increasingly important in scientific applications.
- Solving equations (linear, differential, non-linear) is crucial for scientific computation.
- Modular frameworks are necessary to support various loss functions during pretraining.
- Scientific machine learning is still an emerging field. Most scientific codes are not yet differentiable.
Triton and Kernel Development:
- Prefer writing operators over using raw kernels, especially for library authors.
- Use torch.library.custom_op to wrap kernels into Python operators.
Triton Kernels:
- Triton kernels integrate seamlessly with torch.compile, offering a hackable and performant alternative to CUDA kernels.

Choosing Battles Wisely:
- Models like Llama, Mistral, Pixtral, Gemma, Qwen, GPT, etc., represent a diverse set of tools.
- Focus on a subset of features that deliver the most significant impact for users, as it’s impractical to troubleshoot everything.
Frameworks:
- PyTorch, SageMaker, and Tense dominate the ecosystem.
- PyTorch versions: Focus on v24.x, v23.x, and <22.x for compatibility.
Model Types:
- CausalLM, classification, and other common model architectures.

Communication and Automation:
- Open communication channels.
- Pin dependencies to specific versions for stability.
- Automate CI to re-run tests against the primary upstream dependencies.
- Early detection of breaking changes upstream can help catch new bugs early.

A modular library with:
- Training recipes.
- Full training loops designed for easy copying and modification.
- Memory efficiency: From single-device to single-node setups.

Futures: Reference to objects that may not yet exist.
Actors: Remote class instances.
Shared In-Memory Distributed Object Store: Used for efficient data sharing.
Tasks: Remote functions executed on distributed systems.
Ray Classic API:
- Weak GPU support.
- CPU → Accelerator transition is accelerating.
- High flexibility, but significant overhead for small tasks:
  - Costs associated with RPC.
  - Dynamic memory allocation challenges.
  - Difficult to efficiently support p2p protocols like RDMA or NCCL.

Goals:
- Run tasks that are 1-10ms with < 1% system overhead.
- Achieve GPU-GPU data transfers with < 10us overhead per operation.
What are aDAGs?:
- Static task graphs using a Ray Core-like API.
- Resources are allocated once and reused across multiple executions.
- Predefine p2p communication schedules to avoid deadlocks.

Most Ray training and serving tasks leverage PyTorch.
Ray libraries can scale nearly any AI workload.
aDAG Expansion:
- Will extend support for complex workloads such as:
  - Pipeline parallelism in vLLM (already integrated).
  - 4D model parallelism.

Pickles are unsafe:
- Only load models from trusted sources.
- Use weights_only=True when possible.
- Scan pickle files for viruses and unwanted imports.
- Prefer alternative serialization formats when available.

Model, Implementation, and Hardware Specificity:
- Optimize based on specific hardware, model architectures, and topology.
- Effective control over fusions, memory allocations, and communication overlaps can significantly boost performance.
Thunder:
- A framework that allows manipulation of computations just-in-time, keeping the software stack thin and efficient.

Optimized CUDA/Triton Kernels:
- Quantized GeMM → Cutlass kernels.
- Grouped GeMM (e.g., Mixture of Experts) → Triton kernels.
- All reduce → CUDA kernels.
Attention Mechanisms:
- FlashAttention, Formers → Cutlass kernels.
Other optimizations:
- RoPE, CUDA graphs, and torch.compile for minimizing host overheads.

Device Generalization:
- Categorize devices as CPU or accelerators.
- Apply generalization concepts across accelerator devices, including CUDA, HIP, and XPU.
- Manage devices, streams/queues, events, guards, generators, and allocators effectively for enhanced performance.

gengwg/pytorch2024.md