Skip to content

Instantly share code, notes, and snippets.

@gengwg
Last active September 23, 2024 00:59
Show Gist options
  • Save gengwg/ff459a7ee034a0b443fb1c2dd18bf711 to your computer and use it in GitHub Desktop.
Save gengwg/ff459a7ee034a0b443fb1c2dd18bf711 to your computer and use it in GitHub Desktop.
PyTorch Conference 2024 Notes - San Francisco

PyTorch Conference 2024

Key Takeaways

  • Exciting times ahead for Generative AI (GenAI) on edge devices.
  • Growing compute and rapid pace of model innovation for the edge.
  • The PyTorch ecosystem provides key tools to support development, including:

Lessons for the Broader PyTorch Community

  • Scientific Machine Learning:

    • Has both overlaps and differences with broader trends in machine learning.
    • PyTorch, alongside the Linux Foundation, contributes to stability in this domain.
    • Transformers are becoming increasingly important in scientific applications.
    • Solving equations (linear, differential, non-linear) is crucial for scientific computation.
    • Modular frameworks are necessary to support various loss functions during pretraining.
    • Scientific machine learning is still an emerging field. Most scientific codes are not yet differentiable.
  • Triton and Kernel Development:

    • Prefer writing operators over using raw kernels, especially for library authors.
    • Use torch.library.custom_op to wrap kernels into Python operators.
  • Triton Kernels:

    • Triton kernels integrate seamlessly with torch.compile, offering a hackable and performant alternative to CUDA kernels.

Optimization and Best Practices

  • Choosing Battles Wisely:

    • Models like Llama, Mistral, Pixtral, Gemma, Qwen, GPT, etc., represent a diverse set of tools.
    • Focus on a subset of features that deliver the most significant impact for users, as it’s impractical to troubleshoot everything.
  • Frameworks:

    • PyTorch, SageMaker, and Tense dominate the ecosystem.
    • PyTorch versions: Focus on v24.x, v23.x, and <22.x for compatibility.
  • Model Types:

    • CausalLM, classification, and other common model architectures.

Dependencies & CI/CD

  • Communication and Automation:
    • Open communication channels.
    • Pin dependencies to specific versions for stability.
    • Automate CI to re-run tests against the primary upstream dependencies.
    • Early detection of breaking changes upstream can help catch new bugs early.

Torchtune Basics

  • A modular library with:
    • Training recipes.
    • Full training loops designed for easy copying and modification.
    • Memory efficiency: From single-device to single-node setups.

The FAST Compute Model

  • Futures: Reference to objects that may not yet exist.

  • Actors: Remote class instances.

  • Shared In-Memory Distributed Object Store: Used for efficient data sharing.

  • Tasks: Remote functions executed on distributed systems.

  • Ray Classic API:

    • Weak GPU support.
    • CPU → Accelerator transition is accelerating.
    • High flexibility, but significant overhead for small tasks:
      • Costs associated with RPC.
      • Dynamic memory allocation challenges.
      • Difficult to efficiently support p2p protocols like RDMA or NCCL.

Accelerated Dynamic Acyclic Graphs (aDAGs)

  • Goals:

    • Run tasks that are 1-10ms with < 1% system overhead.
    • Achieve GPU-GPU data transfers with < 10us overhead per operation.
  • What are aDAGs?:

    • Static task graphs using a Ray Core-like API.
    • Resources are allocated once and reused across multiple executions.
    • Predefine p2p communication schedules to avoid deadlocks.

Ray Integration with PyTorch

  • Most Ray training and serving tasks leverage PyTorch.
  • Ray libraries can scale nearly any AI workload.
  • aDAG Expansion:
    • Will extend support for complex workloads such as:
      • Pipeline parallelism in vLLM (already integrated).
      • 4D model parallelism.

Security: Avoiding Pickle Vulnerabilities

  • Pickles are unsafe:
    • Only load models from trusted sources.
    • Use weights_only=True when possible.
    • Scan pickle files for viruses and unwanted imports.
    • Prefer alternative serialization formats when available.

Performance Optimization

  • Model, Implementation, and Hardware Specificity:

    • Optimize based on specific hardware, model architectures, and topology.
    • Effective control over fusions, memory allocations, and communication overlaps can significantly boost performance.
  • Thunder:

    • A framework that allows manipulation of computations just-in-time, keeping the software stack thin and efficient.

High Performance & Efficiency

  • Optimized CUDA/Triton Kernels:

    • Quantized GeMM → Cutlass kernels.
    • Grouped GeMM (e.g., Mixture of Experts) → Triton kernels.
    • All reduce → CUDA kernels.
  • Attention Mechanisms:

    • FlashAttention, Formers → Cutlass kernels.
  • Other optimizations:

    • RoPE, CUDA graphs, and torch.compile for minimizing host overheads.

Runtime Generalization

  • Device Generalization:
    • Categorize devices as CPU or accelerators.
    • Apply generalization concepts across accelerator devices, including CUDA, HIP, and XPU.
    • Manage devices, streams/queues, events, guards, generators, and allocators effectively for enhanced performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment