- RDMA Aware Networks Programming User Manual
 - On the Impact of Cluster Configuration on RoCE Application Design
 - RDMA over Commodity Ethernet at Scale
 - Design Guidelines for High Performance RDMA Systems
 - FaSST: Fast, Scalable and Simple Distributed Transactions with Two-sided (RDMA) Datagram RPCs
 - RDMA [1]: A short history of remote DMA networking
 - [Slide] RDMA Tutorial
 - Understanding the concepts and mechanisms of RDMA
 - [InfiniBand RDMA over PCI Express Networ
 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | #include <rdma/fabric.h> | |
| #include <rdma/fabric.h> | |
| #include <rdma/fi_endpoint.h> | |
| #include <rdma/fi_cm.h> | |
| #include <rdma/fi_errno.h> | |
| #include <rdma/fi_rma.h> | |
| #include <pthread.h> | |
| #include <stdio.h> | |
| #include <stdlib.h> | 
CUDA 12.1.1 toolkit is gonna offer to install Nvidia driver 530 for us. It's from New Feature branch. It's likely to be newer than the default Nvidia driver you would've installed via apt-get (apt would prefer to give you 525, i.e. Production Branch).
If you're confident that you already have a new enough Nvidia driver for CUDA 12.1.1, and you'd like to keep your driver: feel free to skip this "uninstall driver" step.
But if you're not sure, or you know your driver is too old: let's uninstall it. CUDA will install a new driver for us later.
Single-process:
python main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/
Multi-process:
python -m torch.distributed.launch  --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | import cPickle | |
| import pickle | |
| import json | |
| import random | |
| from time import time | |
| from hashlib import md5 | |
| test_runs = 1000 | |
| def float_list(): | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | # This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting. | |
| # https://developer.nvidia.com/nsight-systems | |
| # https://docs.nvidia.com/nsight-systems/profiling/index.html | |
| # My preferred nsys (command line executable used to create profiles) commands | |
| # | |
| # In your script, write | |
| # torch.cuda.nvtx.range_push("region name") | |
| # ... | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | import networkx as nx | |
| from itertools import product | |
| """ | |
| When we compare this code with Airflow, the strengths of your code lie in its simplicity, lightweight nature, and the ability to easily integrate with existing Python code: | |
| Simplicity: This code provides a simple and straightforward way to model and work with DAGs without needing to go through the process of setting up and configuring a comprehensive system like Airflow. For smaller teams or projects with less complexity, this can be an advantage. | |
| Lightweight and easy to incorporate: Your code is a compact, single-file solution that can be easily integrated into an existing Python project without having to set up an entire Airflow environment. When your primary focus is on creating task dependencies with parameter combinations, rather than scheduling and monitoring, your code is easier to incorporate. | |
| Focused on task generation: Your code emphasizes creating a Cartesian product of tasks associated with nodes' parameters. It is geared towards tackling | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | // Thank you to the folks at the C++ slack channel, | |
| // along with @lewissbaker for the excellent literature | |
| // (even though it took me a few days to be convinced | |
| // it really was so). | |
| #include <uv.h> | |
| #include <iostream> | |
| #include <experimental/coroutine> | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | """ | |
| This model integrates the MoE concept within a Transformer architecture. Each token's | |
| representation is processed by a subset of experts, determined by the gating mechanism. | |
| This architecture allows for efficient and specialized handling of different aspects of the | |
| data, aiming for the adaptability and efficiency noted in the Mixtral 8x7B model's design | |
| philosophy. The model activates only a fraction of the available experts for each token, | |
| significantly reducing the computational resources needed compared to activating all experts | |
| for all tokens. | |
| """ | 
OlderNewer