relyt0925 · July 27, 2024 20:09
diff --git a/gistfile1.txt b/gistfile1.txt
 [root@tyler-rhel-newimage instructlab]# /root/ilab model train --data-path /var/instructlabbigdisk/instructlab/generateddata/messages_combined.jsonl  --model-path /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ --device cuda --max-batch-len 2 --effective-batch-size 16 --save-samples 185 --num-epochs 10 --ckpt-output-dir /var/instructlabbigdisk/instructlab/skillscheckpoints/ --gpus 8
 [2024-07-27 20:03:08,445] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 INFO 2024-07-27 20:03:11,898 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
 INFO 2024-07-27 20:03:11,898 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
 INFO 2024-07-27 20:03:11,898 numexpr.utils:161: NumExpr defaulting to 16 threads.
 INFO 2024-07-27 20:03:12,324 datasets:58: PyTorch version 2.3.1 available.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 INFO 2024-07-27 20:03:12,773 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
 tokenizing the dataset with /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ tokenizer...
 ten largest length percentiles:
 quantile 90th: 90.0
 quantile 91th: 92.44
 quantile 92th: 93.0
 quantile 93th: 94.0
 quantile 94th: 96.87999999999994
 quantile 95th: 99.59999999999997
 quantile 96th: 102.91999999999996
 quantile 97th: 107.0
 quantile 98th: 109.59999999999997
 quantile 99th: 115.27999999999997
 quantile 100th: 141.0

 at 4096 max sequence length, the number of samples to be dropped is 0
 (0.00% of total)
 quantile 0th: 43.0
 quantile 1th: 44.0
 quantile 2th: 44.68
 quantile 3th: 45.0
 quantile 4th: 45.36
 quantile 5th: 46.0
 quantile 6th: 48.0
 quantile 7th: 48.0
 quantile 8th: 49.0
 quantile 9th: 49.56
 quantile 10th: 50.0
 at 20 min sequence length, the number of samples to be dropped is 0
 checking the validity of the samples...
 INFO 2024-07-27 20:03:13,126 root:611: number of dropped samples: 0 -- out of 185
 Categorizing training data type...
 Data type sorting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 185/185 [00:00<00:00, 648242.47it/s]
 unmasking the appropriate message content...
 The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.

 Instruction ex sample 17: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Answer: Based on the provided text, there are 8 villages named Qarah Tappeh in different districts and provinces of Iran according to the 2006 census.<|endoftext|>
 Original Input: <|user|> 
 Question: How many villages named Qarah Tappeh were there in different districts and provinces of Iran according to the 2006 census?
 <|assistant|> 
 Answer: Based on the provided text, there are 8 villages named Qarah Tappeh in different districts and provinces of Iran according to the 2006 census.<|endoftext|>

 Instruction ex sample 99: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Answer: There are three items in this list that are types of fruits: "apple," "banana," and "orange."<|endoftext|>
 Original Input: <|user|> 
 Question: How many items in this list are types of fruits and what are they?
 <|assistant|> 
 Answer: There are three items in this list that are types of fruits: "apple," "banana," and "orange."<|endoftext|>

 Creating json from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.89ba/s]
 Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ --data_path=/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/instructlabbigdisk/instructlab/skillscheckpoints/ --num_epochs=10 --effective_batch_size=16 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=185 --log_level=INFO --max_batch_len=2 --seed=42 --chat-tmpl-path=/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] 
 W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] *****************************************
 W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
 W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] *****************************************
 [2024-07-27 20:03:17,567] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,805] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,843] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,879] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,908] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,949] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,978] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 20:03:17,981] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum [WARNING]  async_io: please install the libaio-devel package with yum

 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 [2024-07-27 20:03:21,555] [INFO] [comm.py:637:init_distributed] cdb=None
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 model_name_or_path: /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/
 data_path: /var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl
 output_dir: /var/instructlabbigdisk/instructlab/skillscheckpoints/
 num_epochs: 10
 last_step: 0
 effective_batch_size: 16
 learning_rate: 2.0e-05
 lr_scheduler: cosine
 num_warmup_steps: 25
 save_samples: 185
 save_samples_ds: null
 save_last: false
 log_level: INFO
 seed: 42
 mock_data: false
 mock_len: 2600
 sharding_strategy: FULL_SHARD
 is_granite: false
 lora_r: 0
 lora_alpha: 32
 lora_dropout: 0.1
 lora_quant_bits: null
 lora_target_modules: null
 max_batch_len: 2
 cpu_offload_optimizer: false
 cpu_offload_optimizer_pin_memory: false
 cpu_offload_optimizer_ratio: 1.0
 NEFTune_alpha: null
 chat_tmpl_path: /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 disable_flash_attn: false

 {
    "script_params": {
        "model_name_or_path": "/var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/",
        "data_path": "/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl",
        "output_dir": "/var/instructlabbigdisk/instructlab/skillscheckpoints/",
        "num_epochs": 10,
        "last_step": 0,
        "effective_batch_size": 16,
        "learning_rate": 2e-05,
        "lr_scheduler": "cosine",
        "num_warmup_steps": 25,
        "save_samples": 185,
        "save_samples_ds": null,
        "save_last": false,
        "log_level": "INFO",
        "seed": 42,
        "mock_data": false,
        "mock_len": 2600,
        "sharding_strategy": "FULL_SHARD",
        "is_granite": false,
        "lora_r": 0,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
        "lora_quant_bits": null,
        "lora_target_modules": null,
        "max_batch_len": 2,
        "cpu_offload_optimizer": false,
        "cpu_offload_optimizer_pin_memory": false,
        "cpu_offload_optimizer_ratio": 1.0,
        "NEFTune_alpha": null,
        "chat_tmpl_path": "/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
        "disable_flash_attn": false
    },
    "timestamp": "2024-07-27T20:03:21.897629"
 }
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 [2024-07-27 20:03:21,973] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:21,973] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 [2024-07-27 20:03:22,374] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:22,515] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:22,529] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:22,538] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:22,664] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 20:03:22,682] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-rhel-newimage:260:260 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:260:260 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:260:260 [0] NCCL INFO cudaDriverVersion 12040
 NCCL version 2.20.5+cuda12.4
 tyler-rhel-newimage:265:265 [5] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:265:265 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:265:265 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:267:267 [7] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:267:267 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:267:267 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:262:262 [2] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:262:262 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:263:263 [3] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:262:262 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:263:263 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:263:263 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:266:266 [6] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:266:266 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:266:266 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:261:261 [1] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:261:261 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:261:261 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:264:264 [4] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:264:264 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:264:264 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
 tyler-rhel-newimage:260:1022 [0] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:260:1022 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Using network Socket
 tyler-rhel-newimage:265:1026 [5] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:262:1023 [2] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:266:1024 [6] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:265:1026 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:262:1023 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:266:1024 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Using network Socket
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Using network Socket
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Using network Socket
 tyler-rhel-newimage:263:1025 [3] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:263:1025 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Using network Socket
 tyler-rhel-newimage:264:1028 [4] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:264:1028 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Using network Socket
 tyler-rhel-newimage:261:1027 [1] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:261:1027 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Using network Socket
 tyler-rhel-newimage:267:1029 [7] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:267:1029 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Using network Socket
 tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xbba7bcd413cc6af1 - Init START
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-rhel-newimage:266:1024 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-rhel-newimage:265:1026 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-rhel-newimage:267:1029 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-rhel-newimage:263:1025 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-rhel-newimage:262:1023 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-rhel-newimage:261:1027 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-rhel-newimage:264:1028 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-rhel-newimage:266:1024 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:267:1029 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-rhel-newimage:265:1026 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:264:1028 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-rhel-newimage:260:1022 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-rhel-newimage:262:1023 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-rhel-newimage:261:1027 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:263:1025 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Connected all rings
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Connected all rings
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Connected all rings
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Connected all rings
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Connected all rings
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1022 [0] NCCL INFO Connected all trees
 tyler-rhel-newimage:260:1022 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:260:1022 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:260:1022 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:261:1027 [1] NCCL INFO Connected all trees
 tyler-rhel-newimage:261:1027 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:261:1027 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:262:1023 [2] NCCL INFO Connected all trees
 tyler-rhel-newimage:262:1023 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:262:1023 [2] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:262:1023 [2] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:261:1027 [1] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:263:1025 [3] NCCL INFO Connected all trees
 tyler-rhel-newimage:263:1025 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:263:1025 [3] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:267:1029 [7] NCCL INFO Connected all trees
 tyler-rhel-newimage:267:1029 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:267:1029 [7] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:263:1025 [3] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:267:1029 [7] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:264:1028 [4] NCCL INFO Connected all trees
 tyler-rhel-newimage:264:1028 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:264:1028 [4] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:266:1024 [6] NCCL INFO Connected all trees
 tyler-rhel-newimage:266:1024 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:266:1024 [6] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1026 [5] NCCL INFO Connected all trees
 tyler-rhel-newimage:265:1026 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:265:1026 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1026 [5] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:264:1028 [4] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:266:1024 [6] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
 tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xbba7bcd413cc6af1 - Init COMPLETE
 Generating train split: 185 examples [00:00, 25776.38 examples/s]
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12894.40it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11066.61it/s]
 Effective batch size is too low for multipack sampling, max sample length=141 and min packing length=135. Switching to naive distributed sampling.
 {
    "num_gpus": 8,
    "avg_sample_len": 67.78918918918919,
    "effective_batch_size": 16,
    "max_batch_len_per_gpu": 2,
    "packing_max_batch_len": null,
    "grad_accum": 1,
    "num_batches": 12,
    "avg_samples_per_batch": 15.416666666666666,
    "samples_per_gpu": 2,
    "timestamp": "2024-07-27T20:03:33.017444"
 }
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11659.95it/s]
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12065.72it/s]
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11400.91it/s]
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11540.81it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation:   0%|          | 0/185 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12968.10it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11126.75it/s]
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
 /opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.15251493453979492 seconds
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
 /opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.1219935417175293 seconds
 [2024-07-27 20:03:39,014] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4+d254d75, git-hash=d254d75, git-branch=HEAD
 [2024-07-27 20:03:39,014] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.20261573791503906 seconds
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
 /opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.12093877792358398 seconds
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
 /opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.12085723876953125 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.10260534286499023 seconds
 Loading extension module fused_adam...
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.2024221420288086 seconds
 Time to load fused_adam op: 0.20228958129882812 seconds
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Using network Socket
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Using network Socket
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Using network Socket
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Using network Socket
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Using network Socket
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Using network Socket
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Using network Socket
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Using non-device net plugin version 0
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Using network Socket
 tyler-rhel-newimage:265:1132 [5] NCCL INFO bootstrapSplit: comm 0x56464ab274c0 parent 0x56464a4e7a70 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
 tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:264:1129 [4] NCCL INFO bootstrapSplit: comm 0x55b22abe29d0 parent 0x55b22a5ae220 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
 tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:263:1138 [3] NCCL INFO bootstrapSplit: comm 0x560000574c60 parent 0x55fffff3ce80 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
 tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:262:1141 [2] NCCL INFO bootstrapSplit: comm 0x55f25fc9eb30 parent 0x55f25f665d50 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
 tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:261:1135 [1] NCCL INFO bootstrapSplit: comm 0x55fca66893d0 parent 0x55fca60525d0 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
 tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:260:1124 [0] NCCL INFO bootstrapSplit: comm 0x558210f77210 parent 0x558210938950 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
 tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:267:1125 [7] NCCL INFO bootstrapSplit: comm 0x564fb4716ac0 parent 0x564fb40d9fa0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
 tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:266:1126 [6] NCCL INFO bootstrapSplit: comm 0x55f35a3e2520 parent 0x55f359e7d980 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
 tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x358cd0e27660cbba - Init START
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-rhel-newimage:264:1129 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-rhel-newimage:267:1125 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-rhel-newimage:261:1135 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-rhel-newimage:260:1124 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-rhel-newimage:266:1126 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-rhel-newimage:263:1138 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-rhel-newimage:265:1132 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-rhel-newimage:262:1141 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-rhel-newimage:262:1141 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-rhel-newimage:267:1125 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-rhel-newimage:266:1126 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-rhel-newimage:265:1132 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:264:1129 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-rhel-newimage:260:1124 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:263:1138 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:261:1135 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Connected all rings
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Connected all rings
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Connected all rings
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Connected all rings
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Connected all rings
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Connected all rings
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1124 [0] NCCL INFO Connected all trees
 tyler-rhel-newimage:260:1124 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:260:1124 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:261:1135 [1] NCCL INFO Connected all trees
 tyler-rhel-newimage:261:1135 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:261:1135 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Connected all trees
 tyler-rhel-newimage:262:1141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:262:1141 [2] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:263:1138 [3] NCCL INFO Connected all trees
 tyler-rhel-newimage:263:1138 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:263:1138 [3] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:267:1125 [7] NCCL INFO Connected all trees
 tyler-rhel-newimage:267:1125 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:267:1125 [7] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:264:1129 [4] NCCL INFO Connected all trees
 tyler-rhel-newimage:264:1129 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:264:1129 [4] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:266:1126 [6] NCCL INFO Connected all trees
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Connected all trees
 tyler-rhel-newimage:266:1126 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:265:1132 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:266:1126 [6] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1132 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x358cd0e27660cbba - Init COMPLETE
 tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x358cd0e27660cbba - Init COMPLETE
 [2024-07-27 20:03:47,872] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
 [2024-07-27 20:03:47,874] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
 [2024-07-27 20:03:47,874] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
 [2024-07-27 20:03:47,886] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
 [2024-07-27 20:03:47,887] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
 [2024-07-27 20:03:47,887] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
 [2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
 [2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
 [2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
 [2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
 [2024-07-27 20:04:00,524] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:01,385] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
 [2024-07-27 20:04:01,386] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 17.26 GB         CA 17.26 GB         Max_CA 17 GB 
 [2024-07-27 20:04:01,386] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 138.49 GB, percent = 11.0%
 [2024-07-27 20:04:01,578] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
 [2024-07-27 20:04:01,579] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 18.83 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-07-27 20:04:01,579] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 138.49 GB, percent = 11.0%
 [2024-07-27 20:04:01,579] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
 [2024-07-27 20:04:01,777] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
 [2024-07-27 20:04:01,778] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 15.69 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-07-27 20:04:01,778] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 138.49 GB, percent = 11.0%
 [2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
 [2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
 [2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fbf02112310>
 [2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:01,781] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
 }
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   amp_enabled .................. False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   amp_params ................... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
 }
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fbef4750bd0>
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   communication_data_type ...... None
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   disable_allgather ............ False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   dump_state ................... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
 [2024-07-27 20:04:01,782] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
 }
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   fp16_enabled ................. False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   global_rank .................. 0
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 1
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   graph_harvesting ............. False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   memory_breakdown ............. False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
 }
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   optimizer_name ............... None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   optimizer_params ............. None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   pld_enabled .................. False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   pld_params ................... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   prescale_gradients ........... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   scheduler_name ............... None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   scheduler_params ............. None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   sparse_attention ............. None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   steps_per_print .............. 1
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   train_batch_size ............. 16
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  2
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   weight_quantization_config ... None
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   world_size ................... 8
 [2024-07-27 20:04:01,783] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  False
 [2024-07-27 20:04:01,784] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
 [2024-07-27 20:04:01,784] [INFO] [config.py:1001:print]   zero_enabled ................. True
 [2024-07-27 20:04:01,784] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
 [2024-07-27 20:04:01,784] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
 [2024-07-27 20:04:01,784] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 16, 
    "gradient_accumulation_steps": 1, 
    "train_micro_batch_size_per_gpu": 2, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none"
        }
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false
 }
 [2024-07-27 20:04:01,784] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Number of samples per save: 176
 [2024-07-27 20:04:01,865] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:01,875] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:01,984] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:02,237] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:02,285] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 20:04:02,433] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Epoch 0:   0%|          | 0/12 [00:00<?, ?it/s] total tokens: 118 num samples: 2 num padding tokens: 13 - rank: 7 max len: 59 min len: 46 avg len: 52.5 num_loss_counted_tokens: 52
 total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 68
 total tokens: 282 num samples: 2 num padding tokens: 83 - rank: 7 max len: 141 min len: 58 avg len: 99.5 num_loss_counted_tokens: 150
 total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 7 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 56
 total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 7 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 53
 total tokens: 186 num samples: 2 num padding tokens: 5 - rank: 7 max len: 93 min len: 88 avg len: 90.5 num_loss_counted_tokens: 121
 total tokens: 156 num samples: 2 num padding tokens: 21 - rank: 6 max len: 78 min len: 57 avg len: 67.5 num_loss_counted_tokens: 70 total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 61

 total tokens: 184 num samples: 2 num padding tokens: 27 - rank: 7 max len: 92 min len: 65 avg len: 78.5 num_loss_counted_tokens: 90
 total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 7 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 66
 total tokens: 142 num samples: 2 num padding tokens: 14 - rank: 7 max len: 71 min len: 57 avg len: 64.0 num_loss_counted_tokens: 58
 total tokens: 114 num samples: 2 num padding tokens: 13 - rank: 7 max len: 57 min len: 44 avg len: 50.5 num_loss_counted_tokens: 55
 total tokens: 140 num samples: 2 num padding tokens: 3 - rank: 7 max len: 70 min len: 67 avg len: 68.5 num_loss_counted_tokens: 70
 total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 1 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 74 total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 1 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 71

 total tokens: 158 num samples: 2 num padding tokens: 36 - rank: 3 max len: 79 min len: 43 avg len: 61.0 num_loss_counted_tokens: 56
 total tokens: 202 num samples: 2 num padding tokens: 51 - rank: 4 max len: 101 min len: 50 avg len: 75.5 num_loss_counted_tokens: 106
 total tokens: 106 num samples: 2 num padding tokens: 7 - rank: 0 max len: 53 min len: 46 avg len: 49.5 num_loss_counted_tokens: 56
 total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 7 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 59
 total tokens: 126 num samples: 2 num padding tokens: 19 - rank: 0 max len: 63 min len: 44 avg len: 53.5 num_loss_counted_tokens: 53
 total tokens: 134 num samples: 2 num padding tokens: 0 - rank: 4 max len: 67 min len: 67 avg len: 67.0 num_loss_counted_tokens: 52
 total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 3 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 88
 total tokens: 188 num samples: 2 num padding tokens: 4 - rank: 6 max len: 94 min len: 90 avg len: 92.0 num_loss_counted_tokens: 121
 total tokens: 168 num samples: 2 num padding tokens: 20 - rank: 3 max len: 84 min len: 64 avg len: 74.0 num_loss_counted_tokens: 95
 total tokens: 106 num samples: 2 num padding tokens: 2 - rank: 6 max len: 53 min len: 51 avg len: 52.0 num_loss_counted_tokens: 51
 total tokens: 110 num samples: 2 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 64
 total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 4 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 58
 total tokens: 244 num samples: 2 num padding tokens: 63 - rank: 4 max len: 122 min len: 59 avg len: 90.5 num_loss_counted_tokens: 120
 total tokens: 128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 64 min len: 62 avg len: 63.0 num_loss_counted_tokens: 65
 total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 69
 total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 6 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 77
 total tokens: 180 num samples: 2 num padding tokens: 20 - rank: 4 max len: 90 min len: 70 avg len: 80.0 num_loss_counted_tokens: 111
 total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 4 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 59
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 57
 total tokens: 144 num samples: 2 num padding tokens: 12 - rank: 4 max len: 72 min len: 60 avg len: 66.0 num_loss_counted_tokens: 68
 total tokens: 200 num samples: 2 num padding tokens: 50 - rank: 6 max len: 100 min len: 50 avg len: 75.0 num_loss_counted_tokens: 91
 total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 3 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 86
 total tokens: 154 num samples: 2 num padding tokens: 33 - rank: 4 max len: 77 min len: 44 avg len: 60.5 num_loss_counted_tokens: 77
 total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 6 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 58
 total tokens: 158 num samples: 2 num padding tokens: 5 - rank: 6 max len: 79 min len: 74 avg len: 76.5 num_loss_counted_tokens: 82
 total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 6 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 72
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 0 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 4 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 66
 total tokens: 148 num samples: 2 num padding tokens: 16 - rank: 0 max len: 74 min len: 58 avg len: 66.0 num_loss_counted_tokens: 73
 total tokens: 134 num samples: 2 num padding tokens: 13 - rank: 6 max len: 67 min len: 54 avg len: 60.5 num_loss_counted_tokens: 67
 total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 0 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 92
 total tokens: 140 num samples: 2 num padding tokens: 13 - rank: 3 max len: 70 min len: 57 avg len: 63.5 num_loss_counted_tokens: 67
 total tokens: 162 num samples: 2 num padding tokens: 27 - rank: 3 max len: 81 min len: 54 avg len: 67.5 num_loss_counted_tokens: 86
 total tokens: 102 num samples: 2 num padding tokens: 5 - rank: 0 max len: 51 min len: 46 avg len: 48.5 num_loss_counted_tokens: 49
 total tokens: 160 num samples: 2 num padding tokens: 31 - rank: 0 max len: 80 min len: 49 avg len: 64.5 num_loss_counted_tokens: 70
 total tokens: 146 num samples: 2 num padding tokens: 3 - rank: 0 max len: 73 min len: 70 avg len: 71.5 num_loss_counted_tokens: 87
 total tokens: 130 num samples: 2 num padding tokens: 5 - rank: 0 max len: 65 min len: 60 avg len: 62.5 num_loss_counted_tokens: 57
 total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 0 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 71
 total tokens: 214 num samples: 2 num padding tokens: 26 - rank: 3 max len: 107 min len: 81 avg len: 94.0 num_loss_counted_tokens: 116
 total tokens: 196 num samples: 2 num padding tokens: 32 - rank: 3 max len: 98 min len: 66 avg len: 82.0 num_loss_counted_tokens: 101
 total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 1 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 64
 total tokens: 166 num samples: 2 num padding tokens: 8 - rank: 6 max len: 83 min len: 75 avg len: 79.0 num_loss_counted_tokens: 77
 total tokens: 228 num samples: 2 num padding tokens: 54 - rank: 1 max len: 114 min len: 60 avg len: 87.0 num_loss_counted_tokens: 125
 total tokens: 152 num samples: 2 num padding tokens: 10 - rank: 2 max len: 76 min len: 66 avg len: 71.0 num_loss_counted_tokens: 77 total tokens: 126 num samples: 2 num padding tokens: 12 - rank: 2 max len: 63 min len: 51 avg len: 57.0 num_loss_counted_tokens: 56

 total tokens: 154 num samples: 2 num padding tokens: 24 - rank: 0 max len: 77 min len: 53 avg len: 65.0 num_loss_counted_tokens: 66
 total tokens: 146 num samples: 2 num padding tokens: 19 - rank: 3 max len: 73 min len: 54 avg len: 63.5 num_loss_counted_tokens: 71
 total tokens: 116 num samples: 2 num padding tokens: 10 - rank: 3 max len: 58 min len: 48 avg len: 53.0 num_loss_counted_tokens: 52
 total tokens: 138 num samples: 2 num padding tokens: 24 - rank: 4 max len: 69 min len: 45 avg len: 57.0 num_loss_counted_tokens: 68
 total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 1 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 42
 total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 2 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 51
 total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 1 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 65
 total tokens: 216 num samples: 2 num padding tokens: 49 - rank: 1 max len: 108 min len: 59 avg len: 83.5 num_loss_counted_tokens: 103
 total tokens: 214 num samples: 2 num padding tokens: 21 - rank: 2 max len: 107 min len: 86 avg len: 96.5 num_loss_counted_tokens: 106
 total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 2 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 83
 total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 1 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 63
 total tokens: 164 num samples: 2 num padding tokens: 29 - rank: 1 max len: 82 min len: 53 avg len: 67.5 num_loss_counted_tokens: 83
 total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 1 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 86
 total tokens: 168 num samples: 2 num padding tokens: 18 - rank: 2 max len: 84 min len: 66 avg len: 75.0 num_loss_counted_tokens: 81
 total tokens: 226 num samples: 2 num padding tokens: 44 - rank: 2 max len: 113 min len: 69 avg len: 91.0 num_loss_counted_tokens: 97
 total tokens: 180 num samples: 2 num padding tokens: 24 - rank: 2 max len: 90 min len: 66 avg len: 78.0 num_loss_counted_tokens: 103
 total tokens: 186 num samples: 2 num padding tokens: 48 - rank: 1 max len: 93 min len: 45 avg len: 69.0 num_loss_counted_tokens: 110
 total tokens: 208 num samples: 2 num padding tokens: 33 - rank: 2 max len: 104 min len: 71 avg len: 87.5 num_loss_counted_tokens: 113
 total tokens: 188 num samples: 2 num padding tokens: 32 - rank: 3 max len: 94 min len: 62 avg len: 78.0 num_loss_counted_tokens: 89
 total tokens: 116 num samples: 2 num padding tokens: 10 - rank: 2 max len: 58 min len: 48 avg len: 53.0 num_loss_counted_tokens: 62
 total tokens: 194 num samples: 2 num padding tokens: 48 - rank: 1 max len: 97 min len: 49 avg len: 73.0 num_loss_counted_tokens: 95
 total tokens: 120 num samples: 2 num padding tokens: 9 - rank: 2 max len: 60 min len: 51 avg len: 55.5 num_loss_counted_tokens: 56
 total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 6 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 69
 total tokens: 162 num samples: 2 num padding tokens: 17 - rank: 2 max len: 81 min len: 64 avg len: 72.5 num_loss_counted_tokens: 91
 total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 5 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 70
 total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 5 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 87
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 5 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 72
 total tokens: 172 num samples: 2 num padding tokens: 36 - rank: 5 max len: 86 min len: 50 avg len: 68.0 num_loss_counted_tokens: 70
 total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 5 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 5 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 81
 total tokens: 174 num samples: 2 num padding tokens: 19 - rank: 5 max len: 87 min len: 68 avg len: 77.5 num_loss_counted_tokens: 80
 total tokens: 202 num samples: 2 num padding tokens: 18 - rank: 5 max len: 101 min len: 83 avg len: 92.0 num_loss_counted_tokens: 124
 total tokens: 122 num samples: 2 num padding tokens: 11 - rank: 5 max len: 61 min len: 50 avg len: 55.5 num_loss_counted_tokens: 65
 total tokens: 174 num samples: 2 num padding tokens: 15 - rank: 5 max len: 87 min len: 72 avg len: 79.5 num_loss_counted_tokens: 98 total tokens: 160 num samples: 2 num padding tokens: 7 - rank: 5 max len: 80 min len: 73 avg len: 76.5 num_loss_counted_tokens: 104

 total tokens: 124 num samples: 2 num padding tokens: 1 - rank: 5 max len: 62 min len: 61 avg len: 61.5 num_loss_counted_tokens: 59
 Per-token loss scaled by world size: 0.0017695350106805563Per-token loss scaled by world size: 0.0008982627186924219

 Epoch: 0, Step: 1, Rank: 0, loss = 0.12254030257463455
 Epoch: 0, Step: 1, Rank: 6, loss = 0.062204692512750626
 Per-token loss scaled by world size: 0.0019191226456314325
 Epoch: 0, Step: 1, Rank: 7, loss = 0.1328992396593094
 Per-token loss scaled by world size: 0.002525273710489273
 Epoch: 0, Step: 1, Rank: 3, loss = 0.1748751997947693
 Per-token loss scaled by world size: 0.002455754904076457
 Epoch: 0, Step: 1, Rank: 4, loss = 0.17006102204322815
 Per-token loss scaled by world size: 0.0006225108518265188
 Epoch: 0, Step: 1, Rank: 2, loss = 0.043108876794576645
 Per-token loss scaled by world size: 0.004392423201352358
 Epoch: 0, Step: 1, Rank: 5, loss = 0.30417531728744507
 Per-token loss scaled by world size: 0.0016390715027227998
 Epoch: 0, Step: 1, Rank: 1, loss = 0.11350569874048233
 [2024-07-27 20:04:03,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
 Epoch 0:   8%|▊         | 1/12 [00:01<00:13,  1.27s/it]{
    "epoch": 0,
    "step": 1,
    "rank": 0,
    "loss": 0.12254030257463455,
    "overall_throughput": 19.157316171863247,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 21.99594497680664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 554,
    "batch_size": 16,
    "total_loss": 0.1404212862253189,
    "gradnorm": 2.950925588607788,
    "weight_norm": 393.455078125,
    "timestamp": "2024-07-27T20:04:03.739792"
 }
 Per-token loss scaled by world size: 0.001750853261910379Per-token loss scaled by world size: 0.0013746594777330756Per-token loss scaled by world size: 0.006612943951040506

 Per-token loss scaled by world size: 0.00497298501431942

 Per-token loss scaled by world size: 0.0029370656702667475Per-token loss scaled by world size: 0.00024299396318383515

 Epoch: 0, Step: 2, Rank: 2, loss = 0.09450783580541611
 Epoch: 0, Step: 2, Rank: 1, loss = 0.12037116289138794
 Epoch: 0, Step: 2, Rank: 6, loss = 0.45463991165161133
 Epoch: 0, Step: 2, Rank: 5, loss = 0.34189271926879883
 Epoch: 0, Step: 2, Rank: 3, loss = 0.20192326605319977
 Epoch: 0, Step: 2, Rank: 4, loss = 0.016705835238099098
 Per-token loss scaled by world size: 0.0014473804039880633
 Per-token loss scaled by world size: 0.0009000621503219008
 Epoch: 0, Step: 2, Rank: 0, loss = 0.0995073989033699
 Epoch: 0, Step: 2, Rank: 7, loss = 0.061879273504018784
 [2024-07-27 20:04:04,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
 Epoch 0:  17%|█▋        | 2/12 [00:01<00:08,  1.20it/s]{
    "epoch": 0,
    "step": 2,
    "rank": 0,
    "loss": 0.0995073989033699,
    "overall_throughput": 38.74044042813212,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 21.998329639434814,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 550,
    "batch_size": 16,
    "total_loss": 0.1739284247159958,
    "gradnorm": 4.509922981262207,
    "weight_norm": 393.4551086425781,
    "timestamp": "2024-07-27T20:04:04.235419"
 }
 Per-token loss scaled by world size: 0.0008535730303265154Per-token loss scaled by world size: 0.0017103212885558605Per-token loss scaled by world size: 0.003972221631556749


 Per-token loss scaled by world size: 0.0014537357492372394Per-token loss scaled by world size: 0.0024689952842891216Per-token loss scaled by world size: 0.0007754238904453814


 Epoch: 0, Step: 3, Rank: 1, loss = 0.3440936803817749
 Epoch: 0, Step: 3, Rank: 6, loss = 0.14815658330917358
 Epoch: 0, Step: 3, Rank: 2, loss = 0.07394076138734818
 Epoch: 0, Step: 3, Rank: 5, loss = 0.12592986226081848
 Epoch: 0, Step: 3, Rank: 7, loss = 0.21387672424316406
 Per-token loss scaled by world size: 0.0015765568241477013Epoch: 0, Step: 3, Rank: 4, loss = 0.06717109680175781

 Per-token loss scaled by world size: 0.0009588321554474533
 Epoch: 0, Step: 3, Rank: 0, loss = 0.08305883407592773
 Epoch: 0, Step: 3, Rank: 3, loss = 0.13656923174858093
 [2024-07-27 20:04:04,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:04,785] [INFO] [timer.py:258:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=31.96463872821357, CurrSamplesPerSec=31.96463872821357, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  25%|██▌       | 3/12 [00:02<00:06,  1.42it/s]{
    "epoch": 0,
    "step": 3,
    "rank": 0,
    "loss": 0.08305883407592773,
    "overall_throughput": 31.90019931426007,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 21.998568058013916,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 693,
    "batch_size": 16,
    "total_loss": 0.14909958839416504,
    "gradnorm": 4.072885990142822,
    "weight_norm": 393.4551696777344,
    "timestamp": "2024-07-27T20:04:04.830984"
 }
 Per-token loss scaled by world size: 0.0011740017216652632Per-token loss scaled by world size: 0.001550567802041769Per-token loss scaled by world size: 0.002323366003111005Per-token loss scaled by world size: 0.0024958737194538116

 Per-token loss scaled by world size: 0.0009869500063359737
 Per-token loss scaled by world size: 0.0016128732822835445


 Per-token loss scaled by world size: 0.0012467901688069105
 Epoch: 0, Step: 4, Rank: 6, loss = 0.1122223436832428
 Epoch: 0, Step: 4, Rank: 5, loss = 0.16815361380577087
 Epoch: 0, Step: 4, Rank: 1, loss = 0.08496837317943573
 Epoch: 0, Step: 4, Rank: 3, loss = 0.1806388646364212Epoch: 0, Step: 4, Rank: 2, loss = 0.071430504322052

 Epoch: 0, Step: 4, Rank: 4, loss = 0.11673170328140259
 Epoch: 0, Step: 4, Rank: 7, loss = 0.09023644030094147
 Per-token loss scaled by world size: 0.0015552268596366048
 Epoch: 0, Step: 4, Rank: 0, loss = 0.11255954205989838
 [2024-07-27 20:04:05,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:05,334] [INFO] [timer.py:258:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=32.182993941908016, CurrSamplesPerSec=32.404352908739476, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  33%|███▎      | 4/12 [00:02<00:05,  1.55it/s]{
    "epoch": 0,
    "step": 4,
    "rank": 0,
    "loss": 0.11255954205989838,
    "overall_throughput": 32.34236936565279,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 21.996421813964844,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 579,
    "batch_size": 16,
    "total_loss": 0.11711767315864563,
    "gradnorm": 2.631800889968872,
    "weight_norm": 393.4552001953125,
    "timestamp": "2024-07-27T20:04:05.382886"
 }
 Per-token loss scaled by world size: 0.002091767033562064Per-token loss scaled by world size: 0.00198046350851655Per-token loss scaled by world size: 0.0025131492875516415Per-token loss scaled by world size: 0.0008698466117493808Per-token loss scaled by world size: 0.001278402516618371


 Per-token loss scaled by world size: 0.0014365998795256019

 Per-token loss scaled by world size: 0.0012164696818217635
 Epoch: 0, Step: 5, Rank: 5, loss = 0.1651211529970169Epoch: 0, Step: 5, Rank: 4, loss = 0.20953382551670074

 Epoch: 0, Step: 5, Rank: 2, loss = 0.07252345979213715
 Epoch: 0, Step: 5, Rank: 6, loss = 0.17440107464790344

 Epoch: 0, Step: 5, Rank: 0, loss = 0.10658681392669678
 Epoch: 0, Step: 5, Rank: 3, loss = 0.11977651715278625
 Epoch: 0, Step: 5, Rank: 1, loss = 0.10142315924167633
 Per-token loss scaled by world size: 0.000835251237731427
 Epoch: 0, Step: 5, Rank: 7, loss = 0.06963907182216644
 [2024-07-27 20:04:05,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:05,909] [INFO] [timer.py:258:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=31.583034761692534, CurrSamplesPerSec=30.44781135920859, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  42%|████▏     | 5/12 [00:03<00:04,  1.62it/s]{
    "epoch": 0,
    "step": 5,
    "rank": 0,
    "loss": 0.10658681392669678,
    "overall_throughput": 30.372502350050485,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 22.000954627990723,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 667,
    "batch_size": 16,
    "total_loss": 0.12737563252449036,
    "gradnorm": 2.452970027923584,
    "weight_norm": 393.45526123046875,
    "timestamp": "2024-07-27T20:04:05.943274"
 }
 Per-token loss scaled by world size: 0.002868585754185915Per-token loss scaled by world size: 0.00250813621096313Per-token loss scaled by world size: 0.0015482519520446658Per-token loss scaled by world size: 0.0010162107646465302Per-token loss scaled by world size: 0.0008416934870183468Per-token loss scaled by world size: 0.002133122645318508





 Per-token loss scaled by world size: 0.001864621532149613
 Epoch: 0, Step: 6, Rank: 6, loss = 0.05828727409243584Epoch: 0, Step: 6, Rank: 4, loss = 0.1986495554447174
 Epoch: 0, Step: 6, Rank: 2, loss = 0.10721644759178162Epoch: 0, Step: 6, Rank: 5, loss = 0.07037259638309479
 Epoch: 0, Step: 6, Rank: 3, loss = 0.17368842661380768


 Epoch: 0, Step: 6, Rank: 0, loss = 0.14771874248981476
 Epoch: 0, Step: 6, Rank: 7, loss = 0.12912504374980927
 Per-token loss scaled by world size: 0.0008939910912886262
 Epoch: 0, Step: 6, Rank: 1, loss = 0.06190888211131096
 [2024-07-27 20:04:06,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:06,460] [INFO] [timer.py:258:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=31.51022949423371, CurrSamplesPerSec=31.293813829665694, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  50%|█████     | 6/12 [00:04<00:03,  1.68it/s]{
    "epoch": 0,
    "step": 6,
    "rank": 0,
    "loss": 0.14771874248981476,
    "overall_throughput": 31.243491527428024,
    "lr": 4.800000000000001e-06,
    "cuda_mem_allocated": 21.996244430541992,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 554,
    "batch_size": 16,
    "total_loss": 0.11837086826562881,
    "gradnorm": 2.045849323272705,
    "weight_norm": 393.4552917480469,
    "timestamp": "2024-07-27T20:04:06.510648"
 }
 Per-token loss scaled by world size: 0.001680628745816648Per-token loss scaled by world size: 0.0018293416360393167Per-token loss scaled by world size: 0.0013819513842463493Per-token loss scaled by world size: 0.0005512057687155902Per-token loss scaled by world size: 0.0015243319794535637

 Per-token loss scaled by world size: 0.0011020175879821181Per-token loss scaled by world size: 0.001964986091479659




 Epoch: 0, Step: 7, Rank: 5, loss = 0.11867507547140121Epoch: 0, Step: 7, Rank: 7, loss = 0.15709471702575684

 Epoch: 0, Step: 7, Rank: 6, loss = 0.09463576227426529Epoch: 0, Step: 7, Rank: 4, loss = 0.1309020072221756
 Epoch: 0, Step: 7, Rank: 2, loss = 0.16874317824840546Epoch: 0, Step: 7, Rank: 1, loss = 0.14432398974895477


 Epoch: 0, Step: 7, Rank: 0, loss = 0.04733479768037796
 Per-token loss scaled by world size: 0.0013999826041981578
 Epoch: 0, Step: 7, Rank: 3, loss = 0.1202235072851181
 [2024-07-27 20:04:06,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[5.600000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:07,012] [INFO] [timer.py:258:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=31.621939391152157, CurrSamplesPerSec=32.07681358232997, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  58%|█████▊    | 7/12 [00:04<00:02,  1.72it/s]{
    "epoch": 0,
    "step": 7,
    "rank": 0,
    "loss": 0.04733479768037796,
    "overall_throughput": 32.023072654142574,
    "lr": 5.600000000000001e-06,
    "cuda_mem_allocated": 21.99880838394165,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 687,
    "batch_size": 16,
    "total_loss": 0.12274163216352463,
    "gradnorm": 2.7485461235046387,
    "weight_norm": 393.4553527832031,
    "timestamp": "2024-07-27T20:04:07.057260"
 }
 Per-token loss scaled by world size: 0.0043866513296961784Per-token loss scaled by world size: 0.0009822045685723424Per-token loss scaled by world size: 0.003587431972846389Per-token loss scaled by world size: 0.002129745902493596Per-token loss scaled by world size: 0.0025288627948611975Per-token loss scaled by world size: 0.0017483173869550228



 Per-token loss scaled by world size: 0.0015334823401644826


 Epoch: 0, Step: 8, Rank: 6, loss = 0.2569498121738434
 Epoch: 0, Step: 8, Rank: 3, loss = 0.1811297982931137Epoch: 0, Step: 8, Rank: 0, loss = 0.07035040110349655Epoch: 0, Step: 8, Rank: 4, loss = 0.12522323429584503
 Epoch: 0, Step: 8, Rank: 5, loss = 0.3141939043998718
 Epoch: 0, Step: 8, Rank: 1, loss = 0.1525430530309677


 Epoch: 0, Step: 8, Rank: 7, loss = 0.1098356693983078
 Per-token loss scaled by world size: 0.004045259207487106
 Epoch: 0, Step: 8, Rank: 2, loss = 0.2897416949272156
 [2024-07-27 20:04:07,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[6.4000000000000006e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:07,557] [INFO] [timer.py:258:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=31.719761022460865, CurrSamplesPerSec=32.21809006047175, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  67%|██████▋   | 8/12 [00:05<00:02,  1.76it/s]{
    "epoch": 0,
    "step": 8,
    "rank": 0,
    "loss": 0.07035040110349655,
    "overall_throughput": 32.162302804268634,
    "lr": 6.4000000000000006e-06,
    "cuda_mem_allocated": 22.001669883728027,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 573,
    "batch_size": 16,
    "total_loss": 0.18749594688415527,
    "gradnorm": 3.855632781982422,
    "weight_norm": 393.45538330078125,
    "timestamp": "2024-07-27T20:04:07.599450"
 }
 Per-token loss scaled by world size: 0.0026937518268823624Per-token loss scaled by world size: 0.0037981150671839714
 Per-token loss scaled by world size: 0.0015854539815336466Per-token loss scaled by world size: 0.002551022917032242Per-token loss scaled by world size: 0.0024539916776120663Per-token loss scaled by world size: 0.002055267570540309




 Epoch: 0, Step: 9, Rank: 0, loss = 0.2181939035654068
 Epoch: 0, Step: 9, Rank: 2, loss = 0.1987733244895935Epoch: 0, Step: 9, Rank: 3, loss = 0.30764731764793396

 Epoch: 0, Step: 9, Rank: 4, loss = 0.16647666692733765Epoch: 0, Step: 9, Rank: 7, loss = 0.2066328525543213

 Epoch: 0, Step: 9, Rank: 1, loss = 0.12842176854610443
 Per-token loss scaled by world size: 0.0031997160986065865
 Per-token loss scaled by world size: 0.002269922522827983
 Epoch: 0, Step: 9, Rank: 5, loss = 0.25917699933052063
 Epoch: 0, Step: 9, Rank: 6, loss = 0.18386372923851013
 [2024-07-27 20:04:08,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[7.2000000000000005e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:08,097] [INFO] [timer.py:258:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=31.789200407919928, CurrSamplesPerSec=32.21230625969002, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  75%|███████▌  | 9/12 [00:05<00:01,  1.78it/s]{
    "epoch": 0,
    "step": 9,
    "rank": 0,
    "loss": 0.2181939035654068,
    "overall_throughput": 32.12612073517451,
    "lr": 7.2000000000000005e-06,
    "cuda_mem_allocated": 22.002385139465332,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 648,
    "batch_size": 16,
    "total_loss": 0.20864830911159515,
    "gradnorm": 35.085845947265625,
    "weight_norm": 393.4554138183594,
    "timestamp": "2024-07-27T20:04:08.140249"
 }
 Per-token loss scaled by world size: 0.0028963200747966766Per-token loss scaled by world size: 0.0014150061178952456Per-token loss scaled by world size: 0.004510107450187206Per-token loss scaled by world size: 0.0027439305558800697Per-token loss scaled by world size: 0.003027191385626793Per-token loss scaled by world size: 0.002273061079904437



 Per-token loss scaled by world size: 0.0028788307681679726


 Epoch: 0, Step: 10, Rank: 5, loss = 0.2164275199174881
 Epoch: 0, Step: 10, Rank: 2, loss = 0.3557347357273102
 Epoch: 0, Step: 10, Rank: 6, loss = 0.1116086095571518
 Epoch: 0, Step: 10, Rank: 0, loss = 0.23876972496509552
 Epoch: 0, Step: 10, Rank: 1, loss = 0.22844724357128143
 Epoch: 0, Step: 10, Rank: 3, loss = 0.22706778347492218Epoch: 0, Step: 10, Rank: 4, loss = 0.17928770184516907

 Per-token loss scaled by world size: 0.004227162804454565
 Epoch: 0, Step: 10, Rank: 7, loss = 0.33341747522354126
 [2024-07-27 20:04:08,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:08,643] [INFO] [timer.py:258:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=31.799625325743182, CurrSamplesPerSec=31.872791640267828, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0:  83%|████████▎ | 10/12 [00:06<00:01,  1.80it/s]{
    "epoch": 0,
    "step": 10,
    "rank": 0,
    "loss": 0.23876972496509552,
    "overall_throughput": 31.789585477820573,
    "lr": 8.000000000000001e-06,
    "cuda_mem_allocated": 22.002862453460693,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 631,
    "batch_size": 16,
    "total_loss": 0.23634512722492218,
    "gradnorm": 5.703427791595459,
    "weight_norm": 393.4554748535156,
    "timestamp": "2024-07-27T20:04:08.686452"
 }
 Per-token loss scaled by world size: 0.0033281107898801565Per-token loss scaled by world size: 0.0010645152069628239Per-token loss scaled by world size: 0.004243766888976097Per-token loss scaled by world size: 0.003650533501058817Per-token loss scaled by world size: 0.0036266690585762262Per-token loss scaled by world size: 0.0018828274914994836





 Per-token loss scaled by world size: 0.0036798259243369102
 Epoch: 0, Step: 11, Rank: 4, loss = 0.07890719175338745
 Epoch: 0, Step: 11, Rank: 6, loss = 0.2705957889556885Epoch: 0, Step: 11, Rank: 2, loss = 0.3145692050457001
 Epoch: 0, Step: 11, Rank: 1, loss = 0.26882684230804443

 Epoch: 0, Step: 11, Rank: 5, loss = 0.2727670967578888Epoch: 0, Step: 11, Rank: 0, loss = 0.24669620394706726

 Epoch: 0, Step: 11, Rank: 3, loss = 0.13956458866596222
 Per-token loss scaled by world size: 0.002425282960757613
 Epoch: 0, Step: 11, Rank: 7, loss = 0.17977410554885864
 [2024-07-27 20:04:09,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[8.8e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:09,202] [INFO] [timer.py:258:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=31.7205989700859, CurrSamplesPerSec=31.10225264577545, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 176
 {
    "epoch": 0,
    "step": 11,
    "rank": 0,
    "loss": 0.24669620394706726,
    "overall_throughput": 31.02962868774954,
    "lr": 8.8e-06,
    "cuda_mem_allocated": 22.00071620941162,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 593,
    "batch_size": 16,
    "total_loss": 0.22146263718605042,
    "gradnorm": 4.970978736877441,
    "weight_norm": 393.4555358886719,
    "timestamp": "2024-07-27T20:04:09.205377"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_176
 [20:04:26] INFO     saving took 17.77474355697632 seconds                                                                                                                                                                         utils.py:611
 Epoch 0:  92%|█████████▏| 11/12 [00:24<00:05,  6.00s/it]Per-token loss scaled by world size: 0.0032385066151618958Per-token loss scaled by world size: 0.0007423295173794031Per-token loss scaled by world size: 0.004228494130074978Per-token loss scaled by world size: 0.002175833098590374
 Per-token loss scaled by world size: 0.0016533531015738845


 Per-token loss scaled by world size: 0.0016122134402394295

 Per-token loss scaled by world size: 0.0011377736227586865
 Epoch: 0, Step: 12, Rank: 2, loss = 0.3493793308734894Epoch: 0, Step: 12, Rank: 5, loss = 0.06133497506380081Epoch: 0, Step: 12, Rank: 6, loss = 0.17977821826934814Epoch: 0, Step: 12, Rank: 0, loss = 0.26758161187171936



 Epoch: 0, Step: 12, Rank: 1, loss = 0.13320913910865784Epoch: 0, Step: 12, Rank: 3, loss = 0.1366083025932312

 Epoch: 0, Step: 12, Rank: 7, loss = 0.09400854259729385
 Per-token loss scaled by world size: 0.0017149074701592326
 Epoch: 0, Step: 12, Rank: 4, loss = 0.14169423282146454
 [2024-07-27 20:04:27,462] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[9.600000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:27,540] [INFO] [timer.py:258:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=31.65103378148296, CurrSamplesPerSec=31.038411783233425, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
 Epoch 0: 100%|██████████| 12/12 [00:25<00:00,  4.34s/it]{
    "epoch": 0,
    "step": 12,
    "rank": 0,
    "loss": 0.26758161187171936,
    "overall_throughput": 30.984142462059825,
    "lr": 9.600000000000001e-06,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 661,
    "batch_size": 16,
    "total_loss": 0.17044928669929504,
    "gradnorm": 3.941415309906006,
    "weight_norm": 393.4555969238281,
    "timestamp": "2024-07-27T20:04:27.583911"
 }
 Epoch 0: 100%|██████████| 12/12 [00:25<00:00,  2.10s/it]
 total tokens: 164 num samples: 2 num padding tokens: 22 - rank: 5 max len: 82 min len: 60 avg len: 71.0 num_loss_counted_tokens: 84
 total tokens: 186 num samples: 2 num padding tokens: 38 - rank: 5 max len: 93 min len: 55 avg len: 74.0 num_loss_counted_tokens: 82
 total tokens: 152 num samples: 2 num padding tokens: 28 - rank: 5 max len: 76 min len: 48 avg len: 62.0 num_loss_counted_tokens: 69
 total tokens: 200 num samples: 2 num padding tokens: 36 - rank: 5 max len: 100 min len: 64 avg len: 82.0 num_loss_counted_tokens: 102
 total tokens: 196 num samples: 2 num padding tokens: 28 - rank: 5 max len: 98 min len: 70 avg len: 84.0 num_loss_counted_tokens: 101
 total tokens: 214 num samples: 2 num padding tokens: 42 - rank: 5 max len: 107 min len: 65 avg len: 86.0 num_loss_counted_tokens: 106
 total tokens: 138 num samples: 2 num padding tokens: 12 - rank: 5 max len: 69 min len: 57 avg len: 63.0 num_loss_counted_tokens: 63
 total tokens: 202 num samples: 2 num padding tokens: 39 - rank: 5 max len: 101 min len: 62 avg len: 81.5 num_loss_counted_tokens: 105
 total tokens: 104 num samples: 2 num padding tokens: 0 - rank: 5 max len: 52 min len: 52 avg len: 52.0 num_loss_counted_tokens: 50
 total tokens: 180 num samples: 2 num padding tokens: 31 - rank: 5 max len: 90 min len: 59 avg len: 74.5 num_loss_counted_tokens: 95
 total tokens: 174 num samples: 2 num padding tokens: 7 - rank: 5 max len: 87 min len: 80 avg len: 83.5 num_loss_counted_tokens: 97
 total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 5 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 72
 total tokens: 146 num samples: 2 num padding tokens: 16 - rank: 2 max len: 73 min len: 57 avg len: 65.0 num_loss_counted_tokens: 75
 total tokens: 214 num samples: 2 num padding tokens: 53 - rank: 2 max len: 107 min len: 54 avg len: 80.5 num_loss_counted_tokens: 108
 total tokens: 136 num samples: 2 num padding tokens: 8 - rank: 2 max len: 68 min len: 60 avg len: 64.0 num_loss_counted_tokens: 62
 total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 2 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 80
 total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 7 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 119
 total tokens: 130 num samples: 2 num padding tokens: 7 - rank: 2 max len: 65 min len: 58 avg len: 61.5 num_loss_counted_tokens: 59
 total tokens: 194 num samples: 2 num padding tokens: 23 - rank: 0 max len: 97 min len: 74 avg len: 85.5 num_loss_counted_tokens: 109
 total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 2 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57
 total tokens: 118 num samples: 2 num padding tokens: 5 - rank: 2 max len: 59 min len: 54 avg len: 56.5 num_loss_counted_tokens: 71
 total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 7 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 47
 total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 7 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 62
 total tokens: 150 num samples: 2 num padding tokens: 29 - rank: 2 max len: 75 min len: 46 avg len: 60.5 num_loss_counted_tokens: 64
 total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 0 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 78
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 2 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 73
 total tokens: 136 num samples: 2 num padding tokens: 23 - rank: 0 max len: 68 min len: 45 avg len: 56.5 num_loss_counted_tokens: 51
 total tokens: 132 num samples: 2 num padding tokens: 14 - rank: 2 max len: 66 min len: 52 avg len: 59.0 num_loss_counted_tokens: 58
 total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 2 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 70
 total tokens: 114 num samples: 2 num padding tokens: 2 - rank: 4 max len: 57 min len: 55 avg len: 56.0 num_loss_counted_tokens: 66
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 49
 total tokens: 166 num samples: 2 num padding tokens: 7 - rank: 0 max len: 83 min len: 76 avg len: 79.5 num_loss_counted_tokens: 90
 total tokens: 188 num samples: 2 num padding tokens: 22 - rank: 4 max len: 94 min len: 72 avg len: 83.0 num_loss_counted_tokens: 98
 total tokens: 156 num samples: 2 num padding tokens: 27 - rank: 7 max len: 78 min len: 51 avg len: 64.5 num_loss_counted_tokens: 71
 total tokens: 118 num samples: 2 num padding tokens: 8 - rank: 0 max len: 59 min len: 51 avg len: 55.0 num_loss_counted_tokens: 60
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 4 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 59
 total tokens: 166 num samples: 2 num padding tokens: 16 - rank: 7 max len: 83 min len: 67 avg len: 75.0 num_loss_counted_tokens: 75
 total tokens: 168 num samples: 2 num padding tokens: 13 - rank: 4 max len: 84 min len: 71 avg len: 77.5 num_loss_counted_tokens: 88
 total tokens: 174 num samples: 2 num padding tokens: 41 - rank: 3 max len: 87 min len: 46 avg len: 66.5 num_loss_counted_tokens: 70
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 4 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 68
 total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 4 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 84
 total tokens: 174 num samples: 2 num padding tokens: 6 - rank: 4 max len: 87 min len: 81 avg len: 84.0 num_loss_counted_tokens: 100
 total tokens: 104 num samples: 2 num padding tokens: 8 - rank: 4 max len: 52 min len: 44 avg len: 48.0 num_loss_counted_tokens: 52
 total tokens: 152 num samples: 2 num padding tokens: 12 - rank: 3 max len: 76 min len: 64 avg len: 70.0 num_loss_counted_tokens: 81
 total tokens: 128 num samples: 2 num padding tokens: 14 - rank: 4 max len: 64 min len: 50 avg len: 57.0 num_loss_counted_tokens: 51
 total tokens: 180 num samples: 2 num padding tokens: 9 - rank: 3 max len: 90 min len: 81 avg len: 85.5 num_loss_counted_tokens: 135
 total tokens: 154 num samples: 2 num padding tokens: 23 - rank: 2 max len: 77 min len: 54 avg len: 65.5 num_loss_counted_tokens: 75
 total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 3 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 57
 total tokens: 122 num samples: 2 num padding tokens: 3 - rank: 3 max len: 61 min len: 58 avg len: 59.5 num_loss_counted_tokens: 60
 total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 3 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 80
 total tokens: 124 num samples: 2 num padding tokens: 5 - rank: 0 max len: 62 min len: 57 avg len: 59.5 num_loss_counted_tokens: 73
 total tokens: 244 num samples: 2 num padding tokens: 34 - rank: 3 max len: 122 min len: 88 avg len: 105.0 num_loss_counted_tokens: 147
 total tokens: 186 num samples: 2 num padding tokens: 30 - rank: 7 max len: 93 min len: 63 avg len: 78.0 num_loss_counted_tokens: 117
 total tokens: 138 num samples: 2 num padding tokens: 25 - rank: 0 max len: 69 min len: 44 avg len: 56.5 num_loss_counted_tokens: 69
 total tokens: 118 num samples: 2 num padding tokens: 6 - rank: 0 max len: 59 min len: 53 avg len: 56.0 num_loss_counted_tokens: 50
 total tokens: 148 num samples: 2 num padding tokens: 8 - rank: 0 max len: 74 min len: 66 avg len: 70.0 num_loss_counted_tokens: 76
 total tokens: 96 num samples: 2 num padding tokens: 5 - rank: 0 max len: 48 min len: 43 avg len: 45.5 num_loss_counted_tokens: 39
 total tokens: 166 num samples: 2 num padding tokens: 3 - rank: 3 max len: 83 min len: 80 avg len: 81.5 num_loss_counted_tokens: 111
 total tokens: 164 num samples: 2 num padding tokens: 29 - rank: 0 max len: 82 min len: 53 avg len: 67.5 num_loss_counted_tokens: 85
 total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 1 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 68
 total tokens: 128 num samples: 2 num padding tokens: 7 - rank: 3 max len: 64 min len: 57 avg len: 60.5 num_loss_counted_tokens: 66
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 3 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 73
 total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 4 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57
 total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 3 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 61
 total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 7 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 61
 total tokens: 282 num samples: 2 num padding tokens: 70 - rank: 7 max len: 141 min len: 71 avg len: 106.0 num_loss_counted_tokens: 151
 total tokens: 132 num samples: 2 num padding tokens: 15 - rank: 7 max len: 66 min len: 51 avg len: 58.5 num_loss_counted_tokens: 61
 total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 1 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 99
 total tokens: 188 num samples: 2 num padding tokens: 8 - rank: 7 max len: 94 min len: 86 avg len: 90.0 num_loss_counted_tokens: 85
 total tokens: 168 num samples: 2 num padding tokens: 39 - rank: 7 max len: 84 min len: 45 avg len: 64.5 num_loss_counted_tokens: 71
 total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 7 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 88
 total tokens: 126 num samples: 2 num padding tokens: 2 - rank: 1 max len: 63 min len: 61 avg len: 62.0 num_loss_counted_tokens: 56
 total tokens: 146 num samples: 2 num padding tokens: 11 - rank: 1 max len: 73 min len: 62 avg len: 67.5 num_loss_counted_tokens: 75
 total tokens: 208 num samples: 2 num padding tokens: 38 - rank: 0 max len: 104 min len: 66 avg len: 85.0 num_loss_counted_tokens: 110
 total tokens: 132 num samples: 2 num padding tokens: 16 - rank: 4 max len: 66 min len: 50 avg len: 58.0 num_loss_counted_tokens: 61
 total tokens: 172 num samples: 2 num padding tokens: 28 - rank: 1 max len: 86 min len: 58 avg len: 72.0 num_loss_counted_tokens: 78
 total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 1 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 89
 total tokens: 226 num samples: 2 num padding tokens: 39 - rank: 1 max len: 113 min len: 74 avg len: 93.5 num_loss_counted_tokens: 109
 total tokens: 134 num samples: 2 num padding tokens: 23 - rank: 1 max len: 67 min len: 44 avg len: 55.5 num_loss_counted_tokens: 47
 total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 1 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 62 total tokens: 162 num samples: 2 num padding tokens: 12 - rank: 1 max len: 81 min len: 69 avg len: 75.0 num_loss_counted_tokens: 71

 total tokens: 158 num samples: 2 num padding tokens: 12 - rank: 1 max len: 79 min len: 67 avg len: 73.0 num_loss_counted_tokens: 66
 total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 6 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 76
 total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 6 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 63
 total tokens: 98 num samples: 2 num padding tokens: 4 - rank: 6 max len: 49 min len: 45 avg len: 47.0 num_loss_counted_tokens: 48
 total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 6 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 76
 total tokens: 124 num samples: 2 num padding tokens: 1 - rank: 3 max len: 62 min len: 61 avg len: 61.5 num_loss_counted_tokens: 57
 total tokens: 228 num samples: 2 num padding tokens: 46 - rank: 6 max len: 114 min len: 68 avg len: 91.0 num_loss_counted_tokens: 118
 total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 6 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 68
 total tokens: 216 num samples: 2 num padding tokens: 60 - rank: 6 max len: 108 min len: 48 avg len: 78.0 num_loss_counted_tokens: 102
 total tokens: 126 num samples: 2 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 63
 total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 6 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 57
 total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 6 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 91
 total tokens: 124 num samples: 2 num padding tokens: 11 - rank: 6 max len: 62 min len: 51 avg len: 56.5 num_loss_counted_tokens: 57
 total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 1 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 66
 total tokens: 116 num samples: 2 num padding tokens: 8 - rank: 6 max len: 58 min len: 50 avg len: 54.0 num_loss_counted_tokens: 59
 Per-token loss scaled by world size: 0.0013802563771605492Per-token loss scaled by world size: 0.0019055134616792202

 Per-token loss scaled by world size: 0.003680554451420903Per-token loss scaled by world size: 0.001587073435075581Per-token loss scaled by world size: 0.0016849382082000375
 Per-token loss scaled by world size: 0.00304134888574481


 Per-token loss scaled by world size: 0.00129329867195338
 Epoch: 1, Step: 13, Rank: 1, loss = 0.1150788739323616
 Epoch: 1, Step: 13, Rank: 3, loss = 0.15887218713760376
 Epoch: 1, Step: 13, Rank: 7, loss = 0.30686622858047485
 Epoch: 1, Step: 13, Rank: 4, loss = 0.1323222517967224
 Epoch: 1, Step: 13, Rank: 0, loss = 0.2535724639892578
 Epoch: 1, Step: 13, Rank: 2, loss = 0.14048172533512115
 Epoch: 1, Step: 13, Rank: 5, loss = 0.1078287735581398
 Per-token loss scaled by world size: 0.000824308895971626
 Epoch: 1, Step: 13, Rank: 6, loss = 0.06872675567865372
 [2024-07-27 20:04:28,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[1.04e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:28,578] [INFO] [timer.py:258:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=31.312819818036353, CurrSamplesPerSec=28.289847056874475, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,         | 1/12 [00:00<00:10,  1.05it/s]
    "step": 13,
    "rank": 0,
    "loss": 0.2535724639892578,
    "overall_throughput": 28.191588421887058,
    "lr": 1.04e-05,
    "cuda_mem_allocated": 22.006441116333008,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 667,
    "batch_size": 16,
    "total_loss": 0.16046865284442902,
    "gradnorm": 3.25748348236084,
    "weight_norm": 393.4556884765625,
    "timestamp": "2024-07-27T20:04:28.620919"
 }
 Per-token loss scaled by world size: 0.003146873554214835Per-token loss scaled by world size: 0.0015134336426854134Per-token loss scaled by world size: 0.0021054281387478113Per-token loss scaled by world size: 0.005117365624755621
 Per-token loss scaled by world size: 0.0010033146245405078Per-token loss scaled by world size: 0.0036201237235218287


 Per-token loss scaled by world size: 0.003257090924307704


 Epoch: 1, Step: 14, Rank: 0, loss = 0.2533233165740967
 Epoch: 1, Step: 14, Rank: 5, loss = 0.41194793581962585Epoch: 1, Step: 14, Rank: 2, loss = 0.12183140963315964
 Epoch: 1, Step: 14, Rank: 6, loss = 0.16948696970939636

 Epoch: 1, Step: 14, Rank: 1, loss = 0.29141995310783386
 Epoch: 1, Step: 14, Rank: 3, loss = 0.2621958255767822
 Epoch: 1, Step: 14, Rank: 7, loss = 0.08076682686805725
 Per-token loss scaled by world size: 0.0011276104487478733
 Epoch: 1, Step: 14, Rank: 4, loss = 0.09077264368534088
 [2024-07-27 20:04:29,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[1.1200000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:29,123] [INFO] [timer.py:258:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=31.374398887750708, CurrSamplesPerSec=32.06810729498475, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,▋        | 2/12 [00:01<00:07,  1.40it/s]
    "step": 14,
    "rank": 0,
    "loss": 0.2533233165740967,
    "overall_throughput": 32.01091375509972,
    "lr": 1.1200000000000001e-05,
    "cuda_mem_allocated": 22.00023889541626,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 644,
    "batch_size": 16,
    "total_loss": 0.21021811664104462,
    "gradnorm": 6.8222336769104,
    "weight_norm": 393.4558410644531,
    "timestamp": "2024-07-27T20:04:29.126718"
 }
 Per-token loss scaled by world size: 0.00125453295186162Per-token loss scaled by world size: 0.002432552631944418Per-token loss scaled by world size: 0.0022791901137679815Per-token loss scaled by world size: 0.0012238719500601292

 Per-token loss scaled by world size: 0.0040193116292357445
 Per-token loss scaled by world size: 0.002601771615445614Per-token loss scaled by world size: 0.0017355632735416293



 Epoch: 1, Step: 15, Rank: 0, loss = 0.08985592424869537
 Epoch: 1, Step: 15, Rank: 1, loss = 0.17423158884048462
 Epoch: 1, Step: 15, Rank: 2, loss = 0.1632469892501831
 Epoch: 1, Step: 15, Rank: 6, loss = 0.08765982836484909Epoch: 1, Step: 15, Rank: 4, loss = 0.2878831923007965

 Epoch: 1, Step: 15, Rank: 3, loss = 0.18635189533233643
 Epoch: 1, Step: 15, Rank: 5, loss = 0.1243097186088562
 Per-token loss scaled by world size: 0.0024993098340928555
 Epoch: 1, Step: 15, Rank: 7, loss = 0.17901305854320526
 [2024-07-27 20:04:29,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:29,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=31.399353099680013, CurrSamplesPerSec=31.70192973588364, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,█▌       | 3/12 [00:02<00:05,  1.56it/s]
    "step": 15,
    "rank": 0,
    "loss": 0.08985592424869537,
    "overall_throughput": 31.64709333197519,
    "lr": 1.2e-05,
    "cuda_mem_allocated": 21.999523639678955,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 573,
    "batch_size": 16,
    "total_loss": 0.1615689992904663,
    "gradnorm": 3.001760244369507,
    "weight_norm": 393.45599365234375,
    "timestamp": "2024-07-27T20:04:29.719105"
 }
 Per-token loss scaled by world size: 0.0014579611597582698Per-token loss scaled by world size: 0.003502971027046442Per-token loss scaled by world size: 0.0018768769223242998


 Per-token loss scaled by world size: 0.0020246703643351793Per-token loss scaled by world size: 0.001514959498308599Per-token loss scaled by world size: 0.006437234580516815


 Epoch: 1, Step: 16, Rank: 0, loss = 0.10005258768796921Epoch: 1, Step: 16, Rank: 4, loss = 0.1288006752729416Epoch: 1, Step: 16, Rank: 2, loss = 0.24039138853549957


 Epoch: 1, Step: 16, Rank: 7, loss = 0.4417552351951599
 Epoch: 1, Step: 16, Rank: 1, loss = 0.13894300162792206Epoch: 1, Step: 16, Rank: 3, loss = 0.10396409779787064

 Per-token loss scaled by world size: 0.002007455099374056
 Per-token loss scaled by world size: 0.0018041662406176329
 Epoch: 1, Step: 16, Rank: 5, loss = 0.13776160776615143
 Epoch: 1, Step: 16, Rank: 6, loss = 0.12381090968847275
 [2024-07-27 20:04:30,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[1.2800000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:30,218] [INFO] [timer.py:258:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=31.464876520188444, CurrSamplesPerSec=32.342260256708684, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,██▎      | 4/12 [00:02<00:04,  1.66it/s]
    "step": 16,
    "rank": 0,
    "loss": 0.10005258768796921,
    "overall_throughput": 32.288201736596754,
    "lr": 1.2800000000000001e-05,
    "cuda_mem_allocated": 22.003100872039795,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 549,
    "batch_size": 16,
    "total_loss": 0.17693495750427246,
    "gradnorm": 2.5745155811309814,
    "weight_norm": 393.4561462402344,
    "timestamp": "2024-07-27T20:04:30.261395"
 }
 Per-token loss scaled by world size: 0.004123833030462265Per-token loss scaled by world size: 0.002096434822306037Per-token loss scaled by world size: 0.002511914586648345Per-token loss scaled by world size: 0.004808654077351093


 Per-token loss scaled by world size: 0.0011069930624216795Per-token loss scaled by world size: 0.002304441062733531


 Epoch: 1, Step: 17, Rank: 1, loss = 0.20597699284553528
 Epoch: 1, Step: 17, Rank: 7, loss = 0.17190766334533691
 Epoch: 1, Step: 17, Rank: 3, loss = 0.3943096399307251Epoch: 1, Step: 17, Rank: 6, loss = 0.33815431594848633

 Epoch: 1, Step: 17, Rank: 2, loss = 0.09077343344688416
 Epoch: 1, Step: 17, Rank: 4, loss = 0.18896417319774628
 Per-token loss scaled by world size: 0.0022304877638816833
 Per-token loss scaled by world size: 0.0029599058907479048
 Epoch: 1, Step: 17, Rank: 0, loss = 0.24271227419376373
 Epoch: 1, Step: 17, Rank: 5, loss = 0.18289999663829803
 [2024-07-27 20:04:30,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.3600000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:30,758] [INFO] [timer.py:258:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=31.51917479452854, CurrSamplesPerSec=32.29951508996705, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,███▏     | 5/12 [00:03<00:04,  1.73it/s]
    "step": 17,
    "rank": 0,
    "loss": 0.24271227419376373,
    "overall_throughput": 32.21476489448058,
    "lr": 1.3600000000000002e-05,
    "cuda_mem_allocated": 21.997375965118408,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 656,
    "batch_size": 16,
    "total_loss": 0.22696229815483093,
    "gradnorm": 4.573685169219971,
    "weight_norm": 393.4563293457031,
    "timestamp": "2024-07-27T20:04:30.804653"
 }
 Per-token loss scaled by world size: 0.0030115304980427027Per-token loss scaled by world size: 0.006417228374630213Per-token loss scaled by world size: 0.007109665311872959Per-token loss scaled by world size: 0.000538784428499639Per-token loss scaled by world size: 0.002789800288155675Per-token loss scaled by world size: 0.003157705068588257





 Per-token loss scaled by world size: 0.0017401942750439048
 Epoch: 1, Step: 18, Rank: 0, loss = 0.23715803027153015Epoch: 1, Step: 18, Rank: 7, loss = 0.5053567290306091Epoch: 1, Step: 18, Rank: 6, loss = 0.5598861575126648


 Epoch: 1, Step: 18, Rank: 2, loss = 0.2196967750787735Epoch: 1, Step: 18, Rank: 1, loss = 0.042429275810718536

 Epoch: 1, Step: 18, Rank: 5, loss = 0.24866926670074463
 Epoch: 1, Step: 18, Rank: 3, loss = 0.13704030215740204
 Per-token loss scaled by world size: 0.0006516931462101638
 Epoch: 1, Step: 18, Rank: 4, loss = 0.05132083594799042
 [2024-07-27 20:04:31,227] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.4400000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:31,304] [INFO] [timer.py:258:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=31.586850556537332, CurrSamplesPerSec=32.638021628709105, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,████     | 6/12 [00:03<00:03,  1.76it/s]
    "step": 18,
    "rank": 0,
    "loss": 0.23715803027153015,
    "overall_throughput": 32.58399904834273,
    "lr": 1.4400000000000001e-05,
    "cuda_mem_allocated": 21.999762058258057,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 630,
    "batch_size": 16,
    "total_loss": 0.2501946687698364,
    "gradnorm": 4.389204025268555,
    "weight_norm": 393.45654296875,
    "timestamp": "2024-07-27T20:04:31.348530"
 }
 Per-token loss scaled by world size: 0.003954596351832151Per-token loss scaled by world size: 0.0016637382796034217Per-token loss scaled by world size: 0.003214797005057335Per-token loss scaled by world size: 0.006215415894985199Per-token loss scaled by world size: 0.0025190163869410753Per-token loss scaled by world size: 0.0015009477501735091


 Per-token loss scaled by world size: 0.0017105289734899998



 Epoch: 1, Step: 19, Rank: 0, loss = 0.34849879145622253Epoch: 1, Step: 19, Rank: 7, loss = 0.22198832035064697

 Epoch: 1, Step: 19, Rank: 1, loss = 0.28330397605895996
 Epoch: 1, Step: 19, Rank: 2, loss = 0.14661693572998047Epoch: 1, Step: 19, Rank: 5, loss = 0.5477335453033447
 Epoch: 1, Step: 19, Rank: 3, loss = 0.1507403701543808

 Epoch: 1, Step: 19, Rank: 6, loss = 0.13227102160453796
 Per-token loss scaled by world size: 0.0012170596746727824
 Epoch: 1, Step: 19, Rank: 4, loss = 0.10725338757038116
 [2024-07-27 20:04:31,783] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.5200000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:31,860] [INFO] [timer.py:258:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=31.571700427209116, CurrSamplesPerSec=31.331259798479305, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,████▊    | 7/12 [00:04<00:02,  1.77it/s]
    "step": 19,
    "rank": 0,
    "loss": 0.34849879145622253,
    "overall_throughput": 31.250591517446562,
    "lr": 1.5200000000000002e-05,
    "cuda_mem_allocated": 22.002862453460693,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 705,
    "batch_size": 16,
    "total_loss": 0.24230077862739563,
    "gradnorm": 3.7223894596099854,
    "weight_norm": 393.456787109375,
    "timestamp": "2024-07-27T20:04:31.903580"
 }
 Per-token loss scaled by world size: 0.003282753750681877Per-token loss scaled by world size: 0.006474556401371956Per-token loss scaled by world size: 0.001697456929832697Per-token loss scaled by world size: 0.0012144312495365739Per-token loss scaled by world size: 0.0008425723062828183Per-token loss scaled by world size: 0.0018245026003569365





 Per-token loss scaled by world size: 0.0037540853954851627
 Epoch: 1, Step: 20, Rank: 6, loss = 0.12476308643817902
 Epoch: 1, Step: 20, Rank: 2, loss = 0.4758799076080322
 Epoch: 1, Step: 20, Rank: 5, loss = 0.061929065734148026
 Epoch: 1, Step: 20, Rank: 4, loss = 0.2412824034690857Epoch: 1, Step: 20, Rank: 0, loss = 0.08926069736480713

 Epoch: 1, Step: 20, Rank: 7, loss = 0.27592527866363525Epoch: 1, Step: 20, Rank: 1, loss = 0.13410094380378723

 Per-token loss scaled by world size: 0.0009085916099138558
 Epoch: 1, Step: 20, Rank: 3, loss = 0.06678148359060287
 [2024-07-27 20:04:32,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:32,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=31.62285860527634, CurrSamplesPerSec=32.51863226575504, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,█████▋   | 8/12 [00:04<00:02,  1.80it/s]
    "step": 20,
    "rank": 0,
    "loss": 0.08926069736480713,
    "overall_throughput": 32.4628994682786,
    "lr": 1.6000000000000003e-05,
    "cuda_mem_allocated": 22.00811004638672,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 588,
    "batch_size": 16,
    "total_loss": 0.18374036252498627,
    "gradnorm": 4.014389514923096,
    "weight_norm": 393.4570617675781,
    "timestamp": "2024-07-27T20:04:32.401998"
 }
 Per-token loss scaled by world size: 0.0046243746764957905
 Per-token loss scaled by world size: 0.0019434256246313453Per-token loss scaled by world size: 0.0029365788213908672Per-token loss scaled by world size: 0.0035864808596670628Per-token loss scaled by world size: 0.003000351833179593Per-token loss scaled by world size: 0.002845433074980974Per-token loss scaled by world size: 0.0026900055818259716





 Epoch: 1, Step: 21, Rank: 0, loss = 0.33989155292510986
 Epoch: 1, Step: 21, Rank: 3, loss = 0.14284178614616394Epoch: 1, Step: 21, Rank: 2, loss = 0.220525860786438Epoch: 1, Step: 21, Rank: 5, loss = 0.26360633969306946


 Epoch: 1, Step: 21, Rank: 6, loss = 0.21583855152130127
 Epoch: 1, Step: 21, Rank: 7, loss = 0.1977154165506363
 Epoch: 1, Step: 21, Rank: 4, loss = 0.20913933217525482
 Per-token loss scaled by world size: 0.003519931575283408
 Epoch: 1, Step: 21, Rank: 1, loss = 0.2587149739265442
 [2024-07-27 20:04:32,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.6800000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:32,944] [INFO] [timer.py:258:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=31.62433633690537, CurrSamplesPerSec=31.650959142641135, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 1,██████▌  | 9/12 [00:05<00:01,  1.81it/s]
    "step": 21,
    "rank": 0,
    "loss": 0.33989155292510986,
    "overall_throughput": 31.58514555251878,
    "lr": 1.6800000000000002e-05,
    "cuda_mem_allocated": 21.998091220855713,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 588,
    "batch_size": 16,
    "total_loss": 0.23103423416614532,
    "gradnorm": 4.073541164398193,
    "weight_norm": 393.4573669433594,
    "timestamp": "2024-07-27T20:04:32.946992"
 }
 Per-token loss scaled by world size: 0.0026941639371216297Per-token loss scaled by world size: 0.007396150380373001Per-token loss scaled by world size: 0.0018774037016555667
 Per-token loss scaled by world size: 0.0010539990616962314
 Per-token loss scaled by world size: 0.0033142913598567247Per-token loss scaled by world size: 0.0031690315809100866
 Per-token loss scaled by world size: 0.00544370012357831



 Epoch: 1, Step: 22, Rank: 2, loss = 0.527900218963623Epoch: 1, Step: 22, Rank: 6, loss = 0.13399969041347504

 Epoch: 1, Step: 22, Rank: 5, loss = 0.07522918283939362
 Epoch: 1, Step: 22, Rank: 1, loss = 0.19229595363140106
 Epoch: 1, Step: 22, Rank: 4, loss = 0.23655754327774048
 Epoch: 1, Step: 22, Rank: 7, loss = 0.22618962824344635
 Epoch: 1, Step: 22, Rank: 3, loss = 0.38854408264160156
 Per-token loss scaled by world size: 0.0007584551349282265
 Epoch: 1, Step: 22, Rank: 0, loss = 0.054134733974933624
 [2024-07-27 20:04:33,400] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.76e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:33,478] [INFO] [timer.py:258:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=31.68561296315261, CurrSamplesPerSec=32.896711596691546, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 352
 {
    "epoch": 1,
    "step": 22,
    "rank": 0,
    "loss": 0.054134733974933624,
    "overall_throughput": 32.789902630272564,
    "lr": 1.76e-05,
    "cuda_mem_allocated": 21.997375965118408,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 571,
    "batch_size": 16,
    "total_loss": 0.22935637831687927,
    "gradnorm": 3.735788106918335,
    "weight_norm": 393.4576721191406,
    "timestamp": "2024-07-27T20:04:33.482264"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_352
 [20:04:51] INFO     saving took 17.896338939666748 seconds                                                                                                                                                                        utils.py:611
                                                        Per-token loss scaled by world size: 0.001945212366990745Per-token loss scaled by world size: 0.0022036610171198845Per-token loss scaled by world size: 0.0031597877386957407Per-token loss scaled by world size: 0.0022363392636179924Per-token loss scaled by world size: 0.017282620072364807

 Per-token loss scaled by world size: 0.0026094394270330667Per-token loss scaled by world size: 0.0019479849142953753




 Epoch: 1, Step: 23, Rank: 1, loss = 0.1749155968427658Epoch: 1, Step: 23, Rank: 5, loss = 0.25080814957618713Epoch: 1, Step: 23, Rank: 4, loss = 0.17750942707061768


 Epoch: 1, Step: 23, Rank: 3, loss = 0.207124263048172
 Epoch: 1, Step: 23, Rank: 7, loss = 1.3718079328536987
 Epoch: 1, Step: 23, Rank: 0, loss = 0.15440122783184052
 Epoch: 1, Step: 23, Rank: 2, loss = 0.15462130308151245
 Per-token loss scaled by world size: 0.004577424377202988
 Epoch: 1, Step: 23, Rank: 6, loss = 0.3633330762386322
 [2024-07-27 20:04:51,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.8400000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:51,943] [INFO] [timer.py:258:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=31.661565651394987, CurrSamplesPerSec=31.188169951680987, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 1,████████▏| 11/12 [00:24<00:04,  4.39s/it]
    "step": 23,
    "rank": 0,
    "loss": 0.15440122783184052,
    "overall_throughput": 31.12553529081371,
    "lr": 1.8400000000000003e-05,
    "cuda_mem_allocated": 22.000954627990723,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 635,
    "batch_size": 16,
    "total_loss": 0.3568150997161865,
    "gradnorm": 3.5464768409729004,
    "weight_norm": 393.4579772949219,
    "timestamp": "2024-07-27T20:04:51.986553"
 }
 Per-token loss scaled by world size: 0.002529617166146636Per-token loss scaled by world size: 0.004327333997935057Per-token loss scaled by world size: 0.002556184073910117Per-token loss scaled by world size: 0.005085375625640154Per-token loss scaled by world size: 0.008069510571658611Per-token loss scaled by world size: 0.002654892858117819
 Per-token loss scaled by world size: 0.001250272849574685





 Epoch: 1, Step: 24, Rank: 3, loss = 0.3591546416282654
 Epoch: 1, Step: 24, Rank: 6, loss = 0.17865420877933502Epoch: 1, Step: 24, Rank: 5, loss = 0.18750180304050446
 Epoch: 1, Step: 24, Rank: 1, loss = 0.30561795830726624
 Epoch: 1, Step: 24, Rank: 4, loss = 0.5699091553688049Epoch: 1, Step: 24, Rank: 0, loss = 0.18053050339221954


 Epoch: 1, Step: 24, Rank: 2, loss = 0.08830051869153976
 Per-token loss scaled by world size: 0.0018133769044652581
 Epoch: 1, Step: 24, Rank: 7, loss = 0.12806974351406097
 [2024-07-27 20:04:52,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.9200000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:52,496] [INFO] [timer.py:258:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=31.65142326624155, CurrSamplesPerSec=31.439924179355366, MemAllocated=21.99GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 1,█████████| 12/12 [00:24<00:00,  3.22s/it]
    "step": 24,
    "rank": 0,
    "loss": 0.18053050339221954,
    "overall_throughput": 31.36277025201868,
    "lr": 1.9200000000000003e-05,
    "cuda_mem_allocated": 21.994752407073975,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 565,
    "batch_size": 16,
    "total_loss": 0.2497173249721527,
    "gradnorm": 2.950968027114868,
    "weight_norm": 393.4583435058594,
    "timestamp": "2024-07-27T20:04:52.547761"
 }
 Epoch 1: 100%|██████████| 12/12 [00:24<00:00,  2.08s/it]
 total tokens: 154 num samples: 2 num padding tokens: 26 - rank: 1 max len: 77 min len: 51 avg len: 64.0 num_loss_counted_tokens: 64
 total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 1 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
 total tokens: 214 num samples: 2 num padding tokens: 37 - rank: 1 max len: 107 min len: 70 avg len: 88.5 num_loss_counted_tokens: 106
 total tokens: 166 num samples: 2 num padding tokens: 22 - rank: 1 max len: 83 min len: 61 avg len: 72.0 num_loss_counted_tokens: 86
 total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 1 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 76
 total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 1 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 58
 total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 1 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 75
 total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 1 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 64
 total tokens: 188 num samples: 2 num padding tokens: 27 - rank: 1 max len: 94 min len: 67 avg len: 80.5 num_loss_counted_tokens: 82
 total tokens: 154 num samples: 2 num padding tokens: 17 - rank: 1 max len: 77 min len: 60 avg len: 68.5 num_loss_counted_tokens: 88
 total tokens: 120 num samples: 2 num padding tokens: 16 - rank: 5 max len: 60 min len: 44 avg len: 52.0 num_loss_counted_tokens: 57
 total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 1 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 62
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 5 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 59
 total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 5 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 54
 total tokens: 194 num samples: 2 num padding tokens: 36 - rank: 5 max len: 97 min len: 61 avg len: 79.0 num_loss_counted_tokens: 99
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63 total tokens: 120 num samples: 2 num padding tokens: 3 - rank: 5 max len: 60 min len: 57 avg len: 58.5 num_loss_counted_tokens: 80
 total tokens: 146 num samples: 2 num padding tokens: 18 - rank: 0 max len: 73 min len: 55 avg len: 64.0 num_loss_counted_tokens: 82
 total tokens: 130 num samples: 2 num padding tokens: 8 - rank: 0 max len: 65 min len: 57 avg len: 61.0 num_loss_counted_tokens: 60
 total tokens: 174 num samples: 2 num padding tokens: 29 - rank: 0 max len: 87 min len: 58 avg len: 72.5 num_loss_counted_tokens: 80

 total tokens: 172 num samples: 2 num padding tokens: 10 - rank: 5 max len: 86 min len: 76 avg len: 81.0 num_loss_counted_tokens: 84
 total tokens: 228 num samples: 2 num padding tokens: 44 - rank: 5 max len: 114 min len: 70 avg len: 92.0 num_loss_counted_tokens: 115
 total tokens: 200 num samples: 2 num padding tokens: 32 - rank: 0 max len: 100 min len: 68 avg len: 84.0 num_loss_counted_tokens: 95
 total tokens: 136 num samples: 2 num padding tokens: 4 - rank: 5 max len: 68 min len: 64 avg len: 66.0 num_loss_counted_tokens: 65
 total tokens: 168 num samples: 2 num padding tokens: 32 - rank: 5 max len: 84 min len: 52 avg len: 68.0 num_loss_counted_tokens: 80
 total tokens: 226 num samples: 2 num padding tokens: 64 - rank: 5 max len: 113 min len: 49 avg len: 81.0 num_loss_counted_tokens: 93
 total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 0 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 60
 total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 0 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 52
 total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 0 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 78
 total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 1 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
 total tokens: 226 num samples: 2 num padding tokens: 55 - rank: 4 max len: 113 min len: 58 avg len: 85.5 num_loss_counted_tokens: 102 total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 0 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 74

 total tokens: 160 num samples: 2 num padding tokens: 31 - rank: 0 max len: 80 min len: 49 avg len: 64.5 num_loss_counted_tokens: 78
 total tokens: 174 num samples: 2 num padding tokens: 13 - rank: 0 max len: 87 min len: 74 avg len: 80.5 num_loss_counted_tokens: 90
 total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 2 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 62
 total tokens: 188 num samples: 2 num padding tokens: 28 - rank: 4 max len: 94 min len: 66 avg len: 80.0 num_loss_counted_tokens: 99
 total tokens: 180 num samples: 2 num padding tokens: 28 - rank: 2 max len: 90 min len: 62 avg len: 76.0 num_loss_counted_tokens: 93
 total tokens: 120 num samples: 2 num padding tokens: 12 - rank: 5 max len: 60 min len: 48 avg len: 54.0 num_loss_counted_tokens: 58
 total tokens: 110 num samples: 2 num padding tokens: 7 - rank: 2 max len: 55 min len: 48 avg len: 51.5 num_loss_counted_tokens: 53
 total tokens: 142 num samples: 2 num padding tokens: 27 - rank: 6 max len: 71 min len: 44 avg len: 57.5 num_loss_counted_tokens: 66
 total tokens: 208 num samples: 2 num padding tokens: 39 - rank: 2 max len: 104 min len: 65 avg len: 84.5 num_loss_counted_tokens: 111
 total tokens: 166 num samples: 2 num padding tokens: 1 - rank: 5 max len: 83 min len: 82 avg len: 82.5 num_loss_counted_tokens: 96
 total tokens: 168 num samples: 2 num padding tokens: 20 - rank: 6 max len: 84 min len: 64 avg len: 74.0 num_loss_counted_tokens: 100 total tokens: 166 num samples: 2 num padding tokens: 33 - rank: 6 max len: 83 min len: 50 avg len: 66.5 num_loss_counted_tokens: 86

 total tokens: 134 num samples: 2 num padding tokens: 12 - rank: 2 max len: 67 min len: 55 avg len: 61.0 num_loss_counted_tokens: 67
 total tokens: 160 num samples: 2 num padding tokens: 16 - rank: 6 max len: 80 min len: 64 avg len: 72.0 num_loss_counted_tokens: 72
 total tokens: 180 num samples: 2 num padding tokens: 36 - rank: 6 max len: 90 min len: 54 avg len: 72.0 num_loss_counted_tokens: 95
 total tokens: 184 num samples: 2 num padding tokens: 26 - rank: 6 max len: 92 min len: 66 avg len: 79.0 num_loss_counted_tokens: 96
 total tokens: 150 num samples: 2 num padding tokens: 30 - rank: 2 max len: 75 min len: 45 avg len: 60.0 num_loss_counted_tokens: 65
 total tokens: 154 num samples: 2 num padding tokens: 23 - rank: 2 max len: 77 min len: 54 avg len: 65.5 num_loss_counted_tokens: 74
 total tokens: 152 num samples: 2 num padding tokens: 25 - rank: 2 max len: 76 min len: 51 avg len: 63.5 num_loss_counted_tokens: 70
 total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 2 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 59
 total tokens: 126 num samples: 2 num padding tokens: 9 - rank: 6 max len: 63 min len: 54 avg len: 58.5 num_loss_counted_tokens: 65
 total tokens: 180 num samples: 2 num padding tokens: 19 - rank: 6 max len: 90 min len: 71 avg len: 80.5 num_loss_counted_tokens: 131
 total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 66
 total tokens: 196 num samples: 2 num padding tokens: 29 - rank: 6 max len: 98 min len: 69 avg len: 83.5 num_loss_counted_tokens: 118
 total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 6 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 59
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 2 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 61
 total tokens: 126 num samples: 2 num padding tokens: 10 - rank: 7 max len: 63 min len: 53 avg len: 58.0 num_loss_counted_tokens: 56 total tokens: 122 num samples: 2 num padding tokens: 16 - rank: 7 max len: 61 min len: 45 avg len: 53.0 num_loss_counted_tokens: 53

 total tokens: 214 num samples: 2 num padding tokens: 55 - rank: 4 max len: 107 min len: 52 avg len: 79.5 num_loss_counted_tokens: 104
 total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 7 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 72
 total tokens: 148 num samples: 2 num padding tokens: 10 - rank: 4 max len: 74 min len: 64 avg len: 69.0 num_loss_counted_tokens: 73
 total tokens: 244 num samples: 2 num padding tokens: 29 - rank: 4 max len: 122 min len: 93 avg len: 107.5 num_loss_counted_tokens: 144
 total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 7 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 62
 total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 53 total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 3 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 71

 total tokens: 158 num samples: 2 num padding tokens: 7 - rank: 7 max len: 79 min len: 72 avg len: 75.5 num_loss_counted_tokens: 91
 total tokens: 116 num samples: 2 num padding tokens: 15 - rank: 0 max len: 58 min len: 43 avg len: 50.5 num_loss_counted_tokens: 49
 total tokens: 282 num samples: 2 num padding tokens: 60 - rank: 3 max len: 141 min len: 81 avg len: 111.0 num_loss_counted_tokens: 169
 total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 3 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 62
 total tokens: 134 num samples: 2 num padding tokens: 8 - rank: 4 max len: 67 min len: 59 avg len: 63.0 num_loss_counted_tokens: 56
 total tokens: 216 num samples: 2 num padding tokens: 27 - rank: 4 max len: 108 min len: 81 avg len: 94.5 num_loss_counted_tokens: 123
 total tokens: 126 num samples: 2 num padding tokens: 14 - rank: 7 max len: 63 min len: 49 avg len: 56.0 num_loss_counted_tokens: 58
 total tokens: 156 num samples: 2 num padding tokens: 28 - rank: 3 max len: 78 min len: 50 avg len: 64.0 num_loss_counted_tokens: 70
 total tokens: 162 num samples: 2 num padding tokens: 31 - rank: 4 max len: 81 min len: 50 avg len: 65.5 num_loss_counted_tokens: 75
 total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 4 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 68
 total tokens: 164 num samples: 2 num padding tokens: 20 - rank: 7 max len: 82 min len: 62 avg len: 72.0 num_loss_counted_tokens: 79
 total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 2 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 55 total tokens: 128 num samples: 2 num padding tokens: 19 - rank: 3 max len: 64 min len: 45 avg len: 54.5 num_loss_counted_tokens: 63

 total tokens: 110 num samples: 2 num padding tokens: 7 - rank: 4 max len: 55 min len: 48 avg len: 51.5 num_loss_counted_tokens: 45
 total tokens: 176 num samples: 2 num padding tokens: 18 - rank: 6 max len: 88 min len: 70 avg len: 79.0 num_loss_counted_tokens: 90
 total tokens: 146 num samples: 2 num padding tokens: 21 - rank: 6 max len: 73 min len: 52 avg len: 62.5 num_loss_counted_tokens: 71
 total tokens: 118 num samples: 2 num padding tokens: 6 - rank: 7 max len: 59 min len: 53 avg len: 56.0 num_loss_counted_tokens: 57
 total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 7 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 57
 total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 4 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 54
 total tokens: 142 num samples: 2 num padding tokens: 26 - rank: 7 max len: 71 min len: 45 avg len: 58.0 num_loss_counted_tokens: 67
 total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 4 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 79
 total tokens: 122 num samples: 2 num padding tokens: 9 - rank: 3 max len: 61 min len: 52 avg len: 56.5 num_loss_counted_tokens: 59
 total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 3 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 66
 total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 3 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 76
 total tokens: 122 num samples: 2 num padding tokens: 4 - rank: 3 max len: 61 min len: 57 avg len: 59.0 num_loss_counted_tokens: 64
 total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 3 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 93
 total tokens: 186 num samples: 2 num padding tokens: 42 - rank: 7 max len: 93 min len: 51 avg len: 72.0 num_loss_counted_tokens: 94
 total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 7 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 131
 total tokens: 202 num samples: 2 num padding tokens: 40 - rank: 3 max len: 101 min len: 61 avg len: 81.0 num_loss_counted_tokens: 104
 Per-token loss scaled by world size: 0.000771758146584034Per-token loss scaled by world size: 0.0032502268441021442Per-token loss scaled by world size: 0.001562815043143928Per-token loss scaled by world size: 0.004182006698101759


 Per-token loss scaled by world size: 0.0015922324964776635Per-token loss scaled by world size: 0.0030361246317625046


 Per-token loss scaled by world size: 0.0017774869920685887
 Epoch: 2, Step: 25, Rank: 3, loss = 0.05045368894934654
 Epoch: 2, Step: 25, Rank: 1, loss = 0.2733986973762512
 Epoch: 2, Step: 25, Rank: 4, loss = 0.21248358488082886
 Epoch: 2, Step: 25, Rank: 2, loss = 0.10216903686523438Epoch: 2, Step: 25, Rank: 0, loss = 0.19848664104938507

 Epoch: 2, Step: 25, Rank: 7, loss = 0.10409220308065414
 Epoch: 2, Step: 25, Rank: 5, loss = 0.11620321124792099
 Per-token loss scaled by world size: 0.0023637553676962852
 Epoch: 2, Step: 25, Rank: 6, loss = 0.15453051030635834
 [2024-07-27 20:04:53,438] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[2e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:53,514] [INFO] [timer.py:258:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=31.498630726223958, CurrSamplesPerSec=28.474580988875164, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:   8%|▊         | 1/12 [00:00<00:10,  1.09it/s]{
    "epoch": 2,
    "step": 25,
    "rank": 0,
    "loss": 0.19848664104938507,
    "overall_throughput": 28.3802930607719,
    "lr": 2e-05,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 523,
    "batch_size": 16,
    "total_loss": 0.15147720277309418,
    "gradnorm": 2.086557149887085,
    "weight_norm": 393.458740234375,
    "timestamp": "2024-07-27T20:04:53.559486"
 }
 Per-token loss scaled by world size: 0.004568464122712612Per-token loss scaled by world size: 0.000755587185267359Per-token loss scaled by world size: 0.0012551499530673027Per-token loss scaled by world size: 0.0036310378927737474Per-token loss scaled by world size: 0.0022255314979702234
 Per-token loss scaled by world size: 0.003098478075116873



 Per-token loss scaled by world size: 0.003910013008862734

 Epoch: 2, Step: 26, Rank: 1, loss = 0.057235728949308395
 Epoch: 2, Step: 26, Rank: 3, loss = 0.09507761150598526Epoch: 2, Step: 26, Rank: 0, loss = 0.3460611402988434Epoch: 2, Step: 26, Rank: 6, loss = 0.2750511169433594
 Epoch: 2, Step: 26, Rank: 2, loss = 0.2347097098827362


 Epoch: 2, Step: 26, Rank: 5, loss = 0.16858400404453278Epoch: 2, Step: 26, Rank: 4, loss = 0.2961834967136383

 Per-token loss scaled by world size: 0.000992890098132193
 Epoch: 2, Step: 26, Rank: 7, loss = 0.07521142810583115
 [2024-07-27 20:04:53,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.999453257340926e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:54,074] [INFO] [timer.py:258:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=31.501866831712768, CurrSamplesPerSec=31.576481216592637, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  17%|█▋        | 2/12 [00:01<00:07,  1.42it/s]{
    "epoch": 2,
    "step": 26,
    "rank": 0,
    "loss": 0.3460611402988434,
    "overall_throughput": 31.521633384656862,
    "lr": 1.999453257340926e-05,
    "cuda_mem_allocated": 22.0040545463562,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 606,
    "batch_size": 16,
    "total_loss": 0.19351428747177124,
    "gradnorm": 2.7967944145202637,
    "weight_norm": 393.45916748046875,
    "timestamp": "2024-07-27T20:04:54.115593"
 }
 Per-token loss scaled by world size: 0.001778947887942195Per-token loss scaled by world size: 0.0023961372207850218Per-token loss scaled by world size: 0.0019206402357667685Per-token loss scaled by world size: 0.0016144964611157775Per-token loss scaled by world size: 0.0014130653580650687

 Per-token loss scaled by world size: 0.0023006140254437923Per-token loss scaled by world size: 0.0029887459240853786




 Epoch: 2, Step: 27, Rank: 3, loss = 0.132994145154953Epoch: 2, Step: 27, Rank: 2, loss = 0.15821273624897003

 Epoch: 2, Step: 27, Rank: 5, loss = 0.19738179445266724
 Epoch: 2, Step: 27, Rank: 6, loss = 0.11640125513076782
 Epoch: 2, Step: 27, Rank: 1, loss = 0.1465408354997635Epoch: 2, Step: 27, Rank: 4, loss = 0.24619793891906738

 Epoch: 2, Step: 27, Rank: 0, loss = 0.18951308727264404
 Per-token loss scaled by world size: 0.0023933870252221823
 Epoch: 2, Step: 27, Rank: 7, loss = 0.19715525209903717
 [2024-07-27 20:04:54,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.9978136272187745e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:54,634] [INFO] [timer.py:258:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=31.47828132560415, CurrSamplesPerSec=30.922637265012085, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  25%|██▌       | 3/12 [00:02<00:05,  1.56it/s]{
    "epoch": 2,
    "step": 27,
    "rank": 0,
    "loss": 0.18951308727264404,
    "overall_throughput": 30.84563042714866,
    "lr": 1.9978136272187745e-05,
    "cuda_mem_allocated": 22.00071620941162,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 659,
    "batch_size": 16,
    "total_loss": 0.17304962873458862,
    "gradnorm": 2.572604179382324,
    "weight_norm": 393.4596252441406,
    "timestamp": "2024-07-27T20:04:54.678862"
 }
 Per-token loss scaled by world size: 0.0017004203982651234Per-token loss scaled by world size: 0.003374457824975252Per-token loss scaled by world size: 0.00338700320571661Per-token loss scaled by world size: 0.0010560491355136037Per-token loss scaled by world size: 0.0003863103629555553




 Per-token loss scaled by world size: 0.0015272889286279678Per-token loss scaled by world size: 0.00571776507422328

 Epoch: 2, Step: 28, Rank: 5, loss = 0.32557567954063416Epoch: 2, Step: 28, Rank: 6, loss = 0.10151272267103195Epoch: 2, Step: 28, Rank: 0, loss = 0.1634529083967209
 Epoch: 2, Step: 28, Rank: 4, loss = 0.32436975836753845


 Epoch: 2, Step: 28, Rank: 2, loss = 0.1468106508255005
 Epoch: 2, Step: 28, Rank: 3, loss = 0.5496201515197754
 Epoch: 2, Step: 28, Rank: 1, loss = 0.0371340848505497
 Per-token loss scaled by world size: 0.00041141020483337343
 Epoch: 2, Step: 28, Rank: 7, loss = 0.03954680636525154
 [2024-07-27 20:04:55,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.9950829025450116e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:55,174] [INFO] [timer.py:258:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=31.512036329377096, CurrSamplesPerSec=32.38008718791239, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  33%|███▎      | 4/12 [00:02<00:04,  1.67it/s]{
    "epoch": 2,
    "step": 28,
    "rank": 0,
    "loss": 0.1634529083967209,
    "overall_throughput": 32.299561727331444,
    "lr": 1.9950829025450116e-05,
    "cuda_mem_allocated": 21.99880838394165,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 769,
    "batch_size": 16,
    "total_loss": 0.21100284159183502,
    "gradnorm": 2.3949108123779297,
    "weight_norm": 393.4600830078125,
    "timestamp": "2024-07-27T20:04:55.217257"
 }
 Per-token loss scaled by world size: 0.0007814434356987476Per-token loss scaled by world size: 0.00027669736300595105Per-token loss scaled by world size: 0.0012405101442709565Per-token loss scaled by world size: 0.0030604854691773653Per-token loss scaled by world size: 0.0021558511070907116Per-token loss scaled by world size: 0.0016599269583821297Per-token loss scaled by world size: 0.0017815420869737864






 Epoch: 2, Step: 29, Rank: 3, loss = 0.023380927741527557
 Epoch: 2, Step: 29, Rank: 6, loss = 0.2586110234260559Epoch: 2, Step: 29, Rank: 1, loss = 0.10482310503721237

 Epoch: 2, Step: 29, Rank: 2, loss = 0.18216942250728607Epoch: 2, Step: 29, Rank: 4, loss = 0.06603197008371353

 Epoch: 2, Step: 29, Rank: 0, loss = 0.14026382565498352
 Epoch: 2, Step: 29, Rank: 5, loss = 0.1505403071641922
 Per-token loss scaled by world size: 0.004114873707294464
 Epoch: 2, Step: 29, Rank: 7, loss = 0.3477068245410919
 [2024-07-27 20:04:55,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.9912640693269754e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:55,714] [INFO] [timer.py:258:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=31.539610429600238, CurrSamplesPerSec=32.27386940956719, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 2:  42%|████▏     | 5/12 [00:03<00:04,  1.73it/s]{
    "epoch": 2,
    "step": 29,
    "rank": 0,
    "loss": 0.14026382565498352,
    "overall_throughput": 32.193407411564976,
    "lr": 1.9912640693269754e-05,
    "cuda_mem_allocated": 22.007156372070312,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 676,
    "batch_size": 16,
    "total_loss": 0.15919092297554016,
    "gradnorm": 2.4766182899475098,
    "weight_norm": 393.4605407714844,
    "timestamp": "2024-07-27T20:04:55.749943"
 }
 Per-token loss scaled by world size: 0.0029893971513956785Per-token loss scaled by world size: 0.005299379117786884Per-token loss scaled by world size: 0.0023671372327953577

 Per-token loss scaled by world size: 0.004149050917476416

 Per-token loss scaled by world size: 0.008750627748668194Per-token loss scaled by world size: 0.006499007809907198

 Epoch: 2, Step: 30, Rank: 0, loss = 0.1615571230649948
 Epoch: 2, Step: 30, Rank: 3, loss = 0.36168262362480164
 Epoch: 2, Step: 30, Rank: 6, loss = 0.20402635633945465
 Epoch: 2, Step: 30, Rank: 5, loss = 0.5972303748130798Per-token loss scaled by world size: 0.0007520572980865836

 Epoch: 2, Step: 30, Rank: 4, loss = 0.28317272663116455Epoch: 2, Step: 30, Rank: 7, loss = 0.4435572922229767

 Epoch: 2, Step: 30, Rank: 1, loss = 0.0513279102742672
 Per-token loss scaled by world size: 0.0032296415884047747
 Epoch: 2, Step: 30, Rank: 2, loss = 0.22042304277420044
 [2024-07-27 20:04:56,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.9863613034027224e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:56,256] [INFO] [timer.py:258:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=31.5444892468786, CurrSamplesPerSec=31.676790257487433, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  50%|█████     | 6/12 [00:03<00:03,  1.77it/s]{
    "epoch": 2,
    "step": 30,
    "rank": 0,
    "loss": 0.1615571230649948,
    "overall_throughput": 31.593056094983204,
    "lr": 1.9863613034027224e-05,
    "cuda_mem_allocated": 21.999285221099854,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 546,
    "batch_size": 16,
    "total_loss": 0.29037219285964966,
    "gradnorm": 6.375105857849121,
    "weight_norm": 393.4609375,
    "timestamp": "2024-07-27T20:04:56.301457"
 }
 Per-token loss scaled by world size: 0.002896310528740287Per-token loss scaled by world size: 0.0031822924502193928Per-token loss scaled by world size: 0.0018208534456789494

 Per-token loss scaled by world size: 0.0022670035250484943

 Per-token loss scaled by world size: 0.008491733111441135
 Per-token loss scaled by world size: 0.003121417947113514
 Epoch: 2, Step: 31, Rank: 6, loss = 0.2474232316017151
 Epoch: 2, Step: 31, Rank: 2, loss = 0.14157135784626007
 Per-token loss scaled by world size: 0.0018668599659577012Epoch: 2, Step: 31, Rank: 1, loss = 0.22518813610076904

 Epoch: 2, Step: 31, Rank: 0, loss = 0.17625951766967773
 Epoch: 2, Step: 31, Rank: 7, loss = 0.6602322459220886
 Epoch: 2, Step: 31, Rank: 4, loss = 0.24269025027751923
 Epoch: 2, Step: 31, Rank: 5, loss = 0.145148366689682
 Per-token loss scaled by world size: 0.0017641705926507711
 Epoch: 2, Step: 31, Rank: 3, loss = 0.13716426491737366
 [2024-07-27 20:04:56,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.9803799658748096e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:56,801] [INFO] [timer.py:258:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=31.57118154258231, CurrSamplesPerSec=32.3373511160454, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  58%|█████▊    | 7/12 [00:04<00:02,  1.79it/s]{
    "epoch": 2,
    "step": 31,
    "rank": 0,
    "loss": 0.17625951766967773,
    "overall_throughput": 32.279737447919125,
    "lr": 1.9803799658748096e-05,
    "cuda_mem_allocated": 22.0040545463562,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 622,
    "batch_size": 16,
    "total_loss": 0.24695967137813568,
    "gradnorm": 4.441169261932373,
    "weight_norm": 393.4613952636719,
    "timestamp": "2024-07-27T20:04:56.844460"
 }
 Per-token loss scaled by world size: 0.0011196950217708945Per-token loss scaled by world size: 0.000792959937825799Per-token loss scaled by world size: 0.0029141369741410017Per-token loss scaled by world size: 0.004119256976991892Per-token loss scaled by world size: 0.0033213391434401274




 Per-token loss scaled by world size: 0.004044802393764257Per-token loss scaled by world size: 0.0030391488689929247

 Epoch: 2, Step: 32, Rank: 5, loss = 0.2269384115934372Epoch: 2, Step: 32, Rank: 2, loss = 0.06175175681710243

 Epoch: 2, Step: 32, Rank: 0, loss = 0.08719625324010849
 Epoch: 2, Step: 32, Rank: 3, loss = 0.2586492896080017Epoch: 2, Step: 32, Rank: 6, loss = 0.32078713178634644

 Epoch: 2, Step: 32, Rank: 4, loss = 0.23667371273040771
 Epoch: 2, Step: 32, Rank: 7, loss = 0.31498900055885315
 Per-token loss scaled by world size: 0.0011721396585926414
 Epoch: 2, Step: 32, Rank: 1, loss = 0.09128037840127945
 [2024-07-27 20:04:57,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.973326597248006e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:57,345] [INFO] [timer.py:258:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=31.5894998258649, CurrSamplesPerSec=32.130135235160566, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  67%|██████▋   | 8/12 [00:04<00:02,  1.80it/s]{
    "epoch": 2,
    "step": 32,
    "rank": 0,
    "loss": 0.08719625324010849,
    "overall_throughput": 32.07842353700362,
    "lr": 1.973326597248006e-05,
    "cuda_mem_allocated": 21.997137546539307,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 623,
    "batch_size": 16,
    "total_loss": 0.19978323578834534,
    "gradnorm": 2.9166290760040283,
    "weight_norm": 393.46185302734375,
    "timestamp": "2024-07-27T20:04:57.393136"
 }
 Per-token loss scaled by world size: 0.0031772709917277098Per-token loss scaled by world size: 0.002447927137836814Per-token loss scaled by world size: 0.0009199947817251086Per-token loss scaled by world size: 0.004487441387027502Per-token loss scaled by world size: 0.0024654706940054893
 Per-token loss scaled by world size: 0.00025754657690413296


 Per-token loss scaled by world size: 0.004304614849388599


 Epoch: 2, Step: 33, Rank: 4, loss = 0.06888461112976074
 Epoch: 2, Step: 33, Rank: 0, loss = 0.23789817094802856
 Epoch: 2, Step: 33, Rank: 5, loss = 0.33599716424942017Epoch: 2, Step: 33, Rank: 1, loss = 0.1832885444164276

 Epoch: 2, Step: 33, Rank: 2, loss = 0.18460211157798767
 Epoch: 2, Step: 33, Rank: 3, loss = 0.019283799454569817
 Epoch: 2, Step: 33, Rank: 6, loss = 0.3223080337047577
 Per-token loss scaled by world size: 0.0012406171299517155
 Epoch: 2, Step: 33, Rank: 7, loss = 0.09289120882749557
 [2024-07-27 20:04:57,818] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.9652089102773487e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:04:57,896] [INFO] [timer.py:258:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=31.59934389707566, CurrSamplesPerSec=31.897545876966834, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 528
 {
    "epoch": 2,
    "step": 33,
    "rank": 0,
    "loss": 0.23789817094802856,
    "overall_throughput": 31.819460032023883,
    "lr": 1.9652089102773487e-05,
    "cuda_mem_allocated": 22.002385139465332,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 599,
    "batch_size": 16,
    "total_loss": 0.1806442141532898,
    "gradnorm": 2.8334124088287354,
    "weight_norm": 393.4622802734375,
    "timestamp": "2024-07-27T20:04:57.899380"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_528
 [20:05:15] INFO     saving took 17.999591588974 seconds                                                                                                                                                                           utils.py:611
 Epoch 2:  75%|███████▌  | 9/12 [00:23<00:18,  6.18s/it]Per-token loss scaled by world size: 0.0037983357906341553
 Per-token loss scaled by world size: 0.004671666771173477Per-token loss scaled by world size: 0.001915755565278232Per-token loss scaled by world size: 0.001806297223083675Per-token loss scaled by world size: 0.00687358109280467Per-token loss scaled by world size: 0.0015339453238993883Per-token loss scaled by world size: 0.0011989163467660546





 Epoch: 2, Step: 34, Rank: 0, loss = 0.33425354957580566
 Epoch: 2, Step: 34, Rank: 4, loss = 0.411106675863266
 Epoch: 2, Step: 34, Rank: 1, loss = 0.16858649253845215Epoch: 2, Step: 34, Rank: 6, loss = 0.6048751473426819Epoch: 2, Step: 34, Rank: 2, loss = 0.15895415842533112Epoch: 2, Step: 34, Rank: 5, loss = 0.10550463944673538



 Epoch: 2, Step: 34, Rank: 3, loss = 0.13498719036579132
 Per-token loss scaled by world size: 0.0012863841839134693
 Epoch: 2, Step: 34, Rank: 7, loss = 0.113201804459095
 [2024-07-27 20:05:16,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.9560357815343577e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:16,461] [INFO] [timer.py:258:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=31.575999746037336, CurrSamplesPerSec=30.869055674257183, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  83%|████████▎ | 10/12 [00:23<00:08,  4.45s/it]{
    "epoch": 2,
    "step": 34,
    "rank": 0,
    "loss": 0.33425354957580566,
    "overall_throughput": 30.809901389200697,
    "lr": 1.9560357815343577e-05,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 704,
    "batch_size": 16,
    "total_loss": 0.25393369793891907,
    "gradnorm": 3.9217264652252197,
    "weight_norm": 393.4627685546875,
    "timestamp": "2024-07-27T20:05:16.505094"
 }
 Per-token loss scaled by world size: 0.0034846733324229717Per-token loss scaled by world size: 0.0033776769414544106Per-token loss scaled by world size: 0.0047375233843922615Per-token loss scaled by world size: 0.002200631657615304
 Per-token loss scaled by world size: 0.0065323468297719955
 Per-token loss scaled by world size: 0.000672308262437582Per-token loss scaled by world size: 0.0031945251394063234




 Epoch: 2, Step: 35, Rank: 3, loss = 0.22757098078727722Epoch: 2, Step: 35, Rank: 1, loss = 0.14826755225658417

 Epoch: 2, Step: 35, Rank: 4, loss = 0.31919065117836Epoch: 2, Step: 35, Rank: 5, loss = 0.44011685252189636Epoch: 2, Step: 35, Rank: 6, loss = 0.21523113548755646

 Epoch: 2, Step: 35, Rank: 2, loss = 0.045296769589185715Epoch: 2, Step: 35, Rank: 0, loss = 0.23477986454963684


 Per-token loss scaled by world size: 0.0005250901449471712
 Epoch: 2, Step: 35, Rank: 7, loss = 0.035377949476242065
 [2024-07-27 20:05:16,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.9458172417006347e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:17,013] [INFO] [timer.py:258:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=31.574778541658905, CurrSamplesPerSec=31.53574981496928, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2:  92%|█████████▏| 11/12 [00:24<00:03,  3.25s/it]{
    "epoch": 2,
    "step": 35,
    "rank": 0,
    "loss": 0.23477986454963684,
    "overall_throughput": 31.455751447078118,
    "lr": 1.9458172417006347e-05,
    "cuda_mem_allocated": 22.0038161277771,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 539,
    "batch_size": 16,
    "total_loss": 0.2082289606332779,
    "gradnorm": 3.2071847915649414,
    "weight_norm": 393.4632263183594,
    "timestamp": "2024-07-27T20:05:17.057249"
 }
 Per-token loss scaled by world size: 0.005230115260928869Per-token loss scaled by world size: 0.0020361002534627914Per-token loss scaled by world size: 0.0016461815685033798Per-token loss scaled by world size: 0.0032435867469757795Per-token loss scaled by world size: 0.0013154539046809077
 Per-token loss scaled by world size: 0.007001029327511787




 Per-token loss scaled by world size: 0.0026217142585664988
 Epoch: 2, Step: 36, Rank: 3, loss = 0.2177257537841797
 Epoch: 2, Step: 36, Rank: 5, loss = 0.1366732269525528Epoch: 2, Step: 36, Rank: 0, loss = 0.08829984068870544
 Epoch: 2, Step: 36, Rank: 6, loss = 0.35107147693634033
 Epoch: 2, Step: 36, Rank: 2, loss = 0.11049994081258774Epoch: 2, Step: 36, Rank: 1, loss = 0.4699440896511078


 Epoch: 2, Step: 36, Rank: 7, loss = 0.17598256468772888
 Per-token loss scaled by world size: 0.00114376877900213
 Epoch: 2, Step: 36, Rank: 4, loss = 0.07677547633647919
 [2024-07-27 20:05:17,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.934564464599461e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:17,568] [INFO] [timer.py:258:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=31.570395158941494, CurrSamplesPerSec=31.426423180739413, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 2: 100%|██████████| 12/12 [00:24<00:00,  2.43s/it]{
    "epoch": 2,
    "step": 36,
    "rank": 0,
    "loss": 0.08829984068870544,
    "overall_throughput": 31.342732108746315,
    "lr": 1.934564464599461e-05,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 537,
    "batch_size": 16,
    "total_loss": 0.2033715397119522,
    "gradnorm": 3.2520809173583984,
    "weight_norm": 393.4635925292969,
    "timestamp": "2024-07-27T20:05:17.613214"
 }
 Epoch 2: 100%|██████████| 12/12 [00:25<00:00,  2.09s/it]
 total tokens: 118 num samples: 2 num padding tokens: 8 - rank: 2 max len: 59 min len: 51 avg len: 55.0 num_loss_counted_tokens: 60
 total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 2 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 64
 total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 2 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 70
 total tokens: 180 num samples: 2 num padding tokens: 33 - rank: 2 max len: 90 min len: 57 avg len: 73.5 num_loss_counted_tokens: 99
 total tokens: 214 num samples: 2 num padding tokens: 40 - rank: 2 max len: 107 min len: 67 avg len: 87.0 num_loss_counted_tokens: 103
 total tokens: 126 num samples: 2 num padding tokens: 17 - rank: 2 max len: 63 min len: 46 avg len: 54.5 num_loss_counted_tokens: 54
 total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 2 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 72
 total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 2 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 59
 total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 72
 total tokens: 154 num samples: 2 num padding tokens: 22 - rank: 2 max len: 77 min len: 55 avg len: 66.0 num_loss_counted_tokens: 75
 total tokens: 128 num samples: 2 num padding tokens: 21 - rank: 2 max len: 64 min len: 43 avg len: 53.5 num_loss_counted_tokens: 49
 total tokens: 126 num samples: 2 num padding tokens: 11 - rank: 2 max len: 63 min len: 52 avg len: 57.5 num_loss_counted_tokens: 55
 total tokens: 152 num samples: 2 num padding tokens: 17 - rank: 4 max len: 76 min len: 59 avg len: 67.5 num_loss_counted_tokens: 71
 total tokens: 152 num samples: 2 num padding tokens: 7 - rank: 4 max len: 76 min len: 69 avg len: 72.5 num_loss_counted_tokens: 84
 total tokens: 146 num samples: 2 num padding tokens: 21 - rank: 7 max len: 73 min len: 52 avg len: 62.5 num_loss_counted_tokens: 71 total tokens: 184 num samples: 2 num padding tokens: 32 - rank: 5 max len: 92 min len: 60 avg len: 76.0 num_loss_counted_tokens: 89
 total tokens: 168 num samples: 2 num padding tokens: 33 - rank: 7 max len: 84 min len: 51 avg len: 67.5 num_loss_counted_tokens: 85
 total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 4 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 58
 total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 7 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 83

 total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 5 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 73
 total tokens: 138 num samples: 2 num padding tokens: 12 - rank: 5 max len: 69 min len: 57 avg len: 63.0 num_loss_counted_tokens: 53
 total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 4 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 54
 total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 5 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 79
 total tokens: 152 num samples: 2 num padding tokens: 9 - rank: 5 max len: 76 min len: 67 avg len: 71.5 num_loss_counted_tokens: 81
 total tokens: 138 num samples: 2 num padding tokens: 14 - rank: 5 max len: 69 min len: 55 avg len: 62.0 num_loss_counted_tokens: 78
 total tokens: 96 num samples: 2 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 46
 total tokens: 174 num samples: 2 num padding tokens: 27 - rank: 4 max len: 87 min len: 60 avg len: 73.5 num_loss_counted_tokens: 85
 total tokens: 136 num samples: 2 num padding tokens: 4 - rank: 7 max len: 68 min len: 64 avg len: 66.0 num_loss_counted_tokens: 65
 total tokens: 148 num samples: 2 num padding tokens: 2 - rank: 7 max len: 74 min len: 72 avg len: 73.0 num_loss_counted_tokens: 75
 total tokens: 186 num samples: 2 num padding tokens: 43 - rank: 7 max len: 93 min len: 50 avg len: 71.5 num_loss_counted_tokens: 120
 total tokens: 188 num samples: 2 num padding tokens: 23 - rank: 7 max len: 94 min len: 71 avg len: 82.5 num_loss_counted_tokens: 97
 total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 5 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 71
 total tokens: 140 num samples: 2 num padding tokens: 24 - rank: 3 max len: 70 min len: 46 avg len: 58.0 num_loss_counted_tokens: 67
 total tokens: 166 num samples: 2 num padding tokens: 34 - rank: 7 max len: 83 min len: 49 avg len: 66.0 num_loss_counted_tokens: 74
 total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 7 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 52
 total tokens: 142 num samples: 2 num padding tokens: 22 - rank: 7 max len: 71 min len: 49 avg len: 60.0 num_loss_counted_tokens: 70
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 69
 total tokens: 202 num samples: 2 num padding tokens: 46 - rank: 7 max len: 101 min len: 55 avg len: 78.0 num_loss_counted_tokens: 96
 total tokens: 180 num samples: 2 num padding tokens: 45 - rank: 3 max len: 90 min len: 45 avg len: 67.5 num_loss_counted_tokens: 86 total tokens: 172 num samples: 2 num padding tokens: 23 - rank: 3 max len: 86 min len: 63 avg len: 74.5 num_loss_counted_tokens: 81

 total tokens: 186 num samples: 2 num padding tokens: 13 - rank: 3 max len: 93 min len: 80 avg len: 86.5 num_loss_counted_tokens: 122
 total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 6 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 59
 total tokens: 132 num samples: 2 num padding tokens: 21 - rank: 6 max len: 66 min len: 45 avg len: 55.5 num_loss_counted_tokens: 56
 total tokens: 134 num samples: 2 num padding tokens: 23 - rank: 6 max len: 67 min len: 44 avg len: 55.5 num_loss_counted_tokens: 47
 total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 3 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 59
 total tokens: 128 num samples: 2 num padding tokens: 11 - rank: 6 max len: 64 min len: 53 avg len: 58.5 num_loss_counted_tokens: 59
 total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 6 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 81
 total tokens: 106 num samples: 2 num padding tokens: 4 - rank: 4 max len: 53 min len: 49 avg len: 51.0 num_loss_counted_tokens: 57
 total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 3 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 67
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 4 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 62
 total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 4 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 73
 total tokens: 282 num samples: 2 num padding tokens: 86 - rank: 4 max len: 141 min len: 55 avg len: 98.0 num_loss_counted_tokens: 139
 total tokens: 174 num samples: 2 num padding tokens: 25 - rank: 5 max len: 87 min len: 62 avg len: 74.5 num_loss_counted_tokens: 74
 total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 4 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 70
 total tokens: 208 num samples: 2 num padding tokens: 44 - rank: 4 max len: 104 min len: 60 avg len: 82.0 num_loss_counted_tokens: 109
 total tokens: 172 num samples: 2 num padding tokens: 29 - rank: 6 max len: 86 min len: 57 avg len: 71.5 num_loss_counted_tokens: 54
 total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 6 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 98
 total tokens: 188 num samples: 2 num padding tokens: 39 - rank: 6 max len: 94 min len: 55 avg len: 74.5 num_loss_counted_tokens: 86
 total tokens: 148 num samples: 2 num padding tokens: 23 - rank: 6 max len: 74 min len: 51 avg len: 62.5 num_loss_counted_tokens: 62
 total tokens: 172 num samples: 2 num padding tokens: 21 - rank: 6 max len: 86 min len: 65 avg len: 75.5 num_loss_counted_tokens: 71
 total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 5 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 65
 total tokens: 110 num samples: 2 num padding tokens: 9 - rank: 5 max len: 55 min len: 46 avg len: 50.5 num_loss_counted_tokens: 53
 total tokens: 164 num samples: 2 num padding tokens: 24 - rank: 3 max len: 82 min len: 58 avg len: 70.0 num_loss_counted_tokens: 96
 total tokens: 118 num samples: 2 num padding tokens: 15 - rank: 3 max len: 59 min len: 44 avg len: 51.5 num_loss_counted_tokens: 51
 total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 5 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 65
 total tokens: 130 num samples: 2 num padding tokens: 14 - rank: 3 max len: 65 min len: 51 avg len: 58.0 num_loss_counted_tokens: 61
 total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 1 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 60
 total tokens: 228 num samples: 2 num padding tokens: 34 - rank: 0 max len: 114 min len: 80 avg len: 97.0 num_loss_counted_tokens: 135
 total tokens: 164 num samples: 2 num padding tokens: 1 - rank: 3 max len: 82 min len: 81 avg len: 81.5 num_loss_counted_tokens: 105
 total tokens: 216 num samples: 2 num padding tokens: 47 - rank: 4 max len: 108 min len: 61 avg len: 84.5 num_loss_counted_tokens: 109
 total tokens: 194 num samples: 2 num padding tokens: 19 - rank: 6 max len: 97 min len: 78 avg len: 87.5 num_loss_counted_tokens: 114
 total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 0 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 62
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 1 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 62
 total tokens: 180 num samples: 2 num padding tokens: 3 - rank: 1 max len: 90 min len: 87 avg len: 88.5 num_loss_counted_tokens: 136
 total tokens: 244 num samples: 2 num padding tokens: 70 - rank: 1 max len: 122 min len: 52 avg len: 87.0 num_loss_counted_tokens: 125
 total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 1 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
 total tokens: 106 num samples: 2 num padding tokens: 9 - rank: 1 max len: 53 min len: 44 avg len: 48.5 num_loss_counted_tokens: 47
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 1 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 158 num samples: 2 num padding tokens: 25 - rank: 1 max len: 79 min len: 54 avg len: 66.5 num_loss_counted_tokens: 74
 total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 0 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 101
 total tokens: 140 num samples: 2 num padding tokens: 3 - rank: 7 max len: 70 min len: 67 avg len: 68.5 num_loss_counted_tokens: 81
 total tokens: 196 num samples: 2 num padding tokens: 53 - rank: 1 max len: 98 min len: 45 avg len: 71.5 num_loss_counted_tokens: 94
 total tokens: 228 num samples: 2 num padding tokens: 60 - rank: 1 max len: 114 min len: 54 avg len: 84.0 num_loss_counted_tokens: 122
 total tokens: 114 num samples: 2 num padding tokens: 5 - rank: 6 max len: 57 min len: 52 avg len: 54.5 num_loss_counted_tokens: 53
 total tokens: 226 num samples: 2 num padding tokens: 6 - rank: 0 max len: 113 min len: 107 avg len: 110.0 num_loss_counted_tokens: 142
 total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 0 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 85
 total tokens: 158 num samples: 2 num padding tokens: 29 - rank: 1 max len: 79 min len: 50 avg len: 64.5 num_loss_counted_tokens: 69
 total tokens: 154 num samples: 2 num padding tokens: 14 - rank: 0 max len: 77 min len: 63 avg len: 70.0 num_loss_counted_tokens: 86
 total tokens: 150 num samples: 2 num padding tokens: 21 - rank: 0 max len: 75 min len: 54 avg len: 64.5 num_loss_counted_tokens: 75
 total tokens: 122 num samples: 2 num padding tokens: 1 - rank: 0 max len: 61 min len: 60 avg len: 60.5 num_loss_counted_tokens: 59
 total tokens: 200 num samples: 2 num padding tokens: 52 - rank: 0 max len: 100 min len: 48 avg len: 74.0 num_loss_counted_tokens: 95
 total tokens: 168 num samples: 2 num padding tokens: 32 - rank: 0 max len: 84 min len: 52 avg len: 68.0 num_loss_counted_tokens: 80
 total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 0 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 84
 total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 3 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 60
 total tokens: 176 num samples: 2 num padding tokens: 27 - rank: 1 max len: 88 min len: 61 avg len: 74.5 num_loss_counted_tokens: 83
 total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 0 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 57
 Per-token loss scaled by world size: 0.0011273464187979698Per-token loss scaled by world size: 0.011045603081583977Per-token loss scaled by world size: 0.0008964258013293147


 Per-token loss scaled by world size: 0.005275155883282423Per-token loss scaled by world size: 0.0004854030557908118Per-token loss scaled by world size: 0.001362472539767623


 Per-token loss scaled by world size: 0.0016097185434773564
 Epoch: 3, Step: 37, Rank: 5, loss = 0.08314179629087448
 Epoch: 3, Step: 37, Rank: 3, loss = 0.06611140072345734
 Epoch: 3, Step: 37, Rank: 0, loss = 0.8146132230758667
 Epoch: 3, Step: 37, Rank: 7, loss = 0.3890427350997925
 Epoch: 3, Step: 37, Rank: 6, loss = 0.03579847514629364
 Epoch: 3, Step: 37, Rank: 4, loss = 0.10048235207796097
 Epoch: 3, Step: 37, Rank: 2, loss = 0.11871673911809921
 Per-token loss scaled by world size: 0.0010947352275252342
 Epoch: 3, Step: 37, Rank: 1, loss = 0.08073671907186508
 [2024-07-27 20:05:18,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.922289754977385e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:18,600] [INFO] [timer.py:258:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=31.510911148754047, CurrSamplesPerSec=29.613797942311468, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,         | 1/12 [00:00<00:10,  1.06it/s]
    "step": 37,
    "rank": 0,
    "loss": 0.8146132230758667,
    "overall_throughput": 29.510024167760136,
    "lr": 1.922289754977385e-05,
    "cuda_mem_allocated": 22.01049518585205,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 590,
    "batch_size": 16,
    "total_loss": 0.21108044683933258,
    "gradnorm": 3.543294906616211,
    "weight_norm": 393.4640197753906,
    "timestamp": "2024-07-27T20:05:18.643289"
 }
 Per-token loss scaled by world size: 0.0009516195859760046Per-token loss scaled by world size: 0.005429701413959265Per-token loss scaled by world size: 0.000703756813891232Per-token loss scaled by world size: 0.0006457434501498938Per-token loss scaled by world size: 0.0020593185909092426Per-token loss scaled by world size: 0.0010209670290350914Per-token loss scaled by world size: 0.0010425481013953686






 Epoch: 3, Step: 38, Rank: 1, loss = 0.4642394781112671Epoch: 3, Step: 38, Rank: 6, loss = 0.060171205550432205Epoch: 3, Step: 38, Rank: 5, loss = 0.05521106347441673
 Epoch: 3, Step: 38, Rank: 7, loss = 0.17607174813747406


 Epoch: 3, Step: 38, Rank: 4, loss = 0.08729267865419388Epoch: 3, Step: 38, Rank: 2, loss = 0.08913786709308624Epoch: 3, Step: 38, Rank: 3, loss = 0.08136347681283951


 Per-token loss scaled by world size: 0.0009524038759991527
 Epoch: 3, Step: 38, Rank: 0, loss = 0.08143053203821182
 [2024-07-27 20:05:19,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.909006535049163e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:19,141] [INFO] [timer.py:258:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=31.53368842638859, CurrSamplesPerSec=32.35217658966323, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,▋        | 2/12 [00:01<00:07,  1.42it/s]
    "step": 38,
    "rank": 0,
    "loss": 0.08143053203821182,
    "overall_throughput": 32.29991928493285,
    "lr": 1.909006535049163e-05,
    "cuda_mem_allocated": 21.99785280227661,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 684,
    "batch_size": 16,
    "total_loss": 0.13686475157737732,
    "gradnorm": 7.216549396514893,
    "weight_norm": 393.4645080566406,
    "timestamp": "2024-07-27T20:05:19.186626"
 }
 Per-token loss scaled by world size: 0.0020264529157429934Per-token loss scaled by world size: 0.0008862476679496467Per-token loss scaled by world size: 0.0010139633668586612Per-token loss scaled by world size: 0.00098150665871799Per-token loss scaled by world size: 0.0027166178915649652
 Per-token loss scaled by world size: 0.0003891867527272552Per-token loss scaled by world size: 0.0019367473432794213





 Epoch: 3, Step: 39, Rank: 0, loss = 0.14565131068229675Epoch: 3, Step: 39, Rank: 2, loss = 0.07287861406803131Epoch: 3, Step: 39, Rank: 4, loss = 0.06369905173778534


 Epoch: 3, Step: 39, Rank: 5, loss = 0.19525690376758575Epoch: 3, Step: 39, Rank: 3, loss = 0.13920371234416962Epoch: 3, Step: 39, Rank: 7, loss = 0.07054579257965088


 Epoch: 3, Step: 39, Rank: 1, loss = 0.02797279693186283
 Per-token loss scaled by world size: 0.0002103938313666731
 Epoch: 3, Step: 39, Rank: 6, loss = 0.015122056938707829
 [2024-07-27 20:05:19,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[1.8947293298207637e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:19,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=31.57390143166257, CurrSamplesPerSec=33.093162948245876, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,█▌       | 3/12 [00:02<00:05,  1.60it/s]
    "step": 39,
    "rank": 0,
    "loss": 0.14565131068229675,
    "overall_throughput": 33.040161880332626,
    "lr": 1.8947293298207637e-05,
    "cuda_mem_allocated": 22.00071620941162,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 575,
    "batch_size": 16,
    "total_loss": 0.09129128605127335,
    "gradnorm": 1.8486279249191284,
    "weight_norm": 393.4649658203125,
    "timestamp": "2024-07-27T20:05:19.677166"
 }
 Per-token loss scaled by world size: 0.0030472958460450172Per-token loss scaled by world size: 0.0027008713223040104Per-token loss scaled by world size: 0.0013511140132322907Per-token loss scaled by world size: 0.00032714917324483395Per-token loss scaled by world size: 0.0010533123277127743

 Per-token loss scaled by world size: 0.0011340089840814471

 Per-token loss scaled by world size: 0.000277617946267128


 Epoch: 3, Step: 40, Rank: 0, loss = 0.2605437934398651
 Epoch: 3, Step: 40, Rank: 5, loss = 0.11552024632692337Epoch: 3, Step: 40, Rank: 3, loss = 0.027971254661679268Epoch: 3, Step: 40, Rank: 2, loss = 0.23092450201511383


 Epoch: 3, Step: 40, Rank: 7, loss = 0.09695777297019958
 Epoch: 3, Step: 40, Rank: 1, loss = 0.09005820006132126
 Epoch: 3, Step: 40, Rank: 4, loss = 0.023736335337162018
 Per-token loss scaled by world size: 0.0006618773913942277
 Epoch: 3, Step: 40, Rank: 6, loss = 0.05659051612019539
 [2024-07-27 20:05:20,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.879473751206489e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:20,211] [INFO] [timer.py:258:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=31.60944150873908, CurrSamplesPerSec=32.98311497397824, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,██▎      | 4/12 [00:02<00:04,  1.69it/s]
    "step": 40,
    "rank": 0,
    "loss": 0.2605437934398651,
    "overall_throughput": 32.92414856200329,
    "lr": 1.879473751206489e-05,
    "cuda_mem_allocated": 22.01025676727295,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 684,
    "batch_size": 16,
    "total_loss": 0.11278782784938812,
    "gradnorm": 1.686691403388977,
    "weight_norm": 393.46551513671875,
    "timestamp": "2024-07-27T20:05:20.214384"
 }
 Per-token loss scaled by world size: 0.0014961636625230312Per-token loss scaled by world size: 0.0015210546553134918Per-token loss scaled by world size: 0.0010414053685963154Per-token loss scaled by world size: 0.001773377531208098



 Per-token loss scaled by world size: 0.0008040367974899709Per-token loss scaled by world size: 0.0032368989195674658

 Per-token loss scaled by world size: 0.0005034140776842833
 Epoch: 3, Step: 41, Rank: 6, loss = 0.12187450379133224
 Epoch: 3, Step: 41, Rank: 3, loss = 0.11988011747598648
 Epoch: 3, Step: 41, Rank: 0, loss = 0.08344260603189468
 Epoch: 3, Step: 41, Rank: 1, loss = 0.14209187030792236
 Epoch: 3, Step: 41, Rank: 7, loss = 0.2593565285205841Epoch: 3, Step: 41, Rank: 5, loss = 0.06442344933748245

 Epoch: 3, Step: 41, Rank: 4, loss = 0.04033605381846428
 Per-token loss scaled by world size: 0.0004881576751358807
 Epoch: 3, Step: 41, Rank: 2, loss = 0.03911363333463669
 [2024-07-27 20:05:20,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[1.863256480957574e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:20,751] [INFO] [timer.py:258:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=31.626163332071105, CurrSamplesPerSec=32.274971444510975, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,███▏     | 5/12 [00:03<00:04,  1.75it/s]
    "step": 41,
    "rank": 0,
    "loss": 0.08344260603189468,
    "overall_throughput": 32.196172088118544,
    "lr": 1.863256480957574e-05,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 641,
    "batch_size": 16,
    "total_loss": 0.10881484299898148,
    "gradnorm": 1.5060667991638184,
    "weight_norm": 393.46600341796875,
    "timestamp": "2024-07-27T20:05:20.793861"
 }
 Per-token loss scaled by world size: 0.001642214716412127Per-token loss scaled by world size: 0.0013352860696613789Per-token loss scaled by world size: 0.001149832969531417Per-token loss scaled by world size: 0.0007547381101176143
 Per-token loss scaled by world size: 0.002172367414459586


 Per-token loss scaled by world size: 0.0017613072413951159Per-token loss scaled by world size: 0.0012277569621801376


 Epoch: 3, Step: 42, Rank: 0, loss = 0.10571756958961487Epoch: 3, Step: 42, Rank: 5, loss = 0.0859590396285057Epoch: 3, Step: 42, Rank: 6, loss = 0.04858626425266266


 Epoch: 3, Step: 42, Rank: 7, loss = 0.13984614610671997
 Epoch: 3, Step: 42, Rank: 2, loss = 0.07402049750089645Epoch: 3, Step: 42, Rank: 3, loss = 0.11338414996862411

 Epoch: 3, Step: 42, Rank: 4, loss = 0.07903685420751572
 Per-token loss scaled by world size: 0.0004327835631556809
 Epoch: 3, Step: 42, Rank: 1, loss = 0.02786044217646122
 [2024-07-27 20:05:21,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[1.8460952524209355e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:21,294] [INFO] [timer.py:258:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=31.636401598459667, CurrSamplesPerSec=32.040930582537946, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,████     | 6/12 [00:03<00:03,  1.78it/s]
    "step": 42,
    "rank": 0,
    "loss": 0.10571756958961487,
    "overall_throughput": 31.961989783039616,
    "lr": 1.8460952524209355e-05,
    "cuda_mem_allocated": 22.001193046569824,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 515,
    "batch_size": 16,
    "total_loss": 0.08430136740207672,
    "gradnorm": 2.007233142852783,
    "weight_norm": 393.46649169921875,
    "timestamp": "2024-07-27T20:05:21.336433"
 }
 Per-token loss scaled by world size: 0.001508427201770246Per-token loss scaled by world size: 0.0023370874114334583Per-token loss scaled by world size: 0.0023123989813029766Per-token loss scaled by world size: 0.0019151513697579503Per-token loss scaled by world size: 0.0032233409583568573Per-token loss scaled by world size: 0.0005486609297804534


 Per-token loss scaled by world size: 0.0025942232459783554



 Epoch: 3, Step: 43, Rank: 1, loss = 0.20537155866622925Epoch: 3, Step: 43, Rank: 7, loss = 0.20320206880569458Epoch: 3, Step: 43, Rank: 6, loss = 0.16829392313957214


 Epoch: 3, Step: 43, Rank: 4, loss = 0.13255304098129272Epoch: 3, Step: 43, Rank: 5, loss = 0.04821357876062393Epoch: 3, Step: 43, Rank: 3, loss = 0.2832510769367218


 Epoch: 3, Step: 43, Rank: 0, loss = 0.22796736657619476
 Per-token loss scaled by world size: 0.0013315769610926509
 Epoch: 3, Step: 43, Rank: 2, loss = 0.11701232939958572
 [2024-07-27 20:05:21,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[1.8280088311480203e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:21,830] [INFO] [timer.py:258:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=31.65571342945945, CurrSamplesPerSec=32.44800374432416, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,████▊    | 7/12 [00:04<00:02,  1.80it/s]
    "step": 43,
    "rank": 0,
    "loss": 0.22796736657619476,
    "overall_throughput": 32.36756205093528,
    "lr": 1.8280088311480203e-05,
    "cuda_mem_allocated": 22.007156372070312,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 703,
    "batch_size": 16,
    "total_loss": 0.17323312163352966,
    "gradnorm": 3.137274742126465,
    "weight_norm": 393.4668884277344,
    "timestamp": "2024-07-27T20:05:21.873356"
 }
 Per-token loss scaled by world size: 0.0031055721919983625Per-token loss scaled by world size: 0.005434826016426086Per-token loss scaled by world size: 0.0022665681317448616Per-token loss scaled by world size: 0.0013299736892804503Per-token loss scaled by world size: 0.002249634126201272
 Per-token loss scaled by world size: 0.002406098647043109




 Per-token loss scaled by world size: 0.0007422761409543455
 Epoch: 3, Step: 44, Rank: 4, loss = 0.382475882768631
 Epoch: 3, Step: 44, Rank: 6, loss = 0.09359689801931381Epoch: 3, Step: 44, Rank: 5, loss = 0.15950973331928253

 Epoch: 3, Step: 44, Rank: 1, loss = 0.15831799805164337Epoch: 3, Step: 44, Rank: 7, loss = 0.1693291962146759

 Epoch: 3, Step: 44, Rank: 0, loss = 0.21855464577674866
 Epoch: 3, Step: 44, Rank: 3, loss = 0.05223768204450607
 Per-token loss scaled by world size: 0.000683724822010845
 Epoch: 3, Step: 44, Rank: 2, loss = 0.04811713472008705
 [2024-07-27 20:05:22,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[1.8090169943749477e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:22,384] [INFO] [timer.py:258:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=31.653925419020233, CurrSamplesPerSec=31.580790497837636, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 704
 {
    "epoch": 3,
    "step": 44,
    "rank": 0,
    "loss": 0.21855464577674866,
    "overall_throughput": 31.53099355086062,
    "lr": 1.8090169943749477e-05,
    "cuda_mem_allocated": 21.998329639434814,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 563,
    "batch_size": 16,
    "total_loss": 0.1602673977613449,
    "gradnorm": 2.924142599105835,
    "weight_norm": 393.46728515625,
    "timestamp": "2024-07-27T20:05:22.387393"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_704
 [20:05:40] INFO     saving took 17.951613903045654 seconds                                                                                                                                                                        utils.py:611
                                                       Per-token loss scaled by world size: 0.002263416536152363Per-token loss scaled by world size: 0.0010527893900871277Per-token loss scaled by world size: 0.004291217308491468Per-token loss scaled by world size: 0.002417604671791196Per-token loss scaled by world size: 0.002519844565540552




 Per-token loss scaled by world size: 0.0017610156210139394Per-token loss scaled by world size: 0.003203270025551319

 Epoch: 3, Step: 45, Rank: 0, loss = 0.16494648158550262
 Epoch: 3, Step: 45, Rank: 2, loss = 0.31272247433662415Epoch: 3, Step: 45, Rank: 6, loss = 0.07672202587127686

 Epoch: 3, Step: 45, Rank: 7, loss = 0.17618294060230255
 Epoch: 3, Step: 45, Rank: 3, loss = 0.18363367021083832
 Epoch: 3, Step: 45, Rank: 1, loss = 0.12833401560783386
 Epoch: 3, Step: 45, Rank: 4, loss = 0.23343829810619354
 Per-token loss scaled by world size: 0.0014321228954941034
 Epoch: 3, Step: 45, Rank: 5, loss = 0.10436595976352692
 [2024-07-27 20:05:40,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[1.789140509396394e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:40,889] [INFO] [timer.py:258:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=31.661507538235384, CurrSamplesPerSec=31.983269859773554, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 3,██████▌  | 9/12 [00:23<00:13,  4.48s/it]
    "step": 45,
    "rank": 0,
    "loss": 0.16494648158550262,
    "overall_throughput": 31.91971183779508,
    "lr": 1.789140509396394e-05,
    "cuda_mem_allocated": 22.001669883728027,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 583,
    "batch_size": 16,
    "total_loss": 0.1725432276725769,
    "gradnorm": 4.5381975173950195,
    "weight_norm": 393.46771240234375,
    "timestamp": "2024-07-27T20:05:40.931934"
 }
 Per-token loss scaled by world size: 0.005079582799226046Per-token loss scaled by world size: 0.003907069563865662Per-token loss scaled by world size: 0.0016345379408448935Per-token loss scaled by world size: 0.00215306063182652Per-token loss scaled by world size: 0.005338544957339764Per-token loss scaled by world size: 0.0032401932403445244

 Per-token loss scaled by world size: 0.0003386051394045353




 Epoch: 3, Step: 46, Rank: 5, loss = 0.18193362653255463Epoch: 3, Step: 46, Rank: 6, loss = 0.1381184607744217Epoch: 3, Step: 46, Rank: 2, loss = 0.330147385597229


 Epoch: 3, Step: 46, Rank: 7, loss = 0.028612133115530014
 Epoch: 3, Step: 46, Rank: 3, loss = 0.42922475934028625Epoch: 3, Step: 46, Rank: 1, loss = 0.27379631996154785

 Epoch: 3, Step: 46, Rank: 4, loss = 0.45110705494880676
 Per-token loss scaled by world size: 0.0007393484702333808
 Epoch: 3, Step: 46, Rank: 0, loss = 0.062474943697452545
 [2024-07-27 20:05:41,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[1.7684011108568593e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:41,446] [INFO] [timer.py:258:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=31.649546499309086, CurrSamplesPerSec=31.143634404390532, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 3,███████▎ | 10/12 [00:23<00:06,  3.27s/it]
    "step": 46,
    "rank": 0,
    "loss": 0.062474943697452545,
    "overall_throughput": 31.079148738311545,
    "lr": 1.7684011108568593e-05,
    "cuda_mem_allocated": 21.99785280227661,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 676,
    "batch_size": 16,
    "total_loss": 0.2369268238544464,
    "gradnorm": 3.1312427520751953,
    "weight_norm": 393.46807861328125,
    "timestamp": "2024-07-27T20:05:41.491963"
 }
 Per-token loss scaled by world size: 0.003072180086746812Per-token loss scaled by world size: 0.0009212895529344678Per-token loss scaled by world size: 0.001889892853796482Per-token loss scaled by world size: 0.005953831598162651


 Per-token loss scaled by world size: 0.0021448375191539526

 Per-token loss scaled by world size: 0.0008049519965425134
 Per-token loss scaled by world size: 0.005051196087151766
 Epoch: 3, Step: 47, Rank: 3, loss = 0.06748446077108383
 Epoch: 3, Step: 47, Rank: 2, loss = 0.13843464851379395
 Epoch: 3, Step: 47, Rank: 0, loss = 0.22503718733787537Epoch: 3, Step: 47, Rank: 7, loss = 0.43611815571784973

 Epoch: 3, Step: 47, Rank: 5, loss = 0.1571093499660492
 Epoch: 3, Step: 47, Rank: 4, loss = 0.058962732553482056
 Epoch: 3, Step: 47, Rank: 6, loss = 0.37000012397766113
 Per-token loss scaled by world size: 0.002140582073479891
 Epoch: 3, Step: 47, Rank: 1, loss = 0.1567976325750351
 [2024-07-27 20:05:41,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[1.7468214769841542e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:41,981] [INFO] [timer.py:258:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=31.673656937700265, CurrSamplesPerSec=32.7721445241366, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 3,████████▏| 11/12 [00:24<00:02,  2.43s/it]
    "step": 47,
    "rank": 0,
    "loss": 0.22503718733787537,
    "overall_throughput": 32.68064947498726,
    "lr": 1.7468214769841542e-05,
    "cuda_mem_allocated": 22.002624034881592,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 586,
    "batch_size": 16,
    "total_loss": 0.20124304294586182,
    "gradnorm": 2.9233510494232178,
    "weight_norm": 393.46844482421875,
    "timestamp": "2024-07-27T20:05:41.984708"
 }
 Per-token loss scaled by world size: 0.0007923523080535233Per-token loss scaled by world size: 0.005117433145642281Per-token loss scaled by world size: 0.001681540277786553Per-token loss scaled by world size: 0.0009754404309205711Per-token loss scaled by world size: 0.0007582003017887473Per-token loss scaled by world size: 0.0003797741374000907




 Per-token loss scaled by world size: 0.0006013160455040634

 Epoch: 3, Step: 48, Rank: 5, loss = 0.12737667560577393
 Epoch: 3, Step: 48, Rank: 1, loss = 0.38764557242393494Epoch: 3, Step: 48, Rank: 2, loss = 0.06002068519592285

 Epoch: 3, Step: 48, Rank: 6, loss = 0.028767891228199005Epoch: 3, Step: 48, Rank: 7, loss = 0.07388961315155029Epoch: 3, Step: 48, Rank: 0, loss = 0.05743367224931717


 Epoch: 3, Step: 48, Rank: 4, loss = 0.04554969072341919
 Per-token loss scaled by world size: 0.001107058022171259
 Epoch: 3, Step: 48, Rank: 3, loss = 0.0838596448302269
 [2024-07-27 20:05:42,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[1.7244252047910893e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:42,546] [INFO] [timer.py:258:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=31.653254285775276, CurrSamplesPerSec=30.761573372705392, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 3,█████████| 12/12 [00:24<00:00,  1.86s/it]
    "step": 48,
    "rank": 0,
    "loss": 0.05743367224931717,
    "overall_throughput": 30.690515825250888,
    "lr": 1.7244252047910893e-05,
    "cuda_mem_allocated": 22.003339290618896,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 606,
    "batch_size": 16,
    "total_loss": 0.10806792974472046,
    "gradnorm": 1.4744030237197876,
    "weight_norm": 393.4688415527344,
    "timestamp": "2024-07-27T20:05:42.548994"
 }
 Epoch 3: 100%|██████████| 12/12 [00:24<00:00,  2.08s/it]
 total tokens: 196 num samples: 2 num padding tokens: 44 - rank: 6 max len: 98 min len: 54 avg len: 76.0 num_loss_counted_tokens: 104
 total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 0 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
 total tokens: 136 num samples: 2 num padding tokens: 0 - rank: 0 max len: 68 min len: 68 avg len: 68.0 num_loss_counted_tokens: 53
 total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 6 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 75
 total tokens: 130 num samples: 2 num padding tokens: 15 - rank: 2 max len: 65 min len: 50 avg len: 57.5 num_loss_counted_tokens: 65
 total tokens: 110 num samples: 2 num padding tokens: 11 - rank: 7 max len: 55 min len: 44 avg len: 49.5 num_loss_counted_tokens: 53
 total tokens: 142 num samples: 2 num padding tokens: 0 - rank: 7 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 75 total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 67
 total tokens: 120 num samples: 2 num padding tokens: 15 - rank: 6 max len: 60 min len: 45 avg len: 52.5 num_loss_counted_tokens: 59

 total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 6 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 70
 total tokens: 166 num samples: 2 num padding tokens: 28 - rank: 7 max len: 83 min len: 55 avg len: 69.0 num_loss_counted_tokens: 89
 total tokens: 126 num samples: 2 num padding tokens: 6 - rank: 7 max len: 63 min len: 57 avg len: 60.0 num_loss_counted_tokens: 57
 total tokens: 154 num samples: 2 num padding tokens: 22 - rank: 6 max len: 77 min len: 55 avg len: 66.0 num_loss_counted_tokens: 75
 total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 7 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 70
 total tokens: 282 num samples: 2 num padding tokens: 81 - rank: 6 max len: 141 min len: 60 avg len: 100.5 num_loss_counted_tokens: 156
 total tokens: 194 num samples: 2 num padding tokens: 10 - rank: 7 max len: 97 min len: 87 avg len: 92.0 num_loss_counted_tokens: 116
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 7 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 62
 total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 7 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 61
 total tokens: 164 num samples: 2 num padding tokens: 30 - rank: 0 max len: 82 min len: 52 avg len: 67.0 num_loss_counted_tokens: 81
 total tokens: 150 num samples: 2 num padding tokens: 13 - rank: 7 max len: 75 min len: 62 avg len: 68.5 num_loss_counted_tokens: 70
 total tokens: 180 num samples: 2 num padding tokens: 6 - rank: 0 max len: 90 min len: 84 avg len: 87.0 num_loss_counted_tokens: 114
 total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 7 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 72
 total tokens: 106 num samples: 2 num padding tokens: 8 - rank: 6 max len: 53 min len: 45 avg len: 49.0 num_loss_counted_tokens: 45
 total tokens: 186 num samples: 2 num padding tokens: 16 - rank: 6 max len: 93 min len: 77 avg len: 85.0 num_loss_counted_tokens: 122
 total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 4 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 59
 total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 6 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 80
 total tokens: 118 num samples: 2 num padding tokens: 14 - rank: 4 max len: 59 min len: 45 avg len: 52.0 num_loss_counted_tokens: 52
 total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 6 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 59
 total tokens: 186 num samples: 2 num padding tokens: 30 - rank: 5 max len: 93 min len: 63 avg len: 78.0 num_loss_counted_tokens: 100
 total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 2 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 70
 total tokens: 114 num samples: 2 num padding tokens: 7 - rank: 5 max len: 57 min len: 50 avg len: 53.5 num_loss_counted_tokens: 59
 total tokens: 148 num samples: 2 num padding tokens: 16 - rank: 4 max len: 74 min len: 58 avg len: 66.0 num_loss_counted_tokens: 68
 total tokens: 146 num samples: 2 num padding tokens: 22 - rank: 4 max len: 73 min len: 51 avg len: 62.0 num_loss_counted_tokens: 72
 total tokens: 186 num samples: 2 num padding tokens: 45 - rank: 0 max len: 93 min len: 48 avg len: 70.5 num_loss_counted_tokens: 111
 total tokens: 180 num samples: 2 num padding tokens: 26 - rank: 0 max len: 90 min len: 64 avg len: 77.0 num_loss_counted_tokens: 118
 total tokens: 166 num samples: 2 num padding tokens: 16 - rank: 4 max len: 83 min len: 67 avg len: 75.0 num_loss_counted_tokens: 85
 total tokens: 158 num samples: 2 num padding tokens: 20 - rank: 1 max len: 79 min len: 59 avg len: 69.0 num_loss_counted_tokens: 66
 total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 0 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 66
 total tokens: 132 num samples: 2 num padding tokens: 23 - rank: 0 max len: 66 min len: 43 avg len: 54.5 num_loss_counted_tokens: 45
 total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 4 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 78
 total tokens: 114 num samples: 2 num padding tokens: 11 - rank: 6 max len: 57 min len: 46 avg len: 51.5 num_loss_counted_tokens: 59
 total tokens: 100 num samples: 2 num padding tokens: 1 - rank: 4 max len: 50 min len: 49 avg len: 49.5 num_loss_counted_tokens: 49
 total tokens: 126 num samples: 2 num padding tokens: 9 - rank: 1 max len: 63 min len: 54 avg len: 58.5 num_loss_counted_tokens: 63
 total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 4 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 100
 total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 0 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 57
 total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 2 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 73
 total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 1 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 87
 total tokens: 202 num samples: 2 num padding tokens: 46 - rank: 4 max len: 101 min len: 55 avg len: 78.0 num_loss_counted_tokens: 106
 total tokens: 162 num samples: 2 num padding tokens: 2 - rank: 0 max len: 81 min len: 79 avg len: 80.0 num_loss_counted_tokens: 93
 total tokens: 174 num samples: 2 num padding tokens: 32 - rank: 2 max len: 87 min len: 55 avg len: 71.0 num_loss_counted_tokens: 73
 total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 2 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 73
 total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 0 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 87
 total tokens: 138 num samples: 2 num padding tokens: 7 - rank: 4 max len: 69 min len: 62 avg len: 65.5 num_loss_counted_tokens: 77
 total tokens: 208 num samples: 2 num padding tokens: 43 - rank: 1 max len: 104 min len: 61 avg len: 82.5 num_loss_counted_tokens: 108
 total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 4 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 65
 total tokens: 146 num samples: 2 num padding tokens: 6 - rank: 2 max len: 73 min len: 67 avg len: 70.0 num_loss_counted_tokens: 66
 total tokens: 124 num samples: 2 num padding tokens: 13 - rank: 1 max len: 62 min len: 49 avg len: 55.5 num_loss_counted_tokens: 54
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 69
 total tokens: 228 num samples: 2 num padding tokens: 38 - rank: 4 max len: 114 min len: 76 avg len: 95.0 num_loss_counted_tokens: 129
 total tokens: 132 num samples: 2 num padding tokens: 1 - rank: 2 max len: 66 min len: 65 avg len: 65.5 num_loss_counted_tokens: 52
 total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 2 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 60
 total tokens: 126 num samples: 2 num padding tokens: 3 - rank: 2 max len: 63 min len: 60 avg len: 61.5 num_loss_counted_tokens: 61
 total tokens: 216 num samples: 2 num padding tokens: 49 - rank: 2 max len: 108 min len: 59 avg len: 83.5 num_loss_counted_tokens: 103
 total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 2 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 64
 total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 7 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 64
 total tokens: 244 num samples: 2 num padding tokens: 78 - rank: 6 max len: 122 min len: 44 avg len: 83.0 num_loss_counted_tokens: 115
 total tokens: 214 num samples: 2 num padding tokens: 62 - rank: 5 max len: 107 min len: 45 avg len: 76.0 num_loss_counted_tokens: 99
 total tokens: 200 num samples: 2 num padding tokens: 47 - rank: 1 max len: 100 min len: 53 avg len: 76.5 num_loss_counted_tokens: 90
 total tokens: 144 num samples: 2 num padding tokens: 14 - rank: 1 max len: 72 min len: 58 avg len: 65.0 num_loss_counted_tokens: 84
 total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 1 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 69
 total tokens: 96 num samples: 2 num padding tokens: 4 - rank: 1 max len: 48 min len: 44 avg len: 46.0 num_loss_counted_tokens: 46
 total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 1 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 60
 total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 5 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 73 total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 5 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 82

 total tokens: 226 num samples: 2 num padding tokens: 6 - rank: 5 max len: 113 min len: 107 avg len: 110.0 num_loss_counted_tokens: 142
 total tokens: 148 num samples: 2 num padding tokens: 28 - rank: 5 max len: 74 min len: 46 avg len: 60.0 num_loss_counted_tokens: 63
 total tokens: 168 num samples: 2 num padding tokens: 33 - rank: 5 max len: 84 min len: 51 avg len: 67.5 num_loss_counted_tokens: 89
 total tokens: 162 num samples: 2 num padding tokens: 29 - rank: 5 max len: 81 min len: 52 avg len: 66.5 num_loss_counted_tokens: 72
 total tokens: 186 num samples: 2 num padding tokens: 29 - rank: 1 max len: 93 min len: 64 avg len: 78.5 num_loss_counted_tokens: 76
 total tokens: 140 num samples: 2 num padding tokens: 6 - rank: 5 max len: 70 min len: 64 avg len: 67.0 num_loss_counted_tokens: 59
 total tokens: 156 num samples: 2 num padding tokens: 6 - rank: 5 max len: 78 min len: 72 avg len: 75.0 num_loss_counted_tokens: 81
 total tokens: 174 num samples: 2 num padding tokens: 29 - rank: 1 max len: 87 min len: 58 avg len: 72.5 num_loss_counted_tokens: 84
 total tokens: 124 num samples: 2 num padding tokens: 13 - rank: 3 max len: 62 min len: 49 avg len: 55.5 num_loss_counted_tokens: 55
 total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 3 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 69
 total tokens: 132 num samples: 2 num padding tokens: 16 - rank: 3 max len: 66 min len: 50 avg len: 58.0 num_loss_counted_tokens: 61
 total tokens: 140 num samples: 2 num padding tokens: 19 - rank: 3 max len: 70 min len: 51 avg len: 60.5 num_loss_counted_tokens: 55
 total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 3 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 71
 total tokens: 104 num samples: 2 num padding tokens: 6 - rank: 5 max len: 52 min len: 46 avg len: 49.0 num_loss_counted_tokens: 52
 total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 3 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 100
 total tokens: 160 num samples: 2 num padding tokens: 14 - rank: 3 max len: 80 min len: 66 avg len: 73.0 num_loss_counted_tokens: 83
 total tokens: 130 num samples: 2 num padding tokens: 12 - rank: 3 max len: 65 min len: 53 avg len: 59.0 num_loss_counted_tokens: 64
 total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 3 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 77
 total tokens: 188 num samples: 2 num padding tokens: 26 - rank: 3 max len: 94 min len: 68 avg len: 81.0 num_loss_counted_tokens: 85
 total tokens: 188 num samples: 2 num padding tokens: 14 - rank: 2 max len: 94 min len: 80 avg len: 87.0 num_loss_counted_tokens: 116
 total tokens: 176 num samples: 2 num padding tokens: 25 - rank: 3 max len: 88 min len: 63 avg len: 75.5 num_loss_counted_tokens: 82
 total tokens: 162 num samples: 2 num padding tokens: 26 - rank: 3 max len: 81 min len: 55 avg len: 68.0 num_loss_counted_tokens: 85
 Per-token loss scaled by world size: 0.0007222609710879624Per-token loss scaled by world size: 0.0011062632547691464Per-token loss scaled by world size: 0.0014970493502914906Per-token loss scaled by world size: 0.0006512971594929695Per-token loss scaled by world size: 0.005336429923772812Per-token loss scaled by world size: 0.0008554465603083372Per-token loss scaled by world size: 0.002156679518520832






 Epoch: 4, Step: 49, Rank: 5, loss = 0.36687955260276794Epoch: 4, Step: 49, Rank: 1, loss = 0.10292214155197144
 Epoch: 4, Step: 49, Rank: 4, loss = 0.0760556012392044

 Epoch: 4, Step: 49, Rank: 6, loss = 0.14827170968055725Epoch: 4, Step: 49, Rank: 0, loss = 0.04965544119477272

 Epoch: 4, Step: 49, Rank: 7, loss = 0.05881195142865181
 Epoch: 4, Step: 49, Rank: 2, loss = 0.04477667808532715
 Per-token loss scaled by world size: 0.0012891377555206418
 Epoch: 4, Step: 49, Rank: 3, loss = 0.08862821757793427
 [2024-07-27 20:05:43,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[1.7012367842724887e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:43,557] [INFO] [timer.py:258:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=31.626881887830034, CurrSamplesPerSec=30.459502835854497, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:   8%|▊         | 1/12 [00:00<00:10,  1.09it/s]{
    "epoch": 4,
    "step": 49,
    "rank": 0,
    "loss": 0.04965544119477272,
    "overall_throughput": 30.35780108269395,
    "lr": 1.7012367842724887e-05,
    "cuda_mem_allocated": 21.996244430541992,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 550,
    "batch_size": 16,
    "total_loss": 0.11700016260147095,
    "gradnorm": 1.7901870012283325,
    "weight_norm": 393.46917724609375,
    "timestamp": "2024-07-27T20:05:43.608424"
 }
 Per-token loss scaled by world size: 0.0014621953014284372Per-token loss scaled by world size: 0.0015464453026652336Per-token loss scaled by world size: 0.0014793629525229335Per-token loss scaled by world size: 0.0010739548597484827Per-token loss scaled by world size: 0.002221300033852458
 Per-token loss scaled by world size: 0.001030008657835424

 Per-token loss scaled by world size: 0.0031245022546499968



 Epoch: 4, Step: 50, Rank: 4, loss = 0.0959736704826355Epoch: 4, Step: 50, Rank: 5, loss = 0.10032563656568527Epoch: 4, Step: 50, Rank: 0, loss = 0.14410683512687683
 Epoch: 4, Step: 50, Rank: 1, loss = 0.06682181358337402Epoch: 4, Step: 50, Rank: 3, loss = 0.06967282295227051


 Epoch: 4, Step: 50, Rank: 6, loss = 0.09485992044210434

 Epoch: 4, Step: 50, Rank: 7, loss = 0.2027020901441574
 Per-token loss scaled by world size: 0.0011166962794959545
 Epoch: 4, Step: 50, Rank: 2, loss = 0.07244566828012466
 [2024-07-27 20:05:44,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[1.6772815716257414e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:44,101] [INFO] [timer.py:258:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=31.64645160925237, CurrSamplesPerSec=32.594364979528, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  17%|█▋        | 2/12 [00:01<00:06,  1.43it/s]{
    "epoch": 4,
    "step": 50,
    "rank": 0,
    "loss": 0.14410683512687683,
    "overall_throughput": 32.48167592990112,
    "lr": 1.6772815716257414e-05,
    "cuda_mem_allocated": 21.999523639678955,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 519,
    "batch_size": 16,
    "total_loss": 0.105863556265831,
    "gradnorm": 2.59075927734375,
    "weight_norm": 393.4695129394531,
    "timestamp": "2024-07-27T20:05:44.148907"
 }
 Per-token loss scaled by world size: 0.0009595813462510705Per-token loss scaled by world size: 0.0007476079626940191Per-token loss scaled by world size: 0.002177697606384754Per-token loss scaled by world size: 0.00161154440138489Per-token loss scaled by world size: 0.002184153301641345




 Per-token loss scaled by world size: 0.0022782967425882816Per-token loss scaled by world size: 0.002697640098631382

 Epoch: 4, Step: 51, Rank: 1, loss = 0.12066438794136047Epoch: 4, Step: 51, Rank: 7, loss = 0.16305510699748993Epoch: 4, Step: 51, Rank: 3, loss = 0.055977147072553635


 Epoch: 4, Step: 51, Rank: 2, loss = 0.16353848576545715Epoch: 4, Step: 51, Rank: 0, loss = 0.07184865325689316

 Epoch: 4, Step: 51, Rank: 6, loss = 0.20198580622673035Epoch: 4, Step: 51, Rank: 5, loss = 0.1705874651670456

 Per-token loss scaled by world size: 0.001197479316033423
 Epoch: 4, Step: 51, Rank: 4, loss = 0.08966126292943954
 [2024-07-27 20:05:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[1.6525857615241686e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:44,647] [INFO] [timer.py:258:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=31.66021704434534, CurrSamplesPerSec=32.33534113615524, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  25%|██▌       | 3/12 [00:02<00:05,  1.59it/s]{
    "epoch": 4,
    "step": 51,
    "rank": 0,
    "loss": 0.07184865325689316,
    "overall_throughput": 32.279271652634336,
    "lr": 1.6525857615241686e-05,
    "cuda_mem_allocated": 22.002862453460693,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 599,
    "batch_size": 16,
    "total_loss": 0.12966477870941162,
    "gradnorm": 2.6400396823883057,
    "weight_norm": 393.4698486328125,
    "timestamp": "2024-07-27T20:05:44.689190"
 }
 Per-token loss scaled by world size: 0.001371016027405858Per-token loss scaled by world size: 0.0010015949374064803Per-token loss scaled by world size: 0.0019693197682499886

 Per-token loss scaled by world size: 0.00044975956552661955
 Per-token loss scaled by world size: 0.0015600252663716674Per-token loss scaled by world size: 0.0014032198814675212Per-token loss scaled by world size: 0.0006641225190833211



 Epoch: 4, Step: 52, Rank: 4, loss = 0.08450957387685776
 Epoch: 4, Step: 52, Rank: 0, loss = 0.11567948013544083
 Epoch: 4, Step: 52, Rank: 6, loss = 0.16616135835647583
 Epoch: 4, Step: 52, Rank: 5, loss = 0.1316271275281906Epoch: 4, Step: 52, Rank: 2, loss = 0.03794846311211586

 Epoch: 4, Step: 52, Rank: 1, loss = 0.11839667707681656
 Epoch: 4, Step: 52, Rank: 7, loss = 0.05603533610701561
 Per-token loss scaled by world size: 0.0015486030606552958
 Epoch: 4, Step: 52, Rank: 3, loss = 0.13066338002681732
 [2024-07-27 20:05:45,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[1.6271763584735373e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:45,202] [INFO] [timer.py:258:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=31.653990957078555, CurrSamplesPerSec=31.351883784434047, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  33%|███▎      | 4/12 [00:02<00:04,  1.67it/s]{
    "epoch": 4,
    "step": 52,
    "rank": 0,
    "loss": 0.11567948013544083,
    "overall_throughput": 31.298703164249382,
    "lr": 1.6271763584735373e-05,
    "cuda_mem_allocated": 22.004770278930664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 675,
    "batch_size": 16,
    "total_loss": 0.10512767732143402,
    "gradnorm": 2.028604745864868,
    "weight_norm": 393.47015380859375,
    "timestamp": "2024-07-27T20:05:45.245073"
 }
 Per-token loss scaled by world size: 0.0023022103123366833Per-token loss scaled by world size: 0.002660317113623023Per-token loss scaled by world size: 0.0015502030728384852Per-token loss scaled by world size: 0.001655052648857236Per-token loss scaled by world size: 0.0008553997613489628
 Per-token loss scaled by world size: 0.002113129710778594Per-token loss scaled by world size: 0.0022639036178588867





 Epoch: 4, Step: 53, Rank: 6, loss = 0.19387060403823853
 Epoch: 4, Step: 53, Rank: 0, loss = 0.16777357459068298
 Epoch: 4, Step: 53, Rank: 1, loss = 0.06233725696802139Epoch: 4, Step: 53, Rank: 2, loss = 0.11297105252742767Epoch: 4, Step: 53, Rank: 5, loss = 0.12061195820569992Epoch: 4, Step: 53, Rank: 3, loss = 0.16498197615146637



 Epoch: 4, Step: 53, Rank: 4, loss = 0.15399432182312012
 Per-token loss scaled by world size: 0.0017511562909930944
 Epoch: 4, Step: 53, Rank: 7, loss = 0.12761551141738892
 [2024-07-27 20:05:45,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[1.6010811472830253e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:45,749] [INFO] [timer.py:258:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=31.66602722465284, CurrSamplesPerSec=32.279737447919125, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 4:  42%|████▏     | 5/12 [00:03<00:04,  1.72it/s]{
    "epoch": 4,
    "step": 53,
    "rank": 0,
    "loss": 0.16777357459068298,
    "overall_throughput": 32.227156557192046,
    "lr": 1.6010811472830253e-05,
    "cuda_mem_allocated": 22.00548553466797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 583,
    "batch_size": 16,
    "total_loss": 0.13801953196525574,
    "gradnorm": 2.3451616764068604,
    "weight_norm": 393.4704895019531,
    "timestamp": "2024-07-27T20:05:45.792018"
 }
 Per-token loss scaled by world size: 0.0016146524576470256Per-token loss scaled by world size: 0.00018823673599399626Per-token loss scaled by world size: 0.00015145067300181836Per-token loss scaled by world size: 0.002239079447463155

 Per-token loss scaled by world size: 0.0013640215620398521Per-token loss scaled by world size: 0.0009340652031823993Per-token loss scaled by world size: 0.002048594644293189




 Epoch: 4, Step: 54, Rank: 4, loss = 0.017576605081558228
 Epoch: 4, Step: 54, Rank: 0, loss = 0.15076817572116852
 Epoch: 4, Step: 54, Rank: 6, loss = 0.12736551463603973Epoch: 4, Step: 54, Rank: 5, loss = 0.20907405018806458

 Epoch: 4, Step: 54, Rank: 2, loss = 0.014141706749796867
 Epoch: 4, Step: 54, Rank: 7, loss = 0.08721833676099777
 Epoch: 4, Step: 54, Rank: 3, loss = 0.19128753244876862
 Per-token loss scaled by world size: 0.001376173458993435
 Epoch: 4, Step: 54, Rank: 1, loss = 0.12850019335746765
 [2024-07-27 20:05:46,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[1.5743286626829437e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:46,290] [INFO] [timer.py:258:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=31.675406097739614, CurrSamplesPerSec=32.16120844994824, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  50%|█████     | 6/12 [00:03<00:03,  1.76it/s]{
    "epoch": 4,
    "step": 54,
    "rank": 0,
    "loss": 0.15076817572116852,
    "overall_throughput": 32.07902156132976,
    "lr": 1.5743286626829437e-05,
    "cuda_mem_allocated": 22.004770278930664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 747,
    "batch_size": 16,
    "total_loss": 0.11574152112007141,
    "gradnorm": 1.6529176235198975,
    "weight_norm": 393.4708251953125,
    "timestamp": "2024-07-27T20:05:46.332858"
 }
 Per-token loss scaled by world size: 0.0008495299844071269Per-token loss scaled by world size: 0.002507910830900073Per-token loss scaled by world size: 0.0028947019018232822Per-token loss scaled by world size: 0.001476020785048604Per-token loss scaled by world size: 0.001191351911984384Per-token loss scaled by world size: 0.0018832029309123755



 Per-token loss scaled by world size: 0.0013536742189899087


 Epoch: 4, Step: 55, Rank: 1, loss = 0.22397755086421967Epoch: 4, Step: 55, Rank: 6, loss = 0.19404959678649902Epoch: 4, Step: 55, Rank: 7, loss = 0.09218085557222366


 Epoch: 4, Step: 55, Rank: 0, loss = 0.06573238223791122Epoch: 4, Step: 55, Rank: 2, loss = 0.14571282267570496Epoch: 4, Step: 55, Rank: 3, loss = 0.11420710384845734


 Epoch: 4, Step: 55, Rank: 4, loss = 0.10474054515361786
 Per-token loss scaled by world size: 0.0010037233587354422
 Epoch: 4, Step: 55, Rank: 5, loss = 0.07766309380531311
 [2024-07-27 20:05:46,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[1.5469481581224274e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:05:46,825] [INFO] [timer.py:258:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=31.69138798284937, CurrSamplesPerSec=32.545268319935445, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 880
 {
    "epoch": 4,
    "step": 55,
    "rank": 0,
    "loss": 0.06573238223791122,
    "overall_throughput": 32.46340198605274,
    "lr": 1.5469481581224274e-05,
    "cuda_mem_allocated": 22.000000476837158,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 619,
    "batch_size": 16,
    "total_loss": 0.1272830069065094,
    "gradnorm": 1.8899047374725342,
    "weight_norm": 393.4711608886719,
    "timestamp": "2024-07-27T20:05:46.828726"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_880
 [20:06:04] INFO     saving took 17.93557572364807 seconds                                                                                                                                                                         utils.py:611
 Epoch 4:  58%|█████▊    | 7/12 [00:22<00:32,  6.42s/it]Per-token loss scaled by world size: 0.0008630760130472481Per-token loss scaled by world size: 0.0010983350221067667Per-token loss scaled by world size: 0.0021769509185105562
 Per-token loss scaled by world size: 0.0004714219248853624
 Per-token loss scaled by world size: 0.0017523688729852438

 Per-token loss scaled by world size: 0.0024742181412875652

 Epoch: 4, Step: 56, Rank: 0, loss = 0.042015478014945984Epoch: 4, Step: 56, Rank: 4, loss = 0.09788911044597626Per-token loss scaled by world size: 0.00015522913599852473

 Epoch: 4, Step: 56, Rank: 1, loss = 0.2205146849155426
 Epoch: 4, Step: 56, Rank: 2, loss = 0.19402074813842773Epoch: 4, Step: 56, Rank: 7, loss = 0.07692164927721024

 Epoch: 4, Step: 56, Rank: 3, loss = 0.15617987513542175

 Epoch: 4, Step: 56, Rank: 5, loss = 0.013834796845912933
 Per-token loss scaled by world size: 0.0007872319547459483
 Epoch: 4, Step: 56, Rank: 6, loss = 0.07016205042600632
 [2024-07-27 20:06:05,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[1.5189695737812153e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:05,304] [INFO] [timer.py:258:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=31.708933325115336, CurrSamplesPerSec=32.66747732319786, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  67%|██████▋   | 8/12 [00:22<00:18,  4.55s/it]{
    "epoch": 4,
    "step": 56,
    "rank": 0,
    "loss": 0.042015478014945984,
    "overall_throughput": 32.5988932413107,
    "lr": 1.5189695737812153e-05,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 713,
    "batch_size": 16,
    "total_loss": 0.10894230008125305,
    "gradnorm": 1.9275243282318115,
    "weight_norm": 393.47149658203125,
    "timestamp": "2024-07-27T20:06:05.346935"
 }
 Per-token loss scaled by world size: 0.0021620304323732853Per-token loss scaled by world size: 0.0008936038357205689Per-token loss scaled by world size: 0.0009625108214095235Per-token loss scaled by world size: 0.0029216075781732798Per-token loss scaled by world size: 0.0011535611702129245
 Per-token loss scaled by world size: 0.0010705487802624702
 Per-token loss scaled by world size: 0.001004268298856914




 Epoch: 4, Step: 57, Rank: 0, loss = 0.060765061527490616Epoch: 4, Step: 57, Rank: 1, loss = 0.07844215631484985

 Epoch: 4, Step: 57, Rank: 4, loss = 0.14701807498931885Epoch: 4, Step: 57, Rank: 6, loss = 0.06545073539018631

 Epoch: 4, Step: 57, Rank: 5, loss = 0.07279732078313828Epoch: 4, Step: 57, Rank: 2, loss = 0.19866931438446045

 Epoch: 4, Step: 57, Rank: 7, loss = 0.06829024106264114
 Per-token loss scaled by world size: 0.004088845103979111
 Epoch: 4, Step: 57, Rank: 3, loss = 0.27804145216941833
 [2024-07-27 20:06:05,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[1.4904235038305084e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:05,837] [INFO] [timer.py:258:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=31.726304642319306, CurrSamplesPerSec=32.69348184898873, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  75%|███████▌  | 9/12 [00:23<00:09,  3.29s/it]{
    "epoch": 4,
    "step": 57,
    "rank": 0,
    "loss": 0.060765061527490616,
    "overall_throughput": 32.6126440646996,
    "lr": 1.4904235038305084e-05,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 544,
    "batch_size": 16,
    "total_loss": 0.12118428945541382,
    "gradnorm": 1.7047128677368164,
    "weight_norm": 393.4718322753906,
    "timestamp": "2024-07-27T20:06:05.881923"
 }
 Per-token loss scaled by world size: 0.0010402144398540258Per-token loss scaled by world size: 0.001066502882167697Per-token loss scaled by world size: 0.0007541680242866278
 Per-token loss scaled by world size: 0.001964687602594495Per-token loss scaled by world size: 0.00040768564213067293Per-token loss scaled by world size: 0.002232564380392432

 Per-token loss scaled by world size: 0.004068903159350157



 Epoch: 4, Step: 58, Rank: 7, loss = 0.08758654445409775
 Epoch: 4, Step: 58, Rank: 1, loss = 0.06193605065345764Epoch: 4, Step: 58, Rank: 0, loss = 0.08542761206626892

 Epoch: 4, Step: 58, Rank: 3, loss = 0.16134996712207794
 Epoch: 4, Step: 58, Rank: 5, loss = 0.33415865898132324Epoch: 4, Step: 58, Rank: 4, loss = 0.1833493560552597
 Epoch: 4, Step: 58, Rank: 2, loss = 0.033481184393167496

 Per-token loss scaled by world size: 0.000749716826248914
 Epoch: 4, Step: 58, Rank: 6, loss = 0.06157049536705017
 [2024-07-27 20:06:06,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[1.461341162978688e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:06,388] [INFO] [timer.py:258:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=31.725638447536326, CurrSamplesPerSec=31.68904077052279, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  83%|████████▎ | 10/12 [00:23<00:04,  2.45s/it]{
    "epoch": 4,
    "step": 58,
    "rank": 0,
    "loss": 0.08542761206626892,
    "overall_throughput": 31.617913823118545,
    "lr": 1.461341162978688e-05,
    "cuda_mem_allocated": 22.002624034881592,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 657,
    "batch_size": 16,
    "total_loss": 0.12610748410224915,
    "gradnorm": 2.7373974323272705,
    "weight_norm": 393.47216796875,
    "timestamp": "2024-07-27T20:06:06.428516"
 }
 Per-token loss scaled by world size: 0.0009614942828193307Per-token loss scaled by world size: 0.0012739634839817882Per-token loss scaled by world size: 0.0009918607538565993Per-token loss scaled by world size: 0.001772751216776669


 Per-token loss scaled by world size: 0.0035334480926394463
 Per-token loss scaled by world size: 0.0007813825504854321Per-token loss scaled by world size: 0.00042521810973994434


 Epoch: 4, Step: 59, Rank: 0, loss = 0.07290176302194595Epoch: 4, Step: 59, Rank: 6, loss = 0.09363631904125214

 Epoch: 4, Step: 59, Rank: 4, loss = 0.1302972137928009Epoch: 4, Step: 59, Rank: 7, loss = 0.07066982984542847

 Epoch: 4, Step: 59, Rank: 5, loss = 0.259708434343338
 Epoch: 4, Step: 59, Rank: 3, loss = 0.05743161588907242
 Epoch: 4, Step: 59, Rank: 2, loss = 0.03125353157520294
 Per-token loss scaled by world size: 0.0010259401751682162
 Epoch: 4, Step: 59, Rank: 1, loss = 0.07540660351514816
 [2024-07-27 20:06:06,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[1.4317543523384928e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:06,915] [INFO] [timer.py:258:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=31.746165669393985, CurrSamplesPerSec=32.93967877502177, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4:  92%|█████████▏| 11/12 [00:24<00:01,  1.86s/it]{
    "epoch": 4,
    "step": 59,
    "rank": 0,
    "loss": 0.07290176302194595,
    "overall_throughput": 32.85643000105753,
    "lr": 1.4317543523384928e-05,
    "cuda_mem_allocated": 21.999285221099854,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 588,
    "batch_size": 16,
    "total_loss": 0.09891317039728165,
    "gradnorm": 1.6408778429031372,
    "weight_norm": 393.47247314453125,
    "timestamp": "2024-07-27T20:06:06.958893"
 }
 Per-token loss scaled by world size: 0.0008049748139455914Per-token loss scaled by world size: 0.0006352822529152036Per-token loss scaled by world size: 0.004135269671678543Per-token loss scaled by world size: 0.0009236105252057314
 Per-token loss scaled by world size: 0.0006417850963771343Per-token loss scaled by world size: 0.0002449562889523804Per-token loss scaled by world size: 0.002001277171075344





 Epoch: 4, Step: 60, Rank: 3, loss = 0.04661383479833603Epoch: 4, Step: 60, Rank: 4, loss = 0.05906502529978752Epoch: 4, Step: 60, Rank: 1, loss = 0.047090981155633926Epoch: 4, Step: 60, Rank: 7, loss = 0.06776992231607437

 Epoch: 4, Step: 60, Rank: 5, loss = 0.01797366701066494
 Epoch: 4, Step: 60, Rank: 0, loss = 0.3034254014492035


 Epoch: 4, Step: 60, Rank: 6, loss = 0.14684371650218964
 Per-token loss scaled by world size: 0.0006847438053227961
 Epoch: 4, Step: 60, Rank: 2, loss = 0.0502430759370327
 [2024-07-27 20:06:07,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[1.4016954246529697e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:07,452] [INFO] [timer.py:258:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=31.759205664862847, CurrSamplesPerSec=32.52061781981693, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 4: 100%|██████████| 12/12 [00:24<00:00,  1.46s/it]{
    "epoch": 4,
    "step": 60,
    "rank": 0,
    "loss": 0.3034254014492035,
    "overall_throughput": 32.436209882408896,
    "lr": 1.4016954246529697e-05,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 587,
    "batch_size": 16,
    "total_loss": 0.09237820655107498,
    "gradnorm": 1.7326298952102661,
    "weight_norm": 393.4727478027344,
    "timestamp": "2024-07-27T20:06:07.493757"
 }
 Epoch 4: 100%|██████████| 12/12 [00:24<00:00,  2.08s/it]
 total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 1 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 55
 total tokens: 144 num samples: 2 num padding tokens: 12 - rank: 4 max len: 72 min len: 60 avg len: 66.0 num_loss_counted_tokens: 69
 total tokens: 132 num samples: 2 num padding tokens: 3 - rank: 7 max len: 66 min len: 63 avg len: 64.5 num_loss_counted_tokens: 61
 total tokens: 152 num samples: 2 num padding tokens: 11 - rank: 1 max len: 76 min len: 65 avg len: 70.5 num_loss_counted_tokens: 66
 total tokens: 200 num samples: 2 num padding tokens: 54 - rank: 7 max len: 100 min len: 46 avg len: 73.0 num_loss_counted_tokens: 89
 total tokens: 168 num samples: 2 num padding tokens: 11 - rank: 7 max len: 84 min len: 73 avg len: 78.5 num_loss_counted_tokens: 96
 total tokens: 90 num samples: 2 num padding tokens: 2 - rank: 7 max len: 45 min len: 43 avg len: 44.0 num_loss_counted_tokens: 38
 total tokens: 154 num samples: 2 num padding tokens: 11 - rank: 4 max len: 77 min len: 66 avg len: 71.5 num_loss_counted_tokens: 80
 total tokens: 144 num samples: 2 num padding tokens: 14 - rank: 7 max len: 72 min len: 58 avg len: 65.0 num_loss_counted_tokens: 84
 total tokens: 148 num samples: 2 num padding tokens: 25 - rank: 7 max len: 74 min len: 49 avg len: 61.5 num_loss_counted_tokens: 64
 total tokens: 138 num samples: 2 num padding tokens: 15 - rank: 7 max len: 69 min len: 54 avg len: 61.5 num_loss_counted_tokens: 79
 total tokens: 160 num samples: 2 num padding tokens: 10 - rank: 1 max len: 80 min len: 70 avg len: 75.0 num_loss_counted_tokens: 94
 total tokens: 134 num samples: 2 num padding tokens: 7 - rank: 7 max len: 67 min len: 60 avg len: 63.5 num_loss_counted_tokens: 75
 total tokens: 140 num samples: 2 num padding tokens: 17 - rank: 1 max len: 70 min len: 53 avg len: 61.5 num_loss_counted_tokens: 68
 total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 7 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 51
 total tokens: 166 num samples: 2 num padding tokens: 14 - rank: 7 max len: 83 min len: 69 avg len: 76.0 num_loss_counted_tokens: 85
 total tokens: 156 num samples: 2 num padding tokens: 14 - rank: 7 max len: 78 min len: 64 avg len: 71.0 num_loss_counted_tokens: 86
 total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 4 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 65
 total tokens: 158 num samples: 2 num padding tokens: 19 - rank: 4 max len: 79 min len: 60 avg len: 69.5 num_loss_counted_tokens: 69
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 1 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 4 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 75
 total tokens: 118 num samples: 2 num padding tokens: 7 - rank: 4 max len: 59 min len: 52 avg len: 55.5 num_loss_counted_tokens: 60
 total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 1 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 75
 total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 1 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 58
 total tokens: 134 num samples: 2 num padding tokens: 7 - rank: 4 max len: 67 min len: 60 avg len: 63.5 num_loss_counted_tokens: 55
 total tokens: 128 num samples: 2 num padding tokens: 16 - rank: 4 max len: 64 min len: 48 avg len: 56.0 num_loss_counted_tokens: 49
 total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 4 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 62
 total tokens: 140 num samples: 2 num padding tokens: 7 - rank: 4 max len: 70 min len: 63 avg len: 66.5 num_loss_counted_tokens: 58
 total tokens: 162 num samples: 2 num padding tokens: 24 - rank: 4 max len: 81 min len: 57 avg len: 69.0 num_loss_counted_tokens: 87
 total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 5 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 60
 total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 4 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 73
 total tokens: 158 num samples: 2 num padding tokens: 13 - rank: 7 max len: 79 min len: 66 avg len: 72.5 num_loss_counted_tokens: 72
 total tokens: 180 num samples: 2 num padding tokens: 32 - rank: 0 max len: 90 min len: 58 avg len: 74.0 num_loss_counted_tokens: 118
 total tokens: 152 num samples: 2 num padding tokens: 24 - rank: 0 max len: 76 min len: 52 avg len: 64.0 num_loss_counted_tokens: 71
 total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 2 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 61
 total tokens: 172 num samples: 2 num padding tokens: 42 - rank: 2 max len: 86 min len: 44 avg len: 65.0 num_loss_counted_tokens: 70
 total tokens: 188 num samples: 2 num padding tokens: 42 - rank: 0 max len: 94 min len: 52 avg len: 73.0 num_loss_counted_tokens: 87
 total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 5 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 62
 total tokens: 214 num samples: 2 num padding tokens: 47 - rank: 5 max len: 107 min len: 60 avg len: 83.5 num_loss_counted_tokens: 128
 total tokens: 214 num samples: 2 num padding tokens: 59 - rank: 5 max len: 107 min len: 48 avg len: 77.5 num_loss_counted_tokens: 106
 total tokens: 208 num samples: 2 num padding tokens: 58 - rank: 5 max len: 104 min len: 46 avg len: 75.0 num_loss_counted_tokens: 99
 total tokens: 186 num samples: 2 num padding tokens: 43 - rank: 1 max len: 93 min len: 50 avg len: 71.5 num_loss_counted_tokens: 79
 total tokens: 120 num samples: 2 num padding tokens: 1 - rank: 1 max len: 60 min len: 59 avg len: 59.5 num_loss_counted_tokens: 64
 total tokens: 116 num samples: 2 num padding tokens: 8 - rank: 5 max len: 58 min len: 50 avg len: 54.0 num_loss_counted_tokens: 58
 total tokens: 164 num samples: 2 num padding tokens: 24 - rank: 1 max len: 82 min len: 58 avg len: 70.0 num_loss_counted_tokens: 95
 total tokens: 180 num samples: 2 num padding tokens: 15 - rank: 1 max len: 90 min len: 75 avg len: 82.5 num_loss_counted_tokens: 107
 total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 2 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 71
 total tokens: 132 num samples: 2 num padding tokens: 17 - rank: 2 max len: 66 min len: 49 avg len: 57.5 num_loss_counted_tokens: 61
 total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 2 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 78
 total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 61
 total tokens: 160 num samples: 2 num padding tokens: 22 - rank: 5 max len: 80 min len: 58 avg len: 69.0 num_loss_counted_tokens: 75
 total tokens: 228 num samples: 2 num padding tokens: 54 - rank: 5 max len: 114 min len: 60 avg len: 87.0 num_loss_counted_tokens: 122
 total tokens: 118 num samples: 2 num padding tokens: 10 - rank: 5 max len: 59 min len: 49 avg len: 54.0 num_loss_counted_tokens: 54
 total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 5 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 64
 total tokens: 130 num samples: 2 num padding tokens: 10 - rank: 0 max len: 65 min len: 55 avg len: 60.0 num_loss_counted_tokens: 63
 total tokens: 188 num samples: 2 num padding tokens: 46 - rank: 0 max len: 94 min len: 48 avg len: 71.0 num_loss_counted_tokens: 85
 total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 5 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 99
 total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 0 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 52
 total tokens: 138 num samples: 2 num padding tokens: 24 - rank: 0 max len: 69 min len: 45 avg len: 57.0 num_loss_counted_tokens: 56
 total tokens: 136 num samples: 2 num padding tokens: 18 - rank: 0 max len: 68 min len: 50 avg len: 59.0 num_loss_counted_tokens: 61
 total tokens: 216 num samples: 2 num padding tokens: 45 - rank: 0 max len: 108 min len: 63 avg len: 85.5 num_loss_counted_tokens: 104
 total tokens: 122 num samples: 2 num padding tokens: 4 - rank: 3 max len: 61 min len: 57 avg len: 59.0 num_loss_counted_tokens: 56
 total tokens: 180 num samples: 2 num padding tokens: 24 - rank: 0 max len: 90 min len: 66 avg len: 78.0 num_loss_counted_tokens: 97
 total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 3 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 80
 total tokens: 282 num samples: 2 num padding tokens: 48 - rank: 3 max len: 141 min len: 93 avg len: 117.0 num_loss_counted_tokens: 204
 total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 6 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 67
 total tokens: 168 num samples: 2 num padding tokens: 24 - rank: 3 max len: 84 min len: 60 avg len: 72.0 num_loss_counted_tokens: 95
 total tokens: 226 num samples: 2 num padding tokens: 32 - rank: 0 max len: 113 min len: 81 avg len: 97.0 num_loss_counted_tokens: 119
 total tokens: 174 num samples: 2 num padding tokens: 32 - rank: 3 max len: 87 min len: 55 avg len: 71.0 num_loss_counted_tokens: 83
 total tokens: 172 num samples: 2 num padding tokens: 4 - rank: 3 max len: 86 min len: 82 avg len: 84.0 num_loss_counted_tokens: 81
 total tokens: 122 num samples: 2 num padding tokens: 8 - rank: 3 max len: 61 min len: 53 avg len: 57.0 num_loss_counted_tokens: 54
 total tokens: 136 num samples: 2 num padding tokens: 2 - rank: 3 max len: 68 min len: 66 avg len: 67.0 num_loss_counted_tokens: 63
 total tokens: 122 num samples: 2 num padding tokens: 8 - rank: 3 max len: 61 min len: 53 avg len: 57.0 num_loss_counted_tokens: 53
 total tokens: 116 num samples: 2 num padding tokens: 1 - rank: 6 max len: 58 min len: 57 avg len: 57.5 num_loss_counted_tokens: 67
 total tokens: 154 num samples: 2 num padding tokens: 4 - rank: 6 max len: 77 min len: 73 avg len: 75.0 num_loss_counted_tokens: 92
 total tokens: 194 num samples: 2 num padding tokens: 45 - rank: 2 max len: 97 min len: 52 avg len: 74.5 num_loss_counted_tokens: 88
 total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 2 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 79
 total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 3 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 62
 total tokens: 196 num samples: 2 num padding tokens: 35 - rank: 2 max len: 98 min len: 63 avg len: 80.5 num_loss_counted_tokens: 104
 total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 6 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 90
 total tokens: 148 num samples: 2 num padding tokens: 13 - rank: 2 max len: 74 min len: 61 avg len: 67.5 num_loss_counted_tokens: 69
 total tokens: 202 num samples: 2 num padding tokens: 50 - rank: 5 max len: 101 min len: 51 avg len: 76.0 num_loss_counted_tokens: 96
 total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 2 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 53
 total tokens: 128 num samples: 2 num padding tokens: 14 - rank: 6 max len: 64 min len: 50 avg len: 57.0 num_loss_counted_tokens: 58
 total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 6 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 55
 total tokens: 176 num samples: 2 num padding tokens: 17 - rank: 3 max len: 88 min len: 71 avg len: 79.5 num_loss_counted_tokens: 98
 total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 6 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
 total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 6 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 68
 total tokens: 128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 64 min len: 62 avg len: 63.0 num_loss_counted_tokens: 80
 total tokens: 134 num samples: 2 num padding tokens: 22 - rank: 6 max len: 67 min len: 45 avg len: 56.0 num_loss_counted_tokens: 57
 total tokens: 166 num samples: 2 num padding tokens: 23 - rank: 6 max len: 83 min len: 60 avg len: 71.5 num_loss_counted_tokens: 90
 total tokens: 142 num samples: 2 num padding tokens: 3 - rank: 6 max len: 71 min len: 68 avg len: 69.5 num_loss_counted_tokens: 70
 total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 1 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 60
 total tokens: 146 num samples: 2 num padding tokens: 29 - rank: 2 max len: 73 min len: 44 avg len: 58.5 num_loss_counted_tokens: 61
 total tokens: 152 num samples: 2 num padding tokens: 22 - rank: 3 max len: 76 min len: 54 avg len: 65.0 num_loss_counted_tokens: 86
 total tokens: 244 num samples: 2 num padding tokens: 36 - rank: 6 max len: 122 min len: 86 avg len: 104.0 num_loss_counted_tokens: 139
 Per-token loss scaled by world size: 0.0025800205767154694Per-token loss scaled by world size: 0.0006805358571000397Per-token loss scaled by world size: 0.0009809250477701426Per-token loss scaled by world size: 0.0011542694410309196


 Per-token loss scaled by world size: 0.0011356660397723317

 Per-token loss scaled by world size: 8.372703450731933e-05
 Per-token loss scaled by world size: 0.0002341267536394298
 Epoch: 5, Step: 61, Rank: 1, loss = 0.07038137316703796
 Epoch: 5, Step: 61, Rank: 5, loss = 0.08281882852315903Epoch: 5, Step: 61, Rank: 3, loss = 0.1851164698600769

 Epoch: 5, Step: 61, Rank: 2, loss = 0.048828449100255966
 Epoch: 5, Step: 61, Rank: 0, loss = 0.08148403465747833
 Epoch: 5, Step: 61, Rank: 7, loss = 0.006007414776831865
 Epoch: 5, Step: 61, Rank: 4, loss = 0.0167985949665308
 Per-token loss scaled by world size: 0.0015900362050160766
 Epoch: 5, Step: 61, Rank: 6, loss = 0.11408510059118271
 [2024-07-27 20:06:08,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=61, skipped=0, lr=[1.3711972489182208e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:08,486] [INFO] [timer.py:258:stop] epoch=0/micro_step=61/global_step=61, RunningAvgSamplesPerSec=31.720545490554542, CurrSamplesPerSec=29.62867677087431, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,         | 1/12 [00:00<00:10,  1.06it/s]
    "step": 61,
    "rank": 0,
    "loss": 0.08148403465747833,
    "overall_throughput": 29.523344117535782,
    "lr": 1.3711972489182208e-05,
    "cuda_mem_allocated": 22.004770278930664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 574,
    "batch_size": 16,
    "total_loss": 0.07569002360105515,
    "gradnorm": 1.5541268587112427,
    "weight_norm": 393.4730224609375,
    "timestamp": "2024-07-27T20:06:08.490189"
 }
 Per-token loss scaled by world size: 0.0006772524793632329Per-token loss scaled by world size: 0.0011761346831917763Per-token loss scaled by world size: 0.0012851222418248653Per-token loss scaled by world size: 0.0015470795333385468

 Per-token loss scaled by world size: 0.001160036656074226

 Per-token loss scaled by world size: 0.0007557208882644773Per-token loss scaled by world size: 0.0015825566370040178


 Epoch: 5, Step: 62, Rank: 1, loss = 0.0887981727719307Epoch: 5, Step: 62, Rank: 4, loss = 0.09702672809362411
 Epoch: 5, Step: 62, Rank: 2, loss = 0.11680450290441513Epoch: 5, Step: 62, Rank: 0, loss = 0.05113256350159645


 Epoch: 5, Step: 62, Rank: 7, loss = 0.05705692619085312
 Epoch: 5, Step: 62, Rank: 5, loss = 0.11948302388191223
 Epoch: 5, Step: 62, Rank: 6, loss = 0.08758276700973511
 Per-token loss scaled by world size: 0.000659986340906471
 Epoch: 5, Step: 62, Rank: 3, loss = 0.049828968942165375
 [2024-07-27 20:06:08,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=62, skipped=0, lr=[1.3402931744416432e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:09,058] [INFO] [timer.py:258:stop] epoch=0/micro_step=62/global_step=62, RunningAvgSamplesPerSec=31.69959766825067, CurrSamplesPerSec=30.510810812039587, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,▋        | 2/12 [00:01<00:07,  1.38it/s]
    "step": 62,
    "rank": 0,
    "loss": 0.05113256350159645,
    "overall_throughput": 30.460830095500928,
    "lr": 1.3402931744416432e-05,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 604,
    "batch_size": 16,
    "total_loss": 0.08346420526504517,
    "gradnorm": 1.3599183559417725,
    "weight_norm": 393.4732971191406,
    "timestamp": "2024-07-27T20:06:09.061604"
 }
 Per-token loss scaled by world size: 0.0010878611356019974Per-token loss scaled by world size: 0.0003478115249890834Per-token loss scaled by world size: 0.0022740615531802177Per-token loss scaled by world size: 0.0003918901493307203Per-token loss scaled by world size: 0.002286511706188321Per-token loss scaled by world size: 0.0006958367303013802



 Per-token loss scaled by world size: 0.0007736408151686192


 Epoch: 5, Step: 63, Rank: 2, loss = 0.03065088950097561Epoch: 5, Step: 63, Rank: 5, loss = 0.2004016786813736
 Epoch: 5, Step: 63, Rank: 0, loss = 0.2014988511800766
 Epoch: 5, Step: 63, Rank: 1, loss = 0.06132061034440994Epoch: 5, Step: 63, Rank: 4, loss = 0.03453531861305237Epoch: 5, Step: 63, Rank: 3, loss = 0.09586776047945023


 Epoch: 5, Step: 63, Rank: 6, loss = 0.06817709654569626

 Per-token loss scaled by world size: 0.0005193906254135072
 Epoch: 5, Step: 63, Rank: 7, loss = 0.04577130079269409
 [2024-07-27 20:06:09,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=63, skipped=0, lr=[1.3090169943749475e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:09,591] [INFO] [timer.py:258:stop] epoch=0/micro_step=63/global_step=63, RunningAvgSamplesPerSec=31.71584398677769, CurrSamplesPerSec=32.72206448467118, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,█▌       | 3/12 [00:02<00:05,  1.57it/s]
    "step": 63,
    "rank": 0,
    "loss": 0.2014988511800766,
    "overall_throughput": 32.65121769063287,
    "lr": 1.3090169943749475e-05,
    "cuda_mem_allocated": 22.00572395324707,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 705,
    "batch_size": 16,
    "total_loss": 0.09227793663740158,
    "gradnorm": 2.188631534576416,
    "weight_norm": 393.4734802246094,
    "timestamp": "2024-07-27T20:06:09.633806"
 }
 Per-token loss scaled by world size: 0.0009381945710629225Per-token loss scaled by world size: 0.00045316756586544216Per-token loss scaled by world size: 0.0005594724207185209Per-token loss scaled by world size: 0.0003057016583625227Per-token loss scaled by world size: 0.0005305999657139182
 Per-token loss scaled by world size: 0.00392846018075943



 Per-token loss scaled by world size: 0.0012796723749488592

 Epoch: 5, Step: 64, Rank: 0, loss = 0.0768146812915802Epoch: 5, Step: 64, Rank: 6, loss = 0.03710309416055679Epoch: 5, Step: 64, Rank: 4, loss = 0.043442871421575546Epoch: 5, Step: 64, Rank: 5, loss = 0.045806802809238434

 Epoch: 5, Step: 64, Rank: 3, loss = 0.32164266705513
 Epoch: 5, Step: 64, Rank: 1, loss = 0.025029323995113373

 Epoch: 5, Step: 64, Rank: 2, loss = 0.10477317124605179

 Per-token loss scaled by world size: 0.00109212682582438
 Epoch: 5, Step: 64, Rank: 7, loss = 0.08941788226366043
 [2024-07-27 20:06:10,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=64, skipped=0, lr=[1.2774029087618448e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:10,141] [INFO] [timer.py:258:stop] epoch=0/micro_step=64/global_step=64, RunningAvgSamplesPerSec=31.71981083700255, CurrSamplesPerSec=31.963679576668167, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,██▎      | 4/12 [00:02<00:04,  1.66it/s]
    "step": 64,
    "rank": 0,
    "loss": 0.0768146812915802,
    "overall_throughput": 31.90590190730254,
    "lr": 1.2774029087618448e-05,
    "cuda_mem_allocated": 21.99880838394165,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 655,
    "batch_size": 16,
    "total_loss": 0.09300381690263748,
    "gradnorm": 1.6255645751953125,
    "weight_norm": 393.4736633300781,
    "timestamp": "2024-07-27T20:06:10.186141"
 }
 Per-token loss scaled by world size: 0.0019039374310523272Per-token loss scaled by world size: 0.0005504547152668238Per-token loss scaled by world size: 0.001485039945691824Per-token loss scaled by world size: 0.0013535844627767801

 Per-token loss scaled by world size: 0.000591020449064672


 Per-token loss scaled by world size: 0.0014079039683565497
 Epoch: 5, Step: 65, Rank: 3, loss = 0.042040977627038956Epoch: 5, Step: 65, Rank: 0, loss = 0.11341992765665054
 Epoch: 5, Step: 65, Rank: 7, loss = 0.10338001698255539

 Epoch: 5, Step: 65, Rank: 4, loss = 0.04513918608427048
 Epoch: 5, Step: 65, Rank: 1, loss = 0.14541321992874146
 Per-token loss scaled by world size: 0.0026307932566851377
 Epoch: 5, Step: 65, Rank: 5, loss = 0.20092684030532837
 Epoch: 5, Step: 65, Rank: 2, loss = 0.10752866417169571
 Per-token loss scaled by world size: 0.002622765488922596
 Epoch: 5, Step: 65, Rank: 6, loss = 0.2003137171268463
 [2024-07-27 20:06:10,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=65, skipped=0, lr=[1.2454854871407993e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:10,698] [INFO] [timer.py:258:stop] epoch=0/micro_step=65/global_step=65, RunningAvgSamplesPerSec=31.71541920662633, CurrSamplesPerSec=31.44549285353818, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,███▏     | 5/12 [00:03<00:04,  1.71it/s]
    "step": 65,
    "rank": 0,
    "loss": 0.11341992765665054,
    "overall_throughput": 31.39161267024543,
    "lr": 1.2454854871407993e-05,
    "cuda_mem_allocated": 22.00572395324707,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 611,
    "batch_size": 16,
    "total_loss": 0.11977030336856842,
    "gradnorm": 1.5310957431793213,
    "weight_norm": 393.4738464355469,
    "timestamp": "2024-07-27T20:06:10.740662"
 }
 Per-token loss scaled by world size: 0.0006399019039236009Per-token loss scaled by world size: 0.0005316854221746325Per-token loss scaled by world size: 0.0012345308205112815Per-token loss scaled by world size: 0.00044449279084801674

 Per-token loss scaled by world size: 0.0006190972053445876Per-token loss scaled by world size: 0.0016892498824745417



 Epoch: 5, Step: 66, Rank: 0, loss = 0.03701859712600708
 Epoch: 5, Step: 66, Rank: 4, loss = 0.030947810038924217
 Epoch: 5, Step: 66, Rank: 7, loss = 0.0859542116522789Epoch: 5, Step: 66, Rank: 5, loss = 0.1176140233874321
 Epoch: 5, Step: 66, Rank: 2, loss = 0.04310464486479759Epoch: 5, Step: 66, Rank: 1, loss = 0.04455316811800003


 Per-token loss scaled by world size: 0.0005175694241188467
 Epoch: 5, Step: 66, Rank: 3, loss = 0.036035772413015366
 Per-token loss scaled by world size: 0.0009789945324882865
 Epoch: 5, Step: 66, Rank: 6, loss = 0.06816249340772629
 [2024-07-27 20:06:11,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=66, skipped=0, lr=[1.213299630743747e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:11,253] [INFO] [timer.py:258:stop] epoch=0/micro_step=66/global_step=66, RunningAvgSamplesPerSec=31.711479482763888, CurrSamplesPerSec=31.465234804674058, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 1056
 {
    "epoch": 5,
    "step": 66,
    "rank": 0,
    "loss": 0.03701859712600708,
    "overall_throughput": 31.4168749534778,
    "lr": 1.213299630743747e-05,
    "cuda_mem_allocated": 22.009064197540283,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 557,
    "batch_size": 16,
    "total_loss": 0.05792384222149849,
    "gradnorm": 1.2862759828567505,
    "weight_norm": 393.4739685058594,
    "timestamp": "2024-07-27T20:06:11.256198"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1056
 [20:06:29] INFO     saving took 17.98075246810913 seconds                                                                                                                                                                         utils.py:611
                                                       Per-token loss scaled by world size: 0.0013491392601281404Per-token loss scaled by world size: 0.0005269849789328873Per-token loss scaled by world size: 0.004546341486275196Per-token loss scaled by world size: 0.000828504154924303369s/it]
 Per-token loss scaled by world size: 0.0013991020387038589



 Per-token loss scaled by world size: 0.0006301040411926806
 Per-token loss scaled by world size: 0.0007652370841242373
 Epoch: 5, Step: 67, Rank: 1, loss = 0.06306988000869751Epoch: 5, Step: 67, Rank: 2, loss = 0.34609025716781616
 Epoch: 5, Step: 67, Rank: 4, loss = 0.10650664567947388Epoch: 5, Step: 67, Rank: 0, loss = 0.10270322859287262


 Epoch: 5, Step: 67, Rank: 3, loss = 0.04011673107743263
 Epoch: 5, Step: 67, Rank: 7, loss = 0.047966670244932175
 Epoch: 5, Step: 67, Rank: 5, loss = 0.05825367197394371
 Per-token loss scaled by world size: 0.0012170199770480394
 Epoch: 5, Step: 67, Rank: 6, loss = 0.09264564514160156
 [2024-07-27 20:06:29,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=67, skipped=0, lr=[1.1808805343321102e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:29,799] [INFO] [timer.py:258:stop] epoch=0/micro_step=67/global_step=67, RunningAvgSamplesPerSec=31.70099434878959, CurrSamplesPerSec=31.044068891151483, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,████▊    | 7/12 [00:22<00:23,  4.69s/it]
    "step": 67,
    "rank": 0,
    "loss": 0.10270322859287262,
    "overall_throughput": 30.986774873645142,
    "lr": 1.1808805343321102e-05,
    "cuda_mem_allocated": 22.004770278930664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 609,
    "batch_size": 16,
    "total_loss": 0.10716909170150757,
    "gradnorm": 1.8347676992416382,
    "weight_norm": 393.4740905761719,
    "timestamp": "2024-07-27T20:06:29.841377"
 }
 Per-token loss scaled by world size: 0.0020153727382421494Per-token loss scaled by world size: 0.0002941747079603374Per-token loss scaled by world size: 0.0007493247976526618Per-token loss scaled by world size: 0.0010664670262485743Per-token loss scaled by world size: 0.0009130059042945504



 Per-token loss scaled by world size: 0.0006243651150725782Per-token loss scaled by world size: 0.0005120610003359616


 Epoch: 5, Step: 68, Rank: 0, loss = 0.168031707406044
 Epoch: 5, Step: 68, Rank: 5, loss = 0.06247495487332344
 Epoch: 5, Step: 68, Rank: 3, loss = 0.08891668915748596Epoch: 5, Step: 68, Rank: 7, loss = 0.07612186670303345Epoch: 5, Step: 68, Rank: 2, loss = 0.024526815861463547


 Epoch: 5, Step: 68, Rank: 4, loss = 0.042693085968494415Epoch: 5, Step: 68, Rank: 6, loss = 0.052056439220905304

 Per-token loss scaled by world size: 0.0004039716732222587
 Epoch: 5, Step: 68, Rank: 1, loss = 0.03368113934993744
 [2024-07-27 20:06:30,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=68, skipped=0, lr=[1.148263647711842e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:30,343] [INFO] [timer.py:258:stop] epoch=0/micro_step=68/global_step=68, RunningAvgSamplesPerSec=31.70601140764558, CurrSamplesPerSec=32.035561937422905, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,█████▋   | 8/12 [00:22<00:13,  3.37s/it]
    "step": 68,
    "rank": 0,
    "loss": 0.168031707406044,
    "overall_throughput": 31.980130143367383,
    "lr": 1.148263647711842e-05,
    "cuda_mem_allocated": 21.998568058013916,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 667,
    "batch_size": 16,
    "total_loss": 0.06856284290552139,
    "gradnorm": 1.0227607488632202,
    "weight_norm": 393.47418212890625,
    "timestamp": "2024-07-27T20:06:30.390363"
 }
 Per-token loss scaled by world size: 0.0027562566101551056Per-token loss scaled by world size: 0.0018840961856767535Per-token loss scaled by world size: 0.0018555921269580722Per-token loss scaled by world size: 0.0010745518375188112Per-token loss scaled by world size: 0.0009415823733434081Per-token loss scaled by world size: 0.0031567809637635946Per-token loss scaled by world size: 0.0009981651091948152






 Epoch: 5, Step: 69, Rank: 3, loss = 0.12411483377218246
 Epoch: 5, Step: 69, Rank: 6, loss = 0.07078610360622406Epoch: 5, Step: 69, Rank: 5, loss = 0.12223713099956512

 Epoch: 5, Step: 69, Rank: 1, loss = 0.20795294642448425Epoch: 5, Step: 69, Rank: 0, loss = 0.0620267391204834Epoch: 5, Step: 69, Rank: 7, loss = 0.18156839907169342Epoch: 5, Step: 69, Rank: 4, loss = 0.06575412303209305



 Per-token loss scaled by world size: 0.00018431547505315393
 Epoch: 5, Step: 69, Rank: 2, loss = 0.012141781859099865
 [2024-07-27 20:06:30,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=69, skipped=0, lr=[1.1154846369695864e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:30,877] [INFO] [timer.py:258:stop] epoch=0/micro_step=69/global_step=69, RunningAvgSamplesPerSec=31.723357888730654, CurrSamplesPerSec=32.911763984671325, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 5,██████▌  | 9/12 [00:23<00:07,  2.48s/it]
    "step": 69,
    "rank": 0,
    "loss": 0.0620267391204834,
    "overall_throughput": 32.829137857318266,
    "lr": 1.1154846369695864e-05,
    "cuda_mem_allocated": 21.999523639678955,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 527,
    "batch_size": 16,
    "total_loss": 0.10582275688648224,
    "gradnorm": 2.0553536415100098,
    "weight_norm": 393.4742431640625,
    "timestamp": "2024-07-27T20:06:30.921842"
 }
 Per-token loss scaled by world size: 0.0010255835950374603Per-token loss scaled by world size: 0.0015445285243913531Per-token loss scaled by world size: 0.0007211874471977353Per-token loss scaled by world size: 0.0005934142973273993Per-token loss scaled by world size: 0.0017620512517169118

 Per-token loss scaled by world size: 0.000368919427273795Per-token loss scaled by world size: 0.0008395504555664957




 Epoch: 5, Step: 70, Rank: 7, loss = 0.11294364929199219
 Epoch: 5, Step: 70, Rank: 1, loss = 0.04339342191815376Epoch: 5, Step: 70, Rank: 6, loss = 0.052736829966306686Epoch: 5, Step: 70, Rank: 0, loss = 0.026977233588695526


 Epoch: 5, Step: 70, Rank: 5, loss = 0.12884999811649323Epoch: 5, Step: 70, Rank: 2, loss = 0.07499580085277557

 Epoch: 5, Step: 70, Rank: 4, loss = 0.061392128467559814
 Per-token loss scaled by world size: 0.0018547051586210728
 Epoch: 5, Step: 70, Rank: 3, loss = 0.13562531769275665
 [2024-07-27 20:06:31,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[1.0825793454723325e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:31,432] [INFO] [timer.py:258:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=31.71917367216476, CurrSamplesPerSec=31.441323528309383, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 5,███████▎ | 10/12 [00:23<00:03,  1.89s/it]
    "step": 70,
    "rank": 0,
    "loss": 0.026977233588695526,
    "overall_throughput": 31.36866352554035,
    "lr": 1.0825793454723325e-05,
    "cuda_mem_allocated": 21.999762058258057,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 585,
    "batch_size": 16,
    "total_loss": 0.0796142965555191,
    "gradnorm": 1.7012439966201782,
    "weight_norm": 393.4743347167969,
    "timestamp": "2024-07-27T20:06:31.476165"
 }
 Per-token loss scaled by world size: 0.0009147366508841515Per-token loss scaled by world size: 0.0017351489514112473Per-token loss scaled by world size: 0.0008338880725204945Per-token loss scaled by world size: 0.00024312795721925795Per-token loss scaled by world size: 0.0006241968367248774



 Per-token loss scaled by world size: 0.00024290102010127157Per-token loss scaled by world size: 0.0020128381438553333


 Epoch: 5, Step: 71, Rank: 5, loss = 0.05847639963030815Epoch: 5, Step: 71, Rank: 1, loss = 0.12167732417583466Epoch: 5, Step: 71, Rank: 3, loss = 0.01704934798181057


 Epoch: 5, Step: 71, Rank: 2, loss = 0.04377180337905884
 Epoch: 5, Step: 71, Rank: 6, loss = 0.06414590775966644Epoch: 5, Step: 71, Rank: 7, loss = 0.01703343354165554

 Epoch: 5, Step: 71, Rank: 4, loss = 0.14115028083324432
 Per-token loss scaled by world size: 0.0004984893603250384
 Epoch: 5, Step: 71, Rank: 0, loss = 0.034956566989421844
 [2024-07-27 20:06:31,890] [INFO] [logging.py:96:log_dist] [Rank 0] step=71, skipped=0, lr=[1.0495837546732224e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:31,968] [INFO] [timer.py:258:stop] epoch=0/micro_step=71/global_step=71, RunningAvgSamplesPerSec=31.731589241737456, CurrSamplesPerSec=32.599273292528906, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 5,████████▏| 11/12 [00:24<00:01,  1.47s/it]
    "step": 71,
    "rank": 0,
    "loss": 0.034956566989421844,
    "overall_throughput": 32.517245673379065,
    "lr": 1.0495837546732224e-05,
    "cuda_mem_allocated": 21.998329639434814,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 561,
    "batch_size": 16,
    "total_loss": 0.062282636761665344,
    "gradnorm": 0.9715697765350342,
    "weight_norm": 393.47442626953125,
    "timestamp": "2024-07-27T20:06:32.016067"
 }
 Per-token loss scaled by world size: 0.0011429809965193272Per-token loss scaled by world size: 0.0009149574325419962Per-token loss scaled by world size: 0.0004773043910972774Per-token loss scaled by world size: 0.0027895078528672457Per-token loss scaled by world size: 0.004009348340332508Per-token loss scaled by world size: 0.0015114195412024856
 Per-token loss scaled by world size: 0.0012063757749274373





 Epoch: 5, Step: 72, Rank: 6, loss = 0.34730979800224304
 Epoch: 5, Step: 72, Rank: 2, loss = 0.07925818860530853Epoch: 5, Step: 72, Rank: 0, loss = 0.04134649410843849Epoch: 5, Step: 72, Rank: 5, loss = 0.09901072829961777Epoch: 5, Step: 72, Rank: 3, loss = 0.10450230538845062
 Epoch: 5, Step: 72, Rank: 7, loss = 0.130926713347435
 Epoch: 5, Step: 72, Rank: 1, loss = 0.2416411191225052



 Per-token loss scaled by world size: 0.0014137992402538657
 Epoch: 5, Step: 72, Rank: 4, loss = 0.12247036397457123
 [2024-07-27 20:06:32,427] [INFO] [logging.py:96:log_dist] [Rank 0] step=72, skipped=0, lr=[1.0165339447663586e-05], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:32,504] [INFO] [timer.py:258:stop] epoch=0/micro_step=72/global_step=72, RunningAvgSamplesPerSec=31.74744391309924, CurrSamplesPerSec=32.88104464616879, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 5,█████████| 12/12 [00:24<00:00,  1.19s/it]
    "step": 72,
    "rank": 0,
    "loss": 0.04134649410843849,
    "overall_throughput": 32.79027112653174,
    "lr": 1.0165339447663586e-05,
    "cuda_mem_allocated": 22.01025676727295,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 693,
    "batch_size": 16,
    "total_loss": 0.14580821990966797,
    "gradnorm": 1.6911654472351074,
    "weight_norm": 393.4745178222656,
    "timestamp": "2024-07-27T20:06:32.547687"
 }
 Epoch 5: 100%|██████████| 12/12 [00:25<00:00,  2.09s/it]
 total tokens: 214 num samples: 2 num padding tokens: 23 - rank: 1 max len: 107 min len: 84 avg len: 95.5 num_loss_counted_tokens: 132
 total tokens: 282 num samples: 2 num padding tokens: 83 - rank: 6 max len: 141 min len: 58 avg len: 99.5 num_loss_counted_tokens: 145
 total tokens: 144 num samples: 2 num padding tokens: 27 - rank: 7 max len: 72 min len: 45 avg len: 58.5 num_loss_counted_tokens: 73
 total tokens: 118 num samples: 2 num padding tokens: 9 - rank: 1 max len: 59 min len: 50 avg len: 54.5 num_loss_counted_tokens: 51
 total tokens: 172 num samples: 2 num padding tokens: 19 - rank: 7 max len: 86 min len: 67 avg len: 76.5 num_loss_counted_tokens: 75
 total tokens: 148 num samples: 2 num padding tokens: 17 - rank: 0 max len: 74 min len: 57 avg len: 65.5 num_loss_counted_tokens: 73
 total tokens: 138 num samples: 2 num padding tokens: 1 - rank: 7 max len: 69 min len: 68 avg len: 68.5 num_loss_counted_tokens: 57
 total tokens: 106 num samples: 2 num padding tokens: 5 - rank: 1 max len: 53 min len: 48 avg len: 50.5 num_loss_counted_tokens: 46
 total tokens: 160 num samples: 2 num padding tokens: 18 - rank: 0 max len: 80 min len: 62 avg len: 71.0 num_loss_counted_tokens: 81
 total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 7 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 77
 total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 7 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 92
 total tokens: 188 num samples: 2 num padding tokens: 19 - rank: 0 max len: 94 min len: 75 avg len: 84.5 num_loss_counted_tokens: 99
 total tokens: 138 num samples: 2 num padding tokens: 5 - rank: 2 max len: 69 min len: 64 avg len: 66.5 num_loss_counted_tokens: 70
 total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 3 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 128
 total tokens: 162 num samples: 2 num padding tokens: 18 - rank: 0 max len: 81 min len: 63 avg len: 72.0 num_loss_counted_tokens: 82
 total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 6 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 54
 total tokens: 128 num samples: 2 num padding tokens: 11 - rank: 3 max len: 64 min len: 53 avg len: 58.5 num_loss_counted_tokens: 67
 total tokens: 214 num samples: 2 num padding tokens: 31 - rank: 1 max len: 107 min len: 76 avg len: 91.5 num_loss_counted_tokens: 117
 total tokens: 200 num samples: 2 num padding tokens: 10 - rank: 7 max len: 100 min len: 90 avg len: 95.0 num_loss_counted_tokens: 151
 total tokens: 244 num samples: 2 num padding tokens: 70 - rank: 6 max len: 122 min len: 52 avg len: 87.0 num_loss_counted_tokens: 113
 total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 0 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 75
 total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 6 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 62
 total tokens: 176 num samples: 2 num padding tokens: 8 - rank: 2 max len: 88 min len: 80 avg len: 84.0 num_loss_counted_tokens: 99
 total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 7 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 65
 total tokens: 152 num samples: 2 num padding tokens: 14 - rank: 7 max len: 76 min len: 62 avg len: 69.0 num_loss_counted_tokens: 83
 total tokens: 208 num samples: 2 num padding tokens: 46 - rank: 7 max len: 104 min len: 58 avg len: 81.0 num_loss_counted_tokens: 107
 total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 7 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 55
 total tokens: 148 num samples: 2 num padding tokens: 10 - rank: 1 max len: 74 min len: 64 avg len: 69.0 num_loss_counted_tokens: 73
 total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 0 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 91
 total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 6 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 78
 total tokens: 168 num samples: 2 num padding tokens: 36 - rank: 2 max len: 84 min len: 48 avg len: 66.0 num_loss_counted_tokens: 72
 total tokens: 130 num samples: 2 num padding tokens: 17 - rank: 0 max len: 65 min len: 48 avg len: 56.5 num_loss_counted_tokens: 62
 total tokens: 186 num samples: 2 num padding tokens: 42 - rank: 0 max len: 93 min len: 51 avg len: 72.0 num_loss_counted_tokens: 96
 total tokens: 140 num samples: 2 num padding tokens: 17 - rank: 0 max len: 70 min len: 53 avg len: 61.5 num_loss_counted_tokens: 61
 total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 0 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 76
 total tokens: 104 num samples: 2 num padding tokens: 2 - rank: 0 max len: 52 min len: 50 avg len: 51.0 num_loss_counted_tokens: 61
 total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 7 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 62
 total tokens: 188 num samples: 2 num padding tokens: 39 - rank: 2 max len: 94 min len: 55 avg len: 74.5 num_loss_counted_tokens: 95
 total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 2 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 53
 total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 6 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 72
 total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 7 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 71
 total tokens: 216 num samples: 2 num padding tokens: 21 - rank: 0 max len: 108 min len: 87 avg len: 97.5 num_loss_counted_tokens: 124
 total tokens: 104 num samples: 2 num padding tokens: 8 - rank: 6 max len: 52 min len: 44 avg len: 48.0 num_loss_counted_tokens: 52
 total tokens: 134 num samples: 2 num padding tokens: 8 - rank: 2 max len: 67 min len: 59 avg len: 63.0 num_loss_counted_tokens: 71
 total tokens: 168 num samples: 2 num padding tokens: 25 - rank: 2 max len: 84 min len: 59 avg len: 71.5 num_loss_counted_tokens: 89
 total tokens: 146 num samples: 2 num padding tokens: 19 - rank: 2 max len: 73 min len: 54 avg len: 63.5 num_loss_counted_tokens: 80
 total tokens: 154 num samples: 2 num padding tokens: 27 - rank: 2 max len: 77 min len: 50 avg len: 63.5 num_loss_counted_tokens: 83
 total tokens: 164 num samples: 2 num padding tokens: 36 - rank: 6 max len: 82 min len: 46 avg len: 64.0 num_loss_counted_tokens: 75
 total tokens: 122 num samples: 2 num padding tokens: 10 - rank: 2 max len: 61 min len: 51 avg len: 56.0 num_loss_counted_tokens: 56
 total tokens: 156 num samples: 2 num padding tokens: 20 - rank: 4 max len: 78 min len: 58 avg len: 68.0 num_loss_counted_tokens: 78
 total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 2 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 60
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 124 num samples: 2 num padding tokens: 18 - rank: 4 max len: 62 min len: 44 avg len: 53.0 num_loss_counted_tokens: 60
 total tokens: 162 num samples: 2 num padding tokens: 19 - rank: 6 max len: 81 min len: 62 avg len: 71.5 num_loss_counted_tokens: 82
 total tokens: 104 num samples: 2 num padding tokens: 6 - rank: 4 max len: 52 min len: 46 avg len: 49.0 num_loss_counted_tokens: 56
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 66
 total tokens: 226 num samples: 2 num padding tokens: 48 - rank: 4 max len: 113 min len: 65 avg len: 89.0 num_loss_counted_tokens: 95
 total tokens: 132 num samples: 2 num padding tokens: 12 - rank: 4 max len: 66 min len: 54 avg len: 60.0 num_loss_counted_tokens: 69
 total tokens: 228 num samples: 2 num padding tokens: 17 - rank: 4 max len: 114 min len: 97 avg len: 105.5 num_loss_counted_tokens: 158
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 6 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 67
 total tokens: 98 num samples: 2 num padding tokens: 3 - rank: 4 max len: 49 min len: 46 avg len: 47.5 num_loss_counted_tokens: 47
 total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 3 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 70
 total tokens: 196 num samples: 2 num padding tokens: 38 - rank: 3 max len: 98 min len: 60 avg len: 79.0 num_loss_counted_tokens: 112
 total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 3 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 75
 total tokens: 120 num samples: 2 num padding tokens: 3 - rank: 3 max len: 60 min len: 57 avg len: 58.5 num_loss_counted_tokens: 59
 total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 59 total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 3 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57

 total tokens: 126 num samples: 2 num padding tokens: 17 - rank: 5 max len: 63 min len: 46 avg len: 54.5 num_loss_counted_tokens: 52
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 5 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63
 total tokens: 180 num samples: 2 num padding tokens: 32 - rank: 6 max len: 90 min len: 58 avg len: 74.0 num_loss_counted_tokens: 99
 total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 3 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 45
 total tokens: 166 num samples: 2 num padding tokens: 38 - rank: 1 max len: 83 min len: 45 avg len: 64.0 num_loss_counted_tokens: 56
 total tokens: 144 num samples: 2 num padding tokens: 4 - rank: 3 max len: 72 min len: 68 avg len: 70.0 num_loss_counted_tokens: 60
 total tokens: 138 num samples: 2 num padding tokens: 0 - rank: 3 max len: 69 min len: 69 avg len: 69.0 num_loss_counted_tokens: 75
 total tokens: 186 num samples: 2 num padding tokens: 12 - rank: 4 max len: 93 min len: 81 avg len: 87.0 num_loss_counted_tokens: 131
 total tokens: 184 num samples: 2 num padding tokens: 31 - rank: 1 max len: 92 min len: 61 avg len: 76.5 num_loss_counted_tokens: 87
 total tokens: 142 num samples: 2 num padding tokens: 4 - rank: 4 max len: 71 min len: 67 avg len: 69.0 num_loss_counted_tokens: 59 total tokens: 126 num samples: 2 num padding tokens: 20 - rank: 5 max len: 63 min len: 43 avg len: 53.0 num_loss_counted_tokens: 42

 total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 5 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 85
 total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 1 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 63
 total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 1 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 75
 total tokens: 148 num samples: 2 num padding tokens: 13 - rank: 1 max len: 74 min len: 61 avg len: 67.5 num_loss_counted_tokens: 69
 total tokens: 202 num samples: 2 num padding tokens: 11 - rank: 1 max len: 101 min len: 90 avg len: 95.5 num_loss_counted_tokens: 138
 total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 5 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 69
 total tokens: 174 num samples: 2 num padding tokens: 27 - rank: 1 max len: 87 min len: 60 avg len: 73.5 num_loss_counted_tokens: 76
 total tokens: 186 num samples: 2 num padding tokens: 31 - rank: 5 max len: 93 min len: 62 avg len: 77.5 num_loss_counted_tokens: 81
 total tokens: 166 num samples: 2 num padding tokens: 33 - rank: 5 max len: 83 min len: 50 avg len: 66.5 num_loss_counted_tokens: 82
 total tokens: 142 num samples: 2 num padding tokens: 16 - rank: 5 max len: 71 min len: 55 avg len: 63.0 num_loss_counted_tokens: 61
 total tokens: 146 num samples: 2 num padding tokens: 28 - rank: 5 max len: 73 min len: 45 avg len: 59.0 num_loss_counted_tokens: 72
 total tokens: 120 num samples: 2 num padding tokens: 11 - rank: 5 max len: 60 min len: 49 avg len: 54.5 num_loss_counted_tokens: 79
 total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 5 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 69
 total tokens: 126 num samples: 2 num padding tokens: 6 - rank: 6 max len: 63 min len: 57 avg len: 60.0 num_loss_counted_tokens: 66
 total tokens: 122 num samples: 2 num padding tokens: 3 - rank: 3 max len: 61 min len: 58 avg len: 59.5 num_loss_counted_tokens: 69
 total tokens: 132 num samples: 2 num padding tokens: 3 - rank: 2 max len: 66 min len: 63 avg len: 64.5 num_loss_counted_tokens: 70
 total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 4 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 55
 total tokens: 134 num samples: 2 num padding tokens: 3 - rank: 5 max len: 67 min len: 64 avg len: 65.5 num_loss_counted_tokens: 61
 Per-token loss scaled by world size: 0.0006204941309988499Per-token loss scaled by world size: 0.0005415144260041416Per-token loss scaled by world size: 0.0004509067512117326


 Per-token loss scaled by world size: 7.763502799207345e-05
 Per-token loss scaled by world size: 0.0008618941647000611Per-token loss scaled by world size: 0.0005943336291238666Per-token loss scaled by world size: 0.0004708097840193659


 Epoch: 6, Step: 73, Rank: 6, loss = 0.04921012371778488
 Epoch: 6, Step: 73, Rank: 5, loss = 0.00705508328974247Epoch: 6, Step: 73, Rank: 3, loss = 0.05638740584254265

 Epoch: 6, Step: 73, Rank: 2, loss = 0.0409761518239975
 Epoch: 6, Step: 73, Rank: 1, loss = 0.04278483986854553
 Epoch: 6, Step: 73, Rank: 7, loss = 0.07832463085651398
 Epoch: 6, Step: 73, Rank: 0, loss = 0.054010067135095596
 Per-token loss scaled by world size: 0.00044023498776368797
 Epoch: 6, Step: 73, Rank: 4, loss = 0.040006354451179504
 [2024-07-27 20:06:33,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=73, skipped=0, lr=[9.834660552336415e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:33,537] [INFO] [timer.py:258:stop] epoch=0/micro_step=73/global_step=73, RunningAvgSamplesPerSec=31.690802156326086, CurrSamplesPerSec=28.172368613829672, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:   8%|▊         | 1/12 [00:00<00:10,  1.06it/s]{
    "epoch": 6,
    "step": 73,
    "rank": 0,
    "loss": 0.054010067135095596,
    "overall_throughput": 28.063288365518915,
    "lr": 9.834660552336415e-06,
    "cuda_mem_allocated": 22.000954627990723,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 727,
    "batch_size": 16,
    "total_loss": 0.04609433189034462,
    "gradnorm": 0.7181567549705505,
    "weight_norm": 393.4746398925781,
    "timestamp": "2024-07-27T20:06:33.581427"
 }
 Per-token loss scaled by world size: 0.00021588351228274405Per-token loss scaled by world size: 0.0018644272349774837Per-token loss scaled by world size: 0.0009222680237144232Per-token loss scaled by world size: 0.0011992761865258217Per-token loss scaled by world size: 0.00015600323968101293Per-token loss scaled by world size: 0.0007281338912434876Per-token loss scaled by world size: 0.0013443040661513805






 Epoch: 6, Step: 74, Rank: 4, loss = 0.1323743313550949
 Epoch: 6, Step: 74, Rank: 7, loss = 0.0516975075006485Epoch: 6, Step: 74, Rank: 3, loss = 0.011076229624450207
 Epoch: 6, Step: 74, Rank: 5, loss = 0.015327729284763336Epoch: 6, Step: 74, Rank: 2, loss = 0.08514861017465591


 Epoch: 6, Step: 74, Rank: 6, loss = 0.09544558823108673
 Epoch: 6, Step: 74, Rank: 0, loss = 0.0654810294508934
 Per-token loss scaled by world size: 0.0015726651763543487
 Epoch: 6, Step: 74, Rank: 1, loss = 0.1116592288017273
 [2024-07-27 20:06:34,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=74, skipped=0, lr=[9.504162453267776e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:34,080] [INFO] [timer.py:258:stop] epoch=0/micro_step=74/global_step=74, RunningAvgSamplesPerSec=31.69924367021435, CurrSamplesPerSec=32.31030745624361, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  17%|█▋        | 2/12 [00:01<00:07,  1.41it/s]{
    "epoch": 6,
    "step": 74,
    "rank": 0,
    "loss": 0.0654810294508934,
    "overall_throughput": 32.25508431115172,
    "lr": 9.504162453267776e-06,
    "cuda_mem_allocated": 22.002385139465332,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 568,
    "batch_size": 16,
    "total_loss": 0.0710262879729271,
    "gradnorm": 1.143301010131836,
    "weight_norm": 393.4747314453125,
    "timestamp": "2024-07-27T20:06:34.123115"
 }
 Per-token loss scaled by world size: 0.0001349089725408703Per-token loss scaled by world size: 0.0011630249209702015Per-token loss scaled by world size: 0.0005098663968965411Per-token loss scaled by world size: 0.001282830722630024Per-token loss scaled by world size: 0.0009069825755432248Per-token loss scaled by world size: 0.00048159470316022635



 Per-token loss scaled by world size: 0.0003646048135124147


 Epoch: 6, Step: 75, Rank: 5, loss = 0.035117048770189285Epoch: 6, Step: 75, Rank: 3, loss = 0.08010333776473999

 Epoch: 6, Step: 75, Rank: 4, loss = 0.08835496753454208Epoch: 6, Step: 75, Rank: 6, loss = 0.009291855618357658

 Epoch: 6, Step: 75, Rank: 2, loss = 0.06246842443943024
 Epoch: 6, Step: 75, Rank: 7, loss = 0.025112155824899673Epoch: 6, Step: 75, Rank: 0, loss = 0.033169835805892944

 Per-token loss scaled by world size: 0.0011549023911356926
 Epoch: 6, Step: 75, Rank: 1, loss = 0.07954390347003937
 [2024-07-27 20:06:34,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=75, skipped=0, lr=[9.174206545276678e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:34,640] [INFO] [timer.py:258:stop] epoch=0/micro_step=75/global_step=75, RunningAvgSamplesPerSec=31.692720007171083, CurrSamplesPerSec=31.229969737456262, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 6:  25%|██▌       | 3/12 [00:02<00:05,  1.56it/s]{
    "epoch": 6,
    "step": 75,
    "rank": 0,
    "loss": 0.033169835805892944,
    "overall_throughput": 31.178925272918942,
    "lr": 9.174206545276678e-06,
    "cuda_mem_allocated": 22.00572395324707,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 551,
    "batch_size": 16,
    "total_loss": 0.0516451895236969,
    "gradnorm": 1.016838788986206,
    "weight_norm": 393.4748229980469,
    "timestamp": "2024-07-27T20:06:34.682642"
 }
 Per-token loss scaled by world size: 0.0003609564446378499Per-token loss scaled by world size: 0.00041217487887479365Per-token loss scaled by world size: 0.0004959891666658223Per-token loss scaled by world size: 0.00047398614697158337Per-token loss scaled by world size: 0.0007203637505881488

 Per-token loss scaled by world size: 0.0001487391273258254



 Per-token loss scaled by world size: 0.0008504824945703149
 Epoch: 6, Step: 76, Rank: 0, loss = 0.042779065668582916
 Epoch: 6, Step: 76, Rank: 3, loss = 0.04088130593299866Epoch: 6, Step: 76, Rank: 6, loss = 0.031132493168115616Epoch: 6, Step: 76, Rank: 7, loss = 0.0621313713490963


 Epoch: 6, Step: 76, Rank: 2, loss = 0.012828749604523182
 Epoch: 6, Step: 76, Rank: 5, loss = 0.035550083965063095
 Epoch: 6, Step: 76, Rank: 1, loss = 0.07335411757230759
 Per-token loss scaled by world size: 0.0007280391291715205
 Epoch: 6, Step: 76, Rank: 4, loss = 0.06279337406158447
 [2024-07-27 20:06:35,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=76, skipped=0, lr=[8.84515363030414e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:35,189] [INFO] [timer.py:258:stop] epoch=0/micro_step=76/global_step=76, RunningAvgSamplesPerSec=31.692239680284427, CurrSamplesPerSec=31.657215099110317, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  33%|███▎      | 4/12 [00:02<00:04,  1.66it/s]{
    "epoch": 6,
    "step": 76,
    "rank": 0,
    "loss": 0.042779065668582916,
    "overall_throughput": 31.57946786919451,
    "lr": 8.84515363030414e-06,
    "cuda_mem_allocated": 22.002624034881592,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 690,
    "batch_size": 16,
    "total_loss": 0.04518131911754608,
    "gradnorm": 1.2256078720092773,
    "weight_norm": 393.47491455078125,
    "timestamp": "2024-07-27T20:06:35.231428"
 }
 Per-token loss scaled by world size: 0.0005857766373082995Per-token loss scaled by world size: 0.001119819818995893
 Per-token loss scaled by world size: 0.0010905693052336574Per-token loss scaled by world size: 0.00018508221546653658Per-token loss scaled by world size: 0.0016458512982353568Per-token loss scaled by world size: 0.00018191162962466478Per-token loss scaled by world size: 0.00047674551024101675





 Epoch: 6, Step: 77, Rank: 7, loss = 0.085386261343956
 Epoch: 6, Step: 77, Rank: 3, loss = 0.12549616396427155
 Epoch: 6, Step: 77, Rank: 6, loss = 0.01387076172977686Epoch: 6, Step: 77, Rank: 0, loss = 0.04466547071933746Epoch: 6, Step: 77, Rank: 4, loss = 0.014112519100308418Epoch: 6, Step: 77, Rank: 1, loss = 0.08315590769052505



 Epoch: 6, Step: 77, Rank: 5, loss = 0.03635184466838837
 Per-token loss scaled by world size: 0.0008063243585638702
 Epoch: 6, Step: 77, Rank: 2, loss = 0.06148223206400871
 [2024-07-27 20:06:35,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=77, skipped=0, lr=[8.51736352288158e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:35,734] [INFO] [timer.py:258:stop] epoch=0/micro_step=77/global_step=77, RunningAvgSamplesPerSec=31.702193912096924, CurrSamplesPerSec=32.45657221649108, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 1232
 {
    "epoch": 6,
    "step": 77,
    "rank": 0,
    "loss": 0.04466547071933746,
    "overall_throughput": 32.40263184704888,
    "lr": 8.51736352288158e-06,
    "cuda_mem_allocated": 22.000000476837158,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 610,
    "batch_size": 16,
    "total_loss": 0.05806514620780945,
    "gradnorm": 1.030696988105774,
    "weight_norm": 393.47503662109375,
    "timestamp": "2024-07-27T20:06:35.737358"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1232
 [20:06:53] INFO     saving took 18.036810636520386 seconds                                                                                                                                                                        utils.py:611
 Epoch 6:  42%|████▏     | 5/12 [00:21<00:49,  7.09s/it]Per-token loss scaled by world size: 0.0021851949859410524Per-token loss scaled by world size: 0.0004372596740722656Per-token loss scaled by world size: 0.0008908362942747772Per-token loss scaled by world size: 0.00043337256647646427Per-token loss scaled by world size: 0.0002932958595920354Per-token loss scaled by world size: 0.0002709754917304963


 Per-token loss scaled by world size: 0.0006071476964280009



 Epoch: 6, Step: 78, Rank: 1, loss = 0.07071013003587723
 Epoch: 6, Step: 78, Rank: 0, loss = 0.034707486629486084Epoch: 6, Step: 78, Rank: 4, loss = 0.03439894691109657
 Epoch: 6, Step: 78, Rank: 3, loss = 0.02328035794198513Epoch: 6, Step: 78, Rank: 5, loss = 0.021508680656552315Epoch: 6, Step: 78, Rank: 6, loss = 0.048192348331213Epoch: 6, Step: 78, Rank: 7, loss = 0.17344985902309418




 Per-token loss scaled by world size: 0.0011103027500212193
 Epoch: 6, Step: 78, Rank: 2, loss = 0.08813028037548065
 [2024-07-27 20:06:54,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=78, skipped=0, lr=[8.191194656678905e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:54,330] [INFO] [timer.py:258:stop] epoch=0/micro_step=78/global_step=78, RunningAvgSamplesPerSec=31.696677826343805, CurrSamplesPerSec=31.288371681003333, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  50%|█████     | 6/12 [00:21<00:29,  4.87s/it]{
    "epoch": 6,
    "step": 78,
    "rank": 0,
    "loss": 0.034707486629486084,
    "overall_throughput": 31.230609214279298,
    "lr": 8.191194656678905e-06,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 635,
    "batch_size": 16,
    "total_loss": 0.061797261238098145,
    "gradnorm": 1.2869224548339844,
    "weight_norm": 393.47509765625,
    "timestamp": "2024-07-27T20:06:54.372620"
 }
 Per-token loss scaled by world size: 0.0011720252223312855Per-token loss scaled by world size: 0.0006744764395989478Per-token loss scaled by world size: 0.0009440272697247565Per-token loss scaled by world size: 0.0029248518403619528Per-token loss scaled by world size: 0.0009407943580299616Per-token loss scaled by world size: 0.0013611947651952505





 Per-token loss scaled by world size: 0.0014305550139397383
 Epoch: 6, Step: 79, Rank: 7, loss = 0.04670749232172966
 Epoch: 6, Step: 79, Rank: 0, loss = 0.08116274327039719Epoch: 6, Step: 79, Rank: 4, loss = 0.06537389010190964Epoch: 6, Step: 79, Rank: 1, loss = 0.20254598557949066Epoch: 6, Step: 79, Rank: 2, loss = 0.06515000760555267Epoch: 6, Step: 79, Rank: 5, loss = 0.0942627340555191




 Epoch: 6, Step: 79, Rank: 6, loss = 0.09906593710184097
 Per-token loss scaled by world size: 0.000943321269005537
 Epoch: 6, Step: 79, Rank: 3, loss = 0.06532499939203262
 [2024-07-27 20:06:54,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=79, skipped=0, lr=[7.867003692562533e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:54,883] [INFO] [timer.py:258:stop] epoch=0/micro_step=79/global_step=79, RunningAvgSamplesPerSec=31.69457460454822, CurrSamplesPerSec=31.53554234673331, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  58%|█████▊    | 7/12 [00:22<00:17,  3.46s/it]{
    "epoch": 6,
    "step": 79,
    "rank": 0,
    "loss": 0.08116274327039719,
    "overall_throughput": 31.485415170212256,
    "lr": 7.867003692562533e-06,
    "cuda_mem_allocated": 21.996094703674316,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 554,
    "batch_size": 16,
    "total_loss": 0.08994922786951065,
    "gradnorm": 1.41256582736969,
    "weight_norm": 393.4751892089844,
    "timestamp": "2024-07-27T20:06:54.931617"
 }
 Per-token loss scaled by world size: 0.0004768831713590771Per-token loss scaled by world size: 0.002107172505930066Per-token loss scaled by world size: 0.0008781441720202565Per-token loss scaled by world size: 0.0014709294773638248
 Per-token loss scaled by world size: 0.00031639524968340993

 Per-token loss scaled by world size: 0.0003654623869806528


 Per-token loss scaled by world size: 0.000409139902330935
 Epoch: 6, Step: 80, Rank: 6, loss = 0.155930757522583
 Epoch: 6, Step: 80, Rank: 3, loss = 0.10884878039360046
 Epoch: 6, Step: 80, Rank: 5, loss = 0.023413248360157013Epoch: 6, Step: 80, Rank: 1, loss = 0.06498266756534576

 Epoch: 6, Step: 80, Rank: 4, loss = 0.02704421617090702Epoch: 6, Step: 80, Rank: 0, loss = 0.035289354622364044

 Epoch: 6, Step: 80, Rank: 2, loss = 0.030276352539658546
 Per-token loss scaled by world size: 0.002671802882105112
 Epoch: 6, Step: 80, Rank: 7, loss = 0.19771341979503632
 [2024-07-27 20:06:55,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[7.545145128592009e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:55,440] [INFO] [timer.py:258:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=31.6941207935923, CurrSamplesPerSec=31.65921633267696, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 6:  67%|██████▋   | 8/12 [00:22<00:10,  2.53s/it]{
    "epoch": 6,
    "step": 80,
    "rank": 0,
    "loss": 0.035289354622364044,
    "overall_throughput": 31.609841938694426,
    "lr": 7.545145128592009e-06,
    "cuda_mem_allocated": 22.009064197540283,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 592,
    "batch_size": 16,
    "total_loss": 0.08043734729290009,
    "gradnorm": 1.2677600383758545,
    "weight_norm": 393.4752502441406,
    "timestamp": "2024-07-27T20:06:55.481943"
 }
 Per-token loss scaled by world size: 0.0005759032792411745Per-token loss scaled by world size: 0.0009630320128053427Per-token loss scaled by world size: 0.0008893606718629599Per-token loss scaled by world size: 0.0010249739279970527Per-token loss scaled by world size: 0.0008383162785321474Per-token loss scaled by world size: 0.0007667160243727267Per-token loss scaled by world size: 7.463712972821668e-05






 Epoch: 6, Step: 81, Rank: 3, loss = 0.0855894684791565Epoch: 6, Step: 81, Rank: 5, loss = 0.07904192805290222

 Epoch: 6, Step: 81, Rank: 7, loss = 0.05118340253829956
 Epoch: 6, Step: 81, Rank: 6, loss = 0.006633374840021133
 Epoch: 6, Step: 81, Rank: 4, loss = 0.0681418851017952Epoch: 6, Step: 81, Rank: 2, loss = 0.09109456092119217
 Epoch: 6, Step: 81, Rank: 0, loss = 0.07450535893440247

 Per-token loss scaled by world size: 0.0009560537873767316
 Epoch: 6, Step: 81, Rank: 1, loss = 0.08496928215026855
 [2024-07-27 20:06:55,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=81, skipped=0, lr=[7.225970912381557e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:55,984] [INFO] [timer.py:258:stop] epoch=0/micro_step=81/global_step=81, RunningAvgSamplesPerSec=31.696829857403074, CurrSamplesPerSec=31.90957327177327, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 6:  75%|███████▌  | 9/12 [00:23<00:05,  1.91s/it]{
    "epoch": 6,
    "step": 81,
    "rank": 0,
    "loss": 0.07450535893440247,
    "overall_throughput": 31.833316019435205,
    "lr": 7.225970912381557e-06,
    "cuda_mem_allocated": 22.00548553466797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 711,
    "batch_size": 16,
    "total_loss": 0.06764490157365799,
    "gradnorm": 1.3872599601745605,
    "weight_norm": 393.4753112792969,
    "timestamp": "2024-07-27T20:06:56.027764"
 }
 Per-token loss scaled by world size: 0.001566625782288611Per-token loss scaled by world size: 0.0001653370854910463Per-token loss scaled by world size: 0.00041765952482819557Per-token loss scaled by world size: 0.0008047792944125831Per-token loss scaled by world size: 0.0015484013129025698
 Per-token loss scaled by world size: 7.262427970999852e-05
 Per-token loss scaled by world size: 0.00017705872596707195




 Epoch: 6, Step: 82, Rank: 6, loss = 0.02996707148849964
 Epoch: 6, Step: 82, Rank: 5, loss = 0.01186293549835682
 Epoch: 6, Step: 82, Rank: 2, loss = 0.05774291232228279
 Epoch: 6, Step: 82, Rank: 1, loss = 0.11240539699792862Epoch: 6, Step: 82, Rank: 0, loss = 0.1110977977514267Epoch: 6, Step: 82, Rank: 7, loss = 0.005210792180150747


 Epoch: 6, Step: 82, Rank: 4, loss = 0.012703963555395603
 Per-token loss scaled by world size: 5.9409892855910584e-05
 Epoch: 6, Step: 82, Rank: 3, loss = 0.004262659698724747
 [2024-07-27 20:06:56,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=82, skipped=0, lr=[6.909830056250527e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:56,522] [INFO] [timer.py:258:stop] epoch=0/micro_step=82/global_step=82, RunningAvgSamplesPerSec=31.70569941858613, CurrSamplesPerSec=32.42243510088761, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  83%|████████▎ | 10/12 [00:23<00:02,  1.49s/it]{
    "epoch": 6,
    "step": 82,
    "rank": 0,
    "loss": 0.1110977977514267,
    "overall_throughput": 32.336618768855516,
    "lr": 6.909830056250527e-06,
    "cuda_mem_allocated": 21.99880838394165,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 574,
    "batch_size": 16,
    "total_loss": 0.0431566946208477,
    "gradnorm": 1.0986570119857788,
    "weight_norm": 393.4753723144531,
    "timestamp": "2024-07-27T20:06:56.567461"
 }
 Per-token loss scaled by world size: 0.0015930738300085068Per-token loss scaled by world size: 0.0009168770629912615Per-token loss scaled by world size: 0.0008305592346005142Per-token loss scaled by world size: 0.0003735376812983304Per-token loss scaled by world size: 0.0023468886502087116
 Per-token loss scaled by world size: 0.0006343711283989251Per-token loss scaled by world size: 0.000816680898424238





 Epoch: 6, Step: 83, Rank: 1, loss = 0.0600554458796978
 Epoch: 6, Step: 83, Rank: 7, loss = 0.05349259823560715Epoch: 6, Step: 83, Rank: 5, loss = 0.05440162867307663
 Epoch: 6, Step: 83, Rank: 0, loss = 0.02446671761572361Epoch: 6, Step: 83, Rank: 2, loss = 0.15372121334075928

 Epoch: 6, Step: 83, Rank: 3, loss = 0.10434633493423462Epoch: 6, Step: 83, Rank: 6, loss = 0.04155131056904793


 Per-token loss scaled by world size: 0.00215042638592422
 Epoch: 6, Step: 83, Rank: 4, loss = 0.1408529281616211
 [2024-07-27 20:06:56,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=83, skipped=0, lr=[6.59706825558357e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:57,057] [INFO] [timer.py:258:stop] epoch=0/micro_step=83/global_step=83, RunningAvgSamplesPerSec=31.71805120615546, CurrSamplesPerSec=32.73837880082133, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6:  92%|█████████▏| 11/12 [00:24<00:01,  1.20s/it]{
    "epoch": 6,
    "step": 83,
    "rank": 0,
    "loss": 0.02446671761572361,
    "overall_throughput": 32.651551303359504,
    "lr": 6.59706825558357e-06,
    "cuda_mem_allocated": 22.003100872039795,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 524,
    "batch_size": 16,
    "total_loss": 0.0791110172867775,
    "gradnorm": 1.3195643424987793,
    "weight_norm": 393.4754333496094,
    "timestamp": "2024-07-27T20:06:57.100239"
 }
 Per-token loss scaled by world size: 0.001030595856718719Per-token loss scaled by world size: 0.00038883870001882315Per-token loss scaled by world size: 0.00021640512568410486Per-token loss scaled by world size: 0.0008497635717503726Per-token loss scaled by world size: 0.0006636562757194042
 Per-token loss scaled by world size: 0.0012220889329910278




 Epoch: 6, Step: 84, Rank: 4, loss = 0.03300268575549126
 Epoch: 6, Step: 84, Rank: 6, loss = 0.018367385491728783Epoch: 6, Step: 84, Rank: 7, loss = 0.10372480005025864Epoch: 6, Step: 84, Rank: 1, loss = 0.072123683989048Epoch: 6, Step: 84, Rank: 2, loss = 0.05632782727479935



 Epoch: 6, Step: 84, Rank: 3, loss = 0.08747182786464691
 Per-token loss scaled by world size: 9.578206663718447e-05
 Epoch: 6, Step: 84, Rank: 0, loss = 0.008129502646625042
 Per-token loss scaled by world size: 0.0018002043943852186
 Epoch: 6, Step: 84, Rank: 5, loss = 0.15279234945774078
 [2024-07-27 20:06:57,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=84, skipped=0, lr=[6.2880275108177915e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:57,582] [INFO] [timer.py:258:stop] epoch=0/micro_step=84/global_step=84, RunningAvgSamplesPerSec=31.736034255118497, CurrSamplesPerSec=33.26364124820817, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 6: 100%|██████████| 12/12 [00:24<00:00,  1.01it/s]{
    "epoch": 6,
    "step": 84,
    "rank": 0,
    "loss": 0.008129502646625042,
    "overall_throughput": 33.17494381766968,
    "lr": 6.2880275108177915e-06,
    "cuda_mem_allocated": 22.000000476837158,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 679,
    "batch_size": 16,
    "total_loss": 0.06649251282215118,
    "gradnorm": 1.081682801246643,
    "weight_norm": 393.4754638671875,
    "timestamp": "2024-07-27T20:06:57.629230"
 }
 Epoch 6: 100%|██████████| 12/12 [00:25<00:00,  2.09s/it]
 total tokens: 160 num samples: 2 num padding tokens: 1 - rank: 1 max len: 80 min len: 79 avg len: 79.5 num_loss_counted_tokens: 86
 total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 1 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 57 total tokens: 162 num samples: 2 num padding tokens: 29 - rank: 5 max len: 81 min len: 52 avg len: 66.5 num_loss_counted_tokens: 74
 total tokens: 116 num samples: 2 num padding tokens: 13 - rank: 5 max len: 58 min len: 45 avg len: 51.5 num_loss_counted_tokens: 60 total tokens: 138 num samples: 2 num padding tokens: 16 - rank: 2 max len: 69 min len: 53 avg len: 61.0 num_loss_counted_tokens: 51

 total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 2 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 62

 total tokens: 144 num samples: 2 num padding tokens: 10 - rank: 0 max len: 72 min len: 62 avg len: 67.0 num_loss_counted_tokens: 80
 total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 2 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
 total tokens: 150 num samples: 2 num padding tokens: 12 - rank: 5 max len: 75 min len: 63 avg len: 69.0 num_loss_counted_tokens: 72
 total tokens: 214 num samples: 2 num padding tokens: 46 - rank: 2 max len: 107 min len: 61 avg len: 84.0 num_loss_counted_tokens: 107
 total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 2 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 62
 total tokens: 180 num samples: 2 num padding tokens: 38 - rank: 7 max len: 90 min len: 52 avg len: 71.0 num_loss_counted_tokens: 95
 total tokens: 180 num samples: 2 num padding tokens: 7 - rank: 6 max len: 90 min len: 83 avg len: 86.5 num_loss_counted_tokens: 135
 total tokens: 152 num samples: 2 num padding tokens: 7 - rank: 5 max len: 76 min len: 69 avg len: 72.5 num_loss_counted_tokens: 101
 total tokens: 102 num samples: 2 num padding tokens: 8 - rank: 1 max len: 51 min len: 43 avg len: 47.0 num_loss_counted_tokens: 44
 total tokens: 106 num samples: 2 num padding tokens: 8 - rank: 1 max len: 53 min len: 45 avg len: 49.0 num_loss_counted_tokens: 46
 total tokens: 194 num samples: 2 num padding tokens: 53 - rank: 0 max len: 97 min len: 44 avg len: 70.5 num_loss_counted_tokens: 90
 total tokens: 122 num samples: 2 num padding tokens: 10 - rank: 6 max len: 61 min len: 51 avg len: 56.0 num_loss_counted_tokens: 56
 total tokens: 140 num samples: 2 num padding tokens: 8 - rank: 2 max len: 70 min len: 62 avg len: 66.0 num_loss_counted_tokens: 72
 total tokens: 208 num samples: 2 num padding tokens: 47 - rank: 6 max len: 104 min len: 57 avg len: 80.5 num_loss_counted_tokens: 112
 total tokens: 114 num samples: 2 num padding tokens: 12 - rank: 1 max len: 57 min len: 45 avg len: 51.0 num_loss_counted_tokens: 47
 total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 0 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 79
 total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 2 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 59
 total tokens: 282 num samples: 2 num padding tokens: 77 - rank: 7 max len: 141 min len: 64 avg len: 102.5 num_loss_counted_tokens: 152
 total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 5 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 68
 total tokens: 172 num samples: 2 num padding tokens: 35 - rank: 0 max len: 86 min len: 51 avg len: 68.5 num_loss_counted_tokens: 71
 total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 2 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 105
 total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 2 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 63
 total tokens: 134 num samples: 2 num padding tokens: 4 - rank: 3 max len: 67 min len: 63 avg len: 65.0 num_loss_counted_tokens: 61
 total tokens: 146 num samples: 2 num padding tokens: 28 - rank: 5 max len: 73 min len: 45 avg len: 59.0 num_loss_counted_tokens: 72
 total tokens: 102 num samples: 2 num padding tokens: 1 - rank: 7 max len: 51 min len: 50 avg len: 50.5 num_loss_counted_tokens: 62
 total tokens: 136 num samples: 2 num padding tokens: 2 - rank: 6 max len: 68 min len: 66 avg len: 67.0 num_loss_counted_tokens: 58
 total tokens: 168 num samples: 2 num padding tokens: 14 - rank: 6 max len: 84 min len: 70 avg len: 77.0 num_loss_counted_tokens: 86
 total tokens: 132 num samples: 2 num padding tokens: 12 - rank: 1 max len: 66 min len: 54 avg len: 60.0 num_loss_counted_tokens: 57
 total tokens: 226 num samples: 2 num padding tokens: 27 - rank: 6 max len: 113 min len: 86 avg len: 99.5 num_loss_counted_tokens: 114
 total tokens: 174 num samples: 2 num padding tokens: 28 - rank: 4 max len: 87 min len: 59 avg len: 73.0 num_loss_counted_tokens: 90
 total tokens: 184 num samples: 2 num padding tokens: 14 - rank: 3 max len: 92 min len: 78 avg len: 85.0 num_loss_counted_tokens: 102 total tokens: 200 num samples: 2 num padding tokens: 48 - rank: 3 max len: 100 min len: 52 avg len: 76.0 num_loss_counted_tokens: 85

 total tokens: 176 num samples: 2 num padding tokens: 44 - rank: 0 max len: 88 min len: 44 avg len: 66.0 num_loss_counted_tokens: 74
 total tokens: 110 num samples: 2 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 54
 total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 0 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 104
 total tokens: 146 num samples: 2 num padding tokens: 24 - rank: 7 max len: 73 min len: 49 avg len: 61.0 num_loss_counted_tokens: 64
 total tokens: 214 num samples: 2 num padding tokens: 25 - rank: 0 max len: 107 min len: 82 avg len: 94.5 num_loss_counted_tokens: 135
 total tokens: 142 num samples: 2 num padding tokens: 3 - rank: 0 max len: 71 min len: 68 avg len: 69.5 num_loss_counted_tokens: 59
 total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 0 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 77
 total tokens: 108 num samples: 2 num padding tokens: 6 - rank: 3 max len: 54 min len: 48 avg len: 51.0 num_loss_counted_tokens: 55
 total tokens: 166 num samples: 2 num padding tokens: 12 - rank: 6 max len: 83 min len: 71 avg len: 77.0 num_loss_counted_tokens: 79
 total tokens: 196 num samples: 2 num padding tokens: 28 - rank: 6 max len: 98 min len: 70 avg len: 84.0 num_loss_counted_tokens: 105
 total tokens: 186 num samples: 2 num padding tokens: 20 - rank: 0 max len: 93 min len: 73 avg len: 83.0 num_loss_counted_tokens: 135
 total tokens: 228 num samples: 2 num padding tokens: 52 - rank: 4 max len: 114 min len: 62 avg len: 88.0 num_loss_counted_tokens: 120
 total tokens: 244 num samples: 2 num padding tokens: 64 - rank: 3 max len: 122 min len: 58 avg len: 90.0 num_loss_counted_tokens: 127
 total tokens: 162 num samples: 2 num padding tokens: 2 - rank: 6 max len: 81 min len: 79 avg len: 80.0 num_loss_counted_tokens: 86
 total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 6 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 71
 total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 3 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 59
 total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 3 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 71
 total tokens: 164 num samples: 2 num padding tokens: 19 - rank: 7 max len: 82 min len: 63 avg len: 72.5 num_loss_counted_tokens: 84
 total tokens: 122 num samples: 2 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 61
 total tokens: 118 num samples: 2 num padding tokens: 11 - rank: 1 max len: 59 min len: 48 avg len: 53.5 num_loss_counted_tokens: 55
 total tokens: 142 num samples: 2 num padding tokens: 1 - rank: 1 max len: 71 min len: 70 avg len: 70.5 num_loss_counted_tokens: 72
 total tokens: 128 num samples: 2 num padding tokens: 15 - rank: 2 max len: 64 min len: 49 avg len: 56.5 num_loss_counted_tokens: 52
 total tokens: 172 num samples: 2 num padding tokens: 22 - rank: 3 max len: 86 min len: 64 avg len: 75.0 num_loss_counted_tokens: 70
 total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 1 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 64
 total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 3 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 65
 total tokens: 134 num samples: 2 num padding tokens: 17 - rank: 4 max len: 67 min len: 50 avg len: 58.5 num_loss_counted_tokens: 66
 total tokens: 126 num samples: 2 num padding tokens: 5 - rank: 4 max len: 63 min len: 58 avg len: 60.5 num_loss_counted_tokens: 61 total tokens: 132 num samples: 2 num padding tokens: 14 - rank: 4 max len: 66 min len: 52 avg len: 59.0 num_loss_counted_tokens: 66

 total tokens: 216 num samples: 2 num padding tokens: 7 - rank: 0 max len: 108 min len: 101 avg len: 104.5 num_loss_counted_tokens: 147
 total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 7 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 57
 total tokens: 152 num samples: 2 num padding tokens: 12 - rank: 7 max len: 76 min len: 64 avg len: 70.0 num_loss_counted_tokens: 81
 total tokens: 130 num samples: 2 num padding tokens: 0 - rank: 7 max len: 65 min len: 65 avg len: 65.0 num_loss_counted_tokens: 59
 total tokens: 154 num samples: 2 num padding tokens: 10 - rank: 7 max len: 77 min len: 67 avg len: 72.0 num_loss_counted_tokens: 80
 total tokens: 180 num samples: 2 num padding tokens: 35 - rank: 7 max len: 90 min len: 55 avg len: 72.5 num_loss_counted_tokens: 97
 total tokens: 148 num samples: 2 num padding tokens: 14 - rank: 4 max len: 74 min len: 60 avg len: 67.0 num_loss_counted_tokens: 68
 total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 1 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 61
 total tokens: 132 num samples: 2 num padding tokens: 7 - rank: 4 max len: 66 min len: 59 avg len: 62.5 num_loss_counted_tokens: 67
 total tokens: 152 num samples: 2 num padding tokens: 21 - rank: 5 max len: 76 min len: 55 avg len: 65.5 num_loss_counted_tokens: 63
 total tokens: 144 num samples: 2 num padding tokens: 11 - rank: 3 max len: 72 min len: 61 avg len: 66.5 num_loss_counted_tokens: 66
 total tokens: 138 num samples: 2 num padding tokens: 15 - rank: 5 max len: 69 min len: 54 avg len: 61.5 num_loss_counted_tokens: 62
 total tokens: 188 num samples: 2 num padding tokens: 50 - rank: 1 max len: 94 min len: 44 avg len: 69.0 num_loss_counted_tokens: 79
 total tokens: 148 num samples: 2 num padding tokens: 15 - rank: 5 max len: 74 min len: 59 avg len: 66.5 num_loss_counted_tokens: 71
 total tokens: 116 num samples: 2 num padding tokens: 6 - rank: 3 max len: 58 min len: 52 avg len: 55.0 num_loss_counted_tokens: 60
 total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 5 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 65
 total tokens: 180 num samples: 2 num padding tokens: 35 - rank: 7 max len: 90 min len: 55 avg len: 72.5 num_loss_counted_tokens: 117 total tokens: 168 num samples: 2 num padding tokens: 24 - rank: 4 max len: 84 min len: 60 avg len: 72.0 num_loss_counted_tokens: 91
 total tokens: 174 num samples: 2 num padding tokens: 10 - rank: 4 max len: 87 min len: 77 avg len: 82.0 num_loss_counted_tokens: 93

 total tokens: 116 num samples: 2 num padding tokens: 1 - rank: 4 max len: 58 min len: 57 avg len: 57.5 num_loss_counted_tokens: 68
 total tokens: 128 num samples: 2 num padding tokens: 16 - rank: 4 max len: 64 min len: 48 avg len: 56.0 num_loss_counted_tokens: 64
 total tokens: 160 num samples: 2 num padding tokens: 22 - rank: 2 max len: 80 min len: 58 avg len: 69.0 num_loss_counted_tokens: 79
 total tokens: 174 num samples: 2 num padding tokens: 19 - rank: 5 max len: 87 min len: 68 avg len: 77.5 num_loss_counted_tokens: 76
 total tokens: 166 num samples: 2 num padding tokens: 23 - rank: 4 max len: 83 min len: 60 avg len: 71.5 num_loss_counted_tokens: 86
 total tokens: 188 num samples: 2 num padding tokens: 13 - rank: 7 max len: 94 min len: 81 avg len: 87.5 num_loss_counted_tokens: 115
 total tokens: 128 num samples: 2 num padding tokens: 15 - rank: 5 max len: 64 min len: 49 avg len: 56.5 num_loss_counted_tokens: 57
 total tokens: 134 num samples: 2 num padding tokens: 17 - rank: 2 max len: 67 min len: 50 avg len: 58.5 num_loss_counted_tokens: 55
 total tokens: 162 num samples: 2 num padding tokens: 19 - rank: 6 max len: 81 min len: 62 avg len: 71.5 num_loss_counted_tokens: 77
 total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 1 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 58
 total tokens: 160 num samples: 2 num padding tokens: 12 - rank: 3 max len: 80 min len: 68 avg len: 74.0 num_loss_counted_tokens: 80
 Per-token loss scaled by world size: 0.00021730510343331844Per-token loss scaled by world size: 0.00023930655152071267Per-token loss scaled by world size: 0.00019531356520019472Per-token loss scaled by world size: 0.0005758063634857535Per-token loss scaled by world size: 0.00014575273962691426Per-token loss scaled by world size: 0.0007938417256809771





 Per-token loss scaled by world size: 0.00033632898703217506
 Epoch: 7, Step: 85, Rank: 3, loss = 0.0187968909740448
 Epoch: 7, Step: 85, Rank: 5, loss = 0.016894623637199402
 Epoch: 7, Step: 85, Rank: 6, loss = 0.020700016990303993Epoch: 7, Step: 85, Rank: 0, loss = 0.04980725049972534
 Epoch: 7, Step: 85, Rank: 4, loss = 0.01260761171579361

 Epoch: 7, Step: 85, Rank: 2, loss = 0.06866730749607086
 Epoch: 7, Step: 85, Rank: 1, loss = 0.0290924571454525
 Per-token loss scaled by world size: 0.00042542771552689373
 Epoch: 7, Step: 85, Rank: 7, loss = 0.03679949790239334
 [2024-07-27 20:06:58,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=85, skipped=0, lr=[5.983045753470308e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:58,596] [INFO] [timer.py:258:stop] epoch=0/micro_step=85/global_step=85, RunningAvgSamplesPerSec=31.69756640961793, CurrSamplesPerSec=28.831859851847014, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,         | 1/12 [00:00<00:10,  1.08it/s]
    "step": 85,
    "rank": 0,
    "loss": 0.04980725049972534,
    "overall_throughput": 28.716406669884535,
    "lr": 5.983045753470308e-06,
    "cuda_mem_allocated": 22.00047731399536,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 692,
    "batch_size": 16,
    "total_loss": 0.0316707044839859,
    "gradnorm": 0.7268746495246887,
    "weight_norm": 393.4754943847656,
    "timestamp": "2024-07-27T20:06:58.639043"
 }
 Per-token loss scaled by world size: 0.000342040992109105Per-token loss scaled by world size: 0.00027503896853886545Per-token loss scaled by world size: 0.00036574419937096536Per-token loss scaled by world size: 0.0006328842719085515Per-token loss scaled by world size: 0.0005108661716803908Per-token loss scaled by world size: 0.0006690495647490025



 Per-token loss scaled by world size: 0.0002407751599093899


 Epoch: 7, Step: 86, Rank: 0, loss = 0.032139770686626434
 Epoch: 7, Step: 86, Rank: 5, loss = 0.030056850984692574Epoch: 7, Step: 86, Rank: 4, loss = 0.058792732656002045Epoch: 7, Step: 86, Rank: 2, loss = 0.02416904829442501Epoch: 7, Step: 86, Rank: 6, loss = 0.05561470612883568



 Epoch: 7, Step: 86, Rank: 7, loss = 0.044892363250255585
 Epoch: 7, Step: 86, Rank: 3, loss = 0.021158117800951004
 Per-token loss scaled by world size: 0.0008176557603292167
 Epoch: 7, Step: 86, Rank: 1, loss = 0.07185149937868118
 [2024-07-27 20:06:59,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=86, skipped=0, lr=[5.6824564766150724e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:59,144] [INFO] [timer.py:258:stop] epoch=0/micro_step=86/global_step=86, RunningAvgSamplesPerSec=31.703864056372694, CurrSamplesPerSec=32.23543844733132, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,▋        | 2/12 [00:01<00:07,  1.42it/s]
    "step": 86,
    "rank": 0,
    "loss": 0.032139770686626434,
    "overall_throughput": 32.183202299620085,
    "lr": 5.6824564766150724e-06,
    "cuda_mem_allocated": 22.006441116333008,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 703,
    "batch_size": 16,
    "total_loss": 0.042334385216236115,
    "gradnorm": 0.6796127557754517,
    "weight_norm": 393.4755554199219,
    "timestamp": "2024-07-27T20:06:59.187763"
 }
 Per-token loss scaled by world size: 0.00016505751409567893Per-token loss scaled by world size: 0.0008360829087905586
 Per-token loss scaled by world size: 0.0005081766867078841
 Per-token loss scaled by world size: 0.0005767009570263326Per-token loss scaled by world size: 0.0008457532385364175
 Per-token loss scaled by world size: 0.003279536496847868Per-token loss scaled by world size: 0.0016091869911178946



 Epoch: 7, Step: 87, Rank: 7, loss = 0.05246420204639435
 Epoch: 7, Step: 87, Rank: 0, loss = 0.010357359424233437
 Epoch: 7, Step: 87, Rank: 3, loss = 0.03188808634877205Epoch: 7, Step: 87, Rank: 6, loss = 0.2057909220457077Epoch: 7, Step: 87, Rank: 5, loss = 0.05307101458311081
 Epoch: 7, Step: 87, Rank: 2, loss = 0.03618798404932022


 Epoch: 7, Step: 87, Rank: 4, loss = 0.10097648203372955
 Per-token loss scaled by world size: 0.0013951770961284637
 Epoch: 7, Step: 87, Rank: 1, loss = 0.08754736185073853
 [2024-07-27 20:06:59,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=87, skipped=0, lr=[5.386588370213124e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:06:59,697] [INFO] [timer.py:258:stop] epoch=0/micro_step=87/global_step=87, RunningAvgSamplesPerSec=31.702355408140964, CurrSamplesPerSec=31.576139496344755, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,█▌       | 3/12 [00:02<00:05,  1.57it/s]
    "step": 87,
    "rank": 0,
    "loss": 0.010357359424233437,
    "overall_throughput": 31.529719530621602,
    "lr": 5.386588370213124e-06,
    "cuda_mem_allocated": 22.000000476837158,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 502,
    "batch_size": 16,
    "total_loss": 0.07228542864322662,
    "gradnorm": 1.0233722925186157,
    "weight_norm": 393.4755859375,
    "timestamp": "2024-07-27T20:06:59.740017"
 }
 Per-token loss scaled by world size: 0.0004632726195268333Per-token loss scaled by world size: 0.0006792055210098624Per-token loss scaled by world size: 0.0006460993899963796Per-token loss scaled by world size: 7.74235013523139e-05Per-token loss scaled by world size: 0.00012206515384605154

 Per-token loss scaled by world size: 0.0019949208945035934



 Epoch: 7, Step: 88, Rank: 2, loss = 0.05039575323462486Epoch: 7, Step: 88, Rank: 0, loss = 0.009521082043647766

 Epoch: 7, Step: 88, Rank: 7, loss = 0.03613526374101639Epoch: 7, Step: 88, Rank: 3, loss = 0.0529780313372612
 Epoch: 7, Step: 88, Rank: 4, loss = 0.15560382604599

 Per-token loss scaled by world size: 0.0007654842338524759
 Epoch: 7, Step: 88, Rank: 6, loss = 0.006039033178240061
 Epoch: 7, Step: 88, Rank: 5, loss = 0.05970776826143265
 Per-token loss scaled by world size: 0.0008809524588286877
 Epoch: 7, Step: 88, Rank: 1, loss = 0.06871429085731506
 [2024-07-27 20:07:00,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=88, skipped=0, lr=[5.095764961694923e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:00,266] [INFO] [timer.py:258:stop] epoch=0/micro_step=88/global_step=88, RunningAvgSamplesPerSec=31.68774037603646, CurrSamplesPerSec=30.492857616709514, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 1408
 {
    "epoch": 7,
    "step": 88,
    "rank": 0,
    "loss": 0.009521082043647766,
    "overall_throughput": 30.41481690296598,
    "lr": 5.095764961694923e-06,
    "cuda_mem_allocated": 22.0038161277771,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 624,
    "batch_size": 16,
    "total_loss": 0.05488688498735428,
    "gradnorm": 1.1451473236083984,
    "weight_norm": 393.47564697265625,
    "timestamp": "2024-07-27T20:07:00.269082"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1408
 [20:07:18] INFO     saving took 17.875807285308838 seconds                                                                                                                                                                        utils.py:611
                                                       Per-token loss scaled by world size: 0.00015760907263029367Per-token loss scaled by world size: 0.00012432184303179383Per-token loss scaled by world size: 0.0010254974476993084Per-token loss scaled by world size: 0.0010104298125952482Per-token loss scaled by world size: 0.00047610432375222445




 Per-token loss scaled by world size: 0.00011171086953254417
 Per-token loss scaled by world size: 0.000618505000602454Epoch: 7, Step: 89, Rank: 2, loss = 0.010038988664746284Epoch: 7, Step: 89, Rank: 5, loss = 0.08159220963716507


 Epoch: 7, Step: 89, Rank: 7, loss = 0.012726932764053345
 Epoch: 7, Step: 89, Rank: 3, loss = 0.038445424288511276
 Epoch: 7, Step: 89, Rank: 4, loss = 0.08280891925096512
 Epoch: 7, Step: 89, Rank: 0, loss = 0.009020652621984482
 Epoch: 7, Step: 89, Rank: 6, loss = 0.049944277852773666
 Per-token loss scaled by world size: 0.0007019841577857733
 Epoch: 7, Step: 89, Rank: 1, loss = 0.05668522045016289
 [2024-07-27 20:07:18,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=89, skipped=0, lr=[4.8103042621878515e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:18,713] [INFO] [timer.py:258:stop] epoch=0/micro_step=89/global_step=89, RunningAvgSamplesPerSec=31.681179911181776, CurrSamplesPerSec=31.12696454313878, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,███▏     | 5/12 [00:21<00:35,  5.11s/it]
    "step": 89,
    "rank": 0,
    "loss": 0.009020652621984482,
    "overall_throughput": 31.06243279759643,
    "lr": 4.8103042621878515e-06,
    "cuda_mem_allocated": 22.00548553466797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 646,
    "batch_size": 16,
    "total_loss": 0.04265782982110977,
    "gradnorm": 0.9962098598480225,
    "weight_norm": 393.4757080078125,
    "timestamp": "2024-07-27T20:07:18.755800"
 }
 Per-token loss scaled by world size: 0.0009453526581637561Per-token loss scaled by world size: 0.0016993889585137367Per-token loss scaled by world size: 0.0008407846908085048Per-token loss scaled by world size: 6.15180833847262e-05Per-token loss scaled by world size: 0.0012258148053660989




 Per-token loss scaled by world size: 0.0002534937229938805Per-token loss scaled by world size: 7.776251732138917e-05

 Epoch: 7, Step: 90, Rank: 0, loss = 0.057784680277109146Epoch: 7, Step: 90, Rank: 1, loss = 0.10387515276670456Epoch: 7, Step: 90, Rank: 5, loss = 0.05139296501874924
 Epoch: 7, Step: 90, Rank: 3, loss = 0.0749279335141182Epoch: 7, Step: 90, Rank: 2, loss = 0.0037602928932756186



 Epoch: 7, Step: 90, Rank: 6, loss = 0.015494802966713905
 Epoch: 7, Step: 90, Rank: 4, loss = 0.00475323386490345
 Per-token loss scaled by world size: 0.0009510749368928373
 Epoch: 7, Step: 90, Rank: 7, loss = 0.05813445523381233
 [2024-07-27 20:07:19,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[4.530518418775734e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:19,255] [INFO] [timer.py:258:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=31.685890209714664, CurrSamplesPerSec=32.10111808111374, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,████     | 6/12 [00:21<00:21,  3.56s/it]
    "step": 90,
    "rank": 0,
    "loss": 0.057784680277109146,
    "overall_throughput": 32.01665600380905,
    "lr": 4.530518418775734e-06,
    "cuda_mem_allocated": 21.996421813964844,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 489,
    "batch_size": 16,
    "total_loss": 0.04626544192433357,
    "gradnorm": 0.9134754538536072,
    "weight_norm": 393.4757385253906,
    "timestamp": "2024-07-27T20:07:19.305311"
 }
 Per-token loss scaled by world size: 0.00019140614313073456Per-token loss scaled by world size: 0.0003604689263738692Per-token loss scaled by world size: 0.00012782825797330588Per-token loss scaled by world size: 0.00011688289669109508
 Per-token loss scaled by world size: 0.0008099116967059672


 Per-token loss scaled by world size: 0.0005937899113632739

 Per-token loss scaled by world size: 0.0016323667950928211
 Epoch: 7, Step: 91, Rank: 4, loss = 0.029107866808772087Epoch: 7, Step: 91, Rank: 0, loss = 0.015456045977771282

 Epoch: 7, Step: 91, Rank: 1, loss = 0.010322132147848606Epoch: 7, Step: 91, Rank: 7, loss = 0.009438293986022472
 Epoch: 7, Step: 91, Rank: 2, loss = 0.0654003694653511

 Epoch: 7, Step: 91, Rank: 6, loss = 0.04794853553175926
 Epoch: 7, Step: 91, Rank: 3, loss = 0.13181361556053162
 Per-token loss scaled by world size: 8.413568866671994e-05
 Epoch: 7, Step: 91, Rank: 5, loss = 0.006793956737965345
 [2024-07-27 20:07:19,714] [INFO] [logging.py:96:log_dist] [Rank 0] step=91, skipped=0, lr=[4.256713373170565e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:19,792] [INFO] [timer.py:258:stop] epoch=0/micro_step=91/global_step=91, RunningAvgSamplesPerSec=31.69971548381683, CurrSamplesPerSec=32.965470896955004, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,████▊    | 7/12 [00:22<00:12,  2.57s/it]
    "step": 91,
    "rank": 0,
    "loss": 0.015456045977771282,
    "overall_throughput": 32.88040023537493,
    "lr": 4.256713373170565e-06,
    "cuda_mem_allocated": 22.004292964935303,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 646,
    "batch_size": 16,
    "total_loss": 0.039535101503133774,
    "gradnorm": 1.6763972043991089,
    "weight_norm": 393.47576904296875,
    "timestamp": "2024-07-27T20:07:19.833342"
 }
 Per-token loss scaled by world size: 0.00016448293172288686
 Per-token loss scaled by world size: 0.00010030974954133853Per-token loss scaled by world size: 0.0006337311351671815Per-token loss scaled by world size: 0.0002874261699616909Per-token loss scaled by world size: 0.0004495856410358101Per-token loss scaled by world size: 0.0012448193738237023


 Per-token loss scaled by world size: 8.349026757059619e-05


 Epoch: 7, Step: 92, Rank: 0, loss = 0.013878247700631618
 Epoch: 7, Step: 92, Rank: 4, loss = 0.024251583963632584
 Epoch: 7, Step: 92, Rank: 7, loss = 0.008463635109364986Epoch: 7, Step: 92, Rank: 5, loss = 0.10503163933753967
 Epoch: 7, Step: 92, Rank: 2, loss = 0.05347106233239174

 Epoch: 7, Step: 92, Rank: 6, loss = 0.007044491358101368Epoch: 7, Step: 92, Rank: 3, loss = 0.03793378919363022

 Per-token loss scaled by world size: 0.0010255238739773631
 Epoch: 7, Step: 92, Rank: 1, loss = 0.08652857691049576
 [2024-07-27 20:07:20,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=92, skipped=0, lr=[3.989188527169749e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:20,327] [INFO] [timer.py:258:stop] epoch=0/micro_step=92/global_step=92, RunningAvgSamplesPerSec=31.708216014036797, CurrSamplesPerSec=32.48346829214222, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,█████▋   | 8/12 [00:22<00:07,  1.92s/it]
    "step": 92,
    "rank": 0,
    "loss": 0.013878247700631618,
    "overall_throughput": 32.40163058618926,
    "lr": 3.989188527169749e-06,
    "cuda_mem_allocated": 22.009064197540283,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 675,
    "batch_size": 16,
    "total_loss": 0.04207538068294525,
    "gradnorm": 0.6942251920700073,
    "weight_norm": 393.4757995605469,
    "timestamp": "2024-07-27T20:07:20.370030"
 }
 Per-token loss scaled by world size: 0.001151230651885271Per-token loss scaled by world size: 0.0008526312303729355Per-token loss scaled by world size: 0.00011098023969680071Per-token loss scaled by world size: 0.0004092410672456026Per-token loss scaled by world size: 0.0007324064499698579

 Per-token loss scaled by world size: 0.000303772249026224
 Per-token loss scaled by world size: 0.0005547546315938234



 Epoch: 7, Step: 93, Rank: 6, loss = 0.0076160188764333725Epoch: 7, Step: 93, Rank: 0, loss = 0.05851181969046593Epoch: 7, Step: 93, Rank: 1, loss = 0.02808416821062565


 Epoch: 7, Step: 93, Rank: 3, loss = 0.05026139318943024Epoch: 7, Step: 93, Rank: 5, loss = 0.020846370607614517Epoch: 7, Step: 93, Rank: 7, loss = 0.07900319993495941

 Epoch: 7, Step: 93, Rank: 2, loss = 0.038070037961006165

 Per-token loss scaled by world size: 0.002183598466217518
 Epoch: 7, Step: 93, Rank: 4, loss = 0.14984944462776184
 [2024-07-27 20:07:20,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=93, skipped=0, lr=[3.72823641526463e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:20,868] [INFO] [timer.py:258:stop] epoch=0/micro_step=93/global_step=93, RunningAvgSamplesPerSec=31.713426964551296, CurrSamplesPerSec=32.18953148593345, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 7,██████▌  | 9/12 [00:23<00:04,  1.49s/it]
    "step": 93,
    "rank": 0,
    "loss": 0.05851181969046593,
    "overall_throughput": 32.11137875033854,
    "lr": 3.72823641526463e-06,
    "cuda_mem_allocated": 22.001431465148926,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 549,
    "batch_size": 16,
    "total_loss": 0.05403030663728714,
    "gradnorm": 2.058638572692871,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:20.910467"
 }
 Per-token loss scaled by world size: 0.0005728427204303443Per-token loss scaled by world size: 0.00010026186646427959Per-token loss scaled by world size: 0.0007008212269283831Per-token loss scaled by world size: 0.001267179031856358Per-token loss scaled by world size: 0.0009045311016961932Per-token loss scaled by world size: 0.000113489935756661Per-token loss scaled by world size: 0.00015748964506201446






 Epoch: 7, Step: 94, Rank: 3, loss = 0.06035822629928589Epoch: 7, Step: 94, Rank: 1, loss = 0.008635053411126137

 Epoch: 7, Step: 94, Rank: 0, loss = 0.10913579910993576
 Epoch: 7, Step: 94, Rank: 2, loss = 0.04933607950806618Epoch: 7, Step: 94, Rank: 5, loss = 0.013563795946538448
 Epoch: 7, Step: 94, Rank: 7, loss = 0.07790274173021317
 Epoch: 7, Step: 94, Rank: 4, loss = 0.009774320758879185

 Per-token loss scaled by world size: 3.7123980291653425e-05
 Epoch: 7, Step: 94, Rank: 6, loss = 0.003197302808985114
 [2024-07-27 20:07:21,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=94, skipped=0, lr=[3.4741423847583134e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:21,427] [INFO] [timer.py:258:stop] epoch=0/micro_step=94/global_step=94, RunningAvgSamplesPerSec=31.70583386752702, CurrSamplesPerSec=31.029757814905818, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 7,███████▎ | 10/12 [00:23<00:02,  1.20s/it]
    "step": 94,
    "rank": 0,
    "loss": 0.10913579910993576,
    "overall_throughput": 30.958899761772514,
    "lr": 3.4741423847583134e-06,
    "cuda_mem_allocated": 22.00882577896118,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 689,
    "batch_size": 16,
    "total_loss": 0.0414879135787487,
    "gradnorm": 0.7960036993026733,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:21.470074"
 }
 Per-token loss scaled by world size: 0.00030438421526923776Per-token loss scaled by world size: 0.00036696376628242433Per-token loss scaled by world size: 0.00027681011124514043Per-token loss scaled by world size: 7.804056804161519e-05
 Per-token loss scaled by world size: 0.001165422610938549


 Per-token loss scaled by world size: 0.00014700897736474872
 Per-token loss scaled by world size: 0.0005056550144217908

 Epoch: 7, Step: 95, Rank: 7, loss = 0.02559572272002697Epoch: 7, Step: 95, Rank: 1, loss = 0.005443329457193613

 Epoch: 7, Step: 95, Rank: 6, loss = 0.01930750533938408
 Epoch: 7, Step: 95, Rank: 0, loss = 0.021230798214673996
 Epoch: 7, Step: 95, Rank: 3, loss = 0.08128822594881058
 Epoch: 7, Step: 95, Rank: 5, loss = 0.03526943549513817
 Epoch: 7, Step: 95, Rank: 4, loss = 0.010253876447677612
 Per-token loss scaled by world size: 0.001073820167221129
 Epoch: 7, Step: 95, Rank: 2, loss = 0.07489895820617676
 [2024-07-27 20:07:21,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=95, skipped=0, lr=[3.2271842837425917e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:21,957] [INFO] [timer.py:258:stop] epoch=0/micro_step=95/global_step=95, RunningAvgSamplesPerSec=31.718594052770808, CurrSamplesPerSec=32.93815904428149, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 7,████████▏| 11/12 [00:24<00:00,  1.00it/s]
    "step": 95,
    "rank": 0,
    "loss": 0.021230798214673996,
    "overall_throughput": 32.85461233006348,
    "lr": 3.2271842837425917e-06,
    "cuda_mem_allocated": 22.00023889541626,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 558,
    "batch_size": 16,
    "total_loss": 0.034160979092121124,
    "gradnorm": 0.7562242150306702,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:21.999178"
 }
 Per-token loss scaled by world size: 0.0006009905482642353Per-token loss scaled by world size: 0.0001355827844236046Per-token loss scaled by world size: 0.0003012406814377755Per-token loss scaled by world size: 0.0010038167238235474Per-token loss scaled by world size: 0.0006891617667861283




 Per-token loss scaled by world size: 0.0006996800657361746Per-token loss scaled by world size: 0.0006351979682222009

 Epoch: 7, Step: 96, Rank: 5, loss = 0.0112872663885355Epoch: 7, Step: 96, Rank: 1, loss = 0.08356773853302002

 Epoch: 7, Step: 96, Rank: 0, loss = 0.05003246292471886Epoch: 7, Step: 96, Rank: 2, loss = 0.025078287348151207

 Epoch: 7, Step: 96, Rank: 3, loss = 0.057372719049453735
 Epoch: 7, Step: 96, Rank: 7, loss = 0.05288023129105568
 Epoch: 7, Step: 96, Rank: 6, loss = 0.0582483634352684
 Per-token loss scaled by world size: 0.0005330585991032422
 Epoch: 7, Step: 96, Rank: 4, loss = 0.04437712952494621
 [2024-07-27 20:07:22,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=96, skipped=0, lr=[2.9876321572751143e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:22,503] [INFO] [timer.py:258:stop] epoch=0/micro_step=96/global_step=96, RunningAvgSamplesPerSec=31.71950152323151, CurrSamplesPerSec=31.804123848141387, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 7,█████████| 12/12 [00:24<00:00,  1.16it/s]
    "step": 96,
    "rank": 0,
    "loss": 0.05003246292471886,
    "overall_throughput": 31.730529437408332,
    "lr": 2.9876321572751143e-06,
    "cuda_mem_allocated": 22.00548553466797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 666,
    "batch_size": 16,
    "total_loss": 0.047855526208877563,
    "gradnorm": 1.0569850206375122,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:22.546236"
 }
 Epoch 7: 100%|██████████| 12/12 [00:24<00:00,  2.08s/it]
 total tokens: 166 num samples: 2 num padding tokens: 3 - rank: 6 max len: 83 min len: 80 avg len: 81.5 num_loss_counted_tokens: 88 total tokens: 174 num samples: 2 num padding tokens: 37 - rank: 6 max len: 87 min len: 50 avg len: 68.5 num_loss_counted_tokens: 83
 total tokens: 132 num samples: 2 num padding tokens: 17 - rank: 6 max len: 66 min len: 49 avg len: 57.5 num_loss_counted_tokens: 50

 total tokens: 194 num samples: 2 num padding tokens: 18 - rank: 6 max len: 97 min len: 79 avg len: 88.0 num_loss_counted_tokens: 112 total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 6 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 72

 total tokens: 140 num samples: 2 num padding tokens: 6 - rank: 6 max len: 70 min len: 64 avg len: 67.0 num_loss_counted_tokens: 81
 total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 6 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 48
 total tokens: 144 num samples: 2 num padding tokens: 18 - rank: 6 max len: 72 min len: 54 avg len: 63.0 num_loss_counted_tokens: 82
 total tokens: 154 num samples: 2 num padding tokens: 19 - rank: 3 max len: 77 min len: 58 avg len: 67.5 num_loss_counted_tokens: 87
 total tokens: 196 num samples: 2 num padding tokens: 34 - rank: 6 max len: 98 min len: 64 avg len: 81.0 num_loss_counted_tokens: 113
 total tokens: 228 num samples: 2 num padding tokens: 70 - rank: 6 max len: 114 min len: 44 avg len: 79.0 num_loss_counted_tokens: 110
 total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 3 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 60
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63
 total tokens: 186 num samples: 2 num padding tokens: 48 - rank: 3 max len: 93 min len: 45 avg len: 69.0 num_loss_counted_tokens: 110
 total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 3 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 61
 total tokens: 104 num samples: 2 num padding tokens: 1 - rank: 3 max len: 52 min len: 51 avg len: 51.5 num_loss_counted_tokens: 59
 total tokens: 152 num samples: 2 num padding tokens: 2 - rank: 3 max len: 76 min len: 74 avg len: 75.0 num_loss_counted_tokens: 79
 total tokens: 108 num samples: 2 num padding tokens: 1 - rank: 3 max len: 54 min len: 53 avg len: 53.5 num_loss_counted_tokens: 64
 total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 7 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 61
 total tokens: 188 num samples: 2 num padding tokens: 35 - rank: 3 max len: 94 min len: 59 avg len: 76.5 num_loss_counted_tokens: 93
 total tokens: 162 num samples: 2 num padding tokens: 37 - rank: 7 max len: 81 min len: 44 avg len: 62.5 num_loss_counted_tokens: 66
 total tokens: 172 num samples: 2 num padding tokens: 11 - rank: 7 max len: 86 min len: 75 avg len: 80.5 num_loss_counted_tokens: 72
 total tokens: 166 num samples: 2 num padding tokens: 37 - rank: 7 max len: 83 min len: 46 avg len: 64.5 num_loss_counted_tokens: 80
 total tokens: 142 num samples: 2 num padding tokens: 6 - rank: 3 max len: 71 min len: 65 avg len: 68.0 num_loss_counted_tokens: 78
 total tokens: 158 num samples: 2 num padding tokens: 22 - rank: 7 max len: 79 min len: 57 avg len: 68.0 num_loss_counted_tokens: 65
 total tokens: 128 num samples: 2 num padding tokens: 19 - rank: 7 max len: 64 min len: 45 avg len: 54.5 num_loss_counted_tokens: 58
 total tokens: 126 num samples: 2 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 64
 total tokens: 134 num samples: 2 num padding tokens: 24 - rank: 7 max len: 67 min len: 43 avg len: 55.0 num_loss_counted_tokens: 57
 total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 7 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 72
 total tokens: 202 num samples: 2 num padding tokens: 21 - rank: 7 max len: 101 min len: 80 avg len: 90.5 num_loss_counted_tokens: 128
 total tokens: 138 num samples: 2 num padding tokens: 17 - rank: 6 max len: 69 min len: 52 avg len: 60.5 num_loss_counted_tokens: 60
 total tokens: 196 num samples: 2 num padding tokens: 53 - rank: 5 max len: 98 min len: 45 avg len: 71.5 num_loss_counted_tokens: 99
 total tokens: 110 num samples: 2 num padding tokens: 5 - rank: 5 max len: 55 min len: 50 avg len: 52.5 num_loss_counted_tokens: 57
 total tokens: 176 num samples: 2 num padding tokens: 17 - rank: 5 max len: 88 min len: 71 avg len: 79.5 num_loss_counted_tokens: 86
 total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 5 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 63
 total tokens: 172 num samples: 2 num padding tokens: 18 - rank: 5 max len: 86 min len: 68 avg len: 77.0 num_loss_counted_tokens: 74
 total tokens: 282 num samples: 2 num padding tokens: 81 - rank: 5 max len: 141 min len: 60 avg len: 100.5 num_loss_counted_tokens: 152
 total tokens: 162 num samples: 2 num padding tokens: 11 - rank: 5 max len: 81 min len: 70 avg len: 75.5 num_loss_counted_tokens: 86
 total tokens: 166 num samples: 2 num padding tokens: 24 - rank: 5 max len: 83 min len: 59 avg len: 71.0 num_loss_counted_tokens: 80
 total tokens: 216 num samples: 2 num padding tokens: 47 - rank: 5 max len: 108 min len: 61 avg len: 84.5 num_loss_counted_tokens: 103
 total tokens: 226 num samples: 2 num padding tokens: 40 - rank: 7 max len: 113 min len: 73 avg len: 93.0 num_loss_counted_tokens: 109
 total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 5 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 55
 total tokens: 180 num samples: 2 num padding tokens: 22 - rank: 4 max len: 90 min len: 68 avg len: 79.0 num_loss_counted_tokens: 111
 total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 5 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 71
 total tokens: 152 num samples: 2 num padding tokens: 24 - rank: 4 max len: 76 min len: 52 avg len: 64.0 num_loss_counted_tokens: 59
 total tokens: 140 num samples: 2 num padding tokens: 19 - rank: 4 max len: 70 min len: 51 avg len: 60.5 num_loss_counted_tokens: 55
 total tokens: 102 num samples: 2 num padding tokens: 6 - rank: 3 max len: 51 min len: 45 avg len: 48.0 num_loss_counted_tokens: 50
 total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 4 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 83
 total tokens: 118 num samples: 2 num padding tokens: 11 - rank: 6 max len: 59 min len: 48 avg len: 53.5 num_loss_counted_tokens: 53
 total tokens: 114 num samples: 2 num padding tokens: 8 - rank: 3 max len: 57 min len: 49 avg len: 53.0 num_loss_counted_tokens: 58
 total tokens: 148 num samples: 2 num padding tokens: 14 - rank: 4 max len: 74 min len: 60 avg len: 67.0 num_loss_counted_tokens: 74
 total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 4 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 53
 total tokens: 162 num samples: 2 num padding tokens: 15 - rank: 4 max len: 81 min len: 66 avg len: 73.5 num_loss_counted_tokens: 87
 total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 7 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 75
 total tokens: 168 num samples: 2 num padding tokens: 34 - rank: 4 max len: 84 min len: 50 avg len: 67.0 num_loss_counted_tokens: 88
 total tokens: 160 num samples: 2 num padding tokens: 2 - rank: 4 max len: 80 min len: 78 avg len: 79.0 num_loss_counted_tokens: 91
 total tokens: 128 num samples: 2 num padding tokens: 20 - rank: 4 max len: 64 min len: 44 avg len: 54.0 num_loss_counted_tokens: 47
 total tokens: 186 num samples: 2 num padding tokens: 3 - rank: 4 max len: 93 min len: 90 avg len: 91.5 num_loss_counted_tokens: 114
 total tokens: 126 num samples: 2 num padding tokens: 3 - rank: 5 max len: 63 min len: 60 avg len: 61.5 num_loss_counted_tokens: 65
 total tokens: 136 num samples: 2 num padding tokens: 15 - rank: 4 max len: 68 min len: 53 avg len: 60.5 num_loss_counted_tokens: 52
 total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 0 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 54
 total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 0 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 51
 total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 0 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 78
 total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 87
 total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 0 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 73
 total tokens: 118 num samples: 2 num padding tokens: 2 - rank: 1 max len: 59 min len: 57 avg len: 58.0 num_loss_counted_tokens: 60
 total tokens: 152 num samples: 2 num padding tokens: 27 - rank: 1 max len: 76 min len: 49 avg len: 62.5 num_loss_counted_tokens: 72
 total tokens: 130 num samples: 2 num padding tokens: 7 - rank: 1 max len: 65 min len: 58 avg len: 61.5 num_loss_counted_tokens: 59
 total tokens: 188 num samples: 2 num padding tokens: 34 - rank: 0 max len: 94 min len: 60 avg len: 77.0 num_loss_counted_tokens: 90
 total tokens: 164 num samples: 2 num padding tokens: 12 - rank: 1 max len: 82 min len: 70 avg len: 76.0 num_loss_counted_tokens: 96
 total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 0 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 60
 total tokens: 244 num samples: 2 num padding tokens: 67 - rank: 0 max len: 122 min len: 55 avg len: 88.5 num_loss_counted_tokens: 127
 total tokens: 106 num samples: 2 num padding tokens: 7 - rank: 0 max len: 53 min len: 46 avg len: 49.5 num_loss_counted_tokens: 45
 total tokens: 128 num samples: 2 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 59
 total tokens: 110 num samples: 2 num padding tokens: 5 - rank: 0 max len: 55 min len: 50 avg len: 52.5 num_loss_counted_tokens: 57
 total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 1 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 60
 total tokens: 208 num samples: 2 num padding tokens: 33 - rank: 0 max len: 104 min len: 71 avg len: 87.5 num_loss_counted_tokens: 124
 total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 1 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 86
 total tokens: 154 num samples: 2 num padding tokens: 10 - rank: 1 max len: 77 min len: 67 avg len: 72.0 num_loss_counted_tokens: 68
 total tokens: 132 num samples: 2 num padding tokens: 18 - rank: 1 max len: 66 min len: 48 avg len: 57.0 num_loss_counted_tokens: 68
 total tokens: 126 num samples: 2 num padding tokens: 2 - rank: 1 max len: 63 min len: 61 avg len: 62.0 num_loss_counted_tokens: 68
 total tokens: 138 num samples: 2 num padding tokens: 8 - rank: 1 max len: 69 min len: 61 avg len: 65.0 num_loss_counted_tokens: 58
 total tokens: 146 num samples: 2 num padding tokens: 2 - rank: 2 max len: 73 min len: 71 avg len: 72.0 num_loss_counted_tokens: 79 total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 2 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 64

 total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 2 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 59
 total tokens: 180 num samples: 2 num padding tokens: 31 - rank: 0 max len: 90 min len: 59 avg len: 74.5 num_loss_counted_tokens: 104
 total tokens: 186 num samples: 2 num padding tokens: 1 - rank: 2 max len: 93 min len: 92 avg len: 92.5 num_loss_counted_tokens: 125
 total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 2 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 50
 total tokens: 214 num samples: 2 num padding tokens: 31 - rank: 2 max len: 107 min len: 76 avg len: 91.5 num_loss_counted_tokens: 132
 total tokens: 174 num samples: 2 num padding tokens: 5 - rank: 2 max len: 87 min len: 82 avg len: 84.5 num_loss_counted_tokens: 109
 total tokens: 168 num samples: 2 num padding tokens: 29 - rank: 2 max len: 84 min len: 55 avg len: 69.5 num_loss_counted_tokens: 81
 total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 2 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 65 total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 2 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 75

 total tokens: 200 num samples: 2 num padding tokens: 30 - rank: 1 max len: 100 min len: 70 avg len: 85.0 num_loss_counted_tokens: 92
 total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 2 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 62
 total tokens: 214 num samples: 2 num padding tokens: 49 - rank: 2 max len: 107 min len: 58 avg len: 82.5 num_loss_counted_tokens: 102
 Per-token loss scaled by world size: 0.0008622364257462323Per-token loss scaled by world size: 7.275798998307437e-05Per-token loss scaled by world size: 0.00035221243160776794

 Per-token loss scaled by world size: 0.0006777397356927395

 Per-token loss scaled by world size: 0.0003756655496545136
 Per-token loss scaled by world size: 0.0009425911703146994
 Per-token loss scaled by world size: 0.0004384875064715743
 Epoch: 8, Step: 97, Rank: 1, loss = 0.061649903655052185
 Epoch: 8, Step: 97, Rank: 0, loss = 0.005202196538448334
 Epoch: 8, Step: 97, Rank: 4, loss = 0.025183189660310745
 Epoch: 8, Step: 97, Rank: 5, loss = 0.048458389937877655
 Epoch: 8, Step: 97, Rank: 3, loss = 0.026860086247324944
 Epoch: 8, Step: 97, Rank: 7, loss = 0.06739526987075806
 Epoch: 8, Step: 97, Rank: 6, loss = 0.031351856887340546
 Per-token loss scaled by world size: 0.0008824544493108988
 Epoch: 8, Step: 97, Rank: 2, loss = 0.06309549510478973
 [2024-07-27 20:07:23,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=97, skipped=0, lr=[2.7557479520891104e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:23,531] [INFO] [timer.py:258:stop] epoch=0/micro_step=97/global_step=97, RunningAvgSamplesPerSec=31.709145954932833, CurrSamplesPerSec=30.76501430086227, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:   8%|▊         | 1/12 [00:00<00:10,  1.07it/s]{
    "epoch": 8,
    "step": 97,
    "rank": 0,
    "loss": 0.005202196538448334,
    "overall_throughput": 30.654808948925602,
    "lr": 2.7557479520891104e-06,
    "cuda_mem_allocated": 22.001669883728027,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 572,
    "batch_size": 16,
    "total_loss": 0.041149549186229706,
    "gradnorm": 0.7585266828536987,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:23.574672"
 }
 Per-token loss scaled by world size: 0.00014853040920570493Per-token loss scaled by world size: 0.0014420171501114964Per-token loss scaled by world size: 0.00023432534362655133Per-token loss scaled by world size: 0.0010252870852127671Per-token loss scaled by world size: 0.00022051780251786113Per-token loss scaled by world size: 0.0014104293659329414Per-token loss scaled by world size: 0.0005214783013798296






 Epoch: 8, Step: 98, Rank: 5, loss = 0.10472649335861206Epoch: 8, Step: 98, Rank: 6, loss = 0.017017878592014313

 Epoch: 8, Step: 98, Rank: 7, loss = 0.01078702136874199
 Epoch: 8, Step: 98, Rank: 4, loss = 0.07446147501468658Epoch: 8, Step: 98, Rank: 1, loss = 0.016015104949474335Epoch: 8, Step: 98, Rank: 3, loss = 0.10243242979049683Epoch: 8, Step: 98, Rank: 2, loss = 0.03787236288189888



 Per-token loss scaled by world size: 0.0007809263770468533
 Epoch: 8, Step: 98, Rank: 0, loss = 0.056714776903390884
 [2024-07-27 20:07:23,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=98, skipped=0, lr=[2.5317852301584642e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:24,076] [INFO] [timer.py:258:stop] epoch=0/micro_step=98/global_step=98, RunningAvgSamplesPerSec=31.716539168080143, CurrSamplesPerSec=32.43497139719714, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  17%|█▋        | 2/12 [00:01<00:07,  1.42it/s]{
    "epoch": 8,
    "step": 98,
    "rank": 0,
    "loss": 0.056714776903390884,
    "overall_throughput": 32.37466676829651,
    "lr": 2.5317852301584642e-06,
    "cuda_mem_allocated": 21.996094703674316,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 581,
    "batch_size": 16,
    "total_loss": 0.052503444254398346,
    "gradnorm": 0.8310177326202393,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:24.122854"
 }
 Per-token loss scaled by world size: 1.7735277651809156e-05Per-token loss scaled by world size: 0.0005863551050424576Per-token loss scaled by world size: 0.0004776878922712058Per-token loss scaled by world size: 0.00042502893484197557Per-token loss scaled by world size: 0.000244389520958066




 Per-token loss scaled by world size: 4.276382242096588e-05Per-token loss scaled by world size: 0.00011345247185090557

 Epoch: 8, Step: 99, Rank: 3, loss = 0.03155839815735817Epoch: 8, Step: 99, Rank: 2, loss = 0.04353686794638634
 Epoch: 8, Step: 99, Rank: 1, loss = 0.03546832501888275Epoch: 8, Step: 99, Rank: 0, loss = 0.0013168443692848086


 Epoch: 8, Step: 99, Rank: 4, loss = 0.01814592257142067
 Epoch: 8, Step: 99, Rank: 6, loss = 0.003175213700160384
 Epoch: 8, Step: 99, Rank: 7, loss = 0.00842384621500969
 Per-token loss scaled by world size: 0.0002905700821429491
 Epoch: 8, Step: 99, Rank: 5, loss = 0.021574828773736954
 [2024-07-27 20:07:24,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=99, skipped=0, lr=[2.315988891431412e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:24,631] [INFO] [timer.py:258:stop] epoch=0/micro_step=99/global_step=99, RunningAvgSamplesPerSec=31.71474915120891, CurrSamplesPerSec=31.54384320597289, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 1584
 {
    "epoch": 8,
    "step": 99,
    "rank": 0,
    "loss": 0.0013168443692848086,
    "overall_throughput": 31.43285568885302,
    "lr": 2.315988891431412e-06,
    "cuda_mem_allocated": 21.998091220855713,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 594,
    "batch_size": 16,
    "total_loss": 0.02040003053843975,
    "gradnorm": 0.5513115525245667,
    "weight_norm": 393.475830078125,
    "timestamp": "2024-07-27T20:07:24.635155"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1584
 [20:07:42] INFO     saving took 17.857797861099243 seconds                                                                                                                                                                        utils.py:611
 Epoch 8:  25%|██▌       | 3/12 [00:19<01:19,  8.79s/it]Per-token loss scaled by world size: 0.00023032784520182759Per-token loss scaled by world size: 0.0005766816902905703Per-token loss scaled by world size: 0.00042750773718580604


 Per-token loss scaled by world size: 0.0003948273661080748Per-token loss scaled by world size: 0.00027044868329539895Per-token loss scaled by world size: 0.00016059860354289412Per-token loss scaled by world size: 1.1637920579232741e-05



 Epoch: 8, Step: 100, Rank: 0, loss = 0.020211268216371536
 Epoch: 8, Step: 100, Rank: 3, loss = 0.05060381814837456
 Epoch: 8, Step: 100, Rank: 1, loss = 0.03464610129594803
 Epoch: 8, Step: 100, Rank: 2, loss = 0.03751380369067192Epoch: 8, Step: 100, Rank: 4, loss = 0.02373187243938446Epoch: 8, Step: 100, Rank: 7, loss = 0.00102122756652534
 Epoch: 8, Step: 100, Rank: 5, loss = 0.014092527329921722


 Per-token loss scaled by world size: 3.5370929253986105e-05
 Epoch: 8, Step: 100, Rank: 6, loss = 0.0031037991866469383
 [2024-07-27 20:07:42,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[2.1085949060360654e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:43,046] [INFO] [timer.py:258:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=31.71639245876496, CurrSamplesPerSec=31.876606801027897, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  33%|███▎      | 4/12 [00:20<00:44,  5.54s/it]{
    "epoch": 8,
    "step": 100,
    "rank": 0,
    "loss": 0.020211268216371536,
    "overall_throughput": 31.806867290460243,
    "lr": 2.1085949060360654e-06,
    "cuda_mem_allocated": 21.999046802520752,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 702,
    "batch_size": 16,
    "total_loss": 0.023115552961826324,
    "gradnorm": 0.8944979310035706,
    "weight_norm": 393.4758605957031,
    "timestamp": "2024-07-27T20:07:43.089296"
 }
 Per-token loss scaled by world size: 0.002073504263535142Per-token loss scaled by world size: 0.0009401860297657549Per-token loss scaled by world size: 0.0007578051881864667Per-token loss scaled by world size: 0.00018321115931030363Per-token loss scaled by world size: 0.00033954239916056395
 Per-token loss scaled by world size: 0.00022701223497278988



 Per-token loss scaled by world size: 0.00016321164730470628

 Epoch: 8, Step: 101, Rank: 0, loss = 0.1342594027519226
 Epoch: 8, Step: 101, Rank: 4, loss = 0.04906788468360901
 Epoch: 8, Step: 101, Rank: 5, loss = 0.02198537066578865Epoch: 8, Step: 101, Rank: 2, loss = 0.011862922459840775Epoch: 8, Step: 101, Rank: 6, loss = 0.014699041843414307

 Epoch: 8, Step: 101, Rank: 1, loss = 0.06087704375386238

 Epoch: 8, Step: 101, Rank: 7, loss = 0.010567953810095787
 Per-token loss scaled by world size: 0.00039684344665147364
 Epoch: 8, Step: 101, Rank: 3, loss = 0.025695612654089928
 [2024-07-27 20:07:43,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=101, skipped=0, lr=[1.9098300562505266e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:43,592] [INFO] [timer.py:258:stop] epoch=0/micro_step=101/global_step=101, RunningAvgSamplesPerSec=31.718083745763387, CurrSamplesPerSec=31.884709476489913, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  42%|████▏     | 5/12 [00:20<00:26,  3.74s/it]{
    "epoch": 8,
    "step": 101,
    "rank": 0,
    "loss": 0.1342594027519226,
    "overall_throughput": 31.800326016907388,
    "lr": 1.9098300562505266e-06,
    "cuda_mem_allocated": 21.998091220855713,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 518,
    "batch_size": 16,
    "total_loss": 0.041126906871795654,
    "gradnorm": 0.7889791131019592,
    "weight_norm": 393.4758605957031,
    "timestamp": "2024-07-27T20:07:43.642554"
 }
 Per-token loss scaled by world size: 0.00021082548482809216Per-token loss scaled by world size: 0.0001290614891331643Per-token loss scaled by world size: 0.0004096523334737867Per-token loss scaled by world size: 0.00011912157060578465
 Per-token loss scaled by world size: 0.00024137772561516613
 Per-token loss scaled by world size: 0.00010579593799775466
 Per-token loss scaled by world size: 0.00036702080979011953



 Epoch: 8, Step: 102, Rank: 2, loss = 0.011470340192317963Epoch: 8, Step: 102, Rank: 1, loss = 0.03640785068273544

 Epoch: 8, Step: 102, Rank: 0, loss = 0.01873711496591568
 Epoch: 8, Step: 102, Rank: 4, loss = 0.009402614086866379Epoch: 8, Step: 102, Rank: 3, loss = 0.010586929507553577
 Epoch: 8, Step: 102, Rank: 6, loss = 0.03261897340416908Epoch: 8, Step: 102, Rank: 5, loss = 0.021452445536851883


 Per-token loss scaled by world size: 0.0002314479643246159
 Epoch: 8, Step: 102, Rank: 7, loss = 0.0205699373036623
 [2024-07-27 20:07:44,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=102, skipped=0, lr=[1.7199116885197996e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:44,136] [INFO] [timer.py:258:stop] epoch=0/micro_step=102/global_step=102, RunningAvgSamplesPerSec=31.727033550532532, CurrSamplesPerSec=32.638783565843816, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 8:  50%|█████     | 6/12 [00:21<00:15,  2.65s/it]{
    "epoch": 8,
    "step": 102,
    "rank": 0,
    "loss": 0.01873711496591568,
    "overall_throughput": 32.58409397351569,
    "lr": 1.7199116885197996e-06,
    "cuda_mem_allocated": 22.00572395324707,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 711,
    "batch_size": 16,
    "total_loss": 0.0201557744294405,
    "gradnorm": 0.5014692544937134,
    "weight_norm": 393.4758605957031,
    "timestamp": "2024-07-27T20:07:44.178902"
 }
 Per-token loss scaled by world size: 0.0001110902740038Per-token loss scaled by world size: 0.0010643235873430967Per-token loss scaled by world size: 0.00038977997610345483Per-token loss scaled by world size: 0.0005851351888850331Per-token loss scaled by world size: 0.0006453694077208638Per-token loss scaled by world size: 2.1985697458148934e-05


 Per-token loss scaled by world size: 0.0008211812237277627



 Epoch: 8, Step: 103, Rank: 6, loss = 0.0468108169734478Epoch: 8, Step: 103, Rank: 1, loss = 0.08514588326215744

 Epoch: 8, Step: 103, Rank: 5, loss = 0.03118239715695381Epoch: 8, Step: 103, Rank: 2, loss = 0.051629554480314255
 Epoch: 8, Step: 103, Rank: 0, loss = 0.008887222036719322
 Epoch: 8, Step: 103, Rank: 7, loss = 0.0017588557675480843

 Epoch: 8, Step: 103, Rank: 3, loss = 0.06569449603557587
 Per-token loss scaled by world size: 0.0005115483654662967
 Epoch: 8, Step: 103, Rank: 4, loss = 0.04092387109994888
 [2024-07-27 20:07:44,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=103, skipped=0, lr=[1.5390474757906449e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:44,676] [INFO] [timer.py:258:stop] epoch=0/micro_step=103/global_step=103, RunningAvgSamplesPerSec=31.73624582311991, CurrSamplesPerSec=32.68529726054485, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 8:  58%|█████▊    | 7/12 [00:22<00:09,  1.96s/it]{
    "epoch": 8,
    "step": 103,
    "rank": 0,
    "loss": 0.008887222036719322,
    "overall_throughput": 32.6317687389074,
    "lr": 1.5390474757906449e-06,
    "cuda_mem_allocated": 22.01240301132202,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 640,
    "batch_size": 16,
    "total_loss": 0.0415041409432888,
    "gradnorm": 0.7137126326560974,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:44.718888"
 }
 Per-token loss scaled by world size: 0.000524764705915004Per-token loss scaled by world size: 0.00015330749738495797Per-token loss scaled by world size: 0.001214228686876595Per-token loss scaled by world size: 0.00014493752678390592Per-token loss scaled by world size: 0.0008454297785647213Per-token loss scaled by world size: 0.0007223724969662726




 Per-token loss scaled by world size: 0.0003260647936258465

 Epoch: 8, Step: 104, Rank: 7, loss = 0.01151722576469183Epoch: 8, Step: 104, Rank: 5, loss = 0.06351291388273239Epoch: 8, Step: 104, Rank: 4, loss = 0.01088843122124672
 Epoch: 8, Step: 104, Rank: 0, loss = 0.03942294791340828Epoch: 8, Step: 104, Rank: 3, loss = 0.09121893346309662



 Epoch: 8, Step: 104, Rank: 2, loss = 0.054268233478069305
 Epoch: 8, Step: 104, Rank: 1, loss = 0.024495618417859077
 Per-token loss scaled by world size: 0.0009017937700264156
 Epoch: 8, Step: 104, Rank: 6, loss = 0.06774725764989853
 [2024-07-27 20:07:45,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=104, skipped=0, lr=[1.367435190424261e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:45,219] [INFO] [timer.py:258:stop] epoch=0/micro_step=104/global_step=104, RunningAvgSamplesPerSec=31.74047431733202, CurrSamplesPerSec=32.17343553961532, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  67%|██████▋   | 8/12 [00:22<00:06,  1.51s/it]{
    "epoch": 8,
    "step": 104,
    "rank": 0,
    "loss": 0.03942294791340828,
    "overall_throughput": 32.1218766079758,
    "lr": 1.367435190424261e-06,
    "cuda_mem_allocated": 22.004770278930664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 601,
    "batch_size": 16,
    "total_loss": 0.04538394883275032,
    "gradnorm": 0.7771502137184143,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:45.261450"
 }
 Per-token loss scaled by world size: 0.0007052936707623303
 Per-token loss scaled by world size: 0.000300221232464537Per-token loss scaled by world size: 0.0001537334028398618Per-token loss scaled by world size: 0.0005797221674583852Per-token loss scaled by world size: 2.4881815988919698e-05Per-token loss scaled by world size: 8.731409616302699e-05




 Per-token loss scaled by world size: 8.151983638526872e-05
 Epoch: 8, Step: 105, Rank: 0, loss = 0.04989952594041824
 Epoch: 8, Step: 105, Rank: 6, loss = 0.010876637883484364Epoch: 8, Step: 105, Rank: 5, loss = 0.0017603884916752577
 Epoch: 8, Step: 105, Rank: 1, loss = 0.0061774724163115025Epoch: 8, Step: 105, Rank: 2, loss = 0.021240651607513428


 Epoch: 8, Step: 105, Rank: 7, loss = 0.04101534187793732
 Epoch: 8, Step: 105, Rank: 4, loss = 0.005767528433352709
 Per-token loss scaled by world size: 0.0010399594902992249
 Epoch: 8, Step: 105, Rank: 3, loss = 0.07357713580131531
 [2024-07-27 20:07:45,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=105, skipped=0, lr=[1.2052624879351105e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:45,759] [INFO] [timer.py:258:stop] epoch=0/micro_step=105/global_step=105, RunningAvgSamplesPerSec=31.74492270946544, CurrSamplesPerSec=32.205303527286674, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  75%|███████▌  | 9/12 [00:23<00:03,  1.21s/it]{
    "epoch": 8,
    "step": 105,
    "rank": 0,
    "loss": 0.04989952594041824,
    "overall_throughput": 32.11246971610297,
    "lr": 1.2052624879351105e-06,
    "cuda_mem_allocated": 21.996421813964844,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 566,
    "batch_size": 16,
    "total_loss": 0.026289334520697594,
    "gradnorm": 0.5574781894683838,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:45.808243"
 }
 Per-token loss scaled by world size: 0.0008091902709566057Per-token loss scaled by world size: 0.0007261958089657128Per-token loss scaled by world size: 0.0015269122086465359Per-token loss scaled by world size: 0.0014011193998157978Per-token loss scaled by world size: 3.460650987108238e-05



 Per-token loss scaled by world size: 0.00045884415158070624Per-token loss scaled by world size: 5.5351883929688483e-05


 Epoch: 8, Step: 106, Rank: 0, loss = 0.05917203798890114Epoch: 8, Step: 106, Rank: 1, loss = 0.11165545880794525Epoch: 8, Step: 106, Rank: 2, loss = 0.053103066980838776
 Epoch: 8, Step: 106, Rank: 6, loss = 0.10245685279369354


 Epoch: 8, Step: 106, Rank: 3, loss = 0.0025306011084467173
 Epoch: 8, Step: 106, Rank: 5, loss = 0.004047606606036425Epoch: 8, Step: 106, Rank: 4, loss = 0.033552978187799454

 Per-token loss scaled by world size: 0.00016226798470597714
 Epoch: 8, Step: 106, Rank: 7, loss = 0.011865845881402493
 [2024-07-27 20:07:46,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=106, skipped=0, lr=[1.0527067017923654e-06], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:46,311] [INFO] [timer.py:258:stop] epoch=0/micro_step=106/global_step=106, RunningAvgSamplesPerSec=31.74620798139189, CurrSamplesPerSec=31.879150748989836, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8:  83%|████████▎ | 10/12 [00:23<00:02,  1.00s/it]{
    "epoch": 8,
    "step": 106,
    "rank": 0,
    "loss": 0.05917203798890114,
    "overall_throughput": 31.805977880940024,
    "lr": 1.0527067017923654e-06,
    "cuda_mem_allocated": 21.998091220855713,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 585,
    "batch_size": 16,
    "total_loss": 0.04729805514216423,
    "gradnorm": 0.8684948086738586,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:46.362067"
 }
 Per-token loss scaled by world size: 3.8418351323343813e-05Per-token loss scaled by world size: 0.00024928394122980535Per-token loss scaled by world size: 0.0008044589194469154Per-token loss scaled by world size: 0.0006902430322952569Per-token loss scaled by world size: 0.0003151585115119815Per-token loss scaled by world size: 0.000329785660142079



 Per-token loss scaled by world size: 2.906994086515624e-05


 Epoch: 8, Step: 107, Rank: 4, loss = 0.05936089903116226Epoch: 8, Step: 107, Rank: 5, loss = 0.06918346881866455
 Epoch: 8, Step: 107, Rank: 0, loss = 0.0033039783593267202Epoch: 8, Step: 107, Rank: 3, loss = 0.021438419818878174

 Epoch: 8, Step: 107, Rank: 2, loss = 0.028361566364765167Epoch: 8, Step: 107, Rank: 6, loss = 0.02710363268852234
 Epoch: 8, Step: 107, Rank: 1, loss = 0.0025000148452818394


 Per-token loss scaled by world size: 3.5674460377776995e-05
 Epoch: 8, Step: 107, Rank: 7, loss = 0.0030680035706609488
 [2024-07-27 20:07:46,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=107, skipped=0, lr=[9.09934649508375e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:46,845] [INFO] [timer.py:258:stop] epoch=0/micro_step=107/global_step=107, RunningAvgSamplesPerSec=31.759325648755453, CurrSamplesPerSec=33.18541023815175, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
 Epoch 8:  92%|█████████▏| 11/12 [00:24<00:00,  1.16it/s]{
    "epoch": 8,
    "step": 107,
    "rank": 0,
    "loss": 0.0033039783593267202,
    "overall_throughput": 33.10411670052279,
    "lr": 9.09934649508375e-07,
    "cuda_mem_allocated": 22.00811004638672,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 688,
    "batch_size": 16,
    "total_loss": 0.02678999863564968,
    "gradnorm": 0.5617575645446777,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:46.887834"
 }
 Per-token loss scaled by world size: 0.0007775372941978276Per-token loss scaled by world size: 0.0009101248579099774Per-token loss scaled by world size: 8.433926268480718e-06Per-token loss scaled by world size: 3.585006925277412e-05Per-token loss scaled by world size: 3.69123590644449e-05


 Per-token loss scaled by world size: 0.0004430596309248358Per-token loss scaled by world size: 0.0002181310555897653



 Epoch: 8, Step: 108, Rank: 6, loss = 0.07588165998458862
 Epoch: 8, Step: 108, Rank: 3, loss = 0.0029889994766563177
 Epoch: 8, Step: 108, Rank: 7, loss = 0.03694009780883789Epoch: 8, Step: 108, Rank: 1, loss = 0.0030775677878409624

 Epoch: 8, Step: 108, Rank: 5, loss = 0.06482717394828796Epoch: 8, Step: 108, Rank: 2, loss = 0.0007031786371953785

 Epoch: 8, Step: 108, Rank: 4, loss = 0.018186677247285843
 Per-token loss scaled by world size: 0.0005422345129773021
 Epoch: 8, Step: 108, Rank: 0, loss = 0.045208804309368134
 [2024-07-27 20:07:47,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=108, skipped=0, lr=[7.771024502261526e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:47,383] [INFO] [timer.py:258:stop] epoch=0/micro_step=108/global_step=108, RunningAvgSamplesPerSec=31.765144134069732, CurrSamplesPerSec=32.38818214329323, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Epoch 8: 100%|██████████| 12/12 [00:24<00:00,  1.31it/s]{
    "epoch": 8,
    "step": 108,
    "rank": 0,
    "loss": 0.045208804309368134,
    "overall_throughput": 32.30654330502832,
    "lr": 7.771024502261526e-07,
    "cuda_mem_allocated": 21.99594497680664,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 667,
    "batch_size": 16,
    "total_loss": 0.03097676858305931,
    "gradnorm": 0.6119153499603271,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:47.430973"
 }
 Epoch 8: 100%|██████████| 12/12 [00:24<00:00,  2.07s/it]
 total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 5 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 64 total tokens: 140 num samples: 2 num padding tokens: 25 - rank: 5 max len: 70 min len: 45 avg len: 57.5 num_loss_counted_tokens: 67

 total tokens: 188 num samples: 2 num padding tokens: 28 - rank: 5 max len: 94 min len: 66 avg len: 80.0 num_loss_counted_tokens: 82
 total tokens: 166 num samples: 2 num padding tokens: 24 - rank: 5 max len: 83 min len: 59 avg len: 71.0 num_loss_counted_tokens: 64
 total tokens: 122 num samples: 2 num padding tokens: 1 - rank: 5 max len: 61 min len: 60 avg len: 60.5 num_loss_counted_tokens: 66
 total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 5 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 68
 total tokens: 164 num samples: 2 num padding tokens: 19 - rank: 5 max len: 82 min len: 63 avg len: 72.5 num_loss_counted_tokens: 78
 total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 5 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 66
 total tokens: 282 num samples: 2 num padding tokens: 80 - rank: 5 max len: 141 min len: 61 avg len: 101.0 num_loss_counted_tokens: 146
 total tokens: 124 num samples: 2 num padding tokens: 17 - rank: 1 max len: 62 min len: 45 avg len: 53.5 num_loss_counted_tokens: 53
 total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 4 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 57
 total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 5 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 48
 total tokens: 160 num samples: 2 num padding tokens: 6 - rank: 7 max len: 80 min len: 74 avg len: 77.0 num_loss_counted_tokens: 94
 total tokens: 200 num samples: 2 num padding tokens: 45 - rank: 4 max len: 100 min len: 55 avg len: 77.5 num_loss_counted_tokens: 99
 total tokens: 96 num samples: 2 num padding tokens: 5 - rank: 5 max len: 48 min len: 43 avg len: 45.5 num_loss_counted_tokens: 39
 total tokens: 140 num samples: 2 num padding tokens: 22 - rank: 3 max len: 70 min len: 48 avg len: 59.0 num_loss_counted_tokens: 73
 total tokens: 216 num samples: 2 num padding tokens: 42 - rank: 2 max len: 108 min len: 66 avg len: 87.0 num_loss_counted_tokens: 105
 total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 68
 total tokens: 104 num samples: 2 num padding tokens: 0 - rank: 1 max len: 52 min len: 52 avg len: 52.0 num_loss_counted_tokens: 50
 total tokens: 176 num samples: 2 num padding tokens: 24 - rank: 1 max len: 88 min len: 64 avg len: 76.0 num_loss_counted_tokens: 94
 total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 3 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 69
 total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 3 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 71
 total tokens: 186 num samples: 2 num padding tokens: 3 - rank: 7 max len: 93 min len: 90 avg len: 91.5 num_loss_counted_tokens: 173
 total tokens: 168 num samples: 2 num padding tokens: 18 - rank: 1 max len: 84 min len: 66 avg len: 75.0 num_loss_counted_tokens: 96
 total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 6 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 57
 total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 6 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 64
 total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 6 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 90
 total tokens: 168 num samples: 2 num padding tokens: 31 - rank: 6 max len: 84 min len: 53 avg len: 68.5 num_loss_counted_tokens: 73
 total tokens: 138 num samples: 2 num padding tokens: 6 - rank: 1 max len: 69 min len: 63 avg len: 66.0 num_loss_counted_tokens: 78
 total tokens: 114 num samples: 2 num padding tokens: 6 - rank: 1 max len: 57 min len: 51 avg len: 54.0 num_loss_counted_tokens: 56
 total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 6 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 80 total tokens: 146 num samples: 2 num padding tokens: 27 - rank: 3 max len: 73 min len: 46 avg len: 59.5 num_loss_counted_tokens: 65

 total tokens: 116 num samples: 2 num padding tokens: 5 - rank: 2 max len: 58 min len: 53 avg len: 55.5 num_loss_counted_tokens: 64
 total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 1 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 56
 total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 73
 total tokens: 128 num samples: 2 num padding tokens: 18 - rank: 4 max len: 64 min len: 46 avg len: 55.0 num_loss_counted_tokens: 59
 total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 6 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 73
 total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 2 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 52
 total tokens: 152 num samples: 2 num padding tokens: 11 - rank: 5 max len: 76 min len: 65 avg len: 70.5 num_loss_counted_tokens: 79
 total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 4 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 84
 total tokens: 142 num samples: 2 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 74
 total tokens: 146 num samples: 2 num padding tokens: 16 - rank: 6 max len: 73 min len: 57 avg len: 65.0 num_loss_counted_tokens: 72
 total tokens: 154 num samples: 2 num padding tokens: 28 - rank: 4 max len: 77 min len: 49 avg len: 63.0 num_loss_counted_tokens: 80
 total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 4 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 99
 total tokens: 226 num samples: 2 num padding tokens: 35 - rank: 2 max len: 113 min len: 78 avg len: 95.5 num_loss_counted_tokens: 114
 total tokens: 166 num samples: 2 num padding tokens: 17 - rank: 2 max len: 83 min len: 66 avg len: 74.5 num_loss_counted_tokens: 86
 total tokens: 150 num samples: 2 num padding tokens: 9 - rank: 2 max len: 75 min len: 66 avg len: 70.5 num_loss_counted_tokens: 82
 total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 63
 total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 6 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 52
 total tokens: 116 num samples: 2 num padding tokens: 7 - rank: 4 max len: 58 min len: 51 avg len: 54.5 num_loss_counted_tokens: 59
 total tokens: 194 num samples: 2 num padding tokens: 27 - rank: 4 max len: 97 min len: 70 avg len: 83.5 num_loss_counted_tokens: 113
 total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 6 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 60
 total tokens: 124 num samples: 2 num padding tokens: 8 - rank: 2 max len: 62 min len: 54 avg len: 58.0 num_loss_counted_tokens: 64
 total tokens: 152 num samples: 2 num padding tokens: 27 - rank: 7 max len: 76 min len: 49 avg len: 62.5 num_loss_counted_tokens: 64
 total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 1 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 82
 total tokens: 208 num samples: 2 num padding tokens: 34 - rank: 2 max len: 104 min len: 70 avg len: 87.0 num_loss_counted_tokens: 107
 total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 1 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 97
 total tokens: 172 num samples: 2 num padding tokens: 42 - rank: 1 max len: 86 min len: 44 avg len: 65.0 num_loss_counted_tokens: 68
 total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 4 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 69
 total tokens: 162 num samples: 2 num padding tokens: 18 - rank: 2 max len: 81 min len: 63 avg len: 72.0 num_loss_counted_tokens: 79
 total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 3 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 57
 total tokens: 214 num samples: 2 num padding tokens: 37 - rank: 2 max len: 107 min len: 70 avg len: 88.5 num_loss_counted_tokens: 99
 total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 3 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 54
 total tokens: 114 num samples: 2 num padding tokens: 2 - rank: 7 max len: 57 min len: 55 avg len: 56.0 num_loss_counted_tokens: 64
 total tokens: 146 num samples: 2 num padding tokens: 12 - rank: 7 max len: 73 min len: 61 avg len: 67.0 num_loss_counted_tokens: 86
 total tokens: 184 num samples: 2 num padding tokens: 42 - rank: 1 max len: 92 min len: 50 avg len: 71.0 num_loss_counted_tokens: 86
 total tokens: 164 num samples: 2 num padding tokens: 32 - rank: 6 max len: 82 min len: 50 avg len: 66.0 num_loss_counted_tokens: 94
 total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 2 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 63
 total tokens: 136 num samples: 2 num padding tokens: 16 - rank: 1 max len: 68 min len: 52 avg len: 60.0 num_loss_counted_tokens: 53
 total tokens: 160 num samples: 2 num padding tokens: 14 - rank: 7 max len: 80 min len: 66 avg len: 73.0 num_loss_counted_tokens: 81
 total tokens: 196 num samples: 2 num padding tokens: 49 - rank: 7 max len: 98 min len: 49 avg len: 73.5 num_loss_counted_tokens: 96
 total tokens: 174 num samples: 2 num padding tokens: 23 - rank: 7 max len: 87 min len: 64 avg len: 75.5 num_loss_counted_tokens: 77
 total tokens: 180 num samples: 2 num padding tokens: 28 - rank: 7 max len: 90 min len: 62 avg len: 76.0 num_loss_counted_tokens: 92
 total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 3 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 65
 total tokens: 228 num samples: 2 num padding tokens: 45 - rank: 3 max len: 114 min len: 69 avg len: 91.5 num_loss_counted_tokens: 117
 total tokens: 188 num samples: 2 num padding tokens: 41 - rank: 3 max len: 94 min len: 53 avg len: 73.5 num_loss_counted_tokens: 85
 total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 7 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 60
 total tokens: 130 num samples: 2 num padding tokens: 2 - rank: 3 max len: 65 min len: 63 avg len: 64.0 num_loss_counted_tokens: 57
 total tokens: 142 num samples: 2 num padding tokens: 1 - rank: 6 max len: 71 min len: 70 avg len: 70.5 num_loss_counted_tokens: 66
 total tokens: 124 num samples: 2 num padding tokens: 18 - rank: 0 max len: 62 min len: 44 avg len: 53.0 num_loss_counted_tokens: 50
 total tokens: 154 num samples: 2 num padding tokens: 3 - rank: 2 max len: 77 min len: 74 avg len: 75.5 num_loss_counted_tokens: 81
 total tokens: 134 num samples: 2 num padding tokens: 19 - rank: 0 max len: 67 min len: 48 avg len: 57.5 num_loss_counted_tokens: 64
 total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 4 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 75
 total tokens: 144 num samples: 2 num padding tokens: 28 - rank: 7 max len: 72 min len: 44 avg len: 58.0 num_loss_counted_tokens: 72
 total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 0 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 75
 total tokens: 162 num samples: 2 num padding tokens: 0 - rank: 0 max len: 81 min len: 81 avg len: 81.0 num_loss_counted_tokens: 96
 total tokens: 120 num samples: 2 num padding tokens: 16 - rank: 0 max len: 60 min len: 44 avg len: 52.0 num_loss_counted_tokens: 55
 total tokens: 134 num samples: 2 num padding tokens: 13 - rank: 0 max len: 67 min len: 54 avg len: 60.5 num_loss_counted_tokens: 58
 total tokens: 244 num samples: 2 num padding tokens: 46 - rank: 0 max len: 122 min len: 76 avg len: 99.0 num_loss_counted_tokens: 149
 total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 6 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 83
 total tokens: 214 num samples: 2 num padding tokens: 62 - rank: 0 max len: 107 min len: 45 avg len: 76.0 num_loss_counted_tokens: 99
 total tokens: 202 num samples: 2 num padding tokens: 40 - rank: 0 max len: 101 min len: 61 avg len: 81.0 num_loss_counted_tokens: 104
 total tokens: 116 num samples: 2 num padding tokens: 13 - rank: 3 max len: 58 min len: 45 avg len: 51.5 num_loss_counted_tokens: 51
 total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 0 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 54
 total tokens: 166 num samples: 2 num padding tokens: 31 - rank: 0 max len: 83 min len: 52 avg len: 67.5 num_loss_counted_tokens: 81
 total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 0 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 66
 Per-token loss scaled by world size: 0.00020837620832026005Per-token loss scaled by world size: 0.00101291888859123Per-token loss scaled by world size: 0.0005203241598792374
 Per-token loss scaled by world size: 0.0005080624832771719
 Per-token loss scaled by world size: 0.0008971338393166661

 Per-token loss scaled by world size: 9.295487870986108e-06
 Per-token loss scaled by world size: 3.1353247322840616e-05

 Epoch: 9, Step: 109, Rank: 1, loss = 0.07001801580190659
 Epoch: 9, Step: 109, Rank: 6, loss = 0.014404005371034145
 Epoch: 9, Step: 109, Rank: 4, loss = 0.03596740588545799
 Epoch: 9, Step: 109, Rank: 5, loss = 0.0620143748819828Epoch: 9, Step: 109, Rank: 7, loss = 0.03511982038617134

 Epoch: 9, Step: 109, Rank: 3, loss = 0.0021672931034117937
 Epoch: 9, Step: 109, Rank: 2, loss = 0.0006425505853258073
 Per-token loss scaled by world size: 1.730842632241547e-05
 Epoch: 9, Step: 109, Rank: 0, loss = 0.0011964449658989906
 [2024-07-27 20:07:48,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=109, skipped=0, lr=[6.543553540053926e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:48,396] [INFO] [timer.py:258:stop] epoch=0/micro_step=109/global_step=109, RunningAvgSamplesPerSec=31.767697068441695, CurrSamplesPerSec=32.0406552236319, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,         | 1/12 [00:00<00:10,  1.08it/s]
    "step": 109,
    "rank": 0,
    "loss": 0.0011964449658989906,
    "overall_throughput": 31.927760494448584,
    "lr": 6.543553540053926e-07,
    "cuda_mem_allocated": 21.998091220855713,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 553,
    "batch_size": 16,
    "total_loss": 0.02769123949110508,
    "gradnorm": 0.690696120262146,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:48.445399"
 }
 Per-token loss scaled by world size: 0.0014274526620283723Per-token loss scaled by world size: 0.0002300855703651905Per-token loss scaled by world size: 0.0005493586650118232Per-token loss scaled by world size: 0.0009031961672008038Per-token loss scaled by world size: 0.0002857441722881049Per-token loss scaled by world size: 0.0005578985437750816





 Per-token loss scaled by world size: 0.00028736007516272366
 Epoch: 9, Step: 110, Rank: 2, loss = 0.03742505982518196Epoch: 9, Step: 110, Rank: 5, loss = 0.01567457988858223Epoch: 9, Step: 110, Rank: 3, loss = 0.06153023988008499


 Epoch: 9, Step: 110, Rank: 6, loss = 0.09724520891904831
 Epoch: 9, Step: 110, Rank: 7, loss = 0.019466321915388107
 Epoch: 9, Step: 110, Rank: 0, loss = 0.03800683841109276
 Epoch: 9, Step: 110, Rank: 4, loss = 0.019576406106352806
 Per-token loss scaled by world size: 0.0003402826841920614
 Epoch: 9, Step: 110, Rank: 1, loss = 0.02318175695836544
 [2024-07-27 20:07:48,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[5.418275829936537e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:07:48,932] [INFO] [timer.py:258:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=31.778440450016202, CurrSamplesPerSec=32.971544549678505, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
 Saving model in huggingface format at samples_seen: 1760
 {
    "epoch": 9,
    "step": 110,
    "rank": 0,
    "loss": 0.03800683841109276,
    "overall_throughput": 32.86280148005085,
    "lr": 5.418275829936537e-07,
    "cuda_mem_allocated": 21.999285221099854,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 545,
    "batch_size": 16,
    "total_loss": 0.03901330381631851,
    "gradnorm": 0.746061384677887,
    "weight_norm": 393.47589111328125,
    "timestamp": "2024-07-27T20:07:48.935854"
 }
 Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1760
 [20:08:06] INFO     saving took 18.007025003433228 seconds                                                                                                                                                                        utils.py:611
                                                       Per-token loss scaled by world size: 0.000437290029367432Per-token loss scaled by world size: 0.00030764410621486604Per-token loss scaled by world size: 0.00027671968564391136
 Epoch 9:  17%|█▋        | 2/12 [00:19<01:52, 11.29s/it]

 Per-token loss scaled by world size: 3.6100764191360213e-06Per-token loss scaled by world size: 8.117486140690744e-05

 Per-token loss scaled by world size: 1.315043573413277e-05
 Epoch: 9, Step: 111, Rank: 1, loss = 0.024212971329689026
 Epoch: 9, Step: 111, Rank: 0, loss = 0.038262877613306046Epoch: 9, Step: 111, Rank: 6, loss = 0.0003158816834911704Epoch: 9, Step: 111, Rank: 7, loss = 0.026918860152363777
 Epoch: 9, Step: 111, Rank: 5, loss = 0.007102800067514181


 Epoch: 9, Step: 111, Rank: 4, loss = 0.0011506631271913648
 Per-token loss scaled by world size: 4.370940587250516e-05
 Epoch: 9, Step: 111, Rank: 2, loss = 0.0038245730102062225
 Per-token loss scaled by world size: 0.0007190197939053178
 Epoch: 9, Step: 111, Rank: 3, loss = 0.06291422992944717
 [2024-07-27 20:08:07,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=111, skipped=0, lr=[4.396421846564236e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:07,493] [INFO] [timer.py:258:stop] epoch=0/micro_step=111/global_step=111, RunningAvgSamplesPerSec=31.77947002508808, CurrSamplesPerSec=31.891058187078368, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,█▌       | 3/12 [00:20<00:57,  6.39s/it]
    "step": 111,
    "rank": 0,
    "loss": 0.038262877613306046,
    "overall_throughput": 31.83512815148673,
    "lr": 4.396421846564236e-07,
    "cuda_mem_allocated": 22.00214672088623,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 700,
    "batch_size": 16,
    "total_loss": 0.020587855949997902,
    "gradnorm": 0.4546854794025421,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:07.536418"
 }
 Per-token loss scaled by world size: 1.5196498679870274e-05Per-token loss scaled by world size: 0.00038456235779449344Per-token loss scaled by world size: 2.9230683139758185e-05Per-token loss scaled by world size: 4.643391366698779e-05Per-token loss scaled by world size: 0.000584576278924942




 Per-token loss scaled by world size: 2.605171175673604e-05Per-token loss scaled by world size: 0.000400967663154006
 Epoch: 9, Step: 112, Rank: 7, loss = 0.0473506785929203
 Epoch: 9, Step: 112, Rank: 5, loss = 0.031149551272392273
 Epoch: 9, Step: 112, Rank: 6, loss = 0.002367685316130519

 Epoch: 9, Step: 112, Rank: 1, loss = 0.003761146916076541
 Epoch: 9, Step: 112, Rank: 2, loss = 0.0012309163575991988
 Epoch: 9, Step: 112, Rank: 3, loss = 0.03247838094830513
 Epoch: 9, Step: 112, Rank: 0, loss = 0.0021101885940879583
 Per-token loss scaled by world size: 2.4041009965003468e-05
 Epoch: 9, Step: 112, Rank: 4, loss = 0.0019473218126222491
 [2024-07-27 20:08:07,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=112, skipped=0, lr=[3.4791089722651437e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:08,033] [INFO] [timer.py:258:stop] epoch=0/micro_step=112/global_step=112, RunningAvgSamplesPerSec=31.783663551587292, CurrSamplesPerSec=32.24748961705518, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,██▎      | 4/12 [00:20<00:32,  4.08s/it]
    "step": 112,
    "rank": 0,
    "loss": 0.0021101885940879583,
    "overall_throughput": 32.15704750085054,
    "lr": 3.4791089722651437e-07,
    "cuda_mem_allocated": 22.002624034881592,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 648,
    "batch_size": 16,
    "total_loss": 0.015299483202397823,
    "gradnorm": 0.40167224407196045,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:08.078974"
 }
 Per-token loss scaled by world size: 0.0005526405875571072Per-token loss scaled by world size: 0.0009644478559494019Per-token loss scaled by world size: 0.00029839982744306326Per-token loss scaled by world size: 0.0004884039517492056Per-token loss scaled by world size: 6.763617875549244e-06Per-token loss scaled by world size: 0.00016722115105949342

 Per-token loss scaled by world size: 9.371204214403406e-05




 Epoch: 9, Step: 113, Rank: 4, loss = 0.020887987688183784Epoch: 9, Step: 113, Rank: 1, loss = 0.03418827801942825Epoch: 9, Step: 113, Rank: 6, loss = 0.00047345325583592057Epoch: 9, Step: 113, Rank: 3, loss = 0.06751134991645813

 Epoch: 9, Step: 113, Rank: 7, loss = 0.011705480515956879Epoch: 9, Step: 113, Rank: 5, loss = 0.03868484124541283

 Epoch: 9, Step: 113, Rank: 2, loss = 0.006559843197464943


 Per-token loss scaled by world size: 0.0006510451785288751
 Epoch: 9, Step: 113, Rank: 0, loss = 0.0455731637775898
 [2024-07-27 20:08:08,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=113, skipped=0, lr=[2.667340275199426e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:08,574] [INFO] [timer.py:258:stop] epoch=0/micro_step=113/global_step=113, RunningAvgSamplesPerSec=31.789040793376557, CurrSamplesPerSec=32.391855899896804, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,███▏     | 5/12 [00:21<00:19,  2.80s/it]
    "step": 113,
    "rank": 0,
    "loss": 0.0455731637775898,
    "overall_throughput": 32.30453715301278,
    "lr": 2.667340275199426e-07,
    "cuda_mem_allocated": 21.99761438369751,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 560,
    "batch_size": 16,
    "total_loss": 0.02819805033504963,
    "gradnorm": 0.569709300994873,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:08.621715"
 }
 Per-token loss scaled by world size: 0.00019008330127689987Per-token loss scaled by world size: 0.00028144754469394684Per-token loss scaled by world size: 0.00048485625302419066Per-token loss scaled by world size: 0.0003446684859227389Per-token loss scaled by world size: 0.0005211489042267203


 Per-token loss scaled by world size: 0.000750985462218523


 Per-token loss scaled by world size: 0.00011730282858479768
 Epoch: 9, Step: 114, Rank: 1, loss = 0.041273389011621475Epoch: 9, Step: 114, Rank: 2, loss = 0.023958221077919006Epoch: 9, Step: 114, Rank: 5, loss = 0.029339905828237534


 Epoch: 9, Step: 114, Rank: 7, loss = 0.04436279833316803Epoch: 9, Step: 114, Rank: 0, loss = 0.016180841252207756

 Epoch: 9, Step: 114, Rank: 6, loss = 0.06392763555049896
 Epoch: 9, Step: 114, Rank: 4, loss = 0.009985403157770634
 Per-token loss scaled by world size: 0.0006613648729398847
 Epoch: 9, Step: 114, Rank: 3, loss = 0.05629868432879448
 [2024-07-27 20:08:09,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=114, skipped=0, lr=[1.9620034125190645e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:09,125] [INFO] [timer.py:258:stop] epoch=0/micro_step=114/global_step=114, RunningAvgSamplesPerSec=31.789347765969946, CurrSamplesPerSec=31.82345861552571, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,████     | 6/12 [00:21<00:12,  2.04s/it]
    "step": 114,
    "rank": 0,
    "loss": 0.016180841252207756,
    "overall_throughput": 31.73594640312759,
    "lr": 1.9620034125190645e-07,
    "cuda_mem_allocated": 22.01240301132202,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 681,
    "batch_size": 16,
    "total_loss": 0.035665858536958694,
    "gradnorm": 0.562235951423645,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:09.167828"
 }
 Per-token loss scaled by world size: 0.00038479233626276255Per-token loss scaled by world size: 0.00017321300401818007Per-token loss scaled by world size: 0.00037739198887720704Per-token loss scaled by world size: 0.0002703580248635262Per-token loss scaled by world size: 0.0003972994163632393
 Per-token loss scaled by world size: 0.0005138221313245595Per-token loss scaled by world size: 0.0007780570886097848





 Epoch: 9, Step: 115, Rank: 2, loss = 0.028068529441952705
 Epoch: 9, Step: 115, Rank: 5, loss = 0.038215521723032Epoch: 9, Step: 115, Rank: 7, loss = 0.012882716953754425Epoch: 9, Step: 115, Rank: 0, loss = 0.028618929907679558Epoch: 9, Step: 115, Rank: 4, loss = 0.020107878372073174Epoch: 9, Step: 115, Rank: 6, loss = 0.029549144208431244
 Epoch: 9, Step: 115, Rank: 1, loss = 0.05786799639463425




 Per-token loss scaled by world size: 0.0008164329337887466
 Epoch: 9, Step: 115, Rank: 3, loss = 0.06072219833731651
 [2024-07-27 20:08:09,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=115, skipped=0, lr=[1.3638696597277678e-07], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:09,670] [INFO] [timer.py:258:stop] epoch=0/micro_step=115/global_step=115, RunningAvgSamplesPerSec=31.790592582420803, CurrSamplesPerSec=31.930631657680326, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,████▊    | 7/12 [00:22<00:07,  1.55s/it]
    "step": 115,
    "rank": 0,
    "loss": 0.028618929907679558,
    "overall_throughput": 31.843164422360154,
    "lr": 1.3638696597277678e-07,
    "cuda_mem_allocated": 22.00882577896118,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 595,
    "batch_size": 16,
    "total_loss": 0.03450411558151245,
    "gradnorm": 0.7144197821617126,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:09.673464"
 }
 Per-token loss scaled by world size: 0.0001559390948386863Per-token loss scaled by world size: 0.0005046449950896204Per-token loss scaled by world size: 7.42613265174441e-05Per-token loss scaled by world size: 0.00044146235450170934


 Per-token loss scaled by world size: 1.3344148101168685e-05Per-token loss scaled by world size: 8.937142411014065e-06


 Per-token loss scaled by world size: 4.871577039011754e-05
 Epoch: 9, Step: 116, Rank: 1, loss = 0.037406809628009796Epoch: 9, Step: 116, Rank: 5, loss = 0.005504620727151632Epoch: 9, Step: 116, Rank: 0, loss = 0.011558985337615013


 Epoch: 9, Step: 116, Rank: 7, loss = 0.03272339701652527Epoch: 9, Step: 116, Rank: 6, loss = 0.0006624656962230802
 Epoch: 9, Step: 116, Rank: 2, loss = 0.0009891350055113435

 Epoch: 9, Step: 116, Rank: 4, loss = 0.0036110563669353724
 Per-token loss scaled by world size: 0.0008237811853177845
 Epoch: 9, Step: 116, Rank: 3, loss = 0.061062779277563095
 [2024-07-27 20:08:10,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=116, skipped=0, lr=[8.735930673024806e-08], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:10,215] [INFO] [timer.py:258:stop] epoch=0/micro_step=116/global_step=116, RunningAvgSamplesPerSec=31.793855612274417, CurrSamplesPerSec=32.166943077303586, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,█████▋   | 8/12 [00:22<00:04,  1.23s/it]
    "step": 116,
    "rank": 0,
    "loss": 0.011558985337615013,
    "overall_throughput": 32.10873615463745,
    "lr": 8.735930673024806e-08,
    "cuda_mem_allocated": 22.000000476837158,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 593,
    "batch_size": 16,
    "total_loss": 0.019189907237887383,
    "gradnorm": 0.4252445697784424,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:10.257978"
 }
 Per-token loss scaled by world size: 0.000993338762782514Per-token loss scaled by world size: 0.0006966108339838684Per-token loss scaled by world size: 0.0003090917889494449Per-token loss scaled by world size: 3.207974077668041e-05
 Per-token loss scaled by world size: 0.0003707126888912171
 Per-token loss scaled by world size: 3.2455467589898035e-05


 Per-token loss scaled by world size: 9.373086504638195e-05

 Epoch: 9, Step: 117, Rank: 1, loss = 0.025500072166323662
 Epoch: 9, Step: 117, Rank: 0, loss = 0.08195044845342636Epoch: 9, Step: 117, Rank: 5, loss = 0.05747039616107941Epoch: 9, Step: 117, Rank: 3, loss = 0.0026775761507451534


 Epoch: 9, Step: 117, Rank: 6, loss = 0.002646578708663583
 Epoch: 9, Step: 117, Rank: 4, loss = 0.03058379702270031
 Epoch: 9, Step: 117, Rank: 2, loss = 0.007732796482741833
 Per-token loss scaled by world size: 0.0008588652708567679
 Epoch: 9, Step: 117, Rank: 7, loss = 0.07085638493299484
 [2024-07-27 20:08:10,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=117, skipped=0, lr=[4.9170974549885844e-08], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:10,765] [INFO] [timer.py:258:stop] epoch=0/micro_step=117/global_step=117, RunningAvgSamplesPerSec=31.79231829238237, CurrSamplesPerSec=31.618032996197385, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                       {
    "epoch": 9,██████▌  | 9/12 [00:23<00:03,  1.02s/it]
    "step": 117,
    "rank": 0,
    "loss": 0.08195044845342636,
    "overall_throughput": 31.54031480690336,
    "lr": 4.9170974549885844e-08,
    "cuda_mem_allocated": 21.997137546539307,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 660,
    "batch_size": 16,
    "total_loss": 0.034927256405353546,
    "gradnorm": 0.6421816945075989,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:10.768286"
 }
 Per-token loss scaled by world size: 0.00030643673380836844Per-token loss scaled by world size: 0.000586669659242034Per-token loss scaled by world size: 0.001160175772383809Per-token loss scaled by world size: 0.000358547898940742Per-token loss scaled by world size: 0.00025170366279780865Per-token loss scaled by world size: 0.00010790762462420389

 Per-token loss scaled by world size: 4.243180592311546e-05




 Epoch: 9, Step: 118, Rank: 7, loss = 0.0846928283572197Epoch: 9, Step: 118, Rank: 4, loss = 0.04282688349485397Epoch: 9, Step: 118, Rank: 6, loss = 0.026173997670412064Epoch: 9, Step: 118, Rank: 1, loss = 0.007877256721258163



 Epoch: 9, Step: 118, Rank: 0, loss = 0.022369882091879845Epoch: 9, Step: 118, Rank: 5, loss = 0.0183743666857481Epoch: 9, Step: 118, Rank: 3, loss = 0.0030975218396633863


 Per-token loss scaled by world size: 0.00044214868103154004
 Epoch: 9, Step: 118, Rank: 2, loss = 0.032276853919029236
 [2024-07-27 20:08:11,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=118, skipped=0, lr=[2.1863727812254653e-08], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:11,300] [INFO] [timer.py:258:stop] epoch=0/micro_step=118/global_step=118, RunningAvgSamplesPerSec=31.800842267046153, CurrSamplesPerSec=32.81255650372894, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 9,███████▎ | 10/12 [00:23<00:01,  1.15it/s]
    "step": 118,
    "rank": 0,
    "loss": 0.022369882091879845,
    "overall_throughput": 32.728399914556505,
    "lr": 2.1863727812254653e-08,
    "cuda_mem_allocated": 21.999285221099854,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 584,
    "batch_size": 16,
    "total_loss": 0.029711198061704636,
    "gradnorm": 0.6892233490943909,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:11.344526"
 }
 Per-token loss scaled by world size: 0.00011293171701254323Per-token loss scaled by world size: 0.00029302676557563245Per-token loss scaled by world size: 0.0008117702673189342Per-token loss scaled by world size: 0.001312798005528748Per-token loss scaled by world size: 0.0006738528027199209

 Per-token loss scaled by world size: 0.0004890891723334789
 Per-token loss scaled by world size: 3.5650893551064655e-05



 Epoch: 9, Step: 119, Rank: 4, loss = 0.058345988392829895
 Epoch: 9, Step: 119, Rank: 6, loss = 0.09435735642910004Epoch: 9, Step: 119, Rank: 0, loss = 0.021061299368739128Epoch: 9, Step: 119, Rank: 3, loss = 0.04843316972255707
 Epoch: 9, Step: 119, Rank: 5, loss = 0.0025624081026762724Epoch: 9, Step: 119, Rank: 2, loss = 0.008116967044770718



 Epoch: 9, Step: 119, Rank: 1, loss = 0.035153284668922424
 Per-token loss scaled by world size: 0.0010237974347546697
 Epoch: 9, Step: 119, Rank: 7, loss = 0.07358544319868088
 [2024-07-27 20:08:11,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=119, skipped=0, lr=[5.467426590739511e-09], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:11,848] [INFO] [timer.py:258:stop] epoch=0/micro_step=119/global_step=119, RunningAvgSamplesPerSec=31.801217171186128, CurrSamplesPerSec=31.84476611898689, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 9,████████▏| 11/12 [00:24<00:00,  1.30it/s]
    "step": 119,
    "rank": 0,
    "loss": 0.021061299368739128,
    "overall_throughput": 31.765043565848615,
    "lr": 5.467426590739511e-09,
    "cuda_mem_allocated": 22.003100872039795,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 575,
    "batch_size": 16,
    "total_loss": 0.04270198941230774,
    "gradnorm": 0.709683358669281,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:11.891436"
 }
 Per-token loss scaled by world size: 0.0007179552922025323Per-token loss scaled by world size: 0.00047626314335502684Per-token loss scaled by world size: 0.000766461540479213Per-token loss scaled by world size: 0.000950768415350467Per-token loss scaled by world size: 0.00014302438648883253Per-token loss scaled by world size: 1.124614391301293e-05

 Per-token loss scaled by world size: 0.00010744491737568751




 Epoch: 9, Step: 120, Rank: 6, loss = 0.03857731446623802Epoch: 9, Step: 120, Rank: 7, loss = 0.058154378086328506Epoch: 9, Step: 120, Rank: 0, loss = 0.07701224088668823
 Epoch: 9, Step: 120, Rank: 4, loss = 0.0009109376696869731

 Epoch: 9, Step: 120, Rank: 3, loss = 0.06208338588476181Epoch: 9, Step: 120, Rank: 5, loss = 0.008703038096427917Epoch: 9, Step: 120, Rank: 1, loss = 0.011584974825382233



 Per-token loss scaled by world size: 0.00025203556288033724
 Epoch: 9, Step: 120, Rank: 2, loss = 0.02041487954556942
 [2024-07-27 20:08:12,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-07-27 20:08:12,384] [INFO] [timer.py:258:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=31.807319807184527, CurrSamplesPerSec=32.53786766934063, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
                                                        {
    "epoch": 9,█████████| 12/12 [00:24<00:00,  1.43it/s]
    "step": 120,
    "rank": 0,
    "loss": 0.07701224088668823,
    "overall_throughput": 32.45850310128811,
    "lr": 0.0,
    "cuda_mem_allocated": 22.007394790649414,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 648,
    "batch_size": 16,
    "total_loss": 0.034680142998695374,
    "gradnorm": 0.5826724767684937,
    "weight_norm": 393.4759216308594,
    "timestamp": "2024-07-27T20:08:12.387320"
 }
 Epoch 9: 100%|██████████| 12/12 [00:24<00:00,  2.08s/it]
 tyler-rhel-newimage:260:1034 [0] NCCL INFO [Service thread] Connection closed by localRank 0
 tyler-rhel-newimage:261:1036 [1] NCCL INFO [Service thread] Connection closed by localRank 1
 tyler-rhel-newimage:266:1030 [6] NCCL INFO [Service thread] Connection closed by localRank 6
 tyler-rhel-newimage:263:1038 [3] NCCL INFO [Service thread] Connection closed by localRank 3
 tyler-rhel-newimage:262:1040 [2] NCCL INFO [Service thread] Connection closed by localRank 2
 tyler-rhel-newimage:267:1044 [7] NCCL INFO [Service thread] Connection closed by localRank 7
 tyler-rhel-newimage:265:1042 [5] NCCL INFO [Service thread] Connection closed by localRank 5
 tyler-rhel-newimage:264:1032 [4] NCCL INFO [Service thread] Connection closed by localRank 4
 tyler-rhel-newimage:260:43471 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
 tyler-rhel-newimage:267:43476 [0] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
 tyler-rhel-newimage:266:43475 [0] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
 tyler-rhel-newimage:262:43470 [0] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
 tyler-rhel-newimage:261:43477 [0] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
 tyler-rhel-newimage:263:43473 [0] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
 tyler-rhel-newimage:265:43474 [0] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
 tyler-rhel-newimage:264:43472 [0] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
 Terminating process 🤖
 [root@tyler-rhel-newimage instructlab]#