relyt0925 · July 27, 2024 04:44
diff --git a/gistfile1.txt b/gistfile1.txt
 [root@tyler-rhel-newimage instructlab]# /root/ilab model train --data-path /var/instructlabbigdisk/instructlab/generateddata/messages_Mixtral-8x7B-Instruct-v0_2024-07-27T04_27_23.jsonl --model-path /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --ckpt-output-dir /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --device cuda --gpus 8 --max-batch-len 1  --effective-batch-size 8 --save-samples 46
 [2024-07-27 04:38:32,852] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 INFO 2024-07-27 04:38:36,486 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
 INFO 2024-07-27 04:38:36,486 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
 INFO 2024-07-27 04:38:36,486 numexpr.utils:161: NumExpr defaulting to 16 threads.
 INFO 2024-07-27 04:38:36,869 datasets:58: PyTorch version 2.3.1 available.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 INFO 2024-07-27 04:38:37,191 root:611: !!!!!!!! tokenizer has add_bos_token or add_eos_token
 INFO 2024-07-27 04:38:37,196 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
 tokenizing the dataset with /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ tokenizer...
 ten largest length percentiles:
 quantile 90th: 78.0
 quantile 91th: 79.80000000000001
 quantile 92th: 83.19999999999999
 quantile 93th: 86.80000000000001
 quantile 94th: 89.19999999999999
 quantile 95th: 91.0
 quantile 96th: 93.59999999999997
 quantile 97th: 97.19999999999999
 quantile 98th: 100.70000000000002
 quantile 99th: 103.84999999999998
 quantile 100th: 107.0

 at 4096 max sequence length, the number of samples to be dropped is 0
 (0.00% of total)
 quantile 0th: 44.0
 quantile 1th: 44.45
 quantile 2th: 44.9
 quantile 3th: 45.0
 quantile 4th: 45.0
 quantile 5th: 45.0
 quantile 6th: 45.0
 quantile 7th: 45.15
 quantile 8th: 45.6
 quantile 9th: 46.1
 quantile 10th: 47.0
 at 20 min sequence length, the number of samples to be dropped is 0
 checking the validity of the samples...
 INFO 2024-07-27 04:38:42,745 root:611: number of dropped samples: 0 -- out of 46
 Categorizing training data type...
 Data type sorting: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 506398.91it/s]
 unmasking the appropriate message content...
 The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.

 Instruction ex sample 16: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|>
 Original Input: <|user|> 
 Question: How many villages named "Qarah Tappeh" are mentioned in the text?
 <|assistant|> 
 Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|>

 Instruction ex sample 39: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|>
 Original Input: <|user|> 
 Question: How many villages named "Qarah Tappeh" are mentioned in the text, each with a different location?
 <|assistant|> 
 Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|>

 Creating json from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 172.75ba/s]
 Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --data_path=/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --num_epochs=10 --effective_batch_size=8 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=46 --log_level=INFO --max_batch_len=1 --seed=42 --chat-tmpl-path=/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] 
 W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] *****************************************
 W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
 W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] *****************************************
 [2024-07-27 04:38:47,197] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,436] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,460] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,488] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,592] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,593] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,603] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-07-27 04:38:47,623] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.

 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 [04:38:50] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 model_name_or_path: /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/
 data_path: /var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl
 output_dir: /var/instructlabbigdisk/instructlab/knowledgecheckpoints/
 num_epochs: 10
 last_step: 0
 effective_batch_size: 8
 learning_rate: 2.0e-05
 lr_scheduler: cosine
 num_warmup_steps: 25
 save_samples: 46
 save_samples_ds: null
 save_last: false
 log_level: INFO
 seed: 42
 mock_data: false
 mock_len: 2600
 sharding_strategy: FULL_SHARD
 is_granite: false
 lora_r: 0
 lora_alpha: 32
 lora_dropout: 0.1
 lora_quant_bits: null
 lora_target_modules: null
 max_batch_len: 1
 cpu_offload_optimizer: false
 cpu_offload_optimizer_pin_memory: false
 cpu_offload_optimizer_ratio: 1.0
 NEFTune_alpha: null
 chat_tmpl_path: /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 disable_flash_attn: false

 {
    "script_params": {
        "model_name_or_path": "/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/",
        "data_path": "/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl",
        "output_dir": "/var/instructlabbigdisk/instructlab/knowledgecheckpoints/",
        "num_epochs": 10,
        "last_step": 0,
        "effective_batch_size": 8,
        "learning_rate": 2e-05,
        "lr_scheduler": "cosine",
        "num_warmup_steps": 25,
        "save_samples": 46,
        "save_samples_ds": null,
        "save_last": false,
        "log_level": "INFO",
        "seed": 42,
        "mock_data": false,
        "mock_len": 2600,
        "sharding_strategy": "FULL_SHARD",
        "is_granite": false,
        "lora_r": 0,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
        "lora_quant_bits": null,
        "lora_target_modules": null,
        "max_batch_len": 1,
        "cpu_offload_optimizer": false,
        "cpu_offload_optimizer_pin_memory": false,
        "cpu_offload_optimizer_ratio": 1.0,
        "NEFTune_alpha": null,
        "chat_tmpl_path": "/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
        "disable_flash_attn": false
    },
    "timestamp": "2024-07-27T04:38:51.166561"
 }
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 [2024-07-27 04:38:51,196] [INFO] [comm.py:637:init_distributed] cdb=None
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 [2024-07-27 04:38:51,244] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 04:38:51,244] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
 [04:38:51] INFO     !!!!!!!! tokenizer has add_bos_token or add_eos_token                                                                                                                 utils.py:611
 [2024-07-27 04:38:51,961] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 04:38:52,111] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-rhel-newimage:260:260 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:260:260 [0] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:260:260 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-07-27 04:38:52,191] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 04:38:52,200] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-rhel-newimage:265:265 [5] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:263:263 [3] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:265:265 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:265:265 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:263:263 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:263:263 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-07-27 04:38:52,222] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-07-27 04:38:52,228] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-rhel-newimage:264:264 [4] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:264:264 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:264:264 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:267:267 [7] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:267:267 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:267:267 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:261:261 [1] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:261:261 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:261:261 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:266:266 [6] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:266:266 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:266:266 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:262:262 [2] NCCL INFO cudaDriverVersion 12040
 tyler-rhel-newimage:262:262 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:262:262 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Using network Socket
 tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:263:1024 [3] NCCL INFO Using network Socket
 tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:265:1025 [5] NCCL INFO Using network Socket
 tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:267:1027 [7] NCCL INFO Using network Socket
 tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:261:1028 [1] NCCL INFO Using network Socket
 tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:266:1029 [6] NCCL INFO Using network Socket
 tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:264:1026 [4] NCCL INFO Using network Socket
 tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/IB : No device found.
 tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
 tyler-rhel-newimage:262:1030 [2] NCCL INFO Using network Socket
 tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init START
 tyler-rhel-newimage:263:1024 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-rhel-newimage:261:1028 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-rhel-newimage:263:1024 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-rhel-newimage:261:1028 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-rhel-newimage:262:1030 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-rhel-newimage:260:1019 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-rhel-newimage:262:1030 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-rhel-newimage:264:1026 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-rhel-newimage:264:1026 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-rhel-newimage:265:1025 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-rhel-newimage:265:1025 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-rhel-newimage:267:1027 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-rhel-newimage:267:1027 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-rhel-newimage:266:1029 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-rhel-newimage:266:1029 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-rhel-newimage:266:1029 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-rhel-newimage:267:1027 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-rhel-newimage:262:1030 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-rhel-newimage:264:1026 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-rhel-newimage:261:1028 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-rhel-newimage:260:1019 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:261:1028 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-rhel-newimage:264:1026 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-rhel-newimage:262:1030 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:266:1029 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-rhel-newimage:267:1027 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-rhel-newimage:261:1028 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:264:1026 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:262:1030 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:266:1029 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:267:1027 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-rhel-newimage:260:1019 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:263:1024 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-rhel-newimage:265:1025 [5] NCCL INFO comm 0x558446a0b990 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-rhel-newimage:263:1024 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-rhel-newimage:263:1024 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:265:1025 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-rhel-newimage:265:1025 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:266:1029 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:266:1029 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1025 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:265:1025 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:262:1030 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:262:1030 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:267:1027 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:267:1027 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:260:1019 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:260:1019 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:260:1019 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-rhel-newimage:264:1026 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:264:1026 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:263:1024 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:263:1024 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:261:1028 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:261:1028 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:267:1027 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:265:1025 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:266:1029 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:265:1025 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.77 (kernels 0.15, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:266:1029 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:267:1027 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.76 (kernels 0.20, bootstrap 0.22, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:260:1019 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:262:1030 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:260:1019 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.81 (kernels 0.13, bootstrap 0.35, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:262:1030 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.75 (kernels 0.24, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:264:1026 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:263:1024 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:261:1028 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init COMPLETE
 tyler-rhel-newimage:264:1026 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.76 (kernels 0.23, bootstrap 0.19, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
 tyler-rhel-newimage:263:1024 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.78 (kernels 0.16, bootstrap 0.29, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
 tyler-rhel-newimage:261:1028 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1052 [3] NCCL INFO Connected all rings
 tyler-rhel-newimage:262:1054 [2] NCCL INFO Connected all rings
 tyler-rhel-newimage:261:1053 [1] NCCL INFO Connected all rings
 tyler-rhel-newimage:260:1048 [0] NCCL INFO Connected all rings
 tyler-rhel-newimage:264:1049 [4] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1050 [7] NCCL INFO Connected all rings
 tyler-rhel-newimage:265:1051 [5] NCCL INFO Connected all rings
 tyler-rhel-newimage:266:1047 [6] NCCL INFO Connected all rings
 Generating train split: 46 examples [00:00, 10554.02 examples/s]
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11446.25it/s]
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11298.78it/s]
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11851.23it/s]
 Effective batch size is too low for multipack sampling, max sample length=107 and min packing length=61. Switching to naive distributed sampling.
 {
    "num_gpus": 8,
    "avg_sample_len": 61.52173913043478,
    "effective_batch_size": 8,
    "max_batch_len_per_gpu": 1,
    "packing_max_batch_len": null,
    "grad_accum": 1,
    "num_batches": 6,
    "avg_samples_per_batch": 7.666666666666667,
    "samples_per_gpu": 1,
    "timestamp": "2024-07-27T04:38:53.790452"
 }
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11743.03it/s]
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11754.48it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11646.62it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11752.33it/s]
 Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11259.88it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00,  1.82s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00,  1.84s/it]
 Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00,  1.84s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00,  1.91s/it]
 Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00,  1.91s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00,  1.93s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00,  1.77s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00,  1.78s/it]
 WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Creating extension directory /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
 WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
 Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
 [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
 [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/python3.11/venv/lib64/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.80458378791809 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.133174657821655 seconds
 Loading extension module fused_adam...Loading extension module fused_adam...
 Loading extension module fused_adam...

 Time to load fused_adam op: 30.7347309589386 seconds
 Time to load fused_adam op: 30.73425841331482 secondsTime to load fused_adam op: 26.12966561317444 seconds

 Loading extension module fused_adam...
 Time to load fused_adam op: 26.22968363761902 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.43332004547119 seconds
 [2024-07-27 04:39:49,755] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4+d254d75, git-hash=d254d75, git-branch=HEAD
 [2024-07-27 04:39:49,756] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.431302785873413 seconds
 tyler-rhel-newimage:266:1135 [6] NCCL INFO Using network Socket
 tyler-rhel-newimage:263:1144 [3] NCCL INFO Using network Socket
 tyler-rhel-newimage:261:1129 [1] NCCL INFO Using network Socket
 tyler-rhel-newimage:264:1147 [4] NCCL INFO Using network Socket
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Using network Socket
 tyler-rhel-newimage:267:1138 [7] NCCL INFO Using network Socket
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Using network Socket
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Using network Socket
 tyler-rhel-newimage:264:1147 [4] NCCL INFO bootstrapSplit: comm 0x556733b2a550 parent 0x5567334e8a90 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
 tyler-rhel-newimage:263:1144 [3] NCCL INFO bootstrapSplit: comm 0x55eb051067f0 parent 0x55eb04aea420 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
 tyler-rhel-newimage:266:1135 [6] NCCL INFO bootstrapSplit: comm 0x565389f45580 parent 0x5653898b69a0 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
 tyler-rhel-newimage:267:1138 [7] NCCL INFO bootstrapSplit: comm 0x560b41bec990 parent 0x560b415bb0d0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
 tyler-rhel-newimage:261:1129 [1] NCCL INFO bootstrapSplit: comm 0x55b80d3c14e0 parent 0x55b80cd4f580 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
 tyler-rhel-newimage:262:1141 [2] NCCL INFO bootstrapSplit: comm 0x55bdb15c28a0 parent 0x55bdb0f52ee0 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
 tyler-rhel-newimage:260:1128 [0] NCCL INFO bootstrapSplit: comm 0x5572495c2d50 parent 0x557248f8b2d0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
 tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:265:1132 [5] NCCL INFO bootstrapSplit: comm 0x5584470a63b0 parent 0x558446a0b990 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
 tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init START
 tyler-rhel-newimage:263:1144 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-rhel-newimage:263:1144 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-rhel-newimage:264:1147 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-rhel-newimage:264:1147 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-rhel-newimage:261:1129 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-rhel-newimage:261:1129 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-rhel-newimage:260:1128 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-rhel-newimage:262:1141 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-rhel-newimage:266:1135 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-rhel-newimage:266:1135 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-rhel-newimage:267:1138 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-rhel-newimage:267:1138 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-rhel-newimage:265:1132 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-rhel-newimage:266:1135 [6] NCCL INFO comm 0x565389f45580 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x5584470a63b0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-rhel-newimage:261:1129 [1] NCCL INFO comm 0x55b80d3c14e0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-rhel-newimage:260:1128 [0] NCCL INFO comm 0x5572495c2d50 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-rhel-newimage:264:1147 [4] NCCL INFO comm 0x556733b2a550 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-rhel-newimage:267:1138 [7] NCCL INFO comm 0x560b41bec990 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-rhel-newimage:263:1144 [3] NCCL INFO comm 0x55eb051067f0 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55bdb15c28a0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:265:1132 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:266:1135 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-rhel-newimage:267:1138 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:266:1135 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:267:1138 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:261:1129 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:264:1147 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-rhel-newimage:261:1129 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:263:1144 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-rhel-newimage:264:1147 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:263:1144 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:262:1141 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-rhel-newimage:260:1128 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-rhel-newimage:261:1129 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:261:1129 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:267:1138 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:267:1138 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:260:1128 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:260:1128 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:264:1147 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:264:1147 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:262:1141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:262:1141 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:265:1132 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:265:1132 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:263:1144 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:263:1144 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:266:1135 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-rhel-newimage:266:1135 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-rhel-newimage:260:1128 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:263:1144 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:260:1128 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.56 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:264:1147 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:267:1138 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.34 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:261:1129 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.55 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
 tyler-rhel-newimage:262:1141 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.31 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:265:1132 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.52 (kernels 0.00, bootstrap 0.22, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:266:1135 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.42 (kernels 0.00, bootstrap 0.12, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-rhel-newimage:261:1170 [1] NCCL INFO Connected all rings
 tyler-rhel-newimage:260:1168 [0] NCCL INFO Connected all rings
 tyler-rhel-newimage:264:1164 [4] NCCL INFO Connected all rings
 tyler-rhel-newimage:262:1171 [2] NCCL INFO Connected all rings
 tyler-rhel-newimage:263:1167 [3] NCCL INFO Connected all rings
 tyler-rhel-newimage:266:1166 [6] NCCL INFO Connected all rings
 tyler-rhel-newimage:265:1169 [5] NCCL INFO Connected all rings
 tyler-rhel-newimage:267:1165 [7] NCCL INFO Connected all rings
 [2024-07-27 04:39:54,216] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
 [2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
 [2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
 [2024-07-27 04:39:54,243] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
 [2024-07-27 04:39:54,243] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
 [2024-07-27 04:39:54,244] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
 [2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
 [2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
 [2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
 [2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
 [2024-07-27 04:40:05,980] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:07,679] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
 [2024-07-27 04:40:07,680] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 17.26 GB         CA 17.26 GB         Max_CA 17 GB 
 [2024-07-27 04:40:07,680] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 105.65 GB, percent = 8.4%
 [2024-07-27 04:40:07,885] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
 [2024-07-27 04:40:07,886] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 18.83 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-07-27 04:40:07,886] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 105.65 GB, percent = 8.4%
 [2024-07-27 04:40:07,886] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
 [2024-07-27 04:40:08,052] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:08,080] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
 [2024-07-27 04:40:08,081] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:08,081] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 15.69 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-07-27 04:40:08,081] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 67.95 GB, percent = 5.4%
 [2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
 [2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
 [2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fa89082c350>
 [2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:08,084] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
 [2024-07-27 04:40:08,084] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
 }
 [2024-07-27 04:40:08,084] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
 [2024-07-27 04:40:08,084] [INFO] [config.py:1001:print]   amp_enabled .................. False
 [2024-07-27 04:40:08,084] [INFO] [config.py:1001:print]   amp_params ................... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
 }
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa871e917d0>
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   communication_data_type ...... None
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   disable_allgather ............ False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   dump_state ................... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
 }
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   fp16_enabled ................. False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   global_rank .................. 0
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 1
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
 [2024-07-27 04:40:08,085] [INFO] [config.py:1001:print]   graph_harvesting ............. False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   memory_breakdown ............. False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
 }
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   optimizer_name ............... None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   optimizer_params ............. None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   pld_enabled .................. False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   pld_params ................... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   prescale_gradients ........... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   scheduler_name ............... None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   scheduler_params ............. None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   sparse_attention ............. None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   steps_per_print .............. 1
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   train_batch_size ............. 8
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  1
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   weight_quantization_config ... None
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   world_size ................... 8
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  False
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   zero_enabled ................. True
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
 [2024-07-27 04:40:08,086] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
 [2024-07-27 04:40:08,086] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 8, 
    "gradient_accumulation_steps": 1, 
    "train_micro_batch_size_per_gpu": 1, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none"
        }
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false
 }
 [2024-07-27 04:40:08,087] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Number of samples per save: 40
 [2024-07-27 04:40:08,101] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:08,148] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:08,457] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-07-27 04:40:08,652] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Epoch 0:   0%|          | 0/6 [00:00<?, ?it/s] total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 7 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 1 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 0 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22

 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 Per-token loss scaled by world size: 0.07893893122673035
 Per-token loss scaled by world size: 0.03717859089374542
 Per-token loss scaled by world size: 0.04773510619997978Per-token loss scaled by world size: 0.0572475828230381Per-token loss scaled by world size: 0.07484958320856094


 Per-token loss scaled by world size: 0.053953204303979874
 Epoch: 0, Step: 1, Rank: 0, loss = 2.230024814605713
 Epoch: 0, Step: 1, Rank: 1, loss = 1.0502952337265015
 Per-token loss scaled by world size: 0.054488085210323334
 Epoch: 0, Step: 1, Rank: 3, loss = 1.3485167026519775Epoch: 0, Step: 1, Rank: 7, loss = 1.6172442436218262

 Epoch: 0, Step: 1, Rank: 6, loss = 2.1145007610321045
 Epoch: 0, Step: 1, Rank: 2, loss = 1.5241780281066895
 Epoch: 0, Step: 1, Rank: 5, loss = 1.5392884016036987
 Per-token loss scaled by world size: 0.034635160118341446
 Epoch: 0, Step: 1, Rank: 4, loss = 0.9784433245658875
 [2024-07-27 04:40:09,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
 Epoch 0:  17%|█▋        | 1/6 [00:01<00:06,  1.27s/it]{
    "epoch": 0,
    "step": 1,
    "rank": 0,
    "loss": 2.230024814605713,
    "overall_throughput": 9.740397396362653,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 226,
    "batch_size": 8,
    "total_loss": 1.5503114461898804,
    "gradnorm": 27.33059310913086,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:09.962829"
 }
 Per-token loss scaled by world size: 0.0815284326672554Per-token loss scaled by world size: 0.049197420477867126Per-token loss scaled by world size: 0.05212152749300003Per-token loss scaled by world size: 0.04615860432386398
 Per-token loss scaled by world size: 0.031173814088106155

 Per-token loss scaled by world size: 0.031103696674108505

 Per-token loss scaled by world size: 0.059471674263477325

 Epoch: 0, Step: 2, Rank: 5, loss = 1.5189703702926636
 Epoch: 0, Step: 2, Rank: 0, loss = 2.517190456390381
 Epoch: 0, Step: 2, Rank: 2, loss = 1.6092522144317627Epoch: 0, Step: 2, Rank: 4, loss = 1.4251469373703003

 Epoch: 0, Step: 2, Rank: 3, loss = 0.962491512298584
 Epoch: 0, Step: 2, Rank: 1, loss = 0.960326611995697
 Epoch: 0, Step: 2, Rank: 7, loss = 1.8361879587173462
 Per-token loss scaled by world size: 0.0653553158044815
 Epoch: 0, Step: 2, Rank: 6, loss = 2.017845392227173
 [2024-07-27 04:40:10,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
 Epoch 0:  33%|███▎      | 2/6 [00:01<00:03,  1.28it/s]{
    "epoch": 0,
    "step": 2,
    "rank": 0,
    "loss": 2.517190456390381,
    "overall_throughput": 25.079475366313783,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 21.990607738494873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 247,
    "batch_size": 8,
    "total_loss": 1.6059263944625854,
    "gradnorm": 24.998506546020508,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:10.428087"
 }
 Per-token loss scaled by world size: 0.049534447491168976Per-token loss scaled by world size: 0.034937091171741486Per-token loss scaled by world size: 0.03569952771067619Per-token loss scaled by world size: 0.027379106730222702Per-token loss scaled by world size: 0.07303578406572342Per-token loss scaled by world size: 0.060139141976833344Per-token loss scaled by world size: 0.06049950420856476






 Epoch: 0, Step: 3, Rank: 4, loss = 1.1223540306091309Epoch: 0, Step: 3, Rank: 3, loss = 1.9319698810577393Epoch: 0, Step: 3, Rank: 1, loss = 2.3462746143341064Epoch: 0, Step: 3, Rank: 7, loss = 0.8795537948608398Epoch: 0, Step: 3, Rank: 0, loss = 1.5912941694259644Epoch: 0, Step: 3, Rank: 2, loss = 1.1468473672866821





 Epoch: 0, Step: 3, Rank: 5, loss = 1.9435465335845947
 Per-token loss scaled by world size: 0.07311909645795822
 Epoch: 0, Step: 3, Rank: 6, loss = 2.3489508628845215
 [2024-07-27 04:40:10,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:10,835] [INFO] [timer.py:258:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=19.69948717647646, CurrSamplesPerSec=19.69948717647646, MemAllocated=21.99GB, MaxMemAllocated=28.28GB
 Epoch 0:  50%|█████     | 3/6 [00:02<00:01,  1.56it/s]{
    "epoch": 0,
    "step": 3,
    "rank": 0,
    "loss": 1.5912941694259644,
    "overall_throughput": 19.626204980478747,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 257,
    "batch_size": 8,
    "total_loss": 1.6638489961624146,
    "gradnorm": 23.093570709228516,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:10.899000"
 }
 Per-token loss scaled by world size: 0.022453732788562775Per-token loss scaled by world size: 0.02703114040195942Per-token loss scaled by world size: 0.0565037876367569Per-token loss scaled by world size: 0.013839970342814922Per-token loss scaled by world size: 0.03342469781637192
 Per-token loss scaled by world size: 0.019963225349783897




 Per-token loss scaled by world size: 0.024931060150265694
 Epoch: 0, Step: 4, Rank: 0, loss = 1.2197802066802979Epoch: 0, Step: 4, Rank: 6, loss = 2.5497334003448486

 Epoch: 0, Step: 4, Rank: 5, loss = 0.9008405208587646Epoch: 0, Step: 4, Rank: 1, loss = 1.5082894563674927Epoch: 0, Step: 4, Rank: 2, loss = 0.6245286464691162


 Epoch: 0, Step: 4, Rank: 3, loss = 1.013224720954895
 Epoch: 0, Step: 4, Rank: 7, loss = 1.125014066696167
 Per-token loss scaled by world size: 0.04781263321638107
 Epoch: 0, Step: 4, Rank: 4, loss = 2.1575450897216797
 [2024-07-27 04:40:11,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:11,315] [INFO] [timer.py:258:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=19.53033393623941, CurrSamplesPerSec=19.36406089495735, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 0:  67%|██████▋   | 4/6 [00:02<00:01,  1.73it/s]{
    "epoch": 0,
    "step": 4,
    "rank": 0,
    "loss": 1.2197802066802979,
    "overall_throughput": 19.323057857283946,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 21.994319915771484,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 361,
    "batch_size": 8,
    "total_loss": 1.3873695135116577,
    "gradnorm": 13.594513893127441,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:11.384294"
 }
 Per-token loss scaled by world size: 0.039435986429452896Per-token loss scaled by world size: 0.04640277475118637Per-token loss scaled by world size: 0.04766402393579483Per-token loss scaled by world size: 0.03315580636262894Per-token loss scaled by world size: 0.05420012027025223Per-token loss scaled by world size: 0.039042871445417404





 Per-token loss scaled by world size: 0.037364713847637177
 Epoch: 0, Step: 5, Rank: 7, loss = 1.5848288536071777
 Epoch: 0, Step: 5, Rank: 0, loss = 1.3112465143203735Epoch: 0, Step: 5, Rank: 6, loss = 1.1024305820465088Epoch: 0, Step: 5, Rank: 1, loss = 1.2981754541397095


 Epoch: 0, Step: 5, Rank: 4, loss = 1.242376685142517
 Epoch: 0, Step: 5, Rank: 3, loss = 1.5428922176361084
 Epoch: 0, Step: 5, Rank: 2, loss = 1.8021539449691772
 Per-token loss scaled by world size: 0.041017867624759674
 Epoch: 0, Step: 5, Rank: 5, loss = 1.3638441562652588
 [2024-07-27 04:40:11,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:11,795] [INFO] [timer.py:258:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=19.56114133365461, CurrSamplesPerSec=19.62304862715284, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 40
 {
    "epoch": 0,
    "step": 5,
    "rank": 0,
    "loss": 1.3112465143203735,
    "overall_throughput": 19.583411102181554,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 21.990607738494873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 266,
    "batch_size": 8,
    "total_loss": 1.4059934616088867,
    "gradnorm": 16.828536987304688,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:11.799662"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_40
 [04:40:29] INFO     saving took 17.8935489654541 seconds                                                                                                                                         utils.py:611
 Epoch 0:  83%|████████▎ | 5/6 [00:21<00:06,  7.00s/it]Per-token loss scaled by world size: 0.023287693038582802Per-token loss scaled by world size: 0.022923028096556664Per-token loss scaled by world size: 0.035226039588451385Per-token loss scaled by world size: 0.02928866446018219
 Per-token loss scaled by world size: 0.02294014021754265

 Per-token loss scaled by world size: 0.021732352674007416Per-token loss scaled by world size: 0.05167709290981293



 Epoch: 0, Step: 6, Rank: 1, loss = 0.8420491218566895
 Epoch: 0, Step: 6, Rank: 2, loss = 1.0127485990524292
 Epoch: 0, Step: 6, Rank: 5, loss = 0.6595290303230286Epoch: 0, Step: 6, Rank: 4, loss = 1.485716462135315
 Epoch: 0, Step: 6, Rank: 0, loss = 0.6590370535850525Epoch: 0, Step: 6, Rank: 7, loss = 0.669521152973175
 Epoch: 0, Step: 6, Rank: 3, loss = 0.6248051524162292


 Per-token loss scaled by world size: 0.05193476006388664
 Epoch: 0, Step: 6, Rank: 6, loss = 1.4931243658065796
 [2024-07-27 04:40:30,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:30,197] [INFO] [timer.py:258:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=19.219145353583418, CurrSamplesPerSec=18.261332776041684, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 0: 100%|██████████| 6/6 [00:21<00:00,  4.79s/it]{
    "epoch": 0,
    "step": 6,
    "rank": 0,
    "loss": 0.6590370535850525,
    "overall_throughput": 18.218836059734688,
    "lr": 4.800000000000001e-06,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 230,
    "batch_size": 8,
    "total_loss": 0.9308162927627563,
    "gradnorm": 13.859025001525879,
    "weight_norm": 393.4548645019531,
    "timestamp": "2024-07-27T04:40:30.261588"
 }
 Epoch 0: 100%|██████████| 6/6 [00:21<00:00,  3.61s/it]
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 0 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 0 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 2 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36

 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 7 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 3 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 Per-token loss scaled by world size: 0.04258072003722191Per-token loss scaled by world size: 0.03144193813204765Per-token loss scaled by world size: 0.03665884956717491Per-token loss scaled by world size: 0.0183926559984684


 Per-token loss scaled by world size: 0.02085307240486145

 Per-token loss scaled by world size: 0.011860878206789494
 Epoch: 1, Step: 7, Rank: 6, loss = 1.2830597162246704Per-token loss scaled by world size: 0.017801115289330482
 Epoch: 1, Step: 7, Rank: 7, loss = 0.6437429785728455Epoch: 1, Step: 7, Rank: 2, loss = 1.1004678010940552

 Epoch: 1, Step: 7, Rank: 3, loss = 1.4903252124786377

 Epoch: 1, Step: 7, Rank: 0, loss = 0.7298575639724731
 Epoch: 1, Step: 7, Rank: 1, loss = 0.41513073444366455
 Epoch: 1, Step: 7, Rank: 5, loss = 0.6230390071868896
 Per-token loss scaled by world size: 0.014800201170146465
 Epoch: 1, Step: 7, Rank: 4, loss = 0.5180070400238037
 [2024-07-27 04:40:31,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[5.600000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:31,117] [INFO] [timer.py:258:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=18.856037574472914, CurrSamplesPerSec=17.531170274406254, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 1,▋        | 1/6 [00:00<00:04,  1.23it/s]
    "step": 7,
    "rank": 0,
    "loss": 0.7298575639724731,
    "overall_throughput": 17.461878806108718,
    "lr": 5.600000000000001e-06,
    "cuda_mem_allocated": 21.99084711074829,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 280,
    "batch_size": 8,
    "total_loss": 0.850453794002533,
    "gradnorm": 11.859759330749512,
    "weight_norm": 393.45489501953125,
    "timestamp": "2024-07-27T04:40:31.182263"
 }
 Per-token loss scaled by world size: 0.01552379410713911Per-token loss scaled by world size: 0.006790023762732744Per-token loss scaled by world size: 0.025048548355698586Per-token loss scaled by world size: 0.013136875815689564Per-token loss scaled by world size: 0.011116808280348778Per-token loss scaled by world size: 0.013115502893924713



 Per-token loss scaled by world size: 0.004034785088151693


 Epoch: 1, Step: 8, Rank: 6, loss = 0.2682059407234192Epoch: 1, Step: 8, Rank: 3, loss = 0.9894176721572876Epoch: 1, Step: 8, Rank: 7, loss = 0.43911394476890564


 Epoch: 1, Step: 8, Rank: 0, loss = 0.5180623531341553Epoch: 1, Step: 8, Rank: 4, loss = 0.5189065933227539
 Epoch: 1, Step: 8, Rank: 2, loss = 0.6131898760795593

 Epoch: 1, Step: 8, Rank: 5, loss = 0.15937401354312897
 Per-token loss scaled by world size: 0.0399574413895607
 Epoch: 1, Step: 8, Rank: 1, loss = 1.5783189535140991
 [2024-07-27 04:40:31,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[6.4000000000000006e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:31,596] [INFO] [timer.py:258:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=18.947101184672665, CurrSamplesPerSec=19.41593921964599, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 1,██▎      | 2/6 [00:01<00:02,  1.62it/s]
    "step": 8,
    "rank": 0,
    "loss": 0.5180623531341553,
    "overall_throughput": 19.33400249148091,
    "lr": 6.4000000000000006e-06,
    "cuda_mem_allocated": 21.992404460906982,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 316,
    "batch_size": 8,
    "total_loss": 0.63557368516922,
    "gradnorm": 13.164900779724121,
    "weight_norm": 393.4549255371094,
    "timestamp": "2024-07-27T04:40:31.599119"
 }
 Per-token loss scaled by world size: 0.005403513088822365Per-token loss scaled by world size: 0.009131929837167263Per-token loss scaled by world size: 0.0029204594902694225Per-token loss scaled by world size: 0.014552305452525616Per-token loss scaled by world size: 0.002935068914666772

 Per-token loss scaled by world size: 0.005613779183477163


 Per-token loss scaled by world size: 0.008262179791927338

 Epoch: 1, Step: 9, Rank: 6, loss = 0.5002354979515076
 Epoch: 1, Step: 9, Rank: 7, loss = 0.10039079189300537
 Epoch: 1, Step: 9, Rank: 2, loss = 0.3139100968837738Epoch: 1, Step: 9, Rank: 1, loss = 0.10089299082756042

 Epoch: 1, Step: 9, Rank: 0, loss = 0.18574576079845428
 Epoch: 1, Step: 9, Rank: 4, loss = 0.28401243686676025
 Epoch: 1, Step: 9, Rank: 3, loss = 0.19297365844249725
 Per-token loss scaled by world size: 0.03853427246212959
 Epoch: 1, Step: 9, Rank: 5, loss = 1.3246155977249146
 [2024-07-27 04:40:31,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[7.2000000000000005e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:32,073] [INFO] [timer.py:258:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=19.017175783584705, CurrSamplesPerSec=19.44875538610099, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 1,████     | 3/6 [00:01<00:01,  1.81it/s]
    "step": 9,
    "rank": 0,
    "loss": 0.18574576079845428,
    "overall_throughput": 19.4072473458094,
    "lr": 7.2000000000000005e-06,
    "cuda_mem_allocated": 21.990368366241455,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 275,
    "batch_size": 8,
    "total_loss": 0.3753471374511719,
    "gradnorm": 6.857061386108398,
    "weight_norm": 393.4549255371094,
    "timestamp": "2024-07-27T04:40:32.076959"
 }
 Per-token loss scaled by world size: 0.009762527421116829Per-token loss scaled by world size: 0.011909930035471916Per-token loss scaled by world size: 0.011925755999982357Per-token loss scaled by world size: 0.00691488292068243Per-token loss scaled by world size: 0.014653326012194157Per-token loss scaled by world size: 0.014235693961381912Per-token loss scaled by world size: 0.006737567484378815






 Epoch: 1, Step: 10, Rank: 1, loss = 0.32094308733940125
 Epoch: 1, Step: 10, Rank: 4, loss = 0.39205923676490784Epoch: 1, Step: 10, Rank: 6, loss = 0.22732678055763245Epoch: 1, Step: 10, Rank: 2, loss = 0.48172810673713684

 Epoch: 1, Step: 10, Rank: 7, loss = 0.2214975357055664
 Epoch: 1, Step: 10, Rank: 3, loss = 0.39153894782066345Epoch: 1, Step: 10, Rank: 5, loss = 0.4679984450340271


 Per-token loss scaled by world size: 0.013581880368292332
 Epoch: 1, Step: 10, Rank: 0, loss = 0.4465043246746063
 [2024-07-27 04:40:32,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:32,552] [INFO] [timer.py:258:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=19.068379525424866, CurrSamplesPerSec=19.434674525231042, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 80
 {
    "epoch": 1,
    "step": 10,
    "rank": 0,
    "loss": 0.4465043246746063,
    "overall_throughput": 19.395367576038005,
    "lr": 8.000000000000001e-06,
    "cuda_mem_allocated": 21.99384117126465,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 263,
    "batch_size": 8,
    "total_loss": 0.3686995804309845,
    "gradnorm": 7.663094520568848,
    "weight_norm": 393.4549560546875,
    "timestamp": "2024-07-27T04:40:32.556073"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_80
 [04:40:50] INFO     saving took 17.83508801460266 seconds                                                                                                                                        utils.py:611
                                                      Per-token loss scaled by world size: 0.008527813479304314Per-token loss scaled by world size: 0.021125635132193565Per-token loss scaled by world size: 0.028058378025889397Per-token loss scaled by world size: 0.010279231704771519
 Per-token loss scaled by world size: 0.007951636798679829

 Per-token loss scaled by world size: 0.010785719379782677


 Per-token loss scaled by world size: 0.0015928453067317605
 Epoch: 1, Step: 11, Rank: 0, loss = 0.6364097595214844
 Epoch: 1, Step: 11, Rank: 3, loss = 0.8452586531639099
 Epoch: 1, Step: 11, Rank: 7, loss = 0.309661865234375
 Epoch: 1, Step: 11, Rank: 5, loss = 0.23954305052757263Epoch: 1, Step: 11, Rank: 1, loss = 0.32491979002952576
 Epoch: 1, Step: 11, Rank: 6, loss = 0.2569003701210022

 Epoch: 1, Step: 11, Rank: 4, loss = 0.04798446595668793
 Per-token loss scaled by world size: 0.011055735871195793
 Epoch: 1, Step: 11, Rank: 2, loss = 0.3330540359020233
 [2024-07-27 04:40:50,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[8.8e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:50,878] [INFO] [timer.py:258:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=19.06728482937713, CurrSamplesPerSec=19.05853178378495, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 1,███████▎ | 5/6 [00:20<00:05,  5.01s/it]
    "step": 11,
    "rank": 0,
    "loss": 0.6364097595214844,
    "overall_throughput": 19.020988915430987,
    "lr": 8.8e-06,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 241,
    "batch_size": 8,
    "total_loss": 0.3742165267467499,
    "gradnorm": 10.62493896484375,
    "weight_norm": 393.45501708984375,
    "timestamp": "2024-07-27T04:40:50.881434"
 }
 Per-token loss scaled by world size: 0.01114068366587162Per-token loss scaled by world size: 0.017769839614629745Per-token loss scaled by world size: 0.01096078846603632Per-token loss scaled by world size: 0.01160360500216484

 Per-token loss scaled by world size: 0.01375030167400837

 Per-token loss scaled by world size: 0.012371961027383804
 Per-token loss scaled by world size: 0.026108525693416595

 Epoch: 1, Step: 12, Rank: 4, loss = 0.4709007441997528Epoch: 1, Step: 12, Rank: 2, loss = 0.29046088457107544
 Epoch: 1, Step: 12, Rank: 7, loss = 0.3074955344200134
 Epoch: 1, Step: 12, Rank: 0, loss = 0.29522812366485596
 Epoch: 1, Step: 12, Rank: 3, loss = 0.3643829822540283

 Epoch: 1, Step: 12, Rank: 5, loss = 0.32785695791244507
 Epoch: 1, Step: 12, Rank: 1, loss = 0.6918759346008301
 Per-token loss scaled by world size: 0.008026043884456158
 Epoch: 1, Step: 12, Rank: 6, loss = 0.21269017457962036
 [2024-07-27 04:40:51,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[9.600000000000001e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:51,345] [INFO] [timer.py:258:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=19.154108730048822, CurrSamplesPerSec=19.972626532644533, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 1,█████████| 6/6 [00:21<00:00,  3.47s/it]
    "step": 12,
    "rank": 0,
    "loss": 0.29522812366485596,
    "overall_throughput": 19.934561488662563,
    "lr": 9.600000000000001e-06,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 212,
    "batch_size": 8,
    "total_loss": 0.3701114356517792,
    "gradnorm": 11.501402854919434,
    "weight_norm": 393.4551086425781,
    "timestamp": "2024-07-27T04:40:51.407469"
 }
 Epoch 1: 100%|██████████| 6/6 [00:21<00:00,  3.53s/it]
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66

 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 3 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 6 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 7 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 4 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 4 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 Per-token loss scaled by world size: 0.005579258780926466Per-token loss scaled by world size: 0.003397347405552864Per-token loss scaled by world size: 0.04219382628798485Per-token loss scaled by world size: 0.006288798991590738
 Per-token loss scaled by world size: 0.003948549274355173

 Per-token loss scaled by world size: 0.005502534564584494
 Per-token loss scaled by world size: 0.018879147246479988


 Epoch: 2, Step: 13, Rank: 4, loss = 0.20595817267894745
 Epoch: 2, Step: 13, Rank: 2, loss = 0.11126312613487244
 Epoch: 2, Step: 13, Rank: 6, loss = 0.18272072076797485
 Epoch: 2, Step: 13, Rank: 1, loss = 1.381847858428955
 Epoch: 2, Step: 13, Rank: 3, loss = 0.12931498885154724
 Epoch: 2, Step: 13, Rank: 5, loss = 0.1802080124616623
 Epoch: 2, Step: 13, Rank: 0, loss = 0.6182920932769775
 Per-token loss scaled by world size: 0.005611070431768894
 Epoch: 2, Step: 13, Rank: 7, loss = 0.1837625503540039
 [2024-07-27 04:40:52,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[1.04e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:52,284] [INFO] [timer.py:258:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=18.955822461559762, CurrSamplesPerSec=17.17757371046992, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 2:  17%|█▋        | 1/6 [00:00<00:04,  1.22it/s]{
    "epoch": 2,
    "step": 13,
    "rank": 0,
    "loss": 0.6182920932769775,
    "overall_throughput": 17.11025862908584,
    "lr": 1.04e-05,
    "cuda_mem_allocated": 21.990307807922363,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 262,
    "batch_size": 8,
    "total_loss": 0.3741708993911743,
    "gradnorm": 5.893370628356934,
    "weight_norm": 393.4551696777344,
    "timestamp": "2024-07-27T04:40:52.348171"
 }
 Per-token loss scaled by world size: 0.010395266115665436Per-token loss scaled by world size: 0.009872684255242348Per-token loss scaled by world size: 0.003835555398836732Per-token loss scaled by world size: 0.008823958225548267Per-token loss scaled by world size: 0.0053123775869607925Per-token loss scaled by world size: 0.0028427704237401485




 Per-token loss scaled by world size: 0.010287421755492687

 Epoch: 2, Step: 14, Rank: 0, loss = 0.323552668094635
 Epoch: 2, Step: 14, Rank: 4, loss = 0.3072873055934906Epoch: 2, Step: 14, Rank: 2, loss = 0.27464568614959717Epoch: 2, Step: 14, Rank: 5, loss = 0.11938165873289108
 Epoch: 2, Step: 14, Rank: 1, loss = 0.08848123252391815Epoch: 2, Step: 14, Rank: 3, loss = 0.16534775495529175



 Epoch: 2, Step: 14, Rank: 7, loss = 0.3201960027217865
 Per-token loss scaled by world size: 0.016205936670303345
 Epoch: 2, Step: 14, Rank: 6, loss = 0.5044097900390625
 [2024-07-27 04:40:52,683] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[1.1200000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:52,760] [INFO] [timer.py:258:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=19.0045240462792, CurrSamplesPerSec=19.557238311503617, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 2:  33%|███▎      | 2/6 [00:01<00:02,  1.62it/s]{
    "epoch": 2,
    "step": 14,
    "rank": 0,
    "loss": 0.323552668094635,
    "overall_throughput": 19.51458544398396,
    "lr": 1.1200000000000001e-05,
    "cuda_mem_allocated": 21.989410877227783,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 249,
    "batch_size": 8,
    "total_loss": 0.262912780046463,
    "gradnorm": 8.123388290405273,
    "weight_norm": 393.45526123046875,
    "timestamp": "2024-07-27T04:40:52.825508"
 }
 Per-token loss scaled by world size: 0.00476957717910409Per-token loss scaled by world size: 0.005131879821419716Per-token loss scaled by world size: 0.002688762964680791Per-token loss scaled by world size: 0.00911777000874281Per-token loss scaled by world size: 0.006320170592516661
 Per-token loss scaled by world size: 0.007136975880712271


 Per-token loss scaled by world size: 0.007083716802299023


 Epoch: 2, Step: 15, Rank: 0, loss = 0.16742758452892303
 Epoch: 2, Step: 15, Rank: 1, loss = 0.1556074619293213
 Epoch: 2, Step: 15, Rank: 2, loss = 0.2974672317504883Epoch: 2, Step: 15, Rank: 6, loss = 0.20619556307792664

 Epoch: 2, Step: 15, Rank: 7, loss = 0.23110626637935638
 Epoch: 2, Step: 15, Rank: 5, loss = 0.23284383118152618
 Epoch: 2, Step: 15, Rank: 4, loss = 0.08772089332342148
 Per-token loss scaled by world size: 0.007475144695490599
 Epoch: 2, Step: 15, Rank: 3, loss = 0.2438765913248062
 [2024-07-27 04:40:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:40:53,243] [INFO] [timer.py:258:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=19.026022320074304, CurrSamplesPerSec=19.287847616814023, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 120
 {
    "epoch": 2,
    "step": 15,
    "rank": 0,
    "loss": 0.16742758452892303,
    "overall_throughput": 19.246835159868162,
    "lr": 1.2e-05,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 261,
    "batch_size": 8,
    "total_loss": 0.20278067886829376,
    "gradnorm": 4.634181499481201,
    "weight_norm": 393.455322265625,
    "timestamp": "2024-07-27T04:40:53.247897"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_120
 [04:41:11] INFO     saving took 17.93269968032837 seconds                                                                                                                                        utils.py:611
 Epoch 2:  50%|█████     | 3/6 [00:19<00:26,  8.75s/it]Per-token loss scaled by world size: 0.022889601066708565Per-token loss scaled by world size: 0.006112583447247744Per-token loss scaled by world size: 0.008243937976658344Per-token loss scaled by world size: 0.005759389605373144

 Per-token loss scaled by world size: 0.0023497489746659994

 Per-token loss scaled by world size: 0.003176590893417597

 Per-token loss scaled by world size: 0.0032468584831804037Epoch: 2, Step: 16, Rank: 0, loss = 0.18261343240737915

 Epoch: 2, Step: 16, Rank: 5, loss = 0.6838268041610718
 Epoch: 2, Step: 16, Rank: 4, loss = 0.24628764390945435
 Epoch: 2, Step: 16, Rank: 3, loss = 0.1720617711544037Epoch: 2, Step: 16, Rank: 6, loss = 0.07019875198602676

 Epoch: 2, Step: 16, Rank: 1, loss = 0.09699989855289459Epoch: 2, Step: 16, Rank: 2, loss = 0.09490065276622772

 Per-token loss scaled by world size: 0.002544187940657139
 Epoch: 2, Step: 16, Rank: 7, loss = 0.07600761204957962
 [2024-07-27 04:41:11,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[1.2800000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:11,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=19.0047652088737, CurrSamplesPerSec=18.732683349486162, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 2:  67%|██████▋   | 4/6 [00:20<00:10,  5.49s/it]{
    "epoch": 2,
    "step": 16,
    "rank": 0,
    "loss": 0.18261343240737915,
    "overall_throughput": 18.691548302496816,
    "lr": 1.2800000000000001e-05,
    "cuda_mem_allocated": 21.98988962173462,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 239,
    "batch_size": 8,
    "total_loss": 0.20286208391189575,
    "gradnorm": 3.438565492630005,
    "weight_norm": 393.4554138183594,
    "timestamp": "2024-07-27T04:41:11.737319"
 }
 Per-token loss scaled by world size: 0.006436600815504789Per-token loss scaled by world size: 0.007636873982846737Per-token loss scaled by world size: 0.011849365197122097Per-token loss scaled by world size: 0.0030969511717557907Per-token loss scaled by world size: 0.0029933564364910126Per-token loss scaled by world size: 0.005698263645172119Per-token loss scaled by world size: 0.0030969511717557907






 Epoch: 2, Step: 17, Rank: 2, loss = 0.24247075617313385Epoch: 2, Step: 17, Rank: 6, loss = 0.09832820296287537
 Epoch: 2, Step: 17, Rank: 3, loss = 0.09503906965255737
 Epoch: 2, Step: 17, Rank: 0, loss = 0.20436207950115204
 Epoch: 2, Step: 17, Rank: 7, loss = 0.18091987073421478

 Epoch: 2, Step: 17, Rank: 1, loss = 0.3762173354625702
 Epoch: 2, Step: 17, Rank: 5, loss = 0.09832820296287537
 Per-token loss scaled by world size: 0.0246181171387434
 Epoch: 2, Step: 17, Rank: 4, loss = 0.7816252112388611
 [2024-07-27 04:41:12,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.3600000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:12,143] [INFO] [timer.py:258:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=19.058863757063296, CurrSamplesPerSec=19.849924810962573, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 2:  83%|████████▎ | 5/6 [00:20<00:03,  3.68s/it]{
    "epoch": 2,
    "step": 17,
    "rank": 0,
    "loss": 0.20436207950115204,
    "overall_throughput": 19.81138977879121,
    "lr": 1.3600000000000002e-05,
    "cuda_mem_allocated": 21.990607738494873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 254,
    "batch_size": 8,
    "total_loss": 0.25966137647628784,
    "gradnorm": 4.596966743469238,
    "weight_norm": 393.4555358886719,
    "timestamp": "2024-07-27T04:41:12.206412"
 }
 Per-token loss scaled by world size: 0.003973593469709158Per-token loss scaled by world size: 0.003631346160545945Per-token loss scaled by world size: 0.0038163107819855213Per-token loss scaled by world size: 0.00382098532281816Per-token loss scaled by world size: 0.00380203640088439Per-token loss scaled by world size: 0.001392068457789719

 Per-token loss scaled by world size: 0.004007395356893539




 Epoch: 2, Step: 18, Rank: 0, loss = 0.18179190158843994
 Epoch: 2, Step: 18, Rank: 3, loss = 0.17459622025489807Epoch: 2, Step: 18, Rank: 1, loss = 0.1661340892314911Epoch: 2, Step: 18, Rank: 7, loss = 0.18333832919597626

 Epoch: 2, Step: 18, Rank: 6, loss = 0.1739431619644165Epoch: 2, Step: 18, Rank: 4, loss = 0.06368713080883026

 Epoch: 2, Step: 18, Rank: 2, loss = 0.17481008172035217

 Per-token loss scaled by world size: 0.009031021036207676
 Epoch: 2, Step: 18, Rank: 5, loss = 0.4131692051887512
 [2024-07-27 04:41:12,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.4400000000000001e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:12,623] [INFO] [timer.py:258:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=19.075929261101155, CurrSamplesPerSec=19.33562909999493, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 2: 100%|██████████| 6/6 [00:21<00:00,  2.59s/it]{
    "epoch": 2,
    "step": 18,
    "rank": 0,
    "loss": 0.18179190158843994,
    "overall_throughput": 19.297919953254016,
    "lr": 1.4400000000000001e-05,
    "cuda_mem_allocated": 21.992165088653564,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 366,
    "batch_size": 8,
    "total_loss": 0.19143378734588623,
    "gradnorm": 3.664649486541748,
    "weight_norm": 393.4555969238281,
    "timestamp": "2024-07-27T04:41:12.687682"
 }
 Epoch 2: 100%|██████████| 6/6 [00:21<00:00,  3.55s/it]
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 1 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 1 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30

 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 2 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22

 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 6 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 4 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 7 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 Per-token loss scaled by world size: 0.008028822019696236Per-token loss scaled by world size: 0.0042352983728051186Per-token loss scaled by world size: 0.006302641239017248Per-token loss scaled by world size: 0.00753552932292223Per-token loss scaled by world size: 0.006594506558030844




 Per-token loss scaled by world size: 0.010208014398813248
 Per-token loss scaled by world size: 0.0070448205806314945
 Epoch: 3, Step: 19, Rank: 7, loss = 0.12917660176753998
 Epoch: 3, Step: 19, Rank: 6, loss = 0.2448790818452835
 Epoch: 3, Step: 19, Rank: 2, loss = 0.19223055243492126
 Epoch: 3, Step: 19, Rank: 4, loss = 0.20113244652748108
 Epoch: 3, Step: 19, Rank: 5, loss = 0.22983364760875702
 Epoch: 3, Step: 19, Rank: 0, loss = 0.3113444447517395
 Epoch: 3, Step: 19, Rank: 1, loss = 0.2148670256137848
 Per-token loss scaled by world size: 0.007540326565504074
 Epoch: 3, Step: 19, Rank: 3, loss = 0.2299799621105194
 [2024-07-27 04:41:13,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.5200000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:13,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=18.855222324067338, CurrSamplesPerSec=15.90998650082005, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 3,▋        | 1/6 [00:00<00:04,  1.23it/s]
    "step": 19,
    "rank": 0,
    "loss": 0.3113444447517395,
    "overall_throughput": 15.851661275609098,
    "lr": 1.5200000000000002e-05,
    "cuda_mem_allocated": 21.989410877227783,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 244,
    "batch_size": 8,
    "total_loss": 0.21918047964572906,
    "gradnorm": 7.666770935058594,
    "weight_norm": 393.4556884765625,
    "timestamp": "2024-07-27T04:41:13.615788"
 }
 Per-token loss scaled by world size: 0.0015976645518094301Per-token loss scaled by world size: 0.01189976092427969Per-token loss scaled by world size: 0.006761615164577961Per-token loss scaled by world size: 0.0026721509639173746Per-token loss scaled by world size: 0.001967529533430934




 Per-token loss scaled by world size: 0.005321608856320381
 Per-token loss scaled by world size: 0.0015923914033919573
 Epoch: 3, Step: 20, Rank: 0, loss = 0.057715632021427155
 Epoch: 3, Step: 20, Rank: 4, loss = 0.0965314507484436
 Epoch: 3, Step: 20, Rank: 6, loss = 0.07107700407505035
 Epoch: 3, Step: 20, Rank: 3, loss = 0.4298788607120514
 Epoch: 3, Step: 20, Rank: 2, loss = 0.24426335096359253
 Epoch: 3, Step: 20, Rank: 1, loss = 0.19224311411380768
 Epoch: 3, Step: 20, Rank: 7, loss = 0.057525139302015305
 Per-token loss scaled by world size: 0.018431704491376877
 Epoch: 3, Step: 20, Rank: 5, loss = 0.6658453345298767
 [2024-07-27 04:41:13,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:14,028] [INFO] [timer.py:258:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=18.89434286309558, CurrSamplesPerSec=19.58513710703571, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 160
 {
    "epoch": 3,
    "step": 20,
    "rank": 0,
    "loss": 0.057715632021427155,
    "overall_throughput": 19.54536780630332,
    "lr": 1.6000000000000003e-05,
    "cuda_mem_allocated": 21.990726947784424,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 289,
    "batch_size": 8,
    "total_loss": 0.22688499093055725,
    "gradnorm": 5.258148193359375,
    "weight_norm": 393.4558410644531,
    "timestamp": "2024-07-27T04:41:14.031924"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_160
 [04:41:31] INFO     saving took 17.931371927261353 seconds                                                                                                                                       utils.py:611
                                                      Per-token loss scaled by world size: 0.0045623015612363815Per-token loss scaled by world size: 0.0081652095541358Per-token loss scaled by world size: 0.0009351570624858141Per-token loss scaled by world size: 0.002664643106982112Per-token loss scaled by world size: 0.0031791036017239094




 Epoch: 3, Step: 21, Rank: 2, loss = 0.2602660655975342
 Epoch: 3, Step: 21, Rank: 1, loss = 0.0849355012178421
 Epoch: 3, Step: 21, Rank: 4, loss = 0.10133392363786697Epoch: 3, Step: 21, Rank: 0, loss = 0.029808131977915764

 Epoch: 3, Step: 21, Rank: 6, loss = 0.14542336761951447Per-token loss scaled by world size: 0.0044220853596925735

 Per-token loss scaled by world size: 0.01251036673784256
 Epoch: 3, Step: 21, Rank: 3, loss = 0.14095397293567657
 Epoch: 3, Step: 21, Rank: 7, loss = 0.39876794815063477
 Per-token loss scaled by world size: 0.009876725263893604
 Epoch: 3, Step: 21, Rank: 5, loss = 0.31482061743736267
 [2024-07-27 04:41:32,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.6800000000000002e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:32,449] [INFO] [timer.py:258:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=18.90515878235841, CurrSamplesPerSec=19.101984863890006, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 3,████     | 3/6 [00:19<00:18,  6.29s/it]
    "step": 21,
    "rank": 0,
    "loss": 0.029808131977915764,
    "overall_throughput": 19.05711381074895,
    "lr": 1.6800000000000002e-05,
    "cuda_mem_allocated": 21.990726947784424,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 255,
    "batch_size": 8,
    "total_loss": 0.18453869223594666,
    "gradnorm": 5.108468055725098,
    "weight_norm": 393.4559631347656,
    "timestamp": "2024-07-27T04:41:32.452565"
 }
 Per-token loss scaled by world size: 0.0014227991923689842Per-token loss scaled by world size: 0.0022042023483663797Per-token loss scaled by world size: 0.0035717289429157972Per-token loss scaled by world size: 0.0031726094894111156
 Per-token loss scaled by world size: 0.0027486486360430717
 Per-token loss scaled by world size: 0.002677777549251914

 Per-token loss scaled by world size: 0.00375761860050261


 Epoch: 3, Step: 22, Rank: 0, loss = 0.048019472509622574Epoch: 3, Step: 22, Rank: 6, loss = 0.12054584920406342

 Epoch: 3, Step: 22, Rank: 5, loss = 0.0927668884396553
 Epoch: 3, Step: 22, Rank: 3, loss = 0.10707557201385498Epoch: 3, Step: 22, Rank: 1, loss = 0.12681962549686432

 Epoch: 3, Step: 22, Rank: 7, loss = 0.09037499129772186
 Epoch: 3, Step: 22, Rank: 4, loss = 0.07439182698726654
 Per-token loss scaled by world size: 0.008931323885917664
 Epoch: 3, Step: 22, Rank: 2, loss = 0.30143219232559204
 [2024-07-27 04:41:32,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.76e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:32,927] [INFO] [timer.py:258:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=18.93573071473506, CurrSamplesPerSec=19.53597958978115, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 3,█████▋   | 4/6 [00:20<00:07,  4.00s/it]
    "step": 22,
    "rank": 0,
    "loss": 0.048019472509622574,
    "overall_throughput": 19.499003967857334,
    "lr": 1.76e-05,
    "cuda_mem_allocated": 21.989171504974365,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 270,
    "batch_size": 8,
    "total_loss": 0.12017828971147537,
    "gradnorm": 3.623103380203247,
    "weight_norm": 393.4561462402344,
    "timestamp": "2024-07-27T04:41:32.991116"
 }
 Per-token loss scaled by world size: 0.009149123914539814Per-token loss scaled by world size: 0.0035603949800133705Per-token loss scaled by world size: 0.004935313016176224Per-token loss scaled by world size: 0.0064824605360627174Per-token loss scaled by world size: 0.005307480692863464
 Per-token loss scaled by world size: 0.0033412924967706203Per-token loss scaled by world size: 0.00997106358408928





 Epoch: 3, Step: 23, Rank: 6, loss = 0.10814699530601501Epoch: 3, Step: 23, Rank: 1, loss = 0.14991013705730438

 Epoch: 3, Step: 23, Rank: 4, loss = 0.30287104845046997Epoch: 3, Step: 23, Rank: 0, loss = 0.2779046297073364Epoch: 3, Step: 23, Rank: 2, loss = 0.16121472418308258
 Epoch: 3, Step: 23, Rank: 5, loss = 0.1969047337770462Epoch: 3, Step: 23, Rank: 3, loss = 0.10149175673723221



 Per-token loss scaled by world size: 0.00799593236297369
 Epoch: 3, Step: 23, Rank: 7, loss = 0.24287645518779755
 [2024-07-27 04:41:33,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.8400000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:33,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=18.969543790437204, CurrSamplesPerSec=19.672103775255234, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 3,███████▎ | 5/6 [00:20<00:02,  2.72s/it]
    "step": 23,
    "rank": 0,
    "loss": 0.2779046297073364,
    "overall_throughput": 19.604933597424527,
    "lr": 1.8400000000000003e-05,
    "cuda_mem_allocated": 21.990487575531006,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 243,
    "batch_size": 8,
    "total_loss": 0.19266505539417267,
    "gradnorm": 3.409485101699829,
    "weight_norm": 393.4563293457031,
    "timestamp": "2024-07-27T04:41:33.402199"
 }
 Per-token loss scaled by world size: 0.00869434978812933Per-token loss scaled by world size: 0.010431548580527306Per-token loss scaled by world size: 0.00882460456341505Per-token loss scaled by world size: 0.014862887561321259Per-token loss scaled by world size: 0.007030695676803589Per-token loss scaled by world size: 0.009925030171871185



 Per-token loss scaled by world size: 0.013269560411572456


 Epoch: 3, Step: 24, Rank: 1, loss = 0.31989189982414246Epoch: 3, Step: 24, Rank: 6, loss = 0.5387796759605408
 Epoch: 3, Step: 24, Rank: 0, loss = 0.37814363837242126
 Epoch: 3, Step: 24, Rank: 7, loss = 0.359782338142395
 Epoch: 3, Step: 24, Rank: 2, loss = 0.2548627257347107
 Epoch: 3, Step: 24, Rank: 5, loss = 0.48102155327796936

 Epoch: 3, Step: 24, Rank: 3, loss = 0.31517016887664795
 Per-token loss scaled by world size: 0.008658657781779766
 Epoch: 3, Step: 24, Rank: 4, loss = 0.31387636065483093
 [2024-07-27 04:41:33,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.9200000000000003e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:33,879] [INFO] [timer.py:258:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=18.986021230980416, CurrSamplesPerSec=19.338782826201598, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 3,█████████| 6/6 [00:21<00:00,  1.96s/it]
    "step": 24,
    "rank": 0,
    "loss": 0.37814363837242126,
    "overall_throughput": 19.301816374521977,
    "lr": 1.9200000000000003e-05,
    "cuda_mem_allocated": 21.990128993988037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 290,
    "batch_size": 8,
    "total_loss": 0.3701910078525543,
    "gradnorm": 41.655189514160156,
    "weight_norm": 393.4565124511719,
    "timestamp": "2024-07-27T04:41:33.942020"
 }
 Epoch 3: 100%|██████████| 6/6 [00:21<00:00,  3.54s/it]
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36

 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 7 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29

 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 1 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 2 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32

 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 4 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21

 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 3 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 Per-token loss scaled by world size: 0.011749987490475178Per-token loss scaled by world size: 0.0038406068924814463Per-token loss scaled by world size: 0.01040646806359291
 Per-token loss scaled by world size: 0.0029110456816852093

 Per-token loss scaled by world size: 0.001697335857897997Per-token loss scaled by world size: 0.0049619837664067745Per-token loss scaled by world size: 0.008784592151641846



 Epoch: 4, Step: 25, Rank: 3, loss = 0.13202086091041565
 Epoch: 4, Step: 25, Rank: 6, loss = 0.40390580892562866
 Epoch: 4, Step: 25, Rank: 0, loss = 0.10006719827651978
 Epoch: 4, Step: 25, Rank: 1, loss = 0.35772234201431274
 Epoch: 4, Step: 25, Rank: 5, loss = 0.30197036266326904
 Epoch: 4, Step: 25, Rank: 7, loss = 0.05834592133760452
 Epoch: 4, Step: 25, Rank: 2, loss = 0.17056819796562195
 Per-token loss scaled by world size: 0.0024965431075543165
 Epoch: 4, Step: 25, Rank: 4, loss = 0.08581867069005966
 [2024-07-27 04:41:34,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[2e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:34,808] [INFO] [timer.py:258:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=18.909596063618803, CurrSamplesPerSec=17.371243026535403, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 200
 {
    "epoch": 4,
    "step": 25,
    "rank": 0,
    "loss": 0.10006719827651978,
    "overall_throughput": 17.301396406941095,
    "lr": 2e-05,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 275,
    "batch_size": 8,
    "total_loss": 0.2013024240732193,
    "gradnorm": 2.961458921432495,
    "weight_norm": 393.45672607421875,
    "timestamp": "2024-07-27T04:41:34.811952"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_200
 [04:41:52] INFO     saving took 17.899128675460815 seconds                                                                                                                                       utils.py:611
 Epoch 4:  17%|█▋        | 1/6 [00:18<01:33, 18.72s/it]Per-token loss scaled by world size: 0.005662889685481787Per-token loss scaled by world size: 0.003961643204092979Per-token loss scaled by world size: 0.0033513393718749285
 Per-token loss scaled by world size: 0.0048882560804486275
 Per-token loss scaled by world size: 0.005098323803395033Per-token loss scaled by world size: 0.0037976952735334635



 Per-token loss scaled by world size: 0.0018476687837392092
 Epoch: 4, Step: 26, Rank: 5, loss = 0.12182052433490753
 Epoch: 4, Step: 26, Rank: 1, loss = 0.10305368900299072
 Epoch: 4, Step: 26, Rank: 0, loss = 0.17413385212421417
 Epoch: 4, Step: 26, Rank: 7, loss = 0.1503138691186905
 Epoch: 4, Step: 26, Rank: 3, loss = 0.11677912622690201
 Epoch: 4, Step: 26, Rank: 2, loss = 0.15677346289157867
 Epoch: 4, Step: 26, Rank: 6, loss = 0.05681581422686577
 Per-token loss scaled by world size: 0.0031180845107883215
 Epoch: 4, Step: 26, Rank: 4, loss = 0.09588109701871872
 [2024-07-27 04:41:53,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.9959742939952393e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:53,198] [INFO] [timer.py:258:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=18.9111373084012, CurrSamplesPerSec=18.946655411223635, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 4:  33%|███▎      | 2/6 [00:19<00:31,  7.99s/it]{
    "epoch": 4,
    "step": 26,
    "rank": 0,
    "loss": 0.17413385212421417,
    "overall_throughput": 18.900859917782263,
    "lr": 1.9959742939952393e-05,
    "cuda_mem_allocated": 21.990487575531006,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 246,
    "batch_size": 8,
    "total_loss": 0.12194641679525375,
    "gradnorm": 2.005527973175049,
    "weight_norm": 393.4569091796875,
    "timestamp": "2024-07-27T04:41:53.262256"
 }
 Per-token loss scaled by world size: 0.007557791192084551Per-token loss scaled by world size: 0.008283982053399086Per-token loss scaled by world size: 0.003100305562838912Per-token loss scaled by world size: 0.011851347051560879



 Per-token loss scaled by world size: 0.013045835308730602Per-token loss scaled by world size: 0.009396737441420555

 Per-token loss scaled by world size: 0.0076859793625772
 Epoch: 4, Step: 27, Rank: 0, loss = 0.20972870290279388
 Epoch: 4, Step: 27, Rank: 2, loss = 0.22988051176071167Epoch: 4, Step: 27, Rank: 3, loss = 0.3288748860359192

 Epoch: 4, Step: 27, Rank: 6, loss = 0.36202192306518555Epoch: 4, Step: 27, Rank: 5, loss = 0.26075947284698486

 Epoch: 4, Step: 27, Rank: 4, loss = 0.08603347837924957
 Epoch: 4, Step: 27, Rank: 7, loss = 0.2132859230041504
 Per-token loss scaled by world size: 0.004431413020938635
 Epoch: 4, Step: 27, Rank: 1, loss = 0.12297171354293823
 [2024-07-27 04:41:53,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.98392958859863e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:53,665] [INFO] [timer.py:258:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=18.95425254539218, CurrSamplesPerSec=20.05141088310167, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 4:  50%|█████     | 3/6 [00:19<00:13,  4.56s/it]{
    "epoch": 4,
    "step": 27,
    "rank": 0,
    "loss": 0.20972870290279388,
    "overall_throughput": 20.01441802121427,
    "lr": 1.98392958859863e-05,
    "cuda_mem_allocated": 21.988572120666504,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 222,
    "batch_size": 8,
    "total_loss": 0.22669458389282227,
    "gradnorm": 4.566909313201904,
    "weight_norm": 393.4571838378906,
    "timestamp": "2024-07-27T04:41:53.729196"
 }
 Per-token loss scaled by world size: 0.002589393639937043Per-token loss scaled by world size: 0.0016010800609365106Per-token loss scaled by world size: 0.009488740935921669
 Per-token loss scaled by world size: 0.007330995053052902
 Per-token loss scaled by world size: 0.006591046694666147Per-token loss scaled by world size: 0.0028418628498911858

 Per-token loss scaled by world size: 0.0009722260874696076


 Epoch: 4, Step: 28, Rank: 0, loss = 0.055437397211790085
 Epoch: 4, Step: 28, Rank: 5, loss = 0.3285476565361023
 Epoch: 4, Step: 28, Rank: 1, loss = 0.0896577537059784
 Epoch: 4, Step: 28, Rank: 4, loss = 0.22821499407291412
 Epoch: 4, Step: 28, Rank: 2, loss = 0.2538357079029083Epoch: 4, Step: 28, Rank: 6, loss = 0.09839949756860733

 Epoch: 4, Step: 28, Rank: 3, loss = 0.03366332873702049
 Per-token loss scaled by world size: 0.017863700166344643
 Epoch: 4, Step: 28, Rank: 7, loss = 0.6185306310653687
 [2024-07-27 04:41:54,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.9639628606958535e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:54,145] [INFO] [timer.py:258:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=18.966307174026092, CurrSamplesPerSec=19.272736671546916, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 4:  67%|██████▋   | 4/6 [00:20<00:05,  2.95s/it]{
    "epoch": 4,
    "step": 28,
    "rank": 0,
    "loss": 0.055437397211790085,
    "overall_throughput": 19.215202457388543,
    "lr": 1.9639628606958535e-05,
    "cuda_mem_allocated": 21.989171504974365,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 277,
    "batch_size": 8,
    "total_loss": 0.21328586339950562,
    "gradnorm": 8.249006271362305,
    "weight_norm": 393.4573974609375,
    "timestamp": "2024-07-27T04:41:54.208735"
 }
 Per-token loss scaled by world size: 0.0066835153847932816Per-token loss scaled by world size: 0.004529251717031002Per-token loss scaled by world size: 0.0037545531522482634Per-token loss scaled by world size: 0.003318126080557704Per-token loss scaled by world size: 0.002113455906510353




 Per-token loss scaled by world size: 0.0010128725552931428Per-token loss scaled by world size: 0.0017812160076573491

 Epoch: 4, Step: 29, Rank: 0, loss = 0.2815430760383606
 Epoch: 4, Step: 29, Rank: 6, loss = 0.1397760659456253Epoch: 4, Step: 29, Rank: 3, loss = 0.19079472124576569

 Epoch: 4, Step: 29, Rank: 7, loss = 0.15816055238246918
 Epoch: 4, Step: 29, Rank: 5, loss = 0.07503372430801392
 Epoch: 4, Step: 29, Rank: 1, loss = 0.04266725853085518
 Epoch: 4, Step: 29, Rank: 4, loss = 0.08902932703495026
 Per-token loss scaled by world size: 0.00729252677410841
 Epoch: 4, Step: 29, Rank: 2, loss = 0.3071976900100708
 [2024-07-27 04:41:54,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.9362348706397374e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:54,626] [INFO] [timer.py:258:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=18.978206465919676, CurrSamplesPerSec=19.292915749104477, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 4:  83%|████████▎ | 5/6 [00:20<00:02,  2.06s/it]{
    "epoch": 4,
    "step": 29,
    "rank": 0,
    "loss": 0.2815430760383606,
    "overall_throughput": 19.249871636665503,
    "lr": 1.9362348706397374e-05,
    "cuda_mem_allocated": 21.99084711074829,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 337,
    "batch_size": 8,
    "total_loss": 0.16052529215812683,
    "gradnorm": 3.410759210586548,
    "weight_norm": 393.4576110839844,
    "timestamp": "2024-07-27T04:41:54.689974"
 }
 Per-token loss scaled by world size: 0.005897491704672575Per-token loss scaled by world size: 0.007752169389277697Per-token loss scaled by world size: 0.007537755649536848Per-token loss scaled by world size: 0.012558677233755589Per-token loss scaled by world size: 0.00658394442871213

 Per-token loss scaled by world size: 0.003483764361590147



 Per-token loss scaled by world size: 0.0014572414802387357
 Epoch: 4, Step: 30, Rank: 7, loss = 0.21899878978729248Epoch: 4, Step: 30, Rank: 0, loss = 0.1859964281320572Epoch: 4, Step: 30, Rank: 4, loss = 0.21294160187244415


 Epoch: 4, Step: 30, Rank: 6, loss = 0.166604146361351Epoch: 4, Step: 30, Rank: 1, loss = 0.3547826409339905Epoch: 4, Step: 30, Rank: 3, loss = 0.09841634333133698


 Epoch: 4, Step: 30, Rank: 2, loss = 0.04116707295179367
 Per-token loss scaled by world size: 0.013426681980490685
 Epoch: 4, Step: 30, Rank: 5, loss = 0.3793037533760071
 [2024-07-27 04:41:55,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.900968867902419e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:41:55,093] [INFO] [timer.py:258:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=19.013068675506908, CurrSamplesPerSec=20.005289522667084, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 240
 {
    "epoch": 4,
    "step": 30,
    "rank": 0,
    "loss": 0.1859964281320572,
    "overall_throughput": 19.962075281886168,
    "lr": 1.900968867902419e-05,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 226,
    "batch_size": 8,
    "total_loss": 0.2072763293981552,
    "gradnorm": 3.050539255142212,
    "weight_norm": 393.45782470703125,
    "timestamp": "2024-07-27T04:41:55.095878"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_240
 [04:42:12] INFO     saving took 17.86995029449463 seconds                                                                                                                                        utils.py:611
 Epoch 4: 100%|██████████| 6/6 [00:39<00:00,  6.51s/it]
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 5 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 5 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26

 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22

 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 2 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 6 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 6 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 3 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 4 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 4 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 Per-token loss scaled by world size: 0.005013823974877596Per-token loss scaled by world size: 0.002962352242320776Per-token loss scaled by world size: 0.004985218867659569Per-token loss scaled by world size: 0.0022716219536960125



 Per-token loss scaled by world size: 0.0034899800084531307Per-token loss scaled by world size: 0.0017849565483629704Per-token loss scaled by world size: 0.0013634071219712496


 Epoch: 5, Step: 31, Rank: 6, loss = 0.17323635518550873Epoch: 5, Step: 31, Rank: 4, loss = 0.17423038184642792
 Epoch: 5, Step: 31, Rank: 3, loss = 0.10294174402952194
 Epoch: 5, Step: 31, Rank: 2, loss = 0.07893886417150497

 Epoch: 5, Step: 31, Rank: 0, loss = 0.06202723830938339Epoch: 5, Step: 31, Rank: 1, loss = 0.12127680331468582

 Epoch: 5, Step: 31, Rank: 5, loss = 0.04737839847803116
 Per-token loss scaled by world size: 0.011078107170760632
 Epoch: 5, Step: 31, Rank: 7, loss = 0.3849642276763916
 [2024-07-27 04:42:13,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.8584487936018663e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:13,922] [INFO] [timer.py:258:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=18.801550768968706, CurrSamplesPerSec=14.335953625150017, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 5,▋        | 1/6 [00:00<00:04,  1.19it/s]
    "step": 31,
    "rank": 0,
    "loss": 0.06202723830938339,
    "overall_throughput": 14.285813059808566,
    "lr": 1.8584487936018663e-05,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 278,
    "batch_size": 8,
    "total_loss": 0.14312423765659332,
    "gradnorm": 4.453860282897949,
    "weight_norm": 393.4580078125,
    "timestamp": "2024-07-27T04:42:13.987803"
 }
 Per-token loss scaled by world size: 0.0028378700371831656Per-token loss scaled by world size: 0.005622998811304569Per-token loss scaled by world size: 0.0031444875057786703Per-token loss scaled by world size: 0.0035572010092437267



 Per-token loss scaled by world size: 0.004025444388389587Per-token loss scaled by world size: 0.005346423946321011Per-token loss scaled by world size: 0.0037831738591194153


 Epoch: 5, Step: 32, Rank: 0, loss = 0.0971970483660698
 Epoch: 5, Step: 32, Rank: 6, loss = 0.10769869387149811Epoch: 5, Step: 32, Rank: 7, loss = 0.12183413654565811Epoch: 5, Step: 32, Rank: 3, loss = 0.1925877034664154


 Epoch: 5, Step: 32, Rank: 2, loss = 0.13787147402763367Epoch: 5, Step: 32, Rank: 1, loss = 0.18311502039432526

 Epoch: 5, Step: 32, Rank: 5, loss = 0.12957370281219482
 Per-token loss scaled by world size: 0.008308484219014645
 Epoch: 5, Step: 32, Rank: 4, loss = 0.28456559777259827
 [2024-07-27 04:42:14,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.8090169943749477e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:14,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=18.82734153699244, CurrSamplesPerSec=19.607327906336685, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 5,██▎      | 2/6 [00:01<00:02,  1.60it/s]
    "step": 32,
    "rank": 0,
    "loss": 0.0971970483660698,
    "overall_throughput": 19.569625420939243,
    "lr": 1.8090169943749477e-05,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 274,
    "batch_size": 8,
    "total_loss": 0.15680542588233948,
    "gradnorm": 2.596428394317627,
    "weight_norm": 393.45819091796875,
    "timestamp": "2024-07-27T04:42:14.402141"
 }
 Per-token loss scaled by world size: 0.0021558639127761126Per-token loss scaled by world size: 0.004672932904213667Per-token loss scaled by world size: 0.0039972770027816296Per-token loss scaled by world size: 0.0053141191601753235Per-token loss scaled by world size: 0.0033407120499759912



 Per-token loss scaled by world size: 0.006172977387905121Per-token loss scaled by world size: 0.003799165366217494


 Epoch: 5, Step: 33, Rank: 0, loss = 0.12193598598241806Epoch: 5, Step: 33, Rank: 1, loss = 0.193965345621109Epoch: 5, Step: 33, Rank: 2, loss = 0.1705620437860489

 Epoch: 5, Step: 33, Rank: 7, loss = 0.14590060710906982

 Epoch: 5, Step: 33, Rank: 3, loss = 0.13866953551769257Epoch: 5, Step: 33, Rank: 6, loss = 0.2253136783838272

 Epoch: 5, Step: 33, Rank: 4, loss = 0.07868903130292892
 Per-token loss scaled by world size: 0.003766376990824938
 Epoch: 5, Step: 33, Rank: 5, loss = 0.13747276365756989
 [2024-07-27 04:42:14,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.7530714660036112e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:14,872] [INFO] [timer.py:258:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=18.85669114464766, CurrSamplesPerSec=19.781816809788317, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 5,████     | 3/6 [00:01<00:01,  1.80it/s]
    "step": 33,
    "rank": 0,
    "loss": 0.12193598598241806,
    "overall_throughput": 19.744405215835805,
    "lr": 1.7530714660036112e-05,
    "cuda_mem_allocated": 21.98988962173462,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 292,
    "batch_size": 8,
    "total_loss": 0.1515636295080185,
    "gradnorm": 2.53242564201355,
    "weight_norm": 393.4584655761719,
    "timestamp": "2024-07-27T04:42:14.936193"
 }
 Per-token loss scaled by world size: 0.004924851469695568Per-token loss scaled by world size: 0.004273226950317621Per-token loss scaled by world size: 0.002622528001666069Per-token loss scaled by world size: 0.0037059050519019365Per-token loss scaled by world size: 0.0047779749147593975Per-token loss scaled by world size: 0.005559505894780159
 Per-token loss scaled by world size: 0.007279254496097565





 Epoch: 5, Step: 34, Rank: 7, loss = 0.12646400928497314Epoch: 5, Step: 34, Rank: 2, loss = 0.0894937664270401Epoch: 5, Step: 34, Rank: 3, loss = 0.1630484014749527Epoch: 5, Step: 34, Rank: 1, loss = 0.1680605560541153
 Epoch: 5, Step: 34, Rank: 0, loss = 0.1458238661289215
 Epoch: 5, Step: 34, Rank: 6, loss = 0.18971814215183258



 Epoch: 5, Step: 34, Rank: 5, loss = 0.24840456247329712
 Per-token loss scaled by world size: 0.010788660496473312
 Epoch: 5, Step: 34, Rank: 4, loss = 0.3681630492210388
 [2024-07-27 04:42:15,270] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.691062648986865e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:15,347] [INFO] [timer.py:258:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=18.876022011428752, CurrSamplesPerSec=19.495582553322528, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 5,█████▋   | 4/6 [00:02<00:01,  1.91it/s]
    "step": 34,
    "rank": 0,
    "loss": 0.1458238661289215,
    "overall_throughput": 19.439414681510893,
    "lr": 1.691062648986865e-05,
    "cuda_mem_allocated": 21.988572120666504,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 273,
    "batch_size": 8,
    "total_loss": 0.1873970627784729,
    "gradnorm": 2.919456958770752,
    "weight_norm": 393.45867919921875,
    "timestamp": "2024-07-27T04:42:15.410652"
 }
 Per-token loss scaled by world size: 0.0070740398950874805Per-token loss scaled by world size: 0.006351261865347624Per-token loss scaled by world size: 0.009431459940969944Per-token loss scaled by world size: 0.0034575308673083782


 Per-token loss scaled by world size: 0.0034287304151803255Per-token loss scaled by world size: 0.006853340193629265

 Per-token loss scaled by world size: 0.004821010399609804

 Epoch: 5, Step: 35, Rank: 0, loss = 0.20337864756584167
 Epoch: 5, Step: 35, Rank: 1, loss = 0.2711544632911682
 Epoch: 5, Step: 35, Rank: 4, loss = 0.1825987845659256
 Epoch: 5, Step: 35, Rank: 6, loss = 0.09940401464700699
 Epoch: 5, Step: 35, Rank: 2, loss = 0.19703352451324463
 Epoch: 5, Step: 35, Rank: 7, loss = 0.09857600182294846
 Epoch: 5, Step: 35, Rank: 5, loss = 0.1386040449142456
 Per-token loss scaled by world size: 0.0033929902128875256
 Epoch: 5, Step: 35, Rank: 3, loss = 0.0975484699010849
 [2024-07-27 04:42:15,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.6234898018587336e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:15,825] [INFO] [timer.py:258:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=18.89253145148775, CurrSamplesPerSec=19.43652077202901, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 280
 {
    "epoch": 5,
    "step": 35,
    "rank": 0,
    "loss": 0.20337864756584167,
    "overall_throughput": 19.39916886552688,
    "lr": 1.6234898018587336e-05,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 230,
    "batch_size": 8,
    "total_loss": 0.16103725135326385,
    "gradnorm": 3.5732498168945312,
    "weight_norm": 393.45892333984375,
    "timestamp": "2024-07-27T04:42:15.828221"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_280
 [04:42:33] INFO     saving took 17.876163959503174 seconds                                                                                                                                       utils.py:611
                                                      Per-token loss scaled by world size: 0.004089208785444498Per-token loss scaled by world size: 0.0032626439351588488Per-token loss scaled by world size: 0.007577312644571066Per-token loss scaled by world size: 0.00760306091979146


 Per-token loss scaled by world size: 0.0089601781219244
 Per-token loss scaled by world size: 0.0050941589288413525Per-token loss scaled by world size: 0.004234898369759321


 Epoch: 5, Step: 36, Rank: 1, loss = 0.09910281002521515
 Epoch: 5, Step: 36, Rank: 4, loss = 0.2309429794549942Epoch: 5, Step: 36, Rank: 0, loss = 0.12420971691608429Epoch: 5, Step: 36, Rank: 7, loss = 0.2721654176712036


 Epoch: 5, Step: 36, Rank: 5, loss = 0.2301608771085739
 Epoch: 5, Step: 36, Rank: 2, loss = 0.15473507344722748
 Epoch: 5, Step: 36, Rank: 6, loss = 0.12863503396511078
 Per-token loss scaled by world size: 0.003793718060478568
 Epoch: 5, Step: 36, Rank: 3, loss = 0.11523418873548508
 [2024-07-27 04:42:34,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.5508969814521026e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:34,189] [INFO] [timer.py:258:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=18.899391419587044, CurrSamplesPerSec=19.128599036570417, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 5,█████████| 6/6 [00:21<00:00,  4.75s/it]
    "step": 36,
    "rank": 0,
    "loss": 0.12420971691608429,
    "overall_throughput": 19.082094487965083,
    "lr": 1.5508969814521026e-05,
    "cuda_mem_allocated": 21.992165088653564,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 243,
    "batch_size": 8,
    "total_loss": 0.1693982630968094,
    "gradnorm": 3.1067850589752197,
    "weight_norm": 393.45916748046875,
    "timestamp": "2024-07-27T04:42:34.192324"
 }
 Epoch 5: 100%|██████████| 6/6 [00:21<00:00,  3.54s/it]
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 0 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 1 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 1 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 5 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 3 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32

 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 Per-token loss scaled by world size: 0.007784623187035322Per-token loss scaled by world size: 0.003072483232244849Per-token loss scaled by world size: 0.009150322526693344Per-token loss scaled by world size: 0.0048055145889520645Per-token loss scaled by world size: 0.007070611696690321Per-token loss scaled by world size: 0.0026875571347773075
 Per-token loss scaled by world size: 0.0032222422305494547





 Epoch: 6, Step: 37, Rank: 6, loss = 0.08487734943628311
 Epoch: 6, Step: 37, Rank: 3, loss = 0.21505022048950195
 Epoch: 6, Step: 37, Rank: 0, loss = 0.25277766585350037
 Epoch: 6, Step: 37, Rank: 4, loss = 0.19532564282417297
 Epoch: 6, Step: 37, Rank: 2, loss = 0.07424376904964447
 Epoch: 6, Step: 37, Rank: 5, loss = 0.0890144407749176Epoch: 6, Step: 37, Rank: 1, loss = 0.13275234401226044

 Per-token loss scaled by world size: 0.003815547563135624
 Epoch: 6, Step: 37, Rank: 7, loss = 0.10540450364351273
 [2024-07-27 04:42:35,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.4738686624729987e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:35,123] [INFO] [timer.py:258:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=18.884730321309164, CurrSamplesPerSec=18.399439371025178, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 6:  17%|█▋        | 1/6 [00:00<00:04,  1.23it/s]{
    "epoch": 6,
    "step": 37,
    "rank": 0,
    "loss": 0.25277766585350037,
    "overall_throughput": 18.323549504455777,
    "lr": 1.4738686624729987e-05,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 221,
    "batch_size": 8,
    "total_loss": 0.14368075132369995,
    "gradnorm": 2.0474841594696045,
    "weight_norm": 393.4593811035156,
    "timestamp": "2024-07-27T04:42:35.186644"
 }
 Per-token loss scaled by world size: 0.0039473348297178745Per-token loss scaled by world size: 0.0038144520949572325
 Per-token loss scaled by world size: 0.0010828088270500302
 Per-token loss scaled by world size: 0.0007635311339981854
 Per-token loss scaled by world size: 0.0021416409872472286Per-token loss scaled by world size: 0.0017905712593346834
 Per-token loss scaled by world size: 0.005295279435813427


 Epoch: 6, Step: 38, Rank: 0, loss = 0.14901189506053925
 Epoch: 6, Step: 38, Rank: 1, loss = 0.14399556815624237
 Epoch: 6, Step: 38, Rank: 7, loss = 0.040876034647226334
 Epoch: 6, Step: 38, Rank: 4, loss = 0.02882329933345318Epoch: 6, Step: 38, Rank: 3, loss = 0.08084695041179657

 Epoch: 6, Step: 38, Rank: 2, loss = 0.06759406626224518
 Epoch: 6, Step: 38, Rank: 5, loss = 0.19989679753780365
 Per-token loss scaled by world size: 0.006602860987186432
 Epoch: 6, Step: 38, Rank: 6, loss = 0.24925799667835236
 [2024-07-27 04:42:35,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.3930250316539237e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:35,599] [INFO] [timer.py:258:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=18.899852383768156, CurrSamplesPerSec=19.44482195705551, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 6:  33%|███▎      | 2/6 [00:01<00:02,  1.62it/s]{
    "epoch": 6,
    "step": 38,
    "rank": 0,
    "loss": 0.14901189506053925,
    "overall_throughput": 19.37470523157654,
    "lr": 1.3930250316539237e-05,
    "cuda_mem_allocated": 21.990607738494873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 302,
    "batch_size": 8,
    "total_loss": 0.12003782391548157,
    "gradnorm": 1.780216097831726,
    "weight_norm": 393.4596252441406,
    "timestamp": "2024-07-27T04:42:35.663666"
 }
 Per-token loss scaled by world size: 0.0038089316803961992Per-token loss scaled by world size: 0.0020512850023806095Per-token loss scaled by world size: 0.008632734417915344

 Per-token loss scaled by world size: 0.0009830425260588527Per-token loss scaled by world size: 0.002817384200170636Per-token loss scaled by world size: 0.007761223241686821



 Per-token loss scaled by world size: 0.007352802902460098
 Epoch: 6, Step: 39, Rank: 6, loss = 0.06435906887054443
 Epoch: 6, Step: 39, Rank: 2, loss = 0.2708520293235779
 Epoch: 6, Step: 39, Rank: 7, loss = 0.030842959880828857Epoch: 6, Step: 39, Rank: 4, loss = 0.24350838363170624

 Epoch: 6, Step: 39, Rank: 0, loss = 0.11950523406267166Epoch: 6, Step: 39, Rank: 3, loss = 0.08839543163776398

 Epoch: 6, Step: 39, Rank: 5, loss = 0.23069418966770172
 Per-token loss scaled by world size: 0.003210328985005617
 Epoch: 6, Step: 39, Rank: 1, loss = 0.10072407126426697
 [2024-07-27 04:42:35,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[1.3090169943749475e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:36,075] [INFO] [timer.py:258:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=18.915587653609716, CurrSamplesPerSec=19.50004649173377, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 6:  50%|█████     | 3/6 [00:01<00:01,  1.81it/s]{
    "epoch": 6,
    "step": 39,
    "rank": 0,
    "loss": 0.11950523406267166,
    "overall_throughput": 19.43697112933871,
    "lr": 1.3090169943749475e-05,
    "cuda_mem_allocated": 21.98988962173462,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 251,
    "batch_size": 8,
    "total_loss": 0.1436101794242859,
    "gradnorm": 2.214144706726074,
    "weight_norm": 393.4598388671875,
    "timestamp": "2024-07-27T04:42:36.138875"
 }
 Per-token loss scaled by world size: 0.001895732944831252Per-token loss scaled by world size: 0.0019446390215307474Per-token loss scaled by world size: 0.0018286737613379955Per-token loss scaled by world size: 0.002989412285387516
 Per-token loss scaled by world size: 0.0028383415192365646


 Per-token loss scaled by world size: 0.002208298072218895
 Per-token loss scaled by world size: 0.005300204269587994

 Epoch: 6, Step: 40, Rank: 4, loss = 0.11060825735330582
 Epoch: 6, Step: 40, Rank: 3, loss = 0.06766092777252197Epoch: 6, Step: 40, Rank: 0, loss = 0.07014212012290955

 Epoch: 6, Step: 40, Rank: 7, loss = 0.10501863807439804
 Epoch: 6, Step: 40, Rank: 1, loss = 0.07195164263248444Epoch: 6, Step: 40, Rank: 2, loss = 0.08170703053474426

 Epoch: 6, Step: 40, Rank: 5, loss = 0.19610755145549774
 Per-token loss scaled by world size: 0.0030284025706350803
 Epoch: 6, Step: 40, Rank: 6, loss = 0.11205089092254639
 [2024-07-27 04:42:36,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.2225209339563144e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:36,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=18.92908988177896, CurrSamplesPerSec=19.44259109142837, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 320
 {
    "epoch": 6,
    "step": 40,
    "rank": 0,
    "loss": 0.07014212012290955,
    "overall_throughput": 19.380199690766368,
    "lr": 1.2225209339563144e-05,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 296,
    "batch_size": 8,
    "total_loss": 0.10190588980913162,
    "gradnorm": 1.372182011604309,
    "weight_norm": 393.4600830078125,
    "timestamp": "2024-07-27T04:42:36.554966"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_320
 [04:42:54] INFO     saving took 17.879958391189575 seconds                                                                                                                                       utils.py:611
 Epoch 6:  67%|██████▋   | 4/6 [00:20<00:15,  7.58s/it]Per-token loss scaled by world size: 0.008056806400418282Per-token loss scaled by world size: 0.007982512935996056Per-token loss scaled by world size: 0.00242948392406106Per-token loss scaled by world size: 0.004318062216043472Per-token loss scaled by world size: 0.0034818260464817286Per-token loss scaled by world size: 0.014020812697708607





 Epoch: 6, Step: 41, Rank: 3, loss = 0.07318820059299469Epoch: 6, Step: 41, Rank: 0, loss = 0.24271129071712494
 Epoch: 6, Step: 41, Rank: 7, loss = 0.2404731959104538Per-token loss scaled by world size: 0.0027995144482702017

 Epoch: 6, Step: 41, Rank: 1, loss = 0.42237699031829834Epoch: 6, Step: 41, Rank: 2, loss = 0.10489001125097275

 Epoch: 6, Step: 41, Rank: 5, loss = 0.13008162379264832

 Epoch: 6, Step: 41, Rank: 4, loss = 0.08433537185192108
 Per-token loss scaled by world size: 0.002146774670109153
 Epoch: 6, Step: 41, Rank: 6, loss = 0.0646715834736824
 [2024-07-27 04:42:54,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[1.1342332658176556e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:54,909] [INFO] [timer.py:258:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=18.942455198097623, CurrSamplesPerSec=19.46470827097328, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 6:  83%|████████▎ | 5/6 [00:20<00:05,  5.02s/it]{
    "epoch": 6,
    "step": 41,
    "rank": 0,
    "loss": 0.24271129071712494,
    "overall_throughput": 19.42185054501338,
    "lr": 1.1342332658176556e-05,
    "cuda_mem_allocated": 21.990966320037842,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 241,
    "batch_size": 8,
    "total_loss": 0.17034101486206055,
    "gradnorm": 2.2789089679718018,
    "weight_norm": 393.46026611328125,
    "timestamp": "2024-07-27T04:42:54.973034"
 }
 Per-token loss scaled by world size: 0.0011888241861015558Per-token loss scaled by world size: 0.0031213611364364624
 Per-token loss scaled by world size: 0.002157441573217511Per-token loss scaled by world size: 0.0022118226625025272
 Per-token loss scaled by world size: 0.006297964137047529Per-token loss scaled by world size: 0.0018200232880190015Per-token loss scaled by world size: 0.002669830108061433




 Epoch: 6, Step: 42, Rank: 1, loss = 0.10846729576587677
 Epoch: 6, Step: 42, Rank: 0, loss = 0.04131164029240608
 Epoch: 6, Step: 42, Rank: 4, loss = 0.0927765965461731
 Epoch: 6, Step: 42, Rank: 6, loss = 0.07686083763837814Epoch: 6, Step: 42, Rank: 3, loss = 0.07497109472751617

 Epoch: 6, Step: 42, Rank: 7, loss = 0.06324581056833267Epoch: 6, Step: 42, Rank: 2, loss = 0.21885424852371216

 Per-token loss scaled by world size: 0.0031561183277517557
 Epoch: 6, Step: 42, Rank: 5, loss = 0.10967510938644409
 [2024-07-27 04:42:55,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[1.044864830350515e-05], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:55,387] [INFO] [timer.py:258:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=18.95295051613209, CurrSamplesPerSec=19.371539779153203, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 6: 100%|██████████| 6/6 [00:21<00:00,  3.47s/it]{
    "epoch": 6,
    "step": 42,
    "rank": 0,
    "loss": 0.04131164029240608,
    "overall_throughput": 19.31602776765571,
    "lr": 1.044864830350515e-05,
    "cuda_mem_allocated": 21.990487575531006,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 278,
    "batch_size": 8,
    "total_loss": 0.09827032685279846,
    "gradnorm": 1.404802680015564,
    "weight_norm": 393.4604797363281,
    "timestamp": "2024-07-27T04:42:55.450259"
 }
 Epoch 6: 100%|██████████| 6/6 [00:21<00:00,  3.53s/it]
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 4 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 7 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 7 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 Per-token loss scaled by world size: 0.003743554465472698Per-token loss scaled by world size: 0.005117448978126049Per-token loss scaled by world size: 0.0018975065322592854Per-token loss scaled by world size: 0.009965005330741405Per-token loss scaled by world size: 0.0038619362749159336




 Per-token loss scaled by world size: 0.004172571934759617
 Per-token loss scaled by world size: 0.00353407533839345Epoch: 7, Step: 43, Rank: 6, loss = 0.056213632225990295
 Epoch: 7, Step: 43, Rank: 7, loss = 0.2952132821083069

 Epoch: 7, Step: 43, Rank: 2, loss = 0.110902801156044Epoch: 7, Step: 43, Rank: 0, loss = 0.15160442888736725

 Epoch: 7, Step: 43, Rank: 1, loss = 0.11440986394882202
 Epoch: 7, Step: 43, Rank: 5, loss = 0.12361244112253189
 Epoch: 7, Step: 43, Rank: 4, loss = 0.10469698160886765
 Per-token loss scaled by world size: 0.003125852905213833
 Epoch: 7, Step: 43, Rank: 3, loss = 0.09260339289903641
 [2024-07-27 04:42:56,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[9.551351696494854e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:56,318] [INFO] [timer.py:258:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=18.947759228108367, CurrSamplesPerSec=18.74241437439884, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 7,▋        | 1/6 [00:00<00:04,  1.23it/s]
    "step": 43,
    "rank": 0,
    "loss": 0.15160442888736725,
    "overall_throughput": 18.665014201337474,
    "lr": 9.551351696494854e-06,
    "cuda_mem_allocated": 21.990128993988037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 237,
    "batch_size": 8,
    "total_loss": 0.13115710020065308,
    "gradnorm": 1.5875235795974731,
    "weight_norm": 393.4606628417969,
    "timestamp": "2024-07-27T04:42:56.382842"
 }
 Per-token loss scaled by world size: 0.0007441109046339989Per-token loss scaled by world size: 0.002569864271208644Per-token loss scaled by world size: 0.0021702933590859175Per-token loss scaled by world size: 0.0034706422593444586Per-token loss scaled by world size: 0.003474967321380973



 Per-token loss scaled by world size: 0.0027420881669968367Per-token loss scaled by world size: 0.002911260584369302


 Epoch: 7, Step: 44, Rank: 7, loss = 0.07596027106046677Epoch: 7, Step: 44, Rank: 6, loss = 0.0899452492594719

 Epoch: 7, Step: 44, Rank: 4, loss = 0.12147247791290283Epoch: 7, Step: 44, Rank: 3, loss = 0.12162385880947113

 Epoch: 7, Step: 44, Rank: 5, loss = 0.09597308933734894
 Epoch: 7, Step: 44, Rank: 2, loss = 0.026043880730867386
 Epoch: 7, Step: 44, Rank: 1, loss = 0.10189411789178848
 Per-token loss scaled by world size: 0.0022017783485352993
 Epoch: 7, Step: 44, Rank: 0, loss = 0.07706224173307419
 [2024-07-27 04:42:56,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[8.657667341823449e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:56,789] [INFO] [timer.py:258:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=18.96843313737227, CurrSamplesPerSec=19.85672616190888, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 7,██▎      | 2/6 [00:01<00:02,  1.64it/s]
    "step": 44,
    "rank": 0,
    "loss": 0.07706224173307419,
    "overall_throughput": 19.817860414460462,
    "lr": 8.657667341823449e-06,
    "cuda_mem_allocated": 21.992404460906982,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 280,
    "batch_size": 8,
    "total_loss": 0.08874689787626266,
    "gradnorm": 1.267701268196106,
    "weight_norm": 393.4607849121094,
    "timestamp": "2024-07-27T04:42:56.852698"
 }
 Per-token loss scaled by world size: 0.0025115651078522205Per-token loss scaled by world size: 0.004961833357810974Per-token loss scaled by world size: 0.0043532936833798885Per-token loss scaled by world size: 0.0021706747356802225Per-token loss scaled by world size: 0.0033806730061769485Per-token loss scaled by world size: 0.002844580914825201





 Per-token loss scaled by world size: 0.0033718389458954334
 Epoch: 7, Step: 45, Rank: 4, loss = 0.1349520981311798Epoch: 7, Step: 45, Rank: 5, loss = 0.1538168340921402Epoch: 7, Step: 45, Rank: 6, loss = 0.06729091703891754Epoch: 7, Step: 45, Rank: 7, loss = 0.0881820097565651


 Epoch: 7, Step: 45, Rank: 0, loss = 0.07785851508378983

 Epoch: 7, Step: 45, Rank: 2, loss = 0.10480086505413055
 Epoch: 7, Step: 45, Rank: 1, loss = 0.10452700406312943
 Per-token loss scaled by world size: 0.0024816528894007206
 Epoch: 7, Step: 45, Rank: 3, loss = 0.07693123817443848
 [2024-07-27 04:42:57,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[7.774790660436857e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:42:57,264] [INFO] [timer.py:258:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=18.984207118032753, CurrSamplesPerSec=19.671261884005887, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 360
 {
    "epoch": 7,
    "step": 45,
    "rank": 0,
    "loss": 0.07785851508378983,
    "overall_throughput": 19.632336945869298,
    "lr": 7.774790660436857e-06,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 248,
    "batch_size": 8,
    "total_loss": 0.10104493051767349,
    "gradnorm": 1.2592891454696655,
    "weight_norm": 393.46087646484375,
    "timestamp": "2024-07-27T04:42:57.267026"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_360
 [04:43:15] INFO     saving took 17.815489530563354 seconds                                                                                                                                       utils.py:611
                                                      Per-token loss scaled by world size: 0.0031208053696900606Per-token loss scaled by world size: 0.002719326876103878Per-token loss scaled by world size: 0.00290810433216393Per-token loss scaled by world size: 0.00502825528383255Per-token loss scaled by world size: 0.0031488884706050158



 Per-token loss scaled by world size: 0.0032260508742183447Per-token loss scaled by world size: 0.0017572520300745964


 Epoch: 7, Step: 46, Rank: 5, loss = 0.08837812393903732
 Epoch: 7, Step: 46, Rank: 6, loss = 0.09451339393854141Epoch: 7, Step: 46, Rank: 3, loss = 0.10233887284994125
 Epoch: 7, Step: 46, Rank: 0, loss = 0.10142617672681808Epoch: 7, Step: 46, Rank: 7, loss = 0.10484665632247925


 Epoch: 7, Step: 46, Rank: 1, loss = 0.05711068958044052
 Epoch: 7, Step: 46, Rank: 4, loss = 0.16341829299926758
 Per-token loss scaled by world size: 0.00243758293800056
 Epoch: 7, Step: 46, Rank: 2, loss = 0.0792214423418045
 [2024-07-27 04:43:15,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[6.909830056250527e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:15,568] [INFO] [timer.py:258:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=18.98530026908578, CurrSamplesPerSec=19.0324251537424, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 7,█████▋   | 4/6 [00:20<00:10,  5.45s/it]
    "step": 46,
    "rank": 0,
    "loss": 0.10142617672681808,
    "overall_throughput": 18.986955901758453,
    "lr": 6.909830056250527e-06,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 260,
    "batch_size": 8,
    "total_loss": 0.09890670329332352,
    "gradnorm": 1.4150254726409912,
    "weight_norm": 393.46099853515625,
    "timestamp": "2024-07-27T04:43:15.632493"
 }
 Per-token loss scaled by world size: 0.0018586666556075215Per-token loss scaled by world size: 0.002927313791587949Per-token loss scaled by world size: 0.002946708584204316Per-token loss scaled by world size: 0.0019047270761802793Per-token loss scaled by world size: 0.0012054119724780321Per-token loss scaled by world size: 0.0014979788102209568





 Per-token loss scaled by world size: 0.0022586516570299864
 Epoch: 7, Step: 47, Rank: 0, loss = 0.10757878422737122Epoch: 7, Step: 47, Rank: 5, loss = 0.06999871879816055

 Epoch: 7, Step: 47, Rank: 4, loss = 0.10829153656959534
 Epoch: 7, Step: 47, Rank: 7, loss = 0.055050719529390335
 Epoch: 7, Step: 47, Rank: 2, loss = 0.06830599904060364
 Epoch: 7, Step: 47, Rank: 3, loss = 0.044298890978097916
 Epoch: 7, Step: 47, Rank: 1, loss = 0.08300545066595078
 Per-token loss scaled by world size: 0.002645065076649189
 Epoch: 7, Step: 47, Rank: 6, loss = 0.09720613807439804
 [2024-07-27 04:43:15,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[6.069749683460765e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:16,046] [INFO] [timer.py:258:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=18.996249360616417, CurrSamplesPerSec=19.490837611941338, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 7,███████▎ | 5/6 [00:20<00:03,  3.66s/it]
    "step": 47,
    "rank": 0,
    "loss": 0.10757878422737122,
    "overall_throughput": 19.4505930343033,
    "lr": 6.069749683460765e-06,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 294,
    "batch_size": 8,
    "total_loss": 0.07921702414751053,
    "gradnorm": 1.5372004508972168,
    "weight_norm": 393.4610900878906,
    "timestamp": "2024-07-27T04:43:16.111313"
 }
 Per-token loss scaled by world size: 0.0035249628126621246Per-token loss scaled by world size: 0.0036447104066610336Per-token loss scaled by world size: 0.0025723562575876713Per-token loss scaled by world size: 0.0031749033369123936Per-token loss scaled by world size: 0.00402703694999218

 Per-token loss scaled by world size: 0.0017748093232512474


 Per-token loss scaled by world size: 0.00937521830201149

 Epoch: 7, Step: 48, Rank: 3, loss = 0.08971092104911804Epoch: 7, Step: 48, Rank: 7, loss = 0.12710927426815033

 Epoch: 7, Step: 48, Rank: 0, loss = 0.11072475463151932
 Epoch: 7, Step: 48, Rank: 4, loss = 0.14044290781021118Epoch: 7, Step: 48, Rank: 6, loss = 0.12293307483196259Epoch: 7, Step: 48, Rank: 5, loss = 0.3269607424736023Epoch: 7, Step: 48, Rank: 2, loss = 0.06189647689461708



 Per-token loss scaled by world size: 0.004552490543574095
 Epoch: 7, Step: 48, Rank: 1, loss = 0.15876810252666473
 [2024-07-27 04:43:16,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[5.2613133752700145e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:16,523] [INFO] [timer.py:258:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=19.007811150191724, CurrSamplesPerSec=19.543068281625303, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 7,█████████| 6/6 [00:21<00:00,  2.57s/it]
    "step": 48,
    "rank": 0,
    "loss": 0.11072475463151932,
    "overall_throughput": 19.50403630817306,
    "lr": 5.2613133752700145e-06,
    "cuda_mem_allocated": 21.99084711074829,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 279,
    "batch_size": 8,
    "total_loss": 0.14231827855110168,
    "gradnorm": 2.0794081687927246,
    "weight_norm": 393.461181640625,
    "timestamp": "2024-07-27T04:43:16.587633"
 }
 Epoch 7: 100%|██████████| 6/6 [00:21<00:00,  3.52s/it]
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36

 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 0 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 0 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30

 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57

 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 6 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 Per-token loss scaled by world size: 0.007703100331127644Per-token loss scaled by world size: 0.002897256053984165
 Per-token loss scaled by world size: 0.0018762396648526192Per-token loss scaled by world size: 0.0031769108027219772Per-token loss scaled by world size: 0.0031007928773760796
 Per-token loss scaled by world size: 0.0032394849695265293
 Per-token loss scaled by world size: 0.0033565827179700136



 Epoch: 8, Step: 49, Rank: 6, loss = 0.08872846513986588
 Epoch: 8, Step: 49, Rank: 2, loss = 0.2359074503183365
 Epoch: 8, Step: 49, Rank: 7, loss = 0.057459838688373566
 Epoch: 8, Step: 49, Rank: 0, loss = 0.09729289263486862
 Epoch: 8, Step: 49, Rank: 5, loss = 0.09496178478002548
 Epoch: 8, Step: 49, Rank: 4, loss = 0.10279534757137299
 Epoch: 8, Step: 49, Rank: 1, loss = 0.09920922666788101
 Per-token loss scaled by world size: 0.001880081370472908
 Epoch: 8, Step: 49, Rank: 3, loss = 0.05757749080657959
 [2024-07-27 04:43:17,376] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[4.491030185478976e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:17,453] [INFO] [timer.py:258:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=18.97237396051445, CurrSamplesPerSec=17.473818511885575, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 8:  17%|█▋        | 1/6 [00:00<00:04,  1.23it/s]{
    "epoch": 8,
    "step": 49,
    "rank": 0,
    "loss": 0.09729289263486862,
    "overall_throughput": 17.406956156346215,
    "lr": 4.491030185478976e-06,
    "cuda_mem_allocated": 21.990607738494873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 245,
    "batch_size": 8,
    "total_loss": 0.10424157232046127,
    "gradnorm": 2.468442678451538,
    "weight_norm": 393.4612731933594,
    "timestamp": "2024-07-27T04:43:17.517240"
 }
 Per-token loss scaled by world size: 0.0022910190746188164Per-token loss scaled by world size: 0.003779459511861205Per-token loss scaled by world size: 0.0047139013186097145Per-token loss scaled by world size: 0.0016656998777762055Per-token loss scaled by world size: 0.0011160913854837418
 Per-token loss scaled by world size: 0.002607797970995307

 Per-token loss scaled by world size: 0.0011160913854837418



 Epoch: 8, Step: 50, Rank: 4, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 1, loss = 0.05871592089533806Epoch: 8, Step: 50, Rank: 5, loss = 0.16616502404212952


 Epoch: 8, Step: 50, Rank: 0, loss = 0.1332259476184845Epoch: 8, Step: 50, Rank: 2, loss = 0.08075842261314392Epoch: 8, Step: 50, Rank: 6, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 7, loss = 0.09192487597465515



 Per-token loss scaled by world size: 0.0013738555135205388
 Epoch: 8, Step: 50, Rank: 3, loss = 0.048428408801555634
 [2024-07-27 04:43:17,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[3.7651019814126656e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:17,936] [INFO] [timer.py:258:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=18.977750407321444, CurrSamplesPerSec=19.233927031934993, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 400
 {
    "epoch": 8,
    "step": 50,
    "rank": 0,
    "loss": 0.1332259476184845,
    "overall_throughput": 19.198084907835618,
    "lr": 3.7651019814126656e-06,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 282,
    "batch_size": 8,
    "total_loss": 0.08223787695169449,
    "gradnorm": 1.6959415674209595,
    "weight_norm": 393.4613342285156,
    "timestamp": "2024-07-27T04:43:17.940040"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_400
 [04:43:35] INFO     saving took 17.87660813331604 seconds                                                                                                                                        utils.py:611
 Epoch 8:  33%|███▎      | 2/6 [00:19<00:44, 11.14s/it]Per-token loss scaled by world size: 0.0034444250632077456Per-token loss scaled by world size: 0.0036195043940097094Per-token loss scaled by world size: 0.0021303421817719936Per-token loss scaled by world size: 0.002188930055126548Per-token loss scaled by world size: 0.0012819116236642003Per-token loss scaled by world size: 0.0027832810301333666





 Per-token loss scaled by world size: 0.0016897486057132483Epoch: 8, Step: 51, Rank: 1, loss = 0.11129976063966751

 Epoch: 8, Step: 51, Rank: 7, loss = 0.06550802290439606
 Epoch: 8, Step: 51, Rank: 5, loss = 0.08558589220046997Epoch: 8, Step: 51, Rank: 2, loss = 0.06730959564447403Epoch: 8, Step: 51, Rank: 0, loss = 0.10591606795787811

 Epoch: 8, Step: 51, Rank: 3, loss = 0.0394187830388546

 Epoch: 8, Step: 51, Rank: 6, loss = 0.05195976793766022
 Per-token loss scaled by world size: 0.003074637847021222
 Epoch: 8, Step: 51, Rank: 4, loss = 0.09454511106014252
 [2024-07-27 04:43:36,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[3.089373510131354e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:36,312] [INFO] [timer.py:258:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=18.971761108564685, CurrSamplesPerSec=18.68865417133589, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 8:  50%|█████     | 3/6 [00:19<00:18,  6.28s/it]{
    "epoch": 8,
    "step": 51,
    "rank": 0,
    "loss": 0.10591606795787811,
    "overall_throughput": 18.651619864047372,
    "lr": 3.089373510131354e-06,
    "cuda_mem_allocated": 21.988811492919922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 246,
    "batch_size": 8,
    "total_loss": 0.07769287377595901,
    "gradnorm": 0.9065935611724854,
    "weight_norm": 393.46136474609375,
    "timestamp": "2024-07-27T04:43:36.375570"
 }
 Per-token loss scaled by world size: 0.005752637051045895Per-token loss scaled by world size: 0.00271693360991776
 Per-token loss scaled by world size: 0.004330337047576904Per-token loss scaled by world size: 0.005382548552006483Per-token loss scaled by world size: 0.0025455320719629526



 Per-token loss scaled by world size: 0.0023602889850735664Per-token loss scaled by world size: 0.00044713294482789934

 Epoch: 8, Step: 52, Rank: 1, loss = 0.08286647498607635
 Epoch: 8, Step: 52, Rank: 0, loss = 0.17545543611049652Epoch: 8, Step: 52, Rank: 6, loss = 0.13207527995109558

 Epoch: 8, Step: 52, Rank: 2, loss = 0.16416773200035095Epoch: 8, Step: 52, Rank: 7, loss = 0.07763873040676117

 Epoch: 8, Step: 52, Rank: 5, loss = 0.07198881357908249
 Epoch: 8, Step: 52, Rank: 4, loss = 0.013637554831802845
 Per-token loss scaled by world size: 0.001792231691069901
 Epoch: 8, Step: 52, Rank: 3, loss = 0.05466306582093239
 [2024-07-27 04:43:36,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[2.469285339963892e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:36,777] [INFO] [timer.py:258:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=18.99222895360799, CurrSamplesPerSec=20.052273645410278, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 8:  67%|██████▋   | 4/6 [00:20<00:07,  3.98s/it]{
    "epoch": 8,
    "step": 52,
    "rank": 0,
    "loss": 0.17545543611049652,
    "overall_throughput": 20.013534639121023,
    "lr": 2.469285339963892e-06,
    "cuda_mem_allocated": 21.989290714263916,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 244,
    "batch_size": 8,
    "total_loss": 0.09656163305044174,
    "gradnorm": 1.5890859365463257,
    "weight_norm": 393.4613952636719,
    "timestamp": "2024-07-27T04:43:36.840099"
 }
 Per-token loss scaled by world size: 0.0025300777051597834Per-token loss scaled by world size: 0.0022664989810436964Per-token loss scaled by world size: 0.006000937893986702Per-token loss scaled by world size: 0.002840510569512844
 Per-token loss scaled by world size: 0.004035668447613716
 Per-token loss scaled by world size: 0.0041307490319013596
 Per-token loss scaled by world size: 0.003075978020206094



 Epoch: 8, Step: 53, Rank: 5, loss = 0.07309459149837494Epoch: 8, Step: 53, Rank: 4, loss = 0.09160646796226501Epoch: 8, Step: 53, Rank: 6, loss = 0.19353024661540985Epoch: 8, Step: 53, Rank: 1, loss = 0.13015030324459076



 Epoch: 8, Step: 53, Rank: 3, loss = 0.13321664929389954Epoch: 8, Step: 53, Rank: 2, loss = 0.08159500360488892

 Epoch: 8, Step: 53, Rank: 7, loss = 0.0992002934217453
 Per-token loss scaled by world size: 0.0016319038113579154
 Epoch: 8, Step: 53, Rank: 0, loss = 0.05262889713048935
 [2024-07-27 04:43:37,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[1.9098300562505266e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:37,244] [INFO] [timer.py:258:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=19.00944478158272, CurrSamplesPerSec=19.91191964124113, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 8:  83%|████████▎ | 5/6 [00:20<00:02,  2.71s/it]{
    "epoch": 8,
    "step": 53,
    "rank": 0,
    "loss": 0.05262889713048935,
    "overall_throughput": 19.87502717277025,
    "lr": 1.9098300562505266e-06,
    "cuda_mem_allocated": 21.99288320541382,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 258,
    "batch_size": 8,
    "total_loss": 0.10687780380249023,
    "gradnorm": 1.6277161836624146,
    "weight_norm": 393.4614562988281,
    "timestamp": "2024-07-27T04:43:37.309449"
 }
 Per-token loss scaled by world size: 0.004982769954949617Per-token loss scaled by world size: 0.0023371989373117685Per-token loss scaled by world size: 0.001956745982170105Per-token loss scaled by world size: 0.0019846318755298853Per-token loss scaled by world size: 0.001973965670913458Per-token loss scaled by world size: 0.001133645768277347




 Per-token loss scaled by world size: 0.0006779870600439608

 Epoch: 8, Step: 54, Rank: 0, loss = 0.19868795573711395
 Epoch: 8, Step: 54, Rank: 6, loss = 0.09319580346345901Epoch: 8, Step: 54, Rank: 5, loss = 0.07802524417638779Epoch: 8, Step: 54, Rank: 3, loss = 0.07871188223361969


 Epoch: 8, Step: 54, Rank: 7, loss = 0.045204125344753265Epoch: 8, Step: 54, Rank: 4, loss = 0.027034733444452286

 Epoch: 8, Step: 54, Rank: 2, loss = 0.07913719862699509
 Per-token loss scaled by world size: 0.0017750355182215571
 Epoch: 8, Step: 54, Rank: 1, loss = 0.07077953964471817
 [2024-07-27 04:43:37,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[1.4155120639813392e-06], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:37,723] [INFO] [timer.py:258:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=19.017755360925854, CurrSamplesPerSec=19.451449970580306, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Epoch 8: 100%|██████████| 6/6 [00:21<00:00,  1.95s/it]{
    "epoch": 8,
    "step": 54,
    "rank": 0,
    "loss": 0.19868795573711395,
    "overall_throughput": 19.41544489924397,
    "lr": 1.4155120639813392e-06,
    "cuda_mem_allocated": 21.990128993988037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 319,
    "batch_size": 8,
    "total_loss": 0.0838470607995987,
    "gradnorm": 0.9820513129234314,
    "weight_norm": 393.4614562988281,
    "timestamp": "2024-07-27T04:43:37.787702"
 }
 Epoch 8: 100%|██████████| 6/6 [00:21<00:00,  3.53s/it]
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
 total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
 total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 7 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
 total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
 total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
 total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
 total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
 total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
 total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
 Per-token loss scaled by world size: 0.0020672364626079798Per-token loss scaled by world size: 0.005803861655294895Per-token loss scaled by world size: 0.0010450059780851007Per-token loss scaled by world size: 0.00481435377150774Per-token loss scaled by world size: 0.004757868126034737


 Per-token loss scaled by world size: 0.0012225221144035459

 Per-token loss scaled by world size: 0.003656236920505762

 Epoch: 9, Step: 55, Rank: 4, loss = 0.14262522757053375
 Epoch: 9, Step: 55, Rank: 5, loss = 0.1719394028186798
 Epoch: 9, Step: 55, Rank: 2, loss = 0.030958302319049835
 Epoch: 9, Step: 55, Rank: 1, loss = 0.06124188005924225
 Epoch: 9, Step: 55, Rank: 0, loss = 0.14095184206962585
 Epoch: 9, Step: 55, Rank: 7, loss = 0.03621721640229225
 Epoch: 9, Step: 55, Rank: 3, loss = 0.10831601917743683
 Per-token loss scaled by world size: 0.003545596729964018
 Epoch: 9, Step: 55, Rank: 6, loss = 0.10503830015659332
 [2024-07-27 04:43:38,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[9.903113209758098e-07], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:38,646] [INFO] [timer.py:258:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=18.958935135030256, CurrSamplesPerSec=16.33220426430826, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 440
 {
    "epoch": 9,
    "step": 55,
    "rank": 0,
    "loss": 0.14095184206962585,
    "overall_throughput": 16.273810095413527,
    "lr": 9.903113209758098e-07,
    "cuda_mem_allocated": 21.989410877227783,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 237,
    "batch_size": 8,
    "total_loss": 0.09966102987527847,
    "gradnorm": 1.0968877077102661,
    "weight_norm": 393.4614562988281,
    "timestamp": "2024-07-27T04:43:38.650582"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_440
 [04:43:56] INFO     saving took 17.79723310470581 seconds                                                                                                                                        utils.py:611
                                                      Per-token loss scaled by world size: 0.005642743315547705Per-token loss scaled by world size: 0.002617186401039362Per-token loss scaled by world size: 0.0045571294613182545Per-token loss scaled by world size: 0.002132992260158062Per-token loss scaled by world size: 0.0015219022752717137

 Per-token loss scaled by world size: 0.003468153765425086
 Per-token loss scaled by world size: 0.0018528653308749199



 Epoch: 9, Step: 56, Rank: 0, loss = 0.14867635071277618
 Epoch: 9, Step: 56, Rank: 7, loss = 0.08538571000099182
 Epoch: 9, Step: 56, Rank: 5, loss = 0.06958886981010437Epoch: 9, Step: 56, Rank: 4, loss = 0.1131485179066658

 Epoch: 9, Step: 56, Rank: 2, loss = 0.049652062356472015Epoch: 9, Step: 56, Rank: 1, loss = 0.06044973060488701Epoch: 9, Step: 56, Rank: 6, loss = 0.18409450352191925


 Per-token loss scaled by world size: 0.001767554902471602
 Epoch: 9, Step: 56, Rank: 3, loss = 0.05766648054122925
 [2024-07-27 04:43:56,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[6.37651293602628e-07], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:56,928] [INFO] [timer.py:258:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=18.96440824470298, CurrSamplesPerSec=19.25907525027751, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 9,██▎      | 2/6 [00:19<00:31,  7.95s/it]
    "step": 56,
    "rank": 0,
    "loss": 0.14867635071277618,
    "overall_throughput": 19.213232991090933,
    "lr": 6.37651293602628e-07,
    "cuda_mem_allocated": 21.990248203277588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 261,
    "batch_size": 8,
    "total_loss": 0.09608278423547745,
    "gradnorm": 1.2486889362335205,
    "weight_norm": 393.46148681640625,
    "timestamp": "2024-07-27T04:43:56.991985"
 }
 Per-token loss scaled by world size: 0.006159770302474499Per-token loss scaled by world size: 0.004680618178099394Per-token loss scaled by world size: 0.003500568214803934Per-token loss scaled by world size: 0.002827225485816598Per-token loss scaled by world size: 0.001885988749563694Per-token loss scaled by world size: 0.002812023274600506
 Per-token loss scaled by world size: 0.0035085994750261307





 Epoch: 9, Step: 57, Rank: 6, loss = 0.10764247179031372
 Epoch: 9, Step: 57, Rank: 4, loss = 0.08693718165159225Epoch: 9, Step: 57, Rank: 3, loss = 0.14392900466918945Epoch: 9, Step: 57, Rank: 0, loss = 0.05799415335059166
 Epoch: 9, Step: 57, Rank: 5, loss = 0.08646971732378006
 Epoch: 9, Step: 57, Rank: 7, loss = 0.10788943618535995


 Epoch: 9, Step: 57, Rank: 2, loss = 0.1894129365682602
 Per-token loss scaled by world size: 0.0016236526425927877
 Epoch: 9, Step: 57, Rank: 1, loss = 0.04992732033133507
 [2024-07-27 04:43:57,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[3.603713930414676e-07], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:57,395] [INFO] [timer.py:258:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=18.980723963154006, CurrSamplesPerSec=19.905493724517065, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 9,████     | 3/6 [00:19<00:13,  4.53s/it]
    "step": 57,
    "rank": 0,
    "loss": 0.05799415335059166,
    "overall_throughput": 19.839444119579092,
    "lr": 3.603713930414676e-07,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 246,
    "batch_size": 8,
    "total_loss": 0.1037752702832222,
    "gradnorm": 1.5781608819961548,
    "weight_norm": 393.46148681640625,
    "timestamp": "2024-07-27T04:43:57.458487"
 }
 Per-token loss scaled by world size: 0.00027386093279346824Per-token loss scaled by world size: 0.002793475054204464Per-token loss scaled by world size: 0.0012327907606959343Per-token loss scaled by world size: 0.0018183693755418062Per-token loss scaled by world size: 0.0011149498168379068

 Per-token loss scaled by world size: 0.0009586562518961728
 Per-token loss scaled by world size: 0.006267122458666563



 Epoch: 9, Step: 58, Rank: 3, loss = 0.09497815370559692
 Epoch: 9, Step: 58, Rank: 4, loss = 0.04191488400101662
 Epoch: 9, Step: 58, Rank: 7, loss = 0.032594311982393265Epoch: 9, Step: 58, Rank: 5, loss = 0.06182456016540527Epoch: 9, Step: 58, Rank: 2, loss = 0.037908293306827545


 Epoch: 9, Step: 58, Rank: 6, loss = 0.009311271831393242
 Epoch: 9, Step: 58, Rank: 1, loss = 0.21308216452598572
 Per-token loss scaled by world size: 0.0013049639528617263
 Epoch: 9, Step: 58, Rank: 0, loss = 0.04436877369880676
 [2024-07-27 04:43:57,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[1.6070411401370335e-07], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:57,863] [INFO] [timer.py:258:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=18.99479756047954, CurrSamplesPerSec=19.802352008035566, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 9,█████▋   | 4/6 [00:20<00:05,  2.93s/it]
    "step": 58,
    "rank": 0,
    "loss": 0.04436877369880676,
    "overall_throughput": 19.7424303222504,
    "lr": 1.6070411401370335e-07,
    "cuda_mem_allocated": 21.992165088653564,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 272,
    "batch_size": 8,
    "total_loss": 0.06699780374765396,
    "gradnorm": 1.2012358903884888,
    "weight_norm": 393.46148681640625,
    "timestamp": "2024-07-27T04:43:57.924678"
 }
 Per-token loss scaled by world size: 0.0033104075118899345Per-token loss scaled by world size: 0.005029842257499695Per-token loss scaled by world size: 0.0013227150775492191Per-token loss scaled by world size: 0.0013601266546174884

 Per-token loss scaled by world size: 0.0020338338799774647
 Per-token loss scaled by world size: 0.002029073191806674
 Per-token loss scaled by world size: 0.0012528002262115479


 Epoch: 9, Step: 59, Rank: 1, loss = 0.181074321269989
 Epoch: 9, Step: 59, Rank: 0, loss = 0.11917466670274734
 Epoch: 9, Step: 59, Rank: 4, loss = 0.04761774465441704
 Epoch: 9, Step: 59, Rank: 6, loss = 0.07321801781654358Epoch: 9, Step: 59, Rank: 2, loss = 0.04896456003189087

 Epoch: 9, Step: 59, Rank: 7, loss = 0.04510080814361572Epoch: 9, Step: 59, Rank: 3, loss = 0.07304663211107254

 Per-token loss scaled by world size: 0.0016570077277719975
 Epoch: 9, Step: 59, Rank: 5, loss = 0.05965227633714676
 [2024-07-27 04:43:58,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[4.025706004760932e-08], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:58,339] [INFO] [timer.py:258:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=19.001181258681584, CurrSamplesPerSec=19.36564785840185, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
                                                      {
    "epoch": 9,███████▎ | 5/6 [00:20<00:02,  2.04s/it]
    "step": 59,
    "rank": 0,
    "loss": 0.11917466670274734,
    "overall_throughput": 19.311280902601126,
    "lr": 4.025706004760932e-08,
    "cuda_mem_allocated": 21.98869228363037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 288,
    "batch_size": 8,
    "total_loss": 0.08098112046718597,
    "gradnorm": 1.2536462545394897,
    "weight_norm": 393.46148681640625,
    "timestamp": "2024-07-27T04:43:58.402303"
 }
 Per-token loss scaled by world size: 0.0025534227024763823Per-token loss scaled by world size: 0.002198881469666958Per-token loss scaled by world size: 0.003101743757724762Per-token loss scaled by world size: 0.0017734984867274761Per-token loss scaled by world size: 0.001557655748911202
 Per-token loss scaled by world size: 0.0014592667575925589




 Per-token loss scaled by world size: 0.00225572707131505
 Epoch: 9, Step: 60, Rank: 6, loss = 0.11088734120130539
 Epoch: 9, Step: 60, Rank: 0, loss = 0.0912848636507988Epoch: 9, Step: 60, Rank: 2, loss = 0.05568619444966316Epoch: 9, Step: 60, Rank: 7, loss = 0.05216878652572632Epoch: 9, Step: 60, Rank: 4, loss = 0.07861001044511795Epoch: 9, Step: 60, Rank: 5, loss = 0.06340257078409195




 Epoch: 9, Step: 60, Rank: 3, loss = 0.08064224570989609
 Per-token loss scaled by world size: 0.0007612230838276446
 Epoch: 9, Step: 60, Rank: 1, loss = 0.027213726192712784
 [2024-07-27 04:43:58,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-07-27 04:43:58,814] [INFO] [timer.py:258:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=19.00919914338529, CurrSamplesPerSec=19.4776793799544, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
 Saving model in huggingface format at samples_seen: 480
 {
    "epoch": 9,
    "step": 60,
    "rank": 0,
    "loss": 0.0912848636507988,
    "overall_throughput": 19.42523488114542,
    "lr": 0.0,
    "cuda_mem_allocated": 21.988811492919922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 286,
    "batch_size": 8,
    "total_loss": 0.06998696178197861,
    "gradnorm": 1.0276967287063599,
    "weight_norm": 393.46148681640625,
    "timestamp": "2024-07-27T04:43:58.817101"
 }
 Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_480
 [04:44:16] INFO     saving took 17.839636087417603 seconds                                                                                                                                       utils.py:611
 Epoch 9: 100%|██████████| 6/6 [00:38<00:00,  6.49s/it]
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:550 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:573 -> 3
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:621 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:775 -> 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:564 -> 3

 tyler-rhel-newimage:261:1045 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:775 -> 3

 tyler-rhel-newimage:264:1041 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:775 -> 3
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:826 -> 3
 tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:775 -> 3

 tyler-rhel-newimage:266:1031 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:58 -> 3

 tyler-rhel-newimage:262:1033 [2] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:668 -> 3
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:47 -> 3

 tyler-rhel-newimage:265:1035 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:428 -> 3

 tyler-rhel-newimage:261:1045 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:752 -> 3
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:775 -> 3
 tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:826 -> 3
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:428 -> 3
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:668 -> 3

 tyler-rhel-newimage:261:1045 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:58 -> 3
 tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:826 -> 3
 tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:775 -> 3
 tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:826 -> 3

 tyler-rhel-newimage:267:1037 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:564 -> 3
 tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:775 -> 3

 tyler-rhel-newimage:264:1041 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 4, res=3, closed=0

 tyler-rhel-newimage:265:1035 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0

 tyler-rhel-newimage:260:1039 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable

 tyler-rhel-newimage:264:1041 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Close from rank 4, retcode 3
 tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:826 -> 3
 tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:775 -> 3
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:668 -> 3

 tyler-rhel-newimage:265:1035 [5] proxy.cc:1521 NCCL WARN [Proxy Service 5] Failed to execute operation Close from rank 5, retcode 3
 tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:826 -> 3

 tyler-rhel-newimage:266:1031 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0
 tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:826 -> 3

 tyler-rhel-newimage:260:1039 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0

 tyler-rhel-newimage:267:1037 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0

 tyler-rhel-newimage:262:1033 [2] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 2, res=3, closed=0

 tyler-rhel-newimage:260:1039 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3

 tyler-rhel-newimage:266:1031 [6] proxy.cc:1521 NCCL WARN [Proxy Service 6] Failed to execute operation Close from rank 6, retcode 3

 tyler-rhel-newimage:263:1043 [3] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable

 tyler-rhel-newimage:267:1037 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Close from rank 7, retcode 3

 tyler-rhel-newimage:262:1033 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Close from rank 2, retcode 3
 tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:826 -> 3

 tyler-rhel-newimage:263:1043 [3] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 3, res=3, closed=0

 tyler-rhel-newimage:263:1043 [3] proxy.cc:1521 NCCL WARN [Proxy Service 3] Failed to execute operation Close from rank 3, retcode 3
 tyler-rhel-newimage:260:43160 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
 tyler-rhel-newimage:266:43159 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
 tyler-rhel-newimage:267:43156 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
 tyler-rhel-newimage:262:43163 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
 tyler-rhel-newimage:263:43158 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
 tyler-rhel-newimage:264:43161 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
 tyler-rhel-newimage:265:43162 [5] NCCL INFO comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
 tyler-rhel-newimage:261:43157 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
 Terminating process 🤖
 [root@tyler-rhel-newimage instructlab]#