Skip to content

Instantly share code, notes, and snippets.

@relyt0925
Created July 27, 2024 04:44
Show Gist options
  • Save relyt0925/ed5c0601e419028f6eedc12b4fb5fd28 to your computer and use it in GitHub Desktop.
Save relyt0925/ed5c0601e419028f6eedc12b4fb5fd28 to your computer and use it in GitHub Desktop.
newtraining output
[root@tyler-rhel-newimage instructlab]# /root/ilab model train --data-path /var/instructlabbigdisk/instructlab/generateddata/messages_Mixtral-8x7B-Instruct-v0_2024-07-27T04_27_23.jsonl --model-path /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --ckpt-output-dir /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --device cuda --gpus 8 --max-batch-len 1 --effective-batch-size 8 --save-samples 46
[2024-07-27 04:38:32,852] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
INFO 2024-07-27 04:38:36,486 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2024-07-27 04:38:36,486 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2024-07-27 04:38:36,486 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-07-27 04:38:36,869 datasets:58: PyTorch version 2.3.1 available.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 2024-07-27 04:38:37,191 root:611: !!!!!!!! tokenizer has add_bos_token or add_eos_token
INFO 2024-07-27 04:38:37,196 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
tokenizing the dataset with /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ tokenizer...
ten largest length percentiles:
quantile 90th: 78.0
quantile 91th: 79.80000000000001
quantile 92th: 83.19999999999999
quantile 93th: 86.80000000000001
quantile 94th: 89.19999999999999
quantile 95th: 91.0
quantile 96th: 93.59999999999997
quantile 97th: 97.19999999999999
quantile 98th: 100.70000000000002
quantile 99th: 103.84999999999998
quantile 100th: 107.0
at 4096 max sequence length, the number of samples to be dropped is 0
(0.00% of total)
quantile 0th: 44.0
quantile 1th: 44.45
quantile 2th: 44.9
quantile 3th: 45.0
quantile 4th: 45.0
quantile 5th: 45.0
quantile 6th: 45.0
quantile 7th: 45.15
quantile 8th: 45.6
quantile 9th: 46.1
quantile 10th: 47.0
at 20 min sequence length, the number of samples to be dropped is 0
checking the validity of the samples...
INFO 2024-07-27 04:38:42,745 root:611: number of dropped samples: 0 -- out of 46
Categorizing training data type...
Data type sorting: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 506398.91it/s]
unmasking the appropriate message content...
The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.
Instruction ex sample 16: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|>
Original Input: <|user|>
Question: How many villages named "Qarah Tappeh" are mentioned in the text?
<|assistant|>
Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|>
Instruction ex sample 39: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|>
Original Input: <|user|>
Question: How many villages named "Qarah Tappeh" are mentioned in the text, each with a different location?
<|assistant|>
Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|>
Creating json from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 172.75ba/s]
Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --data_path=/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --num_epochs=10 --effective_batch_size=8 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=46 --log_level=INFO --max_batch_len=1 --seed=42 --chat-tmpl-path=/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757]
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] *****************************************
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] *****************************************
[2024-07-27 04:38:47,197] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,436] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,460] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,488] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,592] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,593] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,603] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 04:38:47,623] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[04:38:50] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
model_name_or_path: /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/
data_path: /var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl
output_dir: /var/instructlabbigdisk/instructlab/knowledgecheckpoints/
num_epochs: 10
last_step: 0
effective_batch_size: 8
learning_rate: 2.0e-05
lr_scheduler: cosine
num_warmup_steps: 25
save_samples: 46
save_samples_ds: null
save_last: false
log_level: INFO
seed: 42
mock_data: false
mock_len: 2600
sharding_strategy: FULL_SHARD
is_granite: false
lora_r: 0
lora_alpha: 32
lora_dropout: 0.1
lora_quant_bits: null
lora_target_modules: null
max_batch_len: 1
cpu_offload_optimizer: false
cpu_offload_optimizer_pin_memory: false
cpu_offload_optimizer_ratio: 1.0
NEFTune_alpha: null
chat_tmpl_path: /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
disable_flash_attn: false
{
"script_params": {
"model_name_or_path": "/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/",
"data_path": "/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl",
"output_dir": "/var/instructlabbigdisk/instructlab/knowledgecheckpoints/",
"num_epochs": 10,
"last_step": 0,
"effective_batch_size": 8,
"learning_rate": 2e-05,
"lr_scheduler": "cosine",
"num_warmup_steps": 25,
"save_samples": 46,
"save_samples_ds": null,
"save_last": false,
"log_level": "INFO",
"seed": 42,
"mock_data": false,
"mock_len": 2600,
"sharding_strategy": "FULL_SHARD",
"is_granite": false,
"lora_r": 0,
"lora_alpha": 32,
"lora_dropout": 0.1,
"lora_quant_bits": null,
"lora_target_modules": null,
"max_batch_len": 1,
"cpu_offload_optimizer": false,
"cpu_offload_optimizer_pin_memory": false,
"cpu_offload_optimizer_ratio": 1.0,
"NEFTune_alpha": null,
"chat_tmpl_path": "/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
"disable_flash_attn": false
},
"timestamp": "2024-07-27T04:38:51.166561"
}
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[2024-07-27 04:38:51,196] [INFO] [comm.py:637:init_distributed] cdb=None
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
[2024-07-27 04:38:51,244] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 04:38:51,244] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611
[2024-07-27 04:38:51,961] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 04:38:52,111] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-rhel-newimage:260:260 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:260:260 [0] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:260:260 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-07-27 04:38:52,191] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 04:38:52,200] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-rhel-newimage:265:265 [5] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:263:263 [3] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:265:265 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:265:265 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:263:263 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:263:263 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-07-27 04:38:52,222] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 04:38:52,228] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-rhel-newimage:264:264 [4] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:264:264 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:264:264 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:267:267 [7] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:267:267 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:267:267 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:261:261 [1] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:261:261 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:261:261 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:266:266 [6] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:266:266 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:266:266 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:262:262 [2] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:262:262 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:262:262 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:260:1019 [0] NCCL INFO Using network Socket
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:263:1024 [3] NCCL INFO Using network Socket
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:265:1025 [5] NCCL INFO Using network Socket
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:267:1027 [7] NCCL INFO Using network Socket
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:261:1028 [1] NCCL INFO Using network Socket
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:266:1029 [6] NCCL INFO Using network Socket
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:264:1026 [4] NCCL INFO Using network Socket
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:262:1030 [2] NCCL INFO Using network Socket
tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init START
tyler-rhel-newimage:263:1024 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-rhel-newimage:261:1028 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-rhel-newimage:263:1024 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-rhel-newimage:261:1028 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-rhel-newimage:260:1019 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-rhel-newimage:262:1030 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-rhel-newimage:260:1019 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-rhel-newimage:262:1030 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-rhel-newimage:264:1026 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-rhel-newimage:264:1026 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-rhel-newimage:265:1025 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-rhel-newimage:265:1025 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-rhel-newimage:267:1027 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-rhel-newimage:267:1027 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-rhel-newimage:266:1029 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-rhel-newimage:266:1029 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-rhel-newimage:266:1029 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-rhel-newimage:267:1027 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-rhel-newimage:262:1030 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-rhel-newimage:264:1026 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-rhel-newimage:261:1028 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-rhel-newimage:260:1019 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:261:1028 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-rhel-newimage:264:1026 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-rhel-newimage:262:1030 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:266:1029 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-rhel-newimage:267:1027 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-rhel-newimage:261:1028 [1] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:264:1026 [4] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:262:1030 [2] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:266:1029 [6] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:267:1027 [7] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1019 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-rhel-newimage:260:1019 [0] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:263:1024 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-rhel-newimage:265:1025 [5] NCCL INFO comm 0x558446a0b990 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-rhel-newimage:263:1024 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-rhel-newimage:263:1024 [3] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:265:1025 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-rhel-newimage:265:1025 [5] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:266:1029 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:266:1029 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1025 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:265:1025 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:262:1030 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:262:1030 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:267:1027 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:267:1027 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:260:1019 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:260:1019 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:260:1019 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-rhel-newimage:264:1026 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:264:1026 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:263:1024 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:263:1024 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:261:1028 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:261:1028 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:267:1027 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:265:1025 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:266:1029 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:265:1025 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.77 (kernels 0.15, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:266:1029 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:267:1027 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.76 (kernels 0.20, bootstrap 0.22, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:260:1019 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:262:1030 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:260:1019 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.81 (kernels 0.13, bootstrap 0.35, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:262:1030 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.75 (kernels 0.24, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:264:1026 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:263:1024 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:261:1028 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init COMPLETE
tyler-rhel-newimage:264:1026 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.76 (kernels 0.23, bootstrap 0.19, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
tyler-rhel-newimage:263:1024 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.78 (kernels 0.16, bootstrap 0.29, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
tyler-rhel-newimage:261:1028 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03)
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1052 [3] NCCL INFO Connected all rings
tyler-rhel-newimage:262:1054 [2] NCCL INFO Connected all rings
tyler-rhel-newimage:261:1053 [1] NCCL INFO Connected all rings
tyler-rhel-newimage:260:1048 [0] NCCL INFO Connected all rings
tyler-rhel-newimage:264:1049 [4] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1050 [7] NCCL INFO Connected all rings
tyler-rhel-newimage:265:1051 [5] NCCL INFO Connected all rings
tyler-rhel-newimage:266:1047 [6] NCCL INFO Connected all rings
Generating train split: 46 examples [00:00, 10554.02 examples/s]
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11446.25it/s]
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11298.78it/s]
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11851.23it/s]
Effective batch size is too low for multipack sampling, max sample length=107 and min packing length=61. Switching to naive distributed sampling.
{
"num_gpus": 8,
"avg_sample_len": 61.52173913043478,
"effective_batch_size": 8,
"max_batch_len_per_gpu": 1,
"packing_max_batch_len": null,
"grad_accum": 1,
"num_batches": 6,
"avg_samples_per_batch": 7.666666666666667,
"samples_per_gpu": 1,
"timestamp": "2024-07-27T04:38:53.790452"
}
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11743.03it/s]
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11754.48it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11646.62it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11752.33it/s]
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11259.88it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.82s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.84s/it]
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.84s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.91s/it]
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.91s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.93s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.77s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.78s/it]
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Creating extension directory /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/python3.11/venv/lib64/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 30.80458378791809 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 30.133174657821655 seconds
Loading extension module fused_adam...Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 30.7347309589386 seconds
Time to load fused_adam op: 30.73425841331482 secondsTime to load fused_adam op: 26.12966561317444 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 26.22968363761902 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 30.43332004547119 seconds
[2024-07-27 04:39:49,755] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4+d254d75, git-hash=d254d75, git-branch=HEAD
[2024-07-27 04:39:49,756] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
Loading extension module fused_adam...
Time to load fused_adam op: 30.431302785873413 seconds
tyler-rhel-newimage:266:1135 [6] NCCL INFO Using network Socket
tyler-rhel-newimage:263:1144 [3] NCCL INFO Using network Socket
tyler-rhel-newimage:261:1129 [1] NCCL INFO Using network Socket
tyler-rhel-newimage:264:1147 [4] NCCL INFO Using network Socket
tyler-rhel-newimage:265:1132 [5] NCCL INFO Using network Socket
tyler-rhel-newimage:267:1138 [7] NCCL INFO Using network Socket
tyler-rhel-newimage:262:1141 [2] NCCL INFO Using network Socket
tyler-rhel-newimage:260:1128 [0] NCCL INFO Using network Socket
tyler-rhel-newimage:264:1147 [4] NCCL INFO bootstrapSplit: comm 0x556733b2a550 parent 0x5567334e8a90 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
tyler-rhel-newimage:263:1144 [3] NCCL INFO bootstrapSplit: comm 0x55eb051067f0 parent 0x55eb04aea420 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
tyler-rhel-newimage:266:1135 [6] NCCL INFO bootstrapSplit: comm 0x565389f45580 parent 0x5653898b69a0 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
tyler-rhel-newimage:267:1138 [7] NCCL INFO bootstrapSplit: comm 0x560b41bec990 parent 0x560b415bb0d0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
tyler-rhel-newimage:261:1129 [1] NCCL INFO bootstrapSplit: comm 0x55b80d3c14e0 parent 0x55b80cd4f580 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
tyler-rhel-newimage:262:1141 [2] NCCL INFO bootstrapSplit: comm 0x55bdb15c28a0 parent 0x55bdb0f52ee0 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
tyler-rhel-newimage:260:1128 [0] NCCL INFO bootstrapSplit: comm 0x5572495c2d50 parent 0x557248f8b2d0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:265:1132 [5] NCCL INFO bootstrapSplit: comm 0x5584470a63b0 parent 0x558446a0b990 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init START
tyler-rhel-newimage:263:1144 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-rhel-newimage:263:1144 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-rhel-newimage:264:1147 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-rhel-newimage:264:1147 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-rhel-newimage:261:1129 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-rhel-newimage:261:1129 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-rhel-newimage:260:1128 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-rhel-newimage:260:1128 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-rhel-newimage:262:1141 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-rhel-newimage:262:1141 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-rhel-newimage:266:1135 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-rhel-newimage:266:1135 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-rhel-newimage:267:1138 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-rhel-newimage:267:1138 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-rhel-newimage:265:1132 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-rhel-newimage:265:1132 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-rhel-newimage:266:1135 [6] NCCL INFO comm 0x565389f45580 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x5584470a63b0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-rhel-newimage:261:1129 [1] NCCL INFO comm 0x55b80d3c14e0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-rhel-newimage:260:1128 [0] NCCL INFO comm 0x5572495c2d50 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-rhel-newimage:264:1147 [4] NCCL INFO comm 0x556733b2a550 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-rhel-newimage:267:1138 [7] NCCL INFO comm 0x560b41bec990 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-rhel-newimage:263:1144 [3] NCCL INFO comm 0x55eb051067f0 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55bdb15c28a0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:265:1132 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:265:1132 [5] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:266:1135 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-rhel-newimage:267:1138 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:266:1135 [6] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:267:1138 [7] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:261:1129 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:264:1147 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-rhel-newimage:261:1129 [1] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:263:1144 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-rhel-newimage:264:1147 [4] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:263:1144 [3] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:262:1141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:262:1141 [2] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1128 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-rhel-newimage:260:1128 [0] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:261:1129 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:261:1129 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:267:1138 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:267:1138 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:260:1128 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:260:1128 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:264:1147 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:264:1147 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:262:1141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:262:1141 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1132 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:265:1132 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:263:1144 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:263:1144 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:266:1135 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:266:1135 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:260:1128 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:263:1144 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:260:1128 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.56 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:264:1147 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:267:1138 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.34 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:261:1129 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.55 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE
tyler-rhel-newimage:262:1141 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.31 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:265:1132 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.52 (kernels 0.00, bootstrap 0.22, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:266:1135 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.42 (kernels 0.00, bootstrap 0.12, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02)
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1170 [1] NCCL INFO Connected all rings
tyler-rhel-newimage:260:1168 [0] NCCL INFO Connected all rings
tyler-rhel-newimage:264:1164 [4] NCCL INFO Connected all rings
tyler-rhel-newimage:262:1171 [2] NCCL INFO Connected all rings
tyler-rhel-newimage:263:1167 [3] NCCL INFO Connected all rings
tyler-rhel-newimage:266:1166 [6] NCCL INFO Connected all rings
tyler-rhel-newimage:265:1169 [5] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1165 [7] NCCL INFO Connected all rings
[2024-07-27 04:39:54,216] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-07-27 04:39:54,243] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-07-27 04:39:54,243] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-07-27 04:39:54,244] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-07-27 04:40:05,980] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:07,679] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-07-27 04:40:07,680] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB
[2024-07-27 04:40:07,680] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.65 GB, percent = 8.4%
[2024-07-27 04:40:07,885] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-07-27 04:40:07,886] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 18.83 GB CA 20.4 GB Max_CA 20 GB
[2024-07-27 04:40:07,886] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.65 GB, percent = 8.4%
[2024-07-27 04:40:07,886] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-07-27 04:40:08,052] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:08,080] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-07-27 04:40:08,081] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:08,081] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 15.69 GB CA 20.4 GB Max_CA 20 GB
[2024-07-27 04:40:08,081] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 67.95 GB, percent = 5.4%
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fa89082c350>
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-07-27 04:40:08,084] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] amp_params ................... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] bfloat16_enabled ............. True
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa871e917d0>
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dump_state ................... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_auto_cast ............... None
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_enabled ................. False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] loss_scale ................... 1.0
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pld_params ................... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] steps_per_print .............. 1
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] train_batch_size ............. 8
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 1
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] world_size ................... 8
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-07-27 04:40:08,086] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
}
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[2024-07-27 04:40:08,087] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Number of samples per save: 40
[2024-07-27 04:40:08,101] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:08,148] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:08,457] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 04:40:08,652] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Epoch 0: 0%| | 0/6 [00:00<?, ?it/s] total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 7 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 1 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 0 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
Per-token loss scaled by world size: 0.07893893122673035
Per-token loss scaled by world size: 0.03717859089374542
Per-token loss scaled by world size: 0.04773510619997978Per-token loss scaled by world size: 0.0572475828230381Per-token loss scaled by world size: 0.07484958320856094
Per-token loss scaled by world size: 0.053953204303979874
Epoch: 0, Step: 1, Rank: 0, loss = 2.230024814605713
Epoch: 0, Step: 1, Rank: 1, loss = 1.0502952337265015
Per-token loss scaled by world size: 0.054488085210323334
Epoch: 0, Step: 1, Rank: 3, loss = 1.3485167026519775Epoch: 0, Step: 1, Rank: 7, loss = 1.6172442436218262
Epoch: 0, Step: 1, Rank: 6, loss = 2.1145007610321045
Epoch: 0, Step: 1, Rank: 2, loss = 1.5241780281066895
Epoch: 0, Step: 1, Rank: 5, loss = 1.5392884016036987
Per-token loss scaled by world size: 0.034635160118341446
Epoch: 0, Step: 1, Rank: 4, loss = 0.9784433245658875
[2024-07-27 04:40:09,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
Epoch 0: 17%|█▋ | 1/6 [00:01<00:06, 1.27s/it]{
"epoch": 0,
"step": 1,
"rank": 0,
"loss": 2.230024814605713,
"overall_throughput": 9.740397396362653,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 226,
"batch_size": 8,
"total_loss": 1.5503114461898804,
"gradnorm": 27.33059310913086,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:09.962829"
}
Per-token loss scaled by world size: 0.0815284326672554Per-token loss scaled by world size: 0.049197420477867126Per-token loss scaled by world size: 0.05212152749300003Per-token loss scaled by world size: 0.04615860432386398
Per-token loss scaled by world size: 0.031173814088106155
Per-token loss scaled by world size: 0.031103696674108505
Per-token loss scaled by world size: 0.059471674263477325
Epoch: 0, Step: 2, Rank: 5, loss = 1.5189703702926636
Epoch: 0, Step: 2, Rank: 0, loss = 2.517190456390381
Epoch: 0, Step: 2, Rank: 2, loss = 1.6092522144317627Epoch: 0, Step: 2, Rank: 4, loss = 1.4251469373703003
Epoch: 0, Step: 2, Rank: 3, loss = 0.962491512298584
Epoch: 0, Step: 2, Rank: 1, loss = 0.960326611995697
Epoch: 0, Step: 2, Rank: 7, loss = 1.8361879587173462
Per-token loss scaled by world size: 0.0653553158044815
Epoch: 0, Step: 2, Rank: 6, loss = 2.017845392227173
[2024-07-27 04:40:10,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
Epoch 0: 33%|███▎ | 2/6 [00:01<00:03, 1.28it/s]{
"epoch": 0,
"step": 2,
"rank": 0,
"loss": 2.517190456390381,
"overall_throughput": 25.079475366313783,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 21.990607738494873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 247,
"batch_size": 8,
"total_loss": 1.6059263944625854,
"gradnorm": 24.998506546020508,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:10.428087"
}
Per-token loss scaled by world size: 0.049534447491168976Per-token loss scaled by world size: 0.034937091171741486Per-token loss scaled by world size: 0.03569952771067619Per-token loss scaled by world size: 0.027379106730222702Per-token loss scaled by world size: 0.07303578406572342Per-token loss scaled by world size: 0.060139141976833344Per-token loss scaled by world size: 0.06049950420856476
Epoch: 0, Step: 3, Rank: 4, loss = 1.1223540306091309Epoch: 0, Step: 3, Rank: 3, loss = 1.9319698810577393Epoch: 0, Step: 3, Rank: 1, loss = 2.3462746143341064Epoch: 0, Step: 3, Rank: 7, loss = 0.8795537948608398Epoch: 0, Step: 3, Rank: 0, loss = 1.5912941694259644Epoch: 0, Step: 3, Rank: 2, loss = 1.1468473672866821
Epoch: 0, Step: 3, Rank: 5, loss = 1.9435465335845947
Per-token loss scaled by world size: 0.07311909645795822
Epoch: 0, Step: 3, Rank: 6, loss = 2.3489508628845215
[2024-07-27 04:40:10,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:10,835] [INFO] [timer.py:258:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=19.69948717647646, CurrSamplesPerSec=19.69948717647646, MemAllocated=21.99GB, MaxMemAllocated=28.28GB
Epoch 0: 50%|█████ | 3/6 [00:02<00:01, 1.56it/s]{
"epoch": 0,
"step": 3,
"rank": 0,
"loss": 1.5912941694259644,
"overall_throughput": 19.626204980478747,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 257,
"batch_size": 8,
"total_loss": 1.6638489961624146,
"gradnorm": 23.093570709228516,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:10.899000"
}
Per-token loss scaled by world size: 0.022453732788562775Per-token loss scaled by world size: 0.02703114040195942Per-token loss scaled by world size: 0.0565037876367569Per-token loss scaled by world size: 0.013839970342814922Per-token loss scaled by world size: 0.03342469781637192
Per-token loss scaled by world size: 0.019963225349783897
Per-token loss scaled by world size: 0.024931060150265694
Epoch: 0, Step: 4, Rank: 0, loss = 1.2197802066802979Epoch: 0, Step: 4, Rank: 6, loss = 2.5497334003448486
Epoch: 0, Step: 4, Rank: 5, loss = 0.9008405208587646Epoch: 0, Step: 4, Rank: 1, loss = 1.5082894563674927Epoch: 0, Step: 4, Rank: 2, loss = 0.6245286464691162
Epoch: 0, Step: 4, Rank: 3, loss = 1.013224720954895
Epoch: 0, Step: 4, Rank: 7, loss = 1.125014066696167
Per-token loss scaled by world size: 0.04781263321638107
Epoch: 0, Step: 4, Rank: 4, loss = 2.1575450897216797
[2024-07-27 04:40:11,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:11,315] [INFO] [timer.py:258:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=19.53033393623941, CurrSamplesPerSec=19.36406089495735, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 0: 67%|██████▋ | 4/6 [00:02<00:01, 1.73it/s]{
"epoch": 0,
"step": 4,
"rank": 0,
"loss": 1.2197802066802979,
"overall_throughput": 19.323057857283946,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 21.994319915771484,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 361,
"batch_size": 8,
"total_loss": 1.3873695135116577,
"gradnorm": 13.594513893127441,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:11.384294"
}
Per-token loss scaled by world size: 0.039435986429452896Per-token loss scaled by world size: 0.04640277475118637Per-token loss scaled by world size: 0.04766402393579483Per-token loss scaled by world size: 0.03315580636262894Per-token loss scaled by world size: 0.05420012027025223Per-token loss scaled by world size: 0.039042871445417404
Per-token loss scaled by world size: 0.037364713847637177
Epoch: 0, Step: 5, Rank: 7, loss = 1.5848288536071777
Epoch: 0, Step: 5, Rank: 0, loss = 1.3112465143203735Epoch: 0, Step: 5, Rank: 6, loss = 1.1024305820465088Epoch: 0, Step: 5, Rank: 1, loss = 1.2981754541397095
Epoch: 0, Step: 5, Rank: 4, loss = 1.242376685142517
Epoch: 0, Step: 5, Rank: 3, loss = 1.5428922176361084
Epoch: 0, Step: 5, Rank: 2, loss = 1.8021539449691772
Per-token loss scaled by world size: 0.041017867624759674
Epoch: 0, Step: 5, Rank: 5, loss = 1.3638441562652588
[2024-07-27 04:40:11,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:11,795] [INFO] [timer.py:258:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=19.56114133365461, CurrSamplesPerSec=19.62304862715284, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 40
{
"epoch": 0,
"step": 5,
"rank": 0,
"loss": 1.3112465143203735,
"overall_throughput": 19.583411102181554,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 21.990607738494873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 266,
"batch_size": 8,
"total_loss": 1.4059934616088867,
"gradnorm": 16.828536987304688,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:11.799662"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_40
[04:40:29] INFO saving took 17.8935489654541 seconds utils.py:611
Epoch 0: 83%|████████▎ | 5/6 [00:21<00:06, 7.00s/it]Per-token loss scaled by world size: 0.023287693038582802Per-token loss scaled by world size: 0.022923028096556664Per-token loss scaled by world size: 0.035226039588451385Per-token loss scaled by world size: 0.02928866446018219
Per-token loss scaled by world size: 0.02294014021754265
Per-token loss scaled by world size: 0.021732352674007416Per-token loss scaled by world size: 0.05167709290981293
Epoch: 0, Step: 6, Rank: 1, loss = 0.8420491218566895
Epoch: 0, Step: 6, Rank: 2, loss = 1.0127485990524292
Epoch: 0, Step: 6, Rank: 5, loss = 0.6595290303230286Epoch: 0, Step: 6, Rank: 4, loss = 1.485716462135315
Epoch: 0, Step: 6, Rank: 0, loss = 0.6590370535850525Epoch: 0, Step: 6, Rank: 7, loss = 0.669521152973175
Epoch: 0, Step: 6, Rank: 3, loss = 0.6248051524162292
Per-token loss scaled by world size: 0.05193476006388664
Epoch: 0, Step: 6, Rank: 6, loss = 1.4931243658065796
[2024-07-27 04:40:30,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:30,197] [INFO] [timer.py:258:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=19.219145353583418, CurrSamplesPerSec=18.261332776041684, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 0: 100%|██████████| 6/6 [00:21<00:00, 4.79s/it]{
"epoch": 0,
"step": 6,
"rank": 0,
"loss": 0.6590370535850525,
"overall_throughput": 18.218836059734688,
"lr": 4.800000000000001e-06,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 230,
"batch_size": 8,
"total_loss": 0.9308162927627563,
"gradnorm": 13.859025001525879,
"weight_norm": 393.4548645019531,
"timestamp": "2024-07-27T04:40:30.261588"
}
Epoch 0: 100%|██████████| 6/6 [00:21<00:00, 3.61s/it]
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 0 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 0 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 2 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 7 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 3 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
Per-token loss scaled by world size: 0.04258072003722191Per-token loss scaled by world size: 0.03144193813204765Per-token loss scaled by world size: 0.03665884956717491Per-token loss scaled by world size: 0.0183926559984684
Per-token loss scaled by world size: 0.02085307240486145
Per-token loss scaled by world size: 0.011860878206789494
Epoch: 1, Step: 7, Rank: 6, loss = 1.2830597162246704Per-token loss scaled by world size: 0.017801115289330482
Epoch: 1, Step: 7, Rank: 7, loss = 0.6437429785728455Epoch: 1, Step: 7, Rank: 2, loss = 1.1004678010940552
Epoch: 1, Step: 7, Rank: 3, loss = 1.4903252124786377
Epoch: 1, Step: 7, Rank: 0, loss = 0.7298575639724731
Epoch: 1, Step: 7, Rank: 1, loss = 0.41513073444366455
Epoch: 1, Step: 7, Rank: 5, loss = 0.6230390071868896
Per-token loss scaled by world size: 0.014800201170146465
Epoch: 1, Step: 7, Rank: 4, loss = 0.5180070400238037
[2024-07-27 04:40:31,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[5.600000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:31,117] [INFO] [timer.py:258:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=18.856037574472914, CurrSamplesPerSec=17.531170274406254, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 1,▋ | 1/6 [00:00<00:04, 1.23it/s]
"step": 7,
"rank": 0,
"loss": 0.7298575639724731,
"overall_throughput": 17.461878806108718,
"lr": 5.600000000000001e-06,
"cuda_mem_allocated": 21.99084711074829,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 280,
"batch_size": 8,
"total_loss": 0.850453794002533,
"gradnorm": 11.859759330749512,
"weight_norm": 393.45489501953125,
"timestamp": "2024-07-27T04:40:31.182263"
}
Per-token loss scaled by world size: 0.01552379410713911Per-token loss scaled by world size: 0.006790023762732744Per-token loss scaled by world size: 0.025048548355698586Per-token loss scaled by world size: 0.013136875815689564Per-token loss scaled by world size: 0.011116808280348778Per-token loss scaled by world size: 0.013115502893924713
Per-token loss scaled by world size: 0.004034785088151693
Epoch: 1, Step: 8, Rank: 6, loss = 0.2682059407234192Epoch: 1, Step: 8, Rank: 3, loss = 0.9894176721572876Epoch: 1, Step: 8, Rank: 7, loss = 0.43911394476890564
Epoch: 1, Step: 8, Rank: 0, loss = 0.5180623531341553Epoch: 1, Step: 8, Rank: 4, loss = 0.5189065933227539
Epoch: 1, Step: 8, Rank: 2, loss = 0.6131898760795593
Epoch: 1, Step: 8, Rank: 5, loss = 0.15937401354312897
Per-token loss scaled by world size: 0.0399574413895607
Epoch: 1, Step: 8, Rank: 1, loss = 1.5783189535140991
[2024-07-27 04:40:31,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[6.4000000000000006e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:31,596] [INFO] [timer.py:258:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=18.947101184672665, CurrSamplesPerSec=19.41593921964599, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 1,██▎ | 2/6 [00:01<00:02, 1.62it/s]
"step": 8,
"rank": 0,
"loss": 0.5180623531341553,
"overall_throughput": 19.33400249148091,
"lr": 6.4000000000000006e-06,
"cuda_mem_allocated": 21.992404460906982,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 316,
"batch_size": 8,
"total_loss": 0.63557368516922,
"gradnorm": 13.164900779724121,
"weight_norm": 393.4549255371094,
"timestamp": "2024-07-27T04:40:31.599119"
}
Per-token loss scaled by world size: 0.005403513088822365Per-token loss scaled by world size: 0.009131929837167263Per-token loss scaled by world size: 0.0029204594902694225Per-token loss scaled by world size: 0.014552305452525616Per-token loss scaled by world size: 0.002935068914666772
Per-token loss scaled by world size: 0.005613779183477163
Per-token loss scaled by world size: 0.008262179791927338
Epoch: 1, Step: 9, Rank: 6, loss = 0.5002354979515076
Epoch: 1, Step: 9, Rank: 7, loss = 0.10039079189300537
Epoch: 1, Step: 9, Rank: 2, loss = 0.3139100968837738Epoch: 1, Step: 9, Rank: 1, loss = 0.10089299082756042
Epoch: 1, Step: 9, Rank: 0, loss = 0.18574576079845428
Epoch: 1, Step: 9, Rank: 4, loss = 0.28401243686676025
Epoch: 1, Step: 9, Rank: 3, loss = 0.19297365844249725
Per-token loss scaled by world size: 0.03853427246212959
Epoch: 1, Step: 9, Rank: 5, loss = 1.3246155977249146
[2024-07-27 04:40:31,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[7.2000000000000005e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:32,073] [INFO] [timer.py:258:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=19.017175783584705, CurrSamplesPerSec=19.44875538610099, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 1,████ | 3/6 [00:01<00:01, 1.81it/s]
"step": 9,
"rank": 0,
"loss": 0.18574576079845428,
"overall_throughput": 19.4072473458094,
"lr": 7.2000000000000005e-06,
"cuda_mem_allocated": 21.990368366241455,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 275,
"batch_size": 8,
"total_loss": 0.3753471374511719,
"gradnorm": 6.857061386108398,
"weight_norm": 393.4549255371094,
"timestamp": "2024-07-27T04:40:32.076959"
}
Per-token loss scaled by world size: 0.009762527421116829Per-token loss scaled by world size: 0.011909930035471916Per-token loss scaled by world size: 0.011925755999982357Per-token loss scaled by world size: 0.00691488292068243Per-token loss scaled by world size: 0.014653326012194157Per-token loss scaled by world size: 0.014235693961381912Per-token loss scaled by world size: 0.006737567484378815
Epoch: 1, Step: 10, Rank: 1, loss = 0.32094308733940125
Epoch: 1, Step: 10, Rank: 4, loss = 0.39205923676490784Epoch: 1, Step: 10, Rank: 6, loss = 0.22732678055763245Epoch: 1, Step: 10, Rank: 2, loss = 0.48172810673713684
Epoch: 1, Step: 10, Rank: 7, loss = 0.2214975357055664
Epoch: 1, Step: 10, Rank: 3, loss = 0.39153894782066345Epoch: 1, Step: 10, Rank: 5, loss = 0.4679984450340271
Per-token loss scaled by world size: 0.013581880368292332
Epoch: 1, Step: 10, Rank: 0, loss = 0.4465043246746063
[2024-07-27 04:40:32,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:32,552] [INFO] [timer.py:258:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=19.068379525424866, CurrSamplesPerSec=19.434674525231042, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 80
{
"epoch": 1,
"step": 10,
"rank": 0,
"loss": 0.4465043246746063,
"overall_throughput": 19.395367576038005,
"lr": 8.000000000000001e-06,
"cuda_mem_allocated": 21.99384117126465,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 263,
"batch_size": 8,
"total_loss": 0.3686995804309845,
"gradnorm": 7.663094520568848,
"weight_norm": 393.4549560546875,
"timestamp": "2024-07-27T04:40:32.556073"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_80
[04:40:50] INFO saving took 17.83508801460266 seconds utils.py:611
Per-token loss scaled by world size: 0.008527813479304314Per-token loss scaled by world size: 0.021125635132193565Per-token loss scaled by world size: 0.028058378025889397Per-token loss scaled by world size: 0.010279231704771519
Per-token loss scaled by world size: 0.007951636798679829
Per-token loss scaled by world size: 0.010785719379782677
Per-token loss scaled by world size: 0.0015928453067317605
Epoch: 1, Step: 11, Rank: 0, loss = 0.6364097595214844
Epoch: 1, Step: 11, Rank: 3, loss = 0.8452586531639099
Epoch: 1, Step: 11, Rank: 7, loss = 0.309661865234375
Epoch: 1, Step: 11, Rank: 5, loss = 0.23954305052757263Epoch: 1, Step: 11, Rank: 1, loss = 0.32491979002952576
Epoch: 1, Step: 11, Rank: 6, loss = 0.2569003701210022
Epoch: 1, Step: 11, Rank: 4, loss = 0.04798446595668793
Per-token loss scaled by world size: 0.011055735871195793
Epoch: 1, Step: 11, Rank: 2, loss = 0.3330540359020233
[2024-07-27 04:40:50,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[8.8e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:50,878] [INFO] [timer.py:258:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=19.06728482937713, CurrSamplesPerSec=19.05853178378495, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 1,███████▎ | 5/6 [00:20<00:05, 5.01s/it]
"step": 11,
"rank": 0,
"loss": 0.6364097595214844,
"overall_throughput": 19.020988915430987,
"lr": 8.8e-06,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 241,
"batch_size": 8,
"total_loss": 0.3742165267467499,
"gradnorm": 10.62493896484375,
"weight_norm": 393.45501708984375,
"timestamp": "2024-07-27T04:40:50.881434"
}
Per-token loss scaled by world size: 0.01114068366587162Per-token loss scaled by world size: 0.017769839614629745Per-token loss scaled by world size: 0.01096078846603632Per-token loss scaled by world size: 0.01160360500216484
Per-token loss scaled by world size: 0.01375030167400837
Per-token loss scaled by world size: 0.012371961027383804
Per-token loss scaled by world size: 0.026108525693416595
Epoch: 1, Step: 12, Rank: 4, loss = 0.4709007441997528Epoch: 1, Step: 12, Rank: 2, loss = 0.29046088457107544
Epoch: 1, Step: 12, Rank: 7, loss = 0.3074955344200134
Epoch: 1, Step: 12, Rank: 0, loss = 0.29522812366485596
Epoch: 1, Step: 12, Rank: 3, loss = 0.3643829822540283
Epoch: 1, Step: 12, Rank: 5, loss = 0.32785695791244507
Epoch: 1, Step: 12, Rank: 1, loss = 0.6918759346008301
Per-token loss scaled by world size: 0.008026043884456158
Epoch: 1, Step: 12, Rank: 6, loss = 0.21269017457962036
[2024-07-27 04:40:51,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[9.600000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:40:51,345] [INFO] [timer.py:258:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=19.154108730048822, CurrSamplesPerSec=19.972626532644533, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 1,█████████| 6/6 [00:21<00:00, 3.47s/it]
"step": 12,
"rank": 0,
"loss": 0.29522812366485596,
"overall_throughput": 19.934561488662563,
"lr": 9.600000000000001e-06,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 212,
"batch_size": 8,
"total_loss": 0.3701114356517792,
"gradnorm": 11.501402854919434,
"weight_norm": 393.4551086425781,
"timestamp": "2024-07-27T04:40:51.407469"
}
Epoch 1: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it]
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 3 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 6 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 7 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 4 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 4 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
Per-token loss scaled by world size: 0.005579258780926466Per-token loss scaled by world size: 0.003397347405552864Per-token loss scaled by world size: 0.04219382628798485Per-token loss scaled by world size: 0.006288798991590738
Per-token loss scaled by world size: 0.003948549274355173
Per-token loss scaled by world size: 0.005502534564584494
Per-token loss scaled by world size: 0.018879147246479988
Epoch: 2, Step: 13, Rank: 4, loss = 0.20595817267894745
Epoch: 2, Step: 13, Rank: 2, loss = 0.11126312613487244
Epoch: 2, Step: 13, Rank: 6, loss = 0.18272072076797485
Epoch: 2, Step: 13, Rank: 1, loss = 1.381847858428955
Epoch: 2, Step: 13, Rank: 3, loss = 0.12931498885154724
Epoch: 2, Step: 13, Rank: 5, loss = 0.1802080124616623
Epoch: 2, Step: 13, Rank: 0, loss = 0.6182920932769775
Per-token loss scaled by world size: 0.005611070431768894
Epoch: 2, Step: 13, Rank: 7, loss = 0.1837625503540039
[2024-07-27 04:40:52,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[1.04e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:40:52,284] [INFO] [timer.py:258:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=18.955822461559762, CurrSamplesPerSec=17.17757371046992, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 2: 17%|█▋ | 1/6 [00:00<00:04, 1.22it/s]{
"epoch": 2,
"step": 13,
"rank": 0,
"loss": 0.6182920932769775,
"overall_throughput": 17.11025862908584,
"lr": 1.04e-05,
"cuda_mem_allocated": 21.990307807922363,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 262,
"batch_size": 8,
"total_loss": 0.3741708993911743,
"gradnorm": 5.893370628356934,
"weight_norm": 393.4551696777344,
"timestamp": "2024-07-27T04:40:52.348171"
}
Per-token loss scaled by world size: 0.010395266115665436Per-token loss scaled by world size: 0.009872684255242348Per-token loss scaled by world size: 0.003835555398836732Per-token loss scaled by world size: 0.008823958225548267Per-token loss scaled by world size: 0.0053123775869607925Per-token loss scaled by world size: 0.0028427704237401485
Per-token loss scaled by world size: 0.010287421755492687
Epoch: 2, Step: 14, Rank: 0, loss = 0.323552668094635
Epoch: 2, Step: 14, Rank: 4, loss = 0.3072873055934906Epoch: 2, Step: 14, Rank: 2, loss = 0.27464568614959717Epoch: 2, Step: 14, Rank: 5, loss = 0.11938165873289108
Epoch: 2, Step: 14, Rank: 1, loss = 0.08848123252391815Epoch: 2, Step: 14, Rank: 3, loss = 0.16534775495529175
Epoch: 2, Step: 14, Rank: 7, loss = 0.3201960027217865
Per-token loss scaled by world size: 0.016205936670303345
Epoch: 2, Step: 14, Rank: 6, loss = 0.5044097900390625
[2024-07-27 04:40:52,683] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[1.1200000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:40:52,760] [INFO] [timer.py:258:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=19.0045240462792, CurrSamplesPerSec=19.557238311503617, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 2: 33%|███▎ | 2/6 [00:01<00:02, 1.62it/s]{
"epoch": 2,
"step": 14,
"rank": 0,
"loss": 0.323552668094635,
"overall_throughput": 19.51458544398396,
"lr": 1.1200000000000001e-05,
"cuda_mem_allocated": 21.989410877227783,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 249,
"batch_size": 8,
"total_loss": 0.262912780046463,
"gradnorm": 8.123388290405273,
"weight_norm": 393.45526123046875,
"timestamp": "2024-07-27T04:40:52.825508"
}
Per-token loss scaled by world size: 0.00476957717910409Per-token loss scaled by world size: 0.005131879821419716Per-token loss scaled by world size: 0.002688762964680791Per-token loss scaled by world size: 0.00911777000874281Per-token loss scaled by world size: 0.006320170592516661
Per-token loss scaled by world size: 0.007136975880712271
Per-token loss scaled by world size: 0.007083716802299023
Epoch: 2, Step: 15, Rank: 0, loss = 0.16742758452892303
Epoch: 2, Step: 15, Rank: 1, loss = 0.1556074619293213
Epoch: 2, Step: 15, Rank: 2, loss = 0.2974672317504883Epoch: 2, Step: 15, Rank: 6, loss = 0.20619556307792664
Epoch: 2, Step: 15, Rank: 7, loss = 0.23110626637935638
Epoch: 2, Step: 15, Rank: 5, loss = 0.23284383118152618
Epoch: 2, Step: 15, Rank: 4, loss = 0.08772089332342148
Per-token loss scaled by world size: 0.007475144695490599
Epoch: 2, Step: 15, Rank: 3, loss = 0.2438765913248062
[2024-07-27 04:40:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:40:53,243] [INFO] [timer.py:258:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=19.026022320074304, CurrSamplesPerSec=19.287847616814023, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 120
{
"epoch": 2,
"step": 15,
"rank": 0,
"loss": 0.16742758452892303,
"overall_throughput": 19.246835159868162,
"lr": 1.2e-05,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 261,
"batch_size": 8,
"total_loss": 0.20278067886829376,
"gradnorm": 4.634181499481201,
"weight_norm": 393.455322265625,
"timestamp": "2024-07-27T04:40:53.247897"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_120
[04:41:11] INFO saving took 17.93269968032837 seconds utils.py:611
Epoch 2: 50%|█████ | 3/6 [00:19<00:26, 8.75s/it]Per-token loss scaled by world size: 0.022889601066708565Per-token loss scaled by world size: 0.006112583447247744Per-token loss scaled by world size: 0.008243937976658344Per-token loss scaled by world size: 0.005759389605373144
Per-token loss scaled by world size: 0.0023497489746659994
Per-token loss scaled by world size: 0.003176590893417597
Per-token loss scaled by world size: 0.0032468584831804037Epoch: 2, Step: 16, Rank: 0, loss = 0.18261343240737915
Epoch: 2, Step: 16, Rank: 5, loss = 0.6838268041610718
Epoch: 2, Step: 16, Rank: 4, loss = 0.24628764390945435
Epoch: 2, Step: 16, Rank: 3, loss = 0.1720617711544037Epoch: 2, Step: 16, Rank: 6, loss = 0.07019875198602676
Epoch: 2, Step: 16, Rank: 1, loss = 0.09699989855289459Epoch: 2, Step: 16, Rank: 2, loss = 0.09490065276622772
Per-token loss scaled by world size: 0.002544187940657139
Epoch: 2, Step: 16, Rank: 7, loss = 0.07600761204957962
[2024-07-27 04:41:11,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[1.2800000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:11,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=19.0047652088737, CurrSamplesPerSec=18.732683349486162, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 2: 67%|██████▋ | 4/6 [00:20<00:10, 5.49s/it]{
"epoch": 2,
"step": 16,
"rank": 0,
"loss": 0.18261343240737915,
"overall_throughput": 18.691548302496816,
"lr": 1.2800000000000001e-05,
"cuda_mem_allocated": 21.98988962173462,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 239,
"batch_size": 8,
"total_loss": 0.20286208391189575,
"gradnorm": 3.438565492630005,
"weight_norm": 393.4554138183594,
"timestamp": "2024-07-27T04:41:11.737319"
}
Per-token loss scaled by world size: 0.006436600815504789Per-token loss scaled by world size: 0.007636873982846737Per-token loss scaled by world size: 0.011849365197122097Per-token loss scaled by world size: 0.0030969511717557907Per-token loss scaled by world size: 0.0029933564364910126Per-token loss scaled by world size: 0.005698263645172119Per-token loss scaled by world size: 0.0030969511717557907
Epoch: 2, Step: 17, Rank: 2, loss = 0.24247075617313385Epoch: 2, Step: 17, Rank: 6, loss = 0.09832820296287537
Epoch: 2, Step: 17, Rank: 3, loss = 0.09503906965255737
Epoch: 2, Step: 17, Rank: 0, loss = 0.20436207950115204
Epoch: 2, Step: 17, Rank: 7, loss = 0.18091987073421478
Epoch: 2, Step: 17, Rank: 1, loss = 0.3762173354625702
Epoch: 2, Step: 17, Rank: 5, loss = 0.09832820296287537
Per-token loss scaled by world size: 0.0246181171387434
Epoch: 2, Step: 17, Rank: 4, loss = 0.7816252112388611
[2024-07-27 04:41:12,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.3600000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:12,143] [INFO] [timer.py:258:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=19.058863757063296, CurrSamplesPerSec=19.849924810962573, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 2: 83%|████████▎ | 5/6 [00:20<00:03, 3.68s/it]{
"epoch": 2,
"step": 17,
"rank": 0,
"loss": 0.20436207950115204,
"overall_throughput": 19.81138977879121,
"lr": 1.3600000000000002e-05,
"cuda_mem_allocated": 21.990607738494873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 254,
"batch_size": 8,
"total_loss": 0.25966137647628784,
"gradnorm": 4.596966743469238,
"weight_norm": 393.4555358886719,
"timestamp": "2024-07-27T04:41:12.206412"
}
Per-token loss scaled by world size: 0.003973593469709158Per-token loss scaled by world size: 0.003631346160545945Per-token loss scaled by world size: 0.0038163107819855213Per-token loss scaled by world size: 0.00382098532281816Per-token loss scaled by world size: 0.00380203640088439Per-token loss scaled by world size: 0.001392068457789719
Per-token loss scaled by world size: 0.004007395356893539
Epoch: 2, Step: 18, Rank: 0, loss = 0.18179190158843994
Epoch: 2, Step: 18, Rank: 3, loss = 0.17459622025489807Epoch: 2, Step: 18, Rank: 1, loss = 0.1661340892314911Epoch: 2, Step: 18, Rank: 7, loss = 0.18333832919597626
Epoch: 2, Step: 18, Rank: 6, loss = 0.1739431619644165Epoch: 2, Step: 18, Rank: 4, loss = 0.06368713080883026
Epoch: 2, Step: 18, Rank: 2, loss = 0.17481008172035217
Per-token loss scaled by world size: 0.009031021036207676
Epoch: 2, Step: 18, Rank: 5, loss = 0.4131692051887512
[2024-07-27 04:41:12,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.4400000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:12,623] [INFO] [timer.py:258:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=19.075929261101155, CurrSamplesPerSec=19.33562909999493, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 2: 100%|██████████| 6/6 [00:21<00:00, 2.59s/it]{
"epoch": 2,
"step": 18,
"rank": 0,
"loss": 0.18179190158843994,
"overall_throughput": 19.297919953254016,
"lr": 1.4400000000000001e-05,
"cuda_mem_allocated": 21.992165088653564,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 366,
"batch_size": 8,
"total_loss": 0.19143378734588623,
"gradnorm": 3.664649486541748,
"weight_norm": 393.4555969238281,
"timestamp": "2024-07-27T04:41:12.687682"
}
Epoch 2: 100%|██████████| 6/6 [00:21<00:00, 3.55s/it]
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 1 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 1 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 2 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 6 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 4 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 7 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
Per-token loss scaled by world size: 0.008028822019696236Per-token loss scaled by world size: 0.0042352983728051186Per-token loss scaled by world size: 0.006302641239017248Per-token loss scaled by world size: 0.00753552932292223Per-token loss scaled by world size: 0.006594506558030844
Per-token loss scaled by world size: 0.010208014398813248
Per-token loss scaled by world size: 0.0070448205806314945
Epoch: 3, Step: 19, Rank: 7, loss = 0.12917660176753998
Epoch: 3, Step: 19, Rank: 6, loss = 0.2448790818452835
Epoch: 3, Step: 19, Rank: 2, loss = 0.19223055243492126
Epoch: 3, Step: 19, Rank: 4, loss = 0.20113244652748108
Epoch: 3, Step: 19, Rank: 5, loss = 0.22983364760875702
Epoch: 3, Step: 19, Rank: 0, loss = 0.3113444447517395
Epoch: 3, Step: 19, Rank: 1, loss = 0.2148670256137848
Per-token loss scaled by world size: 0.007540326565504074
Epoch: 3, Step: 19, Rank: 3, loss = 0.2299799621105194
[2024-07-27 04:41:13,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.5200000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:13,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=18.855222324067338, CurrSamplesPerSec=15.90998650082005, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 3,▋ | 1/6 [00:00<00:04, 1.23it/s]
"step": 19,
"rank": 0,
"loss": 0.3113444447517395,
"overall_throughput": 15.851661275609098,
"lr": 1.5200000000000002e-05,
"cuda_mem_allocated": 21.989410877227783,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 244,
"batch_size": 8,
"total_loss": 0.21918047964572906,
"gradnorm": 7.666770935058594,
"weight_norm": 393.4556884765625,
"timestamp": "2024-07-27T04:41:13.615788"
}
Per-token loss scaled by world size: 0.0015976645518094301Per-token loss scaled by world size: 0.01189976092427969Per-token loss scaled by world size: 0.006761615164577961Per-token loss scaled by world size: 0.0026721509639173746Per-token loss scaled by world size: 0.001967529533430934
Per-token loss scaled by world size: 0.005321608856320381
Per-token loss scaled by world size: 0.0015923914033919573
Epoch: 3, Step: 20, Rank: 0, loss = 0.057715632021427155
Epoch: 3, Step: 20, Rank: 4, loss = 0.0965314507484436
Epoch: 3, Step: 20, Rank: 6, loss = 0.07107700407505035
Epoch: 3, Step: 20, Rank: 3, loss = 0.4298788607120514
Epoch: 3, Step: 20, Rank: 2, loss = 0.24426335096359253
Epoch: 3, Step: 20, Rank: 1, loss = 0.19224311411380768
Epoch: 3, Step: 20, Rank: 7, loss = 0.057525139302015305
Per-token loss scaled by world size: 0.018431704491376877
Epoch: 3, Step: 20, Rank: 5, loss = 0.6658453345298767
[2024-07-27 04:41:13,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:14,028] [INFO] [timer.py:258:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=18.89434286309558, CurrSamplesPerSec=19.58513710703571, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 160
{
"epoch": 3,
"step": 20,
"rank": 0,
"loss": 0.057715632021427155,
"overall_throughput": 19.54536780630332,
"lr": 1.6000000000000003e-05,
"cuda_mem_allocated": 21.990726947784424,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 289,
"batch_size": 8,
"total_loss": 0.22688499093055725,
"gradnorm": 5.258148193359375,
"weight_norm": 393.4558410644531,
"timestamp": "2024-07-27T04:41:14.031924"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_160
[04:41:31] INFO saving took 17.931371927261353 seconds utils.py:611
Per-token loss scaled by world size: 0.0045623015612363815Per-token loss scaled by world size: 0.0081652095541358Per-token loss scaled by world size: 0.0009351570624858141Per-token loss scaled by world size: 0.002664643106982112Per-token loss scaled by world size: 0.0031791036017239094
Epoch: 3, Step: 21, Rank: 2, loss = 0.2602660655975342
Epoch: 3, Step: 21, Rank: 1, loss = 0.0849355012178421
Epoch: 3, Step: 21, Rank: 4, loss = 0.10133392363786697Epoch: 3, Step: 21, Rank: 0, loss = 0.029808131977915764
Epoch: 3, Step: 21, Rank: 6, loss = 0.14542336761951447Per-token loss scaled by world size: 0.0044220853596925735
Per-token loss scaled by world size: 0.01251036673784256
Epoch: 3, Step: 21, Rank: 3, loss = 0.14095397293567657
Epoch: 3, Step: 21, Rank: 7, loss = 0.39876794815063477
Per-token loss scaled by world size: 0.009876725263893604
Epoch: 3, Step: 21, Rank: 5, loss = 0.31482061743736267
[2024-07-27 04:41:32,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.6800000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:32,449] [INFO] [timer.py:258:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=18.90515878235841, CurrSamplesPerSec=19.101984863890006, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 3,████ | 3/6 [00:19<00:18, 6.29s/it]
"step": 21,
"rank": 0,
"loss": 0.029808131977915764,
"overall_throughput": 19.05711381074895,
"lr": 1.6800000000000002e-05,
"cuda_mem_allocated": 21.990726947784424,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 255,
"batch_size": 8,
"total_loss": 0.18453869223594666,
"gradnorm": 5.108468055725098,
"weight_norm": 393.4559631347656,
"timestamp": "2024-07-27T04:41:32.452565"
}
Per-token loss scaled by world size: 0.0014227991923689842Per-token loss scaled by world size: 0.0022042023483663797Per-token loss scaled by world size: 0.0035717289429157972Per-token loss scaled by world size: 0.0031726094894111156
Per-token loss scaled by world size: 0.0027486486360430717
Per-token loss scaled by world size: 0.002677777549251914
Per-token loss scaled by world size: 0.00375761860050261
Epoch: 3, Step: 22, Rank: 0, loss = 0.048019472509622574Epoch: 3, Step: 22, Rank: 6, loss = 0.12054584920406342
Epoch: 3, Step: 22, Rank: 5, loss = 0.0927668884396553
Epoch: 3, Step: 22, Rank: 3, loss = 0.10707557201385498Epoch: 3, Step: 22, Rank: 1, loss = 0.12681962549686432
Epoch: 3, Step: 22, Rank: 7, loss = 0.09037499129772186
Epoch: 3, Step: 22, Rank: 4, loss = 0.07439182698726654
Per-token loss scaled by world size: 0.008931323885917664
Epoch: 3, Step: 22, Rank: 2, loss = 0.30143219232559204
[2024-07-27 04:41:32,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.76e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:32,927] [INFO] [timer.py:258:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=18.93573071473506, CurrSamplesPerSec=19.53597958978115, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 3,█████▋ | 4/6 [00:20<00:07, 4.00s/it]
"step": 22,
"rank": 0,
"loss": 0.048019472509622574,
"overall_throughput": 19.499003967857334,
"lr": 1.76e-05,
"cuda_mem_allocated": 21.989171504974365,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 270,
"batch_size": 8,
"total_loss": 0.12017828971147537,
"gradnorm": 3.623103380203247,
"weight_norm": 393.4561462402344,
"timestamp": "2024-07-27T04:41:32.991116"
}
Per-token loss scaled by world size: 0.009149123914539814Per-token loss scaled by world size: 0.0035603949800133705Per-token loss scaled by world size: 0.004935313016176224Per-token loss scaled by world size: 0.0064824605360627174Per-token loss scaled by world size: 0.005307480692863464
Per-token loss scaled by world size: 0.0033412924967706203Per-token loss scaled by world size: 0.00997106358408928
Epoch: 3, Step: 23, Rank: 6, loss = 0.10814699530601501Epoch: 3, Step: 23, Rank: 1, loss = 0.14991013705730438
Epoch: 3, Step: 23, Rank: 4, loss = 0.30287104845046997Epoch: 3, Step: 23, Rank: 0, loss = 0.2779046297073364Epoch: 3, Step: 23, Rank: 2, loss = 0.16121472418308258
Epoch: 3, Step: 23, Rank: 5, loss = 0.1969047337770462Epoch: 3, Step: 23, Rank: 3, loss = 0.10149175673723221
Per-token loss scaled by world size: 0.00799593236297369
Epoch: 3, Step: 23, Rank: 7, loss = 0.24287645518779755
[2024-07-27 04:41:33,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.8400000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:33,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=18.969543790437204, CurrSamplesPerSec=19.672103775255234, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 3,███████▎ | 5/6 [00:20<00:02, 2.72s/it]
"step": 23,
"rank": 0,
"loss": 0.2779046297073364,
"overall_throughput": 19.604933597424527,
"lr": 1.8400000000000003e-05,
"cuda_mem_allocated": 21.990487575531006,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 243,
"batch_size": 8,
"total_loss": 0.19266505539417267,
"gradnorm": 3.409485101699829,
"weight_norm": 393.4563293457031,
"timestamp": "2024-07-27T04:41:33.402199"
}
Per-token loss scaled by world size: 0.00869434978812933Per-token loss scaled by world size: 0.010431548580527306Per-token loss scaled by world size: 0.00882460456341505Per-token loss scaled by world size: 0.014862887561321259Per-token loss scaled by world size: 0.007030695676803589Per-token loss scaled by world size: 0.009925030171871185
Per-token loss scaled by world size: 0.013269560411572456
Epoch: 3, Step: 24, Rank: 1, loss = 0.31989189982414246Epoch: 3, Step: 24, Rank: 6, loss = 0.5387796759605408
Epoch: 3, Step: 24, Rank: 0, loss = 0.37814363837242126
Epoch: 3, Step: 24, Rank: 7, loss = 0.359782338142395
Epoch: 3, Step: 24, Rank: 2, loss = 0.2548627257347107
Epoch: 3, Step: 24, Rank: 5, loss = 0.48102155327796936
Epoch: 3, Step: 24, Rank: 3, loss = 0.31517016887664795
Per-token loss scaled by world size: 0.008658657781779766
Epoch: 3, Step: 24, Rank: 4, loss = 0.31387636065483093
[2024-07-27 04:41:33,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.9200000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:33,879] [INFO] [timer.py:258:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=18.986021230980416, CurrSamplesPerSec=19.338782826201598, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 3,█████████| 6/6 [00:21<00:00, 1.96s/it]
"step": 24,
"rank": 0,
"loss": 0.37814363837242126,
"overall_throughput": 19.301816374521977,
"lr": 1.9200000000000003e-05,
"cuda_mem_allocated": 21.990128993988037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 290,
"batch_size": 8,
"total_loss": 0.3701910078525543,
"gradnorm": 41.655189514160156,
"weight_norm": 393.4565124511719,
"timestamp": "2024-07-27T04:41:33.942020"
}
Epoch 3: 100%|██████████| 6/6 [00:21<00:00, 3.54s/it]
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 7 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 1 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 2 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 4 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 3 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
Per-token loss scaled by world size: 0.011749987490475178Per-token loss scaled by world size: 0.0038406068924814463Per-token loss scaled by world size: 0.01040646806359291
Per-token loss scaled by world size: 0.0029110456816852093
Per-token loss scaled by world size: 0.001697335857897997Per-token loss scaled by world size: 0.0049619837664067745Per-token loss scaled by world size: 0.008784592151641846
Epoch: 4, Step: 25, Rank: 3, loss = 0.13202086091041565
Epoch: 4, Step: 25, Rank: 6, loss = 0.40390580892562866
Epoch: 4, Step: 25, Rank: 0, loss = 0.10006719827651978
Epoch: 4, Step: 25, Rank: 1, loss = 0.35772234201431274
Epoch: 4, Step: 25, Rank: 5, loss = 0.30197036266326904
Epoch: 4, Step: 25, Rank: 7, loss = 0.05834592133760452
Epoch: 4, Step: 25, Rank: 2, loss = 0.17056819796562195
Per-token loss scaled by world size: 0.0024965431075543165
Epoch: 4, Step: 25, Rank: 4, loss = 0.08581867069005966
[2024-07-27 04:41:34,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[2e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:34,808] [INFO] [timer.py:258:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=18.909596063618803, CurrSamplesPerSec=17.371243026535403, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 200
{
"epoch": 4,
"step": 25,
"rank": 0,
"loss": 0.10006719827651978,
"overall_throughput": 17.301396406941095,
"lr": 2e-05,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 275,
"batch_size": 8,
"total_loss": 0.2013024240732193,
"gradnorm": 2.961458921432495,
"weight_norm": 393.45672607421875,
"timestamp": "2024-07-27T04:41:34.811952"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_200
[04:41:52] INFO saving took 17.899128675460815 seconds utils.py:611
Epoch 4: 17%|█▋ | 1/6 [00:18<01:33, 18.72s/it]Per-token loss scaled by world size: 0.005662889685481787Per-token loss scaled by world size: 0.003961643204092979Per-token loss scaled by world size: 0.0033513393718749285
Per-token loss scaled by world size: 0.0048882560804486275
Per-token loss scaled by world size: 0.005098323803395033Per-token loss scaled by world size: 0.0037976952735334635
Per-token loss scaled by world size: 0.0018476687837392092
Epoch: 4, Step: 26, Rank: 5, loss = 0.12182052433490753
Epoch: 4, Step: 26, Rank: 1, loss = 0.10305368900299072
Epoch: 4, Step: 26, Rank: 0, loss = 0.17413385212421417
Epoch: 4, Step: 26, Rank: 7, loss = 0.1503138691186905
Epoch: 4, Step: 26, Rank: 3, loss = 0.11677912622690201
Epoch: 4, Step: 26, Rank: 2, loss = 0.15677346289157867
Epoch: 4, Step: 26, Rank: 6, loss = 0.05681581422686577
Per-token loss scaled by world size: 0.0031180845107883215
Epoch: 4, Step: 26, Rank: 4, loss = 0.09588109701871872
[2024-07-27 04:41:53,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.9959742939952393e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:53,198] [INFO] [timer.py:258:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=18.9111373084012, CurrSamplesPerSec=18.946655411223635, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 4: 33%|███▎ | 2/6 [00:19<00:31, 7.99s/it]{
"epoch": 4,
"step": 26,
"rank": 0,
"loss": 0.17413385212421417,
"overall_throughput": 18.900859917782263,
"lr": 1.9959742939952393e-05,
"cuda_mem_allocated": 21.990487575531006,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 246,
"batch_size": 8,
"total_loss": 0.12194641679525375,
"gradnorm": 2.005527973175049,
"weight_norm": 393.4569091796875,
"timestamp": "2024-07-27T04:41:53.262256"
}
Per-token loss scaled by world size: 0.007557791192084551Per-token loss scaled by world size: 0.008283982053399086Per-token loss scaled by world size: 0.003100305562838912Per-token loss scaled by world size: 0.011851347051560879
Per-token loss scaled by world size: 0.013045835308730602Per-token loss scaled by world size: 0.009396737441420555
Per-token loss scaled by world size: 0.0076859793625772
Epoch: 4, Step: 27, Rank: 0, loss = 0.20972870290279388
Epoch: 4, Step: 27, Rank: 2, loss = 0.22988051176071167Epoch: 4, Step: 27, Rank: 3, loss = 0.3288748860359192
Epoch: 4, Step: 27, Rank: 6, loss = 0.36202192306518555Epoch: 4, Step: 27, Rank: 5, loss = 0.26075947284698486
Epoch: 4, Step: 27, Rank: 4, loss = 0.08603347837924957
Epoch: 4, Step: 27, Rank: 7, loss = 0.2132859230041504
Per-token loss scaled by world size: 0.004431413020938635
Epoch: 4, Step: 27, Rank: 1, loss = 0.12297171354293823
[2024-07-27 04:41:53,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.98392958859863e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:53,665] [INFO] [timer.py:258:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=18.95425254539218, CurrSamplesPerSec=20.05141088310167, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 4: 50%|█████ | 3/6 [00:19<00:13, 4.56s/it]{
"epoch": 4,
"step": 27,
"rank": 0,
"loss": 0.20972870290279388,
"overall_throughput": 20.01441802121427,
"lr": 1.98392958859863e-05,
"cuda_mem_allocated": 21.988572120666504,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 222,
"batch_size": 8,
"total_loss": 0.22669458389282227,
"gradnorm": 4.566909313201904,
"weight_norm": 393.4571838378906,
"timestamp": "2024-07-27T04:41:53.729196"
}
Per-token loss scaled by world size: 0.002589393639937043Per-token loss scaled by world size: 0.0016010800609365106Per-token loss scaled by world size: 0.009488740935921669
Per-token loss scaled by world size: 0.007330995053052902
Per-token loss scaled by world size: 0.006591046694666147Per-token loss scaled by world size: 0.0028418628498911858
Per-token loss scaled by world size: 0.0009722260874696076
Epoch: 4, Step: 28, Rank: 0, loss = 0.055437397211790085
Epoch: 4, Step: 28, Rank: 5, loss = 0.3285476565361023
Epoch: 4, Step: 28, Rank: 1, loss = 0.0896577537059784
Epoch: 4, Step: 28, Rank: 4, loss = 0.22821499407291412
Epoch: 4, Step: 28, Rank: 2, loss = 0.2538357079029083Epoch: 4, Step: 28, Rank: 6, loss = 0.09839949756860733
Epoch: 4, Step: 28, Rank: 3, loss = 0.03366332873702049
Per-token loss scaled by world size: 0.017863700166344643
Epoch: 4, Step: 28, Rank: 7, loss = 0.6185306310653687
[2024-07-27 04:41:54,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.9639628606958535e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:54,145] [INFO] [timer.py:258:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=18.966307174026092, CurrSamplesPerSec=19.272736671546916, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 4: 67%|██████▋ | 4/6 [00:20<00:05, 2.95s/it]{
"epoch": 4,
"step": 28,
"rank": 0,
"loss": 0.055437397211790085,
"overall_throughput": 19.215202457388543,
"lr": 1.9639628606958535e-05,
"cuda_mem_allocated": 21.989171504974365,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 277,
"batch_size": 8,
"total_loss": 0.21328586339950562,
"gradnorm": 8.249006271362305,
"weight_norm": 393.4573974609375,
"timestamp": "2024-07-27T04:41:54.208735"
}
Per-token loss scaled by world size: 0.0066835153847932816Per-token loss scaled by world size: 0.004529251717031002Per-token loss scaled by world size: 0.0037545531522482634Per-token loss scaled by world size: 0.003318126080557704Per-token loss scaled by world size: 0.002113455906510353
Per-token loss scaled by world size: 0.0010128725552931428Per-token loss scaled by world size: 0.0017812160076573491
Epoch: 4, Step: 29, Rank: 0, loss = 0.2815430760383606
Epoch: 4, Step: 29, Rank: 6, loss = 0.1397760659456253Epoch: 4, Step: 29, Rank: 3, loss = 0.19079472124576569
Epoch: 4, Step: 29, Rank: 7, loss = 0.15816055238246918
Epoch: 4, Step: 29, Rank: 5, loss = 0.07503372430801392
Epoch: 4, Step: 29, Rank: 1, loss = 0.04266725853085518
Epoch: 4, Step: 29, Rank: 4, loss = 0.08902932703495026
Per-token loss scaled by world size: 0.00729252677410841
Epoch: 4, Step: 29, Rank: 2, loss = 0.3071976900100708
[2024-07-27 04:41:54,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.9362348706397374e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:54,626] [INFO] [timer.py:258:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=18.978206465919676, CurrSamplesPerSec=19.292915749104477, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 4: 83%|████████▎ | 5/6 [00:20<00:02, 2.06s/it]{
"epoch": 4,
"step": 29,
"rank": 0,
"loss": 0.2815430760383606,
"overall_throughput": 19.249871636665503,
"lr": 1.9362348706397374e-05,
"cuda_mem_allocated": 21.99084711074829,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 337,
"batch_size": 8,
"total_loss": 0.16052529215812683,
"gradnorm": 3.410759210586548,
"weight_norm": 393.4576110839844,
"timestamp": "2024-07-27T04:41:54.689974"
}
Per-token loss scaled by world size: 0.005897491704672575Per-token loss scaled by world size: 0.007752169389277697Per-token loss scaled by world size: 0.007537755649536848Per-token loss scaled by world size: 0.012558677233755589Per-token loss scaled by world size: 0.00658394442871213
Per-token loss scaled by world size: 0.003483764361590147
Per-token loss scaled by world size: 0.0014572414802387357
Epoch: 4, Step: 30, Rank: 7, loss = 0.21899878978729248Epoch: 4, Step: 30, Rank: 0, loss = 0.1859964281320572Epoch: 4, Step: 30, Rank: 4, loss = 0.21294160187244415
Epoch: 4, Step: 30, Rank: 6, loss = 0.166604146361351Epoch: 4, Step: 30, Rank: 1, loss = 0.3547826409339905Epoch: 4, Step: 30, Rank: 3, loss = 0.09841634333133698
Epoch: 4, Step: 30, Rank: 2, loss = 0.04116707295179367
Per-token loss scaled by world size: 0.013426681980490685
Epoch: 4, Step: 30, Rank: 5, loss = 0.3793037533760071
[2024-07-27 04:41:55,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.900968867902419e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:41:55,093] [INFO] [timer.py:258:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=19.013068675506908, CurrSamplesPerSec=20.005289522667084, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 240
{
"epoch": 4,
"step": 30,
"rank": 0,
"loss": 0.1859964281320572,
"overall_throughput": 19.962075281886168,
"lr": 1.900968867902419e-05,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 226,
"batch_size": 8,
"total_loss": 0.2072763293981552,
"gradnorm": 3.050539255142212,
"weight_norm": 393.45782470703125,
"timestamp": "2024-07-27T04:41:55.095878"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_240
[04:42:12] INFO saving took 17.86995029449463 seconds utils.py:611
Epoch 4: 100%|██████████| 6/6 [00:39<00:00, 6.51s/it]
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 5 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 5 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 2 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 6 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 6 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 3 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 4 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 4 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
Per-token loss scaled by world size: 0.005013823974877596Per-token loss scaled by world size: 0.002962352242320776Per-token loss scaled by world size: 0.004985218867659569Per-token loss scaled by world size: 0.0022716219536960125
Per-token loss scaled by world size: 0.0034899800084531307Per-token loss scaled by world size: 0.0017849565483629704Per-token loss scaled by world size: 0.0013634071219712496
Epoch: 5, Step: 31, Rank: 6, loss = 0.17323635518550873Epoch: 5, Step: 31, Rank: 4, loss = 0.17423038184642792
Epoch: 5, Step: 31, Rank: 3, loss = 0.10294174402952194
Epoch: 5, Step: 31, Rank: 2, loss = 0.07893886417150497
Epoch: 5, Step: 31, Rank: 0, loss = 0.06202723830938339Epoch: 5, Step: 31, Rank: 1, loss = 0.12127680331468582
Epoch: 5, Step: 31, Rank: 5, loss = 0.04737839847803116
Per-token loss scaled by world size: 0.011078107170760632
Epoch: 5, Step: 31, Rank: 7, loss = 0.3849642276763916
[2024-07-27 04:42:13,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.8584487936018663e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:13,922] [INFO] [timer.py:258:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=18.801550768968706, CurrSamplesPerSec=14.335953625150017, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 5,▋ | 1/6 [00:00<00:04, 1.19it/s]
"step": 31,
"rank": 0,
"loss": 0.06202723830938339,
"overall_throughput": 14.285813059808566,
"lr": 1.8584487936018663e-05,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 278,
"batch_size": 8,
"total_loss": 0.14312423765659332,
"gradnorm": 4.453860282897949,
"weight_norm": 393.4580078125,
"timestamp": "2024-07-27T04:42:13.987803"
}
Per-token loss scaled by world size: 0.0028378700371831656Per-token loss scaled by world size: 0.005622998811304569Per-token loss scaled by world size: 0.0031444875057786703Per-token loss scaled by world size: 0.0035572010092437267
Per-token loss scaled by world size: 0.004025444388389587Per-token loss scaled by world size: 0.005346423946321011Per-token loss scaled by world size: 0.0037831738591194153
Epoch: 5, Step: 32, Rank: 0, loss = 0.0971970483660698
Epoch: 5, Step: 32, Rank: 6, loss = 0.10769869387149811Epoch: 5, Step: 32, Rank: 7, loss = 0.12183413654565811Epoch: 5, Step: 32, Rank: 3, loss = 0.1925877034664154
Epoch: 5, Step: 32, Rank: 2, loss = 0.13787147402763367Epoch: 5, Step: 32, Rank: 1, loss = 0.18311502039432526
Epoch: 5, Step: 32, Rank: 5, loss = 0.12957370281219482
Per-token loss scaled by world size: 0.008308484219014645
Epoch: 5, Step: 32, Rank: 4, loss = 0.28456559777259827
[2024-07-27 04:42:14,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.8090169943749477e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:14,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=18.82734153699244, CurrSamplesPerSec=19.607327906336685, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 5,██▎ | 2/6 [00:01<00:02, 1.60it/s]
"step": 32,
"rank": 0,
"loss": 0.0971970483660698,
"overall_throughput": 19.569625420939243,
"lr": 1.8090169943749477e-05,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 274,
"batch_size": 8,
"total_loss": 0.15680542588233948,
"gradnorm": 2.596428394317627,
"weight_norm": 393.45819091796875,
"timestamp": "2024-07-27T04:42:14.402141"
}
Per-token loss scaled by world size: 0.0021558639127761126Per-token loss scaled by world size: 0.004672932904213667Per-token loss scaled by world size: 0.0039972770027816296Per-token loss scaled by world size: 0.0053141191601753235Per-token loss scaled by world size: 0.0033407120499759912
Per-token loss scaled by world size: 0.006172977387905121Per-token loss scaled by world size: 0.003799165366217494
Epoch: 5, Step: 33, Rank: 0, loss = 0.12193598598241806Epoch: 5, Step: 33, Rank: 1, loss = 0.193965345621109Epoch: 5, Step: 33, Rank: 2, loss = 0.1705620437860489
Epoch: 5, Step: 33, Rank: 7, loss = 0.14590060710906982
Epoch: 5, Step: 33, Rank: 3, loss = 0.13866953551769257Epoch: 5, Step: 33, Rank: 6, loss = 0.2253136783838272
Epoch: 5, Step: 33, Rank: 4, loss = 0.07868903130292892
Per-token loss scaled by world size: 0.003766376990824938
Epoch: 5, Step: 33, Rank: 5, loss = 0.13747276365756989
[2024-07-27 04:42:14,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.7530714660036112e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:14,872] [INFO] [timer.py:258:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=18.85669114464766, CurrSamplesPerSec=19.781816809788317, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 5,████ | 3/6 [00:01<00:01, 1.80it/s]
"step": 33,
"rank": 0,
"loss": 0.12193598598241806,
"overall_throughput": 19.744405215835805,
"lr": 1.7530714660036112e-05,
"cuda_mem_allocated": 21.98988962173462,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 292,
"batch_size": 8,
"total_loss": 0.1515636295080185,
"gradnorm": 2.53242564201355,
"weight_norm": 393.4584655761719,
"timestamp": "2024-07-27T04:42:14.936193"
}
Per-token loss scaled by world size: 0.004924851469695568Per-token loss scaled by world size: 0.004273226950317621Per-token loss scaled by world size: 0.002622528001666069Per-token loss scaled by world size: 0.0037059050519019365Per-token loss scaled by world size: 0.0047779749147593975Per-token loss scaled by world size: 0.005559505894780159
Per-token loss scaled by world size: 0.007279254496097565
Epoch: 5, Step: 34, Rank: 7, loss = 0.12646400928497314Epoch: 5, Step: 34, Rank: 2, loss = 0.0894937664270401Epoch: 5, Step: 34, Rank: 3, loss = 0.1630484014749527Epoch: 5, Step: 34, Rank: 1, loss = 0.1680605560541153
Epoch: 5, Step: 34, Rank: 0, loss = 0.1458238661289215
Epoch: 5, Step: 34, Rank: 6, loss = 0.18971814215183258
Epoch: 5, Step: 34, Rank: 5, loss = 0.24840456247329712
Per-token loss scaled by world size: 0.010788660496473312
Epoch: 5, Step: 34, Rank: 4, loss = 0.3681630492210388
[2024-07-27 04:42:15,270] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.691062648986865e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:15,347] [INFO] [timer.py:258:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=18.876022011428752, CurrSamplesPerSec=19.495582553322528, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 5,█████▋ | 4/6 [00:02<00:01, 1.91it/s]
"step": 34,
"rank": 0,
"loss": 0.1458238661289215,
"overall_throughput": 19.439414681510893,
"lr": 1.691062648986865e-05,
"cuda_mem_allocated": 21.988572120666504,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 273,
"batch_size": 8,
"total_loss": 0.1873970627784729,
"gradnorm": 2.919456958770752,
"weight_norm": 393.45867919921875,
"timestamp": "2024-07-27T04:42:15.410652"
}
Per-token loss scaled by world size: 0.0070740398950874805Per-token loss scaled by world size: 0.006351261865347624Per-token loss scaled by world size: 0.009431459940969944Per-token loss scaled by world size: 0.0034575308673083782
Per-token loss scaled by world size: 0.0034287304151803255Per-token loss scaled by world size: 0.006853340193629265
Per-token loss scaled by world size: 0.004821010399609804
Epoch: 5, Step: 35, Rank: 0, loss = 0.20337864756584167
Epoch: 5, Step: 35, Rank: 1, loss = 0.2711544632911682
Epoch: 5, Step: 35, Rank: 4, loss = 0.1825987845659256
Epoch: 5, Step: 35, Rank: 6, loss = 0.09940401464700699
Epoch: 5, Step: 35, Rank: 2, loss = 0.19703352451324463
Epoch: 5, Step: 35, Rank: 7, loss = 0.09857600182294846
Epoch: 5, Step: 35, Rank: 5, loss = 0.1386040449142456
Per-token loss scaled by world size: 0.0033929902128875256
Epoch: 5, Step: 35, Rank: 3, loss = 0.0975484699010849
[2024-07-27 04:42:15,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.6234898018587336e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:15,825] [INFO] [timer.py:258:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=18.89253145148775, CurrSamplesPerSec=19.43652077202901, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 280
{
"epoch": 5,
"step": 35,
"rank": 0,
"loss": 0.20337864756584167,
"overall_throughput": 19.39916886552688,
"lr": 1.6234898018587336e-05,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 230,
"batch_size": 8,
"total_loss": 0.16103725135326385,
"gradnorm": 3.5732498168945312,
"weight_norm": 393.45892333984375,
"timestamp": "2024-07-27T04:42:15.828221"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_280
[04:42:33] INFO saving took 17.876163959503174 seconds utils.py:611
Per-token loss scaled by world size: 0.004089208785444498Per-token loss scaled by world size: 0.0032626439351588488Per-token loss scaled by world size: 0.007577312644571066Per-token loss scaled by world size: 0.00760306091979146
Per-token loss scaled by world size: 0.0089601781219244
Per-token loss scaled by world size: 0.0050941589288413525Per-token loss scaled by world size: 0.004234898369759321
Epoch: 5, Step: 36, Rank: 1, loss = 0.09910281002521515
Epoch: 5, Step: 36, Rank: 4, loss = 0.2309429794549942Epoch: 5, Step: 36, Rank: 0, loss = 0.12420971691608429Epoch: 5, Step: 36, Rank: 7, loss = 0.2721654176712036
Epoch: 5, Step: 36, Rank: 5, loss = 0.2301608771085739
Epoch: 5, Step: 36, Rank: 2, loss = 0.15473507344722748
Epoch: 5, Step: 36, Rank: 6, loss = 0.12863503396511078
Per-token loss scaled by world size: 0.003793718060478568
Epoch: 5, Step: 36, Rank: 3, loss = 0.11523418873548508
[2024-07-27 04:42:34,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.5508969814521026e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:34,189] [INFO] [timer.py:258:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=18.899391419587044, CurrSamplesPerSec=19.128599036570417, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 5,█████████| 6/6 [00:21<00:00, 4.75s/it]
"step": 36,
"rank": 0,
"loss": 0.12420971691608429,
"overall_throughput": 19.082094487965083,
"lr": 1.5508969814521026e-05,
"cuda_mem_allocated": 21.992165088653564,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 243,
"batch_size": 8,
"total_loss": 0.1693982630968094,
"gradnorm": 3.1067850589752197,
"weight_norm": 393.45916748046875,
"timestamp": "2024-07-27T04:42:34.192324"
}
Epoch 5: 100%|██████████| 6/6 [00:21<00:00, 3.54s/it]
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 0 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 1 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 1 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 5 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 3 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
Per-token loss scaled by world size: 0.007784623187035322Per-token loss scaled by world size: 0.003072483232244849Per-token loss scaled by world size: 0.009150322526693344Per-token loss scaled by world size: 0.0048055145889520645Per-token loss scaled by world size: 0.007070611696690321Per-token loss scaled by world size: 0.0026875571347773075
Per-token loss scaled by world size: 0.0032222422305494547
Epoch: 6, Step: 37, Rank: 6, loss = 0.08487734943628311
Epoch: 6, Step: 37, Rank: 3, loss = 0.21505022048950195
Epoch: 6, Step: 37, Rank: 0, loss = 0.25277766585350037
Epoch: 6, Step: 37, Rank: 4, loss = 0.19532564282417297
Epoch: 6, Step: 37, Rank: 2, loss = 0.07424376904964447
Epoch: 6, Step: 37, Rank: 5, loss = 0.0890144407749176Epoch: 6, Step: 37, Rank: 1, loss = 0.13275234401226044
Per-token loss scaled by world size: 0.003815547563135624
Epoch: 6, Step: 37, Rank: 7, loss = 0.10540450364351273
[2024-07-27 04:42:35,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.4738686624729987e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:35,123] [INFO] [timer.py:258:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=18.884730321309164, CurrSamplesPerSec=18.399439371025178, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 6: 17%|█▋ | 1/6 [00:00<00:04, 1.23it/s]{
"epoch": 6,
"step": 37,
"rank": 0,
"loss": 0.25277766585350037,
"overall_throughput": 18.323549504455777,
"lr": 1.4738686624729987e-05,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 221,
"batch_size": 8,
"total_loss": 0.14368075132369995,
"gradnorm": 2.0474841594696045,
"weight_norm": 393.4593811035156,
"timestamp": "2024-07-27T04:42:35.186644"
}
Per-token loss scaled by world size: 0.0039473348297178745Per-token loss scaled by world size: 0.0038144520949572325
Per-token loss scaled by world size: 0.0010828088270500302
Per-token loss scaled by world size: 0.0007635311339981854
Per-token loss scaled by world size: 0.0021416409872472286Per-token loss scaled by world size: 0.0017905712593346834
Per-token loss scaled by world size: 0.005295279435813427
Epoch: 6, Step: 38, Rank: 0, loss = 0.14901189506053925
Epoch: 6, Step: 38, Rank: 1, loss = 0.14399556815624237
Epoch: 6, Step: 38, Rank: 7, loss = 0.040876034647226334
Epoch: 6, Step: 38, Rank: 4, loss = 0.02882329933345318Epoch: 6, Step: 38, Rank: 3, loss = 0.08084695041179657
Epoch: 6, Step: 38, Rank: 2, loss = 0.06759406626224518
Epoch: 6, Step: 38, Rank: 5, loss = 0.19989679753780365
Per-token loss scaled by world size: 0.006602860987186432
Epoch: 6, Step: 38, Rank: 6, loss = 0.24925799667835236
[2024-07-27 04:42:35,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.3930250316539237e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:35,599] [INFO] [timer.py:258:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=18.899852383768156, CurrSamplesPerSec=19.44482195705551, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 6: 33%|███▎ | 2/6 [00:01<00:02, 1.62it/s]{
"epoch": 6,
"step": 38,
"rank": 0,
"loss": 0.14901189506053925,
"overall_throughput": 19.37470523157654,
"lr": 1.3930250316539237e-05,
"cuda_mem_allocated": 21.990607738494873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 302,
"batch_size": 8,
"total_loss": 0.12003782391548157,
"gradnorm": 1.780216097831726,
"weight_norm": 393.4596252441406,
"timestamp": "2024-07-27T04:42:35.663666"
}
Per-token loss scaled by world size: 0.0038089316803961992Per-token loss scaled by world size: 0.0020512850023806095Per-token loss scaled by world size: 0.008632734417915344
Per-token loss scaled by world size: 0.0009830425260588527Per-token loss scaled by world size: 0.002817384200170636Per-token loss scaled by world size: 0.007761223241686821
Per-token loss scaled by world size: 0.007352802902460098
Epoch: 6, Step: 39, Rank: 6, loss = 0.06435906887054443
Epoch: 6, Step: 39, Rank: 2, loss = 0.2708520293235779
Epoch: 6, Step: 39, Rank: 7, loss = 0.030842959880828857Epoch: 6, Step: 39, Rank: 4, loss = 0.24350838363170624
Epoch: 6, Step: 39, Rank: 0, loss = 0.11950523406267166Epoch: 6, Step: 39, Rank: 3, loss = 0.08839543163776398
Epoch: 6, Step: 39, Rank: 5, loss = 0.23069418966770172
Per-token loss scaled by world size: 0.003210328985005617
Epoch: 6, Step: 39, Rank: 1, loss = 0.10072407126426697
[2024-07-27 04:42:35,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[1.3090169943749475e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:36,075] [INFO] [timer.py:258:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=18.915587653609716, CurrSamplesPerSec=19.50004649173377, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 6: 50%|█████ | 3/6 [00:01<00:01, 1.81it/s]{
"epoch": 6,
"step": 39,
"rank": 0,
"loss": 0.11950523406267166,
"overall_throughput": 19.43697112933871,
"lr": 1.3090169943749475e-05,
"cuda_mem_allocated": 21.98988962173462,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 251,
"batch_size": 8,
"total_loss": 0.1436101794242859,
"gradnorm": 2.214144706726074,
"weight_norm": 393.4598388671875,
"timestamp": "2024-07-27T04:42:36.138875"
}
Per-token loss scaled by world size: 0.001895732944831252Per-token loss scaled by world size: 0.0019446390215307474Per-token loss scaled by world size: 0.0018286737613379955Per-token loss scaled by world size: 0.002989412285387516
Per-token loss scaled by world size: 0.0028383415192365646
Per-token loss scaled by world size: 0.002208298072218895
Per-token loss scaled by world size: 0.005300204269587994
Epoch: 6, Step: 40, Rank: 4, loss = 0.11060825735330582
Epoch: 6, Step: 40, Rank: 3, loss = 0.06766092777252197Epoch: 6, Step: 40, Rank: 0, loss = 0.07014212012290955
Epoch: 6, Step: 40, Rank: 7, loss = 0.10501863807439804
Epoch: 6, Step: 40, Rank: 1, loss = 0.07195164263248444Epoch: 6, Step: 40, Rank: 2, loss = 0.08170703053474426
Epoch: 6, Step: 40, Rank: 5, loss = 0.19610755145549774
Per-token loss scaled by world size: 0.0030284025706350803
Epoch: 6, Step: 40, Rank: 6, loss = 0.11205089092254639
[2024-07-27 04:42:36,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.2225209339563144e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:36,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=18.92908988177896, CurrSamplesPerSec=19.44259109142837, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 320
{
"epoch": 6,
"step": 40,
"rank": 0,
"loss": 0.07014212012290955,
"overall_throughput": 19.380199690766368,
"lr": 1.2225209339563144e-05,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 296,
"batch_size": 8,
"total_loss": 0.10190588980913162,
"gradnorm": 1.372182011604309,
"weight_norm": 393.4600830078125,
"timestamp": "2024-07-27T04:42:36.554966"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_320
[04:42:54] INFO saving took 17.879958391189575 seconds utils.py:611
Epoch 6: 67%|██████▋ | 4/6 [00:20<00:15, 7.58s/it]Per-token loss scaled by world size: 0.008056806400418282Per-token loss scaled by world size: 0.007982512935996056Per-token loss scaled by world size: 0.00242948392406106Per-token loss scaled by world size: 0.004318062216043472Per-token loss scaled by world size: 0.0034818260464817286Per-token loss scaled by world size: 0.014020812697708607
Epoch: 6, Step: 41, Rank: 3, loss = 0.07318820059299469Epoch: 6, Step: 41, Rank: 0, loss = 0.24271129071712494
Epoch: 6, Step: 41, Rank: 7, loss = 0.2404731959104538Per-token loss scaled by world size: 0.0027995144482702017
Epoch: 6, Step: 41, Rank: 1, loss = 0.42237699031829834Epoch: 6, Step: 41, Rank: 2, loss = 0.10489001125097275
Epoch: 6, Step: 41, Rank: 5, loss = 0.13008162379264832
Epoch: 6, Step: 41, Rank: 4, loss = 0.08433537185192108
Per-token loss scaled by world size: 0.002146774670109153
Epoch: 6, Step: 41, Rank: 6, loss = 0.0646715834736824
[2024-07-27 04:42:54,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[1.1342332658176556e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:54,909] [INFO] [timer.py:258:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=18.942455198097623, CurrSamplesPerSec=19.46470827097328, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 6: 83%|████████▎ | 5/6 [00:20<00:05, 5.02s/it]{
"epoch": 6,
"step": 41,
"rank": 0,
"loss": 0.24271129071712494,
"overall_throughput": 19.42185054501338,
"lr": 1.1342332658176556e-05,
"cuda_mem_allocated": 21.990966320037842,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 241,
"batch_size": 8,
"total_loss": 0.17034101486206055,
"gradnorm": 2.2789089679718018,
"weight_norm": 393.46026611328125,
"timestamp": "2024-07-27T04:42:54.973034"
}
Per-token loss scaled by world size: 0.0011888241861015558Per-token loss scaled by world size: 0.0031213611364364624
Per-token loss scaled by world size: 0.002157441573217511Per-token loss scaled by world size: 0.0022118226625025272
Per-token loss scaled by world size: 0.006297964137047529Per-token loss scaled by world size: 0.0018200232880190015Per-token loss scaled by world size: 0.002669830108061433
Epoch: 6, Step: 42, Rank: 1, loss = 0.10846729576587677
Epoch: 6, Step: 42, Rank: 0, loss = 0.04131164029240608
Epoch: 6, Step: 42, Rank: 4, loss = 0.0927765965461731
Epoch: 6, Step: 42, Rank: 6, loss = 0.07686083763837814Epoch: 6, Step: 42, Rank: 3, loss = 0.07497109472751617
Epoch: 6, Step: 42, Rank: 7, loss = 0.06324581056833267Epoch: 6, Step: 42, Rank: 2, loss = 0.21885424852371216
Per-token loss scaled by world size: 0.0031561183277517557
Epoch: 6, Step: 42, Rank: 5, loss = 0.10967510938644409
[2024-07-27 04:42:55,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[1.044864830350515e-05], mom=[(0.9, 0.95)]
[2024-07-27 04:42:55,387] [INFO] [timer.py:258:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=18.95295051613209, CurrSamplesPerSec=19.371539779153203, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 6: 100%|██████████| 6/6 [00:21<00:00, 3.47s/it]{
"epoch": 6,
"step": 42,
"rank": 0,
"loss": 0.04131164029240608,
"overall_throughput": 19.31602776765571,
"lr": 1.044864830350515e-05,
"cuda_mem_allocated": 21.990487575531006,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 278,
"batch_size": 8,
"total_loss": 0.09827032685279846,
"gradnorm": 1.404802680015564,
"weight_norm": 393.4604797363281,
"timestamp": "2024-07-27T04:42:55.450259"
}
Epoch 6: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it]
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 4 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 7 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 7 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
Per-token loss scaled by world size: 0.003743554465472698Per-token loss scaled by world size: 0.005117448978126049Per-token loss scaled by world size: 0.0018975065322592854Per-token loss scaled by world size: 0.009965005330741405Per-token loss scaled by world size: 0.0038619362749159336
Per-token loss scaled by world size: 0.004172571934759617
Per-token loss scaled by world size: 0.00353407533839345Epoch: 7, Step: 43, Rank: 6, loss = 0.056213632225990295
Epoch: 7, Step: 43, Rank: 7, loss = 0.2952132821083069
Epoch: 7, Step: 43, Rank: 2, loss = 0.110902801156044Epoch: 7, Step: 43, Rank: 0, loss = 0.15160442888736725
Epoch: 7, Step: 43, Rank: 1, loss = 0.11440986394882202
Epoch: 7, Step: 43, Rank: 5, loss = 0.12361244112253189
Epoch: 7, Step: 43, Rank: 4, loss = 0.10469698160886765
Per-token loss scaled by world size: 0.003125852905213833
Epoch: 7, Step: 43, Rank: 3, loss = 0.09260339289903641
[2024-07-27 04:42:56,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[9.551351696494854e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:42:56,318] [INFO] [timer.py:258:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=18.947759228108367, CurrSamplesPerSec=18.74241437439884, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 7,▋ | 1/6 [00:00<00:04, 1.23it/s]
"step": 43,
"rank": 0,
"loss": 0.15160442888736725,
"overall_throughput": 18.665014201337474,
"lr": 9.551351696494854e-06,
"cuda_mem_allocated": 21.990128993988037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 237,
"batch_size": 8,
"total_loss": 0.13115710020065308,
"gradnorm": 1.5875235795974731,
"weight_norm": 393.4606628417969,
"timestamp": "2024-07-27T04:42:56.382842"
}
Per-token loss scaled by world size: 0.0007441109046339989Per-token loss scaled by world size: 0.002569864271208644Per-token loss scaled by world size: 0.0021702933590859175Per-token loss scaled by world size: 0.0034706422593444586Per-token loss scaled by world size: 0.003474967321380973
Per-token loss scaled by world size: 0.0027420881669968367Per-token loss scaled by world size: 0.002911260584369302
Epoch: 7, Step: 44, Rank: 7, loss = 0.07596027106046677Epoch: 7, Step: 44, Rank: 6, loss = 0.0899452492594719
Epoch: 7, Step: 44, Rank: 4, loss = 0.12147247791290283Epoch: 7, Step: 44, Rank: 3, loss = 0.12162385880947113
Epoch: 7, Step: 44, Rank: 5, loss = 0.09597308933734894
Epoch: 7, Step: 44, Rank: 2, loss = 0.026043880730867386
Epoch: 7, Step: 44, Rank: 1, loss = 0.10189411789178848
Per-token loss scaled by world size: 0.0022017783485352993
Epoch: 7, Step: 44, Rank: 0, loss = 0.07706224173307419
[2024-07-27 04:42:56,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[8.657667341823449e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:42:56,789] [INFO] [timer.py:258:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=18.96843313737227, CurrSamplesPerSec=19.85672616190888, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 7,██▎ | 2/6 [00:01<00:02, 1.64it/s]
"step": 44,
"rank": 0,
"loss": 0.07706224173307419,
"overall_throughput": 19.817860414460462,
"lr": 8.657667341823449e-06,
"cuda_mem_allocated": 21.992404460906982,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 280,
"batch_size": 8,
"total_loss": 0.08874689787626266,
"gradnorm": 1.267701268196106,
"weight_norm": 393.4607849121094,
"timestamp": "2024-07-27T04:42:56.852698"
}
Per-token loss scaled by world size: 0.0025115651078522205Per-token loss scaled by world size: 0.004961833357810974Per-token loss scaled by world size: 0.0043532936833798885Per-token loss scaled by world size: 0.0021706747356802225Per-token loss scaled by world size: 0.0033806730061769485Per-token loss scaled by world size: 0.002844580914825201
Per-token loss scaled by world size: 0.0033718389458954334
Epoch: 7, Step: 45, Rank: 4, loss = 0.1349520981311798Epoch: 7, Step: 45, Rank: 5, loss = 0.1538168340921402Epoch: 7, Step: 45, Rank: 6, loss = 0.06729091703891754Epoch: 7, Step: 45, Rank: 7, loss = 0.0881820097565651
Epoch: 7, Step: 45, Rank: 0, loss = 0.07785851508378983
Epoch: 7, Step: 45, Rank: 2, loss = 0.10480086505413055
Epoch: 7, Step: 45, Rank: 1, loss = 0.10452700406312943
Per-token loss scaled by world size: 0.0024816528894007206
Epoch: 7, Step: 45, Rank: 3, loss = 0.07693123817443848
[2024-07-27 04:42:57,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[7.774790660436857e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:42:57,264] [INFO] [timer.py:258:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=18.984207118032753, CurrSamplesPerSec=19.671261884005887, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 360
{
"epoch": 7,
"step": 45,
"rank": 0,
"loss": 0.07785851508378983,
"overall_throughput": 19.632336945869298,
"lr": 7.774790660436857e-06,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 248,
"batch_size": 8,
"total_loss": 0.10104493051767349,
"gradnorm": 1.2592891454696655,
"weight_norm": 393.46087646484375,
"timestamp": "2024-07-27T04:42:57.267026"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_360
[04:43:15] INFO saving took 17.815489530563354 seconds utils.py:611
Per-token loss scaled by world size: 0.0031208053696900606Per-token loss scaled by world size: 0.002719326876103878Per-token loss scaled by world size: 0.00290810433216393Per-token loss scaled by world size: 0.00502825528383255Per-token loss scaled by world size: 0.0031488884706050158
Per-token loss scaled by world size: 0.0032260508742183447Per-token loss scaled by world size: 0.0017572520300745964
Epoch: 7, Step: 46, Rank: 5, loss = 0.08837812393903732
Epoch: 7, Step: 46, Rank: 6, loss = 0.09451339393854141Epoch: 7, Step: 46, Rank: 3, loss = 0.10233887284994125
Epoch: 7, Step: 46, Rank: 0, loss = 0.10142617672681808Epoch: 7, Step: 46, Rank: 7, loss = 0.10484665632247925
Epoch: 7, Step: 46, Rank: 1, loss = 0.05711068958044052
Epoch: 7, Step: 46, Rank: 4, loss = 0.16341829299926758
Per-token loss scaled by world size: 0.00243758293800056
Epoch: 7, Step: 46, Rank: 2, loss = 0.0792214423418045
[2024-07-27 04:43:15,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[6.909830056250527e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:15,568] [INFO] [timer.py:258:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=18.98530026908578, CurrSamplesPerSec=19.0324251537424, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 7,█████▋ | 4/6 [00:20<00:10, 5.45s/it]
"step": 46,
"rank": 0,
"loss": 0.10142617672681808,
"overall_throughput": 18.986955901758453,
"lr": 6.909830056250527e-06,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 260,
"batch_size": 8,
"total_loss": 0.09890670329332352,
"gradnorm": 1.4150254726409912,
"weight_norm": 393.46099853515625,
"timestamp": "2024-07-27T04:43:15.632493"
}
Per-token loss scaled by world size: 0.0018586666556075215Per-token loss scaled by world size: 0.002927313791587949Per-token loss scaled by world size: 0.002946708584204316Per-token loss scaled by world size: 0.0019047270761802793Per-token loss scaled by world size: 0.0012054119724780321Per-token loss scaled by world size: 0.0014979788102209568
Per-token loss scaled by world size: 0.0022586516570299864
Epoch: 7, Step: 47, Rank: 0, loss = 0.10757878422737122Epoch: 7, Step: 47, Rank: 5, loss = 0.06999871879816055
Epoch: 7, Step: 47, Rank: 4, loss = 0.10829153656959534
Epoch: 7, Step: 47, Rank: 7, loss = 0.055050719529390335
Epoch: 7, Step: 47, Rank: 2, loss = 0.06830599904060364
Epoch: 7, Step: 47, Rank: 3, loss = 0.044298890978097916
Epoch: 7, Step: 47, Rank: 1, loss = 0.08300545066595078
Per-token loss scaled by world size: 0.002645065076649189
Epoch: 7, Step: 47, Rank: 6, loss = 0.09720613807439804
[2024-07-27 04:43:15,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[6.069749683460765e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:16,046] [INFO] [timer.py:258:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=18.996249360616417, CurrSamplesPerSec=19.490837611941338, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 7,███████▎ | 5/6 [00:20<00:03, 3.66s/it]
"step": 47,
"rank": 0,
"loss": 0.10757878422737122,
"overall_throughput": 19.4505930343033,
"lr": 6.069749683460765e-06,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 294,
"batch_size": 8,
"total_loss": 0.07921702414751053,
"gradnorm": 1.5372004508972168,
"weight_norm": 393.4610900878906,
"timestamp": "2024-07-27T04:43:16.111313"
}
Per-token loss scaled by world size: 0.0035249628126621246Per-token loss scaled by world size: 0.0036447104066610336Per-token loss scaled by world size: 0.0025723562575876713Per-token loss scaled by world size: 0.0031749033369123936Per-token loss scaled by world size: 0.00402703694999218
Per-token loss scaled by world size: 0.0017748093232512474
Per-token loss scaled by world size: 0.00937521830201149
Epoch: 7, Step: 48, Rank: 3, loss = 0.08971092104911804Epoch: 7, Step: 48, Rank: 7, loss = 0.12710927426815033
Epoch: 7, Step: 48, Rank: 0, loss = 0.11072475463151932
Epoch: 7, Step: 48, Rank: 4, loss = 0.14044290781021118Epoch: 7, Step: 48, Rank: 6, loss = 0.12293307483196259Epoch: 7, Step: 48, Rank: 5, loss = 0.3269607424736023Epoch: 7, Step: 48, Rank: 2, loss = 0.06189647689461708
Per-token loss scaled by world size: 0.004552490543574095
Epoch: 7, Step: 48, Rank: 1, loss = 0.15876810252666473
[2024-07-27 04:43:16,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[5.2613133752700145e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:16,523] [INFO] [timer.py:258:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=19.007811150191724, CurrSamplesPerSec=19.543068281625303, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 7,█████████| 6/6 [00:21<00:00, 2.57s/it]
"step": 48,
"rank": 0,
"loss": 0.11072475463151932,
"overall_throughput": 19.50403630817306,
"lr": 5.2613133752700145e-06,
"cuda_mem_allocated": 21.99084711074829,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 279,
"batch_size": 8,
"total_loss": 0.14231827855110168,
"gradnorm": 2.0794081687927246,
"weight_norm": 393.461181640625,
"timestamp": "2024-07-27T04:43:16.587633"
}
Epoch 7: 100%|██████████| 6/6 [00:21<00:00, 3.52s/it]
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 0 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 0 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 6 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
Per-token loss scaled by world size: 0.007703100331127644Per-token loss scaled by world size: 0.002897256053984165
Per-token loss scaled by world size: 0.0018762396648526192Per-token loss scaled by world size: 0.0031769108027219772Per-token loss scaled by world size: 0.0031007928773760796
Per-token loss scaled by world size: 0.0032394849695265293
Per-token loss scaled by world size: 0.0033565827179700136
Epoch: 8, Step: 49, Rank: 6, loss = 0.08872846513986588
Epoch: 8, Step: 49, Rank: 2, loss = 0.2359074503183365
Epoch: 8, Step: 49, Rank: 7, loss = 0.057459838688373566
Epoch: 8, Step: 49, Rank: 0, loss = 0.09729289263486862
Epoch: 8, Step: 49, Rank: 5, loss = 0.09496178478002548
Epoch: 8, Step: 49, Rank: 4, loss = 0.10279534757137299
Epoch: 8, Step: 49, Rank: 1, loss = 0.09920922666788101
Per-token loss scaled by world size: 0.001880081370472908
Epoch: 8, Step: 49, Rank: 3, loss = 0.05757749080657959
[2024-07-27 04:43:17,376] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[4.491030185478976e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:17,453] [INFO] [timer.py:258:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=18.97237396051445, CurrSamplesPerSec=17.473818511885575, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 8: 17%|█▋ | 1/6 [00:00<00:04, 1.23it/s]{
"epoch": 8,
"step": 49,
"rank": 0,
"loss": 0.09729289263486862,
"overall_throughput": 17.406956156346215,
"lr": 4.491030185478976e-06,
"cuda_mem_allocated": 21.990607738494873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 245,
"batch_size": 8,
"total_loss": 0.10424157232046127,
"gradnorm": 2.468442678451538,
"weight_norm": 393.4612731933594,
"timestamp": "2024-07-27T04:43:17.517240"
}
Per-token loss scaled by world size: 0.0022910190746188164Per-token loss scaled by world size: 0.003779459511861205Per-token loss scaled by world size: 0.0047139013186097145Per-token loss scaled by world size: 0.0016656998777762055Per-token loss scaled by world size: 0.0011160913854837418
Per-token loss scaled by world size: 0.002607797970995307
Per-token loss scaled by world size: 0.0011160913854837418
Epoch: 8, Step: 50, Rank: 4, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 1, loss = 0.05871592089533806Epoch: 8, Step: 50, Rank: 5, loss = 0.16616502404212952
Epoch: 8, Step: 50, Rank: 0, loss = 0.1332259476184845Epoch: 8, Step: 50, Rank: 2, loss = 0.08075842261314392Epoch: 8, Step: 50, Rank: 6, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 7, loss = 0.09192487597465515
Per-token loss scaled by world size: 0.0013738555135205388
Epoch: 8, Step: 50, Rank: 3, loss = 0.048428408801555634
[2024-07-27 04:43:17,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[3.7651019814126656e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:17,936] [INFO] [timer.py:258:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=18.977750407321444, CurrSamplesPerSec=19.233927031934993, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 400
{
"epoch": 8,
"step": 50,
"rank": 0,
"loss": 0.1332259476184845,
"overall_throughput": 19.198084907835618,
"lr": 3.7651019814126656e-06,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 282,
"batch_size": 8,
"total_loss": 0.08223787695169449,
"gradnorm": 1.6959415674209595,
"weight_norm": 393.4613342285156,
"timestamp": "2024-07-27T04:43:17.940040"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_400
[04:43:35] INFO saving took 17.87660813331604 seconds utils.py:611
Epoch 8: 33%|███▎ | 2/6 [00:19<00:44, 11.14s/it]Per-token loss scaled by world size: 0.0034444250632077456Per-token loss scaled by world size: 0.0036195043940097094Per-token loss scaled by world size: 0.0021303421817719936Per-token loss scaled by world size: 0.002188930055126548Per-token loss scaled by world size: 0.0012819116236642003Per-token loss scaled by world size: 0.0027832810301333666
Per-token loss scaled by world size: 0.0016897486057132483Epoch: 8, Step: 51, Rank: 1, loss = 0.11129976063966751
Epoch: 8, Step: 51, Rank: 7, loss = 0.06550802290439606
Epoch: 8, Step: 51, Rank: 5, loss = 0.08558589220046997Epoch: 8, Step: 51, Rank: 2, loss = 0.06730959564447403Epoch: 8, Step: 51, Rank: 0, loss = 0.10591606795787811
Epoch: 8, Step: 51, Rank: 3, loss = 0.0394187830388546
Epoch: 8, Step: 51, Rank: 6, loss = 0.05195976793766022
Per-token loss scaled by world size: 0.003074637847021222
Epoch: 8, Step: 51, Rank: 4, loss = 0.09454511106014252
[2024-07-27 04:43:36,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[3.089373510131354e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:36,312] [INFO] [timer.py:258:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=18.971761108564685, CurrSamplesPerSec=18.68865417133589, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 8: 50%|█████ | 3/6 [00:19<00:18, 6.28s/it]{
"epoch": 8,
"step": 51,
"rank": 0,
"loss": 0.10591606795787811,
"overall_throughput": 18.651619864047372,
"lr": 3.089373510131354e-06,
"cuda_mem_allocated": 21.988811492919922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 246,
"batch_size": 8,
"total_loss": 0.07769287377595901,
"gradnorm": 0.9065935611724854,
"weight_norm": 393.46136474609375,
"timestamp": "2024-07-27T04:43:36.375570"
}
Per-token loss scaled by world size: 0.005752637051045895Per-token loss scaled by world size: 0.00271693360991776
Per-token loss scaled by world size: 0.004330337047576904Per-token loss scaled by world size: 0.005382548552006483Per-token loss scaled by world size: 0.0025455320719629526
Per-token loss scaled by world size: 0.0023602889850735664Per-token loss scaled by world size: 0.00044713294482789934
Epoch: 8, Step: 52, Rank: 1, loss = 0.08286647498607635
Epoch: 8, Step: 52, Rank: 0, loss = 0.17545543611049652Epoch: 8, Step: 52, Rank: 6, loss = 0.13207527995109558
Epoch: 8, Step: 52, Rank: 2, loss = 0.16416773200035095Epoch: 8, Step: 52, Rank: 7, loss = 0.07763873040676117
Epoch: 8, Step: 52, Rank: 5, loss = 0.07198881357908249
Epoch: 8, Step: 52, Rank: 4, loss = 0.013637554831802845
Per-token loss scaled by world size: 0.001792231691069901
Epoch: 8, Step: 52, Rank: 3, loss = 0.05466306582093239
[2024-07-27 04:43:36,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[2.469285339963892e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:36,777] [INFO] [timer.py:258:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=18.99222895360799, CurrSamplesPerSec=20.052273645410278, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 8: 67%|██████▋ | 4/6 [00:20<00:07, 3.98s/it]{
"epoch": 8,
"step": 52,
"rank": 0,
"loss": 0.17545543611049652,
"overall_throughput": 20.013534639121023,
"lr": 2.469285339963892e-06,
"cuda_mem_allocated": 21.989290714263916,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 244,
"batch_size": 8,
"total_loss": 0.09656163305044174,
"gradnorm": 1.5890859365463257,
"weight_norm": 393.4613952636719,
"timestamp": "2024-07-27T04:43:36.840099"
}
Per-token loss scaled by world size: 0.0025300777051597834Per-token loss scaled by world size: 0.0022664989810436964Per-token loss scaled by world size: 0.006000937893986702Per-token loss scaled by world size: 0.002840510569512844
Per-token loss scaled by world size: 0.004035668447613716
Per-token loss scaled by world size: 0.0041307490319013596
Per-token loss scaled by world size: 0.003075978020206094
Epoch: 8, Step: 53, Rank: 5, loss = 0.07309459149837494Epoch: 8, Step: 53, Rank: 4, loss = 0.09160646796226501Epoch: 8, Step: 53, Rank: 6, loss = 0.19353024661540985Epoch: 8, Step: 53, Rank: 1, loss = 0.13015030324459076
Epoch: 8, Step: 53, Rank: 3, loss = 0.13321664929389954Epoch: 8, Step: 53, Rank: 2, loss = 0.08159500360488892
Epoch: 8, Step: 53, Rank: 7, loss = 0.0992002934217453
Per-token loss scaled by world size: 0.0016319038113579154
Epoch: 8, Step: 53, Rank: 0, loss = 0.05262889713048935
[2024-07-27 04:43:37,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[1.9098300562505266e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:37,244] [INFO] [timer.py:258:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=19.00944478158272, CurrSamplesPerSec=19.91191964124113, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 8: 83%|████████▎ | 5/6 [00:20<00:02, 2.71s/it]{
"epoch": 8,
"step": 53,
"rank": 0,
"loss": 0.05262889713048935,
"overall_throughput": 19.87502717277025,
"lr": 1.9098300562505266e-06,
"cuda_mem_allocated": 21.99288320541382,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 258,
"batch_size": 8,
"total_loss": 0.10687780380249023,
"gradnorm": 1.6277161836624146,
"weight_norm": 393.4614562988281,
"timestamp": "2024-07-27T04:43:37.309449"
}
Per-token loss scaled by world size: 0.004982769954949617Per-token loss scaled by world size: 0.0023371989373117685Per-token loss scaled by world size: 0.001956745982170105Per-token loss scaled by world size: 0.0019846318755298853Per-token loss scaled by world size: 0.001973965670913458Per-token loss scaled by world size: 0.001133645768277347
Per-token loss scaled by world size: 0.0006779870600439608
Epoch: 8, Step: 54, Rank: 0, loss = 0.19868795573711395
Epoch: 8, Step: 54, Rank: 6, loss = 0.09319580346345901Epoch: 8, Step: 54, Rank: 5, loss = 0.07802524417638779Epoch: 8, Step: 54, Rank: 3, loss = 0.07871188223361969
Epoch: 8, Step: 54, Rank: 7, loss = 0.045204125344753265Epoch: 8, Step: 54, Rank: 4, loss = 0.027034733444452286
Epoch: 8, Step: 54, Rank: 2, loss = 0.07913719862699509
Per-token loss scaled by world size: 0.0017750355182215571
Epoch: 8, Step: 54, Rank: 1, loss = 0.07077953964471817
[2024-07-27 04:43:37,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[1.4155120639813392e-06], mom=[(0.9, 0.95)]
[2024-07-27 04:43:37,723] [INFO] [timer.py:258:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=19.017755360925854, CurrSamplesPerSec=19.451449970580306, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Epoch 8: 100%|██████████| 6/6 [00:21<00:00, 1.95s/it]{
"epoch": 8,
"step": 54,
"rank": 0,
"loss": 0.19868795573711395,
"overall_throughput": 19.41544489924397,
"lr": 1.4155120639813392e-06,
"cuda_mem_allocated": 21.990128993988037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 319,
"batch_size": 8,
"total_loss": 0.0838470607995987,
"gradnorm": 0.9820513129234314,
"weight_norm": 393.4614562988281,
"timestamp": "2024-07-27T04:43:37.787702"
}
Epoch 8: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it]
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 7 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26
Per-token loss scaled by world size: 0.0020672364626079798Per-token loss scaled by world size: 0.005803861655294895Per-token loss scaled by world size: 0.0010450059780851007Per-token loss scaled by world size: 0.00481435377150774Per-token loss scaled by world size: 0.004757868126034737
Per-token loss scaled by world size: 0.0012225221144035459
Per-token loss scaled by world size: 0.003656236920505762
Epoch: 9, Step: 55, Rank: 4, loss = 0.14262522757053375
Epoch: 9, Step: 55, Rank: 5, loss = 0.1719394028186798
Epoch: 9, Step: 55, Rank: 2, loss = 0.030958302319049835
Epoch: 9, Step: 55, Rank: 1, loss = 0.06124188005924225
Epoch: 9, Step: 55, Rank: 0, loss = 0.14095184206962585
Epoch: 9, Step: 55, Rank: 7, loss = 0.03621721640229225
Epoch: 9, Step: 55, Rank: 3, loss = 0.10831601917743683
Per-token loss scaled by world size: 0.003545596729964018
Epoch: 9, Step: 55, Rank: 6, loss = 0.10503830015659332
[2024-07-27 04:43:38,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[9.903113209758098e-07], mom=[(0.9, 0.95)]
[2024-07-27 04:43:38,646] [INFO] [timer.py:258:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=18.958935135030256, CurrSamplesPerSec=16.33220426430826, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 440
{
"epoch": 9,
"step": 55,
"rank": 0,
"loss": 0.14095184206962585,
"overall_throughput": 16.273810095413527,
"lr": 9.903113209758098e-07,
"cuda_mem_allocated": 21.989410877227783,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 237,
"batch_size": 8,
"total_loss": 0.09966102987527847,
"gradnorm": 1.0968877077102661,
"weight_norm": 393.4614562988281,
"timestamp": "2024-07-27T04:43:38.650582"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_440
[04:43:56] INFO saving took 17.79723310470581 seconds utils.py:611
Per-token loss scaled by world size: 0.005642743315547705Per-token loss scaled by world size: 0.002617186401039362Per-token loss scaled by world size: 0.0045571294613182545Per-token loss scaled by world size: 0.002132992260158062Per-token loss scaled by world size: 0.0015219022752717137
Per-token loss scaled by world size: 0.003468153765425086
Per-token loss scaled by world size: 0.0018528653308749199
Epoch: 9, Step: 56, Rank: 0, loss = 0.14867635071277618
Epoch: 9, Step: 56, Rank: 7, loss = 0.08538571000099182
Epoch: 9, Step: 56, Rank: 5, loss = 0.06958886981010437Epoch: 9, Step: 56, Rank: 4, loss = 0.1131485179066658
Epoch: 9, Step: 56, Rank: 2, loss = 0.049652062356472015Epoch: 9, Step: 56, Rank: 1, loss = 0.06044973060488701Epoch: 9, Step: 56, Rank: 6, loss = 0.18409450352191925
Per-token loss scaled by world size: 0.001767554902471602
Epoch: 9, Step: 56, Rank: 3, loss = 0.05766648054122925
[2024-07-27 04:43:56,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[6.37651293602628e-07], mom=[(0.9, 0.95)]
[2024-07-27 04:43:56,928] [INFO] [timer.py:258:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=18.96440824470298, CurrSamplesPerSec=19.25907525027751, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 9,██▎ | 2/6 [00:19<00:31, 7.95s/it]
"step": 56,
"rank": 0,
"loss": 0.14867635071277618,
"overall_throughput": 19.213232991090933,
"lr": 6.37651293602628e-07,
"cuda_mem_allocated": 21.990248203277588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 261,
"batch_size": 8,
"total_loss": 0.09608278423547745,
"gradnorm": 1.2486889362335205,
"weight_norm": 393.46148681640625,
"timestamp": "2024-07-27T04:43:56.991985"
}
Per-token loss scaled by world size: 0.006159770302474499Per-token loss scaled by world size: 0.004680618178099394Per-token loss scaled by world size: 0.003500568214803934Per-token loss scaled by world size: 0.002827225485816598Per-token loss scaled by world size: 0.001885988749563694Per-token loss scaled by world size: 0.002812023274600506
Per-token loss scaled by world size: 0.0035085994750261307
Epoch: 9, Step: 57, Rank: 6, loss = 0.10764247179031372
Epoch: 9, Step: 57, Rank: 4, loss = 0.08693718165159225Epoch: 9, Step: 57, Rank: 3, loss = 0.14392900466918945Epoch: 9, Step: 57, Rank: 0, loss = 0.05799415335059166
Epoch: 9, Step: 57, Rank: 5, loss = 0.08646971732378006
Epoch: 9, Step: 57, Rank: 7, loss = 0.10788943618535995
Epoch: 9, Step: 57, Rank: 2, loss = 0.1894129365682602
Per-token loss scaled by world size: 0.0016236526425927877
Epoch: 9, Step: 57, Rank: 1, loss = 0.04992732033133507
[2024-07-27 04:43:57,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[3.603713930414676e-07], mom=[(0.9, 0.95)]
[2024-07-27 04:43:57,395] [INFO] [timer.py:258:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=18.980723963154006, CurrSamplesPerSec=19.905493724517065, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 9,████ | 3/6 [00:19<00:13, 4.53s/it]
"step": 57,
"rank": 0,
"loss": 0.05799415335059166,
"overall_throughput": 19.839444119579092,
"lr": 3.603713930414676e-07,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 246,
"batch_size": 8,
"total_loss": 0.1037752702832222,
"gradnorm": 1.5781608819961548,
"weight_norm": 393.46148681640625,
"timestamp": "2024-07-27T04:43:57.458487"
}
Per-token loss scaled by world size: 0.00027386093279346824Per-token loss scaled by world size: 0.002793475054204464Per-token loss scaled by world size: 0.0012327907606959343Per-token loss scaled by world size: 0.0018183693755418062Per-token loss scaled by world size: 0.0011149498168379068
Per-token loss scaled by world size: 0.0009586562518961728
Per-token loss scaled by world size: 0.006267122458666563
Epoch: 9, Step: 58, Rank: 3, loss = 0.09497815370559692
Epoch: 9, Step: 58, Rank: 4, loss = 0.04191488400101662
Epoch: 9, Step: 58, Rank: 7, loss = 0.032594311982393265Epoch: 9, Step: 58, Rank: 5, loss = 0.06182456016540527Epoch: 9, Step: 58, Rank: 2, loss = 0.037908293306827545
Epoch: 9, Step: 58, Rank: 6, loss = 0.009311271831393242
Epoch: 9, Step: 58, Rank: 1, loss = 0.21308216452598572
Per-token loss scaled by world size: 0.0013049639528617263
Epoch: 9, Step: 58, Rank: 0, loss = 0.04436877369880676
[2024-07-27 04:43:57,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[1.6070411401370335e-07], mom=[(0.9, 0.95)]
[2024-07-27 04:43:57,863] [INFO] [timer.py:258:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=18.99479756047954, CurrSamplesPerSec=19.802352008035566, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 9,█████▋ | 4/6 [00:20<00:05, 2.93s/it]
"step": 58,
"rank": 0,
"loss": 0.04436877369880676,
"overall_throughput": 19.7424303222504,
"lr": 1.6070411401370335e-07,
"cuda_mem_allocated": 21.992165088653564,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 272,
"batch_size": 8,
"total_loss": 0.06699780374765396,
"gradnorm": 1.2012358903884888,
"weight_norm": 393.46148681640625,
"timestamp": "2024-07-27T04:43:57.924678"
}
Per-token loss scaled by world size: 0.0033104075118899345Per-token loss scaled by world size: 0.005029842257499695Per-token loss scaled by world size: 0.0013227150775492191Per-token loss scaled by world size: 0.0013601266546174884
Per-token loss scaled by world size: 0.0020338338799774647
Per-token loss scaled by world size: 0.002029073191806674
Per-token loss scaled by world size: 0.0012528002262115479
Epoch: 9, Step: 59, Rank: 1, loss = 0.181074321269989
Epoch: 9, Step: 59, Rank: 0, loss = 0.11917466670274734
Epoch: 9, Step: 59, Rank: 4, loss = 0.04761774465441704
Epoch: 9, Step: 59, Rank: 6, loss = 0.07321801781654358Epoch: 9, Step: 59, Rank: 2, loss = 0.04896456003189087
Epoch: 9, Step: 59, Rank: 7, loss = 0.04510080814361572Epoch: 9, Step: 59, Rank: 3, loss = 0.07304663211107254
Per-token loss scaled by world size: 0.0016570077277719975
Epoch: 9, Step: 59, Rank: 5, loss = 0.05965227633714676
[2024-07-27 04:43:58,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[4.025706004760932e-08], mom=[(0.9, 0.95)]
[2024-07-27 04:43:58,339] [INFO] [timer.py:258:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=19.001181258681584, CurrSamplesPerSec=19.36564785840185, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
{
"epoch": 9,███████▎ | 5/6 [00:20<00:02, 2.04s/it]
"step": 59,
"rank": 0,
"loss": 0.11917466670274734,
"overall_throughput": 19.311280902601126,
"lr": 4.025706004760932e-08,
"cuda_mem_allocated": 21.98869228363037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 288,
"batch_size": 8,
"total_loss": 0.08098112046718597,
"gradnorm": 1.2536462545394897,
"weight_norm": 393.46148681640625,
"timestamp": "2024-07-27T04:43:58.402303"
}
Per-token loss scaled by world size: 0.0025534227024763823Per-token loss scaled by world size: 0.002198881469666958Per-token loss scaled by world size: 0.003101743757724762Per-token loss scaled by world size: 0.0017734984867274761Per-token loss scaled by world size: 0.001557655748911202
Per-token loss scaled by world size: 0.0014592667575925589
Per-token loss scaled by world size: 0.00225572707131505
Epoch: 9, Step: 60, Rank: 6, loss = 0.11088734120130539
Epoch: 9, Step: 60, Rank: 0, loss = 0.0912848636507988Epoch: 9, Step: 60, Rank: 2, loss = 0.05568619444966316Epoch: 9, Step: 60, Rank: 7, loss = 0.05216878652572632Epoch: 9, Step: 60, Rank: 4, loss = 0.07861001044511795Epoch: 9, Step: 60, Rank: 5, loss = 0.06340257078409195
Epoch: 9, Step: 60, Rank: 3, loss = 0.08064224570989609
Per-token loss scaled by world size: 0.0007612230838276446
Epoch: 9, Step: 60, Rank: 1, loss = 0.027213726192712784
[2024-07-27 04:43:58,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-07-27 04:43:58,814] [INFO] [timer.py:258:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=19.00919914338529, CurrSamplesPerSec=19.4776793799544, MemAllocated=21.99GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 480
{
"epoch": 9,
"step": 60,
"rank": 0,
"loss": 0.0912848636507988,
"overall_throughput": 19.42523488114542,
"lr": 0.0,
"cuda_mem_allocated": 21.988811492919922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 286,
"batch_size": 8,
"total_loss": 0.06998696178197861,
"gradnorm": 1.0276967287063599,
"weight_norm": 393.46148681640625,
"timestamp": "2024-07-27T04:43:58.817101"
}
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_480
[04:44:16] INFO saving took 17.839636087417603 seconds utils.py:611
Epoch 9: 100%|██████████| 6/6 [00:38<00:00, 6.49s/it]
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:550 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:573 -> 3
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:621 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:261:1045 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:264:1041 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:266:1031 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:262:1033 [2] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-rhel-newimage:265:1035 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:261:1045 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:752 -> 3
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:428 -> 3
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:261:1045 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:58 -> 3
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:267:1037 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:564 -> 3
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:264:1041 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 4, res=3, closed=0
tyler-rhel-newimage:265:1035 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0
tyler-rhel-newimage:260:1039 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:264:1041 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Close from rank 4, retcode 3
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:775 -> 3
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:668 -> 3
tyler-rhel-newimage:265:1035 [5] proxy.cc:1521 NCCL WARN [Proxy Service 5] Failed to execute operation Close from rank 5, retcode 3
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:266:1031 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:260:1039 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
tyler-rhel-newimage:267:1037 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0
tyler-rhel-newimage:262:1033 [2] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 2, res=3, closed=0
tyler-rhel-newimage:260:1039 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
tyler-rhel-newimage:266:1031 [6] proxy.cc:1521 NCCL WARN [Proxy Service 6] Failed to execute operation Close from rank 6, retcode 3
tyler-rhel-newimage:263:1043 [3] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-rhel-newimage:267:1037 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Close from rank 7, retcode 3
tyler-rhel-newimage:262:1033 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Close from rank 2, retcode 3
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:826 -> 3
tyler-rhel-newimage:263:1043 [3] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 3, res=3, closed=0
tyler-rhel-newimage:263:1043 [3] proxy.cc:1521 NCCL WARN [Proxy Service 3] Failed to execute operation Close from rank 3, retcode 3
tyler-rhel-newimage:260:43160 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
tyler-rhel-newimage:266:43159 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
tyler-rhel-newimage:267:43156 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
tyler-rhel-newimage:262:43163 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
tyler-rhel-newimage:263:43158 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
tyler-rhel-newimage:264:43161 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
tyler-rhel-newimage:265:43162 [5] NCCL INFO comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
tyler-rhel-newimage:261:43157 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
Terminating process 🤖
[root@tyler-rhel-newimage instructlab]#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment