Created
July 27, 2024 04:44
-
-
Save relyt0925/ed5c0601e419028f6eedc12b4fb5fd28 to your computer and use it in GitHub Desktop.
newtraining output
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[root@tyler-rhel-newimage instructlab]# /root/ilab model train --data-path /var/instructlabbigdisk/instructlab/generateddata/messages_Mixtral-8x7B-Instruct-v0_2024-07-27T04_27_23.jsonl --model-path /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --ckpt-output-dir /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --device cuda --gpus 8 --max-batch-len 1 --effective-batch-size 8 --save-samples 46 | |
[2024-07-27 04:38:32,852] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
INFO 2024-07-27 04:38:36,486 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. | |
INFO 2024-07-27 04:38:36,486 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. | |
INFO 2024-07-27 04:38:36,486 numexpr.utils:161: NumExpr defaulting to 16 threads. | |
INFO 2024-07-27 04:38:36,869 datasets:58: PyTorch version 2.3.1 available. | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
INFO 2024-07-27 04:38:37,191 root:611: !!!!!!!! tokenizer has add_bos_token or add_eos_token | |
INFO 2024-07-27 04:38:37,196 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005 | |
tokenizing the dataset with /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ tokenizer... | |
ten largest length percentiles: | |
quantile 90th: 78.0 | |
quantile 91th: 79.80000000000001 | |
quantile 92th: 83.19999999999999 | |
quantile 93th: 86.80000000000001 | |
quantile 94th: 89.19999999999999 | |
quantile 95th: 91.0 | |
quantile 96th: 93.59999999999997 | |
quantile 97th: 97.19999999999999 | |
quantile 98th: 100.70000000000002 | |
quantile 99th: 103.84999999999998 | |
quantile 100th: 107.0 | |
at 4096 max sequence length, the number of samples to be dropped is 0 | |
(0.00% of total) | |
quantile 0th: 44.0 | |
quantile 1th: 44.45 | |
quantile 2th: 44.9 | |
quantile 3th: 45.0 | |
quantile 4th: 45.0 | |
quantile 5th: 45.0 | |
quantile 6th: 45.0 | |
quantile 7th: 45.15 | |
quantile 8th: 45.6 | |
quantile 9th: 46.1 | |
quantile 10th: 47.0 | |
at 20 min sequence length, the number of samples to be dropped is 0 | |
checking the validity of the samples... | |
INFO 2024-07-27 04:38:42,745 root:611: number of dropped samples: 0 -- out of 46 | |
Categorizing training data type... | |
Data type sorting: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 506398.91it/s] | |
unmasking the appropriate message content... | |
The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens. | |
Instruction ex sample 16: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> | |
Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|> | |
Original Input: <|user|> | |
Question: How many villages named "Qarah Tappeh" are mentioned in the text? | |
<|assistant|> | |
Answer: There are 7 villages named "Qarah Tappeh" mentioned in the text.<|endoftext|> | |
Instruction ex sample 39: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> | |
Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|> | |
Original Input: <|user|> | |
Question: How many villages named "Qarah Tappeh" are mentioned in the text, each with a different location? | |
<|assistant|> | |
Answer: There are eight villages named "Qarah Tappeh" in the text, each located in a different rural district or county.<|endoftext|> | |
Creating json from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 172.75ba/s] | |
Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ --data_path=/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/instructlabbigdisk/instructlab/knowledgecheckpoints/ --num_epochs=10 --effective_batch_size=8 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=46 --log_level=INFO --max_batch_len=1 --seed=42 --chat-tmpl-path=/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py | |
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] | |
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] ***************************************** | |
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
W0727 04:38:44.209000 140472589177280 torch/distributed/run.py:757] ***************************************** | |
[2024-07-27 04:38:47,197] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,436] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,460] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,488] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,592] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,593] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,603] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-07-27 04:38:47,623] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 | |
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
[04:38:50] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
model_name_or_path: /var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/ | |
data_path: /var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl | |
output_dir: /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ | |
num_epochs: 10 | |
last_step: 0 | |
effective_batch_size: 8 | |
learning_rate: 2.0e-05 | |
lr_scheduler: cosine | |
num_warmup_steps: 25 | |
save_samples: 46 | |
save_samples_ds: null | |
save_last: false | |
log_level: INFO | |
seed: 42 | |
mock_data: false | |
mock_len: 2600 | |
sharding_strategy: FULL_SHARD | |
is_granite: false | |
lora_r: 0 | |
lora_alpha: 32 | |
lora_dropout: 0.1 | |
lora_quant_bits: null | |
lora_target_modules: null | |
max_batch_len: 1 | |
cpu_offload_optimizer: false | |
cpu_offload_optimizer_pin_memory: false | |
cpu_offload_optimizer_ratio: 1.0 | |
NEFTune_alpha: null | |
chat_tmpl_path: /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py | |
disable_flash_attn: false | |
{ | |
"script_params": { | |
"model_name_or_path": "/var/instructlabbigdisk/instructlab/models/ibm-granite/granite-7b-base/", | |
"data_path": "/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl", | |
"output_dir": "/var/instructlabbigdisk/instructlab/knowledgecheckpoints/", | |
"num_epochs": 10, | |
"last_step": 0, | |
"effective_batch_size": 8, | |
"learning_rate": 2e-05, | |
"lr_scheduler": "cosine", | |
"num_warmup_steps": 25, | |
"save_samples": 46, | |
"save_samples_ds": null, | |
"save_last": false, | |
"log_level": "INFO", | |
"seed": 42, | |
"mock_data": false, | |
"mock_len": 2600, | |
"sharding_strategy": "FULL_SHARD", | |
"is_granite": false, | |
"lora_r": 0, | |
"lora_alpha": 32, | |
"lora_dropout": 0.1, | |
"lora_quant_bits": null, | |
"lora_target_modules": null, | |
"max_batch_len": 1, | |
"cpu_offload_optimizer": false, | |
"cpu_offload_optimizer_pin_memory": false, | |
"cpu_offload_optimizer_ratio": 1.0, | |
"NEFTune_alpha": null, | |
"chat_tmpl_path": "/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py", | |
"disable_flash_attn": false | |
}, | |
"timestamp": "2024-07-27T04:38:51.166561" | |
} | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
[2024-07-27 04:38:51,196] [INFO] [comm.py:637:init_distributed] cdb=None | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
[2024-07-27 04:38:51,244] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-07-27 04:38:51,244] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. | |
[04:38:51] INFO !!!!!!!! tokenizer has add_bos_token or add_eos_token utils.py:611 | |
[2024-07-27 04:38:51,961] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-07-27 04:38:52,111] [INFO] [comm.py:637:init_distributed] cdb=None | |
tyler-rhel-newimage:260:260 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:260:260 [0] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:260:260 [0] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
[2024-07-27 04:38:52,191] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-07-27 04:38:52,200] [INFO] [comm.py:637:init_distributed] cdb=None | |
tyler-rhel-newimage:265:265 [5] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:263:263 [3] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:265:265 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:265:265 [5] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:263:263 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:263:263 [3] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
[2024-07-27 04:38:52,222] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-07-27 04:38:52,228] [INFO] [comm.py:637:init_distributed] cdb=None | |
tyler-rhel-newimage:264:264 [4] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:264:264 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:264:264 [4] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:267:267 [7] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:267:267 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:267:267 [7] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:261:261 [1] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:261:261 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:261:261 [1] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:266:266 [6] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:266:266 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:266:266 [6] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:262:262 [2] NCCL INFO cudaDriverVersion 12040 | |
tyler-rhel-newimage:262:262 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:262:262 [2] NCCL INFO NCCL version 2.22.3+cuda12.5 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Using network Socket | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO Using network Socket | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO Using network Socket | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO Using network Socket | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO Using network Socket | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO Using network Socket | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO Using network Socket | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/IB : No device found. | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0> | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO Using network Socket | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init START | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO NVLS multicast support is not available on dev 3 | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO NVLS multicast support is not available on dev 1 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO NVLS multicast support is not available on dev 0 | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO NVLS multicast support is not available on dev 2 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO NVLS multicast support is not available on dev 4 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO NVLS multicast support is not available on dev 5 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO NVLS multicast support is not available on dev 7 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO NVLS multicast support is not available on dev 6 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO comm 0x558446a0b990 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO ncclCommInitRank comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO ncclCommInitRank comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO ncclCommInitRank comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:265:1025 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.77 (kernels 0.15, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:266:1029 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:267:1027 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.76 (kernels 0.20, bootstrap 0.22, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO ncclCommInitRank comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:260:1019 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.81 (kernels 0.13, bootstrap 0.35, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO ncclCommInitRank comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:262:1030 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.75 (kernels 0.24, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin. | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO ncclCommInitRank comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO ncclCommInitRank comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO ncclCommInitRank comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x84db589751fa0528 - Init COMPLETE | |
tyler-rhel-newimage:264:1026 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.76 (kernels 0.23, bootstrap 0.19, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03) | |
tyler-rhel-newimage:263:1024 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.78 (kernels 0.16, bootstrap 0.29, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.04, rest 0.03) | |
tyler-rhel-newimage:261:1028 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.75 (kernels 0.21, bootstrap 0.21, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.04, rest 0.03) | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1052 [3] NCCL INFO Connected all rings | |
tyler-rhel-newimage:262:1054 [2] NCCL INFO Connected all rings | |
tyler-rhel-newimage:261:1053 [1] NCCL INFO Connected all rings | |
tyler-rhel-newimage:260:1048 [0] NCCL INFO Connected all rings | |
tyler-rhel-newimage:264:1049 [4] NCCL INFO Connected all rings | |
tyler-rhel-newimage:267:1050 [7] NCCL INFO Connected all rings | |
tyler-rhel-newimage:265:1051 [5] NCCL INFO Connected all rings | |
tyler-rhel-newimage:266:1047 [6] NCCL INFO Connected all rings | |
Generating train split: 46 examples [00:00, 10554.02 examples/s] | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11446.25it/s] | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11298.78it/s] | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11851.23it/s] | |
Effective batch size is too low for multipack sampling, max sample length=107 and min packing length=61. Switching to naive distributed sampling. | |
{ | |
"num_gpus": 8, | |
"avg_sample_len": 61.52173913043478, | |
"effective_batch_size": 8, | |
"max_batch_len_per_gpu": 1, | |
"packing_max_batch_len": null, | |
"grad_accum": 1, | |
"num_batches": 6, | |
"avg_samples_per_batch": 7.666666666666667, | |
"samples_per_gpu": 1, | |
"timestamp": "2024-07-27T04:38:53.790452" | |
} | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11743.03it/s] | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11754.48it/s] | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11646.62it/s] | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11752.33it/s] | |
Data length calculation: 100%|██████████| 46/46 [00:00<00:00, 11259.88it/s] | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.82s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.84s/it] | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.84s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.91s/it] | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.91s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:11<00:00, 1.93s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.77s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100%|██████████| 6/6 [00:10<00:00, 1.78s/it] | |
WARNING: tokenizer has 32006 tokens but model has 32000 vocab size | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
Creating extension directory /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam... | |
Detected CUDA files, patching ldflags | |
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... | |
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. | |
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. | |
warnings.warn( | |
Building extension module fused_adam... | |
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
WARNING: There is a mismatch between bos token id of model(1) and tokenizer(32000). Fixing model bos token id to be same as tokenizer's bos token id | |
WARNING: There is a mismatch between eos token id of model(2) and tokenizer(32001). Fixing model eos token id to be same as tokenizer's eos token id | |
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | |
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o | |
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/python3.11/venv/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/python3.11/venv/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o | |
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/python3.11/venv/lib64/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 30.80458378791809 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 30.133174657821655 seconds | |
Loading extension module fused_adam...Loading extension module fused_adam... | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 30.7347309589386 seconds | |
Time to load fused_adam op: 30.73425841331482 secondsTime to load fused_adam op: 26.12966561317444 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 26.22968363761902 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 30.43332004547119 seconds | |
[2024-07-27 04:39:49,755] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4+d254d75, git-hash=d254d75, git-branch=HEAD | |
[2024-07-27 04:39:49,756] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 30.431302785873413 seconds | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO Using network Socket | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO Using network Socket | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO Using network Socket | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO Using network Socket | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO Using network Socket | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO Using network Socket | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO Using network Socket | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Using network Socket | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO bootstrapSplit: comm 0x556733b2a550 parent 0x5567334e8a90 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO bootstrapSplit: comm 0x55eb051067f0 parent 0x55eb04aea420 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO bootstrapSplit: comm 0x565389f45580 parent 0x5653898b69a0 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO bootstrapSplit: comm 0x560b41bec990 parent 0x560b415bb0d0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO bootstrapSplit: comm 0x55b80d3c14e0 parent 0x55b80cd4f580 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO bootstrapSplit: comm 0x55bdb15c28a0 parent 0x55bdb0f52ee0 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO bootstrapSplit: comm 0x5572495c2d50 parent 0x557248f8b2d0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO bootstrapSplit: comm 0x5584470a63b0 parent 0x558446a0b990 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init START | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO NVLS multicast support is not available on dev 3 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO NVLS multicast support is not available on dev 4 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO NVLS multicast support is not available on dev 1 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO NVLS multicast support is not available on dev 0 | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO NVLS multicast support is not available on dev 2 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO NVLS multicast support is not available on dev 6 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO NVLS multicast support is not available on dev 7 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO NVLS multicast support is not available on dev 5 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO comm 0x565389f45580 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x5584470a63b0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO comm 0x55b80d3c14e0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO comm 0x5572495c2d50 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO comm 0x556733b2a550 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO comm 0x560b41bec990 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0 | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO comm 0x55eb051067f0 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0 | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55bdb15c28a0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO P2P Chunksize set to 524288 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576 | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO ncclCommSplit comm 0x55eb051067f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55eb04aea420 color -934961569 key 3 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO ncclCommSplit comm 0x5572495c2d50 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x557248f8b2d0 color -934961569 key 0 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO ncclCommSplit comm 0x556733b2a550 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x5567334e8a90 color -934961569 key 4 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:263:1144 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:260:1128 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.56 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:264:1147 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.30 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.22, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO ncclCommSplit comm 0x560b41bec990 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x560b415bb0d0 color -934961569 key 7 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:267:1138 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.34 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO ncclCommSplit comm 0x55b80d3c14e0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55b80cd4f580 color -934961569 key 1 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO ncclCommSplit comm 0x55bdb15c28a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55bdb0f52ee0 color -934961569 key 2 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO ncclCommSplit comm 0x5584470a63b0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x558446a0b990 color -934961569 key 5 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:261:1129 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.55 (kernels 0.00, bootstrap 0.26, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO ncclCommSplit comm 0x565389f45580 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5653898b69a0 color -934961569 key 6 commId 0xc4e6ae1bfc2b17b0 - Init COMPLETE | |
tyler-rhel-newimage:262:1141 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.31 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:265:1132 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.52 (kernels 0.00, bootstrap 0.22, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:266:1135 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.42 (kernels 0.00, bootstrap 0.12, allgathers 0.00, topo 0.23, graphs 0.00, connections 0.05, rest 0.02) | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read | |
tyler-rhel-newimage:261:1170 [1] NCCL INFO Connected all rings | |
tyler-rhel-newimage:260:1168 [0] NCCL INFO Connected all rings | |
tyler-rhel-newimage:264:1164 [4] NCCL INFO Connected all rings | |
tyler-rhel-newimage:262:1171 [2] NCCL INFO Connected all rings | |
tyler-rhel-newimage:263:1167 [3] NCCL INFO Connected all rings | |
tyler-rhel-newimage:266:1166 [6] NCCL INFO Connected all rings | |
tyler-rhel-newimage:265:1169 [5] NCCL INFO Connected all rings | |
tyler-rhel-newimage:267:1165 [7] NCCL INFO Connected all rings | |
[2024-07-27 04:39:54,216] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False | |
[2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer | |
[2024-07-27 04:39:54,218] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer | |
[2024-07-27 04:39:54,243] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam | |
[2024-07-27 04:39:54,243] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> | |
[2024-07-27 04:39:54,244] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer | |
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000 | |
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000 | |
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False | |
[2024-07-27 04:39:54,244] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False | |
[2024-07-27 04:40:05,980] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:07,679] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states | |
[2024-07-27 04:40:07,680] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB | |
[2024-07-27 04:40:07,680] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.65 GB, percent = 8.4% | |
[2024-07-27 04:40:07,885] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states | |
[2024-07-27 04:40:07,886] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 18.83 GB CA 20.4 GB Max_CA 20 GB | |
[2024-07-27 04:40:07,886] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.65 GB, percent = 8.4% | |
[2024-07-27 04:40:07,886] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized | |
[2024-07-27 04:40:08,052] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:08,080] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer | |
[2024-07-27 04:40:08,081] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:08,081] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 15.69 GB CA 20.4 GB Max_CA 20 GB | |
[2024-07-27 04:40:08,081] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 67.95 GB, percent = 5.4% | |
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer | |
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler | |
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fa89082c350> | |
[2024-07-27 04:40:08,083] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:08,084] [INFO] [config.py:997:print] DeepSpeedEngine configuration: | |
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] activation_checkpointing_config { | |
"partition_activations": false, | |
"contiguous_memory_optimization": false, | |
"cpu_checkpointing": false, | |
"number_checkpoints": null, | |
"synchronize_checkpoint_boundary": false, | |
"profile": false | |
} | |
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} | |
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] amp_enabled .................. False | |
[2024-07-27 04:40:08,084] [INFO] [config.py:1001:print] amp_params ................... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] autotuning_config ............ { | |
"enabled": false, | |
"start_step": null, | |
"end_step": null, | |
"metric_path": null, | |
"arg_mappings": null, | |
"metric": "throughput", | |
"model_info": null, | |
"results_dir": "autotuning_results", | |
"exps_dir": "autotuning_exps", | |
"overwrite": true, | |
"fast": true, | |
"start_profile_step": 3, | |
"end_profile_step": 5, | |
"tuner_type": "gridsearch", | |
"tuner_early_stopping": 5, | |
"tuner_num_trials": 50, | |
"model_info_path": null, | |
"mp_size": 1, | |
"max_train_batch_size": null, | |
"min_train_batch_size": 1, | |
"max_train_micro_batch_size_per_gpu": 1.024000e+03, | |
"min_train_micro_batch_size_per_gpu": 1, | |
"num_tuning_micro_batch_sizes": 3 | |
} | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] bfloat16_enabled ............. True | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa871e917d0> | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] communication_data_type ...... None | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dataloader_drop_last ......... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] disable_allgather ............ False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dump_state ................... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] elasticity_enabled ........... False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] flops_profiler_config ........ { | |
"enabled": false, | |
"recompute_fwd_factor": 0.0, | |
"profile_step": 1, | |
"module_depth": -1, | |
"top_modules": 1, | |
"detailed": true, | |
"output_file": null | |
} | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_auto_cast ............... None | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_enabled ................. False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] global_rank .................. 0 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] grad_accum_dtype ............. None | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 | |
[2024-07-27 04:40:08,085] [INFO] [config.py:1001:print] graph_harvesting ............. False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] load_universal_checkpoint .... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] loss_scale ................... 1.0 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] memory_breakdown ............. False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] mics_shard_size .............. -1 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] nebula_config ................ { | |
"enabled": false, | |
"persistent_storage_path": null, | |
"persistent_time_interval": 100, | |
"num_of_version_in_retention": 2, | |
"enable_nebula_load": true, | |
"load_path": null | |
} | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_name ............... None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] optimizer_params ............. None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pld_enabled .................. False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] pld_params ................... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] prescale_gradients ........... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] scheduler_name ............... None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] scheduler_params ............. None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] sparse_attention ............. None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] steps_per_print .............. 1 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] train_batch_size ............. 8 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 1 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] use_node_local_storage ....... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] weight_quantization_config ... None | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] world_size ................... 8 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_enabled ................. True | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True | |
[2024-07-27 04:40:08,086] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 | |
[2024-07-27 04:40:08,086] [INFO] [config.py:987:print_user_config] json = { | |
"train_batch_size": 8, | |
"gradient_accumulation_steps": 1, | |
"train_micro_batch_size_per_gpu": 1, | |
"steps_per_print": 1, | |
"zero_optimization": { | |
"stage": 2, | |
"offload_param": { | |
"device": "none" | |
}, | |
"offload_optimizer": { | |
"device": "none" | |
} | |
}, | |
"bf16": { | |
"enabled": true | |
}, | |
"gradient_clipping": 1.0, | |
"prescale_gradients": false, | |
"wall_clock_breakdown": false | |
} | |
[2024-07-27 04:40:08,087] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
Number of samples per save: 40 | |
[2024-07-27 04:40:08,101] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:08,148] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:08,457] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-07-27 04:40:08,652] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/knowledgecheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
Epoch 0: 0%| | 0/6 [00:00<?, ?it/s] total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 7 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 1 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 0 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
Per-token loss scaled by world size: 0.07893893122673035 | |
Per-token loss scaled by world size: 0.03717859089374542 | |
Per-token loss scaled by world size: 0.04773510619997978Per-token loss scaled by world size: 0.0572475828230381Per-token loss scaled by world size: 0.07484958320856094 | |
Per-token loss scaled by world size: 0.053953204303979874 | |
Epoch: 0, Step: 1, Rank: 0, loss = 2.230024814605713 | |
Epoch: 0, Step: 1, Rank: 1, loss = 1.0502952337265015 | |
Per-token loss scaled by world size: 0.054488085210323334 | |
Epoch: 0, Step: 1, Rank: 3, loss = 1.3485167026519775Epoch: 0, Step: 1, Rank: 7, loss = 1.6172442436218262 | |
Epoch: 0, Step: 1, Rank: 6, loss = 2.1145007610321045 | |
Epoch: 0, Step: 1, Rank: 2, loss = 1.5241780281066895 | |
Epoch: 0, Step: 1, Rank: 5, loss = 1.5392884016036987 | |
Per-token loss scaled by world size: 0.034635160118341446 | |
Epoch: 0, Step: 1, Rank: 4, loss = 0.9784433245658875 | |
[2024-07-27 04:40:09,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)] | |
Epoch 0: 17%|█▋ | 1/6 [00:01<00:06, 1.27s/it]{ | |
"epoch": 0, | |
"step": 1, | |
"rank": 0, | |
"loss": 2.230024814605713, | |
"overall_throughput": 9.740397396362653, | |
"lr": 8.000000000000001e-07, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 226, | |
"batch_size": 8, | |
"total_loss": 1.5503114461898804, | |
"gradnorm": 27.33059310913086, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:09.962829" | |
} | |
Per-token loss scaled by world size: 0.0815284326672554Per-token loss scaled by world size: 0.049197420477867126Per-token loss scaled by world size: 0.05212152749300003Per-token loss scaled by world size: 0.04615860432386398 | |
Per-token loss scaled by world size: 0.031173814088106155 | |
Per-token loss scaled by world size: 0.031103696674108505 | |
Per-token loss scaled by world size: 0.059471674263477325 | |
Epoch: 0, Step: 2, Rank: 5, loss = 1.5189703702926636 | |
Epoch: 0, Step: 2, Rank: 0, loss = 2.517190456390381 | |
Epoch: 0, Step: 2, Rank: 2, loss = 1.6092522144317627Epoch: 0, Step: 2, Rank: 4, loss = 1.4251469373703003 | |
Epoch: 0, Step: 2, Rank: 3, loss = 0.962491512298584 | |
Epoch: 0, Step: 2, Rank: 1, loss = 0.960326611995697 | |
Epoch: 0, Step: 2, Rank: 7, loss = 1.8361879587173462 | |
Per-token loss scaled by world size: 0.0653553158044815 | |
Epoch: 0, Step: 2, Rank: 6, loss = 2.017845392227173 | |
[2024-07-27 04:40:10,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)] | |
Epoch 0: 33%|███▎ | 2/6 [00:01<00:03, 1.28it/s]{ | |
"epoch": 0, | |
"step": 2, | |
"rank": 0, | |
"loss": 2.517190456390381, | |
"overall_throughput": 25.079475366313783, | |
"lr": 1.6000000000000001e-06, | |
"cuda_mem_allocated": 21.990607738494873, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 247, | |
"batch_size": 8, | |
"total_loss": 1.6059263944625854, | |
"gradnorm": 24.998506546020508, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:10.428087" | |
} | |
Per-token loss scaled by world size: 0.049534447491168976Per-token loss scaled by world size: 0.034937091171741486Per-token loss scaled by world size: 0.03569952771067619Per-token loss scaled by world size: 0.027379106730222702Per-token loss scaled by world size: 0.07303578406572342Per-token loss scaled by world size: 0.060139141976833344Per-token loss scaled by world size: 0.06049950420856476 | |
Epoch: 0, Step: 3, Rank: 4, loss = 1.1223540306091309Epoch: 0, Step: 3, Rank: 3, loss = 1.9319698810577393Epoch: 0, Step: 3, Rank: 1, loss = 2.3462746143341064Epoch: 0, Step: 3, Rank: 7, loss = 0.8795537948608398Epoch: 0, Step: 3, Rank: 0, loss = 1.5912941694259644Epoch: 0, Step: 3, Rank: 2, loss = 1.1468473672866821 | |
Epoch: 0, Step: 3, Rank: 5, loss = 1.9435465335845947 | |
Per-token loss scaled by world size: 0.07311909645795822 | |
Epoch: 0, Step: 3, Rank: 6, loss = 2.3489508628845215 | |
[2024-07-27 04:40:10,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:10,835] [INFO] [timer.py:258:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=19.69948717647646, CurrSamplesPerSec=19.69948717647646, MemAllocated=21.99GB, MaxMemAllocated=28.28GB | |
Epoch 0: 50%|█████ | 3/6 [00:02<00:01, 1.56it/s]{ | |
"epoch": 0, | |
"step": 3, | |
"rank": 0, | |
"loss": 1.5912941694259644, | |
"overall_throughput": 19.626204980478747, | |
"lr": 2.4000000000000003e-06, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 257, | |
"batch_size": 8, | |
"total_loss": 1.6638489961624146, | |
"gradnorm": 23.093570709228516, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:10.899000" | |
} | |
Per-token loss scaled by world size: 0.022453732788562775Per-token loss scaled by world size: 0.02703114040195942Per-token loss scaled by world size: 0.0565037876367569Per-token loss scaled by world size: 0.013839970342814922Per-token loss scaled by world size: 0.03342469781637192 | |
Per-token loss scaled by world size: 0.019963225349783897 | |
Per-token loss scaled by world size: 0.024931060150265694 | |
Epoch: 0, Step: 4, Rank: 0, loss = 1.2197802066802979Epoch: 0, Step: 4, Rank: 6, loss = 2.5497334003448486 | |
Epoch: 0, Step: 4, Rank: 5, loss = 0.9008405208587646Epoch: 0, Step: 4, Rank: 1, loss = 1.5082894563674927Epoch: 0, Step: 4, Rank: 2, loss = 0.6245286464691162 | |
Epoch: 0, Step: 4, Rank: 3, loss = 1.013224720954895 | |
Epoch: 0, Step: 4, Rank: 7, loss = 1.125014066696167 | |
Per-token loss scaled by world size: 0.04781263321638107 | |
Epoch: 0, Step: 4, Rank: 4, loss = 2.1575450897216797 | |
[2024-07-27 04:40:11,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:11,315] [INFO] [timer.py:258:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=19.53033393623941, CurrSamplesPerSec=19.36406089495735, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 0: 67%|██████▋ | 4/6 [00:02<00:01, 1.73it/s]{ | |
"epoch": 0, | |
"step": 4, | |
"rank": 0, | |
"loss": 1.2197802066802979, | |
"overall_throughput": 19.323057857283946, | |
"lr": 3.2000000000000003e-06, | |
"cuda_mem_allocated": 21.994319915771484, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 361, | |
"batch_size": 8, | |
"total_loss": 1.3873695135116577, | |
"gradnorm": 13.594513893127441, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:11.384294" | |
} | |
Per-token loss scaled by world size: 0.039435986429452896Per-token loss scaled by world size: 0.04640277475118637Per-token loss scaled by world size: 0.04766402393579483Per-token loss scaled by world size: 0.03315580636262894Per-token loss scaled by world size: 0.05420012027025223Per-token loss scaled by world size: 0.039042871445417404 | |
Per-token loss scaled by world size: 0.037364713847637177 | |
Epoch: 0, Step: 5, Rank: 7, loss = 1.5848288536071777 | |
Epoch: 0, Step: 5, Rank: 0, loss = 1.3112465143203735Epoch: 0, Step: 5, Rank: 6, loss = 1.1024305820465088Epoch: 0, Step: 5, Rank: 1, loss = 1.2981754541397095 | |
Epoch: 0, Step: 5, Rank: 4, loss = 1.242376685142517 | |
Epoch: 0, Step: 5, Rank: 3, loss = 1.5428922176361084 | |
Epoch: 0, Step: 5, Rank: 2, loss = 1.8021539449691772 | |
Per-token loss scaled by world size: 0.041017867624759674 | |
Epoch: 0, Step: 5, Rank: 5, loss = 1.3638441562652588 | |
[2024-07-27 04:40:11,718] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:11,795] [INFO] [timer.py:258:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=19.56114133365461, CurrSamplesPerSec=19.62304862715284, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 40 | |
{ | |
"epoch": 0, | |
"step": 5, | |
"rank": 0, | |
"loss": 1.3112465143203735, | |
"overall_throughput": 19.583411102181554, | |
"lr": 4.000000000000001e-06, | |
"cuda_mem_allocated": 21.990607738494873, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 266, | |
"batch_size": 8, | |
"total_loss": 1.4059934616088867, | |
"gradnorm": 16.828536987304688, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:11.799662" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_40 | |
[04:40:29] INFO saving took 17.8935489654541 seconds utils.py:611 | |
Epoch 0: 83%|████████▎ | 5/6 [00:21<00:06, 7.00s/it]Per-token loss scaled by world size: 0.023287693038582802Per-token loss scaled by world size: 0.022923028096556664Per-token loss scaled by world size: 0.035226039588451385Per-token loss scaled by world size: 0.02928866446018219 | |
Per-token loss scaled by world size: 0.02294014021754265 | |
Per-token loss scaled by world size: 0.021732352674007416Per-token loss scaled by world size: 0.05167709290981293 | |
Epoch: 0, Step: 6, Rank: 1, loss = 0.8420491218566895 | |
Epoch: 0, Step: 6, Rank: 2, loss = 1.0127485990524292 | |
Epoch: 0, Step: 6, Rank: 5, loss = 0.6595290303230286Epoch: 0, Step: 6, Rank: 4, loss = 1.485716462135315 | |
Epoch: 0, Step: 6, Rank: 0, loss = 0.6590370535850525Epoch: 0, Step: 6, Rank: 7, loss = 0.669521152973175 | |
Epoch: 0, Step: 6, Rank: 3, loss = 0.6248051524162292 | |
Per-token loss scaled by world size: 0.05193476006388664 | |
Epoch: 0, Step: 6, Rank: 6, loss = 1.4931243658065796 | |
[2024-07-27 04:40:30,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:30,197] [INFO] [timer.py:258:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=19.219145353583418, CurrSamplesPerSec=18.261332776041684, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 0: 100%|██████████| 6/6 [00:21<00:00, 4.79s/it]{ | |
"epoch": 0, | |
"step": 6, | |
"rank": 0, | |
"loss": 0.6590370535850525, | |
"overall_throughput": 18.218836059734688, | |
"lr": 4.800000000000001e-06, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 230, | |
"batch_size": 8, | |
"total_loss": 0.9308162927627563, | |
"gradnorm": 13.859025001525879, | |
"weight_norm": 393.4548645019531, | |
"timestamp": "2024-07-27T04:40:30.261588" | |
} | |
Epoch 0: 100%|██████████| 6/6 [00:21<00:00, 3.61s/it] | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 0 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 0 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 2 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 7 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 3 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 3 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
Per-token loss scaled by world size: 0.04258072003722191Per-token loss scaled by world size: 0.03144193813204765Per-token loss scaled by world size: 0.03665884956717491Per-token loss scaled by world size: 0.0183926559984684 | |
Per-token loss scaled by world size: 0.02085307240486145 | |
Per-token loss scaled by world size: 0.011860878206789494 | |
Epoch: 1, Step: 7, Rank: 6, loss = 1.2830597162246704Per-token loss scaled by world size: 0.017801115289330482 | |
Epoch: 1, Step: 7, Rank: 7, loss = 0.6437429785728455Epoch: 1, Step: 7, Rank: 2, loss = 1.1004678010940552 | |
Epoch: 1, Step: 7, Rank: 3, loss = 1.4903252124786377 | |
Epoch: 1, Step: 7, Rank: 0, loss = 0.7298575639724731 | |
Epoch: 1, Step: 7, Rank: 1, loss = 0.41513073444366455 | |
Epoch: 1, Step: 7, Rank: 5, loss = 0.6230390071868896 | |
Per-token loss scaled by world size: 0.014800201170146465 | |
Epoch: 1, Step: 7, Rank: 4, loss = 0.5180070400238037 | |
[2024-07-27 04:40:31,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[5.600000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:31,117] [INFO] [timer.py:258:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=18.856037574472914, CurrSamplesPerSec=17.531170274406254, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 1,▋ | 1/6 [00:00<00:04, 1.23it/s] | |
"step": 7, | |
"rank": 0, | |
"loss": 0.7298575639724731, | |
"overall_throughput": 17.461878806108718, | |
"lr": 5.600000000000001e-06, | |
"cuda_mem_allocated": 21.99084711074829, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 280, | |
"batch_size": 8, | |
"total_loss": 0.850453794002533, | |
"gradnorm": 11.859759330749512, | |
"weight_norm": 393.45489501953125, | |
"timestamp": "2024-07-27T04:40:31.182263" | |
} | |
Per-token loss scaled by world size: 0.01552379410713911Per-token loss scaled by world size: 0.006790023762732744Per-token loss scaled by world size: 0.025048548355698586Per-token loss scaled by world size: 0.013136875815689564Per-token loss scaled by world size: 0.011116808280348778Per-token loss scaled by world size: 0.013115502893924713 | |
Per-token loss scaled by world size: 0.004034785088151693 | |
Epoch: 1, Step: 8, Rank: 6, loss = 0.2682059407234192Epoch: 1, Step: 8, Rank: 3, loss = 0.9894176721572876Epoch: 1, Step: 8, Rank: 7, loss = 0.43911394476890564 | |
Epoch: 1, Step: 8, Rank: 0, loss = 0.5180623531341553Epoch: 1, Step: 8, Rank: 4, loss = 0.5189065933227539 | |
Epoch: 1, Step: 8, Rank: 2, loss = 0.6131898760795593 | |
Epoch: 1, Step: 8, Rank: 5, loss = 0.15937401354312897 | |
Per-token loss scaled by world size: 0.0399574413895607 | |
Epoch: 1, Step: 8, Rank: 1, loss = 1.5783189535140991 | |
[2024-07-27 04:40:31,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[6.4000000000000006e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:31,596] [INFO] [timer.py:258:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=18.947101184672665, CurrSamplesPerSec=19.41593921964599, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 1,██▎ | 2/6 [00:01<00:02, 1.62it/s] | |
"step": 8, | |
"rank": 0, | |
"loss": 0.5180623531341553, | |
"overall_throughput": 19.33400249148091, | |
"lr": 6.4000000000000006e-06, | |
"cuda_mem_allocated": 21.992404460906982, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 316, | |
"batch_size": 8, | |
"total_loss": 0.63557368516922, | |
"gradnorm": 13.164900779724121, | |
"weight_norm": 393.4549255371094, | |
"timestamp": "2024-07-27T04:40:31.599119" | |
} | |
Per-token loss scaled by world size: 0.005403513088822365Per-token loss scaled by world size: 0.009131929837167263Per-token loss scaled by world size: 0.0029204594902694225Per-token loss scaled by world size: 0.014552305452525616Per-token loss scaled by world size: 0.002935068914666772 | |
Per-token loss scaled by world size: 0.005613779183477163 | |
Per-token loss scaled by world size: 0.008262179791927338 | |
Epoch: 1, Step: 9, Rank: 6, loss = 0.5002354979515076 | |
Epoch: 1, Step: 9, Rank: 7, loss = 0.10039079189300537 | |
Epoch: 1, Step: 9, Rank: 2, loss = 0.3139100968837738Epoch: 1, Step: 9, Rank: 1, loss = 0.10089299082756042 | |
Epoch: 1, Step: 9, Rank: 0, loss = 0.18574576079845428 | |
Epoch: 1, Step: 9, Rank: 4, loss = 0.28401243686676025 | |
Epoch: 1, Step: 9, Rank: 3, loss = 0.19297365844249725 | |
Per-token loss scaled by world size: 0.03853427246212959 | |
Epoch: 1, Step: 9, Rank: 5, loss = 1.3246155977249146 | |
[2024-07-27 04:40:31,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[7.2000000000000005e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:32,073] [INFO] [timer.py:258:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=19.017175783584705, CurrSamplesPerSec=19.44875538610099, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 1,████ | 3/6 [00:01<00:01, 1.81it/s] | |
"step": 9, | |
"rank": 0, | |
"loss": 0.18574576079845428, | |
"overall_throughput": 19.4072473458094, | |
"lr": 7.2000000000000005e-06, | |
"cuda_mem_allocated": 21.990368366241455, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 275, | |
"batch_size": 8, | |
"total_loss": 0.3753471374511719, | |
"gradnorm": 6.857061386108398, | |
"weight_norm": 393.4549255371094, | |
"timestamp": "2024-07-27T04:40:32.076959" | |
} | |
Per-token loss scaled by world size: 0.009762527421116829Per-token loss scaled by world size: 0.011909930035471916Per-token loss scaled by world size: 0.011925755999982357Per-token loss scaled by world size: 0.00691488292068243Per-token loss scaled by world size: 0.014653326012194157Per-token loss scaled by world size: 0.014235693961381912Per-token loss scaled by world size: 0.006737567484378815 | |
Epoch: 1, Step: 10, Rank: 1, loss = 0.32094308733940125 | |
Epoch: 1, Step: 10, Rank: 4, loss = 0.39205923676490784Epoch: 1, Step: 10, Rank: 6, loss = 0.22732678055763245Epoch: 1, Step: 10, Rank: 2, loss = 0.48172810673713684 | |
Epoch: 1, Step: 10, Rank: 7, loss = 0.2214975357055664 | |
Epoch: 1, Step: 10, Rank: 3, loss = 0.39153894782066345Epoch: 1, Step: 10, Rank: 5, loss = 0.4679984450340271 | |
Per-token loss scaled by world size: 0.013581880368292332 | |
Epoch: 1, Step: 10, Rank: 0, loss = 0.4465043246746063 | |
[2024-07-27 04:40:32,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:32,552] [INFO] [timer.py:258:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=19.068379525424866, CurrSamplesPerSec=19.434674525231042, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 80 | |
{ | |
"epoch": 1, | |
"step": 10, | |
"rank": 0, | |
"loss": 0.4465043246746063, | |
"overall_throughput": 19.395367576038005, | |
"lr": 8.000000000000001e-06, | |
"cuda_mem_allocated": 21.99384117126465, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 263, | |
"batch_size": 8, | |
"total_loss": 0.3686995804309845, | |
"gradnorm": 7.663094520568848, | |
"weight_norm": 393.4549560546875, | |
"timestamp": "2024-07-27T04:40:32.556073" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_80 | |
[04:40:50] INFO saving took 17.83508801460266 seconds utils.py:611 | |
Per-token loss scaled by world size: 0.008527813479304314Per-token loss scaled by world size: 0.021125635132193565Per-token loss scaled by world size: 0.028058378025889397Per-token loss scaled by world size: 0.010279231704771519 | |
Per-token loss scaled by world size: 0.007951636798679829 | |
Per-token loss scaled by world size: 0.010785719379782677 | |
Per-token loss scaled by world size: 0.0015928453067317605 | |
Epoch: 1, Step: 11, Rank: 0, loss = 0.6364097595214844 | |
Epoch: 1, Step: 11, Rank: 3, loss = 0.8452586531639099 | |
Epoch: 1, Step: 11, Rank: 7, loss = 0.309661865234375 | |
Epoch: 1, Step: 11, Rank: 5, loss = 0.23954305052757263Epoch: 1, Step: 11, Rank: 1, loss = 0.32491979002952576 | |
Epoch: 1, Step: 11, Rank: 6, loss = 0.2569003701210022 | |
Epoch: 1, Step: 11, Rank: 4, loss = 0.04798446595668793 | |
Per-token loss scaled by world size: 0.011055735871195793 | |
Epoch: 1, Step: 11, Rank: 2, loss = 0.3330540359020233 | |
[2024-07-27 04:40:50,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[8.8e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:50,878] [INFO] [timer.py:258:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=19.06728482937713, CurrSamplesPerSec=19.05853178378495, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 1,███████▎ | 5/6 [00:20<00:05, 5.01s/it] | |
"step": 11, | |
"rank": 0, | |
"loss": 0.6364097595214844, | |
"overall_throughput": 19.020988915430987, | |
"lr": 8.8e-06, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 241, | |
"batch_size": 8, | |
"total_loss": 0.3742165267467499, | |
"gradnorm": 10.62493896484375, | |
"weight_norm": 393.45501708984375, | |
"timestamp": "2024-07-27T04:40:50.881434" | |
} | |
Per-token loss scaled by world size: 0.01114068366587162Per-token loss scaled by world size: 0.017769839614629745Per-token loss scaled by world size: 0.01096078846603632Per-token loss scaled by world size: 0.01160360500216484 | |
Per-token loss scaled by world size: 0.01375030167400837 | |
Per-token loss scaled by world size: 0.012371961027383804 | |
Per-token loss scaled by world size: 0.026108525693416595 | |
Epoch: 1, Step: 12, Rank: 4, loss = 0.4709007441997528Epoch: 1, Step: 12, Rank: 2, loss = 0.29046088457107544 | |
Epoch: 1, Step: 12, Rank: 7, loss = 0.3074955344200134 | |
Epoch: 1, Step: 12, Rank: 0, loss = 0.29522812366485596 | |
Epoch: 1, Step: 12, Rank: 3, loss = 0.3643829822540283 | |
Epoch: 1, Step: 12, Rank: 5, loss = 0.32785695791244507 | |
Epoch: 1, Step: 12, Rank: 1, loss = 0.6918759346008301 | |
Per-token loss scaled by world size: 0.008026043884456158 | |
Epoch: 1, Step: 12, Rank: 6, loss = 0.21269017457962036 | |
[2024-07-27 04:40:51,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[9.600000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:51,345] [INFO] [timer.py:258:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=19.154108730048822, CurrSamplesPerSec=19.972626532644533, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 1,█████████| 6/6 [00:21<00:00, 3.47s/it] | |
"step": 12, | |
"rank": 0, | |
"loss": 0.29522812366485596, | |
"overall_throughput": 19.934561488662563, | |
"lr": 9.600000000000001e-06, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 212, | |
"batch_size": 8, | |
"total_loss": 0.3701114356517792, | |
"gradnorm": 11.501402854919434, | |
"weight_norm": 393.4551086425781, | |
"timestamp": "2024-07-27T04:40:51.407469" | |
} | |
Epoch 1: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it] | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 3 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 6 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 5 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 7 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 4 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 4 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
Per-token loss scaled by world size: 0.005579258780926466Per-token loss scaled by world size: 0.003397347405552864Per-token loss scaled by world size: 0.04219382628798485Per-token loss scaled by world size: 0.006288798991590738 | |
Per-token loss scaled by world size: 0.003948549274355173 | |
Per-token loss scaled by world size: 0.005502534564584494 | |
Per-token loss scaled by world size: 0.018879147246479988 | |
Epoch: 2, Step: 13, Rank: 4, loss = 0.20595817267894745 | |
Epoch: 2, Step: 13, Rank: 2, loss = 0.11126312613487244 | |
Epoch: 2, Step: 13, Rank: 6, loss = 0.18272072076797485 | |
Epoch: 2, Step: 13, Rank: 1, loss = 1.381847858428955 | |
Epoch: 2, Step: 13, Rank: 3, loss = 0.12931498885154724 | |
Epoch: 2, Step: 13, Rank: 5, loss = 0.1802080124616623 | |
Epoch: 2, Step: 13, Rank: 0, loss = 0.6182920932769775 | |
Per-token loss scaled by world size: 0.005611070431768894 | |
Epoch: 2, Step: 13, Rank: 7, loss = 0.1837625503540039 | |
[2024-07-27 04:40:52,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[1.04e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:52,284] [INFO] [timer.py:258:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=18.955822461559762, CurrSamplesPerSec=17.17757371046992, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 2: 17%|█▋ | 1/6 [00:00<00:04, 1.22it/s]{ | |
"epoch": 2, | |
"step": 13, | |
"rank": 0, | |
"loss": 0.6182920932769775, | |
"overall_throughput": 17.11025862908584, | |
"lr": 1.04e-05, | |
"cuda_mem_allocated": 21.990307807922363, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 262, | |
"batch_size": 8, | |
"total_loss": 0.3741708993911743, | |
"gradnorm": 5.893370628356934, | |
"weight_norm": 393.4551696777344, | |
"timestamp": "2024-07-27T04:40:52.348171" | |
} | |
Per-token loss scaled by world size: 0.010395266115665436Per-token loss scaled by world size: 0.009872684255242348Per-token loss scaled by world size: 0.003835555398836732Per-token loss scaled by world size: 0.008823958225548267Per-token loss scaled by world size: 0.0053123775869607925Per-token loss scaled by world size: 0.0028427704237401485 | |
Per-token loss scaled by world size: 0.010287421755492687 | |
Epoch: 2, Step: 14, Rank: 0, loss = 0.323552668094635 | |
Epoch: 2, Step: 14, Rank: 4, loss = 0.3072873055934906Epoch: 2, Step: 14, Rank: 2, loss = 0.27464568614959717Epoch: 2, Step: 14, Rank: 5, loss = 0.11938165873289108 | |
Epoch: 2, Step: 14, Rank: 1, loss = 0.08848123252391815Epoch: 2, Step: 14, Rank: 3, loss = 0.16534775495529175 | |
Epoch: 2, Step: 14, Rank: 7, loss = 0.3201960027217865 | |
Per-token loss scaled by world size: 0.016205936670303345 | |
Epoch: 2, Step: 14, Rank: 6, loss = 0.5044097900390625 | |
[2024-07-27 04:40:52,683] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[1.1200000000000001e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:52,760] [INFO] [timer.py:258:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=19.0045240462792, CurrSamplesPerSec=19.557238311503617, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 2: 33%|███▎ | 2/6 [00:01<00:02, 1.62it/s]{ | |
"epoch": 2, | |
"step": 14, | |
"rank": 0, | |
"loss": 0.323552668094635, | |
"overall_throughput": 19.51458544398396, | |
"lr": 1.1200000000000001e-05, | |
"cuda_mem_allocated": 21.989410877227783, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 249, | |
"batch_size": 8, | |
"total_loss": 0.262912780046463, | |
"gradnorm": 8.123388290405273, | |
"weight_norm": 393.45526123046875, | |
"timestamp": "2024-07-27T04:40:52.825508" | |
} | |
Per-token loss scaled by world size: 0.00476957717910409Per-token loss scaled by world size: 0.005131879821419716Per-token loss scaled by world size: 0.002688762964680791Per-token loss scaled by world size: 0.00911777000874281Per-token loss scaled by world size: 0.006320170592516661 | |
Per-token loss scaled by world size: 0.007136975880712271 | |
Per-token loss scaled by world size: 0.007083716802299023 | |
Epoch: 2, Step: 15, Rank: 0, loss = 0.16742758452892303 | |
Epoch: 2, Step: 15, Rank: 1, loss = 0.1556074619293213 | |
Epoch: 2, Step: 15, Rank: 2, loss = 0.2974672317504883Epoch: 2, Step: 15, Rank: 6, loss = 0.20619556307792664 | |
Epoch: 2, Step: 15, Rank: 7, loss = 0.23110626637935638 | |
Epoch: 2, Step: 15, Rank: 5, loss = 0.23284383118152618 | |
Epoch: 2, Step: 15, Rank: 4, loss = 0.08772089332342148 | |
Per-token loss scaled by world size: 0.007475144695490599 | |
Epoch: 2, Step: 15, Rank: 3, loss = 0.2438765913248062 | |
[2024-07-27 04:40:53,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:40:53,243] [INFO] [timer.py:258:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=19.026022320074304, CurrSamplesPerSec=19.287847616814023, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 120 | |
{ | |
"epoch": 2, | |
"step": 15, | |
"rank": 0, | |
"loss": 0.16742758452892303, | |
"overall_throughput": 19.246835159868162, | |
"lr": 1.2e-05, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 261, | |
"batch_size": 8, | |
"total_loss": 0.20278067886829376, | |
"gradnorm": 4.634181499481201, | |
"weight_norm": 393.455322265625, | |
"timestamp": "2024-07-27T04:40:53.247897" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_120 | |
[04:41:11] INFO saving took 17.93269968032837 seconds utils.py:611 | |
Epoch 2: 50%|█████ | 3/6 [00:19<00:26, 8.75s/it]Per-token loss scaled by world size: 0.022889601066708565Per-token loss scaled by world size: 0.006112583447247744Per-token loss scaled by world size: 0.008243937976658344Per-token loss scaled by world size: 0.005759389605373144 | |
Per-token loss scaled by world size: 0.0023497489746659994 | |
Per-token loss scaled by world size: 0.003176590893417597 | |
Per-token loss scaled by world size: 0.0032468584831804037Epoch: 2, Step: 16, Rank: 0, loss = 0.18261343240737915 | |
Epoch: 2, Step: 16, Rank: 5, loss = 0.6838268041610718 | |
Epoch: 2, Step: 16, Rank: 4, loss = 0.24628764390945435 | |
Epoch: 2, Step: 16, Rank: 3, loss = 0.1720617711544037Epoch: 2, Step: 16, Rank: 6, loss = 0.07019875198602676 | |
Epoch: 2, Step: 16, Rank: 1, loss = 0.09699989855289459Epoch: 2, Step: 16, Rank: 2, loss = 0.09490065276622772 | |
Per-token loss scaled by world size: 0.002544187940657139 | |
Epoch: 2, Step: 16, Rank: 7, loss = 0.07600761204957962 | |
[2024-07-27 04:41:11,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[1.2800000000000001e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:11,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=19.0047652088737, CurrSamplesPerSec=18.732683349486162, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 2: 67%|██████▋ | 4/6 [00:20<00:10, 5.49s/it]{ | |
"epoch": 2, | |
"step": 16, | |
"rank": 0, | |
"loss": 0.18261343240737915, | |
"overall_throughput": 18.691548302496816, | |
"lr": 1.2800000000000001e-05, | |
"cuda_mem_allocated": 21.98988962173462, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 239, | |
"batch_size": 8, | |
"total_loss": 0.20286208391189575, | |
"gradnorm": 3.438565492630005, | |
"weight_norm": 393.4554138183594, | |
"timestamp": "2024-07-27T04:41:11.737319" | |
} | |
Per-token loss scaled by world size: 0.006436600815504789Per-token loss scaled by world size: 0.007636873982846737Per-token loss scaled by world size: 0.011849365197122097Per-token loss scaled by world size: 0.0030969511717557907Per-token loss scaled by world size: 0.0029933564364910126Per-token loss scaled by world size: 0.005698263645172119Per-token loss scaled by world size: 0.0030969511717557907 | |
Epoch: 2, Step: 17, Rank: 2, loss = 0.24247075617313385Epoch: 2, Step: 17, Rank: 6, loss = 0.09832820296287537 | |
Epoch: 2, Step: 17, Rank: 3, loss = 0.09503906965255737 | |
Epoch: 2, Step: 17, Rank: 0, loss = 0.20436207950115204 | |
Epoch: 2, Step: 17, Rank: 7, loss = 0.18091987073421478 | |
Epoch: 2, Step: 17, Rank: 1, loss = 0.3762173354625702 | |
Epoch: 2, Step: 17, Rank: 5, loss = 0.09832820296287537 | |
Per-token loss scaled by world size: 0.0246181171387434 | |
Epoch: 2, Step: 17, Rank: 4, loss = 0.7816252112388611 | |
[2024-07-27 04:41:12,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.3600000000000002e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:12,143] [INFO] [timer.py:258:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=19.058863757063296, CurrSamplesPerSec=19.849924810962573, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 2: 83%|████████▎ | 5/6 [00:20<00:03, 3.68s/it]{ | |
"epoch": 2, | |
"step": 17, | |
"rank": 0, | |
"loss": 0.20436207950115204, | |
"overall_throughput": 19.81138977879121, | |
"lr": 1.3600000000000002e-05, | |
"cuda_mem_allocated": 21.990607738494873, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 254, | |
"batch_size": 8, | |
"total_loss": 0.25966137647628784, | |
"gradnorm": 4.596966743469238, | |
"weight_norm": 393.4555358886719, | |
"timestamp": "2024-07-27T04:41:12.206412" | |
} | |
Per-token loss scaled by world size: 0.003973593469709158Per-token loss scaled by world size: 0.003631346160545945Per-token loss scaled by world size: 0.0038163107819855213Per-token loss scaled by world size: 0.00382098532281816Per-token loss scaled by world size: 0.00380203640088439Per-token loss scaled by world size: 0.001392068457789719 | |
Per-token loss scaled by world size: 0.004007395356893539 | |
Epoch: 2, Step: 18, Rank: 0, loss = 0.18179190158843994 | |
Epoch: 2, Step: 18, Rank: 3, loss = 0.17459622025489807Epoch: 2, Step: 18, Rank: 1, loss = 0.1661340892314911Epoch: 2, Step: 18, Rank: 7, loss = 0.18333832919597626 | |
Epoch: 2, Step: 18, Rank: 6, loss = 0.1739431619644165Epoch: 2, Step: 18, Rank: 4, loss = 0.06368713080883026 | |
Epoch: 2, Step: 18, Rank: 2, loss = 0.17481008172035217 | |
Per-token loss scaled by world size: 0.009031021036207676 | |
Epoch: 2, Step: 18, Rank: 5, loss = 0.4131692051887512 | |
[2024-07-27 04:41:12,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.4400000000000001e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:12,623] [INFO] [timer.py:258:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=19.075929261101155, CurrSamplesPerSec=19.33562909999493, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 2: 100%|██████████| 6/6 [00:21<00:00, 2.59s/it]{ | |
"epoch": 2, | |
"step": 18, | |
"rank": 0, | |
"loss": 0.18179190158843994, | |
"overall_throughput": 19.297919953254016, | |
"lr": 1.4400000000000001e-05, | |
"cuda_mem_allocated": 21.992165088653564, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 366, | |
"batch_size": 8, | |
"total_loss": 0.19143378734588623, | |
"gradnorm": 3.664649486541748, | |
"weight_norm": 393.4555969238281, | |
"timestamp": "2024-07-27T04:41:12.687682" | |
} | |
Epoch 2: 100%|██████████| 6/6 [00:21<00:00, 3.55s/it] | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 1 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 1 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 0 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 2 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 6 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 4 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 7 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 7 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
Per-token loss scaled by world size: 0.008028822019696236Per-token loss scaled by world size: 0.0042352983728051186Per-token loss scaled by world size: 0.006302641239017248Per-token loss scaled by world size: 0.00753552932292223Per-token loss scaled by world size: 0.006594506558030844 | |
Per-token loss scaled by world size: 0.010208014398813248 | |
Per-token loss scaled by world size: 0.0070448205806314945 | |
Epoch: 3, Step: 19, Rank: 7, loss = 0.12917660176753998 | |
Epoch: 3, Step: 19, Rank: 6, loss = 0.2448790818452835 | |
Epoch: 3, Step: 19, Rank: 2, loss = 0.19223055243492126 | |
Epoch: 3, Step: 19, Rank: 4, loss = 0.20113244652748108 | |
Epoch: 3, Step: 19, Rank: 5, loss = 0.22983364760875702 | |
Epoch: 3, Step: 19, Rank: 0, loss = 0.3113444447517395 | |
Epoch: 3, Step: 19, Rank: 1, loss = 0.2148670256137848 | |
Per-token loss scaled by world size: 0.007540326565504074 | |
Epoch: 3, Step: 19, Rank: 3, loss = 0.2299799621105194 | |
[2024-07-27 04:41:13,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.5200000000000002e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:13,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=18.855222324067338, CurrSamplesPerSec=15.90998650082005, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 3,▋ | 1/6 [00:00<00:04, 1.23it/s] | |
"step": 19, | |
"rank": 0, | |
"loss": 0.3113444447517395, | |
"overall_throughput": 15.851661275609098, | |
"lr": 1.5200000000000002e-05, | |
"cuda_mem_allocated": 21.989410877227783, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 244, | |
"batch_size": 8, | |
"total_loss": 0.21918047964572906, | |
"gradnorm": 7.666770935058594, | |
"weight_norm": 393.4556884765625, | |
"timestamp": "2024-07-27T04:41:13.615788" | |
} | |
Per-token loss scaled by world size: 0.0015976645518094301Per-token loss scaled by world size: 0.01189976092427969Per-token loss scaled by world size: 0.006761615164577961Per-token loss scaled by world size: 0.0026721509639173746Per-token loss scaled by world size: 0.001967529533430934 | |
Per-token loss scaled by world size: 0.005321608856320381 | |
Per-token loss scaled by world size: 0.0015923914033919573 | |
Epoch: 3, Step: 20, Rank: 0, loss = 0.057715632021427155 | |
Epoch: 3, Step: 20, Rank: 4, loss = 0.0965314507484436 | |
Epoch: 3, Step: 20, Rank: 6, loss = 0.07107700407505035 | |
Epoch: 3, Step: 20, Rank: 3, loss = 0.4298788607120514 | |
Epoch: 3, Step: 20, Rank: 2, loss = 0.24426335096359253 | |
Epoch: 3, Step: 20, Rank: 1, loss = 0.19224311411380768 | |
Epoch: 3, Step: 20, Rank: 7, loss = 0.057525139302015305 | |
Per-token loss scaled by world size: 0.018431704491376877 | |
Epoch: 3, Step: 20, Rank: 5, loss = 0.6658453345298767 | |
[2024-07-27 04:41:13,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:14,028] [INFO] [timer.py:258:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=18.89434286309558, CurrSamplesPerSec=19.58513710703571, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 160 | |
{ | |
"epoch": 3, | |
"step": 20, | |
"rank": 0, | |
"loss": 0.057715632021427155, | |
"overall_throughput": 19.54536780630332, | |
"lr": 1.6000000000000003e-05, | |
"cuda_mem_allocated": 21.990726947784424, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 289, | |
"batch_size": 8, | |
"total_loss": 0.22688499093055725, | |
"gradnorm": 5.258148193359375, | |
"weight_norm": 393.4558410644531, | |
"timestamp": "2024-07-27T04:41:14.031924" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_160 | |
[04:41:31] INFO saving took 17.931371927261353 seconds utils.py:611 | |
Per-token loss scaled by world size: 0.0045623015612363815Per-token loss scaled by world size: 0.0081652095541358Per-token loss scaled by world size: 0.0009351570624858141Per-token loss scaled by world size: 0.002664643106982112Per-token loss scaled by world size: 0.0031791036017239094 | |
Epoch: 3, Step: 21, Rank: 2, loss = 0.2602660655975342 | |
Epoch: 3, Step: 21, Rank: 1, loss = 0.0849355012178421 | |
Epoch: 3, Step: 21, Rank: 4, loss = 0.10133392363786697Epoch: 3, Step: 21, Rank: 0, loss = 0.029808131977915764 | |
Epoch: 3, Step: 21, Rank: 6, loss = 0.14542336761951447Per-token loss scaled by world size: 0.0044220853596925735 | |
Per-token loss scaled by world size: 0.01251036673784256 | |
Epoch: 3, Step: 21, Rank: 3, loss = 0.14095397293567657 | |
Epoch: 3, Step: 21, Rank: 7, loss = 0.39876794815063477 | |
Per-token loss scaled by world size: 0.009876725263893604 | |
Epoch: 3, Step: 21, Rank: 5, loss = 0.31482061743736267 | |
[2024-07-27 04:41:32,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.6800000000000002e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:32,449] [INFO] [timer.py:258:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=18.90515878235841, CurrSamplesPerSec=19.101984863890006, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 3,████ | 3/6 [00:19<00:18, 6.29s/it] | |
"step": 21, | |
"rank": 0, | |
"loss": 0.029808131977915764, | |
"overall_throughput": 19.05711381074895, | |
"lr": 1.6800000000000002e-05, | |
"cuda_mem_allocated": 21.990726947784424, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 255, | |
"batch_size": 8, | |
"total_loss": 0.18453869223594666, | |
"gradnorm": 5.108468055725098, | |
"weight_norm": 393.4559631347656, | |
"timestamp": "2024-07-27T04:41:32.452565" | |
} | |
Per-token loss scaled by world size: 0.0014227991923689842Per-token loss scaled by world size: 0.0022042023483663797Per-token loss scaled by world size: 0.0035717289429157972Per-token loss scaled by world size: 0.0031726094894111156 | |
Per-token loss scaled by world size: 0.0027486486360430717 | |
Per-token loss scaled by world size: 0.002677777549251914 | |
Per-token loss scaled by world size: 0.00375761860050261 | |
Epoch: 3, Step: 22, Rank: 0, loss = 0.048019472509622574Epoch: 3, Step: 22, Rank: 6, loss = 0.12054584920406342 | |
Epoch: 3, Step: 22, Rank: 5, loss = 0.0927668884396553 | |
Epoch: 3, Step: 22, Rank: 3, loss = 0.10707557201385498Epoch: 3, Step: 22, Rank: 1, loss = 0.12681962549686432 | |
Epoch: 3, Step: 22, Rank: 7, loss = 0.09037499129772186 | |
Epoch: 3, Step: 22, Rank: 4, loss = 0.07439182698726654 | |
Per-token loss scaled by world size: 0.008931323885917664 | |
Epoch: 3, Step: 22, Rank: 2, loss = 0.30143219232559204 | |
[2024-07-27 04:41:32,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.76e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:32,927] [INFO] [timer.py:258:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=18.93573071473506, CurrSamplesPerSec=19.53597958978115, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 3,█████▋ | 4/6 [00:20<00:07, 4.00s/it] | |
"step": 22, | |
"rank": 0, | |
"loss": 0.048019472509622574, | |
"overall_throughput": 19.499003967857334, | |
"lr": 1.76e-05, | |
"cuda_mem_allocated": 21.989171504974365, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 270, | |
"batch_size": 8, | |
"total_loss": 0.12017828971147537, | |
"gradnorm": 3.623103380203247, | |
"weight_norm": 393.4561462402344, | |
"timestamp": "2024-07-27T04:41:32.991116" | |
} | |
Per-token loss scaled by world size: 0.009149123914539814Per-token loss scaled by world size: 0.0035603949800133705Per-token loss scaled by world size: 0.004935313016176224Per-token loss scaled by world size: 0.0064824605360627174Per-token loss scaled by world size: 0.005307480692863464 | |
Per-token loss scaled by world size: 0.0033412924967706203Per-token loss scaled by world size: 0.00997106358408928 | |
Epoch: 3, Step: 23, Rank: 6, loss = 0.10814699530601501Epoch: 3, Step: 23, Rank: 1, loss = 0.14991013705730438 | |
Epoch: 3, Step: 23, Rank: 4, loss = 0.30287104845046997Epoch: 3, Step: 23, Rank: 0, loss = 0.2779046297073364Epoch: 3, Step: 23, Rank: 2, loss = 0.16121472418308258 | |
Epoch: 3, Step: 23, Rank: 5, loss = 0.1969047337770462Epoch: 3, Step: 23, Rank: 3, loss = 0.10149175673723221 | |
Per-token loss scaled by world size: 0.00799593236297369 | |
Epoch: 3, Step: 23, Rank: 7, loss = 0.24287645518779755 | |
[2024-07-27 04:41:33,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.8400000000000003e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:33,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=18.969543790437204, CurrSamplesPerSec=19.672103775255234, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 3,███████▎ | 5/6 [00:20<00:02, 2.72s/it] | |
"step": 23, | |
"rank": 0, | |
"loss": 0.2779046297073364, | |
"overall_throughput": 19.604933597424527, | |
"lr": 1.8400000000000003e-05, | |
"cuda_mem_allocated": 21.990487575531006, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 243, | |
"batch_size": 8, | |
"total_loss": 0.19266505539417267, | |
"gradnorm": 3.409485101699829, | |
"weight_norm": 393.4563293457031, | |
"timestamp": "2024-07-27T04:41:33.402199" | |
} | |
Per-token loss scaled by world size: 0.00869434978812933Per-token loss scaled by world size: 0.010431548580527306Per-token loss scaled by world size: 0.00882460456341505Per-token loss scaled by world size: 0.014862887561321259Per-token loss scaled by world size: 0.007030695676803589Per-token loss scaled by world size: 0.009925030171871185 | |
Per-token loss scaled by world size: 0.013269560411572456 | |
Epoch: 3, Step: 24, Rank: 1, loss = 0.31989189982414246Epoch: 3, Step: 24, Rank: 6, loss = 0.5387796759605408 | |
Epoch: 3, Step: 24, Rank: 0, loss = 0.37814363837242126 | |
Epoch: 3, Step: 24, Rank: 7, loss = 0.359782338142395 | |
Epoch: 3, Step: 24, Rank: 2, loss = 0.2548627257347107 | |
Epoch: 3, Step: 24, Rank: 5, loss = 0.48102155327796936 | |
Epoch: 3, Step: 24, Rank: 3, loss = 0.31517016887664795 | |
Per-token loss scaled by world size: 0.008658657781779766 | |
Epoch: 3, Step: 24, Rank: 4, loss = 0.31387636065483093 | |
[2024-07-27 04:41:33,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.9200000000000003e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:33,879] [INFO] [timer.py:258:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=18.986021230980416, CurrSamplesPerSec=19.338782826201598, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 3,█████████| 6/6 [00:21<00:00, 1.96s/it] | |
"step": 24, | |
"rank": 0, | |
"loss": 0.37814363837242126, | |
"overall_throughput": 19.301816374521977, | |
"lr": 1.9200000000000003e-05, | |
"cuda_mem_allocated": 21.990128993988037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 290, | |
"batch_size": 8, | |
"total_loss": 0.3701910078525543, | |
"gradnorm": 41.655189514160156, | |
"weight_norm": 393.4565124511719, | |
"timestamp": "2024-07-27T04:41:33.942020" | |
} | |
Epoch 3: 100%|██████████| 6/6 [00:21<00:00, 3.54s/it] | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 7 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 0 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 7 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 1 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 1 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 2 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 4 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 6 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 6 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 6 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 3 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
Per-token loss scaled by world size: 0.011749987490475178Per-token loss scaled by world size: 0.0038406068924814463Per-token loss scaled by world size: 0.01040646806359291 | |
Per-token loss scaled by world size: 0.0029110456816852093 | |
Per-token loss scaled by world size: 0.001697335857897997Per-token loss scaled by world size: 0.0049619837664067745Per-token loss scaled by world size: 0.008784592151641846 | |
Epoch: 4, Step: 25, Rank: 3, loss = 0.13202086091041565 | |
Epoch: 4, Step: 25, Rank: 6, loss = 0.40390580892562866 | |
Epoch: 4, Step: 25, Rank: 0, loss = 0.10006719827651978 | |
Epoch: 4, Step: 25, Rank: 1, loss = 0.35772234201431274 | |
Epoch: 4, Step: 25, Rank: 5, loss = 0.30197036266326904 | |
Epoch: 4, Step: 25, Rank: 7, loss = 0.05834592133760452 | |
Epoch: 4, Step: 25, Rank: 2, loss = 0.17056819796562195 | |
Per-token loss scaled by world size: 0.0024965431075543165 | |
Epoch: 4, Step: 25, Rank: 4, loss = 0.08581867069005966 | |
[2024-07-27 04:41:34,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[2e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:34,808] [INFO] [timer.py:258:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=18.909596063618803, CurrSamplesPerSec=17.371243026535403, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 200 | |
{ | |
"epoch": 4, | |
"step": 25, | |
"rank": 0, | |
"loss": 0.10006719827651978, | |
"overall_throughput": 17.301396406941095, | |
"lr": 2e-05, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 275, | |
"batch_size": 8, | |
"total_loss": 0.2013024240732193, | |
"gradnorm": 2.961458921432495, | |
"weight_norm": 393.45672607421875, | |
"timestamp": "2024-07-27T04:41:34.811952" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_200 | |
[04:41:52] INFO saving took 17.899128675460815 seconds utils.py:611 | |
Epoch 4: 17%|█▋ | 1/6 [00:18<01:33, 18.72s/it]Per-token loss scaled by world size: 0.005662889685481787Per-token loss scaled by world size: 0.003961643204092979Per-token loss scaled by world size: 0.0033513393718749285 | |
Per-token loss scaled by world size: 0.0048882560804486275 | |
Per-token loss scaled by world size: 0.005098323803395033Per-token loss scaled by world size: 0.0037976952735334635 | |
Per-token loss scaled by world size: 0.0018476687837392092 | |
Epoch: 4, Step: 26, Rank: 5, loss = 0.12182052433490753 | |
Epoch: 4, Step: 26, Rank: 1, loss = 0.10305368900299072 | |
Epoch: 4, Step: 26, Rank: 0, loss = 0.17413385212421417 | |
Epoch: 4, Step: 26, Rank: 7, loss = 0.1503138691186905 | |
Epoch: 4, Step: 26, Rank: 3, loss = 0.11677912622690201 | |
Epoch: 4, Step: 26, Rank: 2, loss = 0.15677346289157867 | |
Epoch: 4, Step: 26, Rank: 6, loss = 0.05681581422686577 | |
Per-token loss scaled by world size: 0.0031180845107883215 | |
Epoch: 4, Step: 26, Rank: 4, loss = 0.09588109701871872 | |
[2024-07-27 04:41:53,120] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.9959742939952393e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:53,198] [INFO] [timer.py:258:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=18.9111373084012, CurrSamplesPerSec=18.946655411223635, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 4: 33%|███▎ | 2/6 [00:19<00:31, 7.99s/it]{ | |
"epoch": 4, | |
"step": 26, | |
"rank": 0, | |
"loss": 0.17413385212421417, | |
"overall_throughput": 18.900859917782263, | |
"lr": 1.9959742939952393e-05, | |
"cuda_mem_allocated": 21.990487575531006, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 246, | |
"batch_size": 8, | |
"total_loss": 0.12194641679525375, | |
"gradnorm": 2.005527973175049, | |
"weight_norm": 393.4569091796875, | |
"timestamp": "2024-07-27T04:41:53.262256" | |
} | |
Per-token loss scaled by world size: 0.007557791192084551Per-token loss scaled by world size: 0.008283982053399086Per-token loss scaled by world size: 0.003100305562838912Per-token loss scaled by world size: 0.011851347051560879 | |
Per-token loss scaled by world size: 0.013045835308730602Per-token loss scaled by world size: 0.009396737441420555 | |
Per-token loss scaled by world size: 0.0076859793625772 | |
Epoch: 4, Step: 27, Rank: 0, loss = 0.20972870290279388 | |
Epoch: 4, Step: 27, Rank: 2, loss = 0.22988051176071167Epoch: 4, Step: 27, Rank: 3, loss = 0.3288748860359192 | |
Epoch: 4, Step: 27, Rank: 6, loss = 0.36202192306518555Epoch: 4, Step: 27, Rank: 5, loss = 0.26075947284698486 | |
Epoch: 4, Step: 27, Rank: 4, loss = 0.08603347837924957 | |
Epoch: 4, Step: 27, Rank: 7, loss = 0.2132859230041504 | |
Per-token loss scaled by world size: 0.004431413020938635 | |
Epoch: 4, Step: 27, Rank: 1, loss = 0.12297171354293823 | |
[2024-07-27 04:41:53,588] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.98392958859863e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:53,665] [INFO] [timer.py:258:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=18.95425254539218, CurrSamplesPerSec=20.05141088310167, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 4: 50%|█████ | 3/6 [00:19<00:13, 4.56s/it]{ | |
"epoch": 4, | |
"step": 27, | |
"rank": 0, | |
"loss": 0.20972870290279388, | |
"overall_throughput": 20.01441802121427, | |
"lr": 1.98392958859863e-05, | |
"cuda_mem_allocated": 21.988572120666504, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 222, | |
"batch_size": 8, | |
"total_loss": 0.22669458389282227, | |
"gradnorm": 4.566909313201904, | |
"weight_norm": 393.4571838378906, | |
"timestamp": "2024-07-27T04:41:53.729196" | |
} | |
Per-token loss scaled by world size: 0.002589393639937043Per-token loss scaled by world size: 0.0016010800609365106Per-token loss scaled by world size: 0.009488740935921669 | |
Per-token loss scaled by world size: 0.007330995053052902 | |
Per-token loss scaled by world size: 0.006591046694666147Per-token loss scaled by world size: 0.0028418628498911858 | |
Per-token loss scaled by world size: 0.0009722260874696076 | |
Epoch: 4, Step: 28, Rank: 0, loss = 0.055437397211790085 | |
Epoch: 4, Step: 28, Rank: 5, loss = 0.3285476565361023 | |
Epoch: 4, Step: 28, Rank: 1, loss = 0.0896577537059784 | |
Epoch: 4, Step: 28, Rank: 4, loss = 0.22821499407291412 | |
Epoch: 4, Step: 28, Rank: 2, loss = 0.2538357079029083Epoch: 4, Step: 28, Rank: 6, loss = 0.09839949756860733 | |
Epoch: 4, Step: 28, Rank: 3, loss = 0.03366332873702049 | |
Per-token loss scaled by world size: 0.017863700166344643 | |
Epoch: 4, Step: 28, Rank: 7, loss = 0.6185306310653687 | |
[2024-07-27 04:41:54,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.9639628606958535e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:54,145] [INFO] [timer.py:258:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=18.966307174026092, CurrSamplesPerSec=19.272736671546916, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 4: 67%|██████▋ | 4/6 [00:20<00:05, 2.95s/it]{ | |
"epoch": 4, | |
"step": 28, | |
"rank": 0, | |
"loss": 0.055437397211790085, | |
"overall_throughput": 19.215202457388543, | |
"lr": 1.9639628606958535e-05, | |
"cuda_mem_allocated": 21.989171504974365, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 277, | |
"batch_size": 8, | |
"total_loss": 0.21328586339950562, | |
"gradnorm": 8.249006271362305, | |
"weight_norm": 393.4573974609375, | |
"timestamp": "2024-07-27T04:41:54.208735" | |
} | |
Per-token loss scaled by world size: 0.0066835153847932816Per-token loss scaled by world size: 0.004529251717031002Per-token loss scaled by world size: 0.0037545531522482634Per-token loss scaled by world size: 0.003318126080557704Per-token loss scaled by world size: 0.002113455906510353 | |
Per-token loss scaled by world size: 0.0010128725552931428Per-token loss scaled by world size: 0.0017812160076573491 | |
Epoch: 4, Step: 29, Rank: 0, loss = 0.2815430760383606 | |
Epoch: 4, Step: 29, Rank: 6, loss = 0.1397760659456253Epoch: 4, Step: 29, Rank: 3, loss = 0.19079472124576569 | |
Epoch: 4, Step: 29, Rank: 7, loss = 0.15816055238246918 | |
Epoch: 4, Step: 29, Rank: 5, loss = 0.07503372430801392 | |
Epoch: 4, Step: 29, Rank: 1, loss = 0.04266725853085518 | |
Epoch: 4, Step: 29, Rank: 4, loss = 0.08902932703495026 | |
Per-token loss scaled by world size: 0.00729252677410841 | |
Epoch: 4, Step: 29, Rank: 2, loss = 0.3071976900100708 | |
[2024-07-27 04:41:54,548] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.9362348706397374e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:54,626] [INFO] [timer.py:258:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=18.978206465919676, CurrSamplesPerSec=19.292915749104477, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 4: 83%|████████▎ | 5/6 [00:20<00:02, 2.06s/it]{ | |
"epoch": 4, | |
"step": 29, | |
"rank": 0, | |
"loss": 0.2815430760383606, | |
"overall_throughput": 19.249871636665503, | |
"lr": 1.9362348706397374e-05, | |
"cuda_mem_allocated": 21.99084711074829, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 337, | |
"batch_size": 8, | |
"total_loss": 0.16052529215812683, | |
"gradnorm": 3.410759210586548, | |
"weight_norm": 393.4576110839844, | |
"timestamp": "2024-07-27T04:41:54.689974" | |
} | |
Per-token loss scaled by world size: 0.005897491704672575Per-token loss scaled by world size: 0.007752169389277697Per-token loss scaled by world size: 0.007537755649536848Per-token loss scaled by world size: 0.012558677233755589Per-token loss scaled by world size: 0.00658394442871213 | |
Per-token loss scaled by world size: 0.003483764361590147 | |
Per-token loss scaled by world size: 0.0014572414802387357 | |
Epoch: 4, Step: 30, Rank: 7, loss = 0.21899878978729248Epoch: 4, Step: 30, Rank: 0, loss = 0.1859964281320572Epoch: 4, Step: 30, Rank: 4, loss = 0.21294160187244415 | |
Epoch: 4, Step: 30, Rank: 6, loss = 0.166604146361351Epoch: 4, Step: 30, Rank: 1, loss = 0.3547826409339905Epoch: 4, Step: 30, Rank: 3, loss = 0.09841634333133698 | |
Epoch: 4, Step: 30, Rank: 2, loss = 0.04116707295179367 | |
Per-token loss scaled by world size: 0.013426681980490685 | |
Epoch: 4, Step: 30, Rank: 5, loss = 0.3793037533760071 | |
[2024-07-27 04:41:55,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.900968867902419e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:41:55,093] [INFO] [timer.py:258:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=19.013068675506908, CurrSamplesPerSec=20.005289522667084, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 240 | |
{ | |
"epoch": 4, | |
"step": 30, | |
"rank": 0, | |
"loss": 0.1859964281320572, | |
"overall_throughput": 19.962075281886168, | |
"lr": 1.900968867902419e-05, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 226, | |
"batch_size": 8, | |
"total_loss": 0.2072763293981552, | |
"gradnorm": 3.050539255142212, | |
"weight_norm": 393.45782470703125, | |
"timestamp": "2024-07-27T04:41:55.095878" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_240 | |
[04:42:12] INFO saving took 17.86995029449463 seconds utils.py:611 | |
Epoch 4: 100%|██████████| 6/6 [00:39<00:00, 6.51s/it] | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 0 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 5 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 5 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 1 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 1 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 1 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 3 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 2 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 6 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 6 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 3 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 3 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 4 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 4 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 4 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 4 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
Per-token loss scaled by world size: 0.005013823974877596Per-token loss scaled by world size: 0.002962352242320776Per-token loss scaled by world size: 0.004985218867659569Per-token loss scaled by world size: 0.0022716219536960125 | |
Per-token loss scaled by world size: 0.0034899800084531307Per-token loss scaled by world size: 0.0017849565483629704Per-token loss scaled by world size: 0.0013634071219712496 | |
Epoch: 5, Step: 31, Rank: 6, loss = 0.17323635518550873Epoch: 5, Step: 31, Rank: 4, loss = 0.17423038184642792 | |
Epoch: 5, Step: 31, Rank: 3, loss = 0.10294174402952194 | |
Epoch: 5, Step: 31, Rank: 2, loss = 0.07893886417150497 | |
Epoch: 5, Step: 31, Rank: 0, loss = 0.06202723830938339Epoch: 5, Step: 31, Rank: 1, loss = 0.12127680331468582 | |
Epoch: 5, Step: 31, Rank: 5, loss = 0.04737839847803116 | |
Per-token loss scaled by world size: 0.011078107170760632 | |
Epoch: 5, Step: 31, Rank: 7, loss = 0.3849642276763916 | |
[2024-07-27 04:42:13,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.8584487936018663e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:13,922] [INFO] [timer.py:258:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=18.801550768968706, CurrSamplesPerSec=14.335953625150017, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 5,▋ | 1/6 [00:00<00:04, 1.19it/s] | |
"step": 31, | |
"rank": 0, | |
"loss": 0.06202723830938339, | |
"overall_throughput": 14.285813059808566, | |
"lr": 1.8584487936018663e-05, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 278, | |
"batch_size": 8, | |
"total_loss": 0.14312423765659332, | |
"gradnorm": 4.453860282897949, | |
"weight_norm": 393.4580078125, | |
"timestamp": "2024-07-27T04:42:13.987803" | |
} | |
Per-token loss scaled by world size: 0.0028378700371831656Per-token loss scaled by world size: 0.005622998811304569Per-token loss scaled by world size: 0.0031444875057786703Per-token loss scaled by world size: 0.0035572010092437267 | |
Per-token loss scaled by world size: 0.004025444388389587Per-token loss scaled by world size: 0.005346423946321011Per-token loss scaled by world size: 0.0037831738591194153 | |
Epoch: 5, Step: 32, Rank: 0, loss = 0.0971970483660698 | |
Epoch: 5, Step: 32, Rank: 6, loss = 0.10769869387149811Epoch: 5, Step: 32, Rank: 7, loss = 0.12183413654565811Epoch: 5, Step: 32, Rank: 3, loss = 0.1925877034664154 | |
Epoch: 5, Step: 32, Rank: 2, loss = 0.13787147402763367Epoch: 5, Step: 32, Rank: 1, loss = 0.18311502039432526 | |
Epoch: 5, Step: 32, Rank: 5, loss = 0.12957370281219482 | |
Per-token loss scaled by world size: 0.008308484219014645 | |
Epoch: 5, Step: 32, Rank: 4, loss = 0.28456559777259827 | |
[2024-07-27 04:42:14,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.8090169943749477e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:14,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=18.82734153699244, CurrSamplesPerSec=19.607327906336685, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 5,██▎ | 2/6 [00:01<00:02, 1.60it/s] | |
"step": 32, | |
"rank": 0, | |
"loss": 0.0971970483660698, | |
"overall_throughput": 19.569625420939243, | |
"lr": 1.8090169943749477e-05, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 274, | |
"batch_size": 8, | |
"total_loss": 0.15680542588233948, | |
"gradnorm": 2.596428394317627, | |
"weight_norm": 393.45819091796875, | |
"timestamp": "2024-07-27T04:42:14.402141" | |
} | |
Per-token loss scaled by world size: 0.0021558639127761126Per-token loss scaled by world size: 0.004672932904213667Per-token loss scaled by world size: 0.0039972770027816296Per-token loss scaled by world size: 0.0053141191601753235Per-token loss scaled by world size: 0.0033407120499759912 | |
Per-token loss scaled by world size: 0.006172977387905121Per-token loss scaled by world size: 0.003799165366217494 | |
Epoch: 5, Step: 33, Rank: 0, loss = 0.12193598598241806Epoch: 5, Step: 33, Rank: 1, loss = 0.193965345621109Epoch: 5, Step: 33, Rank: 2, loss = 0.1705620437860489 | |
Epoch: 5, Step: 33, Rank: 7, loss = 0.14590060710906982 | |
Epoch: 5, Step: 33, Rank: 3, loss = 0.13866953551769257Epoch: 5, Step: 33, Rank: 6, loss = 0.2253136783838272 | |
Epoch: 5, Step: 33, Rank: 4, loss = 0.07868903130292892 | |
Per-token loss scaled by world size: 0.003766376990824938 | |
Epoch: 5, Step: 33, Rank: 5, loss = 0.13747276365756989 | |
[2024-07-27 04:42:14,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.7530714660036112e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:14,872] [INFO] [timer.py:258:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=18.85669114464766, CurrSamplesPerSec=19.781816809788317, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 5,████ | 3/6 [00:01<00:01, 1.80it/s] | |
"step": 33, | |
"rank": 0, | |
"loss": 0.12193598598241806, | |
"overall_throughput": 19.744405215835805, | |
"lr": 1.7530714660036112e-05, | |
"cuda_mem_allocated": 21.98988962173462, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 292, | |
"batch_size": 8, | |
"total_loss": 0.1515636295080185, | |
"gradnorm": 2.53242564201355, | |
"weight_norm": 393.4584655761719, | |
"timestamp": "2024-07-27T04:42:14.936193" | |
} | |
Per-token loss scaled by world size: 0.004924851469695568Per-token loss scaled by world size: 0.004273226950317621Per-token loss scaled by world size: 0.002622528001666069Per-token loss scaled by world size: 0.0037059050519019365Per-token loss scaled by world size: 0.0047779749147593975Per-token loss scaled by world size: 0.005559505894780159 | |
Per-token loss scaled by world size: 0.007279254496097565 | |
Epoch: 5, Step: 34, Rank: 7, loss = 0.12646400928497314Epoch: 5, Step: 34, Rank: 2, loss = 0.0894937664270401Epoch: 5, Step: 34, Rank: 3, loss = 0.1630484014749527Epoch: 5, Step: 34, Rank: 1, loss = 0.1680605560541153 | |
Epoch: 5, Step: 34, Rank: 0, loss = 0.1458238661289215 | |
Epoch: 5, Step: 34, Rank: 6, loss = 0.18971814215183258 | |
Epoch: 5, Step: 34, Rank: 5, loss = 0.24840456247329712 | |
Per-token loss scaled by world size: 0.010788660496473312 | |
Epoch: 5, Step: 34, Rank: 4, loss = 0.3681630492210388 | |
[2024-07-27 04:42:15,270] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.691062648986865e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:15,347] [INFO] [timer.py:258:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=18.876022011428752, CurrSamplesPerSec=19.495582553322528, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 5,█████▋ | 4/6 [00:02<00:01, 1.91it/s] | |
"step": 34, | |
"rank": 0, | |
"loss": 0.1458238661289215, | |
"overall_throughput": 19.439414681510893, | |
"lr": 1.691062648986865e-05, | |
"cuda_mem_allocated": 21.988572120666504, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 273, | |
"batch_size": 8, | |
"total_loss": 0.1873970627784729, | |
"gradnorm": 2.919456958770752, | |
"weight_norm": 393.45867919921875, | |
"timestamp": "2024-07-27T04:42:15.410652" | |
} | |
Per-token loss scaled by world size: 0.0070740398950874805Per-token loss scaled by world size: 0.006351261865347624Per-token loss scaled by world size: 0.009431459940969944Per-token loss scaled by world size: 0.0034575308673083782 | |
Per-token loss scaled by world size: 0.0034287304151803255Per-token loss scaled by world size: 0.006853340193629265 | |
Per-token loss scaled by world size: 0.004821010399609804 | |
Epoch: 5, Step: 35, Rank: 0, loss = 0.20337864756584167 | |
Epoch: 5, Step: 35, Rank: 1, loss = 0.2711544632911682 | |
Epoch: 5, Step: 35, Rank: 4, loss = 0.1825987845659256 | |
Epoch: 5, Step: 35, Rank: 6, loss = 0.09940401464700699 | |
Epoch: 5, Step: 35, Rank: 2, loss = 0.19703352451324463 | |
Epoch: 5, Step: 35, Rank: 7, loss = 0.09857600182294846 | |
Epoch: 5, Step: 35, Rank: 5, loss = 0.1386040449142456 | |
Per-token loss scaled by world size: 0.0033929902128875256 | |
Epoch: 5, Step: 35, Rank: 3, loss = 0.0975484699010849 | |
[2024-07-27 04:42:15,747] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.6234898018587336e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:15,825] [INFO] [timer.py:258:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=18.89253145148775, CurrSamplesPerSec=19.43652077202901, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 280 | |
{ | |
"epoch": 5, | |
"step": 35, | |
"rank": 0, | |
"loss": 0.20337864756584167, | |
"overall_throughput": 19.39916886552688, | |
"lr": 1.6234898018587336e-05, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 230, | |
"batch_size": 8, | |
"total_loss": 0.16103725135326385, | |
"gradnorm": 3.5732498168945312, | |
"weight_norm": 393.45892333984375, | |
"timestamp": "2024-07-27T04:42:15.828221" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_280 | |
[04:42:33] INFO saving took 17.876163959503174 seconds utils.py:611 | |
Per-token loss scaled by world size: 0.004089208785444498Per-token loss scaled by world size: 0.0032626439351588488Per-token loss scaled by world size: 0.007577312644571066Per-token loss scaled by world size: 0.00760306091979146 | |
Per-token loss scaled by world size: 0.0089601781219244 | |
Per-token loss scaled by world size: 0.0050941589288413525Per-token loss scaled by world size: 0.004234898369759321 | |
Epoch: 5, Step: 36, Rank: 1, loss = 0.09910281002521515 | |
Epoch: 5, Step: 36, Rank: 4, loss = 0.2309429794549942Epoch: 5, Step: 36, Rank: 0, loss = 0.12420971691608429Epoch: 5, Step: 36, Rank: 7, loss = 0.2721654176712036 | |
Epoch: 5, Step: 36, Rank: 5, loss = 0.2301608771085739 | |
Epoch: 5, Step: 36, Rank: 2, loss = 0.15473507344722748 | |
Epoch: 5, Step: 36, Rank: 6, loss = 0.12863503396511078 | |
Per-token loss scaled by world size: 0.003793718060478568 | |
Epoch: 5, Step: 36, Rank: 3, loss = 0.11523418873548508 | |
[2024-07-27 04:42:34,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.5508969814521026e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:34,189] [INFO] [timer.py:258:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=18.899391419587044, CurrSamplesPerSec=19.128599036570417, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 5,█████████| 6/6 [00:21<00:00, 4.75s/it] | |
"step": 36, | |
"rank": 0, | |
"loss": 0.12420971691608429, | |
"overall_throughput": 19.082094487965083, | |
"lr": 1.5508969814521026e-05, | |
"cuda_mem_allocated": 21.992165088653564, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 243, | |
"batch_size": 8, | |
"total_loss": 0.1693982630968094, | |
"gradnorm": 3.1067850589752197, | |
"weight_norm": 393.45916748046875, | |
"timestamp": "2024-07-27T04:42:34.192324" | |
} | |
Epoch 5: 100%|██████████| 6/6 [00:21<00:00, 3.54s/it] | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 4 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 0 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 0 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 1 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 1 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 4 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 5 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 6 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 2 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 5 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 2 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 3 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 3 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 3 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 5 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 7 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 7 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 7 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
Per-token loss scaled by world size: 0.007784623187035322Per-token loss scaled by world size: 0.003072483232244849Per-token loss scaled by world size: 0.009150322526693344Per-token loss scaled by world size: 0.0048055145889520645Per-token loss scaled by world size: 0.007070611696690321Per-token loss scaled by world size: 0.0026875571347773075 | |
Per-token loss scaled by world size: 0.0032222422305494547 | |
Epoch: 6, Step: 37, Rank: 6, loss = 0.08487734943628311 | |
Epoch: 6, Step: 37, Rank: 3, loss = 0.21505022048950195 | |
Epoch: 6, Step: 37, Rank: 0, loss = 0.25277766585350037 | |
Epoch: 6, Step: 37, Rank: 4, loss = 0.19532564282417297 | |
Epoch: 6, Step: 37, Rank: 2, loss = 0.07424376904964447 | |
Epoch: 6, Step: 37, Rank: 5, loss = 0.0890144407749176Epoch: 6, Step: 37, Rank: 1, loss = 0.13275234401226044 | |
Per-token loss scaled by world size: 0.003815547563135624 | |
Epoch: 6, Step: 37, Rank: 7, loss = 0.10540450364351273 | |
[2024-07-27 04:42:35,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.4738686624729987e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:35,123] [INFO] [timer.py:258:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=18.884730321309164, CurrSamplesPerSec=18.399439371025178, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 6: 17%|█▋ | 1/6 [00:00<00:04, 1.23it/s]{ | |
"epoch": 6, | |
"step": 37, | |
"rank": 0, | |
"loss": 0.25277766585350037, | |
"overall_throughput": 18.323549504455777, | |
"lr": 1.4738686624729987e-05, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 221, | |
"batch_size": 8, | |
"total_loss": 0.14368075132369995, | |
"gradnorm": 2.0474841594696045, | |
"weight_norm": 393.4593811035156, | |
"timestamp": "2024-07-27T04:42:35.186644" | |
} | |
Per-token loss scaled by world size: 0.0039473348297178745Per-token loss scaled by world size: 0.0038144520949572325 | |
Per-token loss scaled by world size: 0.0010828088270500302 | |
Per-token loss scaled by world size: 0.0007635311339981854 | |
Per-token loss scaled by world size: 0.0021416409872472286Per-token loss scaled by world size: 0.0017905712593346834 | |
Per-token loss scaled by world size: 0.005295279435813427 | |
Epoch: 6, Step: 38, Rank: 0, loss = 0.14901189506053925 | |
Epoch: 6, Step: 38, Rank: 1, loss = 0.14399556815624237 | |
Epoch: 6, Step: 38, Rank: 7, loss = 0.040876034647226334 | |
Epoch: 6, Step: 38, Rank: 4, loss = 0.02882329933345318Epoch: 6, Step: 38, Rank: 3, loss = 0.08084695041179657 | |
Epoch: 6, Step: 38, Rank: 2, loss = 0.06759406626224518 | |
Epoch: 6, Step: 38, Rank: 5, loss = 0.19989679753780365 | |
Per-token loss scaled by world size: 0.006602860987186432 | |
Epoch: 6, Step: 38, Rank: 6, loss = 0.24925799667835236 | |
[2024-07-27 04:42:35,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.3930250316539237e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:35,599] [INFO] [timer.py:258:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=18.899852383768156, CurrSamplesPerSec=19.44482195705551, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 6: 33%|███▎ | 2/6 [00:01<00:02, 1.62it/s]{ | |
"epoch": 6, | |
"step": 38, | |
"rank": 0, | |
"loss": 0.14901189506053925, | |
"overall_throughput": 19.37470523157654, | |
"lr": 1.3930250316539237e-05, | |
"cuda_mem_allocated": 21.990607738494873, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 302, | |
"batch_size": 8, | |
"total_loss": 0.12003782391548157, | |
"gradnorm": 1.780216097831726, | |
"weight_norm": 393.4596252441406, | |
"timestamp": "2024-07-27T04:42:35.663666" | |
} | |
Per-token loss scaled by world size: 0.0038089316803961992Per-token loss scaled by world size: 0.0020512850023806095Per-token loss scaled by world size: 0.008632734417915344 | |
Per-token loss scaled by world size: 0.0009830425260588527Per-token loss scaled by world size: 0.002817384200170636Per-token loss scaled by world size: 0.007761223241686821 | |
Per-token loss scaled by world size: 0.007352802902460098 | |
Epoch: 6, Step: 39, Rank: 6, loss = 0.06435906887054443 | |
Epoch: 6, Step: 39, Rank: 2, loss = 0.2708520293235779 | |
Epoch: 6, Step: 39, Rank: 7, loss = 0.030842959880828857Epoch: 6, Step: 39, Rank: 4, loss = 0.24350838363170624 | |
Epoch: 6, Step: 39, Rank: 0, loss = 0.11950523406267166Epoch: 6, Step: 39, Rank: 3, loss = 0.08839543163776398 | |
Epoch: 6, Step: 39, Rank: 5, loss = 0.23069418966770172 | |
Per-token loss scaled by world size: 0.003210328985005617 | |
Epoch: 6, Step: 39, Rank: 1, loss = 0.10072407126426697 | |
[2024-07-27 04:42:35,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[1.3090169943749475e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:36,075] [INFO] [timer.py:258:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=18.915587653609716, CurrSamplesPerSec=19.50004649173377, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 6: 50%|█████ | 3/6 [00:01<00:01, 1.81it/s]{ | |
"epoch": 6, | |
"step": 39, | |
"rank": 0, | |
"loss": 0.11950523406267166, | |
"overall_throughput": 19.43697112933871, | |
"lr": 1.3090169943749475e-05, | |
"cuda_mem_allocated": 21.98988962173462, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 251, | |
"batch_size": 8, | |
"total_loss": 0.1436101794242859, | |
"gradnorm": 2.214144706726074, | |
"weight_norm": 393.4598388671875, | |
"timestamp": "2024-07-27T04:42:36.138875" | |
} | |
Per-token loss scaled by world size: 0.001895732944831252Per-token loss scaled by world size: 0.0019446390215307474Per-token loss scaled by world size: 0.0018286737613379955Per-token loss scaled by world size: 0.002989412285387516 | |
Per-token loss scaled by world size: 0.0028383415192365646 | |
Per-token loss scaled by world size: 0.002208298072218895 | |
Per-token loss scaled by world size: 0.005300204269587994 | |
Epoch: 6, Step: 40, Rank: 4, loss = 0.11060825735330582 | |
Epoch: 6, Step: 40, Rank: 3, loss = 0.06766092777252197Epoch: 6, Step: 40, Rank: 0, loss = 0.07014212012290955 | |
Epoch: 6, Step: 40, Rank: 7, loss = 0.10501863807439804 | |
Epoch: 6, Step: 40, Rank: 1, loss = 0.07195164263248444Epoch: 6, Step: 40, Rank: 2, loss = 0.08170703053474426 | |
Epoch: 6, Step: 40, Rank: 5, loss = 0.19610755145549774 | |
Per-token loss scaled by world size: 0.0030284025706350803 | |
Epoch: 6, Step: 40, Rank: 6, loss = 0.11205089092254639 | |
[2024-07-27 04:42:36,474] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.2225209339563144e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:36,551] [INFO] [timer.py:258:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=18.92908988177896, CurrSamplesPerSec=19.44259109142837, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 320 | |
{ | |
"epoch": 6, | |
"step": 40, | |
"rank": 0, | |
"loss": 0.07014212012290955, | |
"overall_throughput": 19.380199690766368, | |
"lr": 1.2225209339563144e-05, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 296, | |
"batch_size": 8, | |
"total_loss": 0.10190588980913162, | |
"gradnorm": 1.372182011604309, | |
"weight_norm": 393.4600830078125, | |
"timestamp": "2024-07-27T04:42:36.554966" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_320 | |
[04:42:54] INFO saving took 17.879958391189575 seconds utils.py:611 | |
Epoch 6: 67%|██████▋ | 4/6 [00:20<00:15, 7.58s/it]Per-token loss scaled by world size: 0.008056806400418282Per-token loss scaled by world size: 0.007982512935996056Per-token loss scaled by world size: 0.00242948392406106Per-token loss scaled by world size: 0.004318062216043472Per-token loss scaled by world size: 0.0034818260464817286Per-token loss scaled by world size: 0.014020812697708607 | |
Epoch: 6, Step: 41, Rank: 3, loss = 0.07318820059299469Epoch: 6, Step: 41, Rank: 0, loss = 0.24271129071712494 | |
Epoch: 6, Step: 41, Rank: 7, loss = 0.2404731959104538Per-token loss scaled by world size: 0.0027995144482702017 | |
Epoch: 6, Step: 41, Rank: 1, loss = 0.42237699031829834Epoch: 6, Step: 41, Rank: 2, loss = 0.10489001125097275 | |
Epoch: 6, Step: 41, Rank: 5, loss = 0.13008162379264832 | |
Epoch: 6, Step: 41, Rank: 4, loss = 0.08433537185192108 | |
Per-token loss scaled by world size: 0.002146774670109153 | |
Epoch: 6, Step: 41, Rank: 6, loss = 0.0646715834736824 | |
[2024-07-27 04:42:54,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[1.1342332658176556e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:54,909] [INFO] [timer.py:258:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=18.942455198097623, CurrSamplesPerSec=19.46470827097328, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 6: 83%|████████▎ | 5/6 [00:20<00:05, 5.02s/it]{ | |
"epoch": 6, | |
"step": 41, | |
"rank": 0, | |
"loss": 0.24271129071712494, | |
"overall_throughput": 19.42185054501338, | |
"lr": 1.1342332658176556e-05, | |
"cuda_mem_allocated": 21.990966320037842, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 241, | |
"batch_size": 8, | |
"total_loss": 0.17034101486206055, | |
"gradnorm": 2.2789089679718018, | |
"weight_norm": 393.46026611328125, | |
"timestamp": "2024-07-27T04:42:54.973034" | |
} | |
Per-token loss scaled by world size: 0.0011888241861015558Per-token loss scaled by world size: 0.0031213611364364624 | |
Per-token loss scaled by world size: 0.002157441573217511Per-token loss scaled by world size: 0.0022118226625025272 | |
Per-token loss scaled by world size: 0.006297964137047529Per-token loss scaled by world size: 0.0018200232880190015Per-token loss scaled by world size: 0.002669830108061433 | |
Epoch: 6, Step: 42, Rank: 1, loss = 0.10846729576587677 | |
Epoch: 6, Step: 42, Rank: 0, loss = 0.04131164029240608 | |
Epoch: 6, Step: 42, Rank: 4, loss = 0.0927765965461731 | |
Epoch: 6, Step: 42, Rank: 6, loss = 0.07686083763837814Epoch: 6, Step: 42, Rank: 3, loss = 0.07497109472751617 | |
Epoch: 6, Step: 42, Rank: 7, loss = 0.06324581056833267Epoch: 6, Step: 42, Rank: 2, loss = 0.21885424852371216 | |
Per-token loss scaled by world size: 0.0031561183277517557 | |
Epoch: 6, Step: 42, Rank: 5, loss = 0.10967510938644409 | |
[2024-07-27 04:42:55,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[1.044864830350515e-05], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:55,387] [INFO] [timer.py:258:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=18.95295051613209, CurrSamplesPerSec=19.371539779153203, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 6: 100%|██████████| 6/6 [00:21<00:00, 3.47s/it]{ | |
"epoch": 6, | |
"step": 42, | |
"rank": 0, | |
"loss": 0.04131164029240608, | |
"overall_throughput": 19.31602776765571, | |
"lr": 1.044864830350515e-05, | |
"cuda_mem_allocated": 21.990487575531006, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 278, | |
"batch_size": 8, | |
"total_loss": 0.09827032685279846, | |
"gradnorm": 1.404802680015564, | |
"weight_norm": 393.4604797363281, | |
"timestamp": "2024-07-27T04:42:55.450259" | |
} | |
Epoch 6: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it] | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 4 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 4 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 4 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 5 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 5 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 5 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 4 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 5 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 1 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 1 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 1 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 0 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 2 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 0 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 2 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 7 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 7 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 7 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 6 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 3 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 6 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 6 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
Per-token loss scaled by world size: 0.003743554465472698Per-token loss scaled by world size: 0.005117448978126049Per-token loss scaled by world size: 0.0018975065322592854Per-token loss scaled by world size: 0.009965005330741405Per-token loss scaled by world size: 0.0038619362749159336 | |
Per-token loss scaled by world size: 0.004172571934759617 | |
Per-token loss scaled by world size: 0.00353407533839345Epoch: 7, Step: 43, Rank: 6, loss = 0.056213632225990295 | |
Epoch: 7, Step: 43, Rank: 7, loss = 0.2952132821083069 | |
Epoch: 7, Step: 43, Rank: 2, loss = 0.110902801156044Epoch: 7, Step: 43, Rank: 0, loss = 0.15160442888736725 | |
Epoch: 7, Step: 43, Rank: 1, loss = 0.11440986394882202 | |
Epoch: 7, Step: 43, Rank: 5, loss = 0.12361244112253189 | |
Epoch: 7, Step: 43, Rank: 4, loss = 0.10469698160886765 | |
Per-token loss scaled by world size: 0.003125852905213833 | |
Epoch: 7, Step: 43, Rank: 3, loss = 0.09260339289903641 | |
[2024-07-27 04:42:56,241] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[9.551351696494854e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:56,318] [INFO] [timer.py:258:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=18.947759228108367, CurrSamplesPerSec=18.74241437439884, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 7,▋ | 1/6 [00:00<00:04, 1.23it/s] | |
"step": 43, | |
"rank": 0, | |
"loss": 0.15160442888736725, | |
"overall_throughput": 18.665014201337474, | |
"lr": 9.551351696494854e-06, | |
"cuda_mem_allocated": 21.990128993988037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 237, | |
"batch_size": 8, | |
"total_loss": 0.13115710020065308, | |
"gradnorm": 1.5875235795974731, | |
"weight_norm": 393.4606628417969, | |
"timestamp": "2024-07-27T04:42:56.382842" | |
} | |
Per-token loss scaled by world size: 0.0007441109046339989Per-token loss scaled by world size: 0.002569864271208644Per-token loss scaled by world size: 0.0021702933590859175Per-token loss scaled by world size: 0.0034706422593444586Per-token loss scaled by world size: 0.003474967321380973 | |
Per-token loss scaled by world size: 0.0027420881669968367Per-token loss scaled by world size: 0.002911260584369302 | |
Epoch: 7, Step: 44, Rank: 7, loss = 0.07596027106046677Epoch: 7, Step: 44, Rank: 6, loss = 0.0899452492594719 | |
Epoch: 7, Step: 44, Rank: 4, loss = 0.12147247791290283Epoch: 7, Step: 44, Rank: 3, loss = 0.12162385880947113 | |
Epoch: 7, Step: 44, Rank: 5, loss = 0.09597308933734894 | |
Epoch: 7, Step: 44, Rank: 2, loss = 0.026043880730867386 | |
Epoch: 7, Step: 44, Rank: 1, loss = 0.10189411789178848 | |
Per-token loss scaled by world size: 0.0022017783485352993 | |
Epoch: 7, Step: 44, Rank: 0, loss = 0.07706224173307419 | |
[2024-07-27 04:42:56,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[8.657667341823449e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:56,789] [INFO] [timer.py:258:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=18.96843313737227, CurrSamplesPerSec=19.85672616190888, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 7,██▎ | 2/6 [00:01<00:02, 1.64it/s] | |
"step": 44, | |
"rank": 0, | |
"loss": 0.07706224173307419, | |
"overall_throughput": 19.817860414460462, | |
"lr": 8.657667341823449e-06, | |
"cuda_mem_allocated": 21.992404460906982, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 280, | |
"batch_size": 8, | |
"total_loss": 0.08874689787626266, | |
"gradnorm": 1.267701268196106, | |
"weight_norm": 393.4607849121094, | |
"timestamp": "2024-07-27T04:42:56.852698" | |
} | |
Per-token loss scaled by world size: 0.0025115651078522205Per-token loss scaled by world size: 0.004961833357810974Per-token loss scaled by world size: 0.0043532936833798885Per-token loss scaled by world size: 0.0021706747356802225Per-token loss scaled by world size: 0.0033806730061769485Per-token loss scaled by world size: 0.002844580914825201 | |
Per-token loss scaled by world size: 0.0033718389458954334 | |
Epoch: 7, Step: 45, Rank: 4, loss = 0.1349520981311798Epoch: 7, Step: 45, Rank: 5, loss = 0.1538168340921402Epoch: 7, Step: 45, Rank: 6, loss = 0.06729091703891754Epoch: 7, Step: 45, Rank: 7, loss = 0.0881820097565651 | |
Epoch: 7, Step: 45, Rank: 0, loss = 0.07785851508378983 | |
Epoch: 7, Step: 45, Rank: 2, loss = 0.10480086505413055 | |
Epoch: 7, Step: 45, Rank: 1, loss = 0.10452700406312943 | |
Per-token loss scaled by world size: 0.0024816528894007206 | |
Epoch: 7, Step: 45, Rank: 3, loss = 0.07693123817443848 | |
[2024-07-27 04:42:57,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[7.774790660436857e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:42:57,264] [INFO] [timer.py:258:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=18.984207118032753, CurrSamplesPerSec=19.671261884005887, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 360 | |
{ | |
"epoch": 7, | |
"step": 45, | |
"rank": 0, | |
"loss": 0.07785851508378983, | |
"overall_throughput": 19.632336945869298, | |
"lr": 7.774790660436857e-06, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 248, | |
"batch_size": 8, | |
"total_loss": 0.10104493051767349, | |
"gradnorm": 1.2592891454696655, | |
"weight_norm": 393.46087646484375, | |
"timestamp": "2024-07-27T04:42:57.267026" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_360 | |
[04:43:15] INFO saving took 17.815489530563354 seconds utils.py:611 | |
Per-token loss scaled by world size: 0.0031208053696900606Per-token loss scaled by world size: 0.002719326876103878Per-token loss scaled by world size: 0.00290810433216393Per-token loss scaled by world size: 0.00502825528383255Per-token loss scaled by world size: 0.0031488884706050158 | |
Per-token loss scaled by world size: 0.0032260508742183447Per-token loss scaled by world size: 0.0017572520300745964 | |
Epoch: 7, Step: 46, Rank: 5, loss = 0.08837812393903732 | |
Epoch: 7, Step: 46, Rank: 6, loss = 0.09451339393854141Epoch: 7, Step: 46, Rank: 3, loss = 0.10233887284994125 | |
Epoch: 7, Step: 46, Rank: 0, loss = 0.10142617672681808Epoch: 7, Step: 46, Rank: 7, loss = 0.10484665632247925 | |
Epoch: 7, Step: 46, Rank: 1, loss = 0.05711068958044052 | |
Epoch: 7, Step: 46, Rank: 4, loss = 0.16341829299926758 | |
Per-token loss scaled by world size: 0.00243758293800056 | |
Epoch: 7, Step: 46, Rank: 2, loss = 0.0792214423418045 | |
[2024-07-27 04:43:15,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[6.909830056250527e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:15,568] [INFO] [timer.py:258:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=18.98530026908578, CurrSamplesPerSec=19.0324251537424, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 7,█████▋ | 4/6 [00:20<00:10, 5.45s/it] | |
"step": 46, | |
"rank": 0, | |
"loss": 0.10142617672681808, | |
"overall_throughput": 18.986955901758453, | |
"lr": 6.909830056250527e-06, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 260, | |
"batch_size": 8, | |
"total_loss": 0.09890670329332352, | |
"gradnorm": 1.4150254726409912, | |
"weight_norm": 393.46099853515625, | |
"timestamp": "2024-07-27T04:43:15.632493" | |
} | |
Per-token loss scaled by world size: 0.0018586666556075215Per-token loss scaled by world size: 0.002927313791587949Per-token loss scaled by world size: 0.002946708584204316Per-token loss scaled by world size: 0.0019047270761802793Per-token loss scaled by world size: 0.0012054119724780321Per-token loss scaled by world size: 0.0014979788102209568 | |
Per-token loss scaled by world size: 0.0022586516570299864 | |
Epoch: 7, Step: 47, Rank: 0, loss = 0.10757878422737122Epoch: 7, Step: 47, Rank: 5, loss = 0.06999871879816055 | |
Epoch: 7, Step: 47, Rank: 4, loss = 0.10829153656959534 | |
Epoch: 7, Step: 47, Rank: 7, loss = 0.055050719529390335 | |
Epoch: 7, Step: 47, Rank: 2, loss = 0.06830599904060364 | |
Epoch: 7, Step: 47, Rank: 3, loss = 0.044298890978097916 | |
Epoch: 7, Step: 47, Rank: 1, loss = 0.08300545066595078 | |
Per-token loss scaled by world size: 0.002645065076649189 | |
Epoch: 7, Step: 47, Rank: 6, loss = 0.09720613807439804 | |
[2024-07-27 04:43:15,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[6.069749683460765e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:16,046] [INFO] [timer.py:258:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=18.996249360616417, CurrSamplesPerSec=19.490837611941338, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 7,███████▎ | 5/6 [00:20<00:03, 3.66s/it] | |
"step": 47, | |
"rank": 0, | |
"loss": 0.10757878422737122, | |
"overall_throughput": 19.4505930343033, | |
"lr": 6.069749683460765e-06, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 294, | |
"batch_size": 8, | |
"total_loss": 0.07921702414751053, | |
"gradnorm": 1.5372004508972168, | |
"weight_norm": 393.4610900878906, | |
"timestamp": "2024-07-27T04:43:16.111313" | |
} | |
Per-token loss scaled by world size: 0.0035249628126621246Per-token loss scaled by world size: 0.0036447104066610336Per-token loss scaled by world size: 0.0025723562575876713Per-token loss scaled by world size: 0.0031749033369123936Per-token loss scaled by world size: 0.00402703694999218 | |
Per-token loss scaled by world size: 0.0017748093232512474 | |
Per-token loss scaled by world size: 0.00937521830201149 | |
Epoch: 7, Step: 48, Rank: 3, loss = 0.08971092104911804Epoch: 7, Step: 48, Rank: 7, loss = 0.12710927426815033 | |
Epoch: 7, Step: 48, Rank: 0, loss = 0.11072475463151932 | |
Epoch: 7, Step: 48, Rank: 4, loss = 0.14044290781021118Epoch: 7, Step: 48, Rank: 6, loss = 0.12293307483196259Epoch: 7, Step: 48, Rank: 5, loss = 0.3269607424736023Epoch: 7, Step: 48, Rank: 2, loss = 0.06189647689461708 | |
Per-token loss scaled by world size: 0.004552490543574095 | |
Epoch: 7, Step: 48, Rank: 1, loss = 0.15876810252666473 | |
[2024-07-27 04:43:16,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[5.2613133752700145e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:16,523] [INFO] [timer.py:258:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=19.007811150191724, CurrSamplesPerSec=19.543068281625303, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 7,█████████| 6/6 [00:21<00:00, 2.57s/it] | |
"step": 48, | |
"rank": 0, | |
"loss": 0.11072475463151932, | |
"overall_throughput": 19.50403630817306, | |
"lr": 5.2613133752700145e-06, | |
"cuda_mem_allocated": 21.99084711074829, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 279, | |
"batch_size": 8, | |
"total_loss": 0.14231827855110168, | |
"gradnorm": 2.0794081687927246, | |
"weight_norm": 393.461181640625, | |
"timestamp": "2024-07-27T04:43:16.587633" | |
} | |
Epoch 7: 100%|██████████| 6/6 [00:21<00:00, 3.52s/it] | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 5 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 5 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 0 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 5 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 5 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 4 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 0 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 0 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 1 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 0 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 1 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 7 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 7 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 4 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 7 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 2 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 2 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 2 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 2 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 6 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 6 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 3 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 3 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 3 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
Per-token loss scaled by world size: 0.007703100331127644Per-token loss scaled by world size: 0.002897256053984165 | |
Per-token loss scaled by world size: 0.0018762396648526192Per-token loss scaled by world size: 0.0031769108027219772Per-token loss scaled by world size: 0.0031007928773760796 | |
Per-token loss scaled by world size: 0.0032394849695265293 | |
Per-token loss scaled by world size: 0.0033565827179700136 | |
Epoch: 8, Step: 49, Rank: 6, loss = 0.08872846513986588 | |
Epoch: 8, Step: 49, Rank: 2, loss = 0.2359074503183365 | |
Epoch: 8, Step: 49, Rank: 7, loss = 0.057459838688373566 | |
Epoch: 8, Step: 49, Rank: 0, loss = 0.09729289263486862 | |
Epoch: 8, Step: 49, Rank: 5, loss = 0.09496178478002548 | |
Epoch: 8, Step: 49, Rank: 4, loss = 0.10279534757137299 | |
Epoch: 8, Step: 49, Rank: 1, loss = 0.09920922666788101 | |
Per-token loss scaled by world size: 0.001880081370472908 | |
Epoch: 8, Step: 49, Rank: 3, loss = 0.05757749080657959 | |
[2024-07-27 04:43:17,376] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[4.491030185478976e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:17,453] [INFO] [timer.py:258:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=18.97237396051445, CurrSamplesPerSec=17.473818511885575, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 8: 17%|█▋ | 1/6 [00:00<00:04, 1.23it/s]{ | |
"epoch": 8, | |
"step": 49, | |
"rank": 0, | |
"loss": 0.09729289263486862, | |
"overall_throughput": 17.406956156346215, | |
"lr": 4.491030185478976e-06, | |
"cuda_mem_allocated": 21.990607738494873, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 245, | |
"batch_size": 8, | |
"total_loss": 0.10424157232046127, | |
"gradnorm": 2.468442678451538, | |
"weight_norm": 393.4612731933594, | |
"timestamp": "2024-07-27T04:43:17.517240" | |
} | |
Per-token loss scaled by world size: 0.0022910190746188164Per-token loss scaled by world size: 0.003779459511861205Per-token loss scaled by world size: 0.0047139013186097145Per-token loss scaled by world size: 0.0016656998777762055Per-token loss scaled by world size: 0.0011160913854837418 | |
Per-token loss scaled by world size: 0.002607797970995307 | |
Per-token loss scaled by world size: 0.0011160913854837418 | |
Epoch: 8, Step: 50, Rank: 4, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 1, loss = 0.05871592089533806Epoch: 8, Step: 50, Rank: 5, loss = 0.16616502404212952 | |
Epoch: 8, Step: 50, Rank: 0, loss = 0.1332259476184845Epoch: 8, Step: 50, Rank: 2, loss = 0.08075842261314392Epoch: 8, Step: 50, Rank: 6, loss = 0.03934222087264061Epoch: 8, Step: 50, Rank: 7, loss = 0.09192487597465515 | |
Per-token loss scaled by world size: 0.0013738555135205388 | |
Epoch: 8, Step: 50, Rank: 3, loss = 0.048428408801555634 | |
[2024-07-27 04:43:17,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[3.7651019814126656e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:17,936] [INFO] [timer.py:258:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=18.977750407321444, CurrSamplesPerSec=19.233927031934993, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 400 | |
{ | |
"epoch": 8, | |
"step": 50, | |
"rank": 0, | |
"loss": 0.1332259476184845, | |
"overall_throughput": 19.198084907835618, | |
"lr": 3.7651019814126656e-06, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 282, | |
"batch_size": 8, | |
"total_loss": 0.08223787695169449, | |
"gradnorm": 1.6959415674209595, | |
"weight_norm": 393.4613342285156, | |
"timestamp": "2024-07-27T04:43:17.940040" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_400 | |
[04:43:35] INFO saving took 17.87660813331604 seconds utils.py:611 | |
Epoch 8: 33%|███▎ | 2/6 [00:19<00:44, 11.14s/it]Per-token loss scaled by world size: 0.0034444250632077456Per-token loss scaled by world size: 0.0036195043940097094Per-token loss scaled by world size: 0.0021303421817719936Per-token loss scaled by world size: 0.002188930055126548Per-token loss scaled by world size: 0.0012819116236642003Per-token loss scaled by world size: 0.0027832810301333666 | |
Per-token loss scaled by world size: 0.0016897486057132483Epoch: 8, Step: 51, Rank: 1, loss = 0.11129976063966751 | |
Epoch: 8, Step: 51, Rank: 7, loss = 0.06550802290439606 | |
Epoch: 8, Step: 51, Rank: 5, loss = 0.08558589220046997Epoch: 8, Step: 51, Rank: 2, loss = 0.06730959564447403Epoch: 8, Step: 51, Rank: 0, loss = 0.10591606795787811 | |
Epoch: 8, Step: 51, Rank: 3, loss = 0.0394187830388546 | |
Epoch: 8, Step: 51, Rank: 6, loss = 0.05195976793766022 | |
Per-token loss scaled by world size: 0.003074637847021222 | |
Epoch: 8, Step: 51, Rank: 4, loss = 0.09454511106014252 | |
[2024-07-27 04:43:36,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[3.089373510131354e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:36,312] [INFO] [timer.py:258:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=18.971761108564685, CurrSamplesPerSec=18.68865417133589, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 8: 50%|█████ | 3/6 [00:19<00:18, 6.28s/it]{ | |
"epoch": 8, | |
"step": 51, | |
"rank": 0, | |
"loss": 0.10591606795787811, | |
"overall_throughput": 18.651619864047372, | |
"lr": 3.089373510131354e-06, | |
"cuda_mem_allocated": 21.988811492919922, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 246, | |
"batch_size": 8, | |
"total_loss": 0.07769287377595901, | |
"gradnorm": 0.9065935611724854, | |
"weight_norm": 393.46136474609375, | |
"timestamp": "2024-07-27T04:43:36.375570" | |
} | |
Per-token loss scaled by world size: 0.005752637051045895Per-token loss scaled by world size: 0.00271693360991776 | |
Per-token loss scaled by world size: 0.004330337047576904Per-token loss scaled by world size: 0.005382548552006483Per-token loss scaled by world size: 0.0025455320719629526 | |
Per-token loss scaled by world size: 0.0023602889850735664Per-token loss scaled by world size: 0.00044713294482789934 | |
Epoch: 8, Step: 52, Rank: 1, loss = 0.08286647498607635 | |
Epoch: 8, Step: 52, Rank: 0, loss = 0.17545543611049652Epoch: 8, Step: 52, Rank: 6, loss = 0.13207527995109558 | |
Epoch: 8, Step: 52, Rank: 2, loss = 0.16416773200035095Epoch: 8, Step: 52, Rank: 7, loss = 0.07763873040676117 | |
Epoch: 8, Step: 52, Rank: 5, loss = 0.07198881357908249 | |
Epoch: 8, Step: 52, Rank: 4, loss = 0.013637554831802845 | |
Per-token loss scaled by world size: 0.001792231691069901 | |
Epoch: 8, Step: 52, Rank: 3, loss = 0.05466306582093239 | |
[2024-07-27 04:43:36,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[2.469285339963892e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:36,777] [INFO] [timer.py:258:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=18.99222895360799, CurrSamplesPerSec=20.052273645410278, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 8: 67%|██████▋ | 4/6 [00:20<00:07, 3.98s/it]{ | |
"epoch": 8, | |
"step": 52, | |
"rank": 0, | |
"loss": 0.17545543611049652, | |
"overall_throughput": 20.013534639121023, | |
"lr": 2.469285339963892e-06, | |
"cuda_mem_allocated": 21.989290714263916, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 244, | |
"batch_size": 8, | |
"total_loss": 0.09656163305044174, | |
"gradnorm": 1.5890859365463257, | |
"weight_norm": 393.4613952636719, | |
"timestamp": "2024-07-27T04:43:36.840099" | |
} | |
Per-token loss scaled by world size: 0.0025300777051597834Per-token loss scaled by world size: 0.0022664989810436964Per-token loss scaled by world size: 0.006000937893986702Per-token loss scaled by world size: 0.002840510569512844 | |
Per-token loss scaled by world size: 0.004035668447613716 | |
Per-token loss scaled by world size: 0.0041307490319013596 | |
Per-token loss scaled by world size: 0.003075978020206094 | |
Epoch: 8, Step: 53, Rank: 5, loss = 0.07309459149837494Epoch: 8, Step: 53, Rank: 4, loss = 0.09160646796226501Epoch: 8, Step: 53, Rank: 6, loss = 0.19353024661540985Epoch: 8, Step: 53, Rank: 1, loss = 0.13015030324459076 | |
Epoch: 8, Step: 53, Rank: 3, loss = 0.13321664929389954Epoch: 8, Step: 53, Rank: 2, loss = 0.08159500360488892 | |
Epoch: 8, Step: 53, Rank: 7, loss = 0.0992002934217453 | |
Per-token loss scaled by world size: 0.0016319038113579154 | |
Epoch: 8, Step: 53, Rank: 0, loss = 0.05262889713048935 | |
[2024-07-27 04:43:37,166] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[1.9098300562505266e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:37,244] [INFO] [timer.py:258:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=19.00944478158272, CurrSamplesPerSec=19.91191964124113, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 8: 83%|████████▎ | 5/6 [00:20<00:02, 2.71s/it]{ | |
"epoch": 8, | |
"step": 53, | |
"rank": 0, | |
"loss": 0.05262889713048935, | |
"overall_throughput": 19.87502717277025, | |
"lr": 1.9098300562505266e-06, | |
"cuda_mem_allocated": 21.99288320541382, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 258, | |
"batch_size": 8, | |
"total_loss": 0.10687780380249023, | |
"gradnorm": 1.6277161836624146, | |
"weight_norm": 393.4614562988281, | |
"timestamp": "2024-07-27T04:43:37.309449" | |
} | |
Per-token loss scaled by world size: 0.004982769954949617Per-token loss scaled by world size: 0.0023371989373117685Per-token loss scaled by world size: 0.001956745982170105Per-token loss scaled by world size: 0.0019846318755298853Per-token loss scaled by world size: 0.001973965670913458Per-token loss scaled by world size: 0.001133645768277347 | |
Per-token loss scaled by world size: 0.0006779870600439608 | |
Epoch: 8, Step: 54, Rank: 0, loss = 0.19868795573711395 | |
Epoch: 8, Step: 54, Rank: 6, loss = 0.09319580346345901Epoch: 8, Step: 54, Rank: 5, loss = 0.07802524417638779Epoch: 8, Step: 54, Rank: 3, loss = 0.07871188223361969 | |
Epoch: 8, Step: 54, Rank: 7, loss = 0.045204125344753265Epoch: 8, Step: 54, Rank: 4, loss = 0.027034733444452286 | |
Epoch: 8, Step: 54, Rank: 2, loss = 0.07913719862699509 | |
Per-token loss scaled by world size: 0.0017750355182215571 | |
Epoch: 8, Step: 54, Rank: 1, loss = 0.07077953964471817 | |
[2024-07-27 04:43:37,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[1.4155120639813392e-06], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:37,723] [INFO] [timer.py:258:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=19.017755360925854, CurrSamplesPerSec=19.451449970580306, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Epoch 8: 100%|██████████| 6/6 [00:21<00:00, 1.95s/it]{ | |
"epoch": 8, | |
"step": 54, | |
"rank": 0, | |
"loss": 0.19868795573711395, | |
"overall_throughput": 19.41544489924397, | |
"lr": 1.4155120639813392e-06, | |
"cuda_mem_allocated": 21.990128993988037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 319, | |
"batch_size": 8, | |
"total_loss": 0.0838470607995987, | |
"gradnorm": 0.9820513129234314, | |
"weight_norm": 393.4614562988281, | |
"timestamp": "2024-07-27T04:43:37.787702" | |
} | |
Epoch 8: 100%|██████████| 6/6 [00:21<00:00, 3.53s/it] | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 0 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 3 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 0 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 64 num samples: 1 num padding tokens: 0 - rank: 3 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 33 | |
total tokens: 88 num samples: 1 num padding tokens: 0 - rank: 3 max len: 88 min len: 88 avg len: 88.0 num_loss_counted_tokens: 53 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 74 num samples: 1 num padding tokens: 0 - rank: 0 max len: 74 min len: 74 avg len: 74.0 num_loss_counted_tokens: 40 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 3 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 0 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 46 num samples: 1 num padding tokens: 0 - rank: 0 max len: 46 min len: 46 avg len: 46.0 num_loss_counted_tokens: 21 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 3 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 3 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 31 | |
total tokens: 71 num samples: 1 num padding tokens: 0 - rank: 7 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 35 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 7 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 7 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 80 num samples: 1 num padding tokens: 0 - rank: 7 max len: 80 min len: 80 avg len: 80.0 num_loss_counted_tokens: 46 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 7 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 50 num samples: 1 num padding tokens: 0 - rank: 5 max len: 50 min len: 50 avg len: 50.0 num_loss_counted_tokens: 25 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 1 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 44 num samples: 1 num padding tokens: 0 - rank: 5 max len: 44 min len: 44 avg len: 44.0 num_loss_counted_tokens: 21 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 1 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 45 num samples: 1 num padding tokens: 0 - rank: 5 max len: 45 min len: 45 avg len: 45.0 num_loss_counted_tokens: 22 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 1 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 1 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 76 num samples: 1 num padding tokens: 0 - rank: 5 max len: 76 min len: 76 avg len: 76.0 num_loss_counted_tokens: 40 | |
total tokens: 100 num samples: 1 num padding tokens: 0 - rank: 1 max len: 100 min len: 100 avg len: 100.0 num_loss_counted_tokens: 66 | |
total tokens: 107 num samples: 1 num padding tokens: 0 - rank: 5 max len: 107 min len: 107 avg len: 107.0 num_loss_counted_tokens: 73 | |
total tokens: 48 num samples: 1 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 23 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 4 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 63 num samples: 1 num padding tokens: 0 - rank: 4 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 32 | |
total tokens: 55 num samples: 1 num padding tokens: 0 - rank: 4 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 32 | |
total tokens: 62 num samples: 1 num padding tokens: 0 - rank: 2 max len: 62 min len: 62 avg len: 62.0 num_loss_counted_tokens: 31 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 29 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 4 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 34 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 2 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 36 | |
total tokens: 59 num samples: 1 num padding tokens: 0 - rank: 2 max len: 59 min len: 59 avg len: 59.0 num_loss_counted_tokens: 30 | |
total tokens: 49 num samples: 1 num padding tokens: 0 - rank: 2 max len: 49 min len: 49 avg len: 49.0 num_loss_counted_tokens: 24 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 2 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 92 num samples: 1 num padding tokens: 0 - rank: 2 max len: 92 min len: 92 avg len: 92.0 num_loss_counted_tokens: 57 | |
total tokens: 57 num samples: 1 num padding tokens: 0 - rank: 6 max len: 57 min len: 57 avg len: 57.0 num_loss_counted_tokens: 34 | |
total tokens: 61 num samples: 1 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 30 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 29 | |
total tokens: 60 num samples: 1 num padding tokens: 0 - rank: 6 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 36 | |
total tokens: 58 num samples: 1 num padding tokens: 0 - rank: 6 max len: 58 min len: 58 avg len: 58.0 num_loss_counted_tokens: 33 | |
total tokens: 51 num samples: 1 num padding tokens: 0 - rank: 6 max len: 51 min len: 51 avg len: 51.0 num_loss_counted_tokens: 26 | |
Per-token loss scaled by world size: 0.0020672364626079798Per-token loss scaled by world size: 0.005803861655294895Per-token loss scaled by world size: 0.0010450059780851007Per-token loss scaled by world size: 0.00481435377150774Per-token loss scaled by world size: 0.004757868126034737 | |
Per-token loss scaled by world size: 0.0012225221144035459 | |
Per-token loss scaled by world size: 0.003656236920505762 | |
Epoch: 9, Step: 55, Rank: 4, loss = 0.14262522757053375 | |
Epoch: 9, Step: 55, Rank: 5, loss = 0.1719394028186798 | |
Epoch: 9, Step: 55, Rank: 2, loss = 0.030958302319049835 | |
Epoch: 9, Step: 55, Rank: 1, loss = 0.06124188005924225 | |
Epoch: 9, Step: 55, Rank: 0, loss = 0.14095184206962585 | |
Epoch: 9, Step: 55, Rank: 7, loss = 0.03621721640229225 | |
Epoch: 9, Step: 55, Rank: 3, loss = 0.10831601917743683 | |
Per-token loss scaled by world size: 0.003545596729964018 | |
Epoch: 9, Step: 55, Rank: 6, loss = 0.10503830015659332 | |
[2024-07-27 04:43:38,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[9.903113209758098e-07], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:38,646] [INFO] [timer.py:258:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=18.958935135030256, CurrSamplesPerSec=16.33220426430826, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 440 | |
{ | |
"epoch": 9, | |
"step": 55, | |
"rank": 0, | |
"loss": 0.14095184206962585, | |
"overall_throughput": 16.273810095413527, | |
"lr": 9.903113209758098e-07, | |
"cuda_mem_allocated": 21.989410877227783, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 237, | |
"batch_size": 8, | |
"total_loss": 0.09966102987527847, | |
"gradnorm": 1.0968877077102661, | |
"weight_norm": 393.4614562988281, | |
"timestamp": "2024-07-27T04:43:38.650582" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_440 | |
[04:43:56] INFO saving took 17.79723310470581 seconds utils.py:611 | |
Per-token loss scaled by world size: 0.005642743315547705Per-token loss scaled by world size: 0.002617186401039362Per-token loss scaled by world size: 0.0045571294613182545Per-token loss scaled by world size: 0.002132992260158062Per-token loss scaled by world size: 0.0015219022752717137 | |
Per-token loss scaled by world size: 0.003468153765425086 | |
Per-token loss scaled by world size: 0.0018528653308749199 | |
Epoch: 9, Step: 56, Rank: 0, loss = 0.14867635071277618 | |
Epoch: 9, Step: 56, Rank: 7, loss = 0.08538571000099182 | |
Epoch: 9, Step: 56, Rank: 5, loss = 0.06958886981010437Epoch: 9, Step: 56, Rank: 4, loss = 0.1131485179066658 | |
Epoch: 9, Step: 56, Rank: 2, loss = 0.049652062356472015Epoch: 9, Step: 56, Rank: 1, loss = 0.06044973060488701Epoch: 9, Step: 56, Rank: 6, loss = 0.18409450352191925 | |
Per-token loss scaled by world size: 0.001767554902471602 | |
Epoch: 9, Step: 56, Rank: 3, loss = 0.05766648054122925 | |
[2024-07-27 04:43:56,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[6.37651293602628e-07], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:56,928] [INFO] [timer.py:258:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=18.96440824470298, CurrSamplesPerSec=19.25907525027751, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 9,██▎ | 2/6 [00:19<00:31, 7.95s/it] | |
"step": 56, | |
"rank": 0, | |
"loss": 0.14867635071277618, | |
"overall_throughput": 19.213232991090933, | |
"lr": 6.37651293602628e-07, | |
"cuda_mem_allocated": 21.990248203277588, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 261, | |
"batch_size": 8, | |
"total_loss": 0.09608278423547745, | |
"gradnorm": 1.2486889362335205, | |
"weight_norm": 393.46148681640625, | |
"timestamp": "2024-07-27T04:43:56.991985" | |
} | |
Per-token loss scaled by world size: 0.006159770302474499Per-token loss scaled by world size: 0.004680618178099394Per-token loss scaled by world size: 0.003500568214803934Per-token loss scaled by world size: 0.002827225485816598Per-token loss scaled by world size: 0.001885988749563694Per-token loss scaled by world size: 0.002812023274600506 | |
Per-token loss scaled by world size: 0.0035085994750261307 | |
Epoch: 9, Step: 57, Rank: 6, loss = 0.10764247179031372 | |
Epoch: 9, Step: 57, Rank: 4, loss = 0.08693718165159225Epoch: 9, Step: 57, Rank: 3, loss = 0.14392900466918945Epoch: 9, Step: 57, Rank: 0, loss = 0.05799415335059166 | |
Epoch: 9, Step: 57, Rank: 5, loss = 0.08646971732378006 | |
Epoch: 9, Step: 57, Rank: 7, loss = 0.10788943618535995 | |
Epoch: 9, Step: 57, Rank: 2, loss = 0.1894129365682602 | |
Per-token loss scaled by world size: 0.0016236526425927877 | |
Epoch: 9, Step: 57, Rank: 1, loss = 0.04992732033133507 | |
[2024-07-27 04:43:57,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[3.603713930414676e-07], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:57,395] [INFO] [timer.py:258:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=18.980723963154006, CurrSamplesPerSec=19.905493724517065, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 9,████ | 3/6 [00:19<00:13, 4.53s/it] | |
"step": 57, | |
"rank": 0, | |
"loss": 0.05799415335059166, | |
"overall_throughput": 19.839444119579092, | |
"lr": 3.603713930414676e-07, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 246, | |
"batch_size": 8, | |
"total_loss": 0.1037752702832222, | |
"gradnorm": 1.5781608819961548, | |
"weight_norm": 393.46148681640625, | |
"timestamp": "2024-07-27T04:43:57.458487" | |
} | |
Per-token loss scaled by world size: 0.00027386093279346824Per-token loss scaled by world size: 0.002793475054204464Per-token loss scaled by world size: 0.0012327907606959343Per-token loss scaled by world size: 0.0018183693755418062Per-token loss scaled by world size: 0.0011149498168379068 | |
Per-token loss scaled by world size: 0.0009586562518961728 | |
Per-token loss scaled by world size: 0.006267122458666563 | |
Epoch: 9, Step: 58, Rank: 3, loss = 0.09497815370559692 | |
Epoch: 9, Step: 58, Rank: 4, loss = 0.04191488400101662 | |
Epoch: 9, Step: 58, Rank: 7, loss = 0.032594311982393265Epoch: 9, Step: 58, Rank: 5, loss = 0.06182456016540527Epoch: 9, Step: 58, Rank: 2, loss = 0.037908293306827545 | |
Epoch: 9, Step: 58, Rank: 6, loss = 0.009311271831393242 | |
Epoch: 9, Step: 58, Rank: 1, loss = 0.21308216452598572 | |
Per-token loss scaled by world size: 0.0013049639528617263 | |
Epoch: 9, Step: 58, Rank: 0, loss = 0.04436877369880676 | |
[2024-07-27 04:43:57,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[1.6070411401370335e-07], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:57,863] [INFO] [timer.py:258:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=18.99479756047954, CurrSamplesPerSec=19.802352008035566, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 9,█████▋ | 4/6 [00:20<00:05, 2.93s/it] | |
"step": 58, | |
"rank": 0, | |
"loss": 0.04436877369880676, | |
"overall_throughput": 19.7424303222504, | |
"lr": 1.6070411401370335e-07, | |
"cuda_mem_allocated": 21.992165088653564, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 272, | |
"batch_size": 8, | |
"total_loss": 0.06699780374765396, | |
"gradnorm": 1.2012358903884888, | |
"weight_norm": 393.46148681640625, | |
"timestamp": "2024-07-27T04:43:57.924678" | |
} | |
Per-token loss scaled by world size: 0.0033104075118899345Per-token loss scaled by world size: 0.005029842257499695Per-token loss scaled by world size: 0.0013227150775492191Per-token loss scaled by world size: 0.0013601266546174884 | |
Per-token loss scaled by world size: 0.0020338338799774647 | |
Per-token loss scaled by world size: 0.002029073191806674 | |
Per-token loss scaled by world size: 0.0012528002262115479 | |
Epoch: 9, Step: 59, Rank: 1, loss = 0.181074321269989 | |
Epoch: 9, Step: 59, Rank: 0, loss = 0.11917466670274734 | |
Epoch: 9, Step: 59, Rank: 4, loss = 0.04761774465441704 | |
Epoch: 9, Step: 59, Rank: 6, loss = 0.07321801781654358Epoch: 9, Step: 59, Rank: 2, loss = 0.04896456003189087 | |
Epoch: 9, Step: 59, Rank: 7, loss = 0.04510080814361572Epoch: 9, Step: 59, Rank: 3, loss = 0.07304663211107254 | |
Per-token loss scaled by world size: 0.0016570077277719975 | |
Epoch: 9, Step: 59, Rank: 5, loss = 0.05965227633714676 | |
[2024-07-27 04:43:58,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[4.025706004760932e-08], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:58,339] [INFO] [timer.py:258:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=19.001181258681584, CurrSamplesPerSec=19.36564785840185, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
{ | |
"epoch": 9,███████▎ | 5/6 [00:20<00:02, 2.04s/it] | |
"step": 59, | |
"rank": 0, | |
"loss": 0.11917466670274734, | |
"overall_throughput": 19.311280902601126, | |
"lr": 4.025706004760932e-08, | |
"cuda_mem_allocated": 21.98869228363037, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 288, | |
"batch_size": 8, | |
"total_loss": 0.08098112046718597, | |
"gradnorm": 1.2536462545394897, | |
"weight_norm": 393.46148681640625, | |
"timestamp": "2024-07-27T04:43:58.402303" | |
} | |
Per-token loss scaled by world size: 0.0025534227024763823Per-token loss scaled by world size: 0.002198881469666958Per-token loss scaled by world size: 0.003101743757724762Per-token loss scaled by world size: 0.0017734984867274761Per-token loss scaled by world size: 0.001557655748911202 | |
Per-token loss scaled by world size: 0.0014592667575925589 | |
Per-token loss scaled by world size: 0.00225572707131505 | |
Epoch: 9, Step: 60, Rank: 6, loss = 0.11088734120130539 | |
Epoch: 9, Step: 60, Rank: 0, loss = 0.0912848636507988Epoch: 9, Step: 60, Rank: 2, loss = 0.05568619444966316Epoch: 9, Step: 60, Rank: 7, loss = 0.05216878652572632Epoch: 9, Step: 60, Rank: 4, loss = 0.07861001044511795Epoch: 9, Step: 60, Rank: 5, loss = 0.06340257078409195 | |
Epoch: 9, Step: 60, Rank: 3, loss = 0.08064224570989609 | |
Per-token loss scaled by world size: 0.0007612230838276446 | |
Epoch: 9, Step: 60, Rank: 1, loss = 0.027213726192712784 | |
[2024-07-27 04:43:58,733] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.0], mom=[(0.9, 0.95)] | |
[2024-07-27 04:43:58,814] [INFO] [timer.py:258:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=19.00919914338529, CurrSamplesPerSec=19.4776793799544, MemAllocated=21.99GB, MaxMemAllocated=28.29GB | |
Saving model in huggingface format at samples_seen: 480 | |
{ | |
"epoch": 9, | |
"step": 60, | |
"rank": 0, | |
"loss": 0.0912848636507988, | |
"overall_throughput": 19.42523488114542, | |
"lr": 0.0, | |
"cuda_mem_allocated": 21.988811492919922, | |
"cuda_malloc_retries": 0, | |
"num_loss_counted_tokens": 286, | |
"batch_size": 8, | |
"total_loss": 0.06998696178197861, | |
"gradnorm": 1.0276967287063599, | |
"weight_norm": 393.46148681640625, | |
"timestamp": "2024-07-27T04:43:58.817101" | |
} | |
Model saved in /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_480 | |
[04:44:16] INFO saving took 17.839636087417603 seconds utils.py:611 | |
Epoch 9: 100%|██████████| 6/6 [00:38<00:00, 6.49s/it] | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:550 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:573 -> 3 | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:621 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:261:1045 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:264:1041 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:261:1045 [1] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:266:1031 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:262:1033 [2] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:47 -> 3 | |
tyler-rhel-newimage:265:1035 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:261:1045 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0 | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:752 -> 3 | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:266:1031 [6] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:428 -> 3 | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:261:1045 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:58 -> 3 | |
tyler-rhel-newimage:265:1035 [5] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:264:1041 [4] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:267:1037 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:564 -> 3 | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:264:1041 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 4, res=3, closed=0 | |
tyler-rhel-newimage:265:1035 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0 | |
tyler-rhel-newimage:260:1039 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:264:1041 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Close from rank 4, retcode 3 | |
tyler-rhel-newimage:262:1033 [2] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO misc/socket.cc:775 -> 3 | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:668 -> 3 | |
tyler-rhel-newimage:265:1035 [5] proxy.cc:1521 NCCL WARN [Proxy Service 5] Failed to execute operation Close from rank 5, retcode 3 | |
tyler-rhel-newimage:260:1039 [0] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:266:1031 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0 | |
tyler-rhel-newimage:267:1037 [7] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:260:1039 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0 | |
tyler-rhel-newimage:267:1037 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0 | |
tyler-rhel-newimage:262:1033 [2] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 2, res=3, closed=0 | |
tyler-rhel-newimage:260:1039 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3 | |
tyler-rhel-newimage:266:1031 [6] proxy.cc:1521 NCCL WARN [Proxy Service 6] Failed to execute operation Close from rank 6, retcode 3 | |
tyler-rhel-newimage:263:1043 [3] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable | |
tyler-rhel-newimage:267:1037 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Close from rank 7, retcode 3 | |
tyler-rhel-newimage:262:1033 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Close from rank 2, retcode 3 | |
tyler-rhel-newimage:263:1043 [3] NCCL INFO misc/socket.cc:826 -> 3 | |
tyler-rhel-newimage:263:1043 [3] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 3, res=3, closed=0 | |
tyler-rhel-newimage:263:1043 [3] proxy.cc:1521 NCCL WARN [Proxy Service 3] Failed to execute operation Close from rank 3, retcode 3 | |
tyler-rhel-newimage:260:43160 [0] NCCL INFO comm 0x557248f8b2d0 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE | |
tyler-rhel-newimage:266:43159 [6] NCCL INFO comm 0x5653898b69a0 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE | |
tyler-rhel-newimage:267:43156 [7] NCCL INFO comm 0x560b415bb0d0 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE | |
tyler-rhel-newimage:262:43163 [2] NCCL INFO comm 0x55bdb0f52ee0 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE | |
tyler-rhel-newimage:263:43158 [3] NCCL INFO comm 0x55eb04aea420 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE | |
tyler-rhel-newimage:264:43161 [4] NCCL INFO comm 0x5567334e8a90 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE | |
tyler-rhel-newimage:265:43162 [5] NCCL INFO comm 0x558446a0b990 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE | |
tyler-rhel-newimage:261:43157 [1] NCCL INFO comm 0x55b80cd4f580 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE | |
Terminating process 🤖 | |
[root@tyler-rhel-newimage instructlab]# |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment