Skip to content

Instantly share code, notes, and snippets.

@relyt0925
Created July 27, 2024 20:09
Show Gist options
  • Save relyt0925/5c6c09acf77c53a563e3663bd2e24fbb to your computer and use it in GitHub Desktop.
Save relyt0925/5c6c09acf77c53a563e3663bd2e24fbb to your computer and use it in GitHub Desktop.
new skills training log
[root@tyler-rhel-newimage instructlab]# /root/ilab model train --data-path /var/instructlabbigdisk/instructlab/generateddata/messages_combined.jsonl --model-path /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ --device cuda --max-batch-len 2 --effective-batch-size 16 --save-samples 185 --num-epochs 10 --ckpt-output-dir /var/instructlabbigdisk/instructlab/skillscheckpoints/ --gpus 8
[2024-07-27 20:03:08,445] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
INFO 2024-07-27 20:03:11,898 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2024-07-27 20:03:11,898 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2024-07-27 20:03:11,898 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-07-27 20:03:12,324 datasets:58: PyTorch version 2.3.1 available.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 2024-07-27 20:03:12,773 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
tokenizing the dataset with /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ tokenizer...
ten largest length percentiles:
quantile 90th: 90.0
quantile 91th: 92.44
quantile 92th: 93.0
quantile 93th: 94.0
quantile 94th: 96.87999999999994
quantile 95th: 99.59999999999997
quantile 96th: 102.91999999999996
quantile 97th: 107.0
quantile 98th: 109.59999999999997
quantile 99th: 115.27999999999997
quantile 100th: 141.0
at 4096 max sequence length, the number of samples to be dropped is 0
(0.00% of total)
quantile 0th: 43.0
quantile 1th: 44.0
quantile 2th: 44.68
quantile 3th: 45.0
quantile 4th: 45.36
quantile 5th: 46.0
quantile 6th: 48.0
quantile 7th: 48.0
quantile 8th: 49.0
quantile 9th: 49.56
quantile 10th: 50.0
at 20 min sequence length, the number of samples to be dropped is 0
checking the validity of the samples...
INFO 2024-07-27 20:03:13,126 root:611: number of dropped samples: 0 -- out of 185
Categorizing training data type...
Data type sorting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 185/185 [00:00<00:00, 648242.47it/s]
unmasking the appropriate message content...
The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.
Instruction ex sample 17: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Answer: Based on the provided text, there are 8 villages named Qarah Tappeh in different districts and provinces of Iran according to the 2006 census.<|endoftext|>
Original Input: <|user|>
Question: How many villages named Qarah Tappeh were there in different districts and provinces of Iran according to the 2006 census?
<|assistant|>
Answer: Based on the provided text, there are 8 villages named Qarah Tappeh in different districts and provinces of Iran according to the 2006 census.<|endoftext|>
Instruction ex sample 99: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Answer: There are three items in this list that are types of fruits: "apple," "banana," and "orange."<|endoftext|>
Original Input: <|user|>
Question: How many items in this list are types of fruits and what are they?
<|assistant|>
Answer: There are three items in this list that are types of fruits: "apple," "banana," and "orange."<|endoftext|>
Creating json from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.89ba/s]
Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/ --data_path=/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/instructlabbigdisk/instructlab/skillscheckpoints/ --num_epochs=10 --effective_batch_size=16 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=185 --log_level=INFO --max_batch_len=2 --seed=42 --chat-tmpl-path=/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757]
W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] *****************************************
W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0727 20:03:14.580000 140296119280064 torch/distributed/run.py:757] *****************************************
[2024-07-27 20:03:17,567] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,805] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,843] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,879] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,908] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,949] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,978] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-27 20:03:17,981] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum [WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-07-27 20:03:21,555] [INFO] [comm.py:637:init_distributed] cdb=None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model_name_or_path: /var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/
data_path: /var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl
output_dir: /var/instructlabbigdisk/instructlab/skillscheckpoints/
num_epochs: 10
last_step: 0
effective_batch_size: 16
learning_rate: 2.0e-05
lr_scheduler: cosine
num_warmup_steps: 25
save_samples: 185
save_samples_ds: null
save_last: false
log_level: INFO
seed: 42
mock_data: false
mock_len: 2600
sharding_strategy: FULL_SHARD
is_granite: false
lora_r: 0
lora_alpha: 32
lora_dropout: 0.1
lora_quant_bits: null
lora_target_modules: null
max_batch_len: 2
cpu_offload_optimizer: false
cpu_offload_optimizer_pin_memory: false
cpu_offload_optimizer_ratio: 1.0
NEFTune_alpha: null
chat_tmpl_path: /opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
disable_flash_attn: false
{
"script_params": {
"model_name_or_path": "/var/instructlabbigdisk/instructlab/knowledgecheckpoints/hf_format/samples_1024/",
"data_path": "/var/instructlabbigdisk/instructlab/.local/share/instructlab/internal/data.jsonl",
"output_dir": "/var/instructlabbigdisk/instructlab/skillscheckpoints/",
"num_epochs": 10,
"last_step": 0,
"effective_batch_size": 16,
"learning_rate": 2e-05,
"lr_scheduler": "cosine",
"num_warmup_steps": 25,
"save_samples": 185,
"save_samples_ds": null,
"save_last": false,
"log_level": "INFO",
"seed": 42,
"mock_data": false,
"mock_len": 2600,
"sharding_strategy": "FULL_SHARD",
"is_granite": false,
"lora_r": 0,
"lora_alpha": 32,
"lora_dropout": 0.1,
"lora_quant_bits": null,
"lora_target_modules": null,
"max_batch_len": 2,
"cpu_offload_optimizer": false,
"cpu_offload_optimizer_pin_memory": false,
"cpu_offload_optimizer_ratio": 1.0,
"NEFTune_alpha": null,
"chat_tmpl_path": "/opt/python3.11/venv/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
"disable_flash_attn": false
},
"timestamp": "2024-07-27T20:03:21.897629"
}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-07-27 20:03:21,973] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:21,973] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-07-27 20:03:22,374] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:22,515] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:22,529] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:22,538] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:22,664] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-27 20:03:22,682] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-rhel-newimage:260:260 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:260:260 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:260:260 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
tyler-rhel-newimage:265:265 [5] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:265:265 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:265:265 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:267:267 [7] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:267:267 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:267:267 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:262:262 [2] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:262:262 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:263:263 [3] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:262:262 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:263:263 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:263:263 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:266:266 [6] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:266:266 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:266:266 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:261:261 [1] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:261:261 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:261:261 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:264:264 [4] NCCL INFO cudaDriverVersion 12040
tyler-rhel-newimage:264:264 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.11<0>
tyler-rhel-newimage:264:264 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
tyler-rhel-newimage:260:1022 [0] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:260:1022 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:260:1022 [0] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:260:1022 [0] NCCL INFO Using network Socket
tyler-rhel-newimage:265:1026 [5] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:262:1023 [2] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:266:1024 [6] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:265:1026 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:262:1023 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:266:1024 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:265:1026 [5] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:265:1026 [5] NCCL INFO Using network Socket
tyler-rhel-newimage:266:1024 [6] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:266:1024 [6] NCCL INFO Using network Socket
tyler-rhel-newimage:262:1023 [2] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:262:1023 [2] NCCL INFO Using network Socket
tyler-rhel-newimage:263:1025 [3] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:263:1025 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:263:1025 [3] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:263:1025 [3] NCCL INFO Using network Socket
tyler-rhel-newimage:264:1028 [4] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:264:1028 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:264:1028 [4] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:264:1028 [4] NCCL INFO Using network Socket
tyler-rhel-newimage:261:1027 [1] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:261:1027 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:261:1027 [1] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:261:1027 [1] NCCL INFO Using network Socket
tyler-rhel-newimage:267:1029 [7] NCCL INFO NET/IB : No device found.
tyler-rhel-newimage:267:1029 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.11<0>
tyler-rhel-newimage:267:1029 [7] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:267:1029 [7] NCCL INFO Using network Socket
tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xbba7bcd413cc6af1 - Init START
tyler-rhel-newimage:266:1024 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-rhel-newimage:266:1024 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-rhel-newimage:265:1026 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-rhel-newimage:265:1026 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-rhel-newimage:260:1022 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-rhel-newimage:267:1029 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-rhel-newimage:267:1029 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-rhel-newimage:263:1025 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-rhel-newimage:263:1025 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-rhel-newimage:262:1023 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-rhel-newimage:262:1023 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-rhel-newimage:261:1027 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-rhel-newimage:261:1027 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-rhel-newimage:264:1028 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-rhel-newimage:264:1028 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-rhel-newimage:266:1024 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-rhel-newimage:267:1029 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-rhel-newimage:266:1024 [6] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:267:1029 [7] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:265:1026 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-rhel-newimage:264:1028 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-rhel-newimage:265:1026 [5] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:264:1028 [4] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1022 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-rhel-newimage:260:1022 [0] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-rhel-newimage:263:1025 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-rhel-newimage:262:1023 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-rhel-newimage:262:1023 [2] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:261:1027 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-rhel-newimage:261:1027 [1] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:263:1025 [3] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Connected all rings
tyler-rhel-newimage:262:1023 [2] NCCL INFO Connected all rings
tyler-rhel-newimage:264:1028 [4] NCCL INFO Connected all rings
tyler-rhel-newimage:265:1026 [5] NCCL INFO Connected all rings
tyler-rhel-newimage:266:1024 [6] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1029 [7] NCCL INFO Connected all rings
tyler-rhel-newimage:260:1022 [0] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1029 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1025 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1023 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1028 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1024 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1026 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:261:1027 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1022 [0] NCCL INFO Connected all trees
tyler-rhel-newimage:260:1022 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:260:1022 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:260:1022 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:261:1027 [1] NCCL INFO Connected all trees
tyler-rhel-newimage:261:1027 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:261:1027 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:262:1023 [2] NCCL INFO Connected all trees
tyler-rhel-newimage:262:1023 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:262:1023 [2] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:262:1023 [2] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:261:1027 [1] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:263:1025 [3] NCCL INFO Connected all trees
tyler-rhel-newimage:263:1025 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:263:1025 [3] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:267:1029 [7] NCCL INFO Connected all trees
tyler-rhel-newimage:267:1029 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:267:1029 [7] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:263:1025 [3] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:267:1029 [7] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:264:1028 [4] NCCL INFO Connected all trees
tyler-rhel-newimage:264:1028 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:264:1028 [4] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:266:1024 [6] NCCL INFO Connected all trees
tyler-rhel-newimage:266:1024 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:266:1024 [6] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1026 [5] NCCL INFO Connected all trees
tyler-rhel-newimage:265:1026 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:265:1026 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1026 [5] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:264:1028 [4] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:266:1024 [6] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304.
tyler-rhel-newimage:265:1026 [5] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:266:1024 [6] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:267:1029 [7] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:260:1022 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:261:1027 [1] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:264:1028 [4] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:262:1023 [2] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xbba7bcd413cc6af1 - Init COMPLETE
tyler-rhel-newimage:263:1025 [3] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xbba7bcd413cc6af1 - Init COMPLETE
Generating train split: 185 examples [00:00, 25776.38 examples/s]
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12894.40it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11066.61it/s]
Effective batch size is too low for multipack sampling, max sample length=141 and min packing length=135. Switching to naive distributed sampling.
{
"num_gpus": 8,
"avg_sample_len": 67.78918918918919,
"effective_batch_size": 16,
"max_batch_len_per_gpu": 2,
"packing_max_batch_len": null,
"grad_accum": 1,
"num_batches": 12,
"avg_samples_per_batch": 15.416666666666666,
"samples_per_gpu": 2,
"timestamp": "2024-07-27T20:03:33.017444"
}
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11659.95it/s]
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12065.72it/s]
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11400.91it/s]
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11540.81it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 0%| | 0/185 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 12968.10it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Data length calculation: 100%|██████████| 185/185 [00:00<00:00, 11126.75it/s]
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Detected CUDA files, patching ldflags
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.15251493453979492 seconds
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1219935417175293 seconds
[2024-07-27 20:03:39,014] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4+d254d75, git-hash=d254d75, git-branch=HEAD
[2024-07-27 20:03:39,014] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
Loading extension module fused_adam...
Time to load fused_adam op: 0.20261573791503906 seconds
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.12093877792358398 seconds
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
/opt/python3.11/venv/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /var/instructlabbigdisk/instructlab/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.12085723876953125 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10260534286499023 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.2024221420288086 seconds
Time to load fused_adam op: 0.20228958129882812 seconds
tyler-rhel-newimage:261:1135 [1] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:261:1135 [1] NCCL INFO Using network Socket
tyler-rhel-newimage:267:1125 [7] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:267:1125 [7] NCCL INFO Using network Socket
tyler-rhel-newimage:264:1129 [4] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:264:1129 [4] NCCL INFO Using network Socket
tyler-rhel-newimage:263:1138 [3] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:263:1138 [3] NCCL INFO Using network Socket
tyler-rhel-newimage:265:1132 [5] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:265:1132 [5] NCCL INFO Using network Socket
tyler-rhel-newimage:260:1124 [0] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:260:1124 [0] NCCL INFO Using network Socket
tyler-rhel-newimage:262:1141 [2] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:262:1141 [2] NCCL INFO Using network Socket
tyler-rhel-newimage:266:1126 [6] NCCL INFO Using non-device net plugin version 0
tyler-rhel-newimage:266:1126 [6] NCCL INFO Using network Socket
tyler-rhel-newimage:265:1132 [5] NCCL INFO bootstrapSplit: comm 0x56464ab274c0 parent 0x56464a4e7a70 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:264:1129 [4] NCCL INFO bootstrapSplit: comm 0x55b22abe29d0 parent 0x55b22a5ae220 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:263:1138 [3] NCCL INFO bootstrapSplit: comm 0x560000574c60 parent 0x55fffff3ce80 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:262:1141 [2] NCCL INFO bootstrapSplit: comm 0x55f25fc9eb30 parent 0x55f25f665d50 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:261:1135 [1] NCCL INFO bootstrapSplit: comm 0x55fca66893d0 parent 0x55fca60525d0 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:260:1124 [0] NCCL INFO bootstrapSplit: comm 0x558210f77210 parent 0x558210938950 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:267:1125 [7] NCCL INFO bootstrapSplit: comm 0x564fb4716ac0 parent 0x564fb40d9fa0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:266:1126 [6] NCCL INFO bootstrapSplit: comm 0x55f35a3e2520 parent 0x55f359e7d980 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x358cd0e27660cbba - Init START
tyler-rhel-newimage:264:1129 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-rhel-newimage:264:1129 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-rhel-newimage:267:1125 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-rhel-newimage:267:1125 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-rhel-newimage:261:1135 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-rhel-newimage:261:1135 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-rhel-newimage:260:1124 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-rhel-newimage:260:1124 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-rhel-newimage:266:1126 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-rhel-newimage:266:1126 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-rhel-newimage:263:1138 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-rhel-newimage:263:1138 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-rhel-newimage:265:1132 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-rhel-newimage:265:1132 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-rhel-newimage:262:1141 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-rhel-newimage:262:1141 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-rhel-newimage:262:1141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-rhel-newimage:262:1141 [2] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:267:1125 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:266:1126 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-rhel-newimage:267:1125 [7] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:265:1132 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-rhel-newimage:266:1126 [6] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:263:1138 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-rhel-newimage:265:1132 [5] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:261:1135 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:264:1129 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:264:1129 [4] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-rhel-newimage:260:1124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-rhel-newimage:260:1124 [0] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:263:1138 [3] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:261:1135 [1] NCCL INFO P2P Chunksize set to 524288
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Connected all rings
tyler-rhel-newimage:264:1129 [4] NCCL INFO Connected all rings
tyler-rhel-newimage:263:1138 [3] NCCL INFO Connected all rings
tyler-rhel-newimage:261:1135 [1] NCCL INFO Connected all rings
tyler-rhel-newimage:262:1141 [2] NCCL INFO Connected all rings
tyler-rhel-newimage:265:1132 [5] NCCL INFO Connected all rings
tyler-rhel-newimage:266:1126 [6] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1125 [7] NCCL INFO Connected all rings
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:267:1125 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:262:1141 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:264:1129 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:261:1135 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:263:1138 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:266:1126 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:265:1132 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
tyler-rhel-newimage:260:1124 [0] NCCL INFO Connected all trees
tyler-rhel-newimage:260:1124 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:260:1124 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:261:1135 [1] NCCL INFO Connected all trees
tyler-rhel-newimage:261:1135 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:261:1135 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:262:1141 [2] NCCL INFO Connected all trees
tyler-rhel-newimage:262:1141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:262:1141 [2] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:263:1138 [3] NCCL INFO Connected all trees
tyler-rhel-newimage:263:1138 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:263:1138 [3] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:267:1125 [7] NCCL INFO Connected all trees
tyler-rhel-newimage:267:1125 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:267:1125 [7] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:264:1129 [4] NCCL INFO Connected all trees
tyler-rhel-newimage:264:1129 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:264:1129 [4] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:266:1126 [6] NCCL INFO Connected all trees
tyler-rhel-newimage:265:1132 [5] NCCL INFO Connected all trees
tyler-rhel-newimage:266:1126 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:265:1132 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-rhel-newimage:266:1126 [6] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1132 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-rhel-newimage:265:1132 [5] NCCL INFO comm 0x56464ab274c0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:266:1126 [6] NCCL INFO comm 0x55f35a3e2520 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:267:1125 [7] NCCL INFO comm 0x564fb4716ac0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:260:1124 [0] NCCL INFO comm 0x558210f77210 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:261:1135 [1] NCCL INFO comm 0x55fca66893d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:262:1141 [2] NCCL INFO comm 0x55f25fc9eb30 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:264:1129 [4] NCCL INFO comm 0x55b22abe29d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x358cd0e27660cbba - Init COMPLETE
tyler-rhel-newimage:263:1138 [3] NCCL INFO comm 0x560000574c60 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x358cd0e27660cbba - Init COMPLETE
[2024-07-27 20:03:47,872] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-07-27 20:03:47,874] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-07-27 20:03:47,874] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-07-27 20:03:47,886] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-07-27 20:03:47,887] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-07-27 20:03:47,887] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-07-27 20:03:47,887] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-07-27 20:04:00,524] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:01,385] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-07-27 20:04:01,386] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB
[2024-07-27 20:04:01,386] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 138.49 GB, percent = 11.0%
[2024-07-27 20:04:01,578] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-07-27 20:04:01,579] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 18.83 GB CA 20.4 GB Max_CA 20 GB
[2024-07-27 20:04:01,579] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 138.49 GB, percent = 11.0%
[2024-07-27 20:04:01,579] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-07-27 20:04:01,777] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-07-27 20:04:01,778] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 15.69 GB CA 20.4 GB Max_CA 20 GB
[2024-07-27 20:04:01,778] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 138.49 GB, percent = 11.0%
[2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fbf02112310>
[2024-07-27 20:04:01,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-07-27 20:04:01,781] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] amp_params ................... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] bfloat16_enabled ............. True
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fbef4750bd0>
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] dump_state ................... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-07-27 20:04:01,782] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] fp16_auto_cast ............... None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] fp16_enabled ................. False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] loss_scale ................... 1.0
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] pld_params ................... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] steps_per_print .............. 1
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] train_batch_size ............. 16
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] world_size ................... 8
[2024-07-27 20:04:01,783] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False
[2024-07-27 20:04:01,784] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-27 20:04:01,784] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-07-27 20:04:01,784] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-07-27 20:04:01,784] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-07-27 20:04:01,784] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 2,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
}
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[2024-07-27 20:04:01,784] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Number of samples per save: 176
[2024-07-27 20:04:01,865] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:01,875] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:01,984] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:02,237] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:02,285] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-07-27 20:04:02,433] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/instructlabbigdisk/instructlab/skillscheckpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Epoch 0: 0%| | 0/12 [00:00<?, ?it/s] total tokens: 118 num samples: 2 num padding tokens: 13 - rank: 7 max len: 59 min len: 46 avg len: 52.5 num_loss_counted_tokens: 52
total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 68
total tokens: 282 num samples: 2 num padding tokens: 83 - rank: 7 max len: 141 min len: 58 avg len: 99.5 num_loss_counted_tokens: 150
total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 7 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 56
total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 7 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 53
total tokens: 186 num samples: 2 num padding tokens: 5 - rank: 7 max len: 93 min len: 88 avg len: 90.5 num_loss_counted_tokens: 121
total tokens: 156 num samples: 2 num padding tokens: 21 - rank: 6 max len: 78 min len: 57 avg len: 67.5 num_loss_counted_tokens: 70 total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 61
total tokens: 184 num samples: 2 num padding tokens: 27 - rank: 7 max len: 92 min len: 65 avg len: 78.5 num_loss_counted_tokens: 90
total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 7 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 66
total tokens: 142 num samples: 2 num padding tokens: 14 - rank: 7 max len: 71 min len: 57 avg len: 64.0 num_loss_counted_tokens: 58
total tokens: 114 num samples: 2 num padding tokens: 13 - rank: 7 max len: 57 min len: 44 avg len: 50.5 num_loss_counted_tokens: 55
total tokens: 140 num samples: 2 num padding tokens: 3 - rank: 7 max len: 70 min len: 67 avg len: 68.5 num_loss_counted_tokens: 70
total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 1 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 74 total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 1 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 71
total tokens: 158 num samples: 2 num padding tokens: 36 - rank: 3 max len: 79 min len: 43 avg len: 61.0 num_loss_counted_tokens: 56
total tokens: 202 num samples: 2 num padding tokens: 51 - rank: 4 max len: 101 min len: 50 avg len: 75.5 num_loss_counted_tokens: 106
total tokens: 106 num samples: 2 num padding tokens: 7 - rank: 0 max len: 53 min len: 46 avg len: 49.5 num_loss_counted_tokens: 56
total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 7 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 59
total tokens: 126 num samples: 2 num padding tokens: 19 - rank: 0 max len: 63 min len: 44 avg len: 53.5 num_loss_counted_tokens: 53
total tokens: 134 num samples: 2 num padding tokens: 0 - rank: 4 max len: 67 min len: 67 avg len: 67.0 num_loss_counted_tokens: 52
total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 3 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 88
total tokens: 188 num samples: 2 num padding tokens: 4 - rank: 6 max len: 94 min len: 90 avg len: 92.0 num_loss_counted_tokens: 121
total tokens: 168 num samples: 2 num padding tokens: 20 - rank: 3 max len: 84 min len: 64 avg len: 74.0 num_loss_counted_tokens: 95
total tokens: 106 num samples: 2 num padding tokens: 2 - rank: 6 max len: 53 min len: 51 avg len: 52.0 num_loss_counted_tokens: 51
total tokens: 110 num samples: 2 num padding tokens: 0 - rank: 6 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 64
total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 4 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 58
total tokens: 244 num samples: 2 num padding tokens: 63 - rank: 4 max len: 122 min len: 59 avg len: 90.5 num_loss_counted_tokens: 120
total tokens: 128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 64 min len: 62 avg len: 63.0 num_loss_counted_tokens: 65
total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 69
total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 6 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 77
total tokens: 180 num samples: 2 num padding tokens: 20 - rank: 4 max len: 90 min len: 70 avg len: 80.0 num_loss_counted_tokens: 111
total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 4 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 59
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 57
total tokens: 144 num samples: 2 num padding tokens: 12 - rank: 4 max len: 72 min len: 60 avg len: 66.0 num_loss_counted_tokens: 68
total tokens: 200 num samples: 2 num padding tokens: 50 - rank: 6 max len: 100 min len: 50 avg len: 75.0 num_loss_counted_tokens: 91
total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 3 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 86
total tokens: 154 num samples: 2 num padding tokens: 33 - rank: 4 max len: 77 min len: 44 avg len: 60.5 num_loss_counted_tokens: 77
total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 6 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 58
total tokens: 158 num samples: 2 num padding tokens: 5 - rank: 6 max len: 79 min len: 74 avg len: 76.5 num_loss_counted_tokens: 82
total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 6 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 72
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 0 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 4 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 66
total tokens: 148 num samples: 2 num padding tokens: 16 - rank: 0 max len: 74 min len: 58 avg len: 66.0 num_loss_counted_tokens: 73
total tokens: 134 num samples: 2 num padding tokens: 13 - rank: 6 max len: 67 min len: 54 avg len: 60.5 num_loss_counted_tokens: 67
total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 0 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 92
total tokens: 140 num samples: 2 num padding tokens: 13 - rank: 3 max len: 70 min len: 57 avg len: 63.5 num_loss_counted_tokens: 67
total tokens: 162 num samples: 2 num padding tokens: 27 - rank: 3 max len: 81 min len: 54 avg len: 67.5 num_loss_counted_tokens: 86
total tokens: 102 num samples: 2 num padding tokens: 5 - rank: 0 max len: 51 min len: 46 avg len: 48.5 num_loss_counted_tokens: 49
total tokens: 160 num samples: 2 num padding tokens: 31 - rank: 0 max len: 80 min len: 49 avg len: 64.5 num_loss_counted_tokens: 70
total tokens: 146 num samples: 2 num padding tokens: 3 - rank: 0 max len: 73 min len: 70 avg len: 71.5 num_loss_counted_tokens: 87
total tokens: 130 num samples: 2 num padding tokens: 5 - rank: 0 max len: 65 min len: 60 avg len: 62.5 num_loss_counted_tokens: 57
total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 0 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 71
total tokens: 214 num samples: 2 num padding tokens: 26 - rank: 3 max len: 107 min len: 81 avg len: 94.0 num_loss_counted_tokens: 116
total tokens: 196 num samples: 2 num padding tokens: 32 - rank: 3 max len: 98 min len: 66 avg len: 82.0 num_loss_counted_tokens: 101
total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 1 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 64
total tokens: 166 num samples: 2 num padding tokens: 8 - rank: 6 max len: 83 min len: 75 avg len: 79.0 num_loss_counted_tokens: 77
total tokens: 228 num samples: 2 num padding tokens: 54 - rank: 1 max len: 114 min len: 60 avg len: 87.0 num_loss_counted_tokens: 125
total tokens: 152 num samples: 2 num padding tokens: 10 - rank: 2 max len: 76 min len: 66 avg len: 71.0 num_loss_counted_tokens: 77 total tokens: 126 num samples: 2 num padding tokens: 12 - rank: 2 max len: 63 min len: 51 avg len: 57.0 num_loss_counted_tokens: 56
total tokens: 154 num samples: 2 num padding tokens: 24 - rank: 0 max len: 77 min len: 53 avg len: 65.0 num_loss_counted_tokens: 66
total tokens: 146 num samples: 2 num padding tokens: 19 - rank: 3 max len: 73 min len: 54 avg len: 63.5 num_loss_counted_tokens: 71
total tokens: 116 num samples: 2 num padding tokens: 10 - rank: 3 max len: 58 min len: 48 avg len: 53.0 num_loss_counted_tokens: 52
total tokens: 138 num samples: 2 num padding tokens: 24 - rank: 4 max len: 69 min len: 45 avg len: 57.0 num_loss_counted_tokens: 68
total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 1 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 42
total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 2 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 51
total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 1 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 65
total tokens: 216 num samples: 2 num padding tokens: 49 - rank: 1 max len: 108 min len: 59 avg len: 83.5 num_loss_counted_tokens: 103
total tokens: 214 num samples: 2 num padding tokens: 21 - rank: 2 max len: 107 min len: 86 avg len: 96.5 num_loss_counted_tokens: 106
total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 2 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 83
total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 1 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 63
total tokens: 164 num samples: 2 num padding tokens: 29 - rank: 1 max len: 82 min len: 53 avg len: 67.5 num_loss_counted_tokens: 83
total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 1 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 86
total tokens: 168 num samples: 2 num padding tokens: 18 - rank: 2 max len: 84 min len: 66 avg len: 75.0 num_loss_counted_tokens: 81
total tokens: 226 num samples: 2 num padding tokens: 44 - rank: 2 max len: 113 min len: 69 avg len: 91.0 num_loss_counted_tokens: 97
total tokens: 180 num samples: 2 num padding tokens: 24 - rank: 2 max len: 90 min len: 66 avg len: 78.0 num_loss_counted_tokens: 103
total tokens: 186 num samples: 2 num padding tokens: 48 - rank: 1 max len: 93 min len: 45 avg len: 69.0 num_loss_counted_tokens: 110
total tokens: 208 num samples: 2 num padding tokens: 33 - rank: 2 max len: 104 min len: 71 avg len: 87.5 num_loss_counted_tokens: 113
total tokens: 188 num samples: 2 num padding tokens: 32 - rank: 3 max len: 94 min len: 62 avg len: 78.0 num_loss_counted_tokens: 89
total tokens: 116 num samples: 2 num padding tokens: 10 - rank: 2 max len: 58 min len: 48 avg len: 53.0 num_loss_counted_tokens: 62
total tokens: 194 num samples: 2 num padding tokens: 48 - rank: 1 max len: 97 min len: 49 avg len: 73.0 num_loss_counted_tokens: 95
total tokens: 120 num samples: 2 num padding tokens: 9 - rank: 2 max len: 60 min len: 51 avg len: 55.5 num_loss_counted_tokens: 56
total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 6 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 69
total tokens: 162 num samples: 2 num padding tokens: 17 - rank: 2 max len: 81 min len: 64 avg len: 72.5 num_loss_counted_tokens: 91
total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 5 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 70
total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 5 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 87
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 5 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 72
total tokens: 172 num samples: 2 num padding tokens: 36 - rank: 5 max len: 86 min len: 50 avg len: 68.0 num_loss_counted_tokens: 70
total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 5 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 5 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 81
total tokens: 174 num samples: 2 num padding tokens: 19 - rank: 5 max len: 87 min len: 68 avg len: 77.5 num_loss_counted_tokens: 80
total tokens: 202 num samples: 2 num padding tokens: 18 - rank: 5 max len: 101 min len: 83 avg len: 92.0 num_loss_counted_tokens: 124
total tokens: 122 num samples: 2 num padding tokens: 11 - rank: 5 max len: 61 min len: 50 avg len: 55.5 num_loss_counted_tokens: 65
total tokens: 174 num samples: 2 num padding tokens: 15 - rank: 5 max len: 87 min len: 72 avg len: 79.5 num_loss_counted_tokens: 98 total tokens: 160 num samples: 2 num padding tokens: 7 - rank: 5 max len: 80 min len: 73 avg len: 76.5 num_loss_counted_tokens: 104
total tokens: 124 num samples: 2 num padding tokens: 1 - rank: 5 max len: 62 min len: 61 avg len: 61.5 num_loss_counted_tokens: 59
Per-token loss scaled by world size: 0.0017695350106805563Per-token loss scaled by world size: 0.0008982627186924219
Epoch: 0, Step: 1, Rank: 0, loss = 0.12254030257463455
Epoch: 0, Step: 1, Rank: 6, loss = 0.062204692512750626
Per-token loss scaled by world size: 0.0019191226456314325
Epoch: 0, Step: 1, Rank: 7, loss = 0.1328992396593094
Per-token loss scaled by world size: 0.002525273710489273
Epoch: 0, Step: 1, Rank: 3, loss = 0.1748751997947693
Per-token loss scaled by world size: 0.002455754904076457
Epoch: 0, Step: 1, Rank: 4, loss = 0.17006102204322815
Per-token loss scaled by world size: 0.0006225108518265188
Epoch: 0, Step: 1, Rank: 2, loss = 0.043108876794576645
Per-token loss scaled by world size: 0.004392423201352358
Epoch: 0, Step: 1, Rank: 5, loss = 0.30417531728744507
Per-token loss scaled by world size: 0.0016390715027227998
Epoch: 0, Step: 1, Rank: 1, loss = 0.11350569874048233
[2024-07-27 20:04:03,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
Epoch 0: 8%|▊ | 1/12 [00:01<00:13, 1.27s/it]{
"epoch": 0,
"step": 1,
"rank": 0,
"loss": 0.12254030257463455,
"overall_throughput": 19.157316171863247,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 21.99594497680664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 554,
"batch_size": 16,
"total_loss": 0.1404212862253189,
"gradnorm": 2.950925588607788,
"weight_norm": 393.455078125,
"timestamp": "2024-07-27T20:04:03.739792"
}
Per-token loss scaled by world size: 0.001750853261910379Per-token loss scaled by world size: 0.0013746594777330756Per-token loss scaled by world size: 0.006612943951040506
Per-token loss scaled by world size: 0.00497298501431942
Per-token loss scaled by world size: 0.0029370656702667475Per-token loss scaled by world size: 0.00024299396318383515
Epoch: 0, Step: 2, Rank: 2, loss = 0.09450783580541611
Epoch: 0, Step: 2, Rank: 1, loss = 0.12037116289138794
Epoch: 0, Step: 2, Rank: 6, loss = 0.45463991165161133
Epoch: 0, Step: 2, Rank: 5, loss = 0.34189271926879883
Epoch: 0, Step: 2, Rank: 3, loss = 0.20192326605319977
Epoch: 0, Step: 2, Rank: 4, loss = 0.016705835238099098
Per-token loss scaled by world size: 0.0014473804039880633
Per-token loss scaled by world size: 0.0009000621503219008
Epoch: 0, Step: 2, Rank: 0, loss = 0.0995073989033699
Epoch: 0, Step: 2, Rank: 7, loss = 0.061879273504018784
[2024-07-27 20:04:04,156] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
Epoch 0: 17%|█▋ | 2/12 [00:01<00:08, 1.20it/s]{
"epoch": 0,
"step": 2,
"rank": 0,
"loss": 0.0995073989033699,
"overall_throughput": 38.74044042813212,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 21.998329639434814,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 550,
"batch_size": 16,
"total_loss": 0.1739284247159958,
"gradnorm": 4.509922981262207,
"weight_norm": 393.4551086425781,
"timestamp": "2024-07-27T20:04:04.235419"
}
Per-token loss scaled by world size: 0.0008535730303265154Per-token loss scaled by world size: 0.0017103212885558605Per-token loss scaled by world size: 0.003972221631556749
Per-token loss scaled by world size: 0.0014537357492372394Per-token loss scaled by world size: 0.0024689952842891216Per-token loss scaled by world size: 0.0007754238904453814
Epoch: 0, Step: 3, Rank: 1, loss = 0.3440936803817749
Epoch: 0, Step: 3, Rank: 6, loss = 0.14815658330917358
Epoch: 0, Step: 3, Rank: 2, loss = 0.07394076138734818
Epoch: 0, Step: 3, Rank: 5, loss = 0.12592986226081848
Epoch: 0, Step: 3, Rank: 7, loss = 0.21387672424316406
Per-token loss scaled by world size: 0.0015765568241477013Epoch: 0, Step: 3, Rank: 4, loss = 0.06717109680175781
Per-token loss scaled by world size: 0.0009588321554474533
Epoch: 0, Step: 3, Rank: 0, loss = 0.08305883407592773
Epoch: 0, Step: 3, Rank: 3, loss = 0.13656923174858093
[2024-07-27 20:04:04,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:04,785] [INFO] [timer.py:258:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=31.96463872821357, CurrSamplesPerSec=31.96463872821357, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 25%|██▌ | 3/12 [00:02<00:06, 1.42it/s]{
"epoch": 0,
"step": 3,
"rank": 0,
"loss": 0.08305883407592773,
"overall_throughput": 31.90019931426007,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 21.998568058013916,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 693,
"batch_size": 16,
"total_loss": 0.14909958839416504,
"gradnorm": 4.072885990142822,
"weight_norm": 393.4551696777344,
"timestamp": "2024-07-27T20:04:04.830984"
}
Per-token loss scaled by world size: 0.0011740017216652632Per-token loss scaled by world size: 0.001550567802041769Per-token loss scaled by world size: 0.002323366003111005Per-token loss scaled by world size: 0.0024958737194538116
Per-token loss scaled by world size: 0.0009869500063359737
Per-token loss scaled by world size: 0.0016128732822835445
Per-token loss scaled by world size: 0.0012467901688069105
Epoch: 0, Step: 4, Rank: 6, loss = 0.1122223436832428
Epoch: 0, Step: 4, Rank: 5, loss = 0.16815361380577087
Epoch: 0, Step: 4, Rank: 1, loss = 0.08496837317943573
Epoch: 0, Step: 4, Rank: 3, loss = 0.1806388646364212Epoch: 0, Step: 4, Rank: 2, loss = 0.071430504322052
Epoch: 0, Step: 4, Rank: 4, loss = 0.11673170328140259
Epoch: 0, Step: 4, Rank: 7, loss = 0.09023644030094147
Per-token loss scaled by world size: 0.0015552268596366048
Epoch: 0, Step: 4, Rank: 0, loss = 0.11255954205989838
[2024-07-27 20:04:05,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:05,334] [INFO] [timer.py:258:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=32.182993941908016, CurrSamplesPerSec=32.404352908739476, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 33%|███▎ | 4/12 [00:02<00:05, 1.55it/s]{
"epoch": 0,
"step": 4,
"rank": 0,
"loss": 0.11255954205989838,
"overall_throughput": 32.34236936565279,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 21.996421813964844,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 579,
"batch_size": 16,
"total_loss": 0.11711767315864563,
"gradnorm": 2.631800889968872,
"weight_norm": 393.4552001953125,
"timestamp": "2024-07-27T20:04:05.382886"
}
Per-token loss scaled by world size: 0.002091767033562064Per-token loss scaled by world size: 0.00198046350851655Per-token loss scaled by world size: 0.0025131492875516415Per-token loss scaled by world size: 0.0008698466117493808Per-token loss scaled by world size: 0.001278402516618371
Per-token loss scaled by world size: 0.0014365998795256019
Per-token loss scaled by world size: 0.0012164696818217635
Epoch: 0, Step: 5, Rank: 5, loss = 0.1651211529970169Epoch: 0, Step: 5, Rank: 4, loss = 0.20953382551670074
Epoch: 0, Step: 5, Rank: 2, loss = 0.07252345979213715
Epoch: 0, Step: 5, Rank: 6, loss = 0.17440107464790344
Epoch: 0, Step: 5, Rank: 0, loss = 0.10658681392669678
Epoch: 0, Step: 5, Rank: 3, loss = 0.11977651715278625
Epoch: 0, Step: 5, Rank: 1, loss = 0.10142315924167633
Per-token loss scaled by world size: 0.000835251237731427
Epoch: 0, Step: 5, Rank: 7, loss = 0.06963907182216644
[2024-07-27 20:04:05,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:05,909] [INFO] [timer.py:258:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=31.583034761692534, CurrSamplesPerSec=30.44781135920859, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 42%|████▏ | 5/12 [00:03<00:04, 1.62it/s]{
"epoch": 0,
"step": 5,
"rank": 0,
"loss": 0.10658681392669678,
"overall_throughput": 30.372502350050485,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 22.000954627990723,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 667,
"batch_size": 16,
"total_loss": 0.12737563252449036,
"gradnorm": 2.452970027923584,
"weight_norm": 393.45526123046875,
"timestamp": "2024-07-27T20:04:05.943274"
}
Per-token loss scaled by world size: 0.002868585754185915Per-token loss scaled by world size: 0.00250813621096313Per-token loss scaled by world size: 0.0015482519520446658Per-token loss scaled by world size: 0.0010162107646465302Per-token loss scaled by world size: 0.0008416934870183468Per-token loss scaled by world size: 0.002133122645318508
Per-token loss scaled by world size: 0.001864621532149613
Epoch: 0, Step: 6, Rank: 6, loss = 0.05828727409243584Epoch: 0, Step: 6, Rank: 4, loss = 0.1986495554447174
Epoch: 0, Step: 6, Rank: 2, loss = 0.10721644759178162Epoch: 0, Step: 6, Rank: 5, loss = 0.07037259638309479
Epoch: 0, Step: 6, Rank: 3, loss = 0.17368842661380768
Epoch: 0, Step: 6, Rank: 0, loss = 0.14771874248981476
Epoch: 0, Step: 6, Rank: 7, loss = 0.12912504374980927
Per-token loss scaled by world size: 0.0008939910912886262
Epoch: 0, Step: 6, Rank: 1, loss = 0.06190888211131096
[2024-07-27 20:04:06,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:06,460] [INFO] [timer.py:258:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=31.51022949423371, CurrSamplesPerSec=31.293813829665694, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 50%|█████ | 6/12 [00:04<00:03, 1.68it/s]{
"epoch": 0,
"step": 6,
"rank": 0,
"loss": 0.14771874248981476,
"overall_throughput": 31.243491527428024,
"lr": 4.800000000000001e-06,
"cuda_mem_allocated": 21.996244430541992,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 554,
"batch_size": 16,
"total_loss": 0.11837086826562881,
"gradnorm": 2.045849323272705,
"weight_norm": 393.4552917480469,
"timestamp": "2024-07-27T20:04:06.510648"
}
Per-token loss scaled by world size: 0.001680628745816648Per-token loss scaled by world size: 0.0018293416360393167Per-token loss scaled by world size: 0.0013819513842463493Per-token loss scaled by world size: 0.0005512057687155902Per-token loss scaled by world size: 0.0015243319794535637
Per-token loss scaled by world size: 0.0011020175879821181Per-token loss scaled by world size: 0.001964986091479659
Epoch: 0, Step: 7, Rank: 5, loss = 0.11867507547140121Epoch: 0, Step: 7, Rank: 7, loss = 0.15709471702575684
Epoch: 0, Step: 7, Rank: 6, loss = 0.09463576227426529Epoch: 0, Step: 7, Rank: 4, loss = 0.1309020072221756
Epoch: 0, Step: 7, Rank: 2, loss = 0.16874317824840546Epoch: 0, Step: 7, Rank: 1, loss = 0.14432398974895477
Epoch: 0, Step: 7, Rank: 0, loss = 0.04733479768037796
Per-token loss scaled by world size: 0.0013999826041981578
Epoch: 0, Step: 7, Rank: 3, loss = 0.1202235072851181
[2024-07-27 20:04:06,935] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[5.600000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:07,012] [INFO] [timer.py:258:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=31.621939391152157, CurrSamplesPerSec=32.07681358232997, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 58%|█████▊ | 7/12 [00:04<00:02, 1.72it/s]{
"epoch": 0,
"step": 7,
"rank": 0,
"loss": 0.04733479768037796,
"overall_throughput": 32.023072654142574,
"lr": 5.600000000000001e-06,
"cuda_mem_allocated": 21.99880838394165,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 687,
"batch_size": 16,
"total_loss": 0.12274163216352463,
"gradnorm": 2.7485461235046387,
"weight_norm": 393.4553527832031,
"timestamp": "2024-07-27T20:04:07.057260"
}
Per-token loss scaled by world size: 0.0043866513296961784Per-token loss scaled by world size: 0.0009822045685723424Per-token loss scaled by world size: 0.003587431972846389Per-token loss scaled by world size: 0.002129745902493596Per-token loss scaled by world size: 0.0025288627948611975Per-token loss scaled by world size: 0.0017483173869550228
Per-token loss scaled by world size: 0.0015334823401644826
Epoch: 0, Step: 8, Rank: 6, loss = 0.2569498121738434
Epoch: 0, Step: 8, Rank: 3, loss = 0.1811297982931137Epoch: 0, Step: 8, Rank: 0, loss = 0.07035040110349655Epoch: 0, Step: 8, Rank: 4, loss = 0.12522323429584503
Epoch: 0, Step: 8, Rank: 5, loss = 0.3141939043998718
Epoch: 0, Step: 8, Rank: 1, loss = 0.1525430530309677
Epoch: 0, Step: 8, Rank: 7, loss = 0.1098356693983078
Per-token loss scaled by world size: 0.004045259207487106
Epoch: 0, Step: 8, Rank: 2, loss = 0.2897416949272156
[2024-07-27 20:04:07,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[6.4000000000000006e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:07,557] [INFO] [timer.py:258:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=31.719761022460865, CurrSamplesPerSec=32.21809006047175, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 67%|██████▋ | 8/12 [00:05<00:02, 1.76it/s]{
"epoch": 0,
"step": 8,
"rank": 0,
"loss": 0.07035040110349655,
"overall_throughput": 32.162302804268634,
"lr": 6.4000000000000006e-06,
"cuda_mem_allocated": 22.001669883728027,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 573,
"batch_size": 16,
"total_loss": 0.18749594688415527,
"gradnorm": 3.855632781982422,
"weight_norm": 393.45538330078125,
"timestamp": "2024-07-27T20:04:07.599450"
}
Per-token loss scaled by world size: 0.0026937518268823624Per-token loss scaled by world size: 0.0037981150671839714
Per-token loss scaled by world size: 0.0015854539815336466Per-token loss scaled by world size: 0.002551022917032242Per-token loss scaled by world size: 0.0024539916776120663Per-token loss scaled by world size: 0.002055267570540309
Epoch: 0, Step: 9, Rank: 0, loss = 0.2181939035654068
Epoch: 0, Step: 9, Rank: 2, loss = 0.1987733244895935Epoch: 0, Step: 9, Rank: 3, loss = 0.30764731764793396
Epoch: 0, Step: 9, Rank: 4, loss = 0.16647666692733765Epoch: 0, Step: 9, Rank: 7, loss = 0.2066328525543213
Epoch: 0, Step: 9, Rank: 1, loss = 0.12842176854610443
Per-token loss scaled by world size: 0.0031997160986065865
Per-token loss scaled by world size: 0.002269922522827983
Epoch: 0, Step: 9, Rank: 5, loss = 0.25917699933052063
Epoch: 0, Step: 9, Rank: 6, loss = 0.18386372923851013
[2024-07-27 20:04:08,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[7.2000000000000005e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:08,097] [INFO] [timer.py:258:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=31.789200407919928, CurrSamplesPerSec=32.21230625969002, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 75%|███████▌ | 9/12 [00:05<00:01, 1.78it/s]{
"epoch": 0,
"step": 9,
"rank": 0,
"loss": 0.2181939035654068,
"overall_throughput": 32.12612073517451,
"lr": 7.2000000000000005e-06,
"cuda_mem_allocated": 22.002385139465332,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 648,
"batch_size": 16,
"total_loss": 0.20864830911159515,
"gradnorm": 35.085845947265625,
"weight_norm": 393.4554138183594,
"timestamp": "2024-07-27T20:04:08.140249"
}
Per-token loss scaled by world size: 0.0028963200747966766Per-token loss scaled by world size: 0.0014150061178952456Per-token loss scaled by world size: 0.004510107450187206Per-token loss scaled by world size: 0.0027439305558800697Per-token loss scaled by world size: 0.003027191385626793Per-token loss scaled by world size: 0.002273061079904437
Per-token loss scaled by world size: 0.0028788307681679726
Epoch: 0, Step: 10, Rank: 5, loss = 0.2164275199174881
Epoch: 0, Step: 10, Rank: 2, loss = 0.3557347357273102
Epoch: 0, Step: 10, Rank: 6, loss = 0.1116086095571518
Epoch: 0, Step: 10, Rank: 0, loss = 0.23876972496509552
Epoch: 0, Step: 10, Rank: 1, loss = 0.22844724357128143
Epoch: 0, Step: 10, Rank: 3, loss = 0.22706778347492218Epoch: 0, Step: 10, Rank: 4, loss = 0.17928770184516907
Per-token loss scaled by world size: 0.004227162804454565
Epoch: 0, Step: 10, Rank: 7, loss = 0.33341747522354126
[2024-07-27 20:04:08,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:08,643] [INFO] [timer.py:258:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=31.799625325743182, CurrSamplesPerSec=31.872791640267828, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 83%|████████▎ | 10/12 [00:06<00:01, 1.80it/s]{
"epoch": 0,
"step": 10,
"rank": 0,
"loss": 0.23876972496509552,
"overall_throughput": 31.789585477820573,
"lr": 8.000000000000001e-06,
"cuda_mem_allocated": 22.002862453460693,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 631,
"batch_size": 16,
"total_loss": 0.23634512722492218,
"gradnorm": 5.703427791595459,
"weight_norm": 393.4554748535156,
"timestamp": "2024-07-27T20:04:08.686452"
}
Per-token loss scaled by world size: 0.0033281107898801565Per-token loss scaled by world size: 0.0010645152069628239Per-token loss scaled by world size: 0.004243766888976097Per-token loss scaled by world size: 0.003650533501058817Per-token loss scaled by world size: 0.0036266690585762262Per-token loss scaled by world size: 0.0018828274914994836
Per-token loss scaled by world size: 0.0036798259243369102
Epoch: 0, Step: 11, Rank: 4, loss = 0.07890719175338745
Epoch: 0, Step: 11, Rank: 6, loss = 0.2705957889556885Epoch: 0, Step: 11, Rank: 2, loss = 0.3145692050457001
Epoch: 0, Step: 11, Rank: 1, loss = 0.26882684230804443
Epoch: 0, Step: 11, Rank: 5, loss = 0.2727670967578888Epoch: 0, Step: 11, Rank: 0, loss = 0.24669620394706726
Epoch: 0, Step: 11, Rank: 3, loss = 0.13956458866596222
Per-token loss scaled by world size: 0.002425282960757613
Epoch: 0, Step: 11, Rank: 7, loss = 0.17977410554885864
[2024-07-27 20:04:09,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[8.8e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:09,202] [INFO] [timer.py:258:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=31.7205989700859, CurrSamplesPerSec=31.10225264577545, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Saving model in huggingface format at samples_seen: 176
{
"epoch": 0,
"step": 11,
"rank": 0,
"loss": 0.24669620394706726,
"overall_throughput": 31.02962868774954,
"lr": 8.8e-06,
"cuda_mem_allocated": 22.00071620941162,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 593,
"batch_size": 16,
"total_loss": 0.22146263718605042,
"gradnorm": 4.970978736877441,
"weight_norm": 393.4555358886719,
"timestamp": "2024-07-27T20:04:09.205377"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_176
[20:04:26] INFO saving took 17.77474355697632 seconds utils.py:611
Epoch 0: 92%|█████████▏| 11/12 [00:24<00:05, 6.00s/it]Per-token loss scaled by world size: 0.0032385066151618958Per-token loss scaled by world size: 0.0007423295173794031Per-token loss scaled by world size: 0.004228494130074978Per-token loss scaled by world size: 0.002175833098590374
Per-token loss scaled by world size: 0.0016533531015738845
Per-token loss scaled by world size: 0.0016122134402394295
Per-token loss scaled by world size: 0.0011377736227586865
Epoch: 0, Step: 12, Rank: 2, loss = 0.3493793308734894Epoch: 0, Step: 12, Rank: 5, loss = 0.06133497506380081Epoch: 0, Step: 12, Rank: 6, loss = 0.17977821826934814Epoch: 0, Step: 12, Rank: 0, loss = 0.26758161187171936
Epoch: 0, Step: 12, Rank: 1, loss = 0.13320913910865784Epoch: 0, Step: 12, Rank: 3, loss = 0.1366083025932312
Epoch: 0, Step: 12, Rank: 7, loss = 0.09400854259729385
Per-token loss scaled by world size: 0.0017149074701592326
Epoch: 0, Step: 12, Rank: 4, loss = 0.14169423282146454
[2024-07-27 20:04:27,462] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[9.600000000000001e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:04:27,540] [INFO] [timer.py:258:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=31.65103378148296, CurrSamplesPerSec=31.038411783233425, MemAllocated=22.0GB, MaxMemAllocated=28.29GB
Epoch 0: 100%|██████████| 12/12 [00:25<00:00, 4.34s/it]{
"epoch": 0,
"step": 12,
"rank": 0,
"loss": 0.26758161187171936,
"overall_throughput": 30.984142462059825,
"lr": 9.600000000000001e-06,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 661,
"batch_size": 16,
"total_loss": 0.17044928669929504,
"gradnorm": 3.941415309906006,
"weight_norm": 393.4555969238281,
"timestamp": "2024-07-27T20:04:27.583911"
}
Epoch 0: 100%|██████████| 12/12 [00:25<00:00, 2.10s/it]
total tokens: 164 num samples: 2 num padding tokens: 22 - rank: 5 max len: 82 min len: 60 avg len: 71.0 num_loss_counted_tokens: 84
total tokens: 186 num samples: 2 num padding tokens: 38 - rank: 5 max len: 93 min len: 55 avg len: 74.0 num_loss_counted_tokens: 82
total tokens: 152 num samples: 2 num padding tokens: 28 - rank: 5 max len: 76 min len: 48 avg len: 62.0 num_loss_counted_tokens: 69
total tokens: 200 num samples: 2 num padding tokens: 36 - rank: 5 max len: 100 min len: 64 avg len: 82.0 num_loss_counted_tokens: 102
total tokens: 196 num samples: 2 num padding tokens: 28 - rank: 5 max len: 98 min len: 70 avg len: 84.0 num_loss_counted_tokens: 101
total tokens: 214 num samples: 2 num padding tokens: 42 - rank: 5 max len: 107 min len: 65 avg len: 86.0 num_loss_counted_tokens: 106
total tokens: 138 num samples: 2 num padding tokens: 12 - rank: 5 max len: 69 min len: 57 avg len: 63.0 num_loss_counted_tokens: 63
total tokens: 202 num samples: 2 num padding tokens: 39 - rank: 5 max len: 101 min len: 62 avg len: 81.5 num_loss_counted_tokens: 105
total tokens: 104 num samples: 2 num padding tokens: 0 - rank: 5 max len: 52 min len: 52 avg len: 52.0 num_loss_counted_tokens: 50
total tokens: 180 num samples: 2 num padding tokens: 31 - rank: 5 max len: 90 min len: 59 avg len: 74.5 num_loss_counted_tokens: 95
total tokens: 174 num samples: 2 num padding tokens: 7 - rank: 5 max len: 87 min len: 80 avg len: 83.5 num_loss_counted_tokens: 97
total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 5 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 72
total tokens: 146 num samples: 2 num padding tokens: 16 - rank: 2 max len: 73 min len: 57 avg len: 65.0 num_loss_counted_tokens: 75
total tokens: 214 num samples: 2 num padding tokens: 53 - rank: 2 max len: 107 min len: 54 avg len: 80.5 num_loss_counted_tokens: 108
total tokens: 136 num samples: 2 num padding tokens: 8 - rank: 2 max len: 68 min len: 60 avg len: 64.0 num_loss_counted_tokens: 62
total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 2 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 80
total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 7 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 119
total tokens: 130 num samples: 2 num padding tokens: 7 - rank: 2 max len: 65 min len: 58 avg len: 61.5 num_loss_counted_tokens: 59
total tokens: 194 num samples: 2 num padding tokens: 23 - rank: 0 max len: 97 min len: 74 avg len: 85.5 num_loss_counted_tokens: 109
total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 2 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57
total tokens: 118 num samples: 2 num padding tokens: 5 - rank: 2 max len: 59 min len: 54 avg len: 56.5 num_loss_counted_tokens: 71
total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 7 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 47
total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 7 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 62
total tokens: 150 num samples: 2 num padding tokens: 29 - rank: 2 max len: 75 min len: 46 avg len: 60.5 num_loss_counted_tokens: 64
total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 0 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 78
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 2 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 73
total tokens: 136 num samples: 2 num padding tokens: 23 - rank: 0 max len: 68 min len: 45 avg len: 56.5 num_loss_counted_tokens: 51
total tokens: 132 num samples: 2 num padding tokens: 14 - rank: 2 max len: 66 min len: 52 avg len: 59.0 num_loss_counted_tokens: 58
total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 2 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 70
total tokens: 114 num samples: 2 num padding tokens: 2 - rank: 4 max len: 57 min len: 55 avg len: 56.0 num_loss_counted_tokens: 66
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 49
total tokens: 166 num samples: 2 num padding tokens: 7 - rank: 0 max len: 83 min len: 76 avg len: 79.5 num_loss_counted_tokens: 90
total tokens: 188 num samples: 2 num padding tokens: 22 - rank: 4 max len: 94 min len: 72 avg len: 83.0 num_loss_counted_tokens: 98
total tokens: 156 num samples: 2 num padding tokens: 27 - rank: 7 max len: 78 min len: 51 avg len: 64.5 num_loss_counted_tokens: 71
total tokens: 118 num samples: 2 num padding tokens: 8 - rank: 0 max len: 59 min len: 51 avg len: 55.0 num_loss_counted_tokens: 60
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 4 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 59
total tokens: 166 num samples: 2 num padding tokens: 16 - rank: 7 max len: 83 min len: 67 avg len: 75.0 num_loss_counted_tokens: 75
total tokens: 168 num samples: 2 num padding tokens: 13 - rank: 4 max len: 84 min len: 71 avg len: 77.5 num_loss_counted_tokens: 88
total tokens: 174 num samples: 2 num padding tokens: 41 - rank: 3 max len: 87 min len: 46 avg len: 66.5 num_loss_counted_tokens: 70
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 4 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 68
total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 4 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 84
total tokens: 174 num samples: 2 num padding tokens: 6 - rank: 4 max len: 87 min len: 81 avg len: 84.0 num_loss_counted_tokens: 100
total tokens: 104 num samples: 2 num padding tokens: 8 - rank: 4 max len: 52 min len: 44 avg len: 48.0 num_loss_counted_tokens: 52
total tokens: 152 num samples: 2 num padding tokens: 12 - rank: 3 max len: 76 min len: 64 avg len: 70.0 num_loss_counted_tokens: 81
total tokens: 128 num samples: 2 num padding tokens: 14 - rank: 4 max len: 64 min len: 50 avg len: 57.0 num_loss_counted_tokens: 51
total tokens: 180 num samples: 2 num padding tokens: 9 - rank: 3 max len: 90 min len: 81 avg len: 85.5 num_loss_counted_tokens: 135
total tokens: 154 num samples: 2 num padding tokens: 23 - rank: 2 max len: 77 min len: 54 avg len: 65.5 num_loss_counted_tokens: 75
total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 3 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 57
total tokens: 122 num samples: 2 num padding tokens: 3 - rank: 3 max len: 61 min len: 58 avg len: 59.5 num_loss_counted_tokens: 60
total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 3 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 80
total tokens: 124 num samples: 2 num padding tokens: 5 - rank: 0 max len: 62 min len: 57 avg len: 59.5 num_loss_counted_tokens: 73
total tokens: 244 num samples: 2 num padding tokens: 34 - rank: 3 max len: 122 min len: 88 avg len: 105.0 num_loss_counted_tokens: 147
total tokens: 186 num samples: 2 num padding tokens: 30 - rank: 7 max len: 93 min len: 63 avg len: 78.0 num_loss_counted_tokens: 117
total tokens: 138 num samples: 2 num padding tokens: 25 - rank: 0 max len: 69 min len: 44 avg len: 56.5 num_loss_counted_tokens: 69
total tokens: 118 num samples: 2 num padding tokens: 6 - rank: 0 max len: 59 min len: 53 avg len: 56.0 num_loss_counted_tokens: 50
total tokens: 148 num samples: 2 num padding tokens: 8 - rank: 0 max len: 74 min len: 66 avg len: 70.0 num_loss_counted_tokens: 76
total tokens: 96 num samples: 2 num padding tokens: 5 - rank: 0 max len: 48 min len: 43 avg len: 45.5 num_loss_counted_tokens: 39
total tokens: 166 num samples: 2 num padding tokens: 3 - rank: 3 max len: 83 min len: 80 avg len: 81.5 num_loss_counted_tokens: 111
total tokens: 164 num samples: 2 num padding tokens: 29 - rank: 0 max len: 82 min len: 53 avg len: 67.5 num_loss_counted_tokens: 85
total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 1 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 68
total tokens: 128 num samples: 2 num padding tokens: 7 - rank: 3 max len: 64 min len: 57 avg len: 60.5 num_loss_counted_tokens: 66
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 3 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 73
total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 4 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57
total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 3 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 61
total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 7 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 61
total tokens: 282 num samples: 2 num padding tokens: 70 - rank: 7 max len: 141 min len: 71 avg len: 106.0 num_loss_counted_tokens: 151
total tokens: 132 num samples: 2 num padding tokens: 15 - rank: 7 max len: 66 min len: 51 avg len: 58.5 num_loss_counted_tokens: 61
total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 1 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 99
total tokens: 188 num samples: 2 num padding tokens: 8 - rank: 7 max len: 94 min len: 86 avg len: 90.0 num_loss_counted_tokens: 85
total tokens: 168 num samples: 2 num padding tokens: 39 - rank: 7 max len: 84 min len: 45 avg len: 64.5 num_loss_counted_tokens: 71
total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 7 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 88
total tokens: 126 num samples: 2 num padding tokens: 2 - rank: 1 max len: 63 min len: 61 avg len: 62.0 num_loss_counted_tokens: 56
total tokens: 146 num samples: 2 num padding tokens: 11 - rank: 1 max len: 73 min len: 62 avg len: 67.5 num_loss_counted_tokens: 75
total tokens: 208 num samples: 2 num padding tokens: 38 - rank: 0 max len: 104 min len: 66 avg len: 85.0 num_loss_counted_tokens: 110
total tokens: 132 num samples: 2 num padding tokens: 16 - rank: 4 max len: 66 min len: 50 avg len: 58.0 num_loss_counted_tokens: 61
total tokens: 172 num samples: 2 num padding tokens: 28 - rank: 1 max len: 86 min len: 58 avg len: 72.0 num_loss_counted_tokens: 78
total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 1 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 89
total tokens: 226 num samples: 2 num padding tokens: 39 - rank: 1 max len: 113 min len: 74 avg len: 93.5 num_loss_counted_tokens: 109
total tokens: 134 num samples: 2 num padding tokens: 23 - rank: 1 max len: 67 min len: 44 avg len: 55.5 num_loss_counted_tokens: 47
total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 1 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 62 total tokens: 162 num samples: 2 num padding tokens: 12 - rank: 1 max len: 81 min len: 69 avg len: 75.0 num_loss_counted_tokens: 71
total tokens: 158 num samples: 2 num padding tokens: 12 - rank: 1 max len: 79 min len: 67 avg len: 73.0 num_loss_counted_tokens: 66
total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 6 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 76
total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 6 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 63
total tokens: 98 num samples: 2 num padding tokens: 4 - rank: 6 max len: 49 min len: 45 avg len: 47.0 num_loss_counted_tokens: 48
total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 6 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 76
total tokens: 124 num samples: 2 num padding tokens: 1 - rank: 3 max len: 62 min len: 61 avg len: 61.5 num_loss_counted_tokens: 57
total tokens: 228 num samples: 2 num padding tokens: 46 - rank: 6 max len: 114 min len: 68 avg len: 91.0 num_loss_counted_tokens: 118
total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 6 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 68
total tokens: 216 num samples: 2 num padding tokens: 60 - rank: 6 max len: 108 min len: 48 avg len: 78.0 num_loss_counted_tokens: 102
total tokens: 126 num samples: 2 num padding tokens: 0 - rank: 6 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 63
total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 6 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 57
total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 6 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 91
total tokens: 124 num samples: 2 num padding tokens: 11 - rank: 6 max len: 62 min len: 51 avg len: 56.5 num_loss_counted_tokens: 57
total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 1 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 66
total tokens: 116 num samples: 2 num padding tokens: 8 - rank: 6 max len: 58 min len: 50 avg len: 54.0 num_loss_counted_tokens: 59
Per-token loss scaled by world size: 0.0013802563771605492Per-token loss scaled by world size: 0.0019055134616792202
Per-token loss scaled by world size: 0.003680554451420903Per-token loss scaled by world size: 0.001587073435075581Per-token loss scaled by world size: 0.0016849382082000375
Per-token loss scaled by world size: 0.00304134888574481
Per-token loss scaled by world size: 0.00129329867195338
Epoch: 1, Step: 13, Rank: 1, loss = 0.1150788739323616
Epoch: 1, Step: 13, Rank: 3, loss = 0.15887218713760376
Epoch: 1, Step: 13, Rank: 7, loss = 0.30686622858047485
Epoch: 1, Step: 13, Rank: 4, loss = 0.1323222517967224
Epoch: 1, Step: 13, Rank: 0, loss = 0.2535724639892578
Epoch: 1, Step: 13, Rank: 2, loss = 0.14048172533512115
Epoch: 1, Step: 13, Rank: 5, loss = 0.1078287735581398
Per-token loss scaled by world size: 0.000824308895971626
Epoch: 1, Step: 13, Rank: 6, loss = 0.06872675567865372
[2024-07-27 20:04:28,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[1.04e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:28,578] [INFO] [timer.py:258:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=31.312819818036353, CurrSamplesPerSec=28.289847056874475, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 1, | 1/12 [00:00<00:10, 1.05it/s]
"step": 13,
"rank": 0,
"loss": 0.2535724639892578,
"overall_throughput": 28.191588421887058,
"lr": 1.04e-05,
"cuda_mem_allocated": 22.006441116333008,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 667,
"batch_size": 16,
"total_loss": 0.16046865284442902,
"gradnorm": 3.25748348236084,
"weight_norm": 393.4556884765625,
"timestamp": "2024-07-27T20:04:28.620919"
}
Per-token loss scaled by world size: 0.003146873554214835Per-token loss scaled by world size: 0.0015134336426854134Per-token loss scaled by world size: 0.0021054281387478113Per-token loss scaled by world size: 0.005117365624755621
Per-token loss scaled by world size: 0.0010033146245405078Per-token loss scaled by world size: 0.0036201237235218287
Per-token loss scaled by world size: 0.003257090924307704
Epoch: 1, Step: 14, Rank: 0, loss = 0.2533233165740967
Epoch: 1, Step: 14, Rank: 5, loss = 0.41194793581962585Epoch: 1, Step: 14, Rank: 2, loss = 0.12183140963315964
Epoch: 1, Step: 14, Rank: 6, loss = 0.16948696970939636
Epoch: 1, Step: 14, Rank: 1, loss = 0.29141995310783386
Epoch: 1, Step: 14, Rank: 3, loss = 0.2621958255767822
Epoch: 1, Step: 14, Rank: 7, loss = 0.08076682686805725
Per-token loss scaled by world size: 0.0011276104487478733
Epoch: 1, Step: 14, Rank: 4, loss = 0.09077264368534088
[2024-07-27 20:04:29,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[1.1200000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:29,123] [INFO] [timer.py:258:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=31.374398887750708, CurrSamplesPerSec=32.06810729498475, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,▋ | 2/12 [00:01<00:07, 1.40it/s]
"step": 14,
"rank": 0,
"loss": 0.2533233165740967,
"overall_throughput": 32.01091375509972,
"lr": 1.1200000000000001e-05,
"cuda_mem_allocated": 22.00023889541626,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 644,
"batch_size": 16,
"total_loss": 0.21021811664104462,
"gradnorm": 6.8222336769104,
"weight_norm": 393.4558410644531,
"timestamp": "2024-07-27T20:04:29.126718"
}
Per-token loss scaled by world size: 0.00125453295186162Per-token loss scaled by world size: 0.002432552631944418Per-token loss scaled by world size: 0.0022791901137679815Per-token loss scaled by world size: 0.0012238719500601292
Per-token loss scaled by world size: 0.0040193116292357445
Per-token loss scaled by world size: 0.002601771615445614Per-token loss scaled by world size: 0.0017355632735416293
Epoch: 1, Step: 15, Rank: 0, loss = 0.08985592424869537
Epoch: 1, Step: 15, Rank: 1, loss = 0.17423158884048462
Epoch: 1, Step: 15, Rank: 2, loss = 0.1632469892501831
Epoch: 1, Step: 15, Rank: 6, loss = 0.08765982836484909Epoch: 1, Step: 15, Rank: 4, loss = 0.2878831923007965
Epoch: 1, Step: 15, Rank: 3, loss = 0.18635189533233643
Epoch: 1, Step: 15, Rank: 5, loss = 0.1243097186088562
Per-token loss scaled by world size: 0.0024993098340928555
Epoch: 1, Step: 15, Rank: 7, loss = 0.17901305854320526
[2024-07-27 20:04:29,597] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:29,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=31.399353099680013, CurrSamplesPerSec=31.70192973588364, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,█▌ | 3/12 [00:02<00:05, 1.56it/s]
"step": 15,
"rank": 0,
"loss": 0.08985592424869537,
"overall_throughput": 31.64709333197519,
"lr": 1.2e-05,
"cuda_mem_allocated": 21.999523639678955,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 573,
"batch_size": 16,
"total_loss": 0.1615689992904663,
"gradnorm": 3.001760244369507,
"weight_norm": 393.45599365234375,
"timestamp": "2024-07-27T20:04:29.719105"
}
Per-token loss scaled by world size: 0.0014579611597582698Per-token loss scaled by world size: 0.003502971027046442Per-token loss scaled by world size: 0.0018768769223242998
Per-token loss scaled by world size: 0.0020246703643351793Per-token loss scaled by world size: 0.001514959498308599Per-token loss scaled by world size: 0.006437234580516815
Epoch: 1, Step: 16, Rank: 0, loss = 0.10005258768796921Epoch: 1, Step: 16, Rank: 4, loss = 0.1288006752729416Epoch: 1, Step: 16, Rank: 2, loss = 0.24039138853549957
Epoch: 1, Step: 16, Rank: 7, loss = 0.4417552351951599
Epoch: 1, Step: 16, Rank: 1, loss = 0.13894300162792206Epoch: 1, Step: 16, Rank: 3, loss = 0.10396409779787064
Per-token loss scaled by world size: 0.002007455099374056
Per-token loss scaled by world size: 0.0018041662406176329
Epoch: 1, Step: 16, Rank: 5, loss = 0.13776160776615143
Epoch: 1, Step: 16, Rank: 6, loss = 0.12381090968847275
[2024-07-27 20:04:30,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[1.2800000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:30,218] [INFO] [timer.py:258:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=31.464876520188444, CurrSamplesPerSec=32.342260256708684, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,██▎ | 4/12 [00:02<00:04, 1.66it/s]
"step": 16,
"rank": 0,
"loss": 0.10005258768796921,
"overall_throughput": 32.288201736596754,
"lr": 1.2800000000000001e-05,
"cuda_mem_allocated": 22.003100872039795,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 549,
"batch_size": 16,
"total_loss": 0.17693495750427246,
"gradnorm": 2.5745155811309814,
"weight_norm": 393.4561462402344,
"timestamp": "2024-07-27T20:04:30.261395"
}
Per-token loss scaled by world size: 0.004123833030462265Per-token loss scaled by world size: 0.002096434822306037Per-token loss scaled by world size: 0.002511914586648345Per-token loss scaled by world size: 0.004808654077351093
Per-token loss scaled by world size: 0.0011069930624216795Per-token loss scaled by world size: 0.002304441062733531
Epoch: 1, Step: 17, Rank: 1, loss = 0.20597699284553528
Epoch: 1, Step: 17, Rank: 7, loss = 0.17190766334533691
Epoch: 1, Step: 17, Rank: 3, loss = 0.3943096399307251Epoch: 1, Step: 17, Rank: 6, loss = 0.33815431594848633
Epoch: 1, Step: 17, Rank: 2, loss = 0.09077343344688416
Epoch: 1, Step: 17, Rank: 4, loss = 0.18896417319774628
Per-token loss scaled by world size: 0.0022304877638816833
Per-token loss scaled by world size: 0.0029599058907479048
Epoch: 1, Step: 17, Rank: 0, loss = 0.24271227419376373
Epoch: 1, Step: 17, Rank: 5, loss = 0.18289999663829803
[2024-07-27 20:04:30,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[1.3600000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:30,758] [INFO] [timer.py:258:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=31.51917479452854, CurrSamplesPerSec=32.29951508996705, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,███▏ | 5/12 [00:03<00:04, 1.73it/s]
"step": 17,
"rank": 0,
"loss": 0.24271227419376373,
"overall_throughput": 32.21476489448058,
"lr": 1.3600000000000002e-05,
"cuda_mem_allocated": 21.997375965118408,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 656,
"batch_size": 16,
"total_loss": 0.22696229815483093,
"gradnorm": 4.573685169219971,
"weight_norm": 393.4563293457031,
"timestamp": "2024-07-27T20:04:30.804653"
}
Per-token loss scaled by world size: 0.0030115304980427027Per-token loss scaled by world size: 0.006417228374630213Per-token loss scaled by world size: 0.007109665311872959Per-token loss scaled by world size: 0.000538784428499639Per-token loss scaled by world size: 0.002789800288155675Per-token loss scaled by world size: 0.003157705068588257
Per-token loss scaled by world size: 0.0017401942750439048
Epoch: 1, Step: 18, Rank: 0, loss = 0.23715803027153015Epoch: 1, Step: 18, Rank: 7, loss = 0.5053567290306091Epoch: 1, Step: 18, Rank: 6, loss = 0.5598861575126648
Epoch: 1, Step: 18, Rank: 2, loss = 0.2196967750787735Epoch: 1, Step: 18, Rank: 1, loss = 0.042429275810718536
Epoch: 1, Step: 18, Rank: 5, loss = 0.24866926670074463
Epoch: 1, Step: 18, Rank: 3, loss = 0.13704030215740204
Per-token loss scaled by world size: 0.0006516931462101638
Epoch: 1, Step: 18, Rank: 4, loss = 0.05132083594799042
[2024-07-27 20:04:31,227] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[1.4400000000000001e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:31,304] [INFO] [timer.py:258:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=31.586850556537332, CurrSamplesPerSec=32.638021628709105, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,████ | 6/12 [00:03<00:03, 1.76it/s]
"step": 18,
"rank": 0,
"loss": 0.23715803027153015,
"overall_throughput": 32.58399904834273,
"lr": 1.4400000000000001e-05,
"cuda_mem_allocated": 21.999762058258057,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 630,
"batch_size": 16,
"total_loss": 0.2501946687698364,
"gradnorm": 4.389204025268555,
"weight_norm": 393.45654296875,
"timestamp": "2024-07-27T20:04:31.348530"
}
Per-token loss scaled by world size: 0.003954596351832151Per-token loss scaled by world size: 0.0016637382796034217Per-token loss scaled by world size: 0.003214797005057335Per-token loss scaled by world size: 0.006215415894985199Per-token loss scaled by world size: 0.0025190163869410753Per-token loss scaled by world size: 0.0015009477501735091
Per-token loss scaled by world size: 0.0017105289734899998
Epoch: 1, Step: 19, Rank: 0, loss = 0.34849879145622253Epoch: 1, Step: 19, Rank: 7, loss = 0.22198832035064697
Epoch: 1, Step: 19, Rank: 1, loss = 0.28330397605895996
Epoch: 1, Step: 19, Rank: 2, loss = 0.14661693572998047Epoch: 1, Step: 19, Rank: 5, loss = 0.5477335453033447
Epoch: 1, Step: 19, Rank: 3, loss = 0.1507403701543808
Epoch: 1, Step: 19, Rank: 6, loss = 0.13227102160453796
Per-token loss scaled by world size: 0.0012170596746727824
Epoch: 1, Step: 19, Rank: 4, loss = 0.10725338757038116
[2024-07-27 20:04:31,783] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[1.5200000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:31,860] [INFO] [timer.py:258:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=31.571700427209116, CurrSamplesPerSec=31.331259798479305, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,████▊ | 7/12 [00:04<00:02, 1.77it/s]
"step": 19,
"rank": 0,
"loss": 0.34849879145622253,
"overall_throughput": 31.250591517446562,
"lr": 1.5200000000000002e-05,
"cuda_mem_allocated": 22.002862453460693,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 705,
"batch_size": 16,
"total_loss": 0.24230077862739563,
"gradnorm": 3.7223894596099854,
"weight_norm": 393.456787109375,
"timestamp": "2024-07-27T20:04:31.903580"
}
Per-token loss scaled by world size: 0.003282753750681877Per-token loss scaled by world size: 0.006474556401371956Per-token loss scaled by world size: 0.001697456929832697Per-token loss scaled by world size: 0.0012144312495365739Per-token loss scaled by world size: 0.0008425723062828183Per-token loss scaled by world size: 0.0018245026003569365
Per-token loss scaled by world size: 0.0037540853954851627
Epoch: 1, Step: 20, Rank: 6, loss = 0.12476308643817902
Epoch: 1, Step: 20, Rank: 2, loss = 0.4758799076080322
Epoch: 1, Step: 20, Rank: 5, loss = 0.061929065734148026
Epoch: 1, Step: 20, Rank: 4, loss = 0.2412824034690857Epoch: 1, Step: 20, Rank: 0, loss = 0.08926069736480713
Epoch: 1, Step: 20, Rank: 7, loss = 0.27592527866363525Epoch: 1, Step: 20, Rank: 1, loss = 0.13410094380378723
Per-token loss scaled by world size: 0.0009085916099138558
Epoch: 1, Step: 20, Rank: 3, loss = 0.06678148359060287
[2024-07-27 20:04:32,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:32,399] [INFO] [timer.py:258:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=31.62285860527634, CurrSamplesPerSec=32.51863226575504, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 1,█████▋ | 8/12 [00:04<00:02, 1.80it/s]
"step": 20,
"rank": 0,
"loss": 0.08926069736480713,
"overall_throughput": 32.4628994682786,
"lr": 1.6000000000000003e-05,
"cuda_mem_allocated": 22.00811004638672,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 588,
"batch_size": 16,
"total_loss": 0.18374036252498627,
"gradnorm": 4.014389514923096,
"weight_norm": 393.4570617675781,
"timestamp": "2024-07-27T20:04:32.401998"
}
Per-token loss scaled by world size: 0.0046243746764957905
Per-token loss scaled by world size: 0.0019434256246313453Per-token loss scaled by world size: 0.0029365788213908672Per-token loss scaled by world size: 0.0035864808596670628Per-token loss scaled by world size: 0.003000351833179593Per-token loss scaled by world size: 0.002845433074980974Per-token loss scaled by world size: 0.0026900055818259716
Epoch: 1, Step: 21, Rank: 0, loss = 0.33989155292510986
Epoch: 1, Step: 21, Rank: 3, loss = 0.14284178614616394Epoch: 1, Step: 21, Rank: 2, loss = 0.220525860786438Epoch: 1, Step: 21, Rank: 5, loss = 0.26360633969306946
Epoch: 1, Step: 21, Rank: 6, loss = 0.21583855152130127
Epoch: 1, Step: 21, Rank: 7, loss = 0.1977154165506363
Epoch: 1, Step: 21, Rank: 4, loss = 0.20913933217525482
Per-token loss scaled by world size: 0.003519931575283408
Epoch: 1, Step: 21, Rank: 1, loss = 0.2587149739265442
[2024-07-27 20:04:32,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.6800000000000002e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:32,944] [INFO] [timer.py:258:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=31.62433633690537, CurrSamplesPerSec=31.650959142641135, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,██████▌ | 9/12 [00:05<00:01, 1.81it/s]
"step": 21,
"rank": 0,
"loss": 0.33989155292510986,
"overall_throughput": 31.58514555251878,
"lr": 1.6800000000000002e-05,
"cuda_mem_allocated": 21.998091220855713,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 588,
"batch_size": 16,
"total_loss": 0.23103423416614532,
"gradnorm": 4.073541164398193,
"weight_norm": 393.4573669433594,
"timestamp": "2024-07-27T20:04:32.946992"
}
Per-token loss scaled by world size: 0.0026941639371216297Per-token loss scaled by world size: 0.007396150380373001Per-token loss scaled by world size: 0.0018774037016555667
Per-token loss scaled by world size: 0.0010539990616962314
Per-token loss scaled by world size: 0.0033142913598567247Per-token loss scaled by world size: 0.0031690315809100866
Per-token loss scaled by world size: 0.00544370012357831
Epoch: 1, Step: 22, Rank: 2, loss = 0.527900218963623Epoch: 1, Step: 22, Rank: 6, loss = 0.13399969041347504
Epoch: 1, Step: 22, Rank: 5, loss = 0.07522918283939362
Epoch: 1, Step: 22, Rank: 1, loss = 0.19229595363140106
Epoch: 1, Step: 22, Rank: 4, loss = 0.23655754327774048
Epoch: 1, Step: 22, Rank: 7, loss = 0.22618962824344635
Epoch: 1, Step: 22, Rank: 3, loss = 0.38854408264160156
Per-token loss scaled by world size: 0.0007584551349282265
Epoch: 1, Step: 22, Rank: 0, loss = 0.054134733974933624
[2024-07-27 20:04:33,400] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.76e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:33,478] [INFO] [timer.py:258:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=31.68561296315261, CurrSamplesPerSec=32.896711596691546, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 352
{
"epoch": 1,
"step": 22,
"rank": 0,
"loss": 0.054134733974933624,
"overall_throughput": 32.789902630272564,
"lr": 1.76e-05,
"cuda_mem_allocated": 21.997375965118408,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 571,
"batch_size": 16,
"total_loss": 0.22935637831687927,
"gradnorm": 3.735788106918335,
"weight_norm": 393.4576721191406,
"timestamp": "2024-07-27T20:04:33.482264"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_352
[20:04:51] INFO saving took 17.896338939666748 seconds utils.py:611
Per-token loss scaled by world size: 0.001945212366990745Per-token loss scaled by world size: 0.0022036610171198845Per-token loss scaled by world size: 0.0031597877386957407Per-token loss scaled by world size: 0.0022363392636179924Per-token loss scaled by world size: 0.017282620072364807
Per-token loss scaled by world size: 0.0026094394270330667Per-token loss scaled by world size: 0.0019479849142953753
Epoch: 1, Step: 23, Rank: 1, loss = 0.1749155968427658Epoch: 1, Step: 23, Rank: 5, loss = 0.25080814957618713Epoch: 1, Step: 23, Rank: 4, loss = 0.17750942707061768
Epoch: 1, Step: 23, Rank: 3, loss = 0.207124263048172
Epoch: 1, Step: 23, Rank: 7, loss = 1.3718079328536987
Epoch: 1, Step: 23, Rank: 0, loss = 0.15440122783184052
Epoch: 1, Step: 23, Rank: 2, loss = 0.15462130308151245
Per-token loss scaled by world size: 0.004577424377202988
Epoch: 1, Step: 23, Rank: 6, loss = 0.3633330762386322
[2024-07-27 20:04:51,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.8400000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:51,943] [INFO] [timer.py:258:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=31.661565651394987, CurrSamplesPerSec=31.188169951680987, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 1,████████▏| 11/12 [00:24<00:04, 4.39s/it]
"step": 23,
"rank": 0,
"loss": 0.15440122783184052,
"overall_throughput": 31.12553529081371,
"lr": 1.8400000000000003e-05,
"cuda_mem_allocated": 22.000954627990723,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 635,
"batch_size": 16,
"total_loss": 0.3568150997161865,
"gradnorm": 3.5464768409729004,
"weight_norm": 393.4579772949219,
"timestamp": "2024-07-27T20:04:51.986553"
}
Per-token loss scaled by world size: 0.002529617166146636Per-token loss scaled by world size: 0.004327333997935057Per-token loss scaled by world size: 0.002556184073910117Per-token loss scaled by world size: 0.005085375625640154Per-token loss scaled by world size: 0.008069510571658611Per-token loss scaled by world size: 0.002654892858117819
Per-token loss scaled by world size: 0.001250272849574685
Epoch: 1, Step: 24, Rank: 3, loss = 0.3591546416282654
Epoch: 1, Step: 24, Rank: 6, loss = 0.17865420877933502Epoch: 1, Step: 24, Rank: 5, loss = 0.18750180304050446
Epoch: 1, Step: 24, Rank: 1, loss = 0.30561795830726624
Epoch: 1, Step: 24, Rank: 4, loss = 0.5699091553688049Epoch: 1, Step: 24, Rank: 0, loss = 0.18053050339221954
Epoch: 1, Step: 24, Rank: 2, loss = 0.08830051869153976
Per-token loss scaled by world size: 0.0018133769044652581
Epoch: 1, Step: 24, Rank: 7, loss = 0.12806974351406097
[2024-07-27 20:04:52,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.9200000000000003e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:52,496] [INFO] [timer.py:258:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=31.65142326624155, CurrSamplesPerSec=31.439924179355366, MemAllocated=21.99GB, MaxMemAllocated=28.3GB
{
"epoch": 1,█████████| 12/12 [00:24<00:00, 3.22s/it]
"step": 24,
"rank": 0,
"loss": 0.18053050339221954,
"overall_throughput": 31.36277025201868,
"lr": 1.9200000000000003e-05,
"cuda_mem_allocated": 21.994752407073975,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 565,
"batch_size": 16,
"total_loss": 0.2497173249721527,
"gradnorm": 2.950968027114868,
"weight_norm": 393.4583435058594,
"timestamp": "2024-07-27T20:04:52.547761"
}
Epoch 1: 100%|██████████| 12/12 [00:24<00:00, 2.08s/it]
total tokens: 154 num samples: 2 num padding tokens: 26 - rank: 1 max len: 77 min len: 51 avg len: 64.0 num_loss_counted_tokens: 64
total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 1 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
total tokens: 214 num samples: 2 num padding tokens: 37 - rank: 1 max len: 107 min len: 70 avg len: 88.5 num_loss_counted_tokens: 106
total tokens: 166 num samples: 2 num padding tokens: 22 - rank: 1 max len: 83 min len: 61 avg len: 72.0 num_loss_counted_tokens: 86
total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 1 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 76
total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 1 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 58
total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 1 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 75
total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 1 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 64
total tokens: 188 num samples: 2 num padding tokens: 27 - rank: 1 max len: 94 min len: 67 avg len: 80.5 num_loss_counted_tokens: 82
total tokens: 154 num samples: 2 num padding tokens: 17 - rank: 1 max len: 77 min len: 60 avg len: 68.5 num_loss_counted_tokens: 88
total tokens: 120 num samples: 2 num padding tokens: 16 - rank: 5 max len: 60 min len: 44 avg len: 52.0 num_loss_counted_tokens: 57
total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 1 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 62
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 5 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 59
total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 5 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 54
total tokens: 194 num samples: 2 num padding tokens: 36 - rank: 5 max len: 97 min len: 61 avg len: 79.0 num_loss_counted_tokens: 99
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63 total tokens: 120 num samples: 2 num padding tokens: 3 - rank: 5 max len: 60 min len: 57 avg len: 58.5 num_loss_counted_tokens: 80
total tokens: 146 num samples: 2 num padding tokens: 18 - rank: 0 max len: 73 min len: 55 avg len: 64.0 num_loss_counted_tokens: 82
total tokens: 130 num samples: 2 num padding tokens: 8 - rank: 0 max len: 65 min len: 57 avg len: 61.0 num_loss_counted_tokens: 60
total tokens: 174 num samples: 2 num padding tokens: 29 - rank: 0 max len: 87 min len: 58 avg len: 72.5 num_loss_counted_tokens: 80
total tokens: 172 num samples: 2 num padding tokens: 10 - rank: 5 max len: 86 min len: 76 avg len: 81.0 num_loss_counted_tokens: 84
total tokens: 228 num samples: 2 num padding tokens: 44 - rank: 5 max len: 114 min len: 70 avg len: 92.0 num_loss_counted_tokens: 115
total tokens: 200 num samples: 2 num padding tokens: 32 - rank: 0 max len: 100 min len: 68 avg len: 84.0 num_loss_counted_tokens: 95
total tokens: 136 num samples: 2 num padding tokens: 4 - rank: 5 max len: 68 min len: 64 avg len: 66.0 num_loss_counted_tokens: 65
total tokens: 168 num samples: 2 num padding tokens: 32 - rank: 5 max len: 84 min len: 52 avg len: 68.0 num_loss_counted_tokens: 80
total tokens: 226 num samples: 2 num padding tokens: 64 - rank: 5 max len: 113 min len: 49 avg len: 81.0 num_loss_counted_tokens: 93
total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 0 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 60
total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 0 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 52
total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 0 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 78
total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 1 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
total tokens: 226 num samples: 2 num padding tokens: 55 - rank: 4 max len: 113 min len: 58 avg len: 85.5 num_loss_counted_tokens: 102 total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 0 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 74
total tokens: 160 num samples: 2 num padding tokens: 31 - rank: 0 max len: 80 min len: 49 avg len: 64.5 num_loss_counted_tokens: 78
total tokens: 174 num samples: 2 num padding tokens: 13 - rank: 0 max len: 87 min len: 74 avg len: 80.5 num_loss_counted_tokens: 90
total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 2 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 62
total tokens: 188 num samples: 2 num padding tokens: 28 - rank: 4 max len: 94 min len: 66 avg len: 80.0 num_loss_counted_tokens: 99
total tokens: 180 num samples: 2 num padding tokens: 28 - rank: 2 max len: 90 min len: 62 avg len: 76.0 num_loss_counted_tokens: 93
total tokens: 120 num samples: 2 num padding tokens: 12 - rank: 5 max len: 60 min len: 48 avg len: 54.0 num_loss_counted_tokens: 58
total tokens: 110 num samples: 2 num padding tokens: 7 - rank: 2 max len: 55 min len: 48 avg len: 51.5 num_loss_counted_tokens: 53
total tokens: 142 num samples: 2 num padding tokens: 27 - rank: 6 max len: 71 min len: 44 avg len: 57.5 num_loss_counted_tokens: 66
total tokens: 208 num samples: 2 num padding tokens: 39 - rank: 2 max len: 104 min len: 65 avg len: 84.5 num_loss_counted_tokens: 111
total tokens: 166 num samples: 2 num padding tokens: 1 - rank: 5 max len: 83 min len: 82 avg len: 82.5 num_loss_counted_tokens: 96
total tokens: 168 num samples: 2 num padding tokens: 20 - rank: 6 max len: 84 min len: 64 avg len: 74.0 num_loss_counted_tokens: 100 total tokens: 166 num samples: 2 num padding tokens: 33 - rank: 6 max len: 83 min len: 50 avg len: 66.5 num_loss_counted_tokens: 86
total tokens: 134 num samples: 2 num padding tokens: 12 - rank: 2 max len: 67 min len: 55 avg len: 61.0 num_loss_counted_tokens: 67
total tokens: 160 num samples: 2 num padding tokens: 16 - rank: 6 max len: 80 min len: 64 avg len: 72.0 num_loss_counted_tokens: 72
total tokens: 180 num samples: 2 num padding tokens: 36 - rank: 6 max len: 90 min len: 54 avg len: 72.0 num_loss_counted_tokens: 95
total tokens: 184 num samples: 2 num padding tokens: 26 - rank: 6 max len: 92 min len: 66 avg len: 79.0 num_loss_counted_tokens: 96
total tokens: 150 num samples: 2 num padding tokens: 30 - rank: 2 max len: 75 min len: 45 avg len: 60.0 num_loss_counted_tokens: 65
total tokens: 154 num samples: 2 num padding tokens: 23 - rank: 2 max len: 77 min len: 54 avg len: 65.5 num_loss_counted_tokens: 74
total tokens: 152 num samples: 2 num padding tokens: 25 - rank: 2 max len: 76 min len: 51 avg len: 63.5 num_loss_counted_tokens: 70
total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 2 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 59
total tokens: 126 num samples: 2 num padding tokens: 9 - rank: 6 max len: 63 min len: 54 avg len: 58.5 num_loss_counted_tokens: 65
total tokens: 180 num samples: 2 num padding tokens: 19 - rank: 6 max len: 90 min len: 71 avg len: 80.5 num_loss_counted_tokens: 131
total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 66
total tokens: 196 num samples: 2 num padding tokens: 29 - rank: 6 max len: 98 min len: 69 avg len: 83.5 num_loss_counted_tokens: 118
total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 6 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 59
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 2 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 61
total tokens: 126 num samples: 2 num padding tokens: 10 - rank: 7 max len: 63 min len: 53 avg len: 58.0 num_loss_counted_tokens: 56 total tokens: 122 num samples: 2 num padding tokens: 16 - rank: 7 max len: 61 min len: 45 avg len: 53.0 num_loss_counted_tokens: 53
total tokens: 214 num samples: 2 num padding tokens: 55 - rank: 4 max len: 107 min len: 52 avg len: 79.5 num_loss_counted_tokens: 104
total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 7 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 72
total tokens: 148 num samples: 2 num padding tokens: 10 - rank: 4 max len: 74 min len: 64 avg len: 69.0 num_loss_counted_tokens: 73
total tokens: 244 num samples: 2 num padding tokens: 29 - rank: 4 max len: 122 min len: 93 avg len: 107.5 num_loss_counted_tokens: 144
total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 7 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 62
total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 53 total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 3 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 71
total tokens: 158 num samples: 2 num padding tokens: 7 - rank: 7 max len: 79 min len: 72 avg len: 75.5 num_loss_counted_tokens: 91
total tokens: 116 num samples: 2 num padding tokens: 15 - rank: 0 max len: 58 min len: 43 avg len: 50.5 num_loss_counted_tokens: 49
total tokens: 282 num samples: 2 num padding tokens: 60 - rank: 3 max len: 141 min len: 81 avg len: 111.0 num_loss_counted_tokens: 169
total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 3 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 62
total tokens: 134 num samples: 2 num padding tokens: 8 - rank: 4 max len: 67 min len: 59 avg len: 63.0 num_loss_counted_tokens: 56
total tokens: 216 num samples: 2 num padding tokens: 27 - rank: 4 max len: 108 min len: 81 avg len: 94.5 num_loss_counted_tokens: 123
total tokens: 126 num samples: 2 num padding tokens: 14 - rank: 7 max len: 63 min len: 49 avg len: 56.0 num_loss_counted_tokens: 58
total tokens: 156 num samples: 2 num padding tokens: 28 - rank: 3 max len: 78 min len: 50 avg len: 64.0 num_loss_counted_tokens: 70
total tokens: 162 num samples: 2 num padding tokens: 31 - rank: 4 max len: 81 min len: 50 avg len: 65.5 num_loss_counted_tokens: 75
total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 4 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 68
total tokens: 164 num samples: 2 num padding tokens: 20 - rank: 7 max len: 82 min len: 62 avg len: 72.0 num_loss_counted_tokens: 79
total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 2 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 55 total tokens: 128 num samples: 2 num padding tokens: 19 - rank: 3 max len: 64 min len: 45 avg len: 54.5 num_loss_counted_tokens: 63
total tokens: 110 num samples: 2 num padding tokens: 7 - rank: 4 max len: 55 min len: 48 avg len: 51.5 num_loss_counted_tokens: 45
total tokens: 176 num samples: 2 num padding tokens: 18 - rank: 6 max len: 88 min len: 70 avg len: 79.0 num_loss_counted_tokens: 90
total tokens: 146 num samples: 2 num padding tokens: 21 - rank: 6 max len: 73 min len: 52 avg len: 62.5 num_loss_counted_tokens: 71
total tokens: 118 num samples: 2 num padding tokens: 6 - rank: 7 max len: 59 min len: 53 avg len: 56.0 num_loss_counted_tokens: 57
total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 7 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 57
total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 4 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 54
total tokens: 142 num samples: 2 num padding tokens: 26 - rank: 7 max len: 71 min len: 45 avg len: 58.0 num_loss_counted_tokens: 67
total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 4 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 79
total tokens: 122 num samples: 2 num padding tokens: 9 - rank: 3 max len: 61 min len: 52 avg len: 56.5 num_loss_counted_tokens: 59
total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 3 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 66
total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 3 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 76
total tokens: 122 num samples: 2 num padding tokens: 4 - rank: 3 max len: 61 min len: 57 avg len: 59.0 num_loss_counted_tokens: 64
total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 3 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 93
total tokens: 186 num samples: 2 num padding tokens: 42 - rank: 7 max len: 93 min len: 51 avg len: 72.0 num_loss_counted_tokens: 94
total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 7 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 131
total tokens: 202 num samples: 2 num padding tokens: 40 - rank: 3 max len: 101 min len: 61 avg len: 81.0 num_loss_counted_tokens: 104
Per-token loss scaled by world size: 0.000771758146584034Per-token loss scaled by world size: 0.0032502268441021442Per-token loss scaled by world size: 0.001562815043143928Per-token loss scaled by world size: 0.004182006698101759
Per-token loss scaled by world size: 0.0015922324964776635Per-token loss scaled by world size: 0.0030361246317625046
Per-token loss scaled by world size: 0.0017774869920685887
Epoch: 2, Step: 25, Rank: 3, loss = 0.05045368894934654
Epoch: 2, Step: 25, Rank: 1, loss = 0.2733986973762512
Epoch: 2, Step: 25, Rank: 4, loss = 0.21248358488082886
Epoch: 2, Step: 25, Rank: 2, loss = 0.10216903686523438Epoch: 2, Step: 25, Rank: 0, loss = 0.19848664104938507
Epoch: 2, Step: 25, Rank: 7, loss = 0.10409220308065414
Epoch: 2, Step: 25, Rank: 5, loss = 0.11620321124792099
Per-token loss scaled by world size: 0.0023637553676962852
Epoch: 2, Step: 25, Rank: 6, loss = 0.15453051030635834
[2024-07-27 20:04:53,438] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[2e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:53,514] [INFO] [timer.py:258:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=31.498630726223958, CurrSamplesPerSec=28.474580988875164, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 8%|▊ | 1/12 [00:00<00:10, 1.09it/s]{
"epoch": 2,
"step": 25,
"rank": 0,
"loss": 0.19848664104938507,
"overall_throughput": 28.3802930607719,
"lr": 2e-05,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 523,
"batch_size": 16,
"total_loss": 0.15147720277309418,
"gradnorm": 2.086557149887085,
"weight_norm": 393.458740234375,
"timestamp": "2024-07-27T20:04:53.559486"
}
Per-token loss scaled by world size: 0.004568464122712612Per-token loss scaled by world size: 0.000755587185267359Per-token loss scaled by world size: 0.0012551499530673027Per-token loss scaled by world size: 0.0036310378927737474Per-token loss scaled by world size: 0.0022255314979702234
Per-token loss scaled by world size: 0.003098478075116873
Per-token loss scaled by world size: 0.003910013008862734
Epoch: 2, Step: 26, Rank: 1, loss = 0.057235728949308395
Epoch: 2, Step: 26, Rank: 3, loss = 0.09507761150598526Epoch: 2, Step: 26, Rank: 0, loss = 0.3460611402988434Epoch: 2, Step: 26, Rank: 6, loss = 0.2750511169433594
Epoch: 2, Step: 26, Rank: 2, loss = 0.2347097098827362
Epoch: 2, Step: 26, Rank: 5, loss = 0.16858400404453278Epoch: 2, Step: 26, Rank: 4, loss = 0.2961834967136383
Per-token loss scaled by world size: 0.000992890098132193
Epoch: 2, Step: 26, Rank: 7, loss = 0.07521142810583115
[2024-07-27 20:04:53,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.999453257340926e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:54,074] [INFO] [timer.py:258:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=31.501866831712768, CurrSamplesPerSec=31.576481216592637, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 17%|█▋ | 2/12 [00:01<00:07, 1.42it/s]{
"epoch": 2,
"step": 26,
"rank": 0,
"loss": 0.3460611402988434,
"overall_throughput": 31.521633384656862,
"lr": 1.999453257340926e-05,
"cuda_mem_allocated": 22.0040545463562,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 606,
"batch_size": 16,
"total_loss": 0.19351428747177124,
"gradnorm": 2.7967944145202637,
"weight_norm": 393.45916748046875,
"timestamp": "2024-07-27T20:04:54.115593"
}
Per-token loss scaled by world size: 0.001778947887942195Per-token loss scaled by world size: 0.0023961372207850218Per-token loss scaled by world size: 0.0019206402357667685Per-token loss scaled by world size: 0.0016144964611157775Per-token loss scaled by world size: 0.0014130653580650687
Per-token loss scaled by world size: 0.0023006140254437923Per-token loss scaled by world size: 0.0029887459240853786
Epoch: 2, Step: 27, Rank: 3, loss = 0.132994145154953Epoch: 2, Step: 27, Rank: 2, loss = 0.15821273624897003
Epoch: 2, Step: 27, Rank: 5, loss = 0.19738179445266724
Epoch: 2, Step: 27, Rank: 6, loss = 0.11640125513076782
Epoch: 2, Step: 27, Rank: 1, loss = 0.1465408354997635Epoch: 2, Step: 27, Rank: 4, loss = 0.24619793891906738
Epoch: 2, Step: 27, Rank: 0, loss = 0.18951308727264404
Per-token loss scaled by world size: 0.0023933870252221823
Epoch: 2, Step: 27, Rank: 7, loss = 0.19715525209903717
[2024-07-27 20:04:54,557] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.9978136272187745e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:54,634] [INFO] [timer.py:258:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=31.47828132560415, CurrSamplesPerSec=30.922637265012085, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 25%|██▌ | 3/12 [00:02<00:05, 1.56it/s]{
"epoch": 2,
"step": 27,
"rank": 0,
"loss": 0.18951308727264404,
"overall_throughput": 30.84563042714866,
"lr": 1.9978136272187745e-05,
"cuda_mem_allocated": 22.00071620941162,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 659,
"batch_size": 16,
"total_loss": 0.17304962873458862,
"gradnorm": 2.572604179382324,
"weight_norm": 393.4596252441406,
"timestamp": "2024-07-27T20:04:54.678862"
}
Per-token loss scaled by world size: 0.0017004203982651234Per-token loss scaled by world size: 0.003374457824975252Per-token loss scaled by world size: 0.00338700320571661Per-token loss scaled by world size: 0.0010560491355136037Per-token loss scaled by world size: 0.0003863103629555553
Per-token loss scaled by world size: 0.0015272889286279678Per-token loss scaled by world size: 0.00571776507422328
Epoch: 2, Step: 28, Rank: 5, loss = 0.32557567954063416Epoch: 2, Step: 28, Rank: 6, loss = 0.10151272267103195Epoch: 2, Step: 28, Rank: 0, loss = 0.1634529083967209
Epoch: 2, Step: 28, Rank: 4, loss = 0.32436975836753845
Epoch: 2, Step: 28, Rank: 2, loss = 0.1468106508255005
Epoch: 2, Step: 28, Rank: 3, loss = 0.5496201515197754
Epoch: 2, Step: 28, Rank: 1, loss = 0.0371340848505497
Per-token loss scaled by world size: 0.00041141020483337343
Epoch: 2, Step: 28, Rank: 7, loss = 0.03954680636525154
[2024-07-27 20:04:55,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.9950829025450116e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:55,174] [INFO] [timer.py:258:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=31.512036329377096, CurrSamplesPerSec=32.38008718791239, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 33%|███▎ | 4/12 [00:02<00:04, 1.67it/s]{
"epoch": 2,
"step": 28,
"rank": 0,
"loss": 0.1634529083967209,
"overall_throughput": 32.299561727331444,
"lr": 1.9950829025450116e-05,
"cuda_mem_allocated": 21.99880838394165,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 769,
"batch_size": 16,
"total_loss": 0.21100284159183502,
"gradnorm": 2.3949108123779297,
"weight_norm": 393.4600830078125,
"timestamp": "2024-07-27T20:04:55.217257"
}
Per-token loss scaled by world size: 0.0007814434356987476Per-token loss scaled by world size: 0.00027669736300595105Per-token loss scaled by world size: 0.0012405101442709565Per-token loss scaled by world size: 0.0030604854691773653Per-token loss scaled by world size: 0.0021558511070907116Per-token loss scaled by world size: 0.0016599269583821297Per-token loss scaled by world size: 0.0017815420869737864
Epoch: 2, Step: 29, Rank: 3, loss = 0.023380927741527557
Epoch: 2, Step: 29, Rank: 6, loss = 0.2586110234260559Epoch: 2, Step: 29, Rank: 1, loss = 0.10482310503721237
Epoch: 2, Step: 29, Rank: 2, loss = 0.18216942250728607Epoch: 2, Step: 29, Rank: 4, loss = 0.06603197008371353
Epoch: 2, Step: 29, Rank: 0, loss = 0.14026382565498352
Epoch: 2, Step: 29, Rank: 5, loss = 0.1505403071641922
Per-token loss scaled by world size: 0.004114873707294464
Epoch: 2, Step: 29, Rank: 7, loss = 0.3477068245410919
[2024-07-27 20:04:55,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.9912640693269754e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:55,714] [INFO] [timer.py:258:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=31.539610429600238, CurrSamplesPerSec=32.27386940956719, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 2: 42%|████▏ | 5/12 [00:03<00:04, 1.73it/s]{
"epoch": 2,
"step": 29,
"rank": 0,
"loss": 0.14026382565498352,
"overall_throughput": 32.193407411564976,
"lr": 1.9912640693269754e-05,
"cuda_mem_allocated": 22.007156372070312,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 676,
"batch_size": 16,
"total_loss": 0.15919092297554016,
"gradnorm": 2.4766182899475098,
"weight_norm": 393.4605407714844,
"timestamp": "2024-07-27T20:04:55.749943"
}
Per-token loss scaled by world size: 0.0029893971513956785Per-token loss scaled by world size: 0.005299379117786884Per-token loss scaled by world size: 0.0023671372327953577
Per-token loss scaled by world size: 0.004149050917476416
Per-token loss scaled by world size: 0.008750627748668194Per-token loss scaled by world size: 0.006499007809907198
Epoch: 2, Step: 30, Rank: 0, loss = 0.1615571230649948
Epoch: 2, Step: 30, Rank: 3, loss = 0.36168262362480164
Epoch: 2, Step: 30, Rank: 6, loss = 0.20402635633945465
Epoch: 2, Step: 30, Rank: 5, loss = 0.5972303748130798Per-token loss scaled by world size: 0.0007520572980865836
Epoch: 2, Step: 30, Rank: 4, loss = 0.28317272663116455Epoch: 2, Step: 30, Rank: 7, loss = 0.4435572922229767
Epoch: 2, Step: 30, Rank: 1, loss = 0.0513279102742672
Per-token loss scaled by world size: 0.0032296415884047747
Epoch: 2, Step: 30, Rank: 2, loss = 0.22042304277420044
[2024-07-27 20:04:56,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.9863613034027224e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:56,256] [INFO] [timer.py:258:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=31.5444892468786, CurrSamplesPerSec=31.676790257487433, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 50%|█████ | 6/12 [00:03<00:03, 1.77it/s]{
"epoch": 2,
"step": 30,
"rank": 0,
"loss": 0.1615571230649948,
"overall_throughput": 31.593056094983204,
"lr": 1.9863613034027224e-05,
"cuda_mem_allocated": 21.999285221099854,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 546,
"batch_size": 16,
"total_loss": 0.29037219285964966,
"gradnorm": 6.375105857849121,
"weight_norm": 393.4609375,
"timestamp": "2024-07-27T20:04:56.301457"
}
Per-token loss scaled by world size: 0.002896310528740287Per-token loss scaled by world size: 0.0031822924502193928Per-token loss scaled by world size: 0.0018208534456789494
Per-token loss scaled by world size: 0.0022670035250484943
Per-token loss scaled by world size: 0.008491733111441135
Per-token loss scaled by world size: 0.003121417947113514
Epoch: 2, Step: 31, Rank: 6, loss = 0.2474232316017151
Epoch: 2, Step: 31, Rank: 2, loss = 0.14157135784626007
Per-token loss scaled by world size: 0.0018668599659577012Epoch: 2, Step: 31, Rank: 1, loss = 0.22518813610076904
Epoch: 2, Step: 31, Rank: 0, loss = 0.17625951766967773
Epoch: 2, Step: 31, Rank: 7, loss = 0.6602322459220886
Epoch: 2, Step: 31, Rank: 4, loss = 0.24269025027751923
Epoch: 2, Step: 31, Rank: 5, loss = 0.145148366689682
Per-token loss scaled by world size: 0.0017641705926507711
Epoch: 2, Step: 31, Rank: 3, loss = 0.13716426491737366
[2024-07-27 20:04:56,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.9803799658748096e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:56,801] [INFO] [timer.py:258:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=31.57118154258231, CurrSamplesPerSec=32.3373511160454, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 58%|█████▊ | 7/12 [00:04<00:02, 1.79it/s]{
"epoch": 2,
"step": 31,
"rank": 0,
"loss": 0.17625951766967773,
"overall_throughput": 32.279737447919125,
"lr": 1.9803799658748096e-05,
"cuda_mem_allocated": 22.0040545463562,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 622,
"batch_size": 16,
"total_loss": 0.24695967137813568,
"gradnorm": 4.441169261932373,
"weight_norm": 393.4613952636719,
"timestamp": "2024-07-27T20:04:56.844460"
}
Per-token loss scaled by world size: 0.0011196950217708945Per-token loss scaled by world size: 0.000792959937825799Per-token loss scaled by world size: 0.0029141369741410017Per-token loss scaled by world size: 0.004119256976991892Per-token loss scaled by world size: 0.0033213391434401274
Per-token loss scaled by world size: 0.004044802393764257Per-token loss scaled by world size: 0.0030391488689929247
Epoch: 2, Step: 32, Rank: 5, loss = 0.2269384115934372Epoch: 2, Step: 32, Rank: 2, loss = 0.06175175681710243
Epoch: 2, Step: 32, Rank: 0, loss = 0.08719625324010849
Epoch: 2, Step: 32, Rank: 3, loss = 0.2586492896080017Epoch: 2, Step: 32, Rank: 6, loss = 0.32078713178634644
Epoch: 2, Step: 32, Rank: 4, loss = 0.23667371273040771
Epoch: 2, Step: 32, Rank: 7, loss = 0.31498900055885315
Per-token loss scaled by world size: 0.0011721396585926414
Epoch: 2, Step: 32, Rank: 1, loss = 0.09128037840127945
[2024-07-27 20:04:57,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.973326597248006e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:57,345] [INFO] [timer.py:258:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=31.5894998258649, CurrSamplesPerSec=32.130135235160566, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 67%|██████▋ | 8/12 [00:04<00:02, 1.80it/s]{
"epoch": 2,
"step": 32,
"rank": 0,
"loss": 0.08719625324010849,
"overall_throughput": 32.07842353700362,
"lr": 1.973326597248006e-05,
"cuda_mem_allocated": 21.997137546539307,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 623,
"batch_size": 16,
"total_loss": 0.19978323578834534,
"gradnorm": 2.9166290760040283,
"weight_norm": 393.46185302734375,
"timestamp": "2024-07-27T20:04:57.393136"
}
Per-token loss scaled by world size: 0.0031772709917277098Per-token loss scaled by world size: 0.002447927137836814Per-token loss scaled by world size: 0.0009199947817251086Per-token loss scaled by world size: 0.004487441387027502Per-token loss scaled by world size: 0.0024654706940054893
Per-token loss scaled by world size: 0.00025754657690413296
Per-token loss scaled by world size: 0.004304614849388599
Epoch: 2, Step: 33, Rank: 4, loss = 0.06888461112976074
Epoch: 2, Step: 33, Rank: 0, loss = 0.23789817094802856
Epoch: 2, Step: 33, Rank: 5, loss = 0.33599716424942017Epoch: 2, Step: 33, Rank: 1, loss = 0.1832885444164276
Epoch: 2, Step: 33, Rank: 2, loss = 0.18460211157798767
Epoch: 2, Step: 33, Rank: 3, loss = 0.019283799454569817
Epoch: 2, Step: 33, Rank: 6, loss = 0.3223080337047577
Per-token loss scaled by world size: 0.0012406171299517155
Epoch: 2, Step: 33, Rank: 7, loss = 0.09289120882749557
[2024-07-27 20:04:57,818] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.9652089102773487e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:04:57,896] [INFO] [timer.py:258:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=31.59934389707566, CurrSamplesPerSec=31.897545876966834, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 528
{
"epoch": 2,
"step": 33,
"rank": 0,
"loss": 0.23789817094802856,
"overall_throughput": 31.819460032023883,
"lr": 1.9652089102773487e-05,
"cuda_mem_allocated": 22.002385139465332,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 599,
"batch_size": 16,
"total_loss": 0.1806442141532898,
"gradnorm": 2.8334124088287354,
"weight_norm": 393.4622802734375,
"timestamp": "2024-07-27T20:04:57.899380"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_528
[20:05:15] INFO saving took 17.999591588974 seconds utils.py:611
Epoch 2: 75%|███████▌ | 9/12 [00:23<00:18, 6.18s/it]Per-token loss scaled by world size: 0.0037983357906341553
Per-token loss scaled by world size: 0.004671666771173477Per-token loss scaled by world size: 0.001915755565278232Per-token loss scaled by world size: 0.001806297223083675Per-token loss scaled by world size: 0.00687358109280467Per-token loss scaled by world size: 0.0015339453238993883Per-token loss scaled by world size: 0.0011989163467660546
Epoch: 2, Step: 34, Rank: 0, loss = 0.33425354957580566
Epoch: 2, Step: 34, Rank: 4, loss = 0.411106675863266
Epoch: 2, Step: 34, Rank: 1, loss = 0.16858649253845215Epoch: 2, Step: 34, Rank: 6, loss = 0.6048751473426819Epoch: 2, Step: 34, Rank: 2, loss = 0.15895415842533112Epoch: 2, Step: 34, Rank: 5, loss = 0.10550463944673538
Epoch: 2, Step: 34, Rank: 3, loss = 0.13498719036579132
Per-token loss scaled by world size: 0.0012863841839134693
Epoch: 2, Step: 34, Rank: 7, loss = 0.113201804459095
[2024-07-27 20:05:16,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.9560357815343577e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:16,461] [INFO] [timer.py:258:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=31.575999746037336, CurrSamplesPerSec=30.869055674257183, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 83%|████████▎ | 10/12 [00:23<00:08, 4.45s/it]{
"epoch": 2,
"step": 34,
"rank": 0,
"loss": 0.33425354957580566,
"overall_throughput": 30.809901389200697,
"lr": 1.9560357815343577e-05,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 704,
"batch_size": 16,
"total_loss": 0.25393369793891907,
"gradnorm": 3.9217264652252197,
"weight_norm": 393.4627685546875,
"timestamp": "2024-07-27T20:05:16.505094"
}
Per-token loss scaled by world size: 0.0034846733324229717Per-token loss scaled by world size: 0.0033776769414544106Per-token loss scaled by world size: 0.0047375233843922615Per-token loss scaled by world size: 0.002200631657615304
Per-token loss scaled by world size: 0.0065323468297719955
Per-token loss scaled by world size: 0.000672308262437582Per-token loss scaled by world size: 0.0031945251394063234
Epoch: 2, Step: 35, Rank: 3, loss = 0.22757098078727722Epoch: 2, Step: 35, Rank: 1, loss = 0.14826755225658417
Epoch: 2, Step: 35, Rank: 4, loss = 0.31919065117836Epoch: 2, Step: 35, Rank: 5, loss = 0.44011685252189636Epoch: 2, Step: 35, Rank: 6, loss = 0.21523113548755646
Epoch: 2, Step: 35, Rank: 2, loss = 0.045296769589185715Epoch: 2, Step: 35, Rank: 0, loss = 0.23477986454963684
Per-token loss scaled by world size: 0.0005250901449471712
Epoch: 2, Step: 35, Rank: 7, loss = 0.035377949476242065
[2024-07-27 20:05:16,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.9458172417006347e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:17,013] [INFO] [timer.py:258:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=31.574778541658905, CurrSamplesPerSec=31.53574981496928, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 92%|█████████▏| 11/12 [00:24<00:03, 3.25s/it]{
"epoch": 2,
"step": 35,
"rank": 0,
"loss": 0.23477986454963684,
"overall_throughput": 31.455751447078118,
"lr": 1.9458172417006347e-05,
"cuda_mem_allocated": 22.0038161277771,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 539,
"batch_size": 16,
"total_loss": 0.2082289606332779,
"gradnorm": 3.2071847915649414,
"weight_norm": 393.4632263183594,
"timestamp": "2024-07-27T20:05:17.057249"
}
Per-token loss scaled by world size: 0.005230115260928869Per-token loss scaled by world size: 0.0020361002534627914Per-token loss scaled by world size: 0.0016461815685033798Per-token loss scaled by world size: 0.0032435867469757795Per-token loss scaled by world size: 0.0013154539046809077
Per-token loss scaled by world size: 0.007001029327511787
Per-token loss scaled by world size: 0.0026217142585664988
Epoch: 2, Step: 36, Rank: 3, loss = 0.2177257537841797
Epoch: 2, Step: 36, Rank: 5, loss = 0.1366732269525528Epoch: 2, Step: 36, Rank: 0, loss = 0.08829984068870544
Epoch: 2, Step: 36, Rank: 6, loss = 0.35107147693634033
Epoch: 2, Step: 36, Rank: 2, loss = 0.11049994081258774Epoch: 2, Step: 36, Rank: 1, loss = 0.4699440896511078
Epoch: 2, Step: 36, Rank: 7, loss = 0.17598256468772888
Per-token loss scaled by world size: 0.00114376877900213
Epoch: 2, Step: 36, Rank: 4, loss = 0.07677547633647919
[2024-07-27 20:05:17,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.934564464599461e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:17,568] [INFO] [timer.py:258:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=31.570395158941494, CurrSamplesPerSec=31.426423180739413, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 2: 100%|██████████| 12/12 [00:24<00:00, 2.43s/it]{
"epoch": 2,
"step": 36,
"rank": 0,
"loss": 0.08829984068870544,
"overall_throughput": 31.342732108746315,
"lr": 1.934564464599461e-05,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 537,
"batch_size": 16,
"total_loss": 0.2033715397119522,
"gradnorm": 3.2520809173583984,
"weight_norm": 393.4635925292969,
"timestamp": "2024-07-27T20:05:17.613214"
}
Epoch 2: 100%|██████████| 12/12 [00:25<00:00, 2.09s/it]
total tokens: 118 num samples: 2 num padding tokens: 8 - rank: 2 max len: 59 min len: 51 avg len: 55.0 num_loss_counted_tokens: 60
total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 2 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 64
total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 2 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 70
total tokens: 180 num samples: 2 num padding tokens: 33 - rank: 2 max len: 90 min len: 57 avg len: 73.5 num_loss_counted_tokens: 99
total tokens: 214 num samples: 2 num padding tokens: 40 - rank: 2 max len: 107 min len: 67 avg len: 87.0 num_loss_counted_tokens: 103
total tokens: 126 num samples: 2 num padding tokens: 17 - rank: 2 max len: 63 min len: 46 avg len: 54.5 num_loss_counted_tokens: 54
total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 2 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 72
total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 2 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 59
total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 72
total tokens: 154 num samples: 2 num padding tokens: 22 - rank: 2 max len: 77 min len: 55 avg len: 66.0 num_loss_counted_tokens: 75
total tokens: 128 num samples: 2 num padding tokens: 21 - rank: 2 max len: 64 min len: 43 avg len: 53.5 num_loss_counted_tokens: 49
total tokens: 126 num samples: 2 num padding tokens: 11 - rank: 2 max len: 63 min len: 52 avg len: 57.5 num_loss_counted_tokens: 55
total tokens: 152 num samples: 2 num padding tokens: 17 - rank: 4 max len: 76 min len: 59 avg len: 67.5 num_loss_counted_tokens: 71
total tokens: 152 num samples: 2 num padding tokens: 7 - rank: 4 max len: 76 min len: 69 avg len: 72.5 num_loss_counted_tokens: 84
total tokens: 146 num samples: 2 num padding tokens: 21 - rank: 7 max len: 73 min len: 52 avg len: 62.5 num_loss_counted_tokens: 71 total tokens: 184 num samples: 2 num padding tokens: 32 - rank: 5 max len: 92 min len: 60 avg len: 76.0 num_loss_counted_tokens: 89
total tokens: 168 num samples: 2 num padding tokens: 33 - rank: 7 max len: 84 min len: 51 avg len: 67.5 num_loss_counted_tokens: 85
total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 4 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 58
total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 7 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 83
total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 5 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 73
total tokens: 138 num samples: 2 num padding tokens: 12 - rank: 5 max len: 69 min len: 57 avg len: 63.0 num_loss_counted_tokens: 53
total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 4 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 54
total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 5 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 79
total tokens: 152 num samples: 2 num padding tokens: 9 - rank: 5 max len: 76 min len: 67 avg len: 71.5 num_loss_counted_tokens: 81
total tokens: 138 num samples: 2 num padding tokens: 14 - rank: 5 max len: 69 min len: 55 avg len: 62.0 num_loss_counted_tokens: 78
total tokens: 96 num samples: 2 num padding tokens: 0 - rank: 5 max len: 48 min len: 48 avg len: 48.0 num_loss_counted_tokens: 46
total tokens: 174 num samples: 2 num padding tokens: 27 - rank: 4 max len: 87 min len: 60 avg len: 73.5 num_loss_counted_tokens: 85
total tokens: 136 num samples: 2 num padding tokens: 4 - rank: 7 max len: 68 min len: 64 avg len: 66.0 num_loss_counted_tokens: 65
total tokens: 148 num samples: 2 num padding tokens: 2 - rank: 7 max len: 74 min len: 72 avg len: 73.0 num_loss_counted_tokens: 75
total tokens: 186 num samples: 2 num padding tokens: 43 - rank: 7 max len: 93 min len: 50 avg len: 71.5 num_loss_counted_tokens: 120
total tokens: 188 num samples: 2 num padding tokens: 23 - rank: 7 max len: 94 min len: 71 avg len: 82.5 num_loss_counted_tokens: 97
total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 5 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 71
total tokens: 140 num samples: 2 num padding tokens: 24 - rank: 3 max len: 70 min len: 46 avg len: 58.0 num_loss_counted_tokens: 67
total tokens: 166 num samples: 2 num padding tokens: 34 - rank: 7 max len: 83 min len: 49 avg len: 66.0 num_loss_counted_tokens: 74
total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 7 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 52
total tokens: 142 num samples: 2 num padding tokens: 22 - rank: 7 max len: 71 min len: 49 avg len: 60.0 num_loss_counted_tokens: 70
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 69
total tokens: 202 num samples: 2 num padding tokens: 46 - rank: 7 max len: 101 min len: 55 avg len: 78.0 num_loss_counted_tokens: 96
total tokens: 180 num samples: 2 num padding tokens: 45 - rank: 3 max len: 90 min len: 45 avg len: 67.5 num_loss_counted_tokens: 86 total tokens: 172 num samples: 2 num padding tokens: 23 - rank: 3 max len: 86 min len: 63 avg len: 74.5 num_loss_counted_tokens: 81
total tokens: 186 num samples: 2 num padding tokens: 13 - rank: 3 max len: 93 min len: 80 avg len: 86.5 num_loss_counted_tokens: 122
total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 6 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 59
total tokens: 132 num samples: 2 num padding tokens: 21 - rank: 6 max len: 66 min len: 45 avg len: 55.5 num_loss_counted_tokens: 56
total tokens: 134 num samples: 2 num padding tokens: 23 - rank: 6 max len: 67 min len: 44 avg len: 55.5 num_loss_counted_tokens: 47
total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 3 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 59
total tokens: 128 num samples: 2 num padding tokens: 11 - rank: 6 max len: 64 min len: 53 avg len: 58.5 num_loss_counted_tokens: 59
total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 6 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 81
total tokens: 106 num samples: 2 num padding tokens: 4 - rank: 4 max len: 53 min len: 49 avg len: 51.0 num_loss_counted_tokens: 57
total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 3 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 67
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 4 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 62
total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 4 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 73
total tokens: 282 num samples: 2 num padding tokens: 86 - rank: 4 max len: 141 min len: 55 avg len: 98.0 num_loss_counted_tokens: 139
total tokens: 174 num samples: 2 num padding tokens: 25 - rank: 5 max len: 87 min len: 62 avg len: 74.5 num_loss_counted_tokens: 74
total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 4 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 70
total tokens: 208 num samples: 2 num padding tokens: 44 - rank: 4 max len: 104 min len: 60 avg len: 82.0 num_loss_counted_tokens: 109
total tokens: 172 num samples: 2 num padding tokens: 29 - rank: 6 max len: 86 min len: 57 avg len: 71.5 num_loss_counted_tokens: 54
total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 6 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 98
total tokens: 188 num samples: 2 num padding tokens: 39 - rank: 6 max len: 94 min len: 55 avg len: 74.5 num_loss_counted_tokens: 86
total tokens: 148 num samples: 2 num padding tokens: 23 - rank: 6 max len: 74 min len: 51 avg len: 62.5 num_loss_counted_tokens: 62
total tokens: 172 num samples: 2 num padding tokens: 21 - rank: 6 max len: 86 min len: 65 avg len: 75.5 num_loss_counted_tokens: 71
total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 5 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 65
total tokens: 110 num samples: 2 num padding tokens: 9 - rank: 5 max len: 55 min len: 46 avg len: 50.5 num_loss_counted_tokens: 53
total tokens: 164 num samples: 2 num padding tokens: 24 - rank: 3 max len: 82 min len: 58 avg len: 70.0 num_loss_counted_tokens: 96
total tokens: 118 num samples: 2 num padding tokens: 15 - rank: 3 max len: 59 min len: 44 avg len: 51.5 num_loss_counted_tokens: 51
total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 5 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 65
total tokens: 130 num samples: 2 num padding tokens: 14 - rank: 3 max len: 65 min len: 51 avg len: 58.0 num_loss_counted_tokens: 61
total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 1 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 60
total tokens: 228 num samples: 2 num padding tokens: 34 - rank: 0 max len: 114 min len: 80 avg len: 97.0 num_loss_counted_tokens: 135
total tokens: 164 num samples: 2 num padding tokens: 1 - rank: 3 max len: 82 min len: 81 avg len: 81.5 num_loss_counted_tokens: 105
total tokens: 216 num samples: 2 num padding tokens: 47 - rank: 4 max len: 108 min len: 61 avg len: 84.5 num_loss_counted_tokens: 109
total tokens: 194 num samples: 2 num padding tokens: 19 - rank: 6 max len: 97 min len: 78 avg len: 87.5 num_loss_counted_tokens: 114
total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 0 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 62
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 1 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 62
total tokens: 180 num samples: 2 num padding tokens: 3 - rank: 1 max len: 90 min len: 87 avg len: 88.5 num_loss_counted_tokens: 136
total tokens: 244 num samples: 2 num padding tokens: 70 - rank: 1 max len: 122 min len: 52 avg len: 87.0 num_loss_counted_tokens: 125
total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 1 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
total tokens: 106 num samples: 2 num padding tokens: 9 - rank: 1 max len: 53 min len: 44 avg len: 48.5 num_loss_counted_tokens: 47
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 1 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 158 num samples: 2 num padding tokens: 25 - rank: 1 max len: 79 min len: 54 avg len: 66.5 num_loss_counted_tokens: 74
total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 0 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 101
total tokens: 140 num samples: 2 num padding tokens: 3 - rank: 7 max len: 70 min len: 67 avg len: 68.5 num_loss_counted_tokens: 81
total tokens: 196 num samples: 2 num padding tokens: 53 - rank: 1 max len: 98 min len: 45 avg len: 71.5 num_loss_counted_tokens: 94
total tokens: 228 num samples: 2 num padding tokens: 60 - rank: 1 max len: 114 min len: 54 avg len: 84.0 num_loss_counted_tokens: 122
total tokens: 114 num samples: 2 num padding tokens: 5 - rank: 6 max len: 57 min len: 52 avg len: 54.5 num_loss_counted_tokens: 53
total tokens: 226 num samples: 2 num padding tokens: 6 - rank: 0 max len: 113 min len: 107 avg len: 110.0 num_loss_counted_tokens: 142
total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 0 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 85
total tokens: 158 num samples: 2 num padding tokens: 29 - rank: 1 max len: 79 min len: 50 avg len: 64.5 num_loss_counted_tokens: 69
total tokens: 154 num samples: 2 num padding tokens: 14 - rank: 0 max len: 77 min len: 63 avg len: 70.0 num_loss_counted_tokens: 86
total tokens: 150 num samples: 2 num padding tokens: 21 - rank: 0 max len: 75 min len: 54 avg len: 64.5 num_loss_counted_tokens: 75
total tokens: 122 num samples: 2 num padding tokens: 1 - rank: 0 max len: 61 min len: 60 avg len: 60.5 num_loss_counted_tokens: 59
total tokens: 200 num samples: 2 num padding tokens: 52 - rank: 0 max len: 100 min len: 48 avg len: 74.0 num_loss_counted_tokens: 95
total tokens: 168 num samples: 2 num padding tokens: 32 - rank: 0 max len: 84 min len: 52 avg len: 68.0 num_loss_counted_tokens: 80
total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 0 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 84
total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 3 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 60
total tokens: 176 num samples: 2 num padding tokens: 27 - rank: 1 max len: 88 min len: 61 avg len: 74.5 num_loss_counted_tokens: 83
total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 0 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 57
Per-token loss scaled by world size: 0.0011273464187979698Per-token loss scaled by world size: 0.011045603081583977Per-token loss scaled by world size: 0.0008964258013293147
Per-token loss scaled by world size: 0.005275155883282423Per-token loss scaled by world size: 0.0004854030557908118Per-token loss scaled by world size: 0.001362472539767623
Per-token loss scaled by world size: 0.0016097185434773564
Epoch: 3, Step: 37, Rank: 5, loss = 0.08314179629087448
Epoch: 3, Step: 37, Rank: 3, loss = 0.06611140072345734
Epoch: 3, Step: 37, Rank: 0, loss = 0.8146132230758667
Epoch: 3, Step: 37, Rank: 7, loss = 0.3890427350997925
Epoch: 3, Step: 37, Rank: 6, loss = 0.03579847514629364
Epoch: 3, Step: 37, Rank: 4, loss = 0.10048235207796097
Epoch: 3, Step: 37, Rank: 2, loss = 0.11871673911809921
Per-token loss scaled by world size: 0.0010947352275252342
Epoch: 3, Step: 37, Rank: 1, loss = 0.08073671907186508
[2024-07-27 20:05:18,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.922289754977385e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:18,600] [INFO] [timer.py:258:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=31.510911148754047, CurrSamplesPerSec=29.613797942311468, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 3, | 1/12 [00:00<00:10, 1.06it/s]
"step": 37,
"rank": 0,
"loss": 0.8146132230758667,
"overall_throughput": 29.510024167760136,
"lr": 1.922289754977385e-05,
"cuda_mem_allocated": 22.01049518585205,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 590,
"batch_size": 16,
"total_loss": 0.21108044683933258,
"gradnorm": 3.543294906616211,
"weight_norm": 393.4640197753906,
"timestamp": "2024-07-27T20:05:18.643289"
}
Per-token loss scaled by world size: 0.0009516195859760046Per-token loss scaled by world size: 0.005429701413959265Per-token loss scaled by world size: 0.000703756813891232Per-token loss scaled by world size: 0.0006457434501498938Per-token loss scaled by world size: 0.0020593185909092426Per-token loss scaled by world size: 0.0010209670290350914Per-token loss scaled by world size: 0.0010425481013953686
Epoch: 3, Step: 38, Rank: 1, loss = 0.4642394781112671Epoch: 3, Step: 38, Rank: 6, loss = 0.060171205550432205Epoch: 3, Step: 38, Rank: 5, loss = 0.05521106347441673
Epoch: 3, Step: 38, Rank: 7, loss = 0.17607174813747406
Epoch: 3, Step: 38, Rank: 4, loss = 0.08729267865419388Epoch: 3, Step: 38, Rank: 2, loss = 0.08913786709308624Epoch: 3, Step: 38, Rank: 3, loss = 0.08136347681283951
Per-token loss scaled by world size: 0.0009524038759991527
Epoch: 3, Step: 38, Rank: 0, loss = 0.08143053203821182
[2024-07-27 20:05:19,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.909006535049163e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:19,141] [INFO] [timer.py:258:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=31.53368842638859, CurrSamplesPerSec=32.35217658966323, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,▋ | 2/12 [00:01<00:07, 1.42it/s]
"step": 38,
"rank": 0,
"loss": 0.08143053203821182,
"overall_throughput": 32.29991928493285,
"lr": 1.909006535049163e-05,
"cuda_mem_allocated": 21.99785280227661,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 684,
"batch_size": 16,
"total_loss": 0.13686475157737732,
"gradnorm": 7.216549396514893,
"weight_norm": 393.4645080566406,
"timestamp": "2024-07-27T20:05:19.186626"
}
Per-token loss scaled by world size: 0.0020264529157429934Per-token loss scaled by world size: 0.0008862476679496467Per-token loss scaled by world size: 0.0010139633668586612Per-token loss scaled by world size: 0.00098150665871799Per-token loss scaled by world size: 0.0027166178915649652
Per-token loss scaled by world size: 0.0003891867527272552Per-token loss scaled by world size: 0.0019367473432794213
Epoch: 3, Step: 39, Rank: 0, loss = 0.14565131068229675Epoch: 3, Step: 39, Rank: 2, loss = 0.07287861406803131Epoch: 3, Step: 39, Rank: 4, loss = 0.06369905173778534
Epoch: 3, Step: 39, Rank: 5, loss = 0.19525690376758575Epoch: 3, Step: 39, Rank: 3, loss = 0.13920371234416962Epoch: 3, Step: 39, Rank: 7, loss = 0.07054579257965088
Epoch: 3, Step: 39, Rank: 1, loss = 0.02797279693186283
Per-token loss scaled by world size: 0.0002103938313666731
Epoch: 3, Step: 39, Rank: 6, loss = 0.015122056938707829
[2024-07-27 20:05:19,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[1.8947293298207637e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:19,674] [INFO] [timer.py:258:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=31.57390143166257, CurrSamplesPerSec=33.093162948245876, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,█▌ | 3/12 [00:02<00:05, 1.60it/s]
"step": 39,
"rank": 0,
"loss": 0.14565131068229675,
"overall_throughput": 33.040161880332626,
"lr": 1.8947293298207637e-05,
"cuda_mem_allocated": 22.00071620941162,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 575,
"batch_size": 16,
"total_loss": 0.09129128605127335,
"gradnorm": 1.8486279249191284,
"weight_norm": 393.4649658203125,
"timestamp": "2024-07-27T20:05:19.677166"
}
Per-token loss scaled by world size: 0.0030472958460450172Per-token loss scaled by world size: 0.0027008713223040104Per-token loss scaled by world size: 0.0013511140132322907Per-token loss scaled by world size: 0.00032714917324483395Per-token loss scaled by world size: 0.0010533123277127743
Per-token loss scaled by world size: 0.0011340089840814471
Per-token loss scaled by world size: 0.000277617946267128
Epoch: 3, Step: 40, Rank: 0, loss = 0.2605437934398651
Epoch: 3, Step: 40, Rank: 5, loss = 0.11552024632692337Epoch: 3, Step: 40, Rank: 3, loss = 0.027971254661679268Epoch: 3, Step: 40, Rank: 2, loss = 0.23092450201511383
Epoch: 3, Step: 40, Rank: 7, loss = 0.09695777297019958
Epoch: 3, Step: 40, Rank: 1, loss = 0.09005820006132126
Epoch: 3, Step: 40, Rank: 4, loss = 0.023736335337162018
Per-token loss scaled by world size: 0.0006618773913942277
Epoch: 3, Step: 40, Rank: 6, loss = 0.05659051612019539
[2024-07-27 20:05:20,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.879473751206489e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:20,211] [INFO] [timer.py:258:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=31.60944150873908, CurrSamplesPerSec=32.98311497397824, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 3,██▎ | 4/12 [00:02<00:04, 1.69it/s]
"step": 40,
"rank": 0,
"loss": 0.2605437934398651,
"overall_throughput": 32.92414856200329,
"lr": 1.879473751206489e-05,
"cuda_mem_allocated": 22.01025676727295,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 684,
"batch_size": 16,
"total_loss": 0.11278782784938812,
"gradnorm": 1.686691403388977,
"weight_norm": 393.46551513671875,
"timestamp": "2024-07-27T20:05:20.214384"
}
Per-token loss scaled by world size: 0.0014961636625230312Per-token loss scaled by world size: 0.0015210546553134918Per-token loss scaled by world size: 0.0010414053685963154Per-token loss scaled by world size: 0.001773377531208098
Per-token loss scaled by world size: 0.0008040367974899709Per-token loss scaled by world size: 0.0032368989195674658
Per-token loss scaled by world size: 0.0005034140776842833
Epoch: 3, Step: 41, Rank: 6, loss = 0.12187450379133224
Epoch: 3, Step: 41, Rank: 3, loss = 0.11988011747598648
Epoch: 3, Step: 41, Rank: 0, loss = 0.08344260603189468
Epoch: 3, Step: 41, Rank: 1, loss = 0.14209187030792236
Epoch: 3, Step: 41, Rank: 7, loss = 0.2593565285205841Epoch: 3, Step: 41, Rank: 5, loss = 0.06442344933748245
Epoch: 3, Step: 41, Rank: 4, loss = 0.04033605381846428
Per-token loss scaled by world size: 0.0004881576751358807
Epoch: 3, Step: 41, Rank: 2, loss = 0.03911363333463669
[2024-07-27 20:05:20,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[1.863256480957574e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:20,751] [INFO] [timer.py:258:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=31.626163332071105, CurrSamplesPerSec=32.274971444510975, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,███▏ | 5/12 [00:03<00:04, 1.75it/s]
"step": 41,
"rank": 0,
"loss": 0.08344260603189468,
"overall_throughput": 32.196172088118544,
"lr": 1.863256480957574e-05,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 641,
"batch_size": 16,
"total_loss": 0.10881484299898148,
"gradnorm": 1.5060667991638184,
"weight_norm": 393.46600341796875,
"timestamp": "2024-07-27T20:05:20.793861"
}
Per-token loss scaled by world size: 0.001642214716412127Per-token loss scaled by world size: 0.0013352860696613789Per-token loss scaled by world size: 0.001149832969531417Per-token loss scaled by world size: 0.0007547381101176143
Per-token loss scaled by world size: 0.002172367414459586
Per-token loss scaled by world size: 0.0017613072413951159Per-token loss scaled by world size: 0.0012277569621801376
Epoch: 3, Step: 42, Rank: 0, loss = 0.10571756958961487Epoch: 3, Step: 42, Rank: 5, loss = 0.0859590396285057Epoch: 3, Step: 42, Rank: 6, loss = 0.04858626425266266
Epoch: 3, Step: 42, Rank: 7, loss = 0.13984614610671997
Epoch: 3, Step: 42, Rank: 2, loss = 0.07402049750089645Epoch: 3, Step: 42, Rank: 3, loss = 0.11338414996862411
Epoch: 3, Step: 42, Rank: 4, loss = 0.07903685420751572
Per-token loss scaled by world size: 0.0004327835631556809
Epoch: 3, Step: 42, Rank: 1, loss = 0.02786044217646122
[2024-07-27 20:05:21,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[1.8460952524209355e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:21,294] [INFO] [timer.py:258:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=31.636401598459667, CurrSamplesPerSec=32.040930582537946, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,████ | 6/12 [00:03<00:03, 1.78it/s]
"step": 42,
"rank": 0,
"loss": 0.10571756958961487,
"overall_throughput": 31.961989783039616,
"lr": 1.8460952524209355e-05,
"cuda_mem_allocated": 22.001193046569824,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 515,
"batch_size": 16,
"total_loss": 0.08430136740207672,
"gradnorm": 2.007233142852783,
"weight_norm": 393.46649169921875,
"timestamp": "2024-07-27T20:05:21.336433"
}
Per-token loss scaled by world size: 0.001508427201770246Per-token loss scaled by world size: 0.0023370874114334583Per-token loss scaled by world size: 0.0023123989813029766Per-token loss scaled by world size: 0.0019151513697579503Per-token loss scaled by world size: 0.0032233409583568573Per-token loss scaled by world size: 0.0005486609297804534
Per-token loss scaled by world size: 0.0025942232459783554
Epoch: 3, Step: 43, Rank: 1, loss = 0.20537155866622925Epoch: 3, Step: 43, Rank: 7, loss = 0.20320206880569458Epoch: 3, Step: 43, Rank: 6, loss = 0.16829392313957214
Epoch: 3, Step: 43, Rank: 4, loss = 0.13255304098129272Epoch: 3, Step: 43, Rank: 5, loss = 0.04821357876062393Epoch: 3, Step: 43, Rank: 3, loss = 0.2832510769367218
Epoch: 3, Step: 43, Rank: 0, loss = 0.22796736657619476
Per-token loss scaled by world size: 0.0013315769610926509
Epoch: 3, Step: 43, Rank: 2, loss = 0.11701232939958572
[2024-07-27 20:05:21,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[1.8280088311480203e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:21,830] [INFO] [timer.py:258:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=31.65571342945945, CurrSamplesPerSec=32.44800374432416, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 3,████▊ | 7/12 [00:04<00:02, 1.80it/s]
"step": 43,
"rank": 0,
"loss": 0.22796736657619476,
"overall_throughput": 32.36756205093528,
"lr": 1.8280088311480203e-05,
"cuda_mem_allocated": 22.007156372070312,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 703,
"batch_size": 16,
"total_loss": 0.17323312163352966,
"gradnorm": 3.137274742126465,
"weight_norm": 393.4668884277344,
"timestamp": "2024-07-27T20:05:21.873356"
}
Per-token loss scaled by world size: 0.0031055721919983625Per-token loss scaled by world size: 0.005434826016426086Per-token loss scaled by world size: 0.0022665681317448616Per-token loss scaled by world size: 0.0013299736892804503Per-token loss scaled by world size: 0.002249634126201272
Per-token loss scaled by world size: 0.002406098647043109
Per-token loss scaled by world size: 0.0007422761409543455
Epoch: 3, Step: 44, Rank: 4, loss = 0.382475882768631
Epoch: 3, Step: 44, Rank: 6, loss = 0.09359689801931381Epoch: 3, Step: 44, Rank: 5, loss = 0.15950973331928253
Epoch: 3, Step: 44, Rank: 1, loss = 0.15831799805164337Epoch: 3, Step: 44, Rank: 7, loss = 0.1693291962146759
Epoch: 3, Step: 44, Rank: 0, loss = 0.21855464577674866
Epoch: 3, Step: 44, Rank: 3, loss = 0.05223768204450607
Per-token loss scaled by world size: 0.000683724822010845
Epoch: 3, Step: 44, Rank: 2, loss = 0.04811713472008705
[2024-07-27 20:05:22,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[1.8090169943749477e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:22,384] [INFO] [timer.py:258:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=31.653925419020233, CurrSamplesPerSec=31.580790497837636, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 704
{
"epoch": 3,
"step": 44,
"rank": 0,
"loss": 0.21855464577674866,
"overall_throughput": 31.53099355086062,
"lr": 1.8090169943749477e-05,
"cuda_mem_allocated": 21.998329639434814,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 563,
"batch_size": 16,
"total_loss": 0.1602673977613449,
"gradnorm": 2.924142599105835,
"weight_norm": 393.46728515625,
"timestamp": "2024-07-27T20:05:22.387393"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_704
[20:05:40] INFO saving took 17.951613903045654 seconds utils.py:611
Per-token loss scaled by world size: 0.002263416536152363Per-token loss scaled by world size: 0.0010527893900871277Per-token loss scaled by world size: 0.004291217308491468Per-token loss scaled by world size: 0.002417604671791196Per-token loss scaled by world size: 0.002519844565540552
Per-token loss scaled by world size: 0.0017610156210139394Per-token loss scaled by world size: 0.003203270025551319
Epoch: 3, Step: 45, Rank: 0, loss = 0.16494648158550262
Epoch: 3, Step: 45, Rank: 2, loss = 0.31272247433662415Epoch: 3, Step: 45, Rank: 6, loss = 0.07672202587127686
Epoch: 3, Step: 45, Rank: 7, loss = 0.17618294060230255
Epoch: 3, Step: 45, Rank: 3, loss = 0.18363367021083832
Epoch: 3, Step: 45, Rank: 1, loss = 0.12833401560783386
Epoch: 3, Step: 45, Rank: 4, loss = 0.23343829810619354
Per-token loss scaled by world size: 0.0014321228954941034
Epoch: 3, Step: 45, Rank: 5, loss = 0.10436595976352692
[2024-07-27 20:05:40,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[1.789140509396394e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:40,889] [INFO] [timer.py:258:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=31.661507538235384, CurrSamplesPerSec=31.983269859773554, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,██████▌ | 9/12 [00:23<00:13, 4.48s/it]
"step": 45,
"rank": 0,
"loss": 0.16494648158550262,
"overall_throughput": 31.91971183779508,
"lr": 1.789140509396394e-05,
"cuda_mem_allocated": 22.001669883728027,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 583,
"batch_size": 16,
"total_loss": 0.1725432276725769,
"gradnorm": 4.5381975173950195,
"weight_norm": 393.46771240234375,
"timestamp": "2024-07-27T20:05:40.931934"
}
Per-token loss scaled by world size: 0.005079582799226046Per-token loss scaled by world size: 0.003907069563865662Per-token loss scaled by world size: 0.0016345379408448935Per-token loss scaled by world size: 0.00215306063182652Per-token loss scaled by world size: 0.005338544957339764Per-token loss scaled by world size: 0.0032401932403445244
Per-token loss scaled by world size: 0.0003386051394045353
Epoch: 3, Step: 46, Rank: 5, loss = 0.18193362653255463Epoch: 3, Step: 46, Rank: 6, loss = 0.1381184607744217Epoch: 3, Step: 46, Rank: 2, loss = 0.330147385597229
Epoch: 3, Step: 46, Rank: 7, loss = 0.028612133115530014
Epoch: 3, Step: 46, Rank: 3, loss = 0.42922475934028625Epoch: 3, Step: 46, Rank: 1, loss = 0.27379631996154785
Epoch: 3, Step: 46, Rank: 4, loss = 0.45110705494880676
Per-token loss scaled by world size: 0.0007393484702333808
Epoch: 3, Step: 46, Rank: 0, loss = 0.062474943697452545
[2024-07-27 20:05:41,368] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[1.7684011108568593e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:41,446] [INFO] [timer.py:258:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=31.649546499309086, CurrSamplesPerSec=31.143634404390532, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,███████▎ | 10/12 [00:23<00:06, 3.27s/it]
"step": 46,
"rank": 0,
"loss": 0.062474943697452545,
"overall_throughput": 31.079148738311545,
"lr": 1.7684011108568593e-05,
"cuda_mem_allocated": 21.99785280227661,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 676,
"batch_size": 16,
"total_loss": 0.2369268238544464,
"gradnorm": 3.1312427520751953,
"weight_norm": 393.46807861328125,
"timestamp": "2024-07-27T20:05:41.491963"
}
Per-token loss scaled by world size: 0.003072180086746812Per-token loss scaled by world size: 0.0009212895529344678Per-token loss scaled by world size: 0.001889892853796482Per-token loss scaled by world size: 0.005953831598162651
Per-token loss scaled by world size: 0.0021448375191539526
Per-token loss scaled by world size: 0.0008049519965425134
Per-token loss scaled by world size: 0.005051196087151766
Epoch: 3, Step: 47, Rank: 3, loss = 0.06748446077108383
Epoch: 3, Step: 47, Rank: 2, loss = 0.13843464851379395
Epoch: 3, Step: 47, Rank: 0, loss = 0.22503718733787537Epoch: 3, Step: 47, Rank: 7, loss = 0.43611815571784973
Epoch: 3, Step: 47, Rank: 5, loss = 0.1571093499660492
Epoch: 3, Step: 47, Rank: 4, loss = 0.058962732553482056
Epoch: 3, Step: 47, Rank: 6, loss = 0.37000012397766113
Per-token loss scaled by world size: 0.002140582073479891
Epoch: 3, Step: 47, Rank: 1, loss = 0.1567976325750351
[2024-07-27 20:05:41,903] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[1.7468214769841542e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:41,981] [INFO] [timer.py:258:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=31.673656937700265, CurrSamplesPerSec=32.7721445241366, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,████████▏| 11/12 [00:24<00:02, 2.43s/it]
"step": 47,
"rank": 0,
"loss": 0.22503718733787537,
"overall_throughput": 32.68064947498726,
"lr": 1.7468214769841542e-05,
"cuda_mem_allocated": 22.002624034881592,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 586,
"batch_size": 16,
"total_loss": 0.20124304294586182,
"gradnorm": 2.9233510494232178,
"weight_norm": 393.46844482421875,
"timestamp": "2024-07-27T20:05:41.984708"
}
Per-token loss scaled by world size: 0.0007923523080535233Per-token loss scaled by world size: 0.005117433145642281Per-token loss scaled by world size: 0.001681540277786553Per-token loss scaled by world size: 0.0009754404309205711Per-token loss scaled by world size: 0.0007582003017887473Per-token loss scaled by world size: 0.0003797741374000907
Per-token loss scaled by world size: 0.0006013160455040634
Epoch: 3, Step: 48, Rank: 5, loss = 0.12737667560577393
Epoch: 3, Step: 48, Rank: 1, loss = 0.38764557242393494Epoch: 3, Step: 48, Rank: 2, loss = 0.06002068519592285
Epoch: 3, Step: 48, Rank: 6, loss = 0.028767891228199005Epoch: 3, Step: 48, Rank: 7, loss = 0.07388961315155029Epoch: 3, Step: 48, Rank: 0, loss = 0.05743367224931717
Epoch: 3, Step: 48, Rank: 4, loss = 0.04554969072341919
Per-token loss scaled by world size: 0.001107058022171259
Epoch: 3, Step: 48, Rank: 3, loss = 0.0838596448302269
[2024-07-27 20:05:42,468] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[1.7244252047910893e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:42,546] [INFO] [timer.py:258:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=31.653254285775276, CurrSamplesPerSec=30.761573372705392, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 3,█████████| 12/12 [00:24<00:00, 1.86s/it]
"step": 48,
"rank": 0,
"loss": 0.05743367224931717,
"overall_throughput": 30.690515825250888,
"lr": 1.7244252047910893e-05,
"cuda_mem_allocated": 22.003339290618896,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 606,
"batch_size": 16,
"total_loss": 0.10806792974472046,
"gradnorm": 1.4744030237197876,
"weight_norm": 393.4688415527344,
"timestamp": "2024-07-27T20:05:42.548994"
}
Epoch 3: 100%|██████████| 12/12 [00:24<00:00, 2.08s/it]
total tokens: 196 num samples: 2 num padding tokens: 44 - rank: 6 max len: 98 min len: 54 avg len: 76.0 num_loss_counted_tokens: 104
total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 0 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
total tokens: 136 num samples: 2 num padding tokens: 0 - rank: 0 max len: 68 min len: 68 avg len: 68.0 num_loss_counted_tokens: 53
total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 6 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 75
total tokens: 130 num samples: 2 num padding tokens: 15 - rank: 2 max len: 65 min len: 50 avg len: 57.5 num_loss_counted_tokens: 65
total tokens: 110 num samples: 2 num padding tokens: 11 - rank: 7 max len: 55 min len: 44 avg len: 49.5 num_loss_counted_tokens: 53
total tokens: 142 num samples: 2 num padding tokens: 0 - rank: 7 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 75 total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 67
total tokens: 120 num samples: 2 num padding tokens: 15 - rank: 6 max len: 60 min len: 45 avg len: 52.5 num_loss_counted_tokens: 59
total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 6 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 70
total tokens: 166 num samples: 2 num padding tokens: 28 - rank: 7 max len: 83 min len: 55 avg len: 69.0 num_loss_counted_tokens: 89
total tokens: 126 num samples: 2 num padding tokens: 6 - rank: 7 max len: 63 min len: 57 avg len: 60.0 num_loss_counted_tokens: 57
total tokens: 154 num samples: 2 num padding tokens: 22 - rank: 6 max len: 77 min len: 55 avg len: 66.0 num_loss_counted_tokens: 75
total tokens: 152 num samples: 2 num padding tokens: 15 - rank: 7 max len: 76 min len: 61 avg len: 68.5 num_loss_counted_tokens: 70
total tokens: 282 num samples: 2 num padding tokens: 81 - rank: 6 max len: 141 min len: 60 avg len: 100.5 num_loss_counted_tokens: 156
total tokens: 194 num samples: 2 num padding tokens: 10 - rank: 7 max len: 97 min len: 87 avg len: 92.0 num_loss_counted_tokens: 116
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 7 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 62
total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 7 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 61
total tokens: 164 num samples: 2 num padding tokens: 30 - rank: 0 max len: 82 min len: 52 avg len: 67.0 num_loss_counted_tokens: 81
total tokens: 150 num samples: 2 num padding tokens: 13 - rank: 7 max len: 75 min len: 62 avg len: 68.5 num_loss_counted_tokens: 70
total tokens: 180 num samples: 2 num padding tokens: 6 - rank: 0 max len: 90 min len: 84 avg len: 87.0 num_loss_counted_tokens: 114
total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 7 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 72
total tokens: 106 num samples: 2 num padding tokens: 8 - rank: 6 max len: 53 min len: 45 avg len: 49.0 num_loss_counted_tokens: 45
total tokens: 186 num samples: 2 num padding tokens: 16 - rank: 6 max len: 93 min len: 77 avg len: 85.0 num_loss_counted_tokens: 122
total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 4 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 59
total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 6 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 80
total tokens: 118 num samples: 2 num padding tokens: 14 - rank: 4 max len: 59 min len: 45 avg len: 52.0 num_loss_counted_tokens: 52
total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 6 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 59
total tokens: 186 num samples: 2 num padding tokens: 30 - rank: 5 max len: 93 min len: 63 avg len: 78.0 num_loss_counted_tokens: 100
total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 2 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 70
total tokens: 114 num samples: 2 num padding tokens: 7 - rank: 5 max len: 57 min len: 50 avg len: 53.5 num_loss_counted_tokens: 59
total tokens: 148 num samples: 2 num padding tokens: 16 - rank: 4 max len: 74 min len: 58 avg len: 66.0 num_loss_counted_tokens: 68
total tokens: 146 num samples: 2 num padding tokens: 22 - rank: 4 max len: 73 min len: 51 avg len: 62.0 num_loss_counted_tokens: 72
total tokens: 186 num samples: 2 num padding tokens: 45 - rank: 0 max len: 93 min len: 48 avg len: 70.5 num_loss_counted_tokens: 111
total tokens: 180 num samples: 2 num padding tokens: 26 - rank: 0 max len: 90 min len: 64 avg len: 77.0 num_loss_counted_tokens: 118
total tokens: 166 num samples: 2 num padding tokens: 16 - rank: 4 max len: 83 min len: 67 avg len: 75.0 num_loss_counted_tokens: 85
total tokens: 158 num samples: 2 num padding tokens: 20 - rank: 1 max len: 79 min len: 59 avg len: 69.0 num_loss_counted_tokens: 66
total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 0 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 66
total tokens: 132 num samples: 2 num padding tokens: 23 - rank: 0 max len: 66 min len: 43 avg len: 54.5 num_loss_counted_tokens: 45
total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 4 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 78
total tokens: 114 num samples: 2 num padding tokens: 11 - rank: 6 max len: 57 min len: 46 avg len: 51.5 num_loss_counted_tokens: 59
total tokens: 100 num samples: 2 num padding tokens: 1 - rank: 4 max len: 50 min len: 49 avg len: 49.5 num_loss_counted_tokens: 49
total tokens: 126 num samples: 2 num padding tokens: 9 - rank: 1 max len: 63 min len: 54 avg len: 58.5 num_loss_counted_tokens: 63
total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 4 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 100
total tokens: 134 num samples: 2 num padding tokens: 15 - rank: 0 max len: 67 min len: 52 avg len: 59.5 num_loss_counted_tokens: 57
total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 2 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 73
total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 1 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 87
total tokens: 202 num samples: 2 num padding tokens: 46 - rank: 4 max len: 101 min len: 55 avg len: 78.0 num_loss_counted_tokens: 106
total tokens: 162 num samples: 2 num padding tokens: 2 - rank: 0 max len: 81 min len: 79 avg len: 80.0 num_loss_counted_tokens: 93
total tokens: 174 num samples: 2 num padding tokens: 32 - rank: 2 max len: 87 min len: 55 avg len: 71.0 num_loss_counted_tokens: 73
total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 2 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 73
total tokens: 152 num samples: 2 num padding tokens: 13 - rank: 0 max len: 76 min len: 63 avg len: 69.5 num_loss_counted_tokens: 87
total tokens: 138 num samples: 2 num padding tokens: 7 - rank: 4 max len: 69 min len: 62 avg len: 65.5 num_loss_counted_tokens: 77
total tokens: 208 num samples: 2 num padding tokens: 43 - rank: 1 max len: 104 min len: 61 avg len: 82.5 num_loss_counted_tokens: 108
total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 4 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 65
total tokens: 146 num samples: 2 num padding tokens: 6 - rank: 2 max len: 73 min len: 67 avg len: 70.0 num_loss_counted_tokens: 66
total tokens: 124 num samples: 2 num padding tokens: 13 - rank: 1 max len: 62 min len: 49 avg len: 55.5 num_loss_counted_tokens: 54
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 69
total tokens: 228 num samples: 2 num padding tokens: 38 - rank: 4 max len: 114 min len: 76 avg len: 95.0 num_loss_counted_tokens: 129
total tokens: 132 num samples: 2 num padding tokens: 1 - rank: 2 max len: 66 min len: 65 avg len: 65.5 num_loss_counted_tokens: 52
total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 2 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 60
total tokens: 126 num samples: 2 num padding tokens: 3 - rank: 2 max len: 63 min len: 60 avg len: 61.5 num_loss_counted_tokens: 61
total tokens: 216 num samples: 2 num padding tokens: 49 - rank: 2 max len: 108 min len: 59 avg len: 83.5 num_loss_counted_tokens: 103
total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 2 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 64
total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 7 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 64
total tokens: 244 num samples: 2 num padding tokens: 78 - rank: 6 max len: 122 min len: 44 avg len: 83.0 num_loss_counted_tokens: 115
total tokens: 214 num samples: 2 num padding tokens: 62 - rank: 5 max len: 107 min len: 45 avg len: 76.0 num_loss_counted_tokens: 99
total tokens: 200 num samples: 2 num padding tokens: 47 - rank: 1 max len: 100 min len: 53 avg len: 76.5 num_loss_counted_tokens: 90
total tokens: 144 num samples: 2 num padding tokens: 14 - rank: 1 max len: 72 min len: 58 avg len: 65.0 num_loss_counted_tokens: 84
total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 1 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 69
total tokens: 96 num samples: 2 num padding tokens: 4 - rank: 1 max len: 48 min len: 44 avg len: 46.0 num_loss_counted_tokens: 46
total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 1 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 60
total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 5 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 73 total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 5 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 82
total tokens: 226 num samples: 2 num padding tokens: 6 - rank: 5 max len: 113 min len: 107 avg len: 110.0 num_loss_counted_tokens: 142
total tokens: 148 num samples: 2 num padding tokens: 28 - rank: 5 max len: 74 min len: 46 avg len: 60.0 num_loss_counted_tokens: 63
total tokens: 168 num samples: 2 num padding tokens: 33 - rank: 5 max len: 84 min len: 51 avg len: 67.5 num_loss_counted_tokens: 89
total tokens: 162 num samples: 2 num padding tokens: 29 - rank: 5 max len: 81 min len: 52 avg len: 66.5 num_loss_counted_tokens: 72
total tokens: 186 num samples: 2 num padding tokens: 29 - rank: 1 max len: 93 min len: 64 avg len: 78.5 num_loss_counted_tokens: 76
total tokens: 140 num samples: 2 num padding tokens: 6 - rank: 5 max len: 70 min len: 64 avg len: 67.0 num_loss_counted_tokens: 59
total tokens: 156 num samples: 2 num padding tokens: 6 - rank: 5 max len: 78 min len: 72 avg len: 75.0 num_loss_counted_tokens: 81
total tokens: 174 num samples: 2 num padding tokens: 29 - rank: 1 max len: 87 min len: 58 avg len: 72.5 num_loss_counted_tokens: 84
total tokens: 124 num samples: 2 num padding tokens: 13 - rank: 3 max len: 62 min len: 49 avg len: 55.5 num_loss_counted_tokens: 55
total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 3 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 69
total tokens: 132 num samples: 2 num padding tokens: 16 - rank: 3 max len: 66 min len: 50 avg len: 58.0 num_loss_counted_tokens: 61
total tokens: 140 num samples: 2 num padding tokens: 19 - rank: 3 max len: 70 min len: 51 avg len: 60.5 num_loss_counted_tokens: 55
total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 3 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 71
total tokens: 104 num samples: 2 num padding tokens: 6 - rank: 5 max len: 52 min len: 46 avg len: 49.0 num_loss_counted_tokens: 52
total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 3 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 100
total tokens: 160 num samples: 2 num padding tokens: 14 - rank: 3 max len: 80 min len: 66 avg len: 73.0 num_loss_counted_tokens: 83
total tokens: 130 num samples: 2 num padding tokens: 12 - rank: 3 max len: 65 min len: 53 avg len: 59.0 num_loss_counted_tokens: 64
total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 3 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 77
total tokens: 188 num samples: 2 num padding tokens: 26 - rank: 3 max len: 94 min len: 68 avg len: 81.0 num_loss_counted_tokens: 85
total tokens: 188 num samples: 2 num padding tokens: 14 - rank: 2 max len: 94 min len: 80 avg len: 87.0 num_loss_counted_tokens: 116
total tokens: 176 num samples: 2 num padding tokens: 25 - rank: 3 max len: 88 min len: 63 avg len: 75.5 num_loss_counted_tokens: 82
total tokens: 162 num samples: 2 num padding tokens: 26 - rank: 3 max len: 81 min len: 55 avg len: 68.0 num_loss_counted_tokens: 85
Per-token loss scaled by world size: 0.0007222609710879624Per-token loss scaled by world size: 0.0011062632547691464Per-token loss scaled by world size: 0.0014970493502914906Per-token loss scaled by world size: 0.0006512971594929695Per-token loss scaled by world size: 0.005336429923772812Per-token loss scaled by world size: 0.0008554465603083372Per-token loss scaled by world size: 0.002156679518520832
Epoch: 4, Step: 49, Rank: 5, loss = 0.36687955260276794Epoch: 4, Step: 49, Rank: 1, loss = 0.10292214155197144
Epoch: 4, Step: 49, Rank: 4, loss = 0.0760556012392044
Epoch: 4, Step: 49, Rank: 6, loss = 0.14827170968055725Epoch: 4, Step: 49, Rank: 0, loss = 0.04965544119477272
Epoch: 4, Step: 49, Rank: 7, loss = 0.05881195142865181
Epoch: 4, Step: 49, Rank: 2, loss = 0.04477667808532715
Per-token loss scaled by world size: 0.0012891377555206418
Epoch: 4, Step: 49, Rank: 3, loss = 0.08862821757793427
[2024-07-27 20:05:43,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[1.7012367842724887e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:43,557] [INFO] [timer.py:258:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=31.626881887830034, CurrSamplesPerSec=30.459502835854497, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 8%|▊ | 1/12 [00:00<00:10, 1.09it/s]{
"epoch": 4,
"step": 49,
"rank": 0,
"loss": 0.04965544119477272,
"overall_throughput": 30.35780108269395,
"lr": 1.7012367842724887e-05,
"cuda_mem_allocated": 21.996244430541992,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 550,
"batch_size": 16,
"total_loss": 0.11700016260147095,
"gradnorm": 1.7901870012283325,
"weight_norm": 393.46917724609375,
"timestamp": "2024-07-27T20:05:43.608424"
}
Per-token loss scaled by world size: 0.0014621953014284372Per-token loss scaled by world size: 0.0015464453026652336Per-token loss scaled by world size: 0.0014793629525229335Per-token loss scaled by world size: 0.0010739548597484827Per-token loss scaled by world size: 0.002221300033852458
Per-token loss scaled by world size: 0.001030008657835424
Per-token loss scaled by world size: 0.0031245022546499968
Epoch: 4, Step: 50, Rank: 4, loss = 0.0959736704826355Epoch: 4, Step: 50, Rank: 5, loss = 0.10032563656568527Epoch: 4, Step: 50, Rank: 0, loss = 0.14410683512687683
Epoch: 4, Step: 50, Rank: 1, loss = 0.06682181358337402Epoch: 4, Step: 50, Rank: 3, loss = 0.06967282295227051
Epoch: 4, Step: 50, Rank: 6, loss = 0.09485992044210434
Epoch: 4, Step: 50, Rank: 7, loss = 0.2027020901441574
Per-token loss scaled by world size: 0.0011166962794959545
Epoch: 4, Step: 50, Rank: 2, loss = 0.07244566828012466
[2024-07-27 20:05:44,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[1.6772815716257414e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:44,101] [INFO] [timer.py:258:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=31.64645160925237, CurrSamplesPerSec=32.594364979528, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 17%|█▋ | 2/12 [00:01<00:06, 1.43it/s]{
"epoch": 4,
"step": 50,
"rank": 0,
"loss": 0.14410683512687683,
"overall_throughput": 32.48167592990112,
"lr": 1.6772815716257414e-05,
"cuda_mem_allocated": 21.999523639678955,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 519,
"batch_size": 16,
"total_loss": 0.105863556265831,
"gradnorm": 2.59075927734375,
"weight_norm": 393.4695129394531,
"timestamp": "2024-07-27T20:05:44.148907"
}
Per-token loss scaled by world size: 0.0009595813462510705Per-token loss scaled by world size: 0.0007476079626940191Per-token loss scaled by world size: 0.002177697606384754Per-token loss scaled by world size: 0.00161154440138489Per-token loss scaled by world size: 0.002184153301641345
Per-token loss scaled by world size: 0.0022782967425882816Per-token loss scaled by world size: 0.002697640098631382
Epoch: 4, Step: 51, Rank: 1, loss = 0.12066438794136047Epoch: 4, Step: 51, Rank: 7, loss = 0.16305510699748993Epoch: 4, Step: 51, Rank: 3, loss = 0.055977147072553635
Epoch: 4, Step: 51, Rank: 2, loss = 0.16353848576545715Epoch: 4, Step: 51, Rank: 0, loss = 0.07184865325689316
Epoch: 4, Step: 51, Rank: 6, loss = 0.20198580622673035Epoch: 4, Step: 51, Rank: 5, loss = 0.1705874651670456
Per-token loss scaled by world size: 0.001197479316033423
Epoch: 4, Step: 51, Rank: 4, loss = 0.08966126292943954
[2024-07-27 20:05:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[1.6525857615241686e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:44,647] [INFO] [timer.py:258:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=31.66021704434534, CurrSamplesPerSec=32.33534113615524, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 25%|██▌ | 3/12 [00:02<00:05, 1.59it/s]{
"epoch": 4,
"step": 51,
"rank": 0,
"loss": 0.07184865325689316,
"overall_throughput": 32.279271652634336,
"lr": 1.6525857615241686e-05,
"cuda_mem_allocated": 22.002862453460693,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 599,
"batch_size": 16,
"total_loss": 0.12966477870941162,
"gradnorm": 2.6400396823883057,
"weight_norm": 393.4698486328125,
"timestamp": "2024-07-27T20:05:44.689190"
}
Per-token loss scaled by world size: 0.001371016027405858Per-token loss scaled by world size: 0.0010015949374064803Per-token loss scaled by world size: 0.0019693197682499886
Per-token loss scaled by world size: 0.00044975956552661955
Per-token loss scaled by world size: 0.0015600252663716674Per-token loss scaled by world size: 0.0014032198814675212Per-token loss scaled by world size: 0.0006641225190833211
Epoch: 4, Step: 52, Rank: 4, loss = 0.08450957387685776
Epoch: 4, Step: 52, Rank: 0, loss = 0.11567948013544083
Epoch: 4, Step: 52, Rank: 6, loss = 0.16616135835647583
Epoch: 4, Step: 52, Rank: 5, loss = 0.1316271275281906Epoch: 4, Step: 52, Rank: 2, loss = 0.03794846311211586
Epoch: 4, Step: 52, Rank: 1, loss = 0.11839667707681656
Epoch: 4, Step: 52, Rank: 7, loss = 0.05603533610701561
Per-token loss scaled by world size: 0.0015486030606552958
Epoch: 4, Step: 52, Rank: 3, loss = 0.13066338002681732
[2024-07-27 20:05:45,125] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[1.6271763584735373e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:45,202] [INFO] [timer.py:258:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=31.653990957078555, CurrSamplesPerSec=31.351883784434047, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 33%|███▎ | 4/12 [00:02<00:04, 1.67it/s]{
"epoch": 4,
"step": 52,
"rank": 0,
"loss": 0.11567948013544083,
"overall_throughput": 31.298703164249382,
"lr": 1.6271763584735373e-05,
"cuda_mem_allocated": 22.004770278930664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 675,
"batch_size": 16,
"total_loss": 0.10512767732143402,
"gradnorm": 2.028604745864868,
"weight_norm": 393.47015380859375,
"timestamp": "2024-07-27T20:05:45.245073"
}
Per-token loss scaled by world size: 0.0023022103123366833Per-token loss scaled by world size: 0.002660317113623023Per-token loss scaled by world size: 0.0015502030728384852Per-token loss scaled by world size: 0.001655052648857236Per-token loss scaled by world size: 0.0008553997613489628
Per-token loss scaled by world size: 0.002113129710778594Per-token loss scaled by world size: 0.0022639036178588867
Epoch: 4, Step: 53, Rank: 6, loss = 0.19387060403823853
Epoch: 4, Step: 53, Rank: 0, loss = 0.16777357459068298
Epoch: 4, Step: 53, Rank: 1, loss = 0.06233725696802139Epoch: 4, Step: 53, Rank: 2, loss = 0.11297105252742767Epoch: 4, Step: 53, Rank: 5, loss = 0.12061195820569992Epoch: 4, Step: 53, Rank: 3, loss = 0.16498197615146637
Epoch: 4, Step: 53, Rank: 4, loss = 0.15399432182312012
Per-token loss scaled by world size: 0.0017511562909930944
Epoch: 4, Step: 53, Rank: 7, loss = 0.12761551141738892
[2024-07-27 20:05:45,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[1.6010811472830253e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:45,749] [INFO] [timer.py:258:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=31.66602722465284, CurrSamplesPerSec=32.279737447919125, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 4: 42%|████▏ | 5/12 [00:03<00:04, 1.72it/s]{
"epoch": 4,
"step": 53,
"rank": 0,
"loss": 0.16777357459068298,
"overall_throughput": 32.227156557192046,
"lr": 1.6010811472830253e-05,
"cuda_mem_allocated": 22.00548553466797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 583,
"batch_size": 16,
"total_loss": 0.13801953196525574,
"gradnorm": 2.3451616764068604,
"weight_norm": 393.4704895019531,
"timestamp": "2024-07-27T20:05:45.792018"
}
Per-token loss scaled by world size: 0.0016146524576470256Per-token loss scaled by world size: 0.00018823673599399626Per-token loss scaled by world size: 0.00015145067300181836Per-token loss scaled by world size: 0.002239079447463155
Per-token loss scaled by world size: 0.0013640215620398521Per-token loss scaled by world size: 0.0009340652031823993Per-token loss scaled by world size: 0.002048594644293189
Epoch: 4, Step: 54, Rank: 4, loss = 0.017576605081558228
Epoch: 4, Step: 54, Rank: 0, loss = 0.15076817572116852
Epoch: 4, Step: 54, Rank: 6, loss = 0.12736551463603973Epoch: 4, Step: 54, Rank: 5, loss = 0.20907405018806458
Epoch: 4, Step: 54, Rank: 2, loss = 0.014141706749796867
Epoch: 4, Step: 54, Rank: 7, loss = 0.08721833676099777
Epoch: 4, Step: 54, Rank: 3, loss = 0.19128753244876862
Per-token loss scaled by world size: 0.001376173458993435
Epoch: 4, Step: 54, Rank: 1, loss = 0.12850019335746765
[2024-07-27 20:05:46,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[1.5743286626829437e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:46,290] [INFO] [timer.py:258:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=31.675406097739614, CurrSamplesPerSec=32.16120844994824, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 50%|█████ | 6/12 [00:03<00:03, 1.76it/s]{
"epoch": 4,
"step": 54,
"rank": 0,
"loss": 0.15076817572116852,
"overall_throughput": 32.07902156132976,
"lr": 1.5743286626829437e-05,
"cuda_mem_allocated": 22.004770278930664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 747,
"batch_size": 16,
"total_loss": 0.11574152112007141,
"gradnorm": 1.6529176235198975,
"weight_norm": 393.4708251953125,
"timestamp": "2024-07-27T20:05:46.332858"
}
Per-token loss scaled by world size: 0.0008495299844071269Per-token loss scaled by world size: 0.002507910830900073Per-token loss scaled by world size: 0.0028947019018232822Per-token loss scaled by world size: 0.001476020785048604Per-token loss scaled by world size: 0.001191351911984384Per-token loss scaled by world size: 0.0018832029309123755
Per-token loss scaled by world size: 0.0013536742189899087
Epoch: 4, Step: 55, Rank: 1, loss = 0.22397755086421967Epoch: 4, Step: 55, Rank: 6, loss = 0.19404959678649902Epoch: 4, Step: 55, Rank: 7, loss = 0.09218085557222366
Epoch: 4, Step: 55, Rank: 0, loss = 0.06573238223791122Epoch: 4, Step: 55, Rank: 2, loss = 0.14571282267570496Epoch: 4, Step: 55, Rank: 3, loss = 0.11420710384845734
Epoch: 4, Step: 55, Rank: 4, loss = 0.10474054515361786
Per-token loss scaled by world size: 0.0010037233587354422
Epoch: 4, Step: 55, Rank: 5, loss = 0.07766309380531311
[2024-07-27 20:05:46,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[1.5469481581224274e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:05:46,825] [INFO] [timer.py:258:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=31.69138798284937, CurrSamplesPerSec=32.545268319935445, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 880
{
"epoch": 4,
"step": 55,
"rank": 0,
"loss": 0.06573238223791122,
"overall_throughput": 32.46340198605274,
"lr": 1.5469481581224274e-05,
"cuda_mem_allocated": 22.000000476837158,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 619,
"batch_size": 16,
"total_loss": 0.1272830069065094,
"gradnorm": 1.8899047374725342,
"weight_norm": 393.4711608886719,
"timestamp": "2024-07-27T20:05:46.828726"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_880
[20:06:04] INFO saving took 17.93557572364807 seconds utils.py:611
Epoch 4: 58%|█████▊ | 7/12 [00:22<00:32, 6.42s/it]Per-token loss scaled by world size: 0.0008630760130472481Per-token loss scaled by world size: 0.0010983350221067667Per-token loss scaled by world size: 0.0021769509185105562
Per-token loss scaled by world size: 0.0004714219248853624
Per-token loss scaled by world size: 0.0017523688729852438
Per-token loss scaled by world size: 0.0024742181412875652
Epoch: 4, Step: 56, Rank: 0, loss = 0.042015478014945984Epoch: 4, Step: 56, Rank: 4, loss = 0.09788911044597626Per-token loss scaled by world size: 0.00015522913599852473
Epoch: 4, Step: 56, Rank: 1, loss = 0.2205146849155426
Epoch: 4, Step: 56, Rank: 2, loss = 0.19402074813842773Epoch: 4, Step: 56, Rank: 7, loss = 0.07692164927721024
Epoch: 4, Step: 56, Rank: 3, loss = 0.15617987513542175
Epoch: 4, Step: 56, Rank: 5, loss = 0.013834796845912933
Per-token loss scaled by world size: 0.0007872319547459483
Epoch: 4, Step: 56, Rank: 6, loss = 0.07016205042600632
[2024-07-27 20:06:05,226] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[1.5189695737812153e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:05,304] [INFO] [timer.py:258:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=31.708933325115336, CurrSamplesPerSec=32.66747732319786, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 67%|██████▋ | 8/12 [00:22<00:18, 4.55s/it]{
"epoch": 4,
"step": 56,
"rank": 0,
"loss": 0.042015478014945984,
"overall_throughput": 32.5988932413107,
"lr": 1.5189695737812153e-05,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 713,
"batch_size": 16,
"total_loss": 0.10894230008125305,
"gradnorm": 1.9275243282318115,
"weight_norm": 393.47149658203125,
"timestamp": "2024-07-27T20:06:05.346935"
}
Per-token loss scaled by world size: 0.0021620304323732853Per-token loss scaled by world size: 0.0008936038357205689Per-token loss scaled by world size: 0.0009625108214095235Per-token loss scaled by world size: 0.0029216075781732798Per-token loss scaled by world size: 0.0011535611702129245
Per-token loss scaled by world size: 0.0010705487802624702
Per-token loss scaled by world size: 0.001004268298856914
Epoch: 4, Step: 57, Rank: 0, loss = 0.060765061527490616Epoch: 4, Step: 57, Rank: 1, loss = 0.07844215631484985
Epoch: 4, Step: 57, Rank: 4, loss = 0.14701807498931885Epoch: 4, Step: 57, Rank: 6, loss = 0.06545073539018631
Epoch: 4, Step: 57, Rank: 5, loss = 0.07279732078313828Epoch: 4, Step: 57, Rank: 2, loss = 0.19866931438446045
Epoch: 4, Step: 57, Rank: 7, loss = 0.06829024106264114
Per-token loss scaled by world size: 0.004088845103979111
Epoch: 4, Step: 57, Rank: 3, loss = 0.27804145216941833
[2024-07-27 20:06:05,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[1.4904235038305084e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:05,837] [INFO] [timer.py:258:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=31.726304642319306, CurrSamplesPerSec=32.69348184898873, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 75%|███████▌ | 9/12 [00:23<00:09, 3.29s/it]{
"epoch": 4,
"step": 57,
"rank": 0,
"loss": 0.060765061527490616,
"overall_throughput": 32.6126440646996,
"lr": 1.4904235038305084e-05,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 544,
"batch_size": 16,
"total_loss": 0.12118428945541382,
"gradnorm": 1.7047128677368164,
"weight_norm": 393.4718322753906,
"timestamp": "2024-07-27T20:06:05.881923"
}
Per-token loss scaled by world size: 0.0010402144398540258Per-token loss scaled by world size: 0.001066502882167697Per-token loss scaled by world size: 0.0007541680242866278
Per-token loss scaled by world size: 0.001964687602594495Per-token loss scaled by world size: 0.00040768564213067293Per-token loss scaled by world size: 0.002232564380392432
Per-token loss scaled by world size: 0.004068903159350157
Epoch: 4, Step: 58, Rank: 7, loss = 0.08758654445409775
Epoch: 4, Step: 58, Rank: 1, loss = 0.06193605065345764Epoch: 4, Step: 58, Rank: 0, loss = 0.08542761206626892
Epoch: 4, Step: 58, Rank: 3, loss = 0.16134996712207794
Epoch: 4, Step: 58, Rank: 5, loss = 0.33415865898132324Epoch: 4, Step: 58, Rank: 4, loss = 0.1833493560552597
Epoch: 4, Step: 58, Rank: 2, loss = 0.033481184393167496
Per-token loss scaled by world size: 0.000749716826248914
Epoch: 4, Step: 58, Rank: 6, loss = 0.06157049536705017
[2024-07-27 20:06:06,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[1.461341162978688e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:06,388] [INFO] [timer.py:258:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=31.725638447536326, CurrSamplesPerSec=31.68904077052279, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 83%|████████▎ | 10/12 [00:23<00:04, 2.45s/it]{
"epoch": 4,
"step": 58,
"rank": 0,
"loss": 0.08542761206626892,
"overall_throughput": 31.617913823118545,
"lr": 1.461341162978688e-05,
"cuda_mem_allocated": 22.002624034881592,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 657,
"batch_size": 16,
"total_loss": 0.12610748410224915,
"gradnorm": 2.7373974323272705,
"weight_norm": 393.47216796875,
"timestamp": "2024-07-27T20:06:06.428516"
}
Per-token loss scaled by world size: 0.0009614942828193307Per-token loss scaled by world size: 0.0012739634839817882Per-token loss scaled by world size: 0.0009918607538565993Per-token loss scaled by world size: 0.001772751216776669
Per-token loss scaled by world size: 0.0035334480926394463
Per-token loss scaled by world size: 0.0007813825504854321Per-token loss scaled by world size: 0.00042521810973994434
Epoch: 4, Step: 59, Rank: 0, loss = 0.07290176302194595Epoch: 4, Step: 59, Rank: 6, loss = 0.09363631904125214
Epoch: 4, Step: 59, Rank: 4, loss = 0.1302972137928009Epoch: 4, Step: 59, Rank: 7, loss = 0.07066982984542847
Epoch: 4, Step: 59, Rank: 5, loss = 0.259708434343338
Epoch: 4, Step: 59, Rank: 3, loss = 0.05743161588907242
Epoch: 4, Step: 59, Rank: 2, loss = 0.03125353157520294
Per-token loss scaled by world size: 0.0010259401751682162
Epoch: 4, Step: 59, Rank: 1, loss = 0.07540660351514816
[2024-07-27 20:06:06,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[1.4317543523384928e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:06,915] [INFO] [timer.py:258:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=31.746165669393985, CurrSamplesPerSec=32.93967877502177, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 92%|█████████▏| 11/12 [00:24<00:01, 1.86s/it]{
"epoch": 4,
"step": 59,
"rank": 0,
"loss": 0.07290176302194595,
"overall_throughput": 32.85643000105753,
"lr": 1.4317543523384928e-05,
"cuda_mem_allocated": 21.999285221099854,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 588,
"batch_size": 16,
"total_loss": 0.09891317039728165,
"gradnorm": 1.6408778429031372,
"weight_norm": 393.47247314453125,
"timestamp": "2024-07-27T20:06:06.958893"
}
Per-token loss scaled by world size: 0.0008049748139455914Per-token loss scaled by world size: 0.0006352822529152036Per-token loss scaled by world size: 0.004135269671678543Per-token loss scaled by world size: 0.0009236105252057314
Per-token loss scaled by world size: 0.0006417850963771343Per-token loss scaled by world size: 0.0002449562889523804Per-token loss scaled by world size: 0.002001277171075344
Epoch: 4, Step: 60, Rank: 3, loss = 0.04661383479833603Epoch: 4, Step: 60, Rank: 4, loss = 0.05906502529978752Epoch: 4, Step: 60, Rank: 1, loss = 0.047090981155633926Epoch: 4, Step: 60, Rank: 7, loss = 0.06776992231607437
Epoch: 4, Step: 60, Rank: 5, loss = 0.01797366701066494
Epoch: 4, Step: 60, Rank: 0, loss = 0.3034254014492035
Epoch: 4, Step: 60, Rank: 6, loss = 0.14684371650218964
Per-token loss scaled by world size: 0.0006847438053227961
Epoch: 4, Step: 60, Rank: 2, loss = 0.0502430759370327
[2024-07-27 20:06:07,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[1.4016954246529697e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:07,452] [INFO] [timer.py:258:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=31.759205664862847, CurrSamplesPerSec=32.52061781981693, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 4: 100%|██████████| 12/12 [00:24<00:00, 1.46s/it]{
"epoch": 4,
"step": 60,
"rank": 0,
"loss": 0.3034254014492035,
"overall_throughput": 32.436209882408896,
"lr": 1.4016954246529697e-05,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 587,
"batch_size": 16,
"total_loss": 0.09237820655107498,
"gradnorm": 1.7326298952102661,
"weight_norm": 393.4727478027344,
"timestamp": "2024-07-27T20:06:07.493757"
}
Epoch 4: 100%|██████████| 12/12 [00:24<00:00, 2.08s/it]
total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 1 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 55
total tokens: 144 num samples: 2 num padding tokens: 12 - rank: 4 max len: 72 min len: 60 avg len: 66.0 num_loss_counted_tokens: 69
total tokens: 132 num samples: 2 num padding tokens: 3 - rank: 7 max len: 66 min len: 63 avg len: 64.5 num_loss_counted_tokens: 61
total tokens: 152 num samples: 2 num padding tokens: 11 - rank: 1 max len: 76 min len: 65 avg len: 70.5 num_loss_counted_tokens: 66
total tokens: 200 num samples: 2 num padding tokens: 54 - rank: 7 max len: 100 min len: 46 avg len: 73.0 num_loss_counted_tokens: 89
total tokens: 168 num samples: 2 num padding tokens: 11 - rank: 7 max len: 84 min len: 73 avg len: 78.5 num_loss_counted_tokens: 96
total tokens: 90 num samples: 2 num padding tokens: 2 - rank: 7 max len: 45 min len: 43 avg len: 44.0 num_loss_counted_tokens: 38
total tokens: 154 num samples: 2 num padding tokens: 11 - rank: 4 max len: 77 min len: 66 avg len: 71.5 num_loss_counted_tokens: 80
total tokens: 144 num samples: 2 num padding tokens: 14 - rank: 7 max len: 72 min len: 58 avg len: 65.0 num_loss_counted_tokens: 84
total tokens: 148 num samples: 2 num padding tokens: 25 - rank: 7 max len: 74 min len: 49 avg len: 61.5 num_loss_counted_tokens: 64
total tokens: 138 num samples: 2 num padding tokens: 15 - rank: 7 max len: 69 min len: 54 avg len: 61.5 num_loss_counted_tokens: 79
total tokens: 160 num samples: 2 num padding tokens: 10 - rank: 1 max len: 80 min len: 70 avg len: 75.0 num_loss_counted_tokens: 94
total tokens: 134 num samples: 2 num padding tokens: 7 - rank: 7 max len: 67 min len: 60 avg len: 63.5 num_loss_counted_tokens: 75
total tokens: 140 num samples: 2 num padding tokens: 17 - rank: 1 max len: 70 min len: 53 avg len: 61.5 num_loss_counted_tokens: 68
total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 7 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 51
total tokens: 166 num samples: 2 num padding tokens: 14 - rank: 7 max len: 83 min len: 69 avg len: 76.0 num_loss_counted_tokens: 85
total tokens: 156 num samples: 2 num padding tokens: 14 - rank: 7 max len: 78 min len: 64 avg len: 71.0 num_loss_counted_tokens: 86
total tokens: 166 num samples: 2 num padding tokens: 21 - rank: 4 max len: 83 min len: 62 avg len: 72.5 num_loss_counted_tokens: 65
total tokens: 158 num samples: 2 num padding tokens: 19 - rank: 4 max len: 79 min len: 60 avg len: 69.5 num_loss_counted_tokens: 69
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 1 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 132 num samples: 2 num padding tokens: 2 - rank: 4 max len: 66 min len: 64 avg len: 65.0 num_loss_counted_tokens: 75
total tokens: 118 num samples: 2 num padding tokens: 7 - rank: 4 max len: 59 min len: 52 avg len: 55.5 num_loss_counted_tokens: 60
total tokens: 162 num samples: 2 num padding tokens: 21 - rank: 1 max len: 81 min len: 60 avg len: 70.5 num_loss_counted_tokens: 75
total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 1 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 58
total tokens: 134 num samples: 2 num padding tokens: 7 - rank: 4 max len: 67 min len: 60 avg len: 63.5 num_loss_counted_tokens: 55
total tokens: 128 num samples: 2 num padding tokens: 16 - rank: 4 max len: 64 min len: 48 avg len: 56.0 num_loss_counted_tokens: 49
total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 4 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 62
total tokens: 140 num samples: 2 num padding tokens: 7 - rank: 4 max len: 70 min len: 63 avg len: 66.5 num_loss_counted_tokens: 58
total tokens: 162 num samples: 2 num padding tokens: 24 - rank: 4 max len: 81 min len: 57 avg len: 69.0 num_loss_counted_tokens: 87
total tokens: 136 num samples: 2 num padding tokens: 7 - rank: 5 max len: 68 min len: 61 avg len: 64.5 num_loss_counted_tokens: 60
total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 4 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 73
total tokens: 158 num samples: 2 num padding tokens: 13 - rank: 7 max len: 79 min len: 66 avg len: 72.5 num_loss_counted_tokens: 72
total tokens: 180 num samples: 2 num padding tokens: 32 - rank: 0 max len: 90 min len: 58 avg len: 74.0 num_loss_counted_tokens: 118
total tokens: 152 num samples: 2 num padding tokens: 24 - rank: 0 max len: 76 min len: 52 avg len: 64.0 num_loss_counted_tokens: 71
total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 2 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 61
total tokens: 172 num samples: 2 num padding tokens: 42 - rank: 2 max len: 86 min len: 44 avg len: 65.0 num_loss_counted_tokens: 70
total tokens: 188 num samples: 2 num padding tokens: 42 - rank: 0 max len: 94 min len: 52 avg len: 73.0 num_loss_counted_tokens: 87
total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 5 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 62
total tokens: 214 num samples: 2 num padding tokens: 47 - rank: 5 max len: 107 min len: 60 avg len: 83.5 num_loss_counted_tokens: 128
total tokens: 214 num samples: 2 num padding tokens: 59 - rank: 5 max len: 107 min len: 48 avg len: 77.5 num_loss_counted_tokens: 106
total tokens: 208 num samples: 2 num padding tokens: 58 - rank: 5 max len: 104 min len: 46 avg len: 75.0 num_loss_counted_tokens: 99
total tokens: 186 num samples: 2 num padding tokens: 43 - rank: 1 max len: 93 min len: 50 avg len: 71.5 num_loss_counted_tokens: 79
total tokens: 120 num samples: 2 num padding tokens: 1 - rank: 1 max len: 60 min len: 59 avg len: 59.5 num_loss_counted_tokens: 64
total tokens: 116 num samples: 2 num padding tokens: 8 - rank: 5 max len: 58 min len: 50 avg len: 54.0 num_loss_counted_tokens: 58
total tokens: 164 num samples: 2 num padding tokens: 24 - rank: 1 max len: 82 min len: 58 avg len: 70.0 num_loss_counted_tokens: 95
total tokens: 180 num samples: 2 num padding tokens: 15 - rank: 1 max len: 90 min len: 75 avg len: 82.5 num_loss_counted_tokens: 107
total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 2 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 71
total tokens: 132 num samples: 2 num padding tokens: 17 - rank: 2 max len: 66 min len: 49 avg len: 57.5 num_loss_counted_tokens: 61
total tokens: 140 num samples: 2 num padding tokens: 12 - rank: 2 max len: 70 min len: 58 avg len: 64.0 num_loss_counted_tokens: 78
total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 2 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 61
total tokens: 160 num samples: 2 num padding tokens: 22 - rank: 5 max len: 80 min len: 58 avg len: 69.0 num_loss_counted_tokens: 75
total tokens: 228 num samples: 2 num padding tokens: 54 - rank: 5 max len: 114 min len: 60 avg len: 87.0 num_loss_counted_tokens: 122
total tokens: 118 num samples: 2 num padding tokens: 10 - rank: 5 max len: 59 min len: 49 avg len: 54.0 num_loss_counted_tokens: 54
total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 5 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 64
total tokens: 130 num samples: 2 num padding tokens: 10 - rank: 0 max len: 65 min len: 55 avg len: 60.0 num_loss_counted_tokens: 63
total tokens: 188 num samples: 2 num padding tokens: 46 - rank: 0 max len: 94 min len: 48 avg len: 71.0 num_loss_counted_tokens: 85
total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 5 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 99
total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 0 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 52
total tokens: 138 num samples: 2 num padding tokens: 24 - rank: 0 max len: 69 min len: 45 avg len: 57.0 num_loss_counted_tokens: 56
total tokens: 136 num samples: 2 num padding tokens: 18 - rank: 0 max len: 68 min len: 50 avg len: 59.0 num_loss_counted_tokens: 61
total tokens: 216 num samples: 2 num padding tokens: 45 - rank: 0 max len: 108 min len: 63 avg len: 85.5 num_loss_counted_tokens: 104
total tokens: 122 num samples: 2 num padding tokens: 4 - rank: 3 max len: 61 min len: 57 avg len: 59.0 num_loss_counted_tokens: 56
total tokens: 180 num samples: 2 num padding tokens: 24 - rank: 0 max len: 90 min len: 66 avg len: 78.0 num_loss_counted_tokens: 97
total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 3 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 80
total tokens: 282 num samples: 2 num padding tokens: 48 - rank: 3 max len: 141 min len: 93 avg len: 117.0 num_loss_counted_tokens: 204
total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 6 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 67
total tokens: 168 num samples: 2 num padding tokens: 24 - rank: 3 max len: 84 min len: 60 avg len: 72.0 num_loss_counted_tokens: 95
total tokens: 226 num samples: 2 num padding tokens: 32 - rank: 0 max len: 113 min len: 81 avg len: 97.0 num_loss_counted_tokens: 119
total tokens: 174 num samples: 2 num padding tokens: 32 - rank: 3 max len: 87 min len: 55 avg len: 71.0 num_loss_counted_tokens: 83
total tokens: 172 num samples: 2 num padding tokens: 4 - rank: 3 max len: 86 min len: 82 avg len: 84.0 num_loss_counted_tokens: 81
total tokens: 122 num samples: 2 num padding tokens: 8 - rank: 3 max len: 61 min len: 53 avg len: 57.0 num_loss_counted_tokens: 54
total tokens: 136 num samples: 2 num padding tokens: 2 - rank: 3 max len: 68 min len: 66 avg len: 67.0 num_loss_counted_tokens: 63
total tokens: 122 num samples: 2 num padding tokens: 8 - rank: 3 max len: 61 min len: 53 avg len: 57.0 num_loss_counted_tokens: 53
total tokens: 116 num samples: 2 num padding tokens: 1 - rank: 6 max len: 58 min len: 57 avg len: 57.5 num_loss_counted_tokens: 67
total tokens: 154 num samples: 2 num padding tokens: 4 - rank: 6 max len: 77 min len: 73 avg len: 75.0 num_loss_counted_tokens: 92
total tokens: 194 num samples: 2 num padding tokens: 45 - rank: 2 max len: 97 min len: 52 avg len: 74.5 num_loss_counted_tokens: 88
total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 2 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 79
total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 3 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 62
total tokens: 196 num samples: 2 num padding tokens: 35 - rank: 2 max len: 98 min len: 63 avg len: 80.5 num_loss_counted_tokens: 104
total tokens: 184 num samples: 2 num padding tokens: 37 - rank: 6 max len: 92 min len: 55 avg len: 73.5 num_loss_counted_tokens: 90
total tokens: 148 num samples: 2 num padding tokens: 13 - rank: 2 max len: 74 min len: 61 avg len: 67.5 num_loss_counted_tokens: 69
total tokens: 202 num samples: 2 num padding tokens: 50 - rank: 5 max len: 101 min len: 51 avg len: 76.0 num_loss_counted_tokens: 96
total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 2 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 53
total tokens: 128 num samples: 2 num padding tokens: 14 - rank: 6 max len: 64 min len: 50 avg len: 57.0 num_loss_counted_tokens: 58
total tokens: 128 num samples: 2 num padding tokens: 6 - rank: 6 max len: 64 min len: 58 avg len: 61.0 num_loss_counted_tokens: 55
total tokens: 176 num samples: 2 num padding tokens: 17 - rank: 3 max len: 88 min len: 71 avg len: 79.5 num_loss_counted_tokens: 98
total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 6 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 51
total tokens: 128 num samples: 2 num padding tokens: 9 - rank: 6 max len: 64 min len: 55 avg len: 59.5 num_loss_counted_tokens: 68
total tokens: 128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 64 min len: 62 avg len: 63.0 num_loss_counted_tokens: 80
total tokens: 134 num samples: 2 num padding tokens: 22 - rank: 6 max len: 67 min len: 45 avg len: 56.0 num_loss_counted_tokens: 57
total tokens: 166 num samples: 2 num padding tokens: 23 - rank: 6 max len: 83 min len: 60 avg len: 71.5 num_loss_counted_tokens: 90
total tokens: 142 num samples: 2 num padding tokens: 3 - rank: 6 max len: 71 min len: 68 avg len: 69.5 num_loss_counted_tokens: 70
total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 1 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 60
total tokens: 146 num samples: 2 num padding tokens: 29 - rank: 2 max len: 73 min len: 44 avg len: 58.5 num_loss_counted_tokens: 61
total tokens: 152 num samples: 2 num padding tokens: 22 - rank: 3 max len: 76 min len: 54 avg len: 65.0 num_loss_counted_tokens: 86
total tokens: 244 num samples: 2 num padding tokens: 36 - rank: 6 max len: 122 min len: 86 avg len: 104.0 num_loss_counted_tokens: 139
Per-token loss scaled by world size: 0.0025800205767154694Per-token loss scaled by world size: 0.0006805358571000397Per-token loss scaled by world size: 0.0009809250477701426Per-token loss scaled by world size: 0.0011542694410309196
Per-token loss scaled by world size: 0.0011356660397723317
Per-token loss scaled by world size: 8.372703450731933e-05
Per-token loss scaled by world size: 0.0002341267536394298
Epoch: 5, Step: 61, Rank: 1, loss = 0.07038137316703796
Epoch: 5, Step: 61, Rank: 5, loss = 0.08281882852315903Epoch: 5, Step: 61, Rank: 3, loss = 0.1851164698600769
Epoch: 5, Step: 61, Rank: 2, loss = 0.048828449100255966
Epoch: 5, Step: 61, Rank: 0, loss = 0.08148403465747833
Epoch: 5, Step: 61, Rank: 7, loss = 0.006007414776831865
Epoch: 5, Step: 61, Rank: 4, loss = 0.0167985949665308
Per-token loss scaled by world size: 0.0015900362050160766
Epoch: 5, Step: 61, Rank: 6, loss = 0.11408510059118271
[2024-07-27 20:06:08,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=61, skipped=0, lr=[1.3711972489182208e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:08,486] [INFO] [timer.py:258:stop] epoch=0/micro_step=61/global_step=61, RunningAvgSamplesPerSec=31.720545490554542, CurrSamplesPerSec=29.62867677087431, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5, | 1/12 [00:00<00:10, 1.06it/s]
"step": 61,
"rank": 0,
"loss": 0.08148403465747833,
"overall_throughput": 29.523344117535782,
"lr": 1.3711972489182208e-05,
"cuda_mem_allocated": 22.004770278930664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 574,
"batch_size": 16,
"total_loss": 0.07569002360105515,
"gradnorm": 1.5541268587112427,
"weight_norm": 393.4730224609375,
"timestamp": "2024-07-27T20:06:08.490189"
}
Per-token loss scaled by world size: 0.0006772524793632329Per-token loss scaled by world size: 0.0011761346831917763Per-token loss scaled by world size: 0.0012851222418248653Per-token loss scaled by world size: 0.0015470795333385468
Per-token loss scaled by world size: 0.001160036656074226
Per-token loss scaled by world size: 0.0007557208882644773Per-token loss scaled by world size: 0.0015825566370040178
Epoch: 5, Step: 62, Rank: 1, loss = 0.0887981727719307Epoch: 5, Step: 62, Rank: 4, loss = 0.09702672809362411
Epoch: 5, Step: 62, Rank: 2, loss = 0.11680450290441513Epoch: 5, Step: 62, Rank: 0, loss = 0.05113256350159645
Epoch: 5, Step: 62, Rank: 7, loss = 0.05705692619085312
Epoch: 5, Step: 62, Rank: 5, loss = 0.11948302388191223
Epoch: 5, Step: 62, Rank: 6, loss = 0.08758276700973511
Per-token loss scaled by world size: 0.000659986340906471
Epoch: 5, Step: 62, Rank: 3, loss = 0.049828968942165375
[2024-07-27 20:06:08,981] [INFO] [logging.py:96:log_dist] [Rank 0] step=62, skipped=0, lr=[1.3402931744416432e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:09,058] [INFO] [timer.py:258:stop] epoch=0/micro_step=62/global_step=62, RunningAvgSamplesPerSec=31.69959766825067, CurrSamplesPerSec=30.510810812039587, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,▋ | 2/12 [00:01<00:07, 1.38it/s]
"step": 62,
"rank": 0,
"loss": 0.05113256350159645,
"overall_throughput": 30.460830095500928,
"lr": 1.3402931744416432e-05,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 604,
"batch_size": 16,
"total_loss": 0.08346420526504517,
"gradnorm": 1.3599183559417725,
"weight_norm": 393.4732971191406,
"timestamp": "2024-07-27T20:06:09.061604"
}
Per-token loss scaled by world size: 0.0010878611356019974Per-token loss scaled by world size: 0.0003478115249890834Per-token loss scaled by world size: 0.0022740615531802177Per-token loss scaled by world size: 0.0003918901493307203Per-token loss scaled by world size: 0.002286511706188321Per-token loss scaled by world size: 0.0006958367303013802
Per-token loss scaled by world size: 0.0007736408151686192
Epoch: 5, Step: 63, Rank: 2, loss = 0.03065088950097561Epoch: 5, Step: 63, Rank: 5, loss = 0.2004016786813736
Epoch: 5, Step: 63, Rank: 0, loss = 0.2014988511800766
Epoch: 5, Step: 63, Rank: 1, loss = 0.06132061034440994Epoch: 5, Step: 63, Rank: 4, loss = 0.03453531861305237Epoch: 5, Step: 63, Rank: 3, loss = 0.09586776047945023
Epoch: 5, Step: 63, Rank: 6, loss = 0.06817709654569626
Per-token loss scaled by world size: 0.0005193906254135072
Epoch: 5, Step: 63, Rank: 7, loss = 0.04577130079269409
[2024-07-27 20:06:09,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=63, skipped=0, lr=[1.3090169943749475e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:09,591] [INFO] [timer.py:258:stop] epoch=0/micro_step=63/global_step=63, RunningAvgSamplesPerSec=31.71584398677769, CurrSamplesPerSec=32.72206448467118, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 5,█▌ | 3/12 [00:02<00:05, 1.57it/s]
"step": 63,
"rank": 0,
"loss": 0.2014988511800766,
"overall_throughput": 32.65121769063287,
"lr": 1.3090169943749475e-05,
"cuda_mem_allocated": 22.00572395324707,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 705,
"batch_size": 16,
"total_loss": 0.09227793663740158,
"gradnorm": 2.188631534576416,
"weight_norm": 393.4734802246094,
"timestamp": "2024-07-27T20:06:09.633806"
}
Per-token loss scaled by world size: 0.0009381945710629225Per-token loss scaled by world size: 0.00045316756586544216Per-token loss scaled by world size: 0.0005594724207185209Per-token loss scaled by world size: 0.0003057016583625227Per-token loss scaled by world size: 0.0005305999657139182
Per-token loss scaled by world size: 0.00392846018075943
Per-token loss scaled by world size: 0.0012796723749488592
Epoch: 5, Step: 64, Rank: 0, loss = 0.0768146812915802Epoch: 5, Step: 64, Rank: 6, loss = 0.03710309416055679Epoch: 5, Step: 64, Rank: 4, loss = 0.043442871421575546Epoch: 5, Step: 64, Rank: 5, loss = 0.045806802809238434
Epoch: 5, Step: 64, Rank: 3, loss = 0.32164266705513
Epoch: 5, Step: 64, Rank: 1, loss = 0.025029323995113373
Epoch: 5, Step: 64, Rank: 2, loss = 0.10477317124605179
Per-token loss scaled by world size: 0.00109212682582438
Epoch: 5, Step: 64, Rank: 7, loss = 0.08941788226366043
[2024-07-27 20:06:10,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=64, skipped=0, lr=[1.2774029087618448e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:10,141] [INFO] [timer.py:258:stop] epoch=0/micro_step=64/global_step=64, RunningAvgSamplesPerSec=31.71981083700255, CurrSamplesPerSec=31.963679576668167, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,██▎ | 4/12 [00:02<00:04, 1.66it/s]
"step": 64,
"rank": 0,
"loss": 0.0768146812915802,
"overall_throughput": 31.90590190730254,
"lr": 1.2774029087618448e-05,
"cuda_mem_allocated": 21.99880838394165,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 655,
"batch_size": 16,
"total_loss": 0.09300381690263748,
"gradnorm": 1.6255645751953125,
"weight_norm": 393.4736633300781,
"timestamp": "2024-07-27T20:06:10.186141"
}
Per-token loss scaled by world size: 0.0019039374310523272Per-token loss scaled by world size: 0.0005504547152668238Per-token loss scaled by world size: 0.001485039945691824Per-token loss scaled by world size: 0.0013535844627767801
Per-token loss scaled by world size: 0.000591020449064672
Per-token loss scaled by world size: 0.0014079039683565497
Epoch: 5, Step: 65, Rank: 3, loss = 0.042040977627038956Epoch: 5, Step: 65, Rank: 0, loss = 0.11341992765665054
Epoch: 5, Step: 65, Rank: 7, loss = 0.10338001698255539
Epoch: 5, Step: 65, Rank: 4, loss = 0.04513918608427048
Epoch: 5, Step: 65, Rank: 1, loss = 0.14541321992874146
Per-token loss scaled by world size: 0.0026307932566851377
Epoch: 5, Step: 65, Rank: 5, loss = 0.20092684030532837
Epoch: 5, Step: 65, Rank: 2, loss = 0.10752866417169571
Per-token loss scaled by world size: 0.002622765488922596
Epoch: 5, Step: 65, Rank: 6, loss = 0.2003137171268463
[2024-07-27 20:06:10,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=65, skipped=0, lr=[1.2454854871407993e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:10,698] [INFO] [timer.py:258:stop] epoch=0/micro_step=65/global_step=65, RunningAvgSamplesPerSec=31.71541920662633, CurrSamplesPerSec=31.44549285353818, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 5,███▏ | 5/12 [00:03<00:04, 1.71it/s]
"step": 65,
"rank": 0,
"loss": 0.11341992765665054,
"overall_throughput": 31.39161267024543,
"lr": 1.2454854871407993e-05,
"cuda_mem_allocated": 22.00572395324707,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 611,
"batch_size": 16,
"total_loss": 0.11977030336856842,
"gradnorm": 1.5310957431793213,
"weight_norm": 393.4738464355469,
"timestamp": "2024-07-27T20:06:10.740662"
}
Per-token loss scaled by world size: 0.0006399019039236009Per-token loss scaled by world size: 0.0005316854221746325Per-token loss scaled by world size: 0.0012345308205112815Per-token loss scaled by world size: 0.00044449279084801674
Per-token loss scaled by world size: 0.0006190972053445876Per-token loss scaled by world size: 0.0016892498824745417
Epoch: 5, Step: 66, Rank: 0, loss = 0.03701859712600708
Epoch: 5, Step: 66, Rank: 4, loss = 0.030947810038924217
Epoch: 5, Step: 66, Rank: 7, loss = 0.0859542116522789Epoch: 5, Step: 66, Rank: 5, loss = 0.1176140233874321
Epoch: 5, Step: 66, Rank: 2, loss = 0.04310464486479759Epoch: 5, Step: 66, Rank: 1, loss = 0.04455316811800003
Per-token loss scaled by world size: 0.0005175694241188467
Epoch: 5, Step: 66, Rank: 3, loss = 0.036035772413015366
Per-token loss scaled by world size: 0.0009789945324882865
Epoch: 5, Step: 66, Rank: 6, loss = 0.06816249340772629
[2024-07-27 20:06:11,175] [INFO] [logging.py:96:log_dist] [Rank 0] step=66, skipped=0, lr=[1.213299630743747e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:11,253] [INFO] [timer.py:258:stop] epoch=0/micro_step=66/global_step=66, RunningAvgSamplesPerSec=31.711479482763888, CurrSamplesPerSec=31.465234804674058, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 1056
{
"epoch": 5,
"step": 66,
"rank": 0,
"loss": 0.03701859712600708,
"overall_throughput": 31.4168749534778,
"lr": 1.213299630743747e-05,
"cuda_mem_allocated": 22.009064197540283,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 557,
"batch_size": 16,
"total_loss": 0.05792384222149849,
"gradnorm": 1.2862759828567505,
"weight_norm": 393.4739685058594,
"timestamp": "2024-07-27T20:06:11.256198"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1056
[20:06:29] INFO saving took 17.98075246810913 seconds utils.py:611
Per-token loss scaled by world size: 0.0013491392601281404Per-token loss scaled by world size: 0.0005269849789328873Per-token loss scaled by world size: 0.004546341486275196Per-token loss scaled by world size: 0.000828504154924303369s/it]
Per-token loss scaled by world size: 0.0013991020387038589
Per-token loss scaled by world size: 0.0006301040411926806
Per-token loss scaled by world size: 0.0007652370841242373
Epoch: 5, Step: 67, Rank: 1, loss = 0.06306988000869751Epoch: 5, Step: 67, Rank: 2, loss = 0.34609025716781616
Epoch: 5, Step: 67, Rank: 4, loss = 0.10650664567947388Epoch: 5, Step: 67, Rank: 0, loss = 0.10270322859287262
Epoch: 5, Step: 67, Rank: 3, loss = 0.04011673107743263
Epoch: 5, Step: 67, Rank: 7, loss = 0.047966670244932175
Epoch: 5, Step: 67, Rank: 5, loss = 0.05825367197394371
Per-token loss scaled by world size: 0.0012170199770480394
Epoch: 5, Step: 67, Rank: 6, loss = 0.09264564514160156
[2024-07-27 20:06:29,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=67, skipped=0, lr=[1.1808805343321102e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:29,799] [INFO] [timer.py:258:stop] epoch=0/micro_step=67/global_step=67, RunningAvgSamplesPerSec=31.70099434878959, CurrSamplesPerSec=31.044068891151483, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,████▊ | 7/12 [00:22<00:23, 4.69s/it]
"step": 67,
"rank": 0,
"loss": 0.10270322859287262,
"overall_throughput": 30.986774873645142,
"lr": 1.1808805343321102e-05,
"cuda_mem_allocated": 22.004770278930664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 609,
"batch_size": 16,
"total_loss": 0.10716909170150757,
"gradnorm": 1.8347676992416382,
"weight_norm": 393.4740905761719,
"timestamp": "2024-07-27T20:06:29.841377"
}
Per-token loss scaled by world size: 0.0020153727382421494Per-token loss scaled by world size: 0.0002941747079603374Per-token loss scaled by world size: 0.0007493247976526618Per-token loss scaled by world size: 0.0010664670262485743Per-token loss scaled by world size: 0.0009130059042945504
Per-token loss scaled by world size: 0.0006243651150725782Per-token loss scaled by world size: 0.0005120610003359616
Epoch: 5, Step: 68, Rank: 0, loss = 0.168031707406044
Epoch: 5, Step: 68, Rank: 5, loss = 0.06247495487332344
Epoch: 5, Step: 68, Rank: 3, loss = 0.08891668915748596Epoch: 5, Step: 68, Rank: 7, loss = 0.07612186670303345Epoch: 5, Step: 68, Rank: 2, loss = 0.024526815861463547
Epoch: 5, Step: 68, Rank: 4, loss = 0.042693085968494415Epoch: 5, Step: 68, Rank: 6, loss = 0.052056439220905304
Per-token loss scaled by world size: 0.0004039716732222587
Epoch: 5, Step: 68, Rank: 1, loss = 0.03368113934993744
[2024-07-27 20:06:30,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=68, skipped=0, lr=[1.148263647711842e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:30,343] [INFO] [timer.py:258:stop] epoch=0/micro_step=68/global_step=68, RunningAvgSamplesPerSec=31.70601140764558, CurrSamplesPerSec=32.035561937422905, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,█████▋ | 8/12 [00:22<00:13, 3.37s/it]
"step": 68,
"rank": 0,
"loss": 0.168031707406044,
"overall_throughput": 31.980130143367383,
"lr": 1.148263647711842e-05,
"cuda_mem_allocated": 21.998568058013916,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 667,
"batch_size": 16,
"total_loss": 0.06856284290552139,
"gradnorm": 1.0227607488632202,
"weight_norm": 393.47418212890625,
"timestamp": "2024-07-27T20:06:30.390363"
}
Per-token loss scaled by world size: 0.0027562566101551056Per-token loss scaled by world size: 0.0018840961856767535Per-token loss scaled by world size: 0.0018555921269580722Per-token loss scaled by world size: 0.0010745518375188112Per-token loss scaled by world size: 0.0009415823733434081Per-token loss scaled by world size: 0.0031567809637635946Per-token loss scaled by world size: 0.0009981651091948152
Epoch: 5, Step: 69, Rank: 3, loss = 0.12411483377218246
Epoch: 5, Step: 69, Rank: 6, loss = 0.07078610360622406Epoch: 5, Step: 69, Rank: 5, loss = 0.12223713099956512
Epoch: 5, Step: 69, Rank: 1, loss = 0.20795294642448425Epoch: 5, Step: 69, Rank: 0, loss = 0.0620267391204834Epoch: 5, Step: 69, Rank: 7, loss = 0.18156839907169342Epoch: 5, Step: 69, Rank: 4, loss = 0.06575412303209305
Per-token loss scaled by world size: 0.00018431547505315393
Epoch: 5, Step: 69, Rank: 2, loss = 0.012141781859099865
[2024-07-27 20:06:30,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=69, skipped=0, lr=[1.1154846369695864e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:30,877] [INFO] [timer.py:258:stop] epoch=0/micro_step=69/global_step=69, RunningAvgSamplesPerSec=31.723357888730654, CurrSamplesPerSec=32.911763984671325, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,██████▌ | 9/12 [00:23<00:07, 2.48s/it]
"step": 69,
"rank": 0,
"loss": 0.0620267391204834,
"overall_throughput": 32.829137857318266,
"lr": 1.1154846369695864e-05,
"cuda_mem_allocated": 21.999523639678955,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 527,
"batch_size": 16,
"total_loss": 0.10582275688648224,
"gradnorm": 2.0553536415100098,
"weight_norm": 393.4742431640625,
"timestamp": "2024-07-27T20:06:30.921842"
}
Per-token loss scaled by world size: 0.0010255835950374603Per-token loss scaled by world size: 0.0015445285243913531Per-token loss scaled by world size: 0.0007211874471977353Per-token loss scaled by world size: 0.0005934142973273993Per-token loss scaled by world size: 0.0017620512517169118
Per-token loss scaled by world size: 0.000368919427273795Per-token loss scaled by world size: 0.0008395504555664957
Epoch: 5, Step: 70, Rank: 7, loss = 0.11294364929199219
Epoch: 5, Step: 70, Rank: 1, loss = 0.04339342191815376Epoch: 5, Step: 70, Rank: 6, loss = 0.052736829966306686Epoch: 5, Step: 70, Rank: 0, loss = 0.026977233588695526
Epoch: 5, Step: 70, Rank: 5, loss = 0.12884999811649323Epoch: 5, Step: 70, Rank: 2, loss = 0.07499580085277557
Epoch: 5, Step: 70, Rank: 4, loss = 0.061392128467559814
Per-token loss scaled by world size: 0.0018547051586210728
Epoch: 5, Step: 70, Rank: 3, loss = 0.13562531769275665
[2024-07-27 20:06:31,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[1.0825793454723325e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:31,432] [INFO] [timer.py:258:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=31.71917367216476, CurrSamplesPerSec=31.441323528309383, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,███████▎ | 10/12 [00:23<00:03, 1.89s/it]
"step": 70,
"rank": 0,
"loss": 0.026977233588695526,
"overall_throughput": 31.36866352554035,
"lr": 1.0825793454723325e-05,
"cuda_mem_allocated": 21.999762058258057,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 585,
"batch_size": 16,
"total_loss": 0.0796142965555191,
"gradnorm": 1.7012439966201782,
"weight_norm": 393.4743347167969,
"timestamp": "2024-07-27T20:06:31.476165"
}
Per-token loss scaled by world size: 0.0009147366508841515Per-token loss scaled by world size: 0.0017351489514112473Per-token loss scaled by world size: 0.0008338880725204945Per-token loss scaled by world size: 0.00024312795721925795Per-token loss scaled by world size: 0.0006241968367248774
Per-token loss scaled by world size: 0.00024290102010127157Per-token loss scaled by world size: 0.0020128381438553333
Epoch: 5, Step: 71, Rank: 5, loss = 0.05847639963030815Epoch: 5, Step: 71, Rank: 1, loss = 0.12167732417583466Epoch: 5, Step: 71, Rank: 3, loss = 0.01704934798181057
Epoch: 5, Step: 71, Rank: 2, loss = 0.04377180337905884
Epoch: 5, Step: 71, Rank: 6, loss = 0.06414590775966644Epoch: 5, Step: 71, Rank: 7, loss = 0.01703343354165554
Epoch: 5, Step: 71, Rank: 4, loss = 0.14115028083324432
Per-token loss scaled by world size: 0.0004984893603250384
Epoch: 5, Step: 71, Rank: 0, loss = 0.034956566989421844
[2024-07-27 20:06:31,890] [INFO] [logging.py:96:log_dist] [Rank 0] step=71, skipped=0, lr=[1.0495837546732224e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:31,968] [INFO] [timer.py:258:stop] epoch=0/micro_step=71/global_step=71, RunningAvgSamplesPerSec=31.731589241737456, CurrSamplesPerSec=32.599273292528906, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 5,████████▏| 11/12 [00:24<00:01, 1.47s/it]
"step": 71,
"rank": 0,
"loss": 0.034956566989421844,
"overall_throughput": 32.517245673379065,
"lr": 1.0495837546732224e-05,
"cuda_mem_allocated": 21.998329639434814,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 561,
"batch_size": 16,
"total_loss": 0.062282636761665344,
"gradnorm": 0.9715697765350342,
"weight_norm": 393.47442626953125,
"timestamp": "2024-07-27T20:06:32.016067"
}
Per-token loss scaled by world size: 0.0011429809965193272Per-token loss scaled by world size: 0.0009149574325419962Per-token loss scaled by world size: 0.0004773043910972774Per-token loss scaled by world size: 0.0027895078528672457Per-token loss scaled by world size: 0.004009348340332508Per-token loss scaled by world size: 0.0015114195412024856
Per-token loss scaled by world size: 0.0012063757749274373
Epoch: 5, Step: 72, Rank: 6, loss = 0.34730979800224304
Epoch: 5, Step: 72, Rank: 2, loss = 0.07925818860530853Epoch: 5, Step: 72, Rank: 0, loss = 0.04134649410843849Epoch: 5, Step: 72, Rank: 5, loss = 0.09901072829961777Epoch: 5, Step: 72, Rank: 3, loss = 0.10450230538845062
Epoch: 5, Step: 72, Rank: 7, loss = 0.130926713347435
Epoch: 5, Step: 72, Rank: 1, loss = 0.2416411191225052
Per-token loss scaled by world size: 0.0014137992402538657
Epoch: 5, Step: 72, Rank: 4, loss = 0.12247036397457123
[2024-07-27 20:06:32,427] [INFO] [logging.py:96:log_dist] [Rank 0] step=72, skipped=0, lr=[1.0165339447663586e-05], mom=[(0.9, 0.95)]
[2024-07-27 20:06:32,504] [INFO] [timer.py:258:stop] epoch=0/micro_step=72/global_step=72, RunningAvgSamplesPerSec=31.74744391309924, CurrSamplesPerSec=32.88104464616879, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 5,█████████| 12/12 [00:24<00:00, 1.19s/it]
"step": 72,
"rank": 0,
"loss": 0.04134649410843849,
"overall_throughput": 32.79027112653174,
"lr": 1.0165339447663586e-05,
"cuda_mem_allocated": 22.01025676727295,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 693,
"batch_size": 16,
"total_loss": 0.14580821990966797,
"gradnorm": 1.6911654472351074,
"weight_norm": 393.4745178222656,
"timestamp": "2024-07-27T20:06:32.547687"
}
Epoch 5: 100%|██████████| 12/12 [00:25<00:00, 2.09s/it]
total tokens: 214 num samples: 2 num padding tokens: 23 - rank: 1 max len: 107 min len: 84 avg len: 95.5 num_loss_counted_tokens: 132
total tokens: 282 num samples: 2 num padding tokens: 83 - rank: 6 max len: 141 min len: 58 avg len: 99.5 num_loss_counted_tokens: 145
total tokens: 144 num samples: 2 num padding tokens: 27 - rank: 7 max len: 72 min len: 45 avg len: 58.5 num_loss_counted_tokens: 73
total tokens: 118 num samples: 2 num padding tokens: 9 - rank: 1 max len: 59 min len: 50 avg len: 54.5 num_loss_counted_tokens: 51
total tokens: 172 num samples: 2 num padding tokens: 19 - rank: 7 max len: 86 min len: 67 avg len: 76.5 num_loss_counted_tokens: 75
total tokens: 148 num samples: 2 num padding tokens: 17 - rank: 0 max len: 74 min len: 57 avg len: 65.5 num_loss_counted_tokens: 73
total tokens: 138 num samples: 2 num padding tokens: 1 - rank: 7 max len: 69 min len: 68 avg len: 68.5 num_loss_counted_tokens: 57
total tokens: 106 num samples: 2 num padding tokens: 5 - rank: 1 max len: 53 min len: 48 avg len: 50.5 num_loss_counted_tokens: 46
total tokens: 160 num samples: 2 num padding tokens: 18 - rank: 0 max len: 80 min len: 62 avg len: 71.0 num_loss_counted_tokens: 81
total tokens: 174 num samples: 2 num padding tokens: 17 - rank: 7 max len: 87 min len: 70 avg len: 78.5 num_loss_counted_tokens: 77
total tokens: 164 num samples: 2 num padding tokens: 21 - rank: 7 max len: 82 min len: 61 avg len: 71.5 num_loss_counted_tokens: 92
total tokens: 188 num samples: 2 num padding tokens: 19 - rank: 0 max len: 94 min len: 75 avg len: 84.5 num_loss_counted_tokens: 99
total tokens: 138 num samples: 2 num padding tokens: 5 - rank: 2 max len: 69 min len: 64 avg len: 66.5 num_loss_counted_tokens: 70
total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 3 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 128
total tokens: 162 num samples: 2 num padding tokens: 18 - rank: 0 max len: 81 min len: 63 avg len: 72.0 num_loss_counted_tokens: 82
total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 6 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 54
total tokens: 128 num samples: 2 num padding tokens: 11 - rank: 3 max len: 64 min len: 53 avg len: 58.5 num_loss_counted_tokens: 67
total tokens: 214 num samples: 2 num padding tokens: 31 - rank: 1 max len: 107 min len: 76 avg len: 91.5 num_loss_counted_tokens: 117
total tokens: 200 num samples: 2 num padding tokens: 10 - rank: 7 max len: 100 min len: 90 avg len: 95.0 num_loss_counted_tokens: 151
total tokens: 244 num samples: 2 num padding tokens: 70 - rank: 6 max len: 122 min len: 52 avg len: 87.0 num_loss_counted_tokens: 113
total tokens: 140 num samples: 2 num padding tokens: 18 - rank: 0 max len: 70 min len: 52 avg len: 61.0 num_loss_counted_tokens: 75
total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 6 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 62
total tokens: 176 num samples: 2 num padding tokens: 8 - rank: 2 max len: 88 min len: 80 avg len: 84.0 num_loss_counted_tokens: 99
total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 7 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 65
total tokens: 152 num samples: 2 num padding tokens: 14 - rank: 7 max len: 76 min len: 62 avg len: 69.0 num_loss_counted_tokens: 83
total tokens: 208 num samples: 2 num padding tokens: 46 - rank: 7 max len: 104 min len: 58 avg len: 81.0 num_loss_counted_tokens: 107
total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 7 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 55
total tokens: 148 num samples: 2 num padding tokens: 10 - rank: 1 max len: 74 min len: 64 avg len: 69.0 num_loss_counted_tokens: 73
total tokens: 152 num samples: 2 num padding tokens: 5 - rank: 0 max len: 76 min len: 71 avg len: 73.5 num_loss_counted_tokens: 91
total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 6 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 78
total tokens: 168 num samples: 2 num padding tokens: 36 - rank: 2 max len: 84 min len: 48 avg len: 66.0 num_loss_counted_tokens: 72
total tokens: 130 num samples: 2 num padding tokens: 17 - rank: 0 max len: 65 min len: 48 avg len: 56.5 num_loss_counted_tokens: 62
total tokens: 186 num samples: 2 num padding tokens: 42 - rank: 0 max len: 93 min len: 51 avg len: 72.0 num_loss_counted_tokens: 96
total tokens: 140 num samples: 2 num padding tokens: 17 - rank: 0 max len: 70 min len: 53 avg len: 61.5 num_loss_counted_tokens: 61
total tokens: 166 num samples: 2 num padding tokens: 19 - rank: 0 max len: 83 min len: 64 avg len: 73.5 num_loss_counted_tokens: 76
total tokens: 104 num samples: 2 num padding tokens: 2 - rank: 0 max len: 52 min len: 50 avg len: 51.0 num_loss_counted_tokens: 61
total tokens: 122 num samples: 2 num padding tokens: 2 - rank: 7 max len: 61 min len: 59 avg len: 60.0 num_loss_counted_tokens: 62
total tokens: 188 num samples: 2 num padding tokens: 39 - rank: 2 max len: 94 min len: 55 avg len: 74.5 num_loss_counted_tokens: 95
total tokens: 102 num samples: 2 num padding tokens: 7 - rank: 2 max len: 51 min len: 44 avg len: 47.5 num_loss_counted_tokens: 53
total tokens: 146 num samples: 2 num padding tokens: 10 - rank: 6 max len: 73 min len: 63 avg len: 68.0 num_loss_counted_tokens: 72
total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 7 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 71
total tokens: 216 num samples: 2 num padding tokens: 21 - rank: 0 max len: 108 min len: 87 avg len: 97.5 num_loss_counted_tokens: 124
total tokens: 104 num samples: 2 num padding tokens: 8 - rank: 6 max len: 52 min len: 44 avg len: 48.0 num_loss_counted_tokens: 52
total tokens: 134 num samples: 2 num padding tokens: 8 - rank: 2 max len: 67 min len: 59 avg len: 63.0 num_loss_counted_tokens: 71
total tokens: 168 num samples: 2 num padding tokens: 25 - rank: 2 max len: 84 min len: 59 avg len: 71.5 num_loss_counted_tokens: 89
total tokens: 146 num samples: 2 num padding tokens: 19 - rank: 2 max len: 73 min len: 54 avg len: 63.5 num_loss_counted_tokens: 80
total tokens: 154 num samples: 2 num padding tokens: 27 - rank: 2 max len: 77 min len: 50 avg len: 63.5 num_loss_counted_tokens: 83
total tokens: 164 num samples: 2 num padding tokens: 36 - rank: 6 max len: 82 min len: 46 avg len: 64.0 num_loss_counted_tokens: 75
total tokens: 122 num samples: 2 num padding tokens: 10 - rank: 2 max len: 61 min len: 51 avg len: 56.0 num_loss_counted_tokens: 56
total tokens: 156 num samples: 2 num padding tokens: 20 - rank: 4 max len: 78 min len: 58 avg len: 68.0 num_loss_counted_tokens: 78
total tokens: 172 num samples: 2 num padding tokens: 32 - rank: 2 max len: 86 min len: 54 avg len: 70.0 num_loss_counted_tokens: 60
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 124 num samples: 2 num padding tokens: 18 - rank: 4 max len: 62 min len: 44 avg len: 53.0 num_loss_counted_tokens: 60
total tokens: 162 num samples: 2 num padding tokens: 19 - rank: 6 max len: 81 min len: 62 avg len: 71.5 num_loss_counted_tokens: 82
total tokens: 104 num samples: 2 num padding tokens: 6 - rank: 4 max len: 52 min len: 46 avg len: 49.0 num_loss_counted_tokens: 56
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 66
total tokens: 226 num samples: 2 num padding tokens: 48 - rank: 4 max len: 113 min len: 65 avg len: 89.0 num_loss_counted_tokens: 95
total tokens: 132 num samples: 2 num padding tokens: 12 - rank: 4 max len: 66 min len: 54 avg len: 60.0 num_loss_counted_tokens: 69
total tokens: 228 num samples: 2 num padding tokens: 17 - rank: 4 max len: 114 min len: 97 avg len: 105.5 num_loss_counted_tokens: 158
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 6 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 67
total tokens: 98 num samples: 2 num padding tokens: 3 - rank: 4 max len: 49 min len: 46 avg len: 47.5 num_loss_counted_tokens: 47
total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 3 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 70
total tokens: 196 num samples: 2 num padding tokens: 38 - rank: 3 max len: 98 min len: 60 avg len: 79.0 num_loss_counted_tokens: 112
total tokens: 142 num samples: 2 num padding tokens: 5 - rank: 3 max len: 71 min len: 66 avg len: 68.5 num_loss_counted_tokens: 75
total tokens: 120 num samples: 2 num padding tokens: 3 - rank: 3 max len: 60 min len: 57 avg len: 58.5 num_loss_counted_tokens: 59
total tokens: 110 num samples: 2 num padding tokens: 10 - rank: 4 max len: 55 min len: 45 avg len: 50.0 num_loss_counted_tokens: 59 total tokens: 116 num samples: 2 num padding tokens: 9 - rank: 3 max len: 58 min len: 49 avg len: 53.5 num_loss_counted_tokens: 57
total tokens: 126 num samples: 2 num padding tokens: 17 - rank: 5 max len: 63 min len: 46 avg len: 54.5 num_loss_counted_tokens: 52
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 5 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63
total tokens: 180 num samples: 2 num padding tokens: 32 - rank: 6 max len: 90 min len: 58 avg len: 74.0 num_loss_counted_tokens: 99
total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 3 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 45
total tokens: 166 num samples: 2 num padding tokens: 38 - rank: 1 max len: 83 min len: 45 avg len: 64.0 num_loss_counted_tokens: 56
total tokens: 144 num samples: 2 num padding tokens: 4 - rank: 3 max len: 72 min len: 68 avg len: 70.0 num_loss_counted_tokens: 60
total tokens: 138 num samples: 2 num padding tokens: 0 - rank: 3 max len: 69 min len: 69 avg len: 69.0 num_loss_counted_tokens: 75
total tokens: 186 num samples: 2 num padding tokens: 12 - rank: 4 max len: 93 min len: 81 avg len: 87.0 num_loss_counted_tokens: 131
total tokens: 184 num samples: 2 num padding tokens: 31 - rank: 1 max len: 92 min len: 61 avg len: 76.5 num_loss_counted_tokens: 87
total tokens: 142 num samples: 2 num padding tokens: 4 - rank: 4 max len: 71 min len: 67 avg len: 69.0 num_loss_counted_tokens: 59 total tokens: 126 num samples: 2 num padding tokens: 20 - rank: 5 max len: 63 min len: 43 avg len: 53.0 num_loss_counted_tokens: 42
total tokens: 172 num samples: 2 num padding tokens: 16 - rank: 5 max len: 86 min len: 70 avg len: 78.0 num_loss_counted_tokens: 85
total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 1 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 63
total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 1 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 75
total tokens: 148 num samples: 2 num padding tokens: 13 - rank: 1 max len: 74 min len: 61 avg len: 67.5 num_loss_counted_tokens: 69
total tokens: 202 num samples: 2 num padding tokens: 11 - rank: 1 max len: 101 min len: 90 avg len: 95.5 num_loss_counted_tokens: 138
total tokens: 140 num samples: 2 num padding tokens: 10 - rank: 5 max len: 70 min len: 60 avg len: 65.0 num_loss_counted_tokens: 69
total tokens: 174 num samples: 2 num padding tokens: 27 - rank: 1 max len: 87 min len: 60 avg len: 73.5 num_loss_counted_tokens: 76
total tokens: 186 num samples: 2 num padding tokens: 31 - rank: 5 max len: 93 min len: 62 avg len: 77.5 num_loss_counted_tokens: 81
total tokens: 166 num samples: 2 num padding tokens: 33 - rank: 5 max len: 83 min len: 50 avg len: 66.5 num_loss_counted_tokens: 82
total tokens: 142 num samples: 2 num padding tokens: 16 - rank: 5 max len: 71 min len: 55 avg len: 63.0 num_loss_counted_tokens: 61
total tokens: 146 num samples: 2 num padding tokens: 28 - rank: 5 max len: 73 min len: 45 avg len: 59.0 num_loss_counted_tokens: 72
total tokens: 120 num samples: 2 num padding tokens: 11 - rank: 5 max len: 60 min len: 49 avg len: 54.5 num_loss_counted_tokens: 79
total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 5 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 69
total tokens: 126 num samples: 2 num padding tokens: 6 - rank: 6 max len: 63 min len: 57 avg len: 60.0 num_loss_counted_tokens: 66
total tokens: 122 num samples: 2 num padding tokens: 3 - rank: 3 max len: 61 min len: 58 avg len: 59.5 num_loss_counted_tokens: 69
total tokens: 132 num samples: 2 num padding tokens: 3 - rank: 2 max len: 66 min len: 63 avg len: 64.5 num_loss_counted_tokens: 70
total tokens: 140 num samples: 2 num padding tokens: 2 - rank: 4 max len: 70 min len: 68 avg len: 69.0 num_loss_counted_tokens: 55
total tokens: 134 num samples: 2 num padding tokens: 3 - rank: 5 max len: 67 min len: 64 avg len: 65.5 num_loss_counted_tokens: 61
Per-token loss scaled by world size: 0.0006204941309988499Per-token loss scaled by world size: 0.0005415144260041416Per-token loss scaled by world size: 0.0004509067512117326
Per-token loss scaled by world size: 7.763502799207345e-05
Per-token loss scaled by world size: 0.0008618941647000611Per-token loss scaled by world size: 0.0005943336291238666Per-token loss scaled by world size: 0.0004708097840193659
Epoch: 6, Step: 73, Rank: 6, loss = 0.04921012371778488
Epoch: 6, Step: 73, Rank: 5, loss = 0.00705508328974247Epoch: 6, Step: 73, Rank: 3, loss = 0.05638740584254265
Epoch: 6, Step: 73, Rank: 2, loss = 0.0409761518239975
Epoch: 6, Step: 73, Rank: 1, loss = 0.04278483986854553
Epoch: 6, Step: 73, Rank: 7, loss = 0.07832463085651398
Epoch: 6, Step: 73, Rank: 0, loss = 0.054010067135095596
Per-token loss scaled by world size: 0.00044023498776368797
Epoch: 6, Step: 73, Rank: 4, loss = 0.040006354451179504
[2024-07-27 20:06:33,460] [INFO] [logging.py:96:log_dist] [Rank 0] step=73, skipped=0, lr=[9.834660552336415e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:33,537] [INFO] [timer.py:258:stop] epoch=0/micro_step=73/global_step=73, RunningAvgSamplesPerSec=31.690802156326086, CurrSamplesPerSec=28.172368613829672, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 8%|▊ | 1/12 [00:00<00:10, 1.06it/s]{
"epoch": 6,
"step": 73,
"rank": 0,
"loss": 0.054010067135095596,
"overall_throughput": 28.063288365518915,
"lr": 9.834660552336415e-06,
"cuda_mem_allocated": 22.000954627990723,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 727,
"batch_size": 16,
"total_loss": 0.04609433189034462,
"gradnorm": 0.7181567549705505,
"weight_norm": 393.4746398925781,
"timestamp": "2024-07-27T20:06:33.581427"
}
Per-token loss scaled by world size: 0.00021588351228274405Per-token loss scaled by world size: 0.0018644272349774837Per-token loss scaled by world size: 0.0009222680237144232Per-token loss scaled by world size: 0.0011992761865258217Per-token loss scaled by world size: 0.00015600323968101293Per-token loss scaled by world size: 0.0007281338912434876Per-token loss scaled by world size: 0.0013443040661513805
Epoch: 6, Step: 74, Rank: 4, loss = 0.1323743313550949
Epoch: 6, Step: 74, Rank: 7, loss = 0.0516975075006485Epoch: 6, Step: 74, Rank: 3, loss = 0.011076229624450207
Epoch: 6, Step: 74, Rank: 5, loss = 0.015327729284763336Epoch: 6, Step: 74, Rank: 2, loss = 0.08514861017465591
Epoch: 6, Step: 74, Rank: 6, loss = 0.09544558823108673
Epoch: 6, Step: 74, Rank: 0, loss = 0.0654810294508934
Per-token loss scaled by world size: 0.0015726651763543487
Epoch: 6, Step: 74, Rank: 1, loss = 0.1116592288017273
[2024-07-27 20:06:34,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=74, skipped=0, lr=[9.504162453267776e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:34,080] [INFO] [timer.py:258:stop] epoch=0/micro_step=74/global_step=74, RunningAvgSamplesPerSec=31.69924367021435, CurrSamplesPerSec=32.31030745624361, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 17%|█▋ | 2/12 [00:01<00:07, 1.41it/s]{
"epoch": 6,
"step": 74,
"rank": 0,
"loss": 0.0654810294508934,
"overall_throughput": 32.25508431115172,
"lr": 9.504162453267776e-06,
"cuda_mem_allocated": 22.002385139465332,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 568,
"batch_size": 16,
"total_loss": 0.0710262879729271,
"gradnorm": 1.143301010131836,
"weight_norm": 393.4747314453125,
"timestamp": "2024-07-27T20:06:34.123115"
}
Per-token loss scaled by world size: 0.0001349089725408703Per-token loss scaled by world size: 0.0011630249209702015Per-token loss scaled by world size: 0.0005098663968965411Per-token loss scaled by world size: 0.001282830722630024Per-token loss scaled by world size: 0.0009069825755432248Per-token loss scaled by world size: 0.00048159470316022635
Per-token loss scaled by world size: 0.0003646048135124147
Epoch: 6, Step: 75, Rank: 5, loss = 0.035117048770189285Epoch: 6, Step: 75, Rank: 3, loss = 0.08010333776473999
Epoch: 6, Step: 75, Rank: 4, loss = 0.08835496753454208Epoch: 6, Step: 75, Rank: 6, loss = 0.009291855618357658
Epoch: 6, Step: 75, Rank: 2, loss = 0.06246842443943024
Epoch: 6, Step: 75, Rank: 7, loss = 0.025112155824899673Epoch: 6, Step: 75, Rank: 0, loss = 0.033169835805892944
Per-token loss scaled by world size: 0.0011549023911356926
Epoch: 6, Step: 75, Rank: 1, loss = 0.07954390347003937
[2024-07-27 20:06:34,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=75, skipped=0, lr=[9.174206545276678e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:34,640] [INFO] [timer.py:258:stop] epoch=0/micro_step=75/global_step=75, RunningAvgSamplesPerSec=31.692720007171083, CurrSamplesPerSec=31.229969737456262, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 6: 25%|██▌ | 3/12 [00:02<00:05, 1.56it/s]{
"epoch": 6,
"step": 75,
"rank": 0,
"loss": 0.033169835805892944,
"overall_throughput": 31.178925272918942,
"lr": 9.174206545276678e-06,
"cuda_mem_allocated": 22.00572395324707,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 551,
"batch_size": 16,
"total_loss": 0.0516451895236969,
"gradnorm": 1.016838788986206,
"weight_norm": 393.4748229980469,
"timestamp": "2024-07-27T20:06:34.682642"
}
Per-token loss scaled by world size: 0.0003609564446378499Per-token loss scaled by world size: 0.00041217487887479365Per-token loss scaled by world size: 0.0004959891666658223Per-token loss scaled by world size: 0.00047398614697158337Per-token loss scaled by world size: 0.0007203637505881488
Per-token loss scaled by world size: 0.0001487391273258254
Per-token loss scaled by world size: 0.0008504824945703149
Epoch: 6, Step: 76, Rank: 0, loss = 0.042779065668582916
Epoch: 6, Step: 76, Rank: 3, loss = 0.04088130593299866Epoch: 6, Step: 76, Rank: 6, loss = 0.031132493168115616Epoch: 6, Step: 76, Rank: 7, loss = 0.0621313713490963
Epoch: 6, Step: 76, Rank: 2, loss = 0.012828749604523182
Epoch: 6, Step: 76, Rank: 5, loss = 0.035550083965063095
Epoch: 6, Step: 76, Rank: 1, loss = 0.07335411757230759
Per-token loss scaled by world size: 0.0007280391291715205
Epoch: 6, Step: 76, Rank: 4, loss = 0.06279337406158447
[2024-07-27 20:06:35,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=76, skipped=0, lr=[8.84515363030414e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:35,189] [INFO] [timer.py:258:stop] epoch=0/micro_step=76/global_step=76, RunningAvgSamplesPerSec=31.692239680284427, CurrSamplesPerSec=31.657215099110317, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 33%|███▎ | 4/12 [00:02<00:04, 1.66it/s]{
"epoch": 6,
"step": 76,
"rank": 0,
"loss": 0.042779065668582916,
"overall_throughput": 31.57946786919451,
"lr": 8.84515363030414e-06,
"cuda_mem_allocated": 22.002624034881592,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 690,
"batch_size": 16,
"total_loss": 0.04518131911754608,
"gradnorm": 1.2256078720092773,
"weight_norm": 393.47491455078125,
"timestamp": "2024-07-27T20:06:35.231428"
}
Per-token loss scaled by world size: 0.0005857766373082995Per-token loss scaled by world size: 0.001119819818995893
Per-token loss scaled by world size: 0.0010905693052336574Per-token loss scaled by world size: 0.00018508221546653658Per-token loss scaled by world size: 0.0016458512982353568Per-token loss scaled by world size: 0.00018191162962466478Per-token loss scaled by world size: 0.00047674551024101675
Epoch: 6, Step: 77, Rank: 7, loss = 0.085386261343956
Epoch: 6, Step: 77, Rank: 3, loss = 0.12549616396427155
Epoch: 6, Step: 77, Rank: 6, loss = 0.01387076172977686Epoch: 6, Step: 77, Rank: 0, loss = 0.04466547071933746Epoch: 6, Step: 77, Rank: 4, loss = 0.014112519100308418Epoch: 6, Step: 77, Rank: 1, loss = 0.08315590769052505
Epoch: 6, Step: 77, Rank: 5, loss = 0.03635184466838837
Per-token loss scaled by world size: 0.0008063243585638702
Epoch: 6, Step: 77, Rank: 2, loss = 0.06148223206400871
[2024-07-27 20:06:35,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=77, skipped=0, lr=[8.51736352288158e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:35,734] [INFO] [timer.py:258:stop] epoch=0/micro_step=77/global_step=77, RunningAvgSamplesPerSec=31.702193912096924, CurrSamplesPerSec=32.45657221649108, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 1232
{
"epoch": 6,
"step": 77,
"rank": 0,
"loss": 0.04466547071933746,
"overall_throughput": 32.40263184704888,
"lr": 8.51736352288158e-06,
"cuda_mem_allocated": 22.000000476837158,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 610,
"batch_size": 16,
"total_loss": 0.05806514620780945,
"gradnorm": 1.030696988105774,
"weight_norm": 393.47503662109375,
"timestamp": "2024-07-27T20:06:35.737358"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1232
[20:06:53] INFO saving took 18.036810636520386 seconds utils.py:611
Epoch 6: 42%|████▏ | 5/12 [00:21<00:49, 7.09s/it]Per-token loss scaled by world size: 0.0021851949859410524Per-token loss scaled by world size: 0.0004372596740722656Per-token loss scaled by world size: 0.0008908362942747772Per-token loss scaled by world size: 0.00043337256647646427Per-token loss scaled by world size: 0.0002932958595920354Per-token loss scaled by world size: 0.0002709754917304963
Per-token loss scaled by world size: 0.0006071476964280009
Epoch: 6, Step: 78, Rank: 1, loss = 0.07071013003587723
Epoch: 6, Step: 78, Rank: 0, loss = 0.034707486629486084Epoch: 6, Step: 78, Rank: 4, loss = 0.03439894691109657
Epoch: 6, Step: 78, Rank: 3, loss = 0.02328035794198513Epoch: 6, Step: 78, Rank: 5, loss = 0.021508680656552315Epoch: 6, Step: 78, Rank: 6, loss = 0.048192348331213Epoch: 6, Step: 78, Rank: 7, loss = 0.17344985902309418
Per-token loss scaled by world size: 0.0011103027500212193
Epoch: 6, Step: 78, Rank: 2, loss = 0.08813028037548065
[2024-07-27 20:06:54,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=78, skipped=0, lr=[8.191194656678905e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:54,330] [INFO] [timer.py:258:stop] epoch=0/micro_step=78/global_step=78, RunningAvgSamplesPerSec=31.696677826343805, CurrSamplesPerSec=31.288371681003333, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 50%|█████ | 6/12 [00:21<00:29, 4.87s/it]{
"epoch": 6,
"step": 78,
"rank": 0,
"loss": 0.034707486629486084,
"overall_throughput": 31.230609214279298,
"lr": 8.191194656678905e-06,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 635,
"batch_size": 16,
"total_loss": 0.061797261238098145,
"gradnorm": 1.2869224548339844,
"weight_norm": 393.47509765625,
"timestamp": "2024-07-27T20:06:54.372620"
}
Per-token loss scaled by world size: 0.0011720252223312855Per-token loss scaled by world size: 0.0006744764395989478Per-token loss scaled by world size: 0.0009440272697247565Per-token loss scaled by world size: 0.0029248518403619528Per-token loss scaled by world size: 0.0009407943580299616Per-token loss scaled by world size: 0.0013611947651952505
Per-token loss scaled by world size: 0.0014305550139397383
Epoch: 6, Step: 79, Rank: 7, loss = 0.04670749232172966
Epoch: 6, Step: 79, Rank: 0, loss = 0.08116274327039719Epoch: 6, Step: 79, Rank: 4, loss = 0.06537389010190964Epoch: 6, Step: 79, Rank: 1, loss = 0.20254598557949066Epoch: 6, Step: 79, Rank: 2, loss = 0.06515000760555267Epoch: 6, Step: 79, Rank: 5, loss = 0.0942627340555191
Epoch: 6, Step: 79, Rank: 6, loss = 0.09906593710184097
Per-token loss scaled by world size: 0.000943321269005537
Epoch: 6, Step: 79, Rank: 3, loss = 0.06532499939203262
[2024-07-27 20:06:54,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=79, skipped=0, lr=[7.867003692562533e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:54,883] [INFO] [timer.py:258:stop] epoch=0/micro_step=79/global_step=79, RunningAvgSamplesPerSec=31.69457460454822, CurrSamplesPerSec=31.53554234673331, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 58%|█████▊ | 7/12 [00:22<00:17, 3.46s/it]{
"epoch": 6,
"step": 79,
"rank": 0,
"loss": 0.08116274327039719,
"overall_throughput": 31.485415170212256,
"lr": 7.867003692562533e-06,
"cuda_mem_allocated": 21.996094703674316,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 554,
"batch_size": 16,
"total_loss": 0.08994922786951065,
"gradnorm": 1.41256582736969,
"weight_norm": 393.4751892089844,
"timestamp": "2024-07-27T20:06:54.931617"
}
Per-token loss scaled by world size: 0.0004768831713590771Per-token loss scaled by world size: 0.002107172505930066Per-token loss scaled by world size: 0.0008781441720202565Per-token loss scaled by world size: 0.0014709294773638248
Per-token loss scaled by world size: 0.00031639524968340993
Per-token loss scaled by world size: 0.0003654623869806528
Per-token loss scaled by world size: 0.000409139902330935
Epoch: 6, Step: 80, Rank: 6, loss = 0.155930757522583
Epoch: 6, Step: 80, Rank: 3, loss = 0.10884878039360046
Epoch: 6, Step: 80, Rank: 5, loss = 0.023413248360157013Epoch: 6, Step: 80, Rank: 1, loss = 0.06498266756534576
Epoch: 6, Step: 80, Rank: 4, loss = 0.02704421617090702Epoch: 6, Step: 80, Rank: 0, loss = 0.035289354622364044
Epoch: 6, Step: 80, Rank: 2, loss = 0.030276352539658546
Per-token loss scaled by world size: 0.002671802882105112
Epoch: 6, Step: 80, Rank: 7, loss = 0.19771341979503632
[2024-07-27 20:06:55,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[7.545145128592009e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:55,440] [INFO] [timer.py:258:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=31.6941207935923, CurrSamplesPerSec=31.65921633267696, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 6: 67%|██████▋ | 8/12 [00:22<00:10, 2.53s/it]{
"epoch": 6,
"step": 80,
"rank": 0,
"loss": 0.035289354622364044,
"overall_throughput": 31.609841938694426,
"lr": 7.545145128592009e-06,
"cuda_mem_allocated": 22.009064197540283,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 592,
"batch_size": 16,
"total_loss": 0.08043734729290009,
"gradnorm": 1.2677600383758545,
"weight_norm": 393.4752502441406,
"timestamp": "2024-07-27T20:06:55.481943"
}
Per-token loss scaled by world size: 0.0005759032792411745Per-token loss scaled by world size: 0.0009630320128053427Per-token loss scaled by world size: 0.0008893606718629599Per-token loss scaled by world size: 0.0010249739279970527Per-token loss scaled by world size: 0.0008383162785321474Per-token loss scaled by world size: 0.0007667160243727267Per-token loss scaled by world size: 7.463712972821668e-05
Epoch: 6, Step: 81, Rank: 3, loss = 0.0855894684791565Epoch: 6, Step: 81, Rank: 5, loss = 0.07904192805290222
Epoch: 6, Step: 81, Rank: 7, loss = 0.05118340253829956
Epoch: 6, Step: 81, Rank: 6, loss = 0.006633374840021133
Epoch: 6, Step: 81, Rank: 4, loss = 0.0681418851017952Epoch: 6, Step: 81, Rank: 2, loss = 0.09109456092119217
Epoch: 6, Step: 81, Rank: 0, loss = 0.07450535893440247
Per-token loss scaled by world size: 0.0009560537873767316
Epoch: 6, Step: 81, Rank: 1, loss = 0.08496928215026855
[2024-07-27 20:06:55,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=81, skipped=0, lr=[7.225970912381557e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:55,984] [INFO] [timer.py:258:stop] epoch=0/micro_step=81/global_step=81, RunningAvgSamplesPerSec=31.696829857403074, CurrSamplesPerSec=31.90957327177327, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 6: 75%|███████▌ | 9/12 [00:23<00:05, 1.91s/it]{
"epoch": 6,
"step": 81,
"rank": 0,
"loss": 0.07450535893440247,
"overall_throughput": 31.833316019435205,
"lr": 7.225970912381557e-06,
"cuda_mem_allocated": 22.00548553466797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 711,
"batch_size": 16,
"total_loss": 0.06764490157365799,
"gradnorm": 1.3872599601745605,
"weight_norm": 393.4753112792969,
"timestamp": "2024-07-27T20:06:56.027764"
}
Per-token loss scaled by world size: 0.001566625782288611Per-token loss scaled by world size: 0.0001653370854910463Per-token loss scaled by world size: 0.00041765952482819557Per-token loss scaled by world size: 0.0008047792944125831Per-token loss scaled by world size: 0.0015484013129025698
Per-token loss scaled by world size: 7.262427970999852e-05
Per-token loss scaled by world size: 0.00017705872596707195
Epoch: 6, Step: 82, Rank: 6, loss = 0.02996707148849964
Epoch: 6, Step: 82, Rank: 5, loss = 0.01186293549835682
Epoch: 6, Step: 82, Rank: 2, loss = 0.05774291232228279
Epoch: 6, Step: 82, Rank: 1, loss = 0.11240539699792862Epoch: 6, Step: 82, Rank: 0, loss = 0.1110977977514267Epoch: 6, Step: 82, Rank: 7, loss = 0.005210792180150747
Epoch: 6, Step: 82, Rank: 4, loss = 0.012703963555395603
Per-token loss scaled by world size: 5.9409892855910584e-05
Epoch: 6, Step: 82, Rank: 3, loss = 0.004262659698724747
[2024-07-27 20:06:56,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=82, skipped=0, lr=[6.909830056250527e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:56,522] [INFO] [timer.py:258:stop] epoch=0/micro_step=82/global_step=82, RunningAvgSamplesPerSec=31.70569941858613, CurrSamplesPerSec=32.42243510088761, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 83%|████████▎ | 10/12 [00:23<00:02, 1.49s/it]{
"epoch": 6,
"step": 82,
"rank": 0,
"loss": 0.1110977977514267,
"overall_throughput": 32.336618768855516,
"lr": 6.909830056250527e-06,
"cuda_mem_allocated": 21.99880838394165,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 574,
"batch_size": 16,
"total_loss": 0.0431566946208477,
"gradnorm": 1.0986570119857788,
"weight_norm": 393.4753723144531,
"timestamp": "2024-07-27T20:06:56.567461"
}
Per-token loss scaled by world size: 0.0015930738300085068Per-token loss scaled by world size: 0.0009168770629912615Per-token loss scaled by world size: 0.0008305592346005142Per-token loss scaled by world size: 0.0003735376812983304Per-token loss scaled by world size: 0.0023468886502087116
Per-token loss scaled by world size: 0.0006343711283989251Per-token loss scaled by world size: 0.000816680898424238
Epoch: 6, Step: 83, Rank: 1, loss = 0.0600554458796978
Epoch: 6, Step: 83, Rank: 7, loss = 0.05349259823560715Epoch: 6, Step: 83, Rank: 5, loss = 0.05440162867307663
Epoch: 6, Step: 83, Rank: 0, loss = 0.02446671761572361Epoch: 6, Step: 83, Rank: 2, loss = 0.15372121334075928
Epoch: 6, Step: 83, Rank: 3, loss = 0.10434633493423462Epoch: 6, Step: 83, Rank: 6, loss = 0.04155131056904793
Per-token loss scaled by world size: 0.00215042638592422
Epoch: 6, Step: 83, Rank: 4, loss = 0.1408529281616211
[2024-07-27 20:06:56,979] [INFO] [logging.py:96:log_dist] [Rank 0] step=83, skipped=0, lr=[6.59706825558357e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:57,057] [INFO] [timer.py:258:stop] epoch=0/micro_step=83/global_step=83, RunningAvgSamplesPerSec=31.71805120615546, CurrSamplesPerSec=32.73837880082133, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 92%|█████████▏| 11/12 [00:24<00:01, 1.20s/it]{
"epoch": 6,
"step": 83,
"rank": 0,
"loss": 0.02446671761572361,
"overall_throughput": 32.651551303359504,
"lr": 6.59706825558357e-06,
"cuda_mem_allocated": 22.003100872039795,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 524,
"batch_size": 16,
"total_loss": 0.0791110172867775,
"gradnorm": 1.3195643424987793,
"weight_norm": 393.4754333496094,
"timestamp": "2024-07-27T20:06:57.100239"
}
Per-token loss scaled by world size: 0.001030595856718719Per-token loss scaled by world size: 0.00038883870001882315Per-token loss scaled by world size: 0.00021640512568410486Per-token loss scaled by world size: 0.0008497635717503726Per-token loss scaled by world size: 0.0006636562757194042
Per-token loss scaled by world size: 0.0012220889329910278
Epoch: 6, Step: 84, Rank: 4, loss = 0.03300268575549126
Epoch: 6, Step: 84, Rank: 6, loss = 0.018367385491728783Epoch: 6, Step: 84, Rank: 7, loss = 0.10372480005025864Epoch: 6, Step: 84, Rank: 1, loss = 0.072123683989048Epoch: 6, Step: 84, Rank: 2, loss = 0.05632782727479935
Epoch: 6, Step: 84, Rank: 3, loss = 0.08747182786464691
Per-token loss scaled by world size: 9.578206663718447e-05
Epoch: 6, Step: 84, Rank: 0, loss = 0.008129502646625042
Per-token loss scaled by world size: 0.0018002043943852186
Epoch: 6, Step: 84, Rank: 5, loss = 0.15279234945774078
[2024-07-27 20:06:57,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=84, skipped=0, lr=[6.2880275108177915e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:57,582] [INFO] [timer.py:258:stop] epoch=0/micro_step=84/global_step=84, RunningAvgSamplesPerSec=31.736034255118497, CurrSamplesPerSec=33.26364124820817, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 6: 100%|██████████| 12/12 [00:24<00:00, 1.01it/s]{
"epoch": 6,
"step": 84,
"rank": 0,
"loss": 0.008129502646625042,
"overall_throughput": 33.17494381766968,
"lr": 6.2880275108177915e-06,
"cuda_mem_allocated": 22.000000476837158,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 679,
"batch_size": 16,
"total_loss": 0.06649251282215118,
"gradnorm": 1.081682801246643,
"weight_norm": 393.4754638671875,
"timestamp": "2024-07-27T20:06:57.629230"
}
Epoch 6: 100%|██████████| 12/12 [00:25<00:00, 2.09s/it]
total tokens: 160 num samples: 2 num padding tokens: 1 - rank: 1 max len: 80 min len: 79 avg len: 79.5 num_loss_counted_tokens: 86
total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 1 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 57 total tokens: 162 num samples: 2 num padding tokens: 29 - rank: 5 max len: 81 min len: 52 avg len: 66.5 num_loss_counted_tokens: 74
total tokens: 116 num samples: 2 num padding tokens: 13 - rank: 5 max len: 58 min len: 45 avg len: 51.5 num_loss_counted_tokens: 60 total tokens: 138 num samples: 2 num padding tokens: 16 - rank: 2 max len: 69 min len: 53 avg len: 61.0 num_loss_counted_tokens: 51
total tokens: 118 num samples: 2 num padding tokens: 4 - rank: 2 max len: 59 min len: 55 avg len: 57.0 num_loss_counted_tokens: 62
total tokens: 144 num samples: 2 num padding tokens: 10 - rank: 0 max len: 72 min len: 62 avg len: 67.0 num_loss_counted_tokens: 80
total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 2 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 56
total tokens: 150 num samples: 2 num padding tokens: 12 - rank: 5 max len: 75 min len: 63 avg len: 69.0 num_loss_counted_tokens: 72
total tokens: 214 num samples: 2 num padding tokens: 46 - rank: 2 max len: 107 min len: 61 avg len: 84.0 num_loss_counted_tokens: 107
total tokens: 140 num samples: 2 num padding tokens: 15 - rank: 2 max len: 70 min len: 55 avg len: 62.5 num_loss_counted_tokens: 62
total tokens: 180 num samples: 2 num padding tokens: 38 - rank: 7 max len: 90 min len: 52 avg len: 71.0 num_loss_counted_tokens: 95
total tokens: 180 num samples: 2 num padding tokens: 7 - rank: 6 max len: 90 min len: 83 avg len: 86.5 num_loss_counted_tokens: 135
total tokens: 152 num samples: 2 num padding tokens: 7 - rank: 5 max len: 76 min len: 69 avg len: 72.5 num_loss_counted_tokens: 101
total tokens: 102 num samples: 2 num padding tokens: 8 - rank: 1 max len: 51 min len: 43 avg len: 47.0 num_loss_counted_tokens: 44
total tokens: 106 num samples: 2 num padding tokens: 8 - rank: 1 max len: 53 min len: 45 avg len: 49.0 num_loss_counted_tokens: 46
total tokens: 194 num samples: 2 num padding tokens: 53 - rank: 0 max len: 97 min len: 44 avg len: 70.5 num_loss_counted_tokens: 90
total tokens: 122 num samples: 2 num padding tokens: 10 - rank: 6 max len: 61 min len: 51 avg len: 56.0 num_loss_counted_tokens: 56
total tokens: 140 num samples: 2 num padding tokens: 8 - rank: 2 max len: 70 min len: 62 avg len: 66.0 num_loss_counted_tokens: 72
total tokens: 208 num samples: 2 num padding tokens: 47 - rank: 6 max len: 104 min len: 57 avg len: 80.5 num_loss_counted_tokens: 112
total tokens: 114 num samples: 2 num padding tokens: 12 - rank: 1 max len: 57 min len: 45 avg len: 51.0 num_loss_counted_tokens: 47
total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 0 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 79
total tokens: 132 num samples: 2 num padding tokens: 11 - rank: 2 max len: 66 min len: 55 avg len: 60.5 num_loss_counted_tokens: 59
total tokens: 282 num samples: 2 num padding tokens: 77 - rank: 7 max len: 141 min len: 64 avg len: 102.5 num_loss_counted_tokens: 152
total tokens: 128 num samples: 2 num padding tokens: 1 - rank: 5 max len: 64 min len: 63 avg len: 63.5 num_loss_counted_tokens: 68
total tokens: 172 num samples: 2 num padding tokens: 35 - rank: 0 max len: 86 min len: 51 avg len: 68.5 num_loss_counted_tokens: 71
total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 2 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 105
total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 2 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 63
total tokens: 134 num samples: 2 num padding tokens: 4 - rank: 3 max len: 67 min len: 63 avg len: 65.0 num_loss_counted_tokens: 61
total tokens: 146 num samples: 2 num padding tokens: 28 - rank: 5 max len: 73 min len: 45 avg len: 59.0 num_loss_counted_tokens: 72
total tokens: 102 num samples: 2 num padding tokens: 1 - rank: 7 max len: 51 min len: 50 avg len: 50.5 num_loss_counted_tokens: 62
total tokens: 136 num samples: 2 num padding tokens: 2 - rank: 6 max len: 68 min len: 66 avg len: 67.0 num_loss_counted_tokens: 58
total tokens: 168 num samples: 2 num padding tokens: 14 - rank: 6 max len: 84 min len: 70 avg len: 77.0 num_loss_counted_tokens: 86
total tokens: 132 num samples: 2 num padding tokens: 12 - rank: 1 max len: 66 min len: 54 avg len: 60.0 num_loss_counted_tokens: 57
total tokens: 226 num samples: 2 num padding tokens: 27 - rank: 6 max len: 113 min len: 86 avg len: 99.5 num_loss_counted_tokens: 114
total tokens: 174 num samples: 2 num padding tokens: 28 - rank: 4 max len: 87 min len: 59 avg len: 73.0 num_loss_counted_tokens: 90
total tokens: 184 num samples: 2 num padding tokens: 14 - rank: 3 max len: 92 min len: 78 avg len: 85.0 num_loss_counted_tokens: 102 total tokens: 200 num samples: 2 num padding tokens: 48 - rank: 3 max len: 100 min len: 52 avg len: 76.0 num_loss_counted_tokens: 85
total tokens: 176 num samples: 2 num padding tokens: 44 - rank: 0 max len: 88 min len: 44 avg len: 66.0 num_loss_counted_tokens: 74
total tokens: 110 num samples: 2 num padding tokens: 0 - rank: 0 max len: 55 min len: 55 avg len: 55.0 num_loss_counted_tokens: 54
total tokens: 186 num samples: 2 num padding tokens: 33 - rank: 0 max len: 93 min len: 60 avg len: 76.5 num_loss_counted_tokens: 104
total tokens: 146 num samples: 2 num padding tokens: 24 - rank: 7 max len: 73 min len: 49 avg len: 61.0 num_loss_counted_tokens: 64
total tokens: 214 num samples: 2 num padding tokens: 25 - rank: 0 max len: 107 min len: 82 avg len: 94.5 num_loss_counted_tokens: 135
total tokens: 142 num samples: 2 num padding tokens: 3 - rank: 0 max len: 71 min len: 68 avg len: 69.5 num_loss_counted_tokens: 59
total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 0 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 77
total tokens: 108 num samples: 2 num padding tokens: 6 - rank: 3 max len: 54 min len: 48 avg len: 51.0 num_loss_counted_tokens: 55
total tokens: 166 num samples: 2 num padding tokens: 12 - rank: 6 max len: 83 min len: 71 avg len: 77.0 num_loss_counted_tokens: 79
total tokens: 196 num samples: 2 num padding tokens: 28 - rank: 6 max len: 98 min len: 70 avg len: 84.0 num_loss_counted_tokens: 105
total tokens: 186 num samples: 2 num padding tokens: 20 - rank: 0 max len: 93 min len: 73 avg len: 83.0 num_loss_counted_tokens: 135
total tokens: 228 num samples: 2 num padding tokens: 52 - rank: 4 max len: 114 min len: 62 avg len: 88.0 num_loss_counted_tokens: 120
total tokens: 244 num samples: 2 num padding tokens: 64 - rank: 3 max len: 122 min len: 58 avg len: 90.0 num_loss_counted_tokens: 127
total tokens: 162 num samples: 2 num padding tokens: 2 - rank: 6 max len: 81 min len: 79 avg len: 80.0 num_loss_counted_tokens: 86
total tokens: 120 num samples: 2 num padding tokens: 7 - rank: 6 max len: 60 min len: 53 avg len: 56.5 num_loss_counted_tokens: 71
total tokens: 142 num samples: 2 num padding tokens: 12 - rank: 3 max len: 71 min len: 59 avg len: 65.0 num_loss_counted_tokens: 59
total tokens: 132 num samples: 2 num padding tokens: 4 - rank: 3 max len: 66 min len: 62 avg len: 64.0 num_loss_counted_tokens: 71
total tokens: 164 num samples: 2 num padding tokens: 19 - rank: 7 max len: 82 min len: 63 avg len: 72.5 num_loss_counted_tokens: 84
total tokens: 122 num samples: 2 num padding tokens: 0 - rank: 6 max len: 61 min len: 61 avg len: 61.0 num_loss_counted_tokens: 61
total tokens: 118 num samples: 2 num padding tokens: 11 - rank: 1 max len: 59 min len: 48 avg len: 53.5 num_loss_counted_tokens: 55
total tokens: 142 num samples: 2 num padding tokens: 1 - rank: 1 max len: 71 min len: 70 avg len: 70.5 num_loss_counted_tokens: 72
total tokens: 128 num samples: 2 num padding tokens: 15 - rank: 2 max len: 64 min len: 49 avg len: 56.5 num_loss_counted_tokens: 52
total tokens: 172 num samples: 2 num padding tokens: 22 - rank: 3 max len: 86 min len: 64 avg len: 75.0 num_loss_counted_tokens: 70
total tokens: 142 num samples: 2 num padding tokens: 8 - rank: 1 max len: 71 min len: 63 avg len: 67.0 num_loss_counted_tokens: 64
total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 3 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 65
total tokens: 134 num samples: 2 num padding tokens: 17 - rank: 4 max len: 67 min len: 50 avg len: 58.5 num_loss_counted_tokens: 66
total tokens: 126 num samples: 2 num padding tokens: 5 - rank: 4 max len: 63 min len: 58 avg len: 60.5 num_loss_counted_tokens: 61 total tokens: 132 num samples: 2 num padding tokens: 14 - rank: 4 max len: 66 min len: 52 avg len: 59.0 num_loss_counted_tokens: 66
total tokens: 216 num samples: 2 num padding tokens: 7 - rank: 0 max len: 108 min len: 101 avg len: 104.5 num_loss_counted_tokens: 147
total tokens: 120 num samples: 2 num padding tokens: 14 - rank: 7 max len: 60 min len: 46 avg len: 53.0 num_loss_counted_tokens: 57
total tokens: 152 num samples: 2 num padding tokens: 12 - rank: 7 max len: 76 min len: 64 avg len: 70.0 num_loss_counted_tokens: 81
total tokens: 130 num samples: 2 num padding tokens: 0 - rank: 7 max len: 65 min len: 65 avg len: 65.0 num_loss_counted_tokens: 59
total tokens: 154 num samples: 2 num padding tokens: 10 - rank: 7 max len: 77 min len: 67 avg len: 72.0 num_loss_counted_tokens: 80
total tokens: 180 num samples: 2 num padding tokens: 35 - rank: 7 max len: 90 min len: 55 avg len: 72.5 num_loss_counted_tokens: 97
total tokens: 148 num samples: 2 num padding tokens: 14 - rank: 4 max len: 74 min len: 60 avg len: 67.0 num_loss_counted_tokens: 68
total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 1 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 61
total tokens: 132 num samples: 2 num padding tokens: 7 - rank: 4 max len: 66 min len: 59 avg len: 62.5 num_loss_counted_tokens: 67
total tokens: 152 num samples: 2 num padding tokens: 21 - rank: 5 max len: 76 min len: 55 avg len: 65.5 num_loss_counted_tokens: 63
total tokens: 144 num samples: 2 num padding tokens: 11 - rank: 3 max len: 72 min len: 61 avg len: 66.5 num_loss_counted_tokens: 66
total tokens: 138 num samples: 2 num padding tokens: 15 - rank: 5 max len: 69 min len: 54 avg len: 61.5 num_loss_counted_tokens: 62
total tokens: 188 num samples: 2 num padding tokens: 50 - rank: 1 max len: 94 min len: 44 avg len: 69.0 num_loss_counted_tokens: 79
total tokens: 148 num samples: 2 num padding tokens: 15 - rank: 5 max len: 74 min len: 59 avg len: 66.5 num_loss_counted_tokens: 71
total tokens: 116 num samples: 2 num padding tokens: 6 - rank: 3 max len: 58 min len: 52 avg len: 55.0 num_loss_counted_tokens: 60
total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 5 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 65
total tokens: 180 num samples: 2 num padding tokens: 35 - rank: 7 max len: 90 min len: 55 avg len: 72.5 num_loss_counted_tokens: 117 total tokens: 168 num samples: 2 num padding tokens: 24 - rank: 4 max len: 84 min len: 60 avg len: 72.0 num_loss_counted_tokens: 91
total tokens: 174 num samples: 2 num padding tokens: 10 - rank: 4 max len: 87 min len: 77 avg len: 82.0 num_loss_counted_tokens: 93
total tokens: 116 num samples: 2 num padding tokens: 1 - rank: 4 max len: 58 min len: 57 avg len: 57.5 num_loss_counted_tokens: 68
total tokens: 128 num samples: 2 num padding tokens: 16 - rank: 4 max len: 64 min len: 48 avg len: 56.0 num_loss_counted_tokens: 64
total tokens: 160 num samples: 2 num padding tokens: 22 - rank: 2 max len: 80 min len: 58 avg len: 69.0 num_loss_counted_tokens: 79
total tokens: 174 num samples: 2 num padding tokens: 19 - rank: 5 max len: 87 min len: 68 avg len: 77.5 num_loss_counted_tokens: 76
total tokens: 166 num samples: 2 num padding tokens: 23 - rank: 4 max len: 83 min len: 60 avg len: 71.5 num_loss_counted_tokens: 86
total tokens: 188 num samples: 2 num padding tokens: 13 - rank: 7 max len: 94 min len: 81 avg len: 87.5 num_loss_counted_tokens: 115
total tokens: 128 num samples: 2 num padding tokens: 15 - rank: 5 max len: 64 min len: 49 avg len: 56.5 num_loss_counted_tokens: 57
total tokens: 134 num samples: 2 num padding tokens: 17 - rank: 2 max len: 67 min len: 50 avg len: 58.5 num_loss_counted_tokens: 55
total tokens: 162 num samples: 2 num padding tokens: 19 - rank: 6 max len: 81 min len: 62 avg len: 71.5 num_loss_counted_tokens: 77
total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 1 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 58
total tokens: 160 num samples: 2 num padding tokens: 12 - rank: 3 max len: 80 min len: 68 avg len: 74.0 num_loss_counted_tokens: 80
Per-token loss scaled by world size: 0.00021730510343331844Per-token loss scaled by world size: 0.00023930655152071267Per-token loss scaled by world size: 0.00019531356520019472Per-token loss scaled by world size: 0.0005758063634857535Per-token loss scaled by world size: 0.00014575273962691426Per-token loss scaled by world size: 0.0007938417256809771
Per-token loss scaled by world size: 0.00033632898703217506
Epoch: 7, Step: 85, Rank: 3, loss = 0.0187968909740448
Epoch: 7, Step: 85, Rank: 5, loss = 0.016894623637199402
Epoch: 7, Step: 85, Rank: 6, loss = 0.020700016990303993Epoch: 7, Step: 85, Rank: 0, loss = 0.04980725049972534
Epoch: 7, Step: 85, Rank: 4, loss = 0.01260761171579361
Epoch: 7, Step: 85, Rank: 2, loss = 0.06866730749607086
Epoch: 7, Step: 85, Rank: 1, loss = 0.0290924571454525
Per-token loss scaled by world size: 0.00042542771552689373
Epoch: 7, Step: 85, Rank: 7, loss = 0.03679949790239334
[2024-07-27 20:06:58,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=85, skipped=0, lr=[5.983045753470308e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:58,596] [INFO] [timer.py:258:stop] epoch=0/micro_step=85/global_step=85, RunningAvgSamplesPerSec=31.69756640961793, CurrSamplesPerSec=28.831859851847014, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7, | 1/12 [00:00<00:10, 1.08it/s]
"step": 85,
"rank": 0,
"loss": 0.04980725049972534,
"overall_throughput": 28.716406669884535,
"lr": 5.983045753470308e-06,
"cuda_mem_allocated": 22.00047731399536,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 692,
"batch_size": 16,
"total_loss": 0.0316707044839859,
"gradnorm": 0.7268746495246887,
"weight_norm": 393.4754943847656,
"timestamp": "2024-07-27T20:06:58.639043"
}
Per-token loss scaled by world size: 0.000342040992109105Per-token loss scaled by world size: 0.00027503896853886545Per-token loss scaled by world size: 0.00036574419937096536Per-token loss scaled by world size: 0.0006328842719085515Per-token loss scaled by world size: 0.0005108661716803908Per-token loss scaled by world size: 0.0006690495647490025
Per-token loss scaled by world size: 0.0002407751599093899
Epoch: 7, Step: 86, Rank: 0, loss = 0.032139770686626434
Epoch: 7, Step: 86, Rank: 5, loss = 0.030056850984692574Epoch: 7, Step: 86, Rank: 4, loss = 0.058792732656002045Epoch: 7, Step: 86, Rank: 2, loss = 0.02416904829442501Epoch: 7, Step: 86, Rank: 6, loss = 0.05561470612883568
Epoch: 7, Step: 86, Rank: 7, loss = 0.044892363250255585
Epoch: 7, Step: 86, Rank: 3, loss = 0.021158117800951004
Per-token loss scaled by world size: 0.0008176557603292167
Epoch: 7, Step: 86, Rank: 1, loss = 0.07185149937868118
[2024-07-27 20:06:59,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=86, skipped=0, lr=[5.6824564766150724e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:59,144] [INFO] [timer.py:258:stop] epoch=0/micro_step=86/global_step=86, RunningAvgSamplesPerSec=31.703864056372694, CurrSamplesPerSec=32.23543844733132, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 7,▋ | 2/12 [00:01<00:07, 1.42it/s]
"step": 86,
"rank": 0,
"loss": 0.032139770686626434,
"overall_throughput": 32.183202299620085,
"lr": 5.6824564766150724e-06,
"cuda_mem_allocated": 22.006441116333008,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 703,
"batch_size": 16,
"total_loss": 0.042334385216236115,
"gradnorm": 0.6796127557754517,
"weight_norm": 393.4755554199219,
"timestamp": "2024-07-27T20:06:59.187763"
}
Per-token loss scaled by world size: 0.00016505751409567893Per-token loss scaled by world size: 0.0008360829087905586
Per-token loss scaled by world size: 0.0005081766867078841
Per-token loss scaled by world size: 0.0005767009570263326Per-token loss scaled by world size: 0.0008457532385364175
Per-token loss scaled by world size: 0.003279536496847868Per-token loss scaled by world size: 0.0016091869911178946
Epoch: 7, Step: 87, Rank: 7, loss = 0.05246420204639435
Epoch: 7, Step: 87, Rank: 0, loss = 0.010357359424233437
Epoch: 7, Step: 87, Rank: 3, loss = 0.03188808634877205Epoch: 7, Step: 87, Rank: 6, loss = 0.2057909220457077Epoch: 7, Step: 87, Rank: 5, loss = 0.05307101458311081
Epoch: 7, Step: 87, Rank: 2, loss = 0.03618798404932022
Epoch: 7, Step: 87, Rank: 4, loss = 0.10097648203372955
Per-token loss scaled by world size: 0.0013951770961284637
Epoch: 7, Step: 87, Rank: 1, loss = 0.08754736185073853
[2024-07-27 20:06:59,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=87, skipped=0, lr=[5.386588370213124e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:06:59,697] [INFO] [timer.py:258:stop] epoch=0/micro_step=87/global_step=87, RunningAvgSamplesPerSec=31.702355408140964, CurrSamplesPerSec=31.576139496344755, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7,█▌ | 3/12 [00:02<00:05, 1.57it/s]
"step": 87,
"rank": 0,
"loss": 0.010357359424233437,
"overall_throughput": 31.529719530621602,
"lr": 5.386588370213124e-06,
"cuda_mem_allocated": 22.000000476837158,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 502,
"batch_size": 16,
"total_loss": 0.07228542864322662,
"gradnorm": 1.0233722925186157,
"weight_norm": 393.4755859375,
"timestamp": "2024-07-27T20:06:59.740017"
}
Per-token loss scaled by world size: 0.0004632726195268333Per-token loss scaled by world size: 0.0006792055210098624Per-token loss scaled by world size: 0.0006460993899963796Per-token loss scaled by world size: 7.74235013523139e-05Per-token loss scaled by world size: 0.00012206515384605154
Per-token loss scaled by world size: 0.0019949208945035934
Epoch: 7, Step: 88, Rank: 2, loss = 0.05039575323462486Epoch: 7, Step: 88, Rank: 0, loss = 0.009521082043647766
Epoch: 7, Step: 88, Rank: 7, loss = 0.03613526374101639Epoch: 7, Step: 88, Rank: 3, loss = 0.0529780313372612
Epoch: 7, Step: 88, Rank: 4, loss = 0.15560382604599
Per-token loss scaled by world size: 0.0007654842338524759
Epoch: 7, Step: 88, Rank: 6, loss = 0.006039033178240061
Epoch: 7, Step: 88, Rank: 5, loss = 0.05970776826143265
Per-token loss scaled by world size: 0.0008809524588286877
Epoch: 7, Step: 88, Rank: 1, loss = 0.06871429085731506
[2024-07-27 20:07:00,188] [INFO] [logging.py:96:log_dist] [Rank 0] step=88, skipped=0, lr=[5.095764961694923e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:00,266] [INFO] [timer.py:258:stop] epoch=0/micro_step=88/global_step=88, RunningAvgSamplesPerSec=31.68774037603646, CurrSamplesPerSec=30.492857616709514, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 1408
{
"epoch": 7,
"step": 88,
"rank": 0,
"loss": 0.009521082043647766,
"overall_throughput": 30.41481690296598,
"lr": 5.095764961694923e-06,
"cuda_mem_allocated": 22.0038161277771,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 624,
"batch_size": 16,
"total_loss": 0.05488688498735428,
"gradnorm": 1.1451473236083984,
"weight_norm": 393.47564697265625,
"timestamp": "2024-07-27T20:07:00.269082"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1408
[20:07:18] INFO saving took 17.875807285308838 seconds utils.py:611
Per-token loss scaled by world size: 0.00015760907263029367Per-token loss scaled by world size: 0.00012432184303179383Per-token loss scaled by world size: 0.0010254974476993084Per-token loss scaled by world size: 0.0010104298125952482Per-token loss scaled by world size: 0.00047610432375222445
Per-token loss scaled by world size: 0.00011171086953254417
Per-token loss scaled by world size: 0.000618505000602454Epoch: 7, Step: 89, Rank: 2, loss = 0.010038988664746284Epoch: 7, Step: 89, Rank: 5, loss = 0.08159220963716507
Epoch: 7, Step: 89, Rank: 7, loss = 0.012726932764053345
Epoch: 7, Step: 89, Rank: 3, loss = 0.038445424288511276
Epoch: 7, Step: 89, Rank: 4, loss = 0.08280891925096512
Epoch: 7, Step: 89, Rank: 0, loss = 0.009020652621984482
Epoch: 7, Step: 89, Rank: 6, loss = 0.049944277852773666
Per-token loss scaled by world size: 0.0007019841577857733
Epoch: 7, Step: 89, Rank: 1, loss = 0.05668522045016289
[2024-07-27 20:07:18,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=89, skipped=0, lr=[4.8103042621878515e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:18,713] [INFO] [timer.py:258:stop] epoch=0/micro_step=89/global_step=89, RunningAvgSamplesPerSec=31.681179911181776, CurrSamplesPerSec=31.12696454313878, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 7,███▏ | 5/12 [00:21<00:35, 5.11s/it]
"step": 89,
"rank": 0,
"loss": 0.009020652621984482,
"overall_throughput": 31.06243279759643,
"lr": 4.8103042621878515e-06,
"cuda_mem_allocated": 22.00548553466797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 646,
"batch_size": 16,
"total_loss": 0.04265782982110977,
"gradnorm": 0.9962098598480225,
"weight_norm": 393.4757080078125,
"timestamp": "2024-07-27T20:07:18.755800"
}
Per-token loss scaled by world size: 0.0009453526581637561Per-token loss scaled by world size: 0.0016993889585137367Per-token loss scaled by world size: 0.0008407846908085048Per-token loss scaled by world size: 6.15180833847262e-05Per-token loss scaled by world size: 0.0012258148053660989
Per-token loss scaled by world size: 0.0002534937229938805Per-token loss scaled by world size: 7.776251732138917e-05
Epoch: 7, Step: 90, Rank: 0, loss = 0.057784680277109146Epoch: 7, Step: 90, Rank: 1, loss = 0.10387515276670456Epoch: 7, Step: 90, Rank: 5, loss = 0.05139296501874924
Epoch: 7, Step: 90, Rank: 3, loss = 0.0749279335141182Epoch: 7, Step: 90, Rank: 2, loss = 0.0037602928932756186
Epoch: 7, Step: 90, Rank: 6, loss = 0.015494802966713905
Epoch: 7, Step: 90, Rank: 4, loss = 0.00475323386490345
Per-token loss scaled by world size: 0.0009510749368928373
Epoch: 7, Step: 90, Rank: 7, loss = 0.05813445523381233
[2024-07-27 20:07:19,178] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[4.530518418775734e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:19,255] [INFO] [timer.py:258:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=31.685890209714664, CurrSamplesPerSec=32.10111808111374, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7,████ | 6/12 [00:21<00:21, 3.56s/it]
"step": 90,
"rank": 0,
"loss": 0.057784680277109146,
"overall_throughput": 32.01665600380905,
"lr": 4.530518418775734e-06,
"cuda_mem_allocated": 21.996421813964844,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 489,
"batch_size": 16,
"total_loss": 0.04626544192433357,
"gradnorm": 0.9134754538536072,
"weight_norm": 393.4757385253906,
"timestamp": "2024-07-27T20:07:19.305311"
}
Per-token loss scaled by world size: 0.00019140614313073456Per-token loss scaled by world size: 0.0003604689263738692Per-token loss scaled by world size: 0.00012782825797330588Per-token loss scaled by world size: 0.00011688289669109508
Per-token loss scaled by world size: 0.0008099116967059672
Per-token loss scaled by world size: 0.0005937899113632739
Per-token loss scaled by world size: 0.0016323667950928211
Epoch: 7, Step: 91, Rank: 4, loss = 0.029107866808772087Epoch: 7, Step: 91, Rank: 0, loss = 0.015456045977771282
Epoch: 7, Step: 91, Rank: 1, loss = 0.010322132147848606Epoch: 7, Step: 91, Rank: 7, loss = 0.009438293986022472
Epoch: 7, Step: 91, Rank: 2, loss = 0.0654003694653511
Epoch: 7, Step: 91, Rank: 6, loss = 0.04794853553175926
Epoch: 7, Step: 91, Rank: 3, loss = 0.13181361556053162
Per-token loss scaled by world size: 8.413568866671994e-05
Epoch: 7, Step: 91, Rank: 5, loss = 0.006793956737965345
[2024-07-27 20:07:19,714] [INFO] [logging.py:96:log_dist] [Rank 0] step=91, skipped=0, lr=[4.256713373170565e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:19,792] [INFO] [timer.py:258:stop] epoch=0/micro_step=91/global_step=91, RunningAvgSamplesPerSec=31.69971548381683, CurrSamplesPerSec=32.965470896955004, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7,████▊ | 7/12 [00:22<00:12, 2.57s/it]
"step": 91,
"rank": 0,
"loss": 0.015456045977771282,
"overall_throughput": 32.88040023537493,
"lr": 4.256713373170565e-06,
"cuda_mem_allocated": 22.004292964935303,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 646,
"batch_size": 16,
"total_loss": 0.039535101503133774,
"gradnorm": 1.6763972043991089,
"weight_norm": 393.47576904296875,
"timestamp": "2024-07-27T20:07:19.833342"
}
Per-token loss scaled by world size: 0.00016448293172288686
Per-token loss scaled by world size: 0.00010030974954133853Per-token loss scaled by world size: 0.0006337311351671815Per-token loss scaled by world size: 0.0002874261699616909Per-token loss scaled by world size: 0.0004495856410358101Per-token loss scaled by world size: 0.0012448193738237023
Per-token loss scaled by world size: 8.349026757059619e-05
Epoch: 7, Step: 92, Rank: 0, loss = 0.013878247700631618
Epoch: 7, Step: 92, Rank: 4, loss = 0.024251583963632584
Epoch: 7, Step: 92, Rank: 7, loss = 0.008463635109364986Epoch: 7, Step: 92, Rank: 5, loss = 0.10503163933753967
Epoch: 7, Step: 92, Rank: 2, loss = 0.05347106233239174
Epoch: 7, Step: 92, Rank: 6, loss = 0.007044491358101368Epoch: 7, Step: 92, Rank: 3, loss = 0.03793378919363022
Per-token loss scaled by world size: 0.0010255238739773631
Epoch: 7, Step: 92, Rank: 1, loss = 0.08652857691049576
[2024-07-27 20:07:20,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=92, skipped=0, lr=[3.989188527169749e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:20,327] [INFO] [timer.py:258:stop] epoch=0/micro_step=92/global_step=92, RunningAvgSamplesPerSec=31.708216014036797, CurrSamplesPerSec=32.48346829214222, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 7,█████▋ | 8/12 [00:22<00:07, 1.92s/it]
"step": 92,
"rank": 0,
"loss": 0.013878247700631618,
"overall_throughput": 32.40163058618926,
"lr": 3.989188527169749e-06,
"cuda_mem_allocated": 22.009064197540283,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 675,
"batch_size": 16,
"total_loss": 0.04207538068294525,
"gradnorm": 0.6942251920700073,
"weight_norm": 393.4757995605469,
"timestamp": "2024-07-27T20:07:20.370030"
}
Per-token loss scaled by world size: 0.001151230651885271Per-token loss scaled by world size: 0.0008526312303729355Per-token loss scaled by world size: 0.00011098023969680071Per-token loss scaled by world size: 0.0004092410672456026Per-token loss scaled by world size: 0.0007324064499698579
Per-token loss scaled by world size: 0.000303772249026224
Per-token loss scaled by world size: 0.0005547546315938234
Epoch: 7, Step: 93, Rank: 6, loss = 0.0076160188764333725Epoch: 7, Step: 93, Rank: 0, loss = 0.05851181969046593Epoch: 7, Step: 93, Rank: 1, loss = 0.02808416821062565
Epoch: 7, Step: 93, Rank: 3, loss = 0.05026139318943024Epoch: 7, Step: 93, Rank: 5, loss = 0.020846370607614517Epoch: 7, Step: 93, Rank: 7, loss = 0.07900319993495941
Epoch: 7, Step: 93, Rank: 2, loss = 0.038070037961006165
Per-token loss scaled by world size: 0.002183598466217518
Epoch: 7, Step: 93, Rank: 4, loss = 0.14984944462776184
[2024-07-27 20:07:20,790] [INFO] [logging.py:96:log_dist] [Rank 0] step=93, skipped=0, lr=[3.72823641526463e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:20,868] [INFO] [timer.py:258:stop] epoch=0/micro_step=93/global_step=93, RunningAvgSamplesPerSec=31.713426964551296, CurrSamplesPerSec=32.18953148593345, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7,██████▌ | 9/12 [00:23<00:04, 1.49s/it]
"step": 93,
"rank": 0,
"loss": 0.05851181969046593,
"overall_throughput": 32.11137875033854,
"lr": 3.72823641526463e-06,
"cuda_mem_allocated": 22.001431465148926,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 549,
"batch_size": 16,
"total_loss": 0.05403030663728714,
"gradnorm": 2.058638572692871,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:20.910467"
}
Per-token loss scaled by world size: 0.0005728427204303443Per-token loss scaled by world size: 0.00010026186646427959Per-token loss scaled by world size: 0.0007008212269283831Per-token loss scaled by world size: 0.001267179031856358Per-token loss scaled by world size: 0.0009045311016961932Per-token loss scaled by world size: 0.000113489935756661Per-token loss scaled by world size: 0.00015748964506201446
Epoch: 7, Step: 94, Rank: 3, loss = 0.06035822629928589Epoch: 7, Step: 94, Rank: 1, loss = 0.008635053411126137
Epoch: 7, Step: 94, Rank: 0, loss = 0.10913579910993576
Epoch: 7, Step: 94, Rank: 2, loss = 0.04933607950806618Epoch: 7, Step: 94, Rank: 5, loss = 0.013563795946538448
Epoch: 7, Step: 94, Rank: 7, loss = 0.07790274173021317
Epoch: 7, Step: 94, Rank: 4, loss = 0.009774320758879185
Per-token loss scaled by world size: 3.7123980291653425e-05
Epoch: 7, Step: 94, Rank: 6, loss = 0.003197302808985114
[2024-07-27 20:07:21,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=94, skipped=0, lr=[3.4741423847583134e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:21,427] [INFO] [timer.py:258:stop] epoch=0/micro_step=94/global_step=94, RunningAvgSamplesPerSec=31.70583386752702, CurrSamplesPerSec=31.029757814905818, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 7,███████▎ | 10/12 [00:23<00:02, 1.20s/it]
"step": 94,
"rank": 0,
"loss": 0.10913579910993576,
"overall_throughput": 30.958899761772514,
"lr": 3.4741423847583134e-06,
"cuda_mem_allocated": 22.00882577896118,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 689,
"batch_size": 16,
"total_loss": 0.0414879135787487,
"gradnorm": 0.7960036993026733,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:21.470074"
}
Per-token loss scaled by world size: 0.00030438421526923776Per-token loss scaled by world size: 0.00036696376628242433Per-token loss scaled by world size: 0.00027681011124514043Per-token loss scaled by world size: 7.804056804161519e-05
Per-token loss scaled by world size: 0.001165422610938549
Per-token loss scaled by world size: 0.00014700897736474872
Per-token loss scaled by world size: 0.0005056550144217908
Epoch: 7, Step: 95, Rank: 7, loss = 0.02559572272002697Epoch: 7, Step: 95, Rank: 1, loss = 0.005443329457193613
Epoch: 7, Step: 95, Rank: 6, loss = 0.01930750533938408
Epoch: 7, Step: 95, Rank: 0, loss = 0.021230798214673996
Epoch: 7, Step: 95, Rank: 3, loss = 0.08128822594881058
Epoch: 7, Step: 95, Rank: 5, loss = 0.03526943549513817
Epoch: 7, Step: 95, Rank: 4, loss = 0.010253876447677612
Per-token loss scaled by world size: 0.001073820167221129
Epoch: 7, Step: 95, Rank: 2, loss = 0.07489895820617676
[2024-07-27 20:07:21,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=95, skipped=0, lr=[3.2271842837425917e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:21,957] [INFO] [timer.py:258:stop] epoch=0/micro_step=95/global_step=95, RunningAvgSamplesPerSec=31.718594052770808, CurrSamplesPerSec=32.93815904428149, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 7,████████▏| 11/12 [00:24<00:00, 1.00it/s]
"step": 95,
"rank": 0,
"loss": 0.021230798214673996,
"overall_throughput": 32.85461233006348,
"lr": 3.2271842837425917e-06,
"cuda_mem_allocated": 22.00023889541626,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 558,
"batch_size": 16,
"total_loss": 0.034160979092121124,
"gradnorm": 0.7562242150306702,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:21.999178"
}
Per-token loss scaled by world size: 0.0006009905482642353Per-token loss scaled by world size: 0.0001355827844236046Per-token loss scaled by world size: 0.0003012406814377755Per-token loss scaled by world size: 0.0010038167238235474Per-token loss scaled by world size: 0.0006891617667861283
Per-token loss scaled by world size: 0.0006996800657361746Per-token loss scaled by world size: 0.0006351979682222009
Epoch: 7, Step: 96, Rank: 5, loss = 0.0112872663885355Epoch: 7, Step: 96, Rank: 1, loss = 0.08356773853302002
Epoch: 7, Step: 96, Rank: 0, loss = 0.05003246292471886Epoch: 7, Step: 96, Rank: 2, loss = 0.025078287348151207
Epoch: 7, Step: 96, Rank: 3, loss = 0.057372719049453735
Epoch: 7, Step: 96, Rank: 7, loss = 0.05288023129105568
Epoch: 7, Step: 96, Rank: 6, loss = 0.0582483634352684
Per-token loss scaled by world size: 0.0005330585991032422
Epoch: 7, Step: 96, Rank: 4, loss = 0.04437712952494621
[2024-07-27 20:07:22,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=96, skipped=0, lr=[2.9876321572751143e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:22,503] [INFO] [timer.py:258:stop] epoch=0/micro_step=96/global_step=96, RunningAvgSamplesPerSec=31.71950152323151, CurrSamplesPerSec=31.804123848141387, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 7,█████████| 12/12 [00:24<00:00, 1.16it/s]
"step": 96,
"rank": 0,
"loss": 0.05003246292471886,
"overall_throughput": 31.730529437408332,
"lr": 2.9876321572751143e-06,
"cuda_mem_allocated": 22.00548553466797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 666,
"batch_size": 16,
"total_loss": 0.047855526208877563,
"gradnorm": 1.0569850206375122,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:22.546236"
}
Epoch 7: 100%|██████████| 12/12 [00:24<00:00, 2.08s/it]
total tokens: 166 num samples: 2 num padding tokens: 3 - rank: 6 max len: 83 min len: 80 avg len: 81.5 num_loss_counted_tokens: 88 total tokens: 174 num samples: 2 num padding tokens: 37 - rank: 6 max len: 87 min len: 50 avg len: 68.5 num_loss_counted_tokens: 83
total tokens: 132 num samples: 2 num padding tokens: 17 - rank: 6 max len: 66 min len: 49 avg len: 57.5 num_loss_counted_tokens: 50
total tokens: 194 num samples: 2 num padding tokens: 18 - rank: 6 max len: 97 min len: 79 avg len: 88.0 num_loss_counted_tokens: 112 total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 6 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 72
total tokens: 140 num samples: 2 num padding tokens: 6 - rank: 6 max len: 70 min len: 64 avg len: 67.0 num_loss_counted_tokens: 81
total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 6 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 48
total tokens: 144 num samples: 2 num padding tokens: 18 - rank: 6 max len: 72 min len: 54 avg len: 63.0 num_loss_counted_tokens: 82
total tokens: 154 num samples: 2 num padding tokens: 19 - rank: 3 max len: 77 min len: 58 avg len: 67.5 num_loss_counted_tokens: 87
total tokens: 196 num samples: 2 num padding tokens: 34 - rank: 6 max len: 98 min len: 64 avg len: 81.0 num_loss_counted_tokens: 113
total tokens: 228 num samples: 2 num padding tokens: 70 - rank: 6 max len: 114 min len: 44 avg len: 79.0 num_loss_counted_tokens: 110
total tokens: 126 num samples: 2 num padding tokens: 1 - rank: 3 max len: 63 min len: 62 avg len: 62.5 num_loss_counted_tokens: 60
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 3 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 63
total tokens: 186 num samples: 2 num padding tokens: 48 - rank: 3 max len: 93 min len: 45 avg len: 69.0 num_loss_counted_tokens: 110
total tokens: 126 num samples: 2 num padding tokens: 8 - rank: 3 max len: 63 min len: 55 avg len: 59.0 num_loss_counted_tokens: 61
total tokens: 104 num samples: 2 num padding tokens: 1 - rank: 3 max len: 52 min len: 51 avg len: 51.5 num_loss_counted_tokens: 59
total tokens: 152 num samples: 2 num padding tokens: 2 - rank: 3 max len: 76 min len: 74 avg len: 75.0 num_loss_counted_tokens: 79
total tokens: 108 num samples: 2 num padding tokens: 1 - rank: 3 max len: 54 min len: 53 avg len: 53.5 num_loss_counted_tokens: 64
total tokens: 120 num samples: 2 num padding tokens: 5 - rank: 7 max len: 60 min len: 55 avg len: 57.5 num_loss_counted_tokens: 61
total tokens: 188 num samples: 2 num padding tokens: 35 - rank: 3 max len: 94 min len: 59 avg len: 76.5 num_loss_counted_tokens: 93
total tokens: 162 num samples: 2 num padding tokens: 37 - rank: 7 max len: 81 min len: 44 avg len: 62.5 num_loss_counted_tokens: 66
total tokens: 172 num samples: 2 num padding tokens: 11 - rank: 7 max len: 86 min len: 75 avg len: 80.5 num_loss_counted_tokens: 72
total tokens: 166 num samples: 2 num padding tokens: 37 - rank: 7 max len: 83 min len: 46 avg len: 64.5 num_loss_counted_tokens: 80
total tokens: 142 num samples: 2 num padding tokens: 6 - rank: 3 max len: 71 min len: 65 avg len: 68.0 num_loss_counted_tokens: 78
total tokens: 158 num samples: 2 num padding tokens: 22 - rank: 7 max len: 79 min len: 57 avg len: 68.0 num_loss_counted_tokens: 65
total tokens: 128 num samples: 2 num padding tokens: 19 - rank: 7 max len: 64 min len: 45 avg len: 54.5 num_loss_counted_tokens: 58
total tokens: 126 num samples: 2 num padding tokens: 0 - rank: 7 max len: 63 min len: 63 avg len: 63.0 num_loss_counted_tokens: 64
total tokens: 134 num samples: 2 num padding tokens: 24 - rank: 7 max len: 67 min len: 43 avg len: 55.0 num_loss_counted_tokens: 57
total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 7 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 72
total tokens: 202 num samples: 2 num padding tokens: 21 - rank: 7 max len: 101 min len: 80 avg len: 90.5 num_loss_counted_tokens: 128
total tokens: 138 num samples: 2 num padding tokens: 17 - rank: 6 max len: 69 min len: 52 avg len: 60.5 num_loss_counted_tokens: 60
total tokens: 196 num samples: 2 num padding tokens: 53 - rank: 5 max len: 98 min len: 45 avg len: 71.5 num_loss_counted_tokens: 99
total tokens: 110 num samples: 2 num padding tokens: 5 - rank: 5 max len: 55 min len: 50 avg len: 52.5 num_loss_counted_tokens: 57
total tokens: 176 num samples: 2 num padding tokens: 17 - rank: 5 max len: 88 min len: 71 avg len: 79.5 num_loss_counted_tokens: 86
total tokens: 120 num samples: 2 num padding tokens: 8 - rank: 5 max len: 60 min len: 52 avg len: 56.0 num_loss_counted_tokens: 63
total tokens: 172 num samples: 2 num padding tokens: 18 - rank: 5 max len: 86 min len: 68 avg len: 77.0 num_loss_counted_tokens: 74
total tokens: 282 num samples: 2 num padding tokens: 81 - rank: 5 max len: 141 min len: 60 avg len: 100.5 num_loss_counted_tokens: 152
total tokens: 162 num samples: 2 num padding tokens: 11 - rank: 5 max len: 81 min len: 70 avg len: 75.5 num_loss_counted_tokens: 86
total tokens: 166 num samples: 2 num padding tokens: 24 - rank: 5 max len: 83 min len: 59 avg len: 71.0 num_loss_counted_tokens: 80
total tokens: 216 num samples: 2 num padding tokens: 47 - rank: 5 max len: 108 min len: 61 avg len: 84.5 num_loss_counted_tokens: 103
total tokens: 226 num samples: 2 num padding tokens: 40 - rank: 7 max len: 113 min len: 73 avg len: 93.0 num_loss_counted_tokens: 109
total tokens: 122 num samples: 2 num padding tokens: 15 - rank: 5 max len: 61 min len: 46 avg len: 53.5 num_loss_counted_tokens: 55
total tokens: 180 num samples: 2 num padding tokens: 22 - rank: 4 max len: 90 min len: 68 avg len: 79.0 num_loss_counted_tokens: 111
total tokens: 152 num samples: 2 num padding tokens: 16 - rank: 5 max len: 76 min len: 60 avg len: 68.0 num_loss_counted_tokens: 71
total tokens: 152 num samples: 2 num padding tokens: 24 - rank: 4 max len: 76 min len: 52 avg len: 64.0 num_loss_counted_tokens: 59
total tokens: 140 num samples: 2 num padding tokens: 19 - rank: 4 max len: 70 min len: 51 avg len: 60.5 num_loss_counted_tokens: 55
total tokens: 102 num samples: 2 num padding tokens: 6 - rank: 3 max len: 51 min len: 45 avg len: 48.0 num_loss_counted_tokens: 50
total tokens: 146 num samples: 2 num padding tokens: 1 - rank: 4 max len: 73 min len: 72 avg len: 72.5 num_loss_counted_tokens: 83
total tokens: 118 num samples: 2 num padding tokens: 11 - rank: 6 max len: 59 min len: 48 avg len: 53.5 num_loss_counted_tokens: 53
total tokens: 114 num samples: 2 num padding tokens: 8 - rank: 3 max len: 57 min len: 49 avg len: 53.0 num_loss_counted_tokens: 58
total tokens: 148 num samples: 2 num padding tokens: 14 - rank: 4 max len: 74 min len: 60 avg len: 67.0 num_loss_counted_tokens: 74
total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 4 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 53
total tokens: 162 num samples: 2 num padding tokens: 15 - rank: 4 max len: 81 min len: 66 avg len: 73.5 num_loss_counted_tokens: 87
total tokens: 138 num samples: 2 num padding tokens: 9 - rank: 7 max len: 69 min len: 60 avg len: 64.5 num_loss_counted_tokens: 75
total tokens: 168 num samples: 2 num padding tokens: 34 - rank: 4 max len: 84 min len: 50 avg len: 67.0 num_loss_counted_tokens: 88
total tokens: 160 num samples: 2 num padding tokens: 2 - rank: 4 max len: 80 min len: 78 avg len: 79.0 num_loss_counted_tokens: 91
total tokens: 128 num samples: 2 num padding tokens: 20 - rank: 4 max len: 64 min len: 44 avg len: 54.0 num_loss_counted_tokens: 47
total tokens: 186 num samples: 2 num padding tokens: 3 - rank: 4 max len: 93 min len: 90 avg len: 91.5 num_loss_counted_tokens: 114
total tokens: 126 num samples: 2 num padding tokens: 3 - rank: 5 max len: 63 min len: 60 avg len: 61.5 num_loss_counted_tokens: 65
total tokens: 136 num samples: 2 num padding tokens: 15 - rank: 4 max len: 68 min len: 53 avg len: 60.5 num_loss_counted_tokens: 52
total tokens: 104 num samples: 2 num padding tokens: 4 - rank: 0 max len: 52 min len: 48 avg len: 50.0 num_loss_counted_tokens: 54
total tokens: 124 num samples: 2 num padding tokens: 7 - rank: 0 max len: 62 min len: 55 avg len: 58.5 num_loss_counted_tokens: 51
total tokens: 154 num samples: 2 num padding tokens: 13 - rank: 0 max len: 77 min len: 64 avg len: 70.5 num_loss_counted_tokens: 78
total tokens: 132 num samples: 2 num padding tokens: 6 - rank: 0 max len: 66 min len: 60 avg len: 63.0 num_loss_counted_tokens: 87
total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 0 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 73
total tokens: 118 num samples: 2 num padding tokens: 2 - rank: 1 max len: 59 min len: 57 avg len: 58.0 num_loss_counted_tokens: 60
total tokens: 152 num samples: 2 num padding tokens: 27 - rank: 1 max len: 76 min len: 49 avg len: 62.5 num_loss_counted_tokens: 72
total tokens: 130 num samples: 2 num padding tokens: 7 - rank: 1 max len: 65 min len: 58 avg len: 61.5 num_loss_counted_tokens: 59
total tokens: 188 num samples: 2 num padding tokens: 34 - rank: 0 max len: 94 min len: 60 avg len: 77.0 num_loss_counted_tokens: 90
total tokens: 164 num samples: 2 num padding tokens: 12 - rank: 1 max len: 82 min len: 70 avg len: 76.0 num_loss_counted_tokens: 96
total tokens: 124 num samples: 2 num padding tokens: 4 - rank: 0 max len: 62 min len: 58 avg len: 60.0 num_loss_counted_tokens: 60
total tokens: 244 num samples: 2 num padding tokens: 67 - rank: 0 max len: 122 min len: 55 avg len: 88.5 num_loss_counted_tokens: 127
total tokens: 106 num samples: 2 num padding tokens: 7 - rank: 0 max len: 53 min len: 46 avg len: 49.5 num_loss_counted_tokens: 45
total tokens: 128 num samples: 2 num padding tokens: 0 - rank: 1 max len: 64 min len: 64 avg len: 64.0 num_loss_counted_tokens: 59
total tokens: 110 num samples: 2 num padding tokens: 5 - rank: 0 max len: 55 min len: 50 avg len: 52.5 num_loss_counted_tokens: 57
total tokens: 132 num samples: 2 num padding tokens: 5 - rank: 1 max len: 66 min len: 61 avg len: 63.5 num_loss_counted_tokens: 60
total tokens: 208 num samples: 2 num padding tokens: 33 - rank: 0 max len: 104 min len: 71 avg len: 87.5 num_loss_counted_tokens: 124
total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 1 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 86
total tokens: 154 num samples: 2 num padding tokens: 10 - rank: 1 max len: 77 min len: 67 avg len: 72.0 num_loss_counted_tokens: 68
total tokens: 132 num samples: 2 num padding tokens: 18 - rank: 1 max len: 66 min len: 48 avg len: 57.0 num_loss_counted_tokens: 68
total tokens: 126 num samples: 2 num padding tokens: 2 - rank: 1 max len: 63 min len: 61 avg len: 62.0 num_loss_counted_tokens: 68
total tokens: 138 num samples: 2 num padding tokens: 8 - rank: 1 max len: 69 min len: 61 avg len: 65.0 num_loss_counted_tokens: 58
total tokens: 146 num samples: 2 num padding tokens: 2 - rank: 2 max len: 73 min len: 71 avg len: 72.0 num_loss_counted_tokens: 79 total tokens: 118 num samples: 2 num padding tokens: 1 - rank: 2 max len: 59 min len: 58 avg len: 58.5 num_loss_counted_tokens: 64
total tokens: 124 num samples: 2 num padding tokens: 10 - rank: 2 max len: 62 min len: 52 avg len: 57.0 num_loss_counted_tokens: 59
total tokens: 180 num samples: 2 num padding tokens: 31 - rank: 0 max len: 90 min len: 59 avg len: 74.5 num_loss_counted_tokens: 104
total tokens: 186 num samples: 2 num padding tokens: 1 - rank: 2 max len: 93 min len: 92 avg len: 92.5 num_loss_counted_tokens: 125
total tokens: 136 num samples: 2 num padding tokens: 5 - rank: 2 max len: 68 min len: 63 avg len: 65.5 num_loss_counted_tokens: 50
total tokens: 214 num samples: 2 num padding tokens: 31 - rank: 2 max len: 107 min len: 76 avg len: 91.5 num_loss_counted_tokens: 132
total tokens: 174 num samples: 2 num padding tokens: 5 - rank: 2 max len: 87 min len: 82 avg len: 84.5 num_loss_counted_tokens: 109
total tokens: 168 num samples: 2 num padding tokens: 29 - rank: 2 max len: 84 min len: 55 avg len: 69.5 num_loss_counted_tokens: 81
total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 2 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 65 total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 2 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 75
total tokens: 200 num samples: 2 num padding tokens: 30 - rank: 1 max len: 100 min len: 70 avg len: 85.0 num_loss_counted_tokens: 92
total tokens: 142 num samples: 2 num padding tokens: 9 - rank: 2 max len: 71 min len: 62 avg len: 66.5 num_loss_counted_tokens: 62
total tokens: 214 num samples: 2 num padding tokens: 49 - rank: 2 max len: 107 min len: 58 avg len: 82.5 num_loss_counted_tokens: 102
Per-token loss scaled by world size: 0.0008622364257462323Per-token loss scaled by world size: 7.275798998307437e-05Per-token loss scaled by world size: 0.00035221243160776794
Per-token loss scaled by world size: 0.0006777397356927395
Per-token loss scaled by world size: 0.0003756655496545136
Per-token loss scaled by world size: 0.0009425911703146994
Per-token loss scaled by world size: 0.0004384875064715743
Epoch: 8, Step: 97, Rank: 1, loss = 0.061649903655052185
Epoch: 8, Step: 97, Rank: 0, loss = 0.005202196538448334
Epoch: 8, Step: 97, Rank: 4, loss = 0.025183189660310745
Epoch: 8, Step: 97, Rank: 5, loss = 0.048458389937877655
Epoch: 8, Step: 97, Rank: 3, loss = 0.026860086247324944
Epoch: 8, Step: 97, Rank: 7, loss = 0.06739526987075806
Epoch: 8, Step: 97, Rank: 6, loss = 0.031351856887340546
Per-token loss scaled by world size: 0.0008824544493108988
Epoch: 8, Step: 97, Rank: 2, loss = 0.06309549510478973
[2024-07-27 20:07:23,455] [INFO] [logging.py:96:log_dist] [Rank 0] step=97, skipped=0, lr=[2.7557479520891104e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:23,531] [INFO] [timer.py:258:stop] epoch=0/micro_step=97/global_step=97, RunningAvgSamplesPerSec=31.709145954932833, CurrSamplesPerSec=30.76501430086227, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 8%|▊ | 1/12 [00:00<00:10, 1.07it/s]{
"epoch": 8,
"step": 97,
"rank": 0,
"loss": 0.005202196538448334,
"overall_throughput": 30.654808948925602,
"lr": 2.7557479520891104e-06,
"cuda_mem_allocated": 22.001669883728027,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 572,
"batch_size": 16,
"total_loss": 0.041149549186229706,
"gradnorm": 0.7585266828536987,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:23.574672"
}
Per-token loss scaled by world size: 0.00014853040920570493Per-token loss scaled by world size: 0.0014420171501114964Per-token loss scaled by world size: 0.00023432534362655133Per-token loss scaled by world size: 0.0010252870852127671Per-token loss scaled by world size: 0.00022051780251786113Per-token loss scaled by world size: 0.0014104293659329414Per-token loss scaled by world size: 0.0005214783013798296
Epoch: 8, Step: 98, Rank: 5, loss = 0.10472649335861206Epoch: 8, Step: 98, Rank: 6, loss = 0.017017878592014313
Epoch: 8, Step: 98, Rank: 7, loss = 0.01078702136874199
Epoch: 8, Step: 98, Rank: 4, loss = 0.07446147501468658Epoch: 8, Step: 98, Rank: 1, loss = 0.016015104949474335Epoch: 8, Step: 98, Rank: 3, loss = 0.10243242979049683Epoch: 8, Step: 98, Rank: 2, loss = 0.03787236288189888
Per-token loss scaled by world size: 0.0007809263770468533
Epoch: 8, Step: 98, Rank: 0, loss = 0.056714776903390884
[2024-07-27 20:07:23,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=98, skipped=0, lr=[2.5317852301584642e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:24,076] [INFO] [timer.py:258:stop] epoch=0/micro_step=98/global_step=98, RunningAvgSamplesPerSec=31.716539168080143, CurrSamplesPerSec=32.43497139719714, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 17%|█▋ | 2/12 [00:01<00:07, 1.42it/s]{
"epoch": 8,
"step": 98,
"rank": 0,
"loss": 0.056714776903390884,
"overall_throughput": 32.37466676829651,
"lr": 2.5317852301584642e-06,
"cuda_mem_allocated": 21.996094703674316,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 581,
"batch_size": 16,
"total_loss": 0.052503444254398346,
"gradnorm": 0.8310177326202393,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:24.122854"
}
Per-token loss scaled by world size: 1.7735277651809156e-05Per-token loss scaled by world size: 0.0005863551050424576Per-token loss scaled by world size: 0.0004776878922712058Per-token loss scaled by world size: 0.00042502893484197557Per-token loss scaled by world size: 0.000244389520958066
Per-token loss scaled by world size: 4.276382242096588e-05Per-token loss scaled by world size: 0.00011345247185090557
Epoch: 8, Step: 99, Rank: 3, loss = 0.03155839815735817Epoch: 8, Step: 99, Rank: 2, loss = 0.04353686794638634
Epoch: 8, Step: 99, Rank: 1, loss = 0.03546832501888275Epoch: 8, Step: 99, Rank: 0, loss = 0.0013168443692848086
Epoch: 8, Step: 99, Rank: 4, loss = 0.01814592257142067
Epoch: 8, Step: 99, Rank: 6, loss = 0.003175213700160384
Epoch: 8, Step: 99, Rank: 7, loss = 0.00842384621500969
Per-token loss scaled by world size: 0.0002905700821429491
Epoch: 8, Step: 99, Rank: 5, loss = 0.021574828773736954
[2024-07-27 20:07:24,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=99, skipped=0, lr=[2.315988891431412e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:24,631] [INFO] [timer.py:258:stop] epoch=0/micro_step=99/global_step=99, RunningAvgSamplesPerSec=31.71474915120891, CurrSamplesPerSec=31.54384320597289, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 1584
{
"epoch": 8,
"step": 99,
"rank": 0,
"loss": 0.0013168443692848086,
"overall_throughput": 31.43285568885302,
"lr": 2.315988891431412e-06,
"cuda_mem_allocated": 21.998091220855713,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 594,
"batch_size": 16,
"total_loss": 0.02040003053843975,
"gradnorm": 0.5513115525245667,
"weight_norm": 393.475830078125,
"timestamp": "2024-07-27T20:07:24.635155"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1584
[20:07:42] INFO saving took 17.857797861099243 seconds utils.py:611
Epoch 8: 25%|██▌ | 3/12 [00:19<01:19, 8.79s/it]Per-token loss scaled by world size: 0.00023032784520182759Per-token loss scaled by world size: 0.0005766816902905703Per-token loss scaled by world size: 0.00042750773718580604
Per-token loss scaled by world size: 0.0003948273661080748Per-token loss scaled by world size: 0.00027044868329539895Per-token loss scaled by world size: 0.00016059860354289412Per-token loss scaled by world size: 1.1637920579232741e-05
Epoch: 8, Step: 100, Rank: 0, loss = 0.020211268216371536
Epoch: 8, Step: 100, Rank: 3, loss = 0.05060381814837456
Epoch: 8, Step: 100, Rank: 1, loss = 0.03464610129594803
Epoch: 8, Step: 100, Rank: 2, loss = 0.03751380369067192Epoch: 8, Step: 100, Rank: 4, loss = 0.02373187243938446Epoch: 8, Step: 100, Rank: 7, loss = 0.00102122756652534
Epoch: 8, Step: 100, Rank: 5, loss = 0.014092527329921722
Per-token loss scaled by world size: 3.5370929253986105e-05
Epoch: 8, Step: 100, Rank: 6, loss = 0.0031037991866469383
[2024-07-27 20:07:42,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[2.1085949060360654e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:43,046] [INFO] [timer.py:258:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=31.71639245876496, CurrSamplesPerSec=31.876606801027897, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 33%|███▎ | 4/12 [00:20<00:44, 5.54s/it]{
"epoch": 8,
"step": 100,
"rank": 0,
"loss": 0.020211268216371536,
"overall_throughput": 31.806867290460243,
"lr": 2.1085949060360654e-06,
"cuda_mem_allocated": 21.999046802520752,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 702,
"batch_size": 16,
"total_loss": 0.023115552961826324,
"gradnorm": 0.8944979310035706,
"weight_norm": 393.4758605957031,
"timestamp": "2024-07-27T20:07:43.089296"
}
Per-token loss scaled by world size: 0.002073504263535142Per-token loss scaled by world size: 0.0009401860297657549Per-token loss scaled by world size: 0.0007578051881864667Per-token loss scaled by world size: 0.00018321115931030363Per-token loss scaled by world size: 0.00033954239916056395
Per-token loss scaled by world size: 0.00022701223497278988
Per-token loss scaled by world size: 0.00016321164730470628
Epoch: 8, Step: 101, Rank: 0, loss = 0.1342594027519226
Epoch: 8, Step: 101, Rank: 4, loss = 0.04906788468360901
Epoch: 8, Step: 101, Rank: 5, loss = 0.02198537066578865Epoch: 8, Step: 101, Rank: 2, loss = 0.011862922459840775Epoch: 8, Step: 101, Rank: 6, loss = 0.014699041843414307
Epoch: 8, Step: 101, Rank: 1, loss = 0.06087704375386238
Epoch: 8, Step: 101, Rank: 7, loss = 0.010567953810095787
Per-token loss scaled by world size: 0.00039684344665147364
Epoch: 8, Step: 101, Rank: 3, loss = 0.025695612654089928
[2024-07-27 20:07:43,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=101, skipped=0, lr=[1.9098300562505266e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:43,592] [INFO] [timer.py:258:stop] epoch=0/micro_step=101/global_step=101, RunningAvgSamplesPerSec=31.718083745763387, CurrSamplesPerSec=31.884709476489913, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 42%|████▏ | 5/12 [00:20<00:26, 3.74s/it]{
"epoch": 8,
"step": 101,
"rank": 0,
"loss": 0.1342594027519226,
"overall_throughput": 31.800326016907388,
"lr": 1.9098300562505266e-06,
"cuda_mem_allocated": 21.998091220855713,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 518,
"batch_size": 16,
"total_loss": 0.041126906871795654,
"gradnorm": 0.7889791131019592,
"weight_norm": 393.4758605957031,
"timestamp": "2024-07-27T20:07:43.642554"
}
Per-token loss scaled by world size: 0.00021082548482809216Per-token loss scaled by world size: 0.0001290614891331643Per-token loss scaled by world size: 0.0004096523334737867Per-token loss scaled by world size: 0.00011912157060578465
Per-token loss scaled by world size: 0.00024137772561516613
Per-token loss scaled by world size: 0.00010579593799775466
Per-token loss scaled by world size: 0.00036702080979011953
Epoch: 8, Step: 102, Rank: 2, loss = 0.011470340192317963Epoch: 8, Step: 102, Rank: 1, loss = 0.03640785068273544
Epoch: 8, Step: 102, Rank: 0, loss = 0.01873711496591568
Epoch: 8, Step: 102, Rank: 4, loss = 0.009402614086866379Epoch: 8, Step: 102, Rank: 3, loss = 0.010586929507553577
Epoch: 8, Step: 102, Rank: 6, loss = 0.03261897340416908Epoch: 8, Step: 102, Rank: 5, loss = 0.021452445536851883
Per-token loss scaled by world size: 0.0002314479643246159
Epoch: 8, Step: 102, Rank: 7, loss = 0.0205699373036623
[2024-07-27 20:07:44,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=102, skipped=0, lr=[1.7199116885197996e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:44,136] [INFO] [timer.py:258:stop] epoch=0/micro_step=102/global_step=102, RunningAvgSamplesPerSec=31.727033550532532, CurrSamplesPerSec=32.638783565843816, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 8: 50%|█████ | 6/12 [00:21<00:15, 2.65s/it]{
"epoch": 8,
"step": 102,
"rank": 0,
"loss": 0.01873711496591568,
"overall_throughput": 32.58409397351569,
"lr": 1.7199116885197996e-06,
"cuda_mem_allocated": 22.00572395324707,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 711,
"batch_size": 16,
"total_loss": 0.0201557744294405,
"gradnorm": 0.5014692544937134,
"weight_norm": 393.4758605957031,
"timestamp": "2024-07-27T20:07:44.178902"
}
Per-token loss scaled by world size: 0.0001110902740038Per-token loss scaled by world size: 0.0010643235873430967Per-token loss scaled by world size: 0.00038977997610345483Per-token loss scaled by world size: 0.0005851351888850331Per-token loss scaled by world size: 0.0006453694077208638Per-token loss scaled by world size: 2.1985697458148934e-05
Per-token loss scaled by world size: 0.0008211812237277627
Epoch: 8, Step: 103, Rank: 6, loss = 0.0468108169734478Epoch: 8, Step: 103, Rank: 1, loss = 0.08514588326215744
Epoch: 8, Step: 103, Rank: 5, loss = 0.03118239715695381Epoch: 8, Step: 103, Rank: 2, loss = 0.051629554480314255
Epoch: 8, Step: 103, Rank: 0, loss = 0.008887222036719322
Epoch: 8, Step: 103, Rank: 7, loss = 0.0017588557675480843
Epoch: 8, Step: 103, Rank: 3, loss = 0.06569449603557587
Per-token loss scaled by world size: 0.0005115483654662967
Epoch: 8, Step: 103, Rank: 4, loss = 0.04092387109994888
[2024-07-27 20:07:44,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=103, skipped=0, lr=[1.5390474757906449e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:44,676] [INFO] [timer.py:258:stop] epoch=0/micro_step=103/global_step=103, RunningAvgSamplesPerSec=31.73624582311991, CurrSamplesPerSec=32.68529726054485, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 8: 58%|█████▊ | 7/12 [00:22<00:09, 1.96s/it]{
"epoch": 8,
"step": 103,
"rank": 0,
"loss": 0.008887222036719322,
"overall_throughput": 32.6317687389074,
"lr": 1.5390474757906449e-06,
"cuda_mem_allocated": 22.01240301132202,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 640,
"batch_size": 16,
"total_loss": 0.0415041409432888,
"gradnorm": 0.7137126326560974,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:44.718888"
}
Per-token loss scaled by world size: 0.000524764705915004Per-token loss scaled by world size: 0.00015330749738495797Per-token loss scaled by world size: 0.001214228686876595Per-token loss scaled by world size: 0.00014493752678390592Per-token loss scaled by world size: 0.0008454297785647213Per-token loss scaled by world size: 0.0007223724969662726
Per-token loss scaled by world size: 0.0003260647936258465
Epoch: 8, Step: 104, Rank: 7, loss = 0.01151722576469183Epoch: 8, Step: 104, Rank: 5, loss = 0.06351291388273239Epoch: 8, Step: 104, Rank: 4, loss = 0.01088843122124672
Epoch: 8, Step: 104, Rank: 0, loss = 0.03942294791340828Epoch: 8, Step: 104, Rank: 3, loss = 0.09121893346309662
Epoch: 8, Step: 104, Rank: 2, loss = 0.054268233478069305
Epoch: 8, Step: 104, Rank: 1, loss = 0.024495618417859077
Per-token loss scaled by world size: 0.0009017937700264156
Epoch: 8, Step: 104, Rank: 6, loss = 0.06774725764989853
[2024-07-27 20:07:45,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=104, skipped=0, lr=[1.367435190424261e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:45,219] [INFO] [timer.py:258:stop] epoch=0/micro_step=104/global_step=104, RunningAvgSamplesPerSec=31.74047431733202, CurrSamplesPerSec=32.17343553961532, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 67%|██████▋ | 8/12 [00:22<00:06, 1.51s/it]{
"epoch": 8,
"step": 104,
"rank": 0,
"loss": 0.03942294791340828,
"overall_throughput": 32.1218766079758,
"lr": 1.367435190424261e-06,
"cuda_mem_allocated": 22.004770278930664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 601,
"batch_size": 16,
"total_loss": 0.04538394883275032,
"gradnorm": 0.7771502137184143,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:45.261450"
}
Per-token loss scaled by world size: 0.0007052936707623303
Per-token loss scaled by world size: 0.000300221232464537Per-token loss scaled by world size: 0.0001537334028398618Per-token loss scaled by world size: 0.0005797221674583852Per-token loss scaled by world size: 2.4881815988919698e-05Per-token loss scaled by world size: 8.731409616302699e-05
Per-token loss scaled by world size: 8.151983638526872e-05
Epoch: 8, Step: 105, Rank: 0, loss = 0.04989952594041824
Epoch: 8, Step: 105, Rank: 6, loss = 0.010876637883484364Epoch: 8, Step: 105, Rank: 5, loss = 0.0017603884916752577
Epoch: 8, Step: 105, Rank: 1, loss = 0.0061774724163115025Epoch: 8, Step: 105, Rank: 2, loss = 0.021240651607513428
Epoch: 8, Step: 105, Rank: 7, loss = 0.04101534187793732
Epoch: 8, Step: 105, Rank: 4, loss = 0.005767528433352709
Per-token loss scaled by world size: 0.0010399594902992249
Epoch: 8, Step: 105, Rank: 3, loss = 0.07357713580131531
[2024-07-27 20:07:45,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=105, skipped=0, lr=[1.2052624879351105e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:45,759] [INFO] [timer.py:258:stop] epoch=0/micro_step=105/global_step=105, RunningAvgSamplesPerSec=31.74492270946544, CurrSamplesPerSec=32.205303527286674, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 75%|███████▌ | 9/12 [00:23<00:03, 1.21s/it]{
"epoch": 8,
"step": 105,
"rank": 0,
"loss": 0.04989952594041824,
"overall_throughput": 32.11246971610297,
"lr": 1.2052624879351105e-06,
"cuda_mem_allocated": 21.996421813964844,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 566,
"batch_size": 16,
"total_loss": 0.026289334520697594,
"gradnorm": 0.5574781894683838,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:45.808243"
}
Per-token loss scaled by world size: 0.0008091902709566057Per-token loss scaled by world size: 0.0007261958089657128Per-token loss scaled by world size: 0.0015269122086465359Per-token loss scaled by world size: 0.0014011193998157978Per-token loss scaled by world size: 3.460650987108238e-05
Per-token loss scaled by world size: 0.00045884415158070624Per-token loss scaled by world size: 5.5351883929688483e-05
Epoch: 8, Step: 106, Rank: 0, loss = 0.05917203798890114Epoch: 8, Step: 106, Rank: 1, loss = 0.11165545880794525Epoch: 8, Step: 106, Rank: 2, loss = 0.053103066980838776
Epoch: 8, Step: 106, Rank: 6, loss = 0.10245685279369354
Epoch: 8, Step: 106, Rank: 3, loss = 0.0025306011084467173
Epoch: 8, Step: 106, Rank: 5, loss = 0.004047606606036425Epoch: 8, Step: 106, Rank: 4, loss = 0.033552978187799454
Per-token loss scaled by world size: 0.00016226798470597714
Epoch: 8, Step: 106, Rank: 7, loss = 0.011865845881402493
[2024-07-27 20:07:46,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=106, skipped=0, lr=[1.0527067017923654e-06], mom=[(0.9, 0.95)]
[2024-07-27 20:07:46,311] [INFO] [timer.py:258:stop] epoch=0/micro_step=106/global_step=106, RunningAvgSamplesPerSec=31.74620798139189, CurrSamplesPerSec=31.879150748989836, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 83%|████████▎ | 10/12 [00:23<00:02, 1.00s/it]{
"epoch": 8,
"step": 106,
"rank": 0,
"loss": 0.05917203798890114,
"overall_throughput": 31.805977880940024,
"lr": 1.0527067017923654e-06,
"cuda_mem_allocated": 21.998091220855713,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 585,
"batch_size": 16,
"total_loss": 0.04729805514216423,
"gradnorm": 0.8684948086738586,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:46.362067"
}
Per-token loss scaled by world size: 3.8418351323343813e-05Per-token loss scaled by world size: 0.00024928394122980535Per-token loss scaled by world size: 0.0008044589194469154Per-token loss scaled by world size: 0.0006902430322952569Per-token loss scaled by world size: 0.0003151585115119815Per-token loss scaled by world size: 0.000329785660142079
Per-token loss scaled by world size: 2.906994086515624e-05
Epoch: 8, Step: 107, Rank: 4, loss = 0.05936089903116226Epoch: 8, Step: 107, Rank: 5, loss = 0.06918346881866455
Epoch: 8, Step: 107, Rank: 0, loss = 0.0033039783593267202Epoch: 8, Step: 107, Rank: 3, loss = 0.021438419818878174
Epoch: 8, Step: 107, Rank: 2, loss = 0.028361566364765167Epoch: 8, Step: 107, Rank: 6, loss = 0.02710363268852234
Epoch: 8, Step: 107, Rank: 1, loss = 0.0025000148452818394
Per-token loss scaled by world size: 3.5674460377776995e-05
Epoch: 8, Step: 107, Rank: 7, loss = 0.0030680035706609488
[2024-07-27 20:07:46,768] [INFO] [logging.py:96:log_dist] [Rank 0] step=107, skipped=0, lr=[9.09934649508375e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:07:46,845] [INFO] [timer.py:258:stop] epoch=0/micro_step=107/global_step=107, RunningAvgSamplesPerSec=31.759325648755453, CurrSamplesPerSec=33.18541023815175, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
Epoch 8: 92%|█████████▏| 11/12 [00:24<00:00, 1.16it/s]{
"epoch": 8,
"step": 107,
"rank": 0,
"loss": 0.0033039783593267202,
"overall_throughput": 33.10411670052279,
"lr": 9.09934649508375e-07,
"cuda_mem_allocated": 22.00811004638672,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 688,
"batch_size": 16,
"total_loss": 0.02678999863564968,
"gradnorm": 0.5617575645446777,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:46.887834"
}
Per-token loss scaled by world size: 0.0007775372941978276Per-token loss scaled by world size: 0.0009101248579099774Per-token loss scaled by world size: 8.433926268480718e-06Per-token loss scaled by world size: 3.585006925277412e-05Per-token loss scaled by world size: 3.69123590644449e-05
Per-token loss scaled by world size: 0.0004430596309248358Per-token loss scaled by world size: 0.0002181310555897653
Epoch: 8, Step: 108, Rank: 6, loss = 0.07588165998458862
Epoch: 8, Step: 108, Rank: 3, loss = 0.0029889994766563177
Epoch: 8, Step: 108, Rank: 7, loss = 0.03694009780883789Epoch: 8, Step: 108, Rank: 1, loss = 0.0030775677878409624
Epoch: 8, Step: 108, Rank: 5, loss = 0.06482717394828796Epoch: 8, Step: 108, Rank: 2, loss = 0.0007031786371953785
Epoch: 8, Step: 108, Rank: 4, loss = 0.018186677247285843
Per-token loss scaled by world size: 0.0005422345129773021
Epoch: 8, Step: 108, Rank: 0, loss = 0.045208804309368134
[2024-07-27 20:07:47,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=108, skipped=0, lr=[7.771024502261526e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:07:47,383] [INFO] [timer.py:258:stop] epoch=0/micro_step=108/global_step=108, RunningAvgSamplesPerSec=31.765144134069732, CurrSamplesPerSec=32.38818214329323, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Epoch 8: 100%|██████████| 12/12 [00:24<00:00, 1.31it/s]{
"epoch": 8,
"step": 108,
"rank": 0,
"loss": 0.045208804309368134,
"overall_throughput": 32.30654330502832,
"lr": 7.771024502261526e-07,
"cuda_mem_allocated": 21.99594497680664,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 667,
"batch_size": 16,
"total_loss": 0.03097676858305931,
"gradnorm": 0.6119153499603271,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:47.430973"
}
Epoch 8: 100%|██████████| 12/12 [00:24<00:00, 2.07s/it]
total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 5 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 64 total tokens: 140 num samples: 2 num padding tokens: 25 - rank: 5 max len: 70 min len: 45 avg len: 57.5 num_loss_counted_tokens: 67
total tokens: 188 num samples: 2 num padding tokens: 28 - rank: 5 max len: 94 min len: 66 avg len: 80.0 num_loss_counted_tokens: 82
total tokens: 166 num samples: 2 num padding tokens: 24 - rank: 5 max len: 83 min len: 59 avg len: 71.0 num_loss_counted_tokens: 64
total tokens: 122 num samples: 2 num padding tokens: 1 - rank: 5 max len: 61 min len: 60 avg len: 60.5 num_loss_counted_tokens: 66
total tokens: 144 num samples: 2 num padding tokens: 17 - rank: 5 max len: 72 min len: 55 avg len: 63.5 num_loss_counted_tokens: 68
total tokens: 164 num samples: 2 num padding tokens: 19 - rank: 5 max len: 82 min len: 63 avg len: 72.5 num_loss_counted_tokens: 78
total tokens: 128 num samples: 2 num padding tokens: 3 - rank: 5 max len: 64 min len: 61 avg len: 62.5 num_loss_counted_tokens: 66
total tokens: 282 num samples: 2 num padding tokens: 80 - rank: 5 max len: 141 min len: 61 avg len: 101.0 num_loss_counted_tokens: 146
total tokens: 124 num samples: 2 num padding tokens: 17 - rank: 1 max len: 62 min len: 45 avg len: 53.5 num_loss_counted_tokens: 53
total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 4 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 57
total tokens: 136 num samples: 2 num padding tokens: 13 - rank: 5 max len: 68 min len: 55 avg len: 61.5 num_loss_counted_tokens: 48
total tokens: 160 num samples: 2 num padding tokens: 6 - rank: 7 max len: 80 min len: 74 avg len: 77.0 num_loss_counted_tokens: 94
total tokens: 200 num samples: 2 num padding tokens: 45 - rank: 4 max len: 100 min len: 55 avg len: 77.5 num_loss_counted_tokens: 99
total tokens: 96 num samples: 2 num padding tokens: 5 - rank: 5 max len: 48 min len: 43 avg len: 45.5 num_loss_counted_tokens: 39
total tokens: 140 num samples: 2 num padding tokens: 22 - rank: 3 max len: 70 min len: 48 avg len: 59.0 num_loss_counted_tokens: 73
total tokens: 216 num samples: 2 num padding tokens: 42 - rank: 2 max len: 108 min len: 66 avg len: 87.0 num_loss_counted_tokens: 105
total tokens: 138 num samples: 2 num padding tokens: 10 - rank: 7 max len: 69 min len: 59 avg len: 64.0 num_loss_counted_tokens: 68
total tokens: 104 num samples: 2 num padding tokens: 0 - rank: 1 max len: 52 min len: 52 avg len: 52.0 num_loss_counted_tokens: 50
total tokens: 176 num samples: 2 num padding tokens: 24 - rank: 1 max len: 88 min len: 64 avg len: 76.0 num_loss_counted_tokens: 94
total tokens: 142 num samples: 2 num padding tokens: 13 - rank: 3 max len: 71 min len: 58 avg len: 64.5 num_loss_counted_tokens: 69
total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 3 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 71
total tokens: 186 num samples: 2 num padding tokens: 3 - rank: 7 max len: 93 min len: 90 avg len: 91.5 num_loss_counted_tokens: 173
total tokens: 168 num samples: 2 num padding tokens: 18 - rank: 1 max len: 84 min len: 66 avg len: 75.0 num_loss_counted_tokens: 96
total tokens: 136 num samples: 2 num padding tokens: 6 - rank: 6 max len: 68 min len: 62 avg len: 65.0 num_loss_counted_tokens: 57
total tokens: 128 num samples: 2 num padding tokens: 4 - rank: 6 max len: 64 min len: 60 avg len: 62.0 num_loss_counted_tokens: 64
total tokens: 186 num samples: 2 num padding tokens: 14 - rank: 6 max len: 93 min len: 79 avg len: 86.0 num_loss_counted_tokens: 90
total tokens: 168 num samples: 2 num padding tokens: 31 - rank: 6 max len: 84 min len: 53 avg len: 68.5 num_loss_counted_tokens: 73
total tokens: 138 num samples: 2 num padding tokens: 6 - rank: 1 max len: 69 min len: 63 avg len: 66.0 num_loss_counted_tokens: 78
total tokens: 114 num samples: 2 num padding tokens: 6 - rank: 1 max len: 57 min len: 51 avg len: 54.0 num_loss_counted_tokens: 56
total tokens: 172 num samples: 2 num padding tokens: 27 - rank: 6 max len: 86 min len: 59 avg len: 72.5 num_loss_counted_tokens: 80 total tokens: 146 num samples: 2 num padding tokens: 27 - rank: 3 max len: 73 min len: 46 avg len: 59.5 num_loss_counted_tokens: 65
total tokens: 116 num samples: 2 num padding tokens: 5 - rank: 2 max len: 58 min len: 53 avg len: 55.5 num_loss_counted_tokens: 64
total tokens: 134 num samples: 2 num padding tokens: 16 - rank: 1 max len: 67 min len: 51 avg len: 59.0 num_loss_counted_tokens: 56
total tokens: 120 num samples: 2 num padding tokens: 0 - rank: 4 max len: 60 min len: 60 avg len: 60.0 num_loss_counted_tokens: 73
total tokens: 128 num samples: 2 num padding tokens: 18 - rank: 4 max len: 64 min len: 46 avg len: 55.0 num_loss_counted_tokens: 59
total tokens: 124 num samples: 2 num padding tokens: 2 - rank: 6 max len: 62 min len: 60 avg len: 61.0 num_loss_counted_tokens: 73
total tokens: 128 num samples: 2 num padding tokens: 5 - rank: 2 max len: 64 min len: 59 avg len: 61.5 num_loss_counted_tokens: 52
total tokens: 152 num samples: 2 num padding tokens: 11 - rank: 5 max len: 76 min len: 65 avg len: 70.5 num_loss_counted_tokens: 79
total tokens: 172 num samples: 2 num padding tokens: 26 - rank: 4 max len: 86 min len: 60 avg len: 73.0 num_loss_counted_tokens: 84
total tokens: 142 num samples: 2 num padding tokens: 0 - rank: 4 max len: 71 min len: 71 avg len: 71.0 num_loss_counted_tokens: 74
total tokens: 146 num samples: 2 num padding tokens: 16 - rank: 6 max len: 73 min len: 57 avg len: 65.0 num_loss_counted_tokens: 72
total tokens: 154 num samples: 2 num padding tokens: 28 - rank: 4 max len: 77 min len: 49 avg len: 63.0 num_loss_counted_tokens: 80
total tokens: 186 num samples: 2 num padding tokens: 41 - rank: 4 max len: 93 min len: 52 avg len: 72.5 num_loss_counted_tokens: 99
total tokens: 226 num samples: 2 num padding tokens: 35 - rank: 2 max len: 113 min len: 78 avg len: 95.5 num_loss_counted_tokens: 114
total tokens: 166 num samples: 2 num padding tokens: 17 - rank: 2 max len: 83 min len: 66 avg len: 74.5 num_loss_counted_tokens: 86
total tokens: 150 num samples: 2 num padding tokens: 9 - rank: 2 max len: 75 min len: 66 avg len: 70.5 num_loss_counted_tokens: 82
total tokens: 110 num samples: 2 num padding tokens: 3 - rank: 3 max len: 55 min len: 52 avg len: 53.5 num_loss_counted_tokens: 63
total tokens: 116 num samples: 2 num padding tokens: 3 - rank: 6 max len: 58 min len: 55 avg len: 56.5 num_loss_counted_tokens: 52
total tokens: 116 num samples: 2 num padding tokens: 7 - rank: 4 max len: 58 min len: 51 avg len: 54.5 num_loss_counted_tokens: 59
total tokens: 194 num samples: 2 num padding tokens: 27 - rank: 4 max len: 97 min len: 70 avg len: 83.5 num_loss_counted_tokens: 113
total tokens: 120 num samples: 2 num padding tokens: 6 - rank: 6 max len: 60 min len: 54 avg len: 57.0 num_loss_counted_tokens: 60
total tokens: 124 num samples: 2 num padding tokens: 8 - rank: 2 max len: 62 min len: 54 avg len: 58.0 num_loss_counted_tokens: 64
total tokens: 152 num samples: 2 num padding tokens: 27 - rank: 7 max len: 76 min len: 49 avg len: 62.5 num_loss_counted_tokens: 64
total tokens: 174 num samples: 2 num padding tokens: 20 - rank: 1 max len: 87 min len: 67 avg len: 77.0 num_loss_counted_tokens: 82
total tokens: 208 num samples: 2 num padding tokens: 34 - rank: 2 max len: 104 min len: 70 avg len: 87.0 num_loss_counted_tokens: 107
total tokens: 180 num samples: 2 num padding tokens: 30 - rank: 1 max len: 90 min len: 60 avg len: 75.0 num_loss_counted_tokens: 97
total tokens: 172 num samples: 2 num padding tokens: 42 - rank: 1 max len: 86 min len: 44 avg len: 65.0 num_loss_counted_tokens: 68
total tokens: 152 num samples: 2 num padding tokens: 8 - rank: 4 max len: 76 min len: 68 avg len: 72.0 num_loss_counted_tokens: 69
total tokens: 162 num samples: 2 num padding tokens: 18 - rank: 2 max len: 81 min len: 63 avg len: 72.0 num_loss_counted_tokens: 79
total tokens: 126 num samples: 2 num padding tokens: 13 - rank: 3 max len: 63 min len: 50 avg len: 56.5 num_loss_counted_tokens: 57
total tokens: 214 num samples: 2 num padding tokens: 37 - rank: 2 max len: 107 min len: 70 avg len: 88.5 num_loss_counted_tokens: 99
total tokens: 124 num samples: 2 num padding tokens: 12 - rank: 3 max len: 62 min len: 50 avg len: 56.0 num_loss_counted_tokens: 54
total tokens: 114 num samples: 2 num padding tokens: 2 - rank: 7 max len: 57 min len: 55 avg len: 56.0 num_loss_counted_tokens: 64
total tokens: 146 num samples: 2 num padding tokens: 12 - rank: 7 max len: 73 min len: 61 avg len: 67.0 num_loss_counted_tokens: 86
total tokens: 184 num samples: 2 num padding tokens: 42 - rank: 1 max len: 92 min len: 50 avg len: 71.0 num_loss_counted_tokens: 86
total tokens: 164 num samples: 2 num padding tokens: 32 - rank: 6 max len: 82 min len: 50 avg len: 66.0 num_loss_counted_tokens: 94
total tokens: 122 num samples: 2 num padding tokens: 6 - rank: 2 max len: 61 min len: 55 avg len: 58.0 num_loss_counted_tokens: 63
total tokens: 136 num samples: 2 num padding tokens: 16 - rank: 1 max len: 68 min len: 52 avg len: 60.0 num_loss_counted_tokens: 53
total tokens: 160 num samples: 2 num padding tokens: 14 - rank: 7 max len: 80 min len: 66 avg len: 73.0 num_loss_counted_tokens: 81
total tokens: 196 num samples: 2 num padding tokens: 49 - rank: 7 max len: 98 min len: 49 avg len: 73.5 num_loss_counted_tokens: 96
total tokens: 174 num samples: 2 num padding tokens: 23 - rank: 7 max len: 87 min len: 64 avg len: 75.5 num_loss_counted_tokens: 77
total tokens: 180 num samples: 2 num padding tokens: 28 - rank: 7 max len: 90 min len: 62 avg len: 76.0 num_loss_counted_tokens: 92
total tokens: 132 num samples: 2 num padding tokens: 8 - rank: 3 max len: 66 min len: 58 avg len: 62.0 num_loss_counted_tokens: 65
total tokens: 228 num samples: 2 num padding tokens: 45 - rank: 3 max len: 114 min len: 69 avg len: 91.5 num_loss_counted_tokens: 117
total tokens: 188 num samples: 2 num padding tokens: 41 - rank: 3 max len: 94 min len: 53 avg len: 73.5 num_loss_counted_tokens: 85
total tokens: 110 num samples: 2 num padding tokens: 4 - rank: 7 max len: 55 min len: 51 avg len: 53.0 num_loss_counted_tokens: 60
total tokens: 130 num samples: 2 num padding tokens: 2 - rank: 3 max len: 65 min len: 63 avg len: 64.0 num_loss_counted_tokens: 57
total tokens: 142 num samples: 2 num padding tokens: 1 - rank: 6 max len: 71 min len: 70 avg len: 70.5 num_loss_counted_tokens: 66
total tokens: 124 num samples: 2 num padding tokens: 18 - rank: 0 max len: 62 min len: 44 avg len: 53.0 num_loss_counted_tokens: 50
total tokens: 154 num samples: 2 num padding tokens: 3 - rank: 2 max len: 77 min len: 74 avg len: 75.5 num_loss_counted_tokens: 81
total tokens: 134 num samples: 2 num padding tokens: 19 - rank: 0 max len: 67 min len: 48 avg len: 57.5 num_loss_counted_tokens: 64
total tokens: 142 num samples: 2 num padding tokens: 11 - rank: 4 max len: 71 min len: 60 avg len: 65.5 num_loss_counted_tokens: 75
total tokens: 144 num samples: 2 num padding tokens: 28 - rank: 7 max len: 72 min len: 44 avg len: 58.0 num_loss_counted_tokens: 72
total tokens: 158 num samples: 2 num padding tokens: 24 - rank: 0 max len: 79 min len: 55 avg len: 67.0 num_loss_counted_tokens: 75
total tokens: 162 num samples: 2 num padding tokens: 0 - rank: 0 max len: 81 min len: 81 avg len: 81.0 num_loss_counted_tokens: 96
total tokens: 120 num samples: 2 num padding tokens: 16 - rank: 0 max len: 60 min len: 44 avg len: 52.0 num_loss_counted_tokens: 55
total tokens: 134 num samples: 2 num padding tokens: 13 - rank: 0 max len: 67 min len: 54 avg len: 60.5 num_loss_counted_tokens: 58
total tokens: 244 num samples: 2 num padding tokens: 46 - rank: 0 max len: 122 min len: 76 avg len: 99.0 num_loss_counted_tokens: 149
total tokens: 174 num samples: 2 num padding tokens: 24 - rank: 6 max len: 87 min len: 63 avg len: 75.0 num_loss_counted_tokens: 83
total tokens: 214 num samples: 2 num padding tokens: 62 - rank: 0 max len: 107 min len: 45 avg len: 76.0 num_loss_counted_tokens: 99
total tokens: 202 num samples: 2 num padding tokens: 40 - rank: 0 max len: 101 min len: 61 avg len: 81.0 num_loss_counted_tokens: 104
total tokens: 116 num samples: 2 num padding tokens: 13 - rank: 3 max len: 58 min len: 45 avg len: 51.5 num_loss_counted_tokens: 51
total tokens: 116 num samples: 2 num padding tokens: 12 - rank: 0 max len: 58 min len: 46 avg len: 52.0 num_loss_counted_tokens: 54
total tokens: 166 num samples: 2 num padding tokens: 31 - rank: 0 max len: 83 min len: 52 avg len: 67.5 num_loss_counted_tokens: 81
total tokens: 140 num samples: 2 num padding tokens: 4 - rank: 0 max len: 70 min len: 66 avg len: 68.0 num_loss_counted_tokens: 66
Per-token loss scaled by world size: 0.00020837620832026005Per-token loss scaled by world size: 0.00101291888859123Per-token loss scaled by world size: 0.0005203241598792374
Per-token loss scaled by world size: 0.0005080624832771719
Per-token loss scaled by world size: 0.0008971338393166661
Per-token loss scaled by world size: 9.295487870986108e-06
Per-token loss scaled by world size: 3.1353247322840616e-05
Epoch: 9, Step: 109, Rank: 1, loss = 0.07001801580190659
Epoch: 9, Step: 109, Rank: 6, loss = 0.014404005371034145
Epoch: 9, Step: 109, Rank: 4, loss = 0.03596740588545799
Epoch: 9, Step: 109, Rank: 5, loss = 0.0620143748819828Epoch: 9, Step: 109, Rank: 7, loss = 0.03511982038617134
Epoch: 9, Step: 109, Rank: 3, loss = 0.0021672931034117937
Epoch: 9, Step: 109, Rank: 2, loss = 0.0006425505853258073
Per-token loss scaled by world size: 1.730842632241547e-05
Epoch: 9, Step: 109, Rank: 0, loss = 0.0011964449658989906
[2024-07-27 20:07:48,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=109, skipped=0, lr=[6.543553540053926e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:07:48,396] [INFO] [timer.py:258:stop] epoch=0/micro_step=109/global_step=109, RunningAvgSamplesPerSec=31.767697068441695, CurrSamplesPerSec=32.0406552236319, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9, | 1/12 [00:00<00:10, 1.08it/s]
"step": 109,
"rank": 0,
"loss": 0.0011964449658989906,
"overall_throughput": 31.927760494448584,
"lr": 6.543553540053926e-07,
"cuda_mem_allocated": 21.998091220855713,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 553,
"batch_size": 16,
"total_loss": 0.02769123949110508,
"gradnorm": 0.690696120262146,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:48.445399"
}
Per-token loss scaled by world size: 0.0014274526620283723Per-token loss scaled by world size: 0.0002300855703651905Per-token loss scaled by world size: 0.0005493586650118232Per-token loss scaled by world size: 0.0009031961672008038Per-token loss scaled by world size: 0.0002857441722881049Per-token loss scaled by world size: 0.0005578985437750816
Per-token loss scaled by world size: 0.00028736007516272366
Epoch: 9, Step: 110, Rank: 2, loss = 0.03742505982518196Epoch: 9, Step: 110, Rank: 5, loss = 0.01567457988858223Epoch: 9, Step: 110, Rank: 3, loss = 0.06153023988008499
Epoch: 9, Step: 110, Rank: 6, loss = 0.09724520891904831
Epoch: 9, Step: 110, Rank: 7, loss = 0.019466321915388107
Epoch: 9, Step: 110, Rank: 0, loss = 0.03800683841109276
Epoch: 9, Step: 110, Rank: 4, loss = 0.019576406106352806
Per-token loss scaled by world size: 0.0003402826841920614
Epoch: 9, Step: 110, Rank: 1, loss = 0.02318175695836544
[2024-07-27 20:07:48,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[5.418275829936537e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:07:48,932] [INFO] [timer.py:258:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=31.778440450016202, CurrSamplesPerSec=32.971544549678505, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
Saving model in huggingface format at samples_seen: 1760
{
"epoch": 9,
"step": 110,
"rank": 0,
"loss": 0.03800683841109276,
"overall_throughput": 32.86280148005085,
"lr": 5.418275829936537e-07,
"cuda_mem_allocated": 21.999285221099854,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 545,
"batch_size": 16,
"total_loss": 0.03901330381631851,
"gradnorm": 0.746061384677887,
"weight_norm": 393.47589111328125,
"timestamp": "2024-07-27T20:07:48.935854"
}
Model saved in /var/instructlabbigdisk/instructlab/skillscheckpoints/hf_format/samples_1760
[20:08:06] INFO saving took 18.007025003433228 seconds utils.py:611
Per-token loss scaled by world size: 0.000437290029367432Per-token loss scaled by world size: 0.00030764410621486604Per-token loss scaled by world size: 0.00027671968564391136
Epoch 9: 17%|█▋ | 2/12 [00:19<01:52, 11.29s/it]
Per-token loss scaled by world size: 3.6100764191360213e-06Per-token loss scaled by world size: 8.117486140690744e-05
Per-token loss scaled by world size: 1.315043573413277e-05
Epoch: 9, Step: 111, Rank: 1, loss = 0.024212971329689026
Epoch: 9, Step: 111, Rank: 0, loss = 0.038262877613306046Epoch: 9, Step: 111, Rank: 6, loss = 0.0003158816834911704Epoch: 9, Step: 111, Rank: 7, loss = 0.026918860152363777
Epoch: 9, Step: 111, Rank: 5, loss = 0.007102800067514181
Epoch: 9, Step: 111, Rank: 4, loss = 0.0011506631271913648
Per-token loss scaled by world size: 4.370940587250516e-05
Epoch: 9, Step: 111, Rank: 2, loss = 0.0038245730102062225
Per-token loss scaled by world size: 0.0007190197939053178
Epoch: 9, Step: 111, Rank: 3, loss = 0.06291422992944717
[2024-07-27 20:08:07,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=111, skipped=0, lr=[4.396421846564236e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:08:07,493] [INFO] [timer.py:258:stop] epoch=0/micro_step=111/global_step=111, RunningAvgSamplesPerSec=31.77947002508808, CurrSamplesPerSec=31.891058187078368, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,█▌ | 3/12 [00:20<00:57, 6.39s/it]
"step": 111,
"rank": 0,
"loss": 0.038262877613306046,
"overall_throughput": 31.83512815148673,
"lr": 4.396421846564236e-07,
"cuda_mem_allocated": 22.00214672088623,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 700,
"batch_size": 16,
"total_loss": 0.020587855949997902,
"gradnorm": 0.4546854794025421,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:07.536418"
}
Per-token loss scaled by world size: 1.5196498679870274e-05Per-token loss scaled by world size: 0.00038456235779449344Per-token loss scaled by world size: 2.9230683139758185e-05Per-token loss scaled by world size: 4.643391366698779e-05Per-token loss scaled by world size: 0.000584576278924942
Per-token loss scaled by world size: 2.605171175673604e-05Per-token loss scaled by world size: 0.000400967663154006
Epoch: 9, Step: 112, Rank: 7, loss = 0.0473506785929203
Epoch: 9, Step: 112, Rank: 5, loss = 0.031149551272392273
Epoch: 9, Step: 112, Rank: 6, loss = 0.002367685316130519
Epoch: 9, Step: 112, Rank: 1, loss = 0.003761146916076541
Epoch: 9, Step: 112, Rank: 2, loss = 0.0012309163575991988
Epoch: 9, Step: 112, Rank: 3, loss = 0.03247838094830513
Epoch: 9, Step: 112, Rank: 0, loss = 0.0021101885940879583
Per-token loss scaled by world size: 2.4041009965003468e-05
Epoch: 9, Step: 112, Rank: 4, loss = 0.0019473218126222491
[2024-07-27 20:08:07,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=112, skipped=0, lr=[3.4791089722651437e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:08:08,033] [INFO] [timer.py:258:stop] epoch=0/micro_step=112/global_step=112, RunningAvgSamplesPerSec=31.783663551587292, CurrSamplesPerSec=32.24748961705518, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,██▎ | 4/12 [00:20<00:32, 4.08s/it]
"step": 112,
"rank": 0,
"loss": 0.0021101885940879583,
"overall_throughput": 32.15704750085054,
"lr": 3.4791089722651437e-07,
"cuda_mem_allocated": 22.002624034881592,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 648,
"batch_size": 16,
"total_loss": 0.015299483202397823,
"gradnorm": 0.40167224407196045,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:08.078974"
}
Per-token loss scaled by world size: 0.0005526405875571072Per-token loss scaled by world size: 0.0009644478559494019Per-token loss scaled by world size: 0.00029839982744306326Per-token loss scaled by world size: 0.0004884039517492056Per-token loss scaled by world size: 6.763617875549244e-06Per-token loss scaled by world size: 0.00016722115105949342
Per-token loss scaled by world size: 9.371204214403406e-05
Epoch: 9, Step: 113, Rank: 4, loss = 0.020887987688183784Epoch: 9, Step: 113, Rank: 1, loss = 0.03418827801942825Epoch: 9, Step: 113, Rank: 6, loss = 0.00047345325583592057Epoch: 9, Step: 113, Rank: 3, loss = 0.06751134991645813
Epoch: 9, Step: 113, Rank: 7, loss = 0.011705480515956879Epoch: 9, Step: 113, Rank: 5, loss = 0.03868484124541283
Epoch: 9, Step: 113, Rank: 2, loss = 0.006559843197464943
Per-token loss scaled by world size: 0.0006510451785288751
Epoch: 9, Step: 113, Rank: 0, loss = 0.0455731637775898
[2024-07-27 20:08:08,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=113, skipped=0, lr=[2.667340275199426e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:08:08,574] [INFO] [timer.py:258:stop] epoch=0/micro_step=113/global_step=113, RunningAvgSamplesPerSec=31.789040793376557, CurrSamplesPerSec=32.391855899896804, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,███▏ | 5/12 [00:21<00:19, 2.80s/it]
"step": 113,
"rank": 0,
"loss": 0.0455731637775898,
"overall_throughput": 32.30453715301278,
"lr": 2.667340275199426e-07,
"cuda_mem_allocated": 21.99761438369751,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 560,
"batch_size": 16,
"total_loss": 0.02819805033504963,
"gradnorm": 0.569709300994873,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:08.621715"
}
Per-token loss scaled by world size: 0.00019008330127689987Per-token loss scaled by world size: 0.00028144754469394684Per-token loss scaled by world size: 0.00048485625302419066Per-token loss scaled by world size: 0.0003446684859227389Per-token loss scaled by world size: 0.0005211489042267203
Per-token loss scaled by world size: 0.000750985462218523
Per-token loss scaled by world size: 0.00011730282858479768
Epoch: 9, Step: 114, Rank: 1, loss = 0.041273389011621475Epoch: 9, Step: 114, Rank: 2, loss = 0.023958221077919006Epoch: 9, Step: 114, Rank: 5, loss = 0.029339905828237534
Epoch: 9, Step: 114, Rank: 7, loss = 0.04436279833316803Epoch: 9, Step: 114, Rank: 0, loss = 0.016180841252207756
Epoch: 9, Step: 114, Rank: 6, loss = 0.06392763555049896
Epoch: 9, Step: 114, Rank: 4, loss = 0.009985403157770634
Per-token loss scaled by world size: 0.0006613648729398847
Epoch: 9, Step: 114, Rank: 3, loss = 0.05629868432879448
[2024-07-27 20:08:09,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=114, skipped=0, lr=[1.9620034125190645e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:08:09,125] [INFO] [timer.py:258:stop] epoch=0/micro_step=114/global_step=114, RunningAvgSamplesPerSec=31.789347765969946, CurrSamplesPerSec=31.82345861552571, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 9,████ | 6/12 [00:21<00:12, 2.04s/it]
"step": 114,
"rank": 0,
"loss": 0.016180841252207756,
"overall_throughput": 31.73594640312759,
"lr": 1.9620034125190645e-07,
"cuda_mem_allocated": 22.01240301132202,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 681,
"batch_size": 16,
"total_loss": 0.035665858536958694,
"gradnorm": 0.562235951423645,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:09.167828"
}
Per-token loss scaled by world size: 0.00038479233626276255Per-token loss scaled by world size: 0.00017321300401818007Per-token loss scaled by world size: 0.00037739198887720704Per-token loss scaled by world size: 0.0002703580248635262Per-token loss scaled by world size: 0.0003972994163632393
Per-token loss scaled by world size: 0.0005138221313245595Per-token loss scaled by world size: 0.0007780570886097848
Epoch: 9, Step: 115, Rank: 2, loss = 0.028068529441952705
Epoch: 9, Step: 115, Rank: 5, loss = 0.038215521723032Epoch: 9, Step: 115, Rank: 7, loss = 0.012882716953754425Epoch: 9, Step: 115, Rank: 0, loss = 0.028618929907679558Epoch: 9, Step: 115, Rank: 4, loss = 0.020107878372073174Epoch: 9, Step: 115, Rank: 6, loss = 0.029549144208431244
Epoch: 9, Step: 115, Rank: 1, loss = 0.05786799639463425
Per-token loss scaled by world size: 0.0008164329337887466
Epoch: 9, Step: 115, Rank: 3, loss = 0.06072219833731651
[2024-07-27 20:08:09,592] [INFO] [logging.py:96:log_dist] [Rank 0] step=115, skipped=0, lr=[1.3638696597277678e-07], mom=[(0.9, 0.95)]
[2024-07-27 20:08:09,670] [INFO] [timer.py:258:stop] epoch=0/micro_step=115/global_step=115, RunningAvgSamplesPerSec=31.790592582420803, CurrSamplesPerSec=31.930631657680326, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 9,████▊ | 7/12 [00:22<00:07, 1.55s/it]
"step": 115,
"rank": 0,
"loss": 0.028618929907679558,
"overall_throughput": 31.843164422360154,
"lr": 1.3638696597277678e-07,
"cuda_mem_allocated": 22.00882577896118,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 595,
"batch_size": 16,
"total_loss": 0.03450411558151245,
"gradnorm": 0.7144197821617126,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:09.673464"
}
Per-token loss scaled by world size: 0.0001559390948386863Per-token loss scaled by world size: 0.0005046449950896204Per-token loss scaled by world size: 7.42613265174441e-05Per-token loss scaled by world size: 0.00044146235450170934
Per-token loss scaled by world size: 1.3344148101168685e-05Per-token loss scaled by world size: 8.937142411014065e-06
Per-token loss scaled by world size: 4.871577039011754e-05
Epoch: 9, Step: 116, Rank: 1, loss = 0.037406809628009796Epoch: 9, Step: 116, Rank: 5, loss = 0.005504620727151632Epoch: 9, Step: 116, Rank: 0, loss = 0.011558985337615013
Epoch: 9, Step: 116, Rank: 7, loss = 0.03272339701652527Epoch: 9, Step: 116, Rank: 6, loss = 0.0006624656962230802
Epoch: 9, Step: 116, Rank: 2, loss = 0.0009891350055113435
Epoch: 9, Step: 116, Rank: 4, loss = 0.0036110563669353724
Per-token loss scaled by world size: 0.0008237811853177845
Epoch: 9, Step: 116, Rank: 3, loss = 0.061062779277563095
[2024-07-27 20:08:10,138] [INFO] [logging.py:96:log_dist] [Rank 0] step=116, skipped=0, lr=[8.735930673024806e-08], mom=[(0.9, 0.95)]
[2024-07-27 20:08:10,215] [INFO] [timer.py:258:stop] epoch=0/micro_step=116/global_step=116, RunningAvgSamplesPerSec=31.793855612274417, CurrSamplesPerSec=32.166943077303586, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,█████▋ | 8/12 [00:22<00:04, 1.23s/it]
"step": 116,
"rank": 0,
"loss": 0.011558985337615013,
"overall_throughput": 32.10873615463745,
"lr": 8.735930673024806e-08,
"cuda_mem_allocated": 22.000000476837158,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 593,
"batch_size": 16,
"total_loss": 0.019189907237887383,
"gradnorm": 0.4252445697784424,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:10.257978"
}
Per-token loss scaled by world size: 0.000993338762782514Per-token loss scaled by world size: 0.0006966108339838684Per-token loss scaled by world size: 0.0003090917889494449Per-token loss scaled by world size: 3.207974077668041e-05
Per-token loss scaled by world size: 0.0003707126888912171
Per-token loss scaled by world size: 3.2455467589898035e-05
Per-token loss scaled by world size: 9.373086504638195e-05
Epoch: 9, Step: 117, Rank: 1, loss = 0.025500072166323662
Epoch: 9, Step: 117, Rank: 0, loss = 0.08195044845342636Epoch: 9, Step: 117, Rank: 5, loss = 0.05747039616107941Epoch: 9, Step: 117, Rank: 3, loss = 0.0026775761507451534
Epoch: 9, Step: 117, Rank: 6, loss = 0.002646578708663583
Epoch: 9, Step: 117, Rank: 4, loss = 0.03058379702270031
Epoch: 9, Step: 117, Rank: 2, loss = 0.007732796482741833
Per-token loss scaled by world size: 0.0008588652708567679
Epoch: 9, Step: 117, Rank: 7, loss = 0.07085638493299484
[2024-07-27 20:08:10,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=117, skipped=0, lr=[4.9170974549885844e-08], mom=[(0.9, 0.95)]
[2024-07-27 20:08:10,765] [INFO] [timer.py:258:stop] epoch=0/micro_step=117/global_step=117, RunningAvgSamplesPerSec=31.79231829238237, CurrSamplesPerSec=31.618032996197385, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,██████▌ | 9/12 [00:23<00:03, 1.02s/it]
"step": 117,
"rank": 0,
"loss": 0.08195044845342636,
"overall_throughput": 31.54031480690336,
"lr": 4.9170974549885844e-08,
"cuda_mem_allocated": 21.997137546539307,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 660,
"batch_size": 16,
"total_loss": 0.034927256405353546,
"gradnorm": 0.6421816945075989,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:10.768286"
}
Per-token loss scaled by world size: 0.00030643673380836844Per-token loss scaled by world size: 0.000586669659242034Per-token loss scaled by world size: 0.001160175772383809Per-token loss scaled by world size: 0.000358547898940742Per-token loss scaled by world size: 0.00025170366279780865Per-token loss scaled by world size: 0.00010790762462420389
Per-token loss scaled by world size: 4.243180592311546e-05
Epoch: 9, Step: 118, Rank: 7, loss = 0.0846928283572197Epoch: 9, Step: 118, Rank: 4, loss = 0.04282688349485397Epoch: 9, Step: 118, Rank: 6, loss = 0.026173997670412064Epoch: 9, Step: 118, Rank: 1, loss = 0.007877256721258163
Epoch: 9, Step: 118, Rank: 0, loss = 0.022369882091879845Epoch: 9, Step: 118, Rank: 5, loss = 0.0183743666857481Epoch: 9, Step: 118, Rank: 3, loss = 0.0030975218396633863
Per-token loss scaled by world size: 0.00044214868103154004
Epoch: 9, Step: 118, Rank: 2, loss = 0.032276853919029236
[2024-07-27 20:08:11,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=118, skipped=0, lr=[2.1863727812254653e-08], mom=[(0.9, 0.95)]
[2024-07-27 20:08:11,300] [INFO] [timer.py:258:stop] epoch=0/micro_step=118/global_step=118, RunningAvgSamplesPerSec=31.800842267046153, CurrSamplesPerSec=32.81255650372894, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,███████▎ | 10/12 [00:23<00:01, 1.15it/s]
"step": 118,
"rank": 0,
"loss": 0.022369882091879845,
"overall_throughput": 32.728399914556505,
"lr": 2.1863727812254653e-08,
"cuda_mem_allocated": 21.999285221099854,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 584,
"batch_size": 16,
"total_loss": 0.029711198061704636,
"gradnorm": 0.6892233490943909,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:11.344526"
}
Per-token loss scaled by world size: 0.00011293171701254323Per-token loss scaled by world size: 0.00029302676557563245Per-token loss scaled by world size: 0.0008117702673189342Per-token loss scaled by world size: 0.001312798005528748Per-token loss scaled by world size: 0.0006738528027199209
Per-token loss scaled by world size: 0.0004890891723334789
Per-token loss scaled by world size: 3.5650893551064655e-05
Epoch: 9, Step: 119, Rank: 4, loss = 0.058345988392829895
Epoch: 9, Step: 119, Rank: 6, loss = 0.09435735642910004Epoch: 9, Step: 119, Rank: 0, loss = 0.021061299368739128Epoch: 9, Step: 119, Rank: 3, loss = 0.04843316972255707
Epoch: 9, Step: 119, Rank: 5, loss = 0.0025624081026762724Epoch: 9, Step: 119, Rank: 2, loss = 0.008116967044770718
Epoch: 9, Step: 119, Rank: 1, loss = 0.035153284668922424
Per-token loss scaled by world size: 0.0010237974347546697
Epoch: 9, Step: 119, Rank: 7, loss = 0.07358544319868088
[2024-07-27 20:08:11,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=119, skipped=0, lr=[5.467426590739511e-09], mom=[(0.9, 0.95)]
[2024-07-27 20:08:11,848] [INFO] [timer.py:258:stop] epoch=0/micro_step=119/global_step=119, RunningAvgSamplesPerSec=31.801217171186128, CurrSamplesPerSec=31.84476611898689, MemAllocated=22.0GB, MaxMemAllocated=28.3GB
{
"epoch": 9,████████▏| 11/12 [00:24<00:00, 1.30it/s]
"step": 119,
"rank": 0,
"loss": 0.021061299368739128,
"overall_throughput": 31.765043565848615,
"lr": 5.467426590739511e-09,
"cuda_mem_allocated": 22.003100872039795,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 575,
"batch_size": 16,
"total_loss": 0.04270198941230774,
"gradnorm": 0.709683358669281,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:11.891436"
}
Per-token loss scaled by world size: 0.0007179552922025323Per-token loss scaled by world size: 0.00047626314335502684Per-token loss scaled by world size: 0.000766461540479213Per-token loss scaled by world size: 0.000950768415350467Per-token loss scaled by world size: 0.00014302438648883253Per-token loss scaled by world size: 1.124614391301293e-05
Per-token loss scaled by world size: 0.00010744491737568751
Epoch: 9, Step: 120, Rank: 6, loss = 0.03857731446623802Epoch: 9, Step: 120, Rank: 7, loss = 0.058154378086328506Epoch: 9, Step: 120, Rank: 0, loss = 0.07701224088668823
Epoch: 9, Step: 120, Rank: 4, loss = 0.0009109376696869731
Epoch: 9, Step: 120, Rank: 3, loss = 0.06208338588476181Epoch: 9, Step: 120, Rank: 5, loss = 0.008703038096427917Epoch: 9, Step: 120, Rank: 1, loss = 0.011584974825382233
Per-token loss scaled by world size: 0.00025203556288033724
Epoch: 9, Step: 120, Rank: 2, loss = 0.02041487954556942
[2024-07-27 20:08:12,306] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-07-27 20:08:12,384] [INFO] [timer.py:258:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=31.807319807184527, CurrSamplesPerSec=32.53786766934063, MemAllocated=22.01GB, MaxMemAllocated=28.3GB
{
"epoch": 9,█████████| 12/12 [00:24<00:00, 1.43it/s]
"step": 120,
"rank": 0,
"loss": 0.07701224088668823,
"overall_throughput": 32.45850310128811,
"lr": 0.0,
"cuda_mem_allocated": 22.007394790649414,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 648,
"batch_size": 16,
"total_loss": 0.034680142998695374,
"gradnorm": 0.5826724767684937,
"weight_norm": 393.4759216308594,
"timestamp": "2024-07-27T20:08:12.387320"
}
Epoch 9: 100%|██████████| 12/12 [00:24<00:00, 2.08s/it]
tyler-rhel-newimage:260:1034 [0] NCCL INFO [Service thread] Connection closed by localRank 0
tyler-rhel-newimage:261:1036 [1] NCCL INFO [Service thread] Connection closed by localRank 1
tyler-rhel-newimage:266:1030 [6] NCCL INFO [Service thread] Connection closed by localRank 6
tyler-rhel-newimage:263:1038 [3] NCCL INFO [Service thread] Connection closed by localRank 3
tyler-rhel-newimage:262:1040 [2] NCCL INFO [Service thread] Connection closed by localRank 2
tyler-rhel-newimage:267:1044 [7] NCCL INFO [Service thread] Connection closed by localRank 7
tyler-rhel-newimage:265:1042 [5] NCCL INFO [Service thread] Connection closed by localRank 5
tyler-rhel-newimage:264:1032 [4] NCCL INFO [Service thread] Connection closed by localRank 4
tyler-rhel-newimage:260:43471 [0] NCCL INFO comm 0x558210938950 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
tyler-rhel-newimage:267:43476 [0] NCCL INFO comm 0x564fb40d9fa0 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
tyler-rhel-newimage:266:43475 [0] NCCL INFO comm 0x55f359e7d980 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
tyler-rhel-newimage:262:43470 [0] NCCL INFO comm 0x55f25f665d50 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
tyler-rhel-newimage:261:43477 [0] NCCL INFO comm 0x55fca60525d0 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
tyler-rhel-newimage:263:43473 [0] NCCL INFO comm 0x55fffff3ce80 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
tyler-rhel-newimage:265:43474 [0] NCCL INFO comm 0x56464a4e7a70 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
tyler-rhel-newimage:264:43472 [0] NCCL INFO comm 0x55b22a5ae220 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
Terminating process 🤖
[root@tyler-rhel-newimage instructlab]#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment