Created
July 7, 2024 02:40
-
-
Save relyt0925/0d18e292d3c191169071f82719311b4f to your computer and use it in GitHub Desktop.
ilab train --num-epochs 10 log file
This file has been truncated, but you can view the full file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nohup: ignoring input | |
Converting /instructlab/generated/train_combinedknowlegeskills.jsonl | |
Converting /instructlab/generated/test_combinedknowlegeskills.jsonl | |
[16:39:00] INFO !!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token | |
INFO eos: 32000, pad: 32001, system: 32002, user: utils.py:192 | |
32003, assistant: 32004 | |
Generating train split: 32962 examples [00:00, 295454.57 examples/s] | |
removing pretraining samples system msg | |
Map (num_proc=72): 100%|██████████| 32962/32962 [00:00<00:00, 71731.50 examples/s] | |
[16:39:02] INFO tokenizing the dataset with data_process.py:208 | |
/instructlab/models/ibm/granite-7b-base | |
tokenizer... | |
Map (num_proc=72): 100%|██████████| 32962/32962 [00:02<00:00, 11879.44 examples/s] | |
ten largest length percentiles: | |
Map (num_proc=72): 100%|██████████| 32962/32962 [00:00<00:00, 63203.15 examples/s] | |
quantile 90th: 349.0 | |
quantile 91th: 357.51000000000204 | |
quantile 92th: 369.0 | |
quantile 93th: 382.0 | |
quantile 94th: 398.0 | |
quantile 95th: 418.0 | |
quantile 96th: 443.0 | |
quantile 97th: 478.0 | |
quantile 98th: 527.0 | |
quantile 99th: 628.3899999999994 | |
quantile 100th: 3125.0 | |
at 4096 max sequence length, the number of samples to be dropped is 0 | |
(0.00% of total) | |
quantile 0th: 68.0 | |
quantile 1th: 84.0 | |
quantile 2th: 88.0 | |
quantile 3th: 91.0 | |
quantile 4th: 94.0 | |
quantile 5th: 96.0 | |
quantile 6th: 98.0 | |
quantile 7th: 100.0 | |
quantile 8th: 102.0 | |
quantile 9th: 104.0 | |
quantile 10th: 106.0 | |
at 20 min sequence length, the number of samples to be dropped is 0 | |
[16:39:07] INFO checking the validity of the samples... data_process.py:241 | |
Filter (num_proc=72): 100%|██████████| 32962/32962 [00:02<00:00, 13609.22 examples/s] | |
[16:39:11] INFO number of dropped samples: 0 -- out of utils.py:192 | |
32962 | |
INFO unmasking the assistant responses... data_process.py:258 | |
Map (num_proc=72): 100%|██████████| 32962/32962 [00:00<00:00, 33140.56 examples/s] | |
Creating json from Arrow format: 100%|██████████| 33/33 [00:01<00:00, 22.86ba/s] | |
time="2024-06-27T16:39:21Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly" | |
[2024-06-27 16:39:26,118] torch.distributed.run: [WARNING] | |
[2024-06-27 16:39:26,118] torch.distributed.run: [WARNING] ***************************************** | |
[2024-06-27 16:39:26,118] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2024-06-27 16:39:26,118] torch.distributed.run: [WARNING] ***************************************** | |
[2024-06-27 16:39:36,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,019] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,020] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,021] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,021] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,022] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,023] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
[2024-06-27 16:39:36,023] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
df: /root/.triton/autotune: No such file or directory | |
df: /root/.triton/autotune: No such file or directory | |
df: /root/.triton/autotune: No such file or directory | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. | |
[WARNING] async_io: please install the libaio-devel package with yum [WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] async_io: please install the libaio-devel package with yum [WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io: please install the libaio-devel package with yum [WARNING] async_io: please install the libaio-devel package with yum [WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] async_io: please install the libaio-devel package with yum | |
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. | |
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 | |
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible | |
model_name_or_path: /instructlab/models/ibm/granite-7b-base | |
data_path: /instructlab/training/data.jsonl | |
output_dir: /instructlab/training_output | |
num_epochs: 10 | |
last_step: 0 | |
effective_batch_size: 96 | |
learning_rate: 2.0e-05 | |
num_warmup_steps: 385 | |
save_samples: 4999 | |
save_samples_ds: null | |
log_level: INFO | |
seed: 19347 | |
mock_data: false | |
mock_len: 2600 | |
sharding_strategy: HYBRID_SHARD | |
is_granite: false | |
lora_r: 0 | |
lora_alpha: 32 | |
lora_dropout: 0.1 | |
lora_quant_bits: null | |
lora_target_modules: null | |
max_batch_len: 60000 | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[2024-06-27 16:39:40,622] [INFO] [comm.py:637:init_distributed] cdb=None | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[2024-06-27 16:39:40,622] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
[16:39:40] INFO [91m!!!!!!!! tokenizer has add_bos_token or utils.py:192 | |
add_eos_token[0m | |
[2024-06-27 16:39:41,670] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,719] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,720] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,726] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,748] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,756] [INFO] [comm.py:637:init_distributed] cdb=None | |
[2024-06-27 16:39:41,758] [INFO] [comm.py:637:init_distributed] cdb=None | |
Generating train split: 32962 examples [00:01, 23030.19 examples/s] | |
Map (num_proc=72): 100% 32962/32962 [00:00<00:00, 45017.44 examples/s] | |
Map (num_proc=72): 100% 32962/32962 [00:00<00:00, 40468.05 examples/s] | |
Map (num_proc=72): 100% 32962/32962 [00:00<00:00, 47269.97 examples/s] | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Loading checkpoint shards: 0% 0/6 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Loading checkpoint shards: 83% 5/6 [00:09<00:01, 1.88s/it]num_gpus: 8 | |
avg_sample_len: 211.24461501122505 | |
effective_batch_size: 96 | |
max_batch_len_per_gpu: 60000 | |
packing_max_batch_len: 2534 | |
grad_accum: 1 | |
num batches: 213 | |
avg_samples_per_batch: 154.7511737089202 | |
samples_per_gpu: 12 | |
Loading checkpoint shards: 67% 4/6 [00:08<00:03, 1.95s/it]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
Loading checkpoint shards: 100% 6/6 [00:10<00:00, 1.81s/it] | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.95s/it] | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.95s/it] | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.84s/it] | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.84s/it] | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.89s/it] | |
Loading checkpoint shards: 100% 6/6 [00:11<00:00, 1.89s/it] | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
Loading checkpoint shards: 100% 6/6 [00:10<00:00, 1.68s/it] | |
WARNING: tokenizer has 32005 tokens but model has 32000 vocab size | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Creating extension directory /root/.cache/torch_extensions/py39_cu121/fused_adam... | |
Detected CUDA files, patching ldflags | |
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... | |
Building extension module fused_adam... | |
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... | |
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib64/python3.9/site-packages/torch/include -isystem /usr/local/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib64/python3.9/site-packages/torch/include/TH -isystem /usr/local/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o | |
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib64/python3.9/site-packages/torch/include -isystem /usr/local/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib64/python3.9/site-packages/torch/include/TH -isystem /usr/local/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /usr/local/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o | |
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib64/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 29.981279850006104 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 27.43861722946167 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 27.945048570632935 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 27.940150499343872 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 29.24064588546753 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 20.32837748527527 seconds | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 29.240574836730957 seconds | |
[2024-06-27 16:40:43,126] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown | |
[2024-06-27 16:40:43,126] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized | |
Loading extension module fused_adam... | |
Time to load fused_adam op: 27.638421297073364 seconds | |
[2024-06-27 16:40:56,015] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False | |
[2024-06-27 16:40:56,017] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer | |
[2024-06-27 16:40:56,017] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer | |
[2024-06-27 16:40:56,033] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam | |
[2024-06-27 16:40:56,034] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> | |
[2024-06-27 16:40:56,034] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer | |
[2024-06-27 16:40:56,034] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000 | |
[2024-06-27 16:40:56,034] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000 | |
[2024-06-27 16:40:56,034] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False | |
[2024-06-27 16:40:56,034] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False | |
[2024-06-27 16:41:09,394] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:09,685] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:09,789] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:10,053] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:10,169] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:10,327] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:10,363] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
[2024-06-27 16:41:10,387] [INFO] [utils.py:779:see_memory_usage] Before initializing optimizer states | |
[2024-06-27 16:41:10,388] [INFO] [utils.py:780:see_memory_usage] MA 15.72 GB Max_MA 17.29 GB CA 17.29 GB Max_CA 17 GB | |
[2024-06-27 16:41:10,389] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 29.02 GB, percent = 2.3% | |
[2024-06-27 16:41:10,560] [INFO] [utils.py:779:see_memory_usage] After initializing optimizer states | |
[2024-06-27 16:41:10,561] [INFO] [utils.py:780:see_memory_usage] MA 15.72 GB Max_MA 18.86 GB CA 20.43 GB Max_CA 20 GB | |
[2024-06-27 16:41:10,561] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 29.02 GB, percent = 2.3% | |
[2024-06-27 16:41:10,561] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized | |
[2024-06-27 16:41:10,713] [INFO] [utils.py:779:see_memory_usage] After initializing ZeRO optimizer | |
[2024-06-27 16:41:10,714] [INFO] [utils.py:780:see_memory_usage] MA 15.72 GB Max_MA 15.72 GB CA 20.43 GB Max_CA 20 GB | |
[2024-06-27 16:41:10,714] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 29.02 GB, percent = 2.3% | |
[2024-06-27 16:41:10,716] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam | |
[2024-06-27 16:41:10,716] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler | |
[2024-06-27 16:41:10,716] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7ff089b46a60> | |
[2024-06-27 16:41:10,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:10,718] [INFO] [config.py:996:print] DeepSpeedEngine configuration: | |
[2024-06-27 16:41:10,718] [INFO] [config.py:1000:print] activation_checkpointing_config { | |
"partition_activations": false, | |
"contiguous_memory_optimization": false, | |
"cpu_checkpointing": false, | |
"number_checkpoints": null, | |
"synchronize_checkpoint_boundary": false, | |
"profile": false | |
} | |
[2024-06-27 16:41:10,718] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} | |
[2024-06-27 16:41:10,718] [INFO] [config.py:1000:print] amp_enabled .................. False | |
[2024-06-27 16:41:10,718] [INFO] [config.py:1000:print] amp_params ................... False | |
[2024-06-27 16:41:10,718] [INFO] [config.py:1000:print] autotuning_config ............ { | |
"enabled": false, | |
"start_step": null, | |
"end_step": null, | |
"metric_path": null, | |
"arg_mappings": null, | |
"metric": "throughput", | |
"model_info": null, | |
"results_dir": "autotuning_results", | |
"exps_dir": "autotuning_exps", | |
"overwrite": true, | |
"fast": true, | |
"start_profile_step": 3, | |
"end_profile_step": 5, | |
"tuner_type": "gridsearch", | |
"tuner_early_stopping": 5, | |
"tuner_num_trials": 50, | |
"model_info_path": null, | |
"mp_size": 1, | |
"max_train_batch_size": null, | |
"min_train_batch_size": 1, | |
"max_train_micro_batch_size_per_gpu": 1.024000e+03, | |
"min_train_micro_batch_size_per_gpu": 1, | |
"num_tuning_micro_batch_sizes": 3 | |
} | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] bfloat16_enabled ............. True | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7ff089b469d0> | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] communication_data_type ...... None | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] dataloader_drop_last ......... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] disable_allgather ............ False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] dump_state ................... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 | |
[2024-06-27 16:41:10,719] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] elasticity_enabled ........... False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] flops_profiler_config ........ { | |
"enabled": false, | |
"recompute_fwd_factor": 0.0, | |
"profile_step": 1, | |
"module_depth": -1, | |
"top_modules": 1, | |
"detailed": true, | |
"output_file": null | |
} | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] fp16_auto_cast ............... None | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] fp16_enabled ................. False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] global_rank .................. 0 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] grad_accum_dtype ............. None | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] graph_harvesting ............. False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] load_universal_checkpoint .... False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] loss_scale ................... 1.0 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] memory_breakdown ............. False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] mics_shard_size .............. -1 | |
[2024-06-27 16:41:10,720] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] nebula_config ................ { | |
"enabled": false, | |
"persistent_storage_path": null, | |
"persistent_time_interval": 100, | |
"num_of_version_in_retention": 2, | |
"enable_nebula_load": true, | |
"load_path": null | |
} | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] optimizer_name ............... None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] optimizer_params ............. None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] pld_enabled .................. False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] pld_params ................... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] prescale_gradients ........... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] scheduler_name ............... None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] scheduler_params ............. None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] sparse_attention ............. None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] steps_per_print .............. 1 | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] train_batch_size ............. 96 | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 12 | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] use_node_local_storage ....... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] wall_clock_breakdown ......... False | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] weight_quantization_config ... None | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] world_size ................... 8 | |
[2024-06-27 16:41:10,721] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False | |
[2024-06-27 16:41:10,722] [INFO] [config.py:1000:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True | |
[2024-06-27 16:41:10,722] [INFO] [config.py:1000:print] zero_enabled ................. True | |
[2024-06-27 16:41:10,722] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True | |
[2024-06-27 16:41:10,722] [INFO] [config.py:1000:print] zero_optimization_stage ...... 2 | |
[2024-06-27 16:41:10,722] [INFO] [config.py:986:print_user_config] json = { | |
"train_batch_size": 96, | |
"gradient_accumulation_steps": 1, | |
"train_micro_batch_size_per_gpu": 12, | |
"steps_per_print": 1, | |
"zero_optimization": { | |
"stage": 2, | |
"offload_param": { | |
"device": "none" | |
}, | |
"offload_optimizer": { | |
"device": "none" | |
} | |
}, | |
"bf16": { | |
"enabled": true | |
}, | |
"gradient_clipping": 1.0, | |
"prescale_gradients": false, | |
"wall_clock_breakdown": false | |
} | |
[2024-06-27 16:41:10,722] [WARNING] [engine.py:2754:load_checkpoint] Unable to find latest file at /instructlab/training_output/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. | |
Number of samples per save: 4992 | |
Epoch 0: 0% 0/213 [00:00<?, ?it/s] total tokens: 2396 num samples: 4 num padding tokens: 366 - rank: 0 max len: 599 min len: 454 avg len: 507.5 num_loss_counted_tokens: 1026 | |
total tokens: 1995 num samples: 3 num padding tokens: 127 - rank: 0 max len: 665 min len: 559 avg len: 622.6666666666666 num_loss_counted_tokens: 766 | |
total tokens: 2460 num samples: 15 num padding tokens: 207 - rank: 6 max len: 164 min len: 135 avg len: 150.2 num_loss_counted_tokens: 851 | |
total tokens: 2268 num samples: 3 num padding tokens: 600 - rank: 0 max len: 756 min len: 385 avg len: 556.0 num_loss_counted_tokens: 210 | |
total tokens: 2163 num samples: 3 num padding tokens: 302 - rank: 0 max len: 721 min len: 459 avg len: 620.3333333333334 num_loss_counted_tokens: 1362 | |
total tokens: 2500 num samples: 5 num padding tokens: 269 - rank: 0 max len: 500 min len: 377 avg len: 446.2 num_loss_counted_tokens: 1289 | |
total tokens: 2081 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2081 min len: 2081 avg len: 2081.0 num_loss_counted_tokens: 42 | |
total tokens: 2328 num samples: 4 num padding tokens: 337 - rank: 0 max len: 582 min len: 449 avg len: 497.75 num_loss_counted_tokens: 839 | |
total tokens: 2028 num samples: 4 num padding tokens: 127 - rank: 0 max len: 507 min len: 444 avg len: 475.25 num_loss_counted_tokens: 1252 | |
total tokens: 1923 num samples: 3 num padding tokens: 334 - rank: 0 max len: 641 min len: 427 avg len: 529.6666666666666 num_loss_counted_tokens: 1099 | |
total tokens: 2274 num samples: 3 num padding tokens: 499 - rank: 0 max len: 758 min len: 497 avg len: 591.6666666666666 num_loss_counted_tokens: 975 | |
total tokens: 2043 num samples: 3 num padding tokens: 297 - rank: 0 max len: 681 min len: 523 avg len: 582.0 num_loss_counted_tokens: 948 | |
total tokens: 2254 num samples: 7 num padding tokens: 79 - rank: 2 max len: 322 min len: 295 avg len: 310.7142857142857 num_loss_counted_tokens: 841 | |
total tokens: 2445 num samples: 15 num padding tokens: 357 - rank: 6 max len: 163 min len: 127 avg len: 139.2 num_loss_counted_tokens: 823 | |
total tokens: 2244 num samples: 6 num padding tokens: 111 - rank: 2 max len: 374 min len: 336 avg len: 355.5 num_loss_counted_tokens: 1110 | |
total tokens: 2505 num samples: 15 num padding tokens: 274 - rank: 6 max len: 167 min len: 130 avg len: 148.73333333333332 num_loss_counted_tokens: 871 | |
total tokens: 2320 num samples: 5 num padding tokens: 259 - rank: 0 max len: 464 min len: 335 avg len: 412.2 num_loss_counted_tokens: 1281 | |
total tokens: 2226 num samples: 7 num padding tokens: 138 - rank: 2 max len: 318 min len: 276 avg len: 298.2857142857143 num_loss_counted_tokens: 971 | |
total tokens: 2160 num samples: 5 num padding tokens: 85 - rank: 0 max len: 432 min len: 393 avg len: 415.0 num_loss_counted_tokens: 1348 | |
total tokens: 2466 num samples: 3 num padding tokens: 615 - rank: 0 max len: 822 min len: 497 avg len: 617.0 num_loss_counted_tokens: 1501 | |
total tokens: 2312 num samples: 8 num padding tokens: 151 - rank: 2 max len: 289 min len: 261 avg len: 270.125 num_loss_counted_tokens: 1056 | |
total tokens: 2312 num samples: 8 num padding tokens: 312 - rank: 3 max len: 289 min len: 232 avg len: 250.0 num_loss_counted_tokens: 778 | |
total tokens: 2180 num samples: 4 num padding tokens: 241 - rank: 0 max len: 545 min len: 426 avg len: 484.75 num_loss_counted_tokens: 1313 | |
total tokens: 1698 num samples: 2 num padding tokens: 76 - rank: 0 max len: 849 min len: 773 avg len: 811.0 num_loss_counted_tokens: 827 | |
total tokens: 2445 num samples: 15 num padding tokens: 266 - rank: 6 max len: 163 min len: 129 avg len: 145.26666666666668 num_loss_counted_tokens: 782 | |
total tokens: 2457 num samples: 7 num padding tokens: 187 - rank: 1 max len: 351 min len: 295 avg len: 324.2857142857143 num_loss_counted_tokens: 1156 | |
total tokens: 2232 num samples: 6 num padding tokens: 142 - rank: 1 max len: 372 min len: 329 avg len: 348.3333333333333 num_loss_counted_tokens: 800 | |
total tokens: 2185 num samples: 5 num padding tokens: 346 - rank: 1 max len: 437 min len: 323 avg len: 367.8 num_loss_counted_tokens: 1226 | |
total tokens: 2150 num samples: 5 num padding tokens: 96 - rank: 1 max len: 430 min len: 388 avg len: 410.8 num_loss_counted_tokens: 1098 total tokens: 2160 num samples: 5 num padding tokens: 182 - rank: 2 max len: 432 min len: 362 avg len: 395.6 num_loss_counted_tokens: 1303 | |
total tokens: 2060 num samples: 4 num padding tokens: 146 - rank: 1 max len: 515 min len: 457 avg len: 478.5 num_loss_counted_tokens: 1005 | |
total tokens: 2431 num samples: 11 num padding tokens: 255 - rank: 5 max len: 221 min len: 182 avg len: 197.8181818181818 num_loss_counted_tokens: 833 total tokens: 2464 num samples: 11 num padding tokens: 346 - rank: 5 max len: 224 min len: 168 avg len: 192.54545454545453 num_loss_counted_tokens: 903 | |
total tokens: 2496 num samples: 13 num padding tokens: 181 - rank: 5 max len: 192 min len: 168 avg len: 178.07692307692307 num_loss_counted_tokens: 826 | |
total tokens: 2484 num samples: 12 num padding tokens: 287 - rank: 5 max len: 207 min len: 164 avg len: 183.08333333333334 num_loss_counted_tokens: 1091 | |
total tokens: 2275 num samples: 7 num padding tokens: 78 - rank: 2 max len: 325 min len: 295 avg len: 313.85714285714283 num_loss_counted_tokens: 914 | |
total tokens: 2214 num samples: 6 num padding tokens: 132 - rank: 2 max len: 369 min len: 329 avg len: 347.0 num_loss_counted_tokens: 913 | |
total tokens: 2275 num samples: 5 num padding tokens: 255 - rank: 1 max len: 455 min len: 382 avg len: 404.0 num_loss_counted_tokens: 1080 | |
total tokens: 2534 num samples: 7 num padding tokens: 221 - rank: 2 max len: 362 min len: 287 avg len: 330.42857142857144 num_loss_counted_tokens: 632 total tokens: 2457 num samples: 13 num padding tokens: 163 - rank: 5 max len: 189 min len: 168 avg len: 176.46153846153845 num_loss_counted_tokens: 850 | |
total tokens: 2502 num samples: 6 num padding tokens: 214 - rank: 1 max len: 417 min len: 347 avg len: 381.3333333333333 num_loss_counted_tokens: 1494 total tokens: 2300 num samples: 5 num padding tokens: 193 - rank: 1 max len: 460 min len: 382 avg len: 421.4 num_loss_counted_tokens: 981 | |
total tokens: 2275 num samples: 7 num padding tokens: 167 - rank: 3 max len: 325 min len: 274 avg len: 301.14285714285717 num_loss_counted_tokens: 961 total tokens: 2496 num samples: 16 num padding tokens: 331 - rank: 6 max len: 156 min len: 121 avg len: 135.3125 num_loss_counted_tokens: 721 | |
total tokens: 2506 num samples: 14 num padding tokens: 178 - rank: 6 max len: 179 min len: 144 avg len: 166.28571428571428 num_loss_counted_tokens: 898 total tokens: 2380 num samples: 14 num padding tokens: 257 - rank: 6 max len: 170 min len: 130 avg len: 151.64285714285714 num_loss_counted_tokens: 833 | |
total tokens: 2345 num samples: 7 num padding tokens: 192 - rank: 2 max len: 335 min len: 287 avg len: 307.57142857142856 num_loss_counted_tokens: 926 | |
total tokens: 2261 num samples: 7 num padding tokens: 114 - rank: 2 max len: 323 min len: 275 avg len: 306.7142857142857 num_loss_counted_tokens: 1451 | |
total tokens: 2439 num samples: 9 num padding tokens: 98 - rank: 3 max len: 271 min len: 238 avg len: 260.1111111111111 num_loss_counted_tokens: 1042 | |
total tokens: 2520 num samples: 14 num padding tokens: 166 - rank: 5 max len: 180 min len: 158 avg len: 168.14285714285714 num_loss_counted_tokens: 820 | |
total tokens: 2520 num samples: 15 num padding tokens: 444 - rank: 6 max len: 168 min len: 115 avg len: 138.4 num_loss_counted_tokens: 696 | |
total tokens: 2520 num samples: 12 num padding tokens: 191 - rank: 5 max len: 210 min len: 174 avg len: 194.08333333333334 num_loss_counted_tokens: 969 total tokens: 2410 num samples: 10 num padding tokens: 162 - rank: 3 max len: 241 min len: 211 avg len: 224.8 num_loss_counted_tokens: 957 | |
total tokens: 2520 num samples: 12 num padding tokens: 119 - rank: 5 max len: 210 min len: 176 avg len: 200.08333333333334 num_loss_counted_tokens: 1003 | |
total tokens: 2401 num samples: 7 num padding tokens: 159 - rank: 3 max len: 343 min len: 295 avg len: 320.2857142857143 num_loss_counted_tokens: 1284 | |
total tokens: 2288 num samples: 8 num padding tokens: 118 - rank: 3 max len: 286 min len: 250 avg len: 271.25 num_loss_counted_tokens: 1249 | |
total tokens: 2275 num samples: 7 num padding tokens: 128 - rank: 3 max len: 325 min len: 284 avg len: 306.7142857142857 num_loss_counted_tokens: 954 total tokens: 2304 num samples: 9 num padding tokens: 127 - rank: 3 max len: 256 min len: 226 avg len: 241.88888888888889 num_loss_counted_tokens: 828 | |
total tokens: 2496 num samples: 16 num padding tokens: 165 - rank: 6 max len: 156 min len: 133 avg len: 145.6875 num_loss_counted_tokens: 816 | |
total tokens: 2421 num samples: 9 num padding tokens: 143 - rank: 3 max len: 269 min len: 234 avg len: 253.11111111111111 num_loss_counted_tokens: 854 total tokens: 2394 num samples: 14 num padding tokens: 197 - rank: 6 max len: 171 min len: 134 avg len: 156.92857142857142 num_loss_counted_tokens: 863 | |
total tokens: 2421 num samples: 9 num padding tokens: 90 - rank: 2 max len: 269 min len: 247 avg len: 259.0 num_loss_counted_tokens: 949 | |
total tokens: 2495 num samples: 5 num padding tokens: 345 - rank: 1 max len: 499 min len: 372 avg len: 430.0 num_loss_counted_tokens: 1697 | |
total tokens: 2448 num samples: 9 num padding tokens: 228 - rank: 4 max len: 272 min len: 227 avg len: 246.66666666666666 num_loss_counted_tokens: 675 | |
total tokens: 2519 num samples: 11 num padding tokens: 242 - rank: 4 max len: 229 min len: 190 avg len: 207.0 num_loss_counted_tokens: 1072 total tokens: 2350 num samples: 10 num padding tokens: 181 - rank: 4 max len: 235 min len: 207 avg len: 216.9 num_loss_counted_tokens: 645 | |
total tokens: 2288 num samples: 8 num padding tokens: 189 - rank: 4 max len: 286 min len: 224 avg len: 262.375 num_loss_counted_tokens: 883 total tokens: 2464 num samples: 11 num padding tokens: 199 - rank: 4 max len: 224 min len: 193 avg len: 205.9090909090909 num_loss_counted_tokens: 1028 | |
total tokens: 2464 num samples: 7 num padding tokens: 265 - rank: 2 max len: 352 min len: 285 avg len: 314.14285714285717 num_loss_counted_tokens: 1055 | |
total tokens: 2526 num samples: 6 num padding tokens: 323 - rank: 2 max len: 421 min len: 301 avg len: 367.1666666666667 num_loss_counted_tokens: 1004 | |
total tokens: 2450 num samples: 7 num padding tokens: 309 - rank: 2 max len: 350 min len: 285 avg len: 305.85714285714283 num_loss_counted_tokens: 1280 total tokens: 2380 num samples: 5 num padding tokens: 282 - rank: 1 max len: 476 min len: 382 avg len: 419.6 num_loss_counted_tokens: 1094 | |
total tokens: 2420 num samples: 4 num padding tokens: 291 - rank: 1 max len: 605 min len: 460 avg len: 532.25 num_loss_counted_tokens: 434 | |
total tokens: 2368 num samples: 8 num padding tokens: 67 - rank: 1 max len: 296 min len: 278 avg len: 287.625 num_loss_counted_tokens: 998 | |
total tokens: 2370 num samples: 6 num padding tokens: 315 - rank: 1 max len: 395 min len: 304 avg len: 342.5 num_loss_counted_tokens: 1097 | |
total tokens: 2376 num samples: 8 num padding tokens: 151 - rank: 2 max len: 297 min len: 260 avg len: 278.125 num_loss_counted_tokens: 931 total tokens: 2290 num samples: 5 num padding tokens: 193 - rank: 1 max len: 458 min len: 395 avg len: 419.4 num_loss_counted_tokens: 1368 | |
total tokens: 2286 num samples: 6 num padding tokens: 194 - rank: 2 max len: 381 min len: 329 avg len: 348.6666666666667 num_loss_counted_tokens: 1008 | |
total tokens: 2496 num samples: 13 num padding tokens: 152 - rank: 5 max len: 192 min len: 162 avg len: 180.30769230769232 num_loss_counted_tokens: 1022 | |
total tokens: 2379 num samples: 13 num padding tokens: 133 - rank: 5 max len: 183 min len: 157 avg len: 172.76923076923077 num_loss_counted_tokens: 674 | |
total tokens: 2388 num samples: 12 num padding tokens: 197 - rank: 5 max len: 199 min len: 173 avg len: 182.58333333333334 num_loss_counted_tokens: 933 | |
total tokens: 2484 num samples: 9 num padding tokens: 207 - rank: 3 max len: 276 min len: 237 avg len: 253.0 num_loss_counted_tokens: 510 total tokens: 2264 num samples: 8 num padding tokens: 123 - rank: 3 max len: 283 min len: 255 avg len: 267.625 num_loss_counted_tokens: 859 | |
total tokens: 2400 num samples: 15 num padding tokens: 224 - rank: 6 max len: 160 min len: 125 avg len: 145.06666666666666 num_loss_counted_tokens: 725 | |
total tokens: 2520 num samples: 9 num padding tokens: 325 - rank: 3 max len: 280 min len: 212 avg len: 243.88888888888889 num_loss_counted_tokens: 956 | |
total tokens: 2470 num samples: 13 num padding tokens: 257 - rank: 5 max len: 190 min len: 158 avg len: 170.23076923076923 num_loss_counted_tokens: 946 total tokens: 2529 num samples: 9 num padding tokens: 315 - rank: 4 max len: 281 min len: 228 avg len: 246.0 num_loss_counted_tokens: 1086 | |
total tokens: 2360 num samples: 10 num padding tokens: 101 - rank: 4 max len: 236 min len: 217 avg len: 225.9 num_loss_counted_tokens: 939 | |
total tokens: 2496 num samples: 13 num padding tokens: 205 - rank: 5 max len: 192 min len: 163 avg len: 176.23076923076923 num_loss_counted_tokens: 950 | |
total tokens: 2398 num samples: 11 num padding tokens: 173 - rank: 4 max len: 218 min len: 188 avg len: 202.27272727272728 num_loss_counted_tokens: 983 | |
total tokens: 2504 num samples: 8 num padding tokens: 347 - rank: 3 max len: 313 min len: 230 avg len: 269.625 num_loss_counted_tokens: 767 | |
total tokens: 2496 num samples: 13 num padding tokens: 134 - rank: 5 max len: 192 min len: 166 avg len: 181.69230769230768 num_loss_counted_tokens: 1086 | |
total tokens: 2512 num samples: 16 num padding tokens: 251 - rank: 6 max len: 157 min len: 127 avg len: 141.3125 num_loss_counted_tokens: 935 | |
total tokens: 2512 num samples: 16 num padding tokens: 275 - rank: 6 max len: 157 min len: 125 avg len: 139.8125 num_loss_counted_tokens: 754 | |
total tokens: 2400 num samples: 15 num padding tokens: 231 - rank: 6 max len: 160 min len: 129 avg len: 144.6 num_loss_counted_tokens: 944 | |
total tokens: 2331 num samples: 9 num padding tokens: 163 - rank: 3 max len: 259 min len: 224 avg len: 240.88888888888889 num_loss_counted_tokens: 867 | |
total tokens: 2261 num samples: 17 num padding tokens: 330 - rank: 7 max len: 133 min len: 82 avg len: 113.58823529411765 num_loss_counted_tokens: 564 | |
total tokens: 2340 num samples: 10 num padding tokens: 223 - rank: 4 max len: 234 min len: 193 avg len: 211.7 num_loss_counted_tokens: 986 | |
total tokens: 2460 num samples: 15 num padding tokens: 190 - rank: 6 max len: 164 min len: 131 avg len: 151.33333333333334 num_loss_counted_tokens: 728 | |
total tokens: 2360 num samples: 20 num padding tokens: 292 - rank: 7 max len: 118 min len: 86 avg len: 103.4 num_loss_counted_tokens: 595 | |
total tokens: 2160 num samples: 3 num padding tokens: 344 - rank: 0 max len: 720 min len: 524 avg len: 605.3333333333334 num_loss_counted_tokens: 1144 | |
total tokens: 2508 num samples: 12 num padding tokens: 166 - rank: 4 max len: 209 min len: 183 avg len: 195.16666666666666 num_loss_counted_tokens: 993 | |
total tokens: 2480 num samples: 10 num padding tokens: 220 - rank: 4 max len: 248 min len: 203 avg len: 226.0 num_loss_counted_tokens: 806 | |
total tokens: 1770 num samples: 15 num padding tokens: 268 - rank: 7 max len: 118 min len: 86 avg len: 100.13333333333334 num_loss_counted_tokens: 335 | |
total tokens: 1610 num samples: 14 num padding tokens: 284 - rank: 7 max len: 115 min len: 76 avg len: 94.71428571428571 num_loss_counted_tokens: 319 | |
total tokens: 2394 num samples: 19 num padding tokens: 404 - rank: 7 max len: 126 min len: 72 avg len: 104.73684210526316 num_loss_counted_tokens: 602 | |
total tokens: 2413 num samples: 19 num padding tokens: 350 - rank: 7 max len: 127 min len: 85 avg len: 108.57894736842105 num_loss_counted_tokens: 532 | |
total tokens: 2431 num samples: 17 num padding tokens: 334 - rank: 7 max len: 143 min len: 81 avg len: 123.3529411764706 num_loss_counted_tokens: 651 | |
total tokens: 2394 num samples: 18 num padding tokens: 439 - rank: 7 max len: 133 min len: 75 avg len: 108.61111111111111 num_loss_counted_tokens: 516 | |
total tokens: 2420 num samples: 11 num padding tokens: 137 - rank: 4 max len: 220 min len: 195 avg len: 207.54545454545453 num_loss_counted_tokens: 921 | |
total tokens: 2350 num samples: 10 num padding tokens: 127 - rank: 4 max len: 235 min len: 209 avg len: 222.3 num_loss_counted_tokens: 854 | |
total tokens: 2420 num samples: 11 num padding tokens: 87 - rank: 4 max len: 220 min len: 202 avg len: 212.0909090909091 num_loss_counted_tokens: 917 | |
total tokens: 2508 num samples: 12 num padding tokens: 82 - rank: 4 max len: 209 min len: 193 avg len: 202.16666666666666 num_loss_counted_tokens: 961 | |
total tokens: 2338 num samples: 7 num padding tokens: 198 - rank: 2 max len: 334 min len: 281 avg len: 305.7142857142857 num_loss_counted_tokens: 1035 | |
total tokens: 2520 num samples: 21 num padding tokens: 359 - rank: 7 max len: 120 min len: 87 avg len: 102.9047619047619 num_loss_counted_tokens: 510 | |
total tokens: 2356 num samples: 19 num padding tokens: 231 - rank: 7 max len: 124 min len: 88 avg len: 111.84210526315789 num_loss_counted_tokens: 624 | |
total tokens: 2300 num samples: 20 num padding tokens: 287 - rank: 7 max len: 115 min len: 75 avg len: 100.65 num_loss_counted_tokens: 523 | |
total tokens: 2337 num samples: 19 num padding tokens: 396 - rank: 7 max len: 123 min len: 79 avg len: 102.15789473684211 num_loss_counted_tokens: 506 | |
total tokens: 2385 num samples: 15 num padding tokens: 185 - rank: 6 max len: 159 min len: 130 avg len: 146.66666666666666 num_loss_counted_tokens: 950 | |
total tokens: 2304 num samples: 6 num padding tokens: 106 - rank: 1 max len: 384 min len: 337 avg len: 366.3333333333333 num_loss_counted_tokens: 1223 | |
total tokens: 2470 num samples: 19 num padding tokens: 412 - rank: 7 max len: 130 min len: 89 avg len: 108.3157894736842 num_loss_counted_tokens: 609 | |
total tokens: 2376 num samples: 12 num padding tokens: 226 - rank: 5 max len: 198 min len: 166 avg len: 179.16666666666666 num_loss_counted_tokens: 798 | |
total tokens: 2520 num samples: 20 num padding tokens: 349 - rank: 7 max len: 126 min len: 78 avg len: 108.55 num_loss_counted_tokens: 650 | |
total tokens: 2288 num samples: 8 num padding tokens: 143 - rank: 3 max len: 286 min len: 246 avg len: 268.125 num_loss_counted_tokens: 1136 | |
total tokens: 2140 num samples: 5 num padding tokens: 182 - rank: 1 max len: 428 min len: 366 avg len: 391.6 num_loss_counted_tokens: 1277 | |
total tokens: 2450 num samples: 14 num padding tokens: 254 - rank: 6 max len: 175 min len: 135 avg len: 156.85714285714286 num_loss_counted_tokens: 929 | |
total tokens: 2431 num samples: 11 num padding tokens: 267 - rank: 5 max len: 221 min len: 175 avg len: 196.72727272727272 num_loss_counted_tokens: 731 | |
total tokens: 2432 num samples: 19 num padding tokens: 341 - rank: 7 max len: 128 min len: 86 avg len: 110.05263157894737 num_loss_counted_tokens: 562 | |
total tokens: 2520 num samples: 9 num padding tokens: 184 - rank: 3 max len: 280 min len: 237 avg len: 259.55555555555554 num_loss_counted_tokens: 1300 | |
total tokens: 2336 num samples: 8 num padding tokens: 176 - rank: 3 max len: 292 min len: 250 avg len: 270.0 num_loss_counted_tokens: 1062 | |
total tokens: 2480 num samples: 10 num padding tokens: 145 - rank: 4 max len: 248 min len: 221 avg len: 233.5 num_loss_counted_tokens: 912 | |
total tokens: 2410 num samples: 10 num padding tokens: 223 - rank: 4 max len: 241 min len: 200 avg len: 218.7 num_loss_counted_tokens: 925 | |
total tokens: 2436 num samples: 12 num padding tokens: 256 - rank: 5 max len: 203 min len: 162 avg len: 181.66666666666666 num_loss_counted_tokens: 1065 | |
total tokens: 2240 num samples: 5 num padding tokens: 216 - rank: 1 max len: 448 min len: 341 avg len: 404.8 num_loss_counted_tokens: 1265 | |
total tokens: 2430 num samples: 18 num padding tokens: 393 - rank: 7 max len: 135 min len: 90 avg len: 113.16666666666667 num_loss_counted_tokens: 586 | |
total tokens: 2451 num samples: 19 num padding tokens: 413 - rank: 7 max len: 129 min len: 87 avg len: 107.26315789473684 num_loss_counted_tokens: 549 | |
Per-token loss scaled by world size: 0.001983513357117772Per-token loss scaled by world size: 0.0013691497733816504Per-token loss scaled by world size: 0.0015049705980345607Per-token loss scaled by world size: 0.0012615423183888197Per-token loss scaled by world size: 0.002342585939913988Per-token loss scaled by world size: 0.0015739978989586234 | |
Epoch: 0, Step: 1, Rank: 4, loss = 1.7871454954147339Epoch: 0, Step: 1, Rank: 6, loss = 1.2336039543151855Epoch: 0, Step: 1, Rank: 5, loss = 1.3559784889221191 | |
Epoch: 0, Step: 1, Rank: 1, loss = 2.1106698513031006Epoch: 0, Step: 1, Rank: 7, loss = 1.1366496086120605Epoch: 0, Step: 1, Rank: 2, loss = 1.4181721210479736 | |
Per-token loss scaled by world size: 0.001367097138427198 | |
Epoch: 0, Step: 1, Rank: 3, loss = 1.2317545413970947 | |
Per-token loss scaled by world size: 0.0019534588791429996 | |
Epoch: 0, Step: 1, Rank: 0, loss = 1.7600665092468262 | |
[2024-06-27 16:41:12,448] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[5.194805194805195e-08], mom=[(0.9, 0.95)] | |
throughput: 72.52539926929414 samples/s, lr: 5.194805194805195e-08, loss: 1.7600665092468262 cuda_mem_allocated: 22.30030393600464 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7208.0 batch_size: 80.0 total loss: 1.5042551755905151 | |
Epoch 0: 0% 1/213 [00:01<06:17, 1.78s/it] total tokens: 2382 num samples: 3 num padding tokens: 468 - rank: 1 max len: 794 min len: 436 avg len: 638.0 num_loss_counted_tokens: 1091 | |
total tokens: 2440 num samples: 10 num padding tokens: 222 - rank: 4 max len: 244 min len: 205 avg len: 221.8 num_loss_counted_tokens: 985 | |
total tokens: 2298 num samples: 6 num padding tokens: 309 - rank: 2 max len: 383 min len: 300 avg len: 331.5 num_loss_counted_tokens: 882 | |
total tokens: 2376 num samples: 12 num padding tokens: 211 - rank: 5 max len: 198 min len: 161 avg len: 180.41666666666666 num_loss_counted_tokens: 864 | |
total tokens: 2458 num samples: 2 num padding tokens: 373 - rank: 0 max len: 1229 min len: 856 avg len: 1042.5 num_loss_counted_tokens: 794 | |
total tokens: 2392 num samples: 8 num padding tokens: 217 - rank: 3 max len: 299 min len: 248 avg len: 271.875 num_loss_counted_tokens: 850 | |
total tokens: 2385 num samples: 15 num padding tokens: 283 - rank: 6 max len: 159 min len: 124 avg len: 140.13333333333333 num_loss_counted_tokens: 736 | |
total tokens: 2520 num samples: 21 num padding tokens: 394 - rank: 7 max len: 120 min len: 77 avg len: 101.23809523809524 num_loss_counted_tokens: 565 | |
Per-token loss scaled by world size: 0.002336997538805008Per-token loss scaled by world size: 0.00141163042280823Per-token loss scaled by world size: 0.0022019841708242893Per-token loss scaled by world size: 0.0013177217915654182Per-token loss scaled by world size: 0.001291085034608841Per-token loss scaled by world size: 0.0017191915540024638Per-token loss scaled by world size: 0.0018813223578035831 | |
Epoch: 0, Step: 2, Rank: 7, loss = 1.2286478281021118Epoch: 0, Step: 2, Rank: 2, loss = 1.9165520668029785 | |
Epoch: 0, Step: 2, Rank: 3, loss = 2.034064292907715Epoch: 0, Step: 2, Rank: 0, loss = 1.1237281560897827Epoch: 0, Step: 2, Rank: 5, loss = 1.496341347694397 | |
Epoch: 0, Step: 2, Rank: 4, loss = 1.1469120979309082 | |
Epoch: 0, Step: 2, Rank: 1, loss = 1.637455940246582 | |
Per-token loss scaled by world size: 0.0013686075108125806 | |
Epoch: 0, Step: 2, Rank: 6, loss = 1.1912018060684204 | |
[2024-06-27 16:41:13,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.038961038961039e-07], mom=[(0.9, 0.95)] | |
throughput: 103.27660727036942 samples/s, lr: 1.038961038961039e-07, loss: 1.1237281560897827 cuda_mem_allocated: 22.292194366455078 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6963.0 batch_size: 77.0 total loss: 1.471863031387329 | |
Epoch 0: 1% 2/213 [00:02<04:43, 1.34s/it] total tokens: 2210 num samples: 5 num padding tokens: 279 - rank: 1 max len: 442 min len: 342 avg len: 386.2 num_loss_counted_tokens: 1121 | |
total tokens: 2380 num samples: 10 num padding tokens: 117 - rank: 4 max len: 238 min len: 209 avg len: 226.3 num_loss_counted_tokens: 740 | |
total tokens: 2344 num samples: 8 num padding tokens: 177 - rank: 3 max len: 293 min len: 238 avg len: 270.875 num_loss_counted_tokens: 672 | |
total tokens: 2472 num samples: 12 num padding tokens: 168 - rank: 5 max len: 206 min len: 173 avg len: 192.0 num_loss_counted_tokens: 1281 | |
total tokens: 2338 num samples: 7 num padding tokens: 111 - rank: 2 max len: 334 min len: 301 avg len: 318.14285714285717 num_loss_counted_tokens: 983 | |
total tokens: 2408 num samples: 14 num padding tokens: 225 - rank: 6 max len: 172 min len: 136 avg len: 155.92857142857142 num_loss_counted_tokens: 854 | |
total tokens: 2032 num samples: 4 num padding tokens: 89 - rank: 0 max len: 508 min len: 448 avg len: 485.75 num_loss_counted_tokens: 860 | |
total tokens: 2508 num samples: 19 num padding tokens: 315 - rank: 7 max len: 132 min len: 80 avg len: 115.42105263157895 num_loss_counted_tokens: 648 | |
Per-token loss scaled by world size: 0.0019702184945344925Per-token loss scaled by world size: 0.0019760041031986475Per-token loss scaled by world size: 0.0029752030968666077Per-token loss scaled by world size: 0.0018035820685327053Per-token loss scaled by world size: 0.0020922920666635036Per-token loss scaled by world size: 0.0003915991692338139Per-token loss scaled by world size: 0.0013368118088692427 | |
Epoch: 0, Step: 3, Rank: 3, loss = 1.505739450454712Epoch: 0, Step: 3, Rank: 5, loss = 2.273798942565918 | |
Epoch: 0, Step: 3, Rank: 6, loss = 1.5101611614227295Per-token loss scaled by world size: 0.0017627485794946551 | |
Epoch: 0, Step: 3, Rank: 2, loss = 1.5990341901779175Epoch: 0, Step: 3, Rank: 0, loss = 0.29927965998649597 | |
Epoch: 0, Step: 3, Rank: 4, loss = 1.0216584205627441Epoch: 0, Step: 3, Rank: 1, loss = 1.3783875703811646 | |
Epoch: 0, Step: 3, Rank: 7, loss = 1.3471806049346924 | |
[2024-06-27 16:41:14,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.5584415584415585e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:14,601] [INFO] [timer.py:260:stop] epoch=0/micro_step=3/global_step=3, RunningAvgSamplesPerSec=96.2315216744463, CurrSamplesPerSec=96.2315216744463, MemAllocated=22.29GB, MaxMemAllocated=28.58GB | |
throughput: 96.10978268834587 samples/s, lr: 1.5584415584415585e-07, loss: 0.29927965998649597 cuda_mem_allocated: 22.285038471221924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6114.0 batch_size: 81.0 total loss: 1.3669049739837646 | |
Epoch 0: 1% 3/213 [00:03<04:14, 1.21s/it] total tokens: 2457 num samples: 9 num padding tokens: 126 - rank: 3 max len: 273 min len: 248 avg len: 259.0 num_loss_counted_tokens: 1047 | |
total tokens: 2470 num samples: 10 num padding tokens: 129 - rank: 4 max len: 247 min len: 220 avg len: 234.1 num_loss_counted_tokens: 909 | |
total tokens: 2420 num samples: 11 num padding tokens: 268 - rank: 5 max len: 220 min len: 174 avg len: 195.63636363636363 num_loss_counted_tokens: 1018 | |
total tokens: 2475 num samples: 15 num padding tokens: 210 - rank: 6 max len: 165 min len: 130 avg len: 151.0 num_loss_counted_tokens: 835 | |
total tokens: 2360 num samples: 8 num padding tokens: 62 - rank: 2 max len: 295 min len: 277 avg len: 287.25 num_loss_counted_tokens: 927 | |
total tokens: 2261 num samples: 7 num padding tokens: 103 - rank: 1 max len: 323 min len: 295 avg len: 308.2857142857143 num_loss_counted_tokens: 1369 | |
total tokens: 2304 num samples: 18 num padding tokens: 409 - rank: 7 max len: 128 min len: 84 avg len: 105.27777777777777 num_loss_counted_tokens: 522 | |
total tokens: 2135 num samples: 5 num padding tokens: 232 - rank: 0 max len: 427 min len: 329 avg len: 380.6 num_loss_counted_tokens: 1496 | |
Per-token loss scaled by world size: 0.001208532601594925Per-token loss scaled by world size: 0.001757761579938233Per-token loss scaled by world size: 0.0013055852614343166Per-token loss scaled by world size: 0.001869880361482501Per-token loss scaled by world size: 0.0014144543092697859Per-token loss scaled by world size: 0.001308830687776208 | |
Per-token loss scaled by world size: 0.0012367891613394022 | |
Epoch: 0, Step: 4, Rank: 4, loss = 1.3536328077316284 | |
Epoch: 0, Step: 4, Rank: 5, loss = 1.1565656661987305Epoch: 0, Step: 4, Rank: 7, loss = 1.2494450807571411Epoch: 0, Step: 4, Rank: 1, loss = 1.7894755601882935Epoch: 0, Step: 4, Rank: 2, loss = 1.6821777820587158 | |
Epoch: 0, Step: 4, Rank: 6, loss = 1.2525509595870972 | |
Epoch: 0, Step: 4, Rank: 3, loss = 1.1836072206497192 | |
Per-token loss scaled by world size: 0.0015714052133262157 | |
Epoch: 0, Step: 4, Rank: 0, loss = 1.503834843635559 | |
[2024-06-27 16:41:15,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[2.077922077922078e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:15,663] [INFO] [timer.py:260:stop] epoch=0/micro_step=4/global_step=4, RunningAvgSamplesPerSec=95.70544300942215, CurrSamplesPerSec=95.18508500635784, MemAllocated=22.31GB, MaxMemAllocated=28.6GB | |
throughput: 95.07239730007387 samples/s, lr: 2.077922077922078e-07, loss: 1.503834843635559 cuda_mem_allocated: 22.312707901000977 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7656.0 batch_size: 87.0 total loss: 1.3964112997055054 | |
Epoch 0: 2% 4/213 [00:04<04:00, 1.15s/it] total tokens: 2280 num samples: 6 num padding tokens: 192 - rank: 2 max len: 380 min len: 304 avg len: 348.0 num_loss_counted_tokens: 1276 | |
total tokens: 2502 num samples: 9 num padding tokens: 204 - rank: 3 max len: 278 min len: 235 avg len: 255.33333333333334 num_loss_counted_tokens: 1202 | |
total tokens: 2412 num samples: 12 num padding tokens: 165 - rank: 5 max len: 201 min len: 171 avg len: 187.25 num_loss_counted_tokens: 1087 | |
total tokens: 2505 num samples: 15 num padding tokens: 200 - rank: 6 max len: 167 min len: 125 avg len: 153.66666666666666 num_loss_counted_tokens: 850 | |
total tokens: 2195 num samples: 5 num padding tokens: 59 - rank: 1 max len: 439 min len: 418 avg len: 427.2 num_loss_counted_tokens: 1231 | |
total tokens: 1845 num samples: 15 num padding tokens: 234 - rank: 7 max len: 123 min len: 91 avg len: 107.4 num_loss_counted_tokens: 395 | |
total tokens: 2497 num samples: 11 num padding tokens: 115 - rank: 4 max len: 227 min len: 205 avg len: 216.54545454545453 num_loss_counted_tokens: 827 | |
total tokens: 1986 num samples: 3 num padding tokens: 245 - rank: 0 max len: 662 min len: 485 avg len: 580.3333333333334 num_loss_counted_tokens: 476 | |
Per-token loss scaled by world size: 0.0015053263632580638Per-token loss scaled by world size: 0.0005892603076063097Per-token loss scaled by world size: 0.0021721089724451303Per-token loss scaled by world size: 0.0016074457671493292Per-token loss scaled by world size: 0.0011634426191449165 | |
Per-token loss scaled by world size: 0.002225358271971345 | |
Epoch: 0, Step: 5, Rank: 0, loss = 0.5614914298057556Epoch: 0, Step: 5, Rank: 5, loss = 1.4343878030776978Epoch: 0, Step: 5, Rank: 1, loss = 1.5316948890686035Epoch: 0, Step: 5, Rank: 3, loss = 2.0697484016418457Epoch: 0, Step: 5, Rank: 7, loss = 1.1086153984069824 | |
Epoch: 0, Step: 5, Rank: 2, loss = 2.120488166809082 | |
Per-token loss scaled by world size: 0.0013402224285528064 | |
Per-token loss scaled by world size: 0.001313359010964632 | |
Epoch: 0, Step: 5, Rank: 6, loss = 1.2770644426345825 | |
Epoch: 0, Step: 5, Rank: 4, loss = 1.251466989517212 | |
[2024-06-27 16:41:16,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[2.597402597402598e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:16,709] [INFO] [timer.py:260:stop] epoch=0/micro_step=5/global_step=5, RunningAvgSamplesPerSec=96.04795179630791, CurrSamplesPerSec=96.74037697335353, MemAllocated=22.25GB, MaxMemAllocated=28.6GB | |
throughput: 96.61398644655536 samples/s, lr: 2.597402597402598e-07, loss: 0.5614914298057556 cuda_mem_allocated: 22.252480506896973 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7623.0 batch_size: 69.0 total loss: 1.4193696975708008 | |
Epoch 0: 2% 5/213 [00:05<03:51, 1.11s/it] total tokens: 2490 num samples: 6 num padding tokens: 154 - rank: 1 max len: 415 min len: 363 avg len: 389.3333333333333 num_loss_counted_tokens: 1546 | |
total tokens: 2534 num samples: 7 num padding tokens: 226 - rank: 2 max len: 362 min len: 314 avg len: 329.7142857142857 num_loss_counted_tokens: 1456 | |
total tokens: 2520 num samples: 14 num padding tokens: 336 - rank: 6 max len: 180 min len: 136 avg len: 156.0 num_loss_counted_tokens: 769 | |
total tokens: 2295 num samples: 9 num padding tokens: 146 - rank: 4 max len: 255 min len: 222 avg len: 238.77777777777777 num_loss_counted_tokens: 773 | |
total tokens: 2408 num samples: 8 num padding tokens: 115 - rank: 3 max len: 301 min len: 266 avg len: 286.625 num_loss_counted_tokens: 1149 | |
total tokens: 2398 num samples: 11 num padding tokens: 241 - rank: 5 max len: 218 min len: 181 avg len: 196.0909090909091 num_loss_counted_tokens: 922 | |
total tokens: 2118 num samples: 3 num padding tokens: 513 - rank: 0 max len: 706 min len: 441 avg len: 535.0 num_loss_counted_tokens: 749 | |
total tokens: 2412 num samples: 18 num padding tokens: 462 - rank: 7 max len: 134 min len: 76 avg len: 108.33333333333333 num_loss_counted_tokens: 561 | |
Per-token loss scaled by world size: 0.0016384563641622663Per-token loss scaled by world size: 0.0014407404232770205Per-token loss scaled by world size: 0.0006767849554307759Per-token loss scaled by world size: 0.0013280463172122836Per-token loss scaled by world size: 0.0022748447954654694Per-token loss scaled by world size: 0.0014405775582417846Per-token loss scaled by world size: 0.0011797224869951606 | |
Epoch: 0, Step: 6, Rank: 4, loss = 1.5426067113876343Epoch: 0, Step: 6, Rank: 2, loss = 1.3564571142196655Epoch: 0, Step: 6, Rank: 7, loss = 0.6371930241584778 | |
Epoch: 0, Step: 6, Rank: 3, loss = 1.1107087135314941 | |
Epoch: 0, Step: 6, Rank: 1, loss = 2.141766309738159Epoch: 0, Step: 6, Rank: 6, loss = 1.25035560131073Epoch: 0, Step: 6, Rank: 0, loss = 1.3563038110733032 | |
Per-token loss scaled by world size: 0.001480351435020566 | |
Epoch: 0, Step: 6, Rank: 5, loss = 1.3937509059906006 | |
[2024-06-27 16:41:17,684] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[3.116883116883117e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:17,757] [INFO] [timer.py:260:stop] epoch=0/micro_step=6/global_step=6, RunningAvgSamplesPerSec=96.11403261537606, CurrSamplesPerSec=96.31282176277023, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 96.2162989336599 samples/s, lr: 3.116883116883117e-07, loss: 1.3563038110733032 cuda_mem_allocated: 22.272515773773193 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7532.0 batch_size: 71.0 total loss: 1.3486427068710327 | |
Epoch 0: 3% 6/213 [00:07<03:45, 1.09s/it] total tokens: 2408 num samples: 14 num padding tokens: 164 - rank: 5 max len: 172 min len: 152 avg len: 160.28571428571428 num_loss_counted_tokens: 726 | |
total tokens: 2296 num samples: 8 num padding tokens: 117 - rank: 2 max len: 287 min len: 259 avg len: 272.375 num_loss_counted_tokens: 825 | |
total tokens: 2398 num samples: 11 num padding tokens: 160 - rank: 4 max len: 218 min len: 181 avg len: 203.45454545454547 num_loss_counted_tokens: 851 | |
total tokens: 2295 num samples: 9 num padding tokens: 136 - rank: 3 max len: 255 min len: 223 avg len: 239.88888888888889 num_loss_counted_tokens: 975 | |
total tokens: 2310 num samples: 7 num padding tokens: 103 - rank: 1 max len: 330 min len: 302 avg len: 315.2857142857143 num_loss_counted_tokens: 816 | |
total tokens: 2482 num samples: 17 num padding tokens: 205 - rank: 6 max len: 146 min len: 123 avg len: 133.94117647058823 num_loss_counted_tokens: 855 | |
total tokens: 2500 num samples: 5 num padding tokens: 160 - rank: 0 max len: 500 min len: 425 avg len: 468.0 num_loss_counted_tokens: 1269 | |
total tokens: 2337 num samples: 19 num padding tokens: 358 - rank: 7 max len: 123 min len: 78 avg len: 104.15789473684211 num_loss_counted_tokens: 530 | |
Per-token loss scaled by world size: 0.0019260219996795058Per-token loss scaled by world size: 0.0012690859148278832Per-token loss scaled by world size: 0.0011052560294046998Per-token loss scaled by world size: 0.0006829628837294877Per-token loss scaled by world size: 0.0015464224852621555Per-token loss scaled by world size: 0.0016165170818567276Per-token loss scaled by world size: 0.0011752048740163445 | |
Epoch: 0, Step: 7, Rank: 2, loss = 1.1348072290420532Epoch: 0, Step: 7, Rank: 0, loss = 1.5609493255615234Epoch: 0, Step: 7, Rank: 4, loss = 1.2254611253738403Epoch: 0, Step: 7, Rank: 3, loss = 1.859815001487732Epoch: 0, Step: 7, Rank: 6, loss = 1.067262887954712 | |
Epoch: 0, Step: 7, Rank: 5, loss = 1.4932641983032227Epoch: 0, Step: 7, Rank: 7, loss = 0.6594860553741455 | |
Per-token loss scaled by world size: 0.0023967588786035776 | |
Epoch: 0, Step: 7, Rank: 1, loss = 2.3143703937530518 | |
[2024-06-27 16:41:18,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=7, skipped=0, lr=[3.6363636363636366e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:18,812] [INFO] [timer.py:260:stop] epoch=0/micro_step=7/global_step=7, RunningAvgSamplesPerSec=96.05955535596553, CurrSamplesPerSec=95.8422624736177, MemAllocated=22.24GB, MaxMemAllocated=28.6GB | |
throughput: 95.74697602218868 samples/s, lr: 3.6363636363636366e-07, loss: 1.5609493255615234 cuda_mem_allocated: 22.24389410018921 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7725.0 batch_size: 75.0 total loss: 1.4144270420074463 | |
Epoch 0: 3% 7/213 [00:08<03:42, 1.08s/it] total tokens: 2431 num samples: 17 num padding tokens: 480 - rank: 7 max len: 143 min len: 79 avg len: 114.76470588235294 num_loss_counted_tokens: 617 | |
total tokens: 2208 num samples: 4 num padding tokens: 342 - rank: 1 max len: 552 min len: 424 avg len: 466.5 num_loss_counted_tokens: 899 | |
total tokens: 2422 num samples: 7 num padding tokens: 216 - rank: 3 max len: 346 min len: 292 avg len: 315.14285714285717 num_loss_counted_tokens: 742 | |
total tokens: 2280 num samples: 8 num padding tokens: 297 - rank: 4 max len: 285 min len: 229 avg len: 247.875 num_loss_counted_tokens: 870 | |
total tokens: 2431 num samples: 11 num padding tokens: 253 - rank: 5 max len: 221 min len: 178 avg len: 198.0 num_loss_counted_tokens: 644 | |
total tokens: 2464 num samples: 14 num padding tokens: 213 - rank: 6 max len: 176 min len: 148 avg len: 160.78571428571428 num_loss_counted_tokens: 912 | |
total tokens: 2532 num samples: 6 num padding tokens: 160 - rank: 2 max len: 422 min len: 353 avg len: 395.3333333333333 num_loss_counted_tokens: 1173 | |
total tokens: 1754 num samples: 2 num padding tokens: 265 - rank: 0 max len: 877 min len: 612 avg len: 744.5 num_loss_counted_tokens: 642 | |
Per-token loss scaled by world size: 0.00137547985650599Per-token loss scaled by world size: 0.0016671305056661367Per-token loss scaled by world size: 0.0002774471649900079Per-token loss scaled by world size: 0.0019099974306300282Per-token loss scaled by world size: 0.0011880508391186595Per-token loss scaled by world size: 0.0027690879069268703 | |
Per-token loss scaled by world size: 0.001052290783263743 | |
Epoch: 0, Step: 8, Rank: 3, loss = 1.0983530282974243 | |
Epoch: 0, Step: 8, Rank: 1, loss = 1.541262149810791Epoch: 0, Step: 8, Rank: 5, loss = 1.271631121635437 | |
Epoch: 0, Step: 8, Rank: 0, loss = 0.25649991631507874Epoch: 0, Step: 8, Rank: 4, loss = 1.7657926082611084 | |
Epoch: 0, Step: 8, Rank: 7, loss = 0.9728428721427917Epoch: 0, Step: 8, Rank: 2, loss = 2.5600218772888184 | |
Per-token loss scaled by world size: 0.0012348840245977044 | |
Epoch: 0, Step: 8, Rank: 6, loss = 1.1416503190994263 | |
[2024-06-27 16:41:19,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=8, skipped=0, lr=[4.155844155844156e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:19,866] [INFO] [timer.py:260:stop] epoch=0/micro_step=8/global_step=8, RunningAvgSamplesPerSec=96.01011764335593, CurrSamplesPerSec=95.7636904249434, MemAllocated=22.29GB, MaxMemAllocated=28.6GB | |
throughput: 95.66008237179061 samples/s, lr: 4.155844155844156e-07, loss: 0.25649991631507874 cuda_mem_allocated: 22.285754203796387 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7396.0 batch_size: 83.0 total loss: 1.3260066509246826 | |
Epoch 0: 4% 8/213 [00:09<03:39, 1.07s/it] total tokens: 2530 num samples: 10 num padding tokens: 237 - rank: 3 max len: 253 min len: 214 avg len: 229.3 num_loss_counted_tokens: 1138 | |
total tokens: 2464 num samples: 8 num padding tokens: 266 - rank: 2 max len: 308 min len: 258 avg len: 274.75 num_loss_counted_tokens: 939 | |
total tokens: 2448 num samples: 12 num padding tokens: 161 - rank: 4 max len: 204 min len: 182 avg len: 190.58333333333334 num_loss_counted_tokens: 950 | |
total tokens: 2280 num samples: 6 num padding tokens: 212 - rank: 1 max len: 380 min len: 324 avg len: 344.6666666666667 num_loss_counted_tokens: 1046 | |
total tokens: 2448 num samples: 17 num padding tokens: 145 - rank: 6 max len: 144 min len: 127 avg len: 135.47058823529412 num_loss_counted_tokens: 774 | |
total tokens: 2534 num samples: 14 num padding tokens: 209 - rank: 5 max len: 181 min len: 147 avg len: 166.07142857142858 num_loss_counted_tokens: 891 | |
total tokens: 2500 num samples: 20 num padding tokens: 355 - rank: 7 max len: 125 min len: 90 avg len: 107.25 num_loss_counted_tokens: 582 | |
total tokens: 2040 num samples: 3 num padding tokens: 319 - rank: 0 max len: 680 min len: 452 avg len: 573.6666666666666 num_loss_counted_tokens: 1064 | |
Per-token loss scaled by world size: 0.0014683828921988606Per-token loss scaled by world size: 0.0016043963842093945Per-token loss scaled by world size: 0.0018583537312224507Per-token loss scaled by world size: 0.001770748058333993Per-token loss scaled by world size: 0.00180139543954283Per-token loss scaled by world size: 0.0018689936259761453Per-token loss scaled by world size: 0.0016208573943004012 | |
Epoch: 0, Step: 9, Rank: 0, loss = 1.5706535577774048Epoch: 0, Step: 9, Rank: 1, loss = 1.4230996370315552Epoch: 0, Step: 9, Rank: 4, loss = 1.5978378057479858Epoch: 0, Step: 9, Rank: 5, loss = 1.3024556636810303Epoch: 0, Step: 9, Rank: 2, loss = 1.648359775543213 | |
Epoch: 0, Step: 9, Rank: 6, loss = 1.4377005100250244Epoch: 0, Step: 9, Rank: 3, loss = 1.6577973365783691 | |
Per-token loss scaled by world size: 0.0012299425434321165 | |
Epoch: 0, Step: 9, Rank: 7, loss = 1.090959072113037 | |
[2024-06-27 16:41:20,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=9, skipped=0, lr=[4.675324675324676e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:20,924] [INFO] [timer.py:260:stop] epoch=0/micro_step=9/global_step=9, RunningAvgSamplesPerSec=95.94776163885263, CurrSamplesPerSec=95.57531994870092, MemAllocated=22.29GB, MaxMemAllocated=28.6GB | |
throughput: 95.48201234130524 samples/s, lr: 4.675324675324676e-07, loss: 1.5706535577774048 cuda_mem_allocated: 22.291240215301514 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7096.0 batch_size: 93.0 total loss: 1.466107964515686 | |
Epoch 0: 4% 9/213 [00:10<03:37, 1.07s/it] total tokens: 2410 num samples: 5 num padding tokens: 277 - rank: 1 max len: 482 min len: 379 avg len: 426.6 num_loss_counted_tokens: 1506 | |
total tokens: 2352 num samples: 12 num padding tokens: 211 - rank: 5 max len: 196 min len: 162 avg len: 178.41666666666666 num_loss_counted_tokens: 782 | |
total tokens: 2403 num samples: 9 num padding tokens: 194 - rank: 3 max len: 267 min len: 231 avg len: 245.44444444444446 num_loss_counted_tokens: 679 | |
total tokens: 2184 num samples: 6 num padding tokens: 270 - rank: 2 max len: 364 min len: 277 avg len: 319.0 num_loss_counted_tokens: 930 | |
total tokens: 2430 num samples: 15 num padding tokens: 230 - rank: 6 max len: 162 min len: 136 avg len: 146.66666666666666 num_loss_counted_tokens: 838 | |
total tokens: 2497 num samples: 11 num padding tokens: 185 - rank: 4 max len: 227 min len: 197 avg len: 210.1818181818182 num_loss_counted_tokens: 928 | |
total tokens: 2527 num samples: 19 num padding tokens: 360 - rank: 7 max len: 133 min len: 85 avg len: 114.05263157894737 num_loss_counted_tokens: 708 | |
total tokens: 2224 num samples: 4 num padding tokens: 134 - rank: 0 max len: 556 min len: 485 avg len: 522.5 num_loss_counted_tokens: 1323 | |
Per-token loss scaled by world size: 0.0011083846911787987Per-token loss scaled by world size: 0.0011594243114814162 | |
Per-token loss scaled by world size: 0.001401674235239625Per-token loss scaled by world size: 0.00240886933170259Per-token loss scaled by world size: 0.0019389678491279483Per-token loss scaled by world size: 0.0018687343690544367Per-token loss scaled by world size: 0.0014738640747964382 | |
Epoch: 0, Step: 10, Rank: 7, loss = 1.0675129890441895 | |
Epoch: 0, Step: 10, Rank: 5, loss = 1.1166704893112183Epoch: 0, Step: 10, Rank: 2, loss = 1.349987506866455Epoch: 0, Step: 10, Rank: 0, loss = 2.320042371749878Epoch: 0, Step: 10, Rank: 3, loss = 1.8674683570861816 | |
Epoch: 0, Step: 10, Rank: 6, loss = 1.799824833869934 | |
Per-token loss scaled by world size: 0.0015482893213629723Epoch: 0, Step: 10, Rank: 1, loss = 1.4195153713226318 | |
Epoch: 0, Step: 10, Rank: 4, loss = 1.4911961555480957 | |
[2024-06-27 16:41:21,905] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[5.194805194805196e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:21,979] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=95.97138399789038, CurrSamplesPerSec=96.13706675987818, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 96.04621058773185 samples/s, lr: 5.194805194805196e-07, loss: 2.320042371749878 cuda_mem_allocated: 22.272157669067383 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7705.0 batch_size: 79.0 total loss: 1.5540271997451782 | |
Epoch 0: 5% 10/213 [00:11<03:35, 1.06s/it] total tokens: 2500 num samples: 10 num padding tokens: 295 - rank: 4 max len: 250 min len: 200 avg len: 220.5 num_loss_counted_tokens: 907 | |
total tokens: 2272 num samples: 8 num padding tokens: 91 - rank: 3 max len: 284 min len: 256 avg len: 272.625 num_loss_counted_tokens: 757 | |
total tokens: 2232 num samples: 6 num padding tokens: 185 - rank: 2 max len: 372 min len: 314 avg len: 341.1666666666667 num_loss_counted_tokens: 1263 | |
total tokens: 2352 num samples: 12 num padding tokens: 260 - rank: 5 max len: 196 min len: 163 avg len: 174.33333333333334 num_loss_counted_tokens: 739 | |
total tokens: 2150 num samples: 5 num padding tokens: 130 - rank: 1 max len: 430 min len: 384 avg len: 404.0 num_loss_counted_tokens: 1214 | |
total tokens: 2445 num samples: 15 num padding tokens: 266 - rank: 6 max len: 163 min len: 135 avg len: 145.26666666666668 num_loss_counted_tokens: 668 | |
total tokens: 2484 num samples: 4 num padding tokens: 329 - rank: 0 max len: 621 min len: 442 avg len: 538.75 num_loss_counted_tokens: 860 | |
total tokens: 2508 num samples: 19 num padding tokens: 441 - rank: 7 max len: 132 min len: 77 avg len: 108.78947368421052 num_loss_counted_tokens: 497 | |
Per-token loss scaled by world size: 0.0010614210041239858Per-token loss scaled by world size: 0.0014034874038770795 | |
Per-token loss scaled by world size: 0.0013799527660012245Per-token loss scaled by world size: 0.0014280208852142096Per-token loss scaled by world size: 0.002117231721058488Per-token loss scaled by world size: 0.0011735373409464955Per-token loss scaled by world size: 0.0021539500448852777 | |
Epoch: 0, Step: 11, Rank: 2, loss = 0.9681486487388611 | |
Epoch: 0, Step: 11, Rank: 4, loss = 1.2586894035339355Epoch: 0, Step: 11, Rank: 6, loss = 1.280155897140503Epoch: 0, Step: 11, Rank: 5, loss = 1.302533507347107Epoch: 0, Step: 11, Rank: 0, loss = 1.0704127550125122 | |
Per-token loss scaled by world size: 0.001191622344776988Epoch: 0, Step: 11, Rank: 1, loss = 1.9311798810958862Epoch: 0, Step: 11, Rank: 3, loss = 1.9646717309951782 | |
Epoch: 0, Step: 11, Rank: 7, loss = 1.0869085788726807 | |
[2024-06-27 16:41:22,961] [INFO] [logging.py:96:log_dist] [Rank 0] step=11, skipped=0, lr=[5.714285714285715e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:23,034] [INFO] [timer.py:260:stop] epoch=0/micro_step=11/global_step=11, RunningAvgSamplesPerSec=95.92757797123055, CurrSamplesPerSec=95.5785641751634, MemAllocated=22.26GB, MaxMemAllocated=28.6GB | |
throughput: 95.4869032323241 samples/s, lr: 5.714285714285715e-07, loss: 1.0704127550125122 cuda_mem_allocated: 22.256415367126465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7297.0 batch_size: 80.0 total loss: 1.3578375577926636 | |
Epoch 0: 5% 11/213 [00:12<03:34, 1.06s/it] total tokens: 2364 num samples: 12 num padding tokens: 133 - rank: 5 max len: 197 min len: 167 avg len: 185.91666666666666 num_loss_counted_tokens: 835 | |
total tokens: 2376 num samples: 11 num padding tokens: 75 - rank: 4 max len: 216 min len: 198 avg len: 209.1818181818182 num_loss_counted_tokens: 785 | |
total tokens: 2331 num samples: 9 num padding tokens: 187 - rank: 3 max len: 259 min len: 219 avg len: 238.22222222222223 num_loss_counted_tokens: 912 | |
total tokens: 2520 num samples: 9 num padding tokens: 110 - rank: 2 max len: 280 min len: 259 avg len: 267.77777777777777 num_loss_counted_tokens: 1000 | |
total tokens: 2490 num samples: 15 num padding tokens: 306 - rank: 6 max len: 166 min len: 127 avg len: 145.6 num_loss_counted_tokens: 960 | |
total tokens: 2515 num samples: 5 num padding tokens: 636 - rank: 0 max len: 503 min len: 323 avg len: 375.8 num_loss_counted_tokens: 958 | |
total tokens: 2240 num samples: 7 num padding tokens: 131 - rank: 1 max len: 320 min len: 284 avg len: 301.2857142857143 num_loss_counted_tokens: 1323 | |
total tokens: 2413 num samples: 19 num padding tokens: 324 - rank: 7 max len: 127 min len: 79 avg len: 109.94736842105263 num_loss_counted_tokens: 578 | |
Per-token loss scaled by world size: 0.0013383155455812812Per-token loss scaled by world size: 0.0017805329989641905Per-token loss scaled by world size: 0.0012576576555147767Per-token loss scaled by world size: 0.0017124736914411187Per-token loss scaled by world size: 0.0028887703083455563Per-token loss scaled by world size: 0.0011990099446848035Per-token loss scaled by world size: 0.0013487422838807106 | |
Epoch: 0, Step: 12, Rank: 2, loss = 1.7293426990509033Epoch: 0, Step: 12, Rank: 6, loss = 1.2998390197753906Epoch: 0, Step: 12, Rank: 4, loss = 1.221500039100647Epoch: 0, Step: 12, Rank: 1, loss = 2.805718183517456Epoch: 0, Step: 12, Rank: 0, loss = 1.6632400751113892 | |
Epoch: 0, Step: 12, Rank: 3, loss = 1.1645383834838867Per-token loss scaled by world size: 0.001060148817487061 | |
Epoch: 0, Step: 12, Rank: 5, loss = 1.309965968132019 | |
Epoch: 0, Step: 12, Rank: 7, loss = 1.0296695232391357 | |
[2024-06-27 16:41:24,015] [INFO] [logging.py:96:log_dist] [Rank 0] step=12, skipped=0, lr=[6.233766233766234e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:24,088] [INFO] [timer.py:260:stop] epoch=0/micro_step=12/global_step=12, RunningAvgSamplesPerSec=95.90573154462837, CurrSamplesPerSec=95.70956056431233, MemAllocated=22.26GB, MaxMemAllocated=28.6GB | |
throughput: 95.58278426781294 samples/s, lr: 6.233766233766234e-07, loss: 1.6632400751113892 cuda_mem_allocated: 22.258203983306885 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7770.0 batch_size: 78.0 total loss: 1.5279767513275146 | |
Epoch 0: 6% 12/213 [00:13<03:32, 1.06s/it] total tokens: 2509 num samples: 13 num padding tokens: 237 - rank: 5 max len: 193 min len: 157 avg len: 174.76923076923077 num_loss_counted_tokens: 879 | |
total tokens: 2408 num samples: 8 num padding tokens: 256 - rank: 3 max len: 301 min len: 249 avg len: 269.0 num_loss_counted_tokens: 906 | |
total tokens: 2390 num samples: 10 num padding tokens: 236 - rank: 4 max len: 239 min len: 196 avg len: 215.4 num_loss_counted_tokens: 1115 | |
total tokens: 2408 num samples: 7 num padding tokens: 151 - rank: 2 max len: 344 min len: 305 avg len: 322.42857142857144 num_loss_counted_tokens: 838 | |
total tokens: 2496 num samples: 16 num padding tokens: 272 - rank: 6 max len: 156 min len: 117 avg len: 139.0 num_loss_counted_tokens: 745 | |
total tokens: 2170 num samples: 5 num padding tokens: 219 - rank: 1 max len: 434 min len: 367 avg len: 390.2 num_loss_counted_tokens: 811 | |
total tokens: 1986 num samples: 3 num padding tokens: 183 - rank: 0 max len: 662 min len: 562 avg len: 601.0 num_loss_counted_tokens: 651 | |
total tokens: 2415 num samples: 21 num padding tokens: 249 - rank: 7 max len: 115 min len: 83 avg len: 103.14285714285714 num_loss_counted_tokens: 507 | |
Per-token loss scaled by world size: 0.00319823925383389Per-token loss scaled by world size: 0.003556205425411463Per-token loss scaled by world size: 0.0001228039327543229 | |
Per-token loss scaled by world size: 0.0017982145072892308Per-token loss scaled by world size: 0.0036040786653757095Per-token loss scaled by world size: 0.001817232114262879Per-token loss scaled by world size: 0.003064669668674469 | |
Epoch: 0, Step: 13, Rank: 4, loss = 2.3244247436523438 | |
Epoch: 0, Step: 13, Rank: 3, loss = 1.1753579378128052Epoch: 0, Step: 13, Rank: 2, loss = 2.090449094772339Epoch: 0, Step: 13, Rank: 5, loss = 2.3557159900665283 | |
Epoch: 0, Step: 13, Rank: 7, loss = 1.1877883672714233Epoch: 0, Step: 13, Rank: 0, loss = 0.08026771992444992 | |
Epoch: 0, Step: 13, Rank: 6, loss = 2.0031447410583496 | |
Per-token loss scaled by world size: 0.0008352873846888542 | |
Epoch: 0, Step: 13, Rank: 1, loss = 0.5459647178649902 | |
[2024-06-27 16:41:25,068] [INFO] [logging.py:96:log_dist] [Rank 0] step=13, skipped=0, lr=[6.753246753246753e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:25,140] [INFO] [timer.py:260:stop] epoch=0/micro_step=13/global_step=13, RunningAvgSamplesPerSec=95.93293456876103, CurrSamplesPerSec=96.20581597966803, MemAllocated=22.26GB, MaxMemAllocated=28.6GB | |
throughput: 96.10228171004144 samples/s, lr: 6.753246753246753e-07, loss: 0.08026771992444992 cuda_mem_allocated: 22.262736797332764 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5229.0 batch_size: 77.0 total loss: 1.4703891277313232 | |
Epoch 0: 6% 13/213 [00:14<03:31, 1.06s/it] total tokens: 2261 num samples: 7 num padding tokens: 220 - rank: 2 max len: 323 min len: 262 avg len: 291.57142857142856 num_loss_counted_tokens: 721 | |
total tokens: 2431 num samples: 11 num padding tokens: 187 - rank: 4 max len: 221 min len: 190 avg len: 204.0 num_loss_counted_tokens: 1057 | |
total tokens: 2349 num samples: 9 num padding tokens: 158 - rank: 3 max len: 261 min len: 222 avg len: 243.44444444444446 num_loss_counted_tokens: 898 | |
total tokens: 2480 num samples: 16 num padding tokens: 169 - rank: 6 max len: 155 min len: 131 avg len: 144.4375 num_loss_counted_tokens: 873 | |
total tokens: 2470 num samples: 13 num padding tokens: 172 - rank: 5 max len: 190 min len: 161 avg len: 176.76923076923077 num_loss_counted_tokens: 823 | |
total tokens: 2358 num samples: 6 num padding tokens: 212 - rank: 1 max len: 393 min len: 327 avg len: 357.6666666666667 num_loss_counted_tokens: 1096 | |
total tokens: 2337 num samples: 3 num padding tokens: 722 - rank: 0 max len: 779 min len: 413 avg len: 538.3333333333334 num_loss_counted_tokens: 676 | |
total tokens: 2413 num samples: 19 num padding tokens: 335 - rank: 7 max len: 127 min len: 83 avg len: 109.36842105263158 num_loss_counted_tokens: 586 | |
Per-token loss scaled by world size: 0.0013381103053689003Per-token loss scaled by world size: 0.0018794572679325938Per-token loss scaled by world size: 0.0011530570918694139Per-token loss scaled by world size: 0.0015317659126594663Per-token loss scaled by world size: 0.0009951952379196882Per-token loss scaled by world size: 0.001884638681076467 | |
Epoch: 0, Step: 14, Rank: 5, loss = 1.3391138315200806Epoch: 0, Step: 14, Rank: 4, loss = 1.1539218425750732Epoch: 0, Step: 14, Rank: 1, loss = 1.8808668851852417 | |
Epoch: 0, Step: 14, Rank: 3, loss = 1.5329147577285767 | |
Epoch: 0, Step: 14, Rank: 2, loss = 1.886052131652832 | |
Epoch: 0, Step: 14, Rank: 6, loss = 0.9959416389465332Per-token loss scaled by world size: 0.001999085070565343 | |
Per-token loss scaled by world size: 0.0008474871865473688 | |
Epoch: 0, Step: 14, Rank: 0, loss = 2.000584363937378 | |
Epoch: 0, Step: 14, Rank: 7, loss = 0.848122775554657 | |
[2024-06-27 16:41:26,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=14, skipped=0, lr=[7.272727272727273e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:26,206] [INFO] [timer.py:260:stop] epoch=0/micro_step=14/global_step=14, RunningAvgSamplesPerSec=95.84903652127285, CurrSamplesPerSec=94.93575094375034, MemAllocated=22.31GB, MaxMemAllocated=28.6GB | |
throughput: 94.79494390156565 samples/s, lr: 7.272727272727273e-07, loss: 2.000584363937378 cuda_mem_allocated: 22.30865240097046 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8006.0 batch_size: 86.0 total loss: 1.4546897411346436 | |
Epoch 0: 7% 14/213 [00:15<03:30, 1.06s/it] total tokens: 2424 num samples: 8 num padding tokens: 156 - rank: 3 max len: 303 min len: 257 avg len: 283.5 num_loss_counted_tokens: 979 | |
total tokens: 2480 num samples: 16 num padding tokens: 241 - rank: 6 max len: 155 min len: 123 avg len: 139.9375 num_loss_counted_tokens: 795 | |
total tokens: 2184 num samples: 6 num padding tokens: 110 - rank: 2 max len: 364 min len: 308 avg len: 345.6666666666667 num_loss_counted_tokens: 1168 | |
total tokens: 2532 num samples: 6 num padding tokens: 191 - rank: 1 max len: 422 min len: 368 avg len: 390.1666666666667 num_loss_counted_tokens: 1426 | |
total tokens: 2412 num samples: 12 num padding tokens: 247 - rank: 5 max len: 201 min len: 166 avg len: 180.41666666666666 num_loss_counted_tokens: 759 | |
total tokens: 2261 num samples: 19 num padding tokens: 299 - rank: 7 max len: 119 min len: 91 avg len: 103.26315789473684 num_loss_counted_tokens: 498 | |
total tokens: 2500 num samples: 10 num padding tokens: 278 - rank: 4 max len: 250 min len: 203 avg len: 222.2 num_loss_counted_tokens: 827 | |
total tokens: 2264 num samples: 4 num padding tokens: 267 - rank: 0 max len: 566 min len: 445 avg len: 499.25 num_loss_counted_tokens: 1267 | |
Per-token loss scaled by world size: 0.001072316663339734Per-token loss scaled by world size: 0.0011827138951048255Per-token loss scaled by world size: 0.0016549252904951572Per-token loss scaled by world size: 0.0016529569402337074Per-token loss scaled by world size: 0.0012709875591099262Per-token loss scaled by world size: 0.0011481934925541282Per-token loss scaled by world size: 0.001602635602466762 | |
Epoch: 0, Step: 15, Rank: 2, loss = 1.5783849954605103Epoch: 0, Step: 15, Rank: 7, loss = 1.0227220058441162Epoch: 0, Step: 15, Rank: 5, loss = 1.5765076875686646Epoch: 0, Step: 15, Rank: 1, loss = 1.1280133724212646 | |
Epoch: 0, Step: 15, Rank: 6, loss = 1.5285136699676514Epoch: 0, Step: 15, Rank: 0, loss = 1.2122043371200562Epoch: 0, Step: 15, Rank: 3, loss = 1.0950895547866821 | |
Per-token loss scaled by world size: 0.0016304106684401631 | |
Epoch: 0, Step: 15, Rank: 4, loss = 1.5550041198730469 | |
[2024-06-27 16:41:27,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=15, skipped=0, lr=[7.792207792207792e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:27,259] [INFO] [timer.py:260:stop] epoch=0/micro_step=15/global_step=15, RunningAvgSamplesPerSec=95.86242641431832, CurrSamplesPerSec=96.02339742473795, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 95.92409418377973 samples/s, lr: 7.792207792207792e-07, loss: 1.2122043371200562 cuda_mem_allocated: 22.27454423904419 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7630.0 batch_size: 86.0 total loss: 1.337054967880249 | |
Epoch 0: 7% 15/213 [00:16<03:29, 1.06s/it] total tokens: 2385 num samples: 15 num padding tokens: 180 - rank: 6 max len: 159 min len: 136 avg len: 147.0 num_loss_counted_tokens: 903 | |
total tokens: 2418 num samples: 13 num padding tokens: 220 - rank: 5 max len: 186 min len: 159 avg len: 169.07692307692307 num_loss_counted_tokens: 1015 | |
total tokens: 2453 num samples: 11 num padding tokens: 199 - rank: 4 max len: 223 min len: 188 avg len: 204.9090909090909 num_loss_counted_tokens: 715 | |
total tokens: 2422 num samples: 7 num padding tokens: 354 - rank: 2 max len: 346 min len: 261 avg len: 295.42857142857144 num_loss_counted_tokens: 1106 | |
total tokens: 2235 num samples: 5 num padding tokens: 247 - rank: 1 max len: 447 min len: 355 avg len: 397.6 num_loss_counted_tokens: 786 | |
total tokens: 2304 num samples: 9 num padding tokens: 177 - rank: 3 max len: 256 min len: 223 avg len: 236.33333333333334 num_loss_counted_tokens: 788 | |
total tokens: 2430 num samples: 18 num padding tokens: 300 - rank: 7 max len: 135 min len: 96 avg len: 118.33333333333333 num_loss_counted_tokens: 642 | |
total tokens: 2132 num samples: 4 num padding tokens: 209 - rank: 0 max len: 533 min len: 448 avg len: 480.75 num_loss_counted_tokens: 1024 | |
Per-token loss scaled by world size: 0.0017991786589846015Per-token loss scaled by world size: 0.0020480118691921234Per-token loss scaled by world size: 0.002071066526696086Per-token loss scaled by world size: 0.001926294295117259Per-token loss scaled by world size: 0.0016120504587888718Per-token loss scaled by world size: 0.0010637413943186402Per-token loss scaled by world size: 0.0019166981801390648 | |
Epoch: 0, Step: 16, Rank: 5, loss = 1.9381872415542603Epoch: 0, Step: 16, Rank: 3, loss = 1.0066982507705688Epoch: 0, Step: 16, Rank: 1, loss = 1.9600056409835815Epoch: 0, Step: 16, Rank: 6, loss = 1.70269775390625Epoch: 0, Step: 16, Rank: 2, loss = 1.822996735572815Epoch: 0, Step: 16, Rank: 4, loss = 1.525604248046875Epoch: 0, Step: 16, Rank: 0, loss = 1.8139152526855469 | |
Per-token loss scaled by world size: 0.0011032491456717253 | |
Epoch: 0, Step: 16, Rank: 7, loss = 1.0440874099731445 | |
[2024-06-27 16:41:28,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=0, lr=[8.311688311688312e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:28,312] [INFO] [timer.py:260:stop] epoch=0/micro_step=16/global_step=16, RunningAvgSamplesPerSec=95.8717084742141, CurrSamplesPerSec=95.99253903204811, MemAllocated=22.22GB, MaxMemAllocated=28.6GB | |
throughput: 95.90423990327956 samples/s, lr: 8.311688311688312e-07, loss: 1.8139152526855469 cuda_mem_allocated: 22.21705961227417 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7571.0 batch_size: 80.0 total loss: 1.6017740964889526 | |
Epoch 0: 8% 16/213 [00:17<03:28, 1.06s/it] total tokens: 2508 num samples: 12 num padding tokens: 153 - rank: 4 max len: 209 min len: 184 avg len: 196.25 num_loss_counted_tokens: 829 | |
total tokens: 2506 num samples: 14 num padding tokens: 161 - rank: 5 max len: 179 min len: 158 avg len: 167.5 num_loss_counted_tokens: 809 | |
total tokens: 2344 num samples: 8 num padding tokens: 216 - rank: 2 max len: 293 min len: 244 avg len: 266.0 num_loss_counted_tokens: 853 | |
total tokens: 2496 num samples: 16 num padding tokens: 194 - rank: 6 max len: 156 min len: 131 avg len: 143.875 num_loss_counted_tokens: 824 | |
total tokens: 2178 num samples: 6 num padding tokens: 121 - rank: 1 max len: 363 min len: 326 avg len: 342.8333333333333 num_loss_counted_tokens: 699 | |
total tokens: 2430 num samples: 10 num padding tokens: 171 - rank: 3 max len: 243 min len: 210 avg len: 225.9 num_loss_counted_tokens: 858 | |
total tokens: 2004 num samples: 3 num padding tokens: 480 - rank: 0 max len: 668 min len: 375 avg len: 508.0 num_loss_counted_tokens: 1082 | |
total tokens: 2451 num samples: 19 num padding tokens: 331 - rank: 7 max len: 129 min len: 83 avg len: 111.57894736842105 num_loss_counted_tokens: 522 | |
Per-token loss scaled by world size: 0.001671988982707262Per-token loss scaled by world size: 0.0018468183698132634Per-token loss scaled by world size: 0.0012665605172514915Per-token loss scaled by world size: 0.0014818720519542694Per-token loss scaled by world size: 0.0011362729128450155Per-token loss scaled by world size: 0.0014294381253421307Per-token loss scaled by world size: 0.0016345924232155085 | |
Epoch: 0, Step: 17, Rank: 3, loss = 1.8872175216674805Epoch: 0, Step: 17, Rank: 5, loss = 1.5142879486083984Epoch: 0, Step: 17, Rank: 1, loss = 1.7085636854171753Epoch: 0, Step: 17, Rank: 4, loss = 1.1611288785934448Epoch: 0, Step: 17, Rank: 2, loss = 1.2942665815353394Epoch: 0, Step: 17, Rank: 6, loss = 1.4607070684432983 | |
Epoch: 0, Step: 17, Rank: 0, loss = 1.67034912109375 | |
Per-token loss scaled by world size: 0.0008283466449938715 | |
Epoch: 0, Step: 17, Rank: 7, loss = 0.8464667201042175 | |
[2024-06-27 16:41:29,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=0, lr=[8.831168831168832e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:29,363] [INFO] [timer.py:260:stop] epoch=0/micro_step=17/global_step=17, RunningAvgSamplesPerSec=95.8893184295598, CurrSamplesPerSec=96.13653883023541, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 95.98745891392592 samples/s, lr: 8.831168831168832e-07, loss: 1.67034912109375 cuda_mem_allocated: 22.272157669067383 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8175.0 batch_size: 80.0 total loss: 1.442873477935791 | |
Epoch 0: 8% 17/213 [00:18<03:26, 1.05s/it] total tokens: 2324 num samples: 7 num padding tokens: 157 - rank: 2 max len: 332 min len: 274 avg len: 309.57142857142856 num_loss_counted_tokens: 937 | |
total tokens: 2519 num samples: 11 num padding tokens: 181 - rank: 4 max len: 229 min len: 193 avg len: 212.54545454545453 num_loss_counted_tokens: 1209 | |
total tokens: 2394 num samples: 14 num padding tokens: 336 - rank: 6 max len: 171 min len: 129 avg len: 147.0 num_loss_counted_tokens: 747 | |
total tokens: 2439 num samples: 9 num padding tokens: 198 - rank: 3 max len: 271 min len: 230 avg len: 249.0 num_loss_counted_tokens: 889 | |
total tokens: 2444 num samples: 13 num padding tokens: 115 - rank: 5 max len: 188 min len: 172 avg len: 179.15384615384616 num_loss_counted_tokens: 931 | |
total tokens: 2322 num samples: 6 num padding tokens: 71 - rank: 1 max len: 387 min len: 354 avg len: 375.1666666666667 num_loss_counted_tokens: 962 | |
total tokens: 2428 num samples: 4 num padding tokens: 448 - rank: 0 max len: 607 min len: 393 avg len: 495.0 num_loss_counted_tokens: 1317 | |
total tokens: 2375 num samples: 19 num padding tokens: 331 - rank: 7 max len: 125 min len: 78 avg len: 107.57894736842105 num_loss_counted_tokens: 577 | |
Per-token loss scaled by world size: 0.002093401039019227Per-token loss scaled by world size: 0.0014379583299160004Per-token loss scaled by world size: 0.0016214289935305715Per-token loss scaled by world size: 0.0013020787155255675Per-token loss scaled by world size: 0.002127719344571233 | |
Per-token loss scaled by world size: 0.002393938135355711Per-token loss scaled by world size: 0.0013802594039589167 | |
Epoch: 0, Step: 18, Rank: 2, loss = 1.7997846603393555 | |
Epoch: 0, Step: 18, Rank: 3, loss = 1.3715262413024902Epoch: 0, Step: 18, Rank: 6, loss = 1.216333031654358Epoch: 0, Step: 18, Rank: 4, loss = 1.770755648612976Epoch: 0, Step: 18, Rank: 7, loss = 1.1675269603729248Epoch: 0, Step: 18, Rank: 1, loss = 2.024972438812256 | |
Epoch: 0, Step: 18, Rank: 5, loss = 1.101395845413208 | |
Per-token loss scaled by world size: 0.0016835287678986788 | |
Epoch: 0, Step: 18, Rank: 0, loss = 1.4240548610687256 | |
[2024-06-27 16:41:30,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=18, skipped=0, lr=[9.350649350649352e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:30,421] [INFO] [timer.py:260:stop] epoch=0/micro_step=18/global_step=18, RunningAvgSamplesPerSec=95.8767895965041, CurrSamplesPerSec=95.6892492142802, MemAllocated=22.31GB, MaxMemAllocated=28.6GB | |
throughput: 95.5992144140585 samples/s, lr: 9.350649350649352e-07, loss: 1.4240548610687256 cuda_mem_allocated: 22.307698249816895 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6767.0 batch_size: 77.0 total loss: 1.4845436811447144 | |
Epoch 0: 8% 18/213 [00:19<03:25, 1.06s/it] total tokens: 2457 num samples: 9 num padding tokens: 119 - rank: 4 max len: 273 min len: 234 avg len: 259.77777777777777 num_loss_counted_tokens: 1300 | |
total tokens: 2320 num samples: 10 num padding tokens: 241 - rank: 5 max len: 232 min len: 190 avg len: 207.9 num_loss_counted_tokens: 739 | |
total tokens: 2196 num samples: 6 num padding tokens: 174 - rank: 2 max len: 366 min len: 305 avg len: 337.0 num_loss_counted_tokens: 882 | |
total tokens: 2115 num samples: 5 num padding tokens: 173 - rank: 1 max len: 423 min len: 366 avg len: 388.4 num_loss_counted_tokens: 804 | |
total tokens: 2336 num samples: 8 num padding tokens: 58 - rank: 3 max len: 292 min len: 274 avg len: 284.75 num_loss_counted_tokens: 1323 | |
total tokens: 2431 num samples: 13 num padding tokens: 440 - rank: 6 max len: 187 min len: 129 avg len: 153.15384615384616 num_loss_counted_tokens: 763 | |
total tokens: 2356 num samples: 19 num padding tokens: 338 - rank: 7 max len: 124 min len: 78 avg len: 106.21052631578948 num_loss_counted_tokens: 597 | |
total tokens: 2252 num samples: 4 num padding tokens: 113 - rank: 0 max len: 563 min len: 459 avg len: 534.75 num_loss_counted_tokens: 1563 | |
Per-token loss scaled by world size: 0.0012483698083087802Per-token loss scaled by world size: 0.002057483186945319Per-token loss scaled by world size: 0.0013635861687362194Per-token loss scaled by world size: 0.002183597767725587Per-token loss scaled by world size: 0.0012890166835859418Per-token loss scaled by world size: 0.002308495109900832 | |
Per-token loss scaled by world size: 0.0014407304115593433 | |
Epoch: 0, Step: 19, Rank: 3, loss = 1.117134928703308 | |
Epoch: 0, Step: 19, Rank: 6, loss = 1.2202391624450684 | |
Epoch: 0, Step: 19, Rank: 5, loss = 2.06581449508667Epoch: 0, Step: 19, Rank: 2, loss = 1.8411903381347656Epoch: 0, Step: 19, Rank: 0, loss = 1.1535087823867798Epoch: 0, Step: 19, Rank: 1, loss = 1.9540469646453857 | |
Per-token loss scaled by world size: 0.001318735652603209 | |
Epoch: 0, Step: 19, Rank: 4, loss = 1.2892736196517944 | |
Epoch: 0, Step: 19, Rank: 7, loss = 1.1801035404205322 | |
[2024-06-27 16:41:31,401] [INFO] [logging.py:96:log_dist] [Rank 0] step=19, skipped=0, lr=[9.870129870129872e-07], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:31,474] [INFO] [timer.py:260:stop] epoch=0/micro_step=19/global_step=19, RunningAvgSamplesPerSec=95.89438502997807, CurrSamplesPerSec=96.176792877456, MemAllocated=22.26GB, MaxMemAllocated=28.6GB | |
throughput: 96.07118922617792 samples/s, lr: 9.870129870129872e-07, loss: 1.1535087823867798 cuda_mem_allocated: 22.256892204284668 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7159.0 batch_size: 79.0 total loss: 1.4776639938354492 | |
Epoch 0: 9% 19/213 [00:20<03:24, 1.05s/it] total tokens: 2506 num samples: 7 num padding tokens: 244 - rank: 3 max len: 358 min len: 276 avg len: 323.14285714285717 num_loss_counted_tokens: 1048 | |
total tokens: 2420 num samples: 10 num padding tokens: 320 - rank: 5 max len: 242 min len: 189 avg len: 210.0 num_loss_counted_tokens: 770 | |
total tokens: 2418 num samples: 6 num padding tokens: 145 - rank: 2 max len: 403 min len: 360 avg len: 378.8333333333333 num_loss_counted_tokens: 1049 | |
total tokens: 2484 num samples: 9 num padding tokens: 169 - rank: 4 max len: 276 min len: 243 avg len: 257.22222222222223 num_loss_counted_tokens: 1004 | |
total tokens: 2148 num samples: 4 num padding tokens: 157 - rank: 1 max len: 537 min len: 436 avg len: 497.75 num_loss_counted_tokens: 1372 | |
total tokens: 2457 num samples: 13 num padding tokens: 241 - rank: 6 max len: 189 min len: 149 avg len: 170.46153846153845 num_loss_counted_tokens: 909 | |
total tokens: 2430 num samples: 3 num padding tokens: 476 - rank: 0 max len: 810 min len: 564 avg len: 651.3333333333334 num_loss_counted_tokens: 1234 | |
total tokens: 2516 num samples: 17 num padding tokens: 507 - rank: 7 max len: 148 min len: 81 avg len: 118.17647058823529 num_loss_counted_tokens: 632 | |
Per-token loss scaled by world size: 0.00189284048974514Per-token loss scaled by world size: 0.0013468463439494371 | |
Per-token loss scaled by world size: 0.0011827753623947501Per-token loss scaled by world size: 0.0007272624061442912Per-token loss scaled by world size: 0.0018043563468381763Per-token loss scaled by world size: 0.00139837886672467Per-token loss scaled by world size: 0.0013607203727588058 | |
Epoch: 0, Step: 20, Rank: 0, loss = 1.921942949295044 | |
Epoch: 0, Step: 20, Rank: 2, loss = 1.4198789596557617 | |
Epoch: 0, Step: 20, Rank: 5, loss = 1.2009605169296265Epoch: 0, Step: 20, Rank: 3, loss = 1.3675540685653687 | |
Epoch: 0, Step: 20, Rank: 1, loss = 1.832098364830017Epoch: 0, Step: 20, Rank: 7, loss = 0.7384440898895264 | |
Epoch: 0, Step: 20, Rank: 4, loss = 1.3816415071487427 | |
Per-token loss scaled by world size: 0.0010783413890749216 | |
Epoch: 0, Step: 20, Rank: 6, loss = 1.0949208736419678 | |
[2024-06-27 16:41:32,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.0389610389610392e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:32,530] [INFO] [timer.py:260:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=95.8777511390329, CurrSamplesPerSec=95.59585530608001, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 95.50808023042254 samples/s, lr: 1.0389610389610392e-06, loss: 1.921942949295044 cuda_mem_allocated: 22.269176959991455 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8123.0 batch_size: 83.0 total loss: 1.3696801662445068 | |
Epoch 0: 9% 20/213 [00:21<03:23, 1.06s/it] total tokens: 2528 num samples: 16 num padding tokens: 235 - rank: 6 max len: 158 min len: 127 avg len: 143.3125 num_loss_counted_tokens: 823 | |
total tokens: 2430 num samples: 10 num padding tokens: 247 - rank: 4 max len: 243 min len: 192 avg len: 218.3 num_loss_counted_tokens: 964 | |
total tokens: 2282 num samples: 7 num padding tokens: 129 - rank: 2 max len: 326 min len: 281 avg len: 307.57142857142856 num_loss_counted_tokens: 951 | |
total tokens: 2496 num samples: 6 num padding tokens: 309 - rank: 1 max len: 416 min len: 332 avg len: 364.5 num_loss_counted_tokens: 1473 | |
total tokens: 2483 num samples: 13 num padding tokens: 232 - rank: 5 max len: 191 min len: 158 avg len: 173.15384615384616 num_loss_counted_tokens: 800 | |
total tokens: 2385 num samples: 9 num padding tokens: 83 - rank: 3 max len: 265 min len: 246 avg len: 255.77777777777777 num_loss_counted_tokens: 1230 | |
total tokens: 2132 num samples: 4 num padding tokens: 241 - rank: 0 max len: 533 min len: 417 avg len: 472.75 num_loss_counted_tokens: 1386 | |
total tokens: 2356 num samples: 19 num padding tokens: 389 - rank: 7 max len: 124 min len: 72 avg len: 103.52631578947368 num_loss_counted_tokens: 485 | |
Per-token loss scaled by world size: 0.0015498390421271324Per-token loss scaled by world size: 0.0014788935659453273Per-token loss scaled by world size: 0.0012543796328827739Per-token loss scaled by world size: 0.0015987858641892672Per-token loss scaled by world size: 0.0020744511857628822Per-token loss scaled by world size: 0.0006408541230484843Per-token loss scaled by world size: 0.000580018968321383 | |
Epoch: 0, Step: 21, Rank: 1, loss = 1.4227522611618042Epoch: 0, Step: 21, Rank: 6, loss = 1.3576242923736572Epoch: 0, Step: 21, Rank: 4, loss = 1.1515204906463623 | |
Epoch: 0, Step: 21, Rank: 5, loss = 1.4676854610443115 | |
Epoch: 0, Step: 21, Rank: 2, loss = 1.9043461084365845Epoch: 0, Step: 21, Rank: 7, loss = 0.5324574112892151 | |
Epoch: 0, Step: 21, Rank: 0, loss = 0.5883041024208069 | |
Per-token loss scaled by world size: 0.0017352607101202011 | |
Epoch: 0, Step: 21, Rank: 3, loss = 1.592969298362732 | |
[2024-06-27 16:41:33,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=21, skipped=0, lr=[1.090909090909091e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:33,589] [INFO] [timer.py:260:stop] epoch=0/micro_step=21/global_step=21, RunningAvgSamplesPerSec=95.86535362154734, CurrSamplesPerSec=95.64274528256846, MemAllocated=22.25GB, MaxMemAllocated=28.6GB | |
throughput: 95.55155079901728 samples/s, lr: 1.090909090909091e-06, loss: 0.5883041024208069 cuda_mem_allocated: 22.251407623291016 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7344.0 batch_size: 76.0 total loss: 1.2522073984146118 | |
Epoch 0: 10% 21/213 [00:22<03:22, 1.06s/it] total tokens: 2380 num samples: 14 num padding tokens: 230 - rank: 6 max len: 170 min len: 139 avg len: 153.57142857142858 num_loss_counted_tokens: 873 | |
total tokens: 2376 num samples: 12 num padding tokens: 116 - rank: 5 max len: 198 min len: 173 avg len: 188.33333333333334 num_loss_counted_tokens: 709 | |
total tokens: 2530 num samples: 11 num padding tokens: 215 - rank: 4 max len: 230 min len: 198 avg len: 210.45454545454547 num_loss_counted_tokens: 1045 | |
total tokens: 2385 num samples: 9 num padding tokens: 140 - rank: 3 max len: 265 min len: 231 avg len: 249.44444444444446 num_loss_counted_tokens: 1016 | |
total tokens: 2268 num samples: 6 num padding tokens: 304 - rank: 1 max len: 378 min len: 298 avg len: 327.3333333333333 num_loss_counted_tokens: 804 | |
total tokens: 2344 num samples: 8 num padding tokens: 111 - rank: 2 max len: 293 min len: 269 avg len: 279.125 num_loss_counted_tokens: 985 | |
total tokens: 2322 num samples: 18 num padding tokens: 356 - rank: 7 max len: 129 min len: 84 avg len: 109.22222222222223 num_loss_counted_tokens: 597 | |
total tokens: 2313 num samples: 3 num padding tokens: 645 - rank: 0 max len: 771 min len: 444 avg len: 556.0 num_loss_counted_tokens: 1142 | |
Per-token loss scaled by world size: 0.0009969336679205298Per-token loss scaled by world size: 0.0010108622955158353Per-token loss scaled by world size: 0.0007795466226525605Per-token loss scaled by world size: 0.0006575710722245276Per-token loss scaled by world size: 0.0012540360912680626Per-token loss scaled by world size: 0.0018550005042925477Per-token loss scaled by world size: 0.0019121961668133736 | |
Epoch: 0, Step: 22, Rank: 6, loss = 0.9875874519348145Epoch: 0, Step: 22, Rank: 7, loss = 1.0013854503631592Epoch: 0, Step: 22, Rank: 4, loss = 0.7722383737564087Epoch: 0, Step: 22, Rank: 5, loss = 1.2422795295715332 | |
Epoch: 0, Step: 22, Rank: 0, loss = 0.6514063477516174Epoch: 0, Step: 22, Rank: 1, loss = 1.894269347190857Epoch: 0, Step: 22, Rank: 3, loss = 1.837609887123108 | |
Per-token loss scaled by world size: 0.0017476376378908753 | |
Epoch: 0, Step: 22, Rank: 2, loss = 1.7312535047531128 | |
[2024-06-27 16:41:34,576] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.142857142857143e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:34,649] [INFO] [timer.py:260:stop] epoch=0/micro_step=22/global_step=22, RunningAvgSamplesPerSec=95.84430542402474, CurrSamplesPerSec=95.44613848576427, MemAllocated=22.27GB, MaxMemAllocated=28.6GB | |
throughput: 95.3410939478241 samples/s, lr: 1.142857142857143e-06, loss: 0.6514063477516174 cuda_mem_allocated: 22.267149925231934 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7925.0 batch_size: 76.0 total loss: 1.264753818511963 | |
Epoch 0: 10% 22/213 [00:23<03:21, 1.06s/it] total tokens: 2475 num samples: 15 num padding tokens: 195 - rank: 5 max len: 165 min len: 143 avg len: 152.0 num_loss_counted_tokens: 971 | |
total tokens: 2444 num samples: 13 num padding tokens: 114 - rank: 4 max len: 188 min len: 166 avg len: 179.23076923076923 num_loss_counted_tokens: 728 | |
total tokens: 2490 num samples: 10 num padding tokens: 294 - rank: 3 max len: 249 min len: 192 avg len: 219.6 num_loss_counted_tokens: 1091 | |
total tokens: 2286 num samples: 6 num padding tokens: 267 - rank: 1 max len: 381 min len: 311 avg len: 336.5 num_loss_counted_tokens: 751 | |
total tokens: 2397 num samples: 17 num padding tokens: 166 - rank: 6 max len: 141 min len: 121 avg len: 131.23529411764707 num_loss_counted_tokens: 790 | |
total tokens: 2464 num samples: 8 num padding tokens: 164 - rank: 2 max len: 308 min len: 255 avg len: 287.5 num_loss_counted_tokens: 1046 | |
total tokens: 2184 num samples: 4 num padding tokens: 197 - rank: 0 max len: 546 min len: 401 avg len: 496.75 num_loss_counted_tokens: 1258 | |
total tokens: 2499 num samples: 21 num padding tokens: 343 - rank: 7 max len: 119 min len: 83 avg len: 102.66666666666667 num_loss_counted_tokens: 571 | |
Per-token loss scaled by world size: 0.0014706471702083945Per-token loss scaled by world size: 0.0017663220642134547 | |
Per-token loss scaled by world size: 0.001505438587628305Per-token loss scaled by world size: 0.0021361932158470154Per-token loss scaled by world size: 0.0012977442238479853Per-token loss scaled by world size: 0.0013161160750314593Per-token loss scaled by world size: 0.0012715160846710205 | |
Epoch: 0, Step: 23, Rank: 4, loss = 1.2586901187896729 | |
Epoch: 0, Step: 23, Rank: 6, loss = 1.5117509365081787Epoch: 0, Step: 23, Rank: 5, loss = 1.126430869102478Epoch: 0, Step: 23, Rank: 2, loss = 1.288467288017273 | |
Epoch: 0, Step: 23, Rank: 3, loss = 1.8283144235610962Epoch: 0, Step: 23, Rank: 7, loss = 1.1107068061828613Epoch: 0, Step: 23, Rank: 1, loss = 1.0882588624954224 | |
Per-token loss scaled by world size: 0.002270130207762122 | |
Epoch: 0, Step: 23, Rank: 0, loss = 1.9429476261138916 | |
[2024-06-27 16:41:35,633] [INFO] [logging.py:96:log_dist] [Rank 0] step=23, skipped=0, lr=[1.1948051948051948e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:35,706] [INFO] [timer.py:260:stop] epoch=0/micro_step=23/global_step=23, RunningAvgSamplesPerSec=95.83486836329531, CurrSamplesPerSec=95.64651664388255, MemAllocated=22.31GB, MaxMemAllocated=28.6GB | |
throughput: 95.53180508961152 samples/s, lr: 1.1948051948051948e-06, loss: 1.9429476261138916 cuda_mem_allocated: 22.312707901000977 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6847.0 batch_size: 90.0 total loss: 1.394445776939392 | |
Epoch 0: 11% 23/213 [00:24<03:20, 1.06s/it] total tokens: 2261 num samples: 7 num padding tokens: 194 - rank: 3 max len: 323 min len: 274 avg len: 295.2857142857143 num_loss_counted_tokens: 680 | |
total tokens: 2376 num samples: 11 num padding tokens: 268 - rank: 5 max len: 216 min len: 176 avg len: 191.63636363636363 num_loss_counted_tokens: 873 | |
total tokens: 2505 num samples: 5 num padding tokens: 249 - rank: 1 max len: 501 min len: 394 avg len: 451.2 num_loss_counted_tokens: 1236 | |
total tokens: 2436 num samples: 14 num padding tokens: 140 - rank: 6 max len: 174 min len: 152 avg len: 164.0 num_loss_counted_tokens: 837 | |
total tokens: 2268 num samples: 6 num padding tokens: 116 - rank: 2 max len: 378 min len: 335 avg len: 358.6666666666667 num_loss_counted_tokens: 1144 | |
total tokens: 2448 num samples: 9 num padding tokens: 280 - rank: 4 max len: 272 min len: 218 avg len: 240.88888888888889 num_loss_counted_tokens: 771 | |
total tokens: 2516 num samples: 17 num padding tokens: 456 - rank: 7 max len: 148 min len: 90 avg len: 121.17647058823529 num_loss_counted_tokens: 602 | |
total tokens: 2306 num samples: 2 num padding tokens: 108 - rank: 0 max len: 1153 min len: 1045 avg len: 1099.0 num_loss_counted_tokens: 1954 | |
Per-token loss scaled by world size: 0.00243198755197227Per-token loss scaled by world size: 0.0016992025775834918 | |
Per-token loss scaled by world size: 0.001585247926414013Per-token loss scaled by world size: 0.0016948279226198792Per-token loss scaled by world size: 0.0009180614724755287Per-token loss scaled by world size: 0.0012782311532646418Per-token loss scaled by world size: 0.0011824824614450336 | |
Epoch: 0, Step: 24, Rank: 1, loss = 1.9756858348846436 | |
Epoch: 0, Step: 24, Rank: 6, loss = 1.38038969039917Epoch: 0, Step: 24, Rank: 4, loss = 1.287815809249878 | |
Epoch: 0, Step: 24, Rank: 7, loss = 1.038403034210205Epoch: 0, Step: 24, Rank: 0, loss = 1.376835823059082Epoch: 0, Step: 24, Rank: 5, loss = 0.7458102107048035Epoch: 0, Step: 24, Rank: 3, loss = 0.9606191515922546 | |
Per-token loss scaled by world size: 0.0015160423936322331 | |
Epoch: 0, Step: 24, Rank: 2, loss = 1.2315949201583862 | |
[2024-06-27 16:41:36,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=24, skipped=0, lr=[1.2467532467532468e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:36,764] [INFO] [timer.py:260:stop] epoch=0/micro_step=24/global_step=24, RunningAvgSamplesPerSec=95.82915506967412, CurrSamplesPerSec=95.70933306584531, MemAllocated=22.22GB, MaxMemAllocated=28.6GB | |
throughput: 95.6186019870654 samples/s, lr: 1.2467532467532468e-06, loss: 1.376835823059082 cuda_mem_allocated: 22.223738193511963 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6499.0 batch_size: 69.0 total loss: 1.2496442794799805 | |
Epoch 0: 11% 24/213 [00:26<03:19, 1.06s/it] total tokens: 2520 num samples: 7 num padding tokens: 235 - rank: 2 max len: 360 min len: 285 avg len: 326.42857142857144 num_loss_counted_tokens: 706 | |
total tokens: 2125 num samples: 5 num padding tokens: 203 - rank: 1 max len: 425 min len: 362 avg len: 384.4 num_loss_counted_tokens: 1095 | |
total tokens: 2522 num samples: 13 num padding tokens: 269 - rank: 5 max len: 194 min len: 153 avg len: 173.30769230769232 num_loss_counted_tokens: 748 | |
total tokens: 2475 num samples: 9 num padding tokens: 198 - rank: 3 max len: 275 min len: 236 avg len: 253.0 num_loss_counted_tokens: 1090 | |
total tokens: 2448 num samples: 16 num padding tokens: 193 - rank: 6 max len: 153 min len: 122 avg len: 140.9375 num_loss_counted_tokens: 884 | |
total tokens: 2519 num samples: 11 num padding tokens: 235 - rank: 4 max len: 229 min len: 197 avg len: 207.63636363636363 num_loss_counted_tokens: 1174 | |
total tokens: 1902 num samples: 3 num padding tokens: 311 - rank: 0 max len: 634 min len: 478 avg len: 530.3333333333334 num_loss_counted_tokens: 659 | |
total tokens: 1708 num samples: 14 num padding tokens: 210 - rank: 7 max len: 122 min len: 86 avg len: 107.0 num_loss_counted_tokens: 390 | |
Per-token loss scaled by world size: 0.0014107475290074944Per-token loss scaled by world size: 0.0015696436166763306Per-token loss scaled by world size: 0.0014085440197959542Per-token loss scaled by world size: 0.001544936210848391Per-token loss scaled by world size: 0.0017203099559992552Per-token loss scaled by world size: 0.001288165687583387Per-token loss scaled by world size: 0.001661831745877862 | |
Epoch: 0, Step: 25, Rank: 5, loss = 1.3021199703216553Epoch: 0, Step: 25, Rank: 6, loss = 1.3000861406326294Epoch: 0, Step: 25, Rank: 4, loss = 1.4487810134887695 | |
Epoch: 0, Step: 25, Rank: 0, loss = 1.4259761571884155 | |
Epoch: 0, Step: 25, Rank: 1, loss = 1.587846040725708 | |
Epoch: 0, Step: 25, Rank: 2, loss = 1.1889768838882446Epoch: 0, Step: 25, Rank: 3, loss = 1.5338706970214844 | |
Per-token loss scaled by world size: 0.0009627352119423449 | |
Epoch: 0, Step: 25, Rank: 7, loss = 0.8886045813560486 | |
[2024-06-27 16:41:37,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=25, skipped=0, lr=[1.2987012987012986e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:37,827] [INFO] [timer.py:260:stop] epoch=0/micro_step=25/global_step=25, RunningAvgSamplesPerSec=95.79422050433683, CurrSamplesPerSec=95.03205291448599, MemAllocated=22.26GB, MaxMemAllocated=28.6GB | |
throughput: 94.87960153003193 samples/s, lr: 1.2987012987012986e-06, loss: 1.4259761571884155 cuda_mem_allocated: 22.257846355438232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7384.0 batch_size: 90.0 total loss: 1.334532618522644 | |
Epoch 0: 12% 25/213 [00:27<03:19, 1.06s/it] total tokens: 2373 num samples: 7 num padding tokens: 192 - rank: 2 max len: 339 min len: 281 avg len: 311.57142857142856 num_loss_counted_tokens: 1240 | |
total tokens: 2413 num samples: 19 num padding tokens: 335 - rank: 7 max len: 127 min len: 88 avg len: 109.36842105263158 num_loss_counted_tokens: 610 | |
total tokens: 2470 num samples: 10 num padding tokens: 191 - rank: 4 max len: 247 min len: 205 avg len: 227.9 num_loss_counted_tokens: 991 | |
total tokens: 2188 num samples: 4 num padding tokens: 441 - rank: 0 max len: 547 min len: 393 avg len: 436.75 num_loss_counted_tokens: 825 | |
total tokens: 2464 num samples: 16 num padding tokens: 243 - rank: 6 max len: 154 min len: 127 avg len: 138.8125 num_loss_counted_tokens: 793 | |
total tokens: 2310 num samples: 6 num padding tokens: 57 - rank: 1 max len: 385 min len: 365 avg len: 375.5 num_loss_counted_tokens: 1448 | |
total tokens: 2483 num samples: 13 num padding tokens: 197 - rank: 5 max len: 191 min len: 159 avg len: 175.84615384615384 num_loss_counted_tokens: 942 | |
total tokens: 2520 num samples: 9 num padding tokens: 163 - rank: 3 max len: 280 min len: 248 avg len: 261.8888888888889 num_loss_counted_tokens: 1139 | |
Per-token loss scaled by world size: 0.0011576360557228327Per-token loss scaled by world size: 0.0017824555980041623Per-token loss scaled by world size: 0.0012702572857961059Per-token loss scaled by world size: 0.001084008952602744 | |
Per-token loss scaled by world size: 0.00111071125138551Per-token loss scaled by world size: 0.0014819849748164415Per-token loss scaled by world size: 0.0011219846783205867 | |
Epoch: 0, Step: 26, Rank: 5, loss = 1.2216699123382568 | |
Epoch: 0, Step: 26, Rank: 1, loss = 1.714276671409607Epoch: 0, Step: 26, Rank: 4, loss = 1.1133564710617065Epoch: 0, Step: 26, Rank: 6, loss = 1.0425455570220947 | |
Epoch: 0, Step: 26, Rank: 7, loss = 1.0790687799453735Epoch: 0, Step: 26, Rank: 2, loss = 1.0682265758514404Epoch: 0, Step: 26, Rank: 0, loss = 1.425299048423767 | |
Per-token loss scaled by world size: 0.0009582243510521948 | |
Epoch: 0, Step: 26, Rank: 3, loss = 0.9215722680091858 | |
[2024-06-27 16:41:38,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=26, skipped=0, lr=[1.3506493506493506e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:38,886] [INFO] [timer.py:260:stop] epoch=0/micro_step=26/global_step=26, RunningAvgSamplesPerSec=95.73713854441172, CurrSamplesPerSec=94.44277537831931, MemAllocated=22.28GB, MaxMemAllocated=28.6GB | |
throughput: 94.34945411780033 samples/s, lr: 1.3506493506493506e-06, loss: 1.425299048423767 cuda_mem_allocated: 22.27979040145874 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7694.0 batch_size: 81.0 total loss: 1.1982518434524536 | |
Epoch 0: 12% 26/213 [00:28<03:18, 1.06s/it] total tokens: 2478 num samples: 14 num padding tokens: 194 - rank: 5 max len: 177 min len: 154 avg len: 163.14285714285714 num_loss_counted_tokens: 822 | |
total tokens: 2448 num samples: 8 num padding tokens: 210 - rank: 2 max len: 306 min len: 253 avg len: 279.75 num_loss_counted_tokens: 944 | |
total tokens: 2510 num samples: 10 num padding tokens: 183 - rank: 3 max len: 251 min len: 214 avg len: 232.7 num_loss_counted_tokens: 991 | |
total tokens: 2464 num samples: 16 num padding tokens: 293 - rank: 6 max len: 154 min len: 114 avg len: 135.6875 num_loss_counted_tokens: 812 | |
total tokens: 1332 num samples: 12 num padding tokens: 160 - rank: 7 max len: 111 min len: 77 avg len: 97.66666666666667 num_loss_counted_tokens: 259 | |
total tokens: 2343 num samples: 11 num padding tokens: 196 - rank: 4 max len: 213 min len: 177 avg len: 195.1818181818182 num_loss_counted_tokens: 752 | |
total tokens: 2506 num samples: 7 num padding tokens: 138 - rank: 1 max len: 358 min len: 308 avg len: 338.2857142857143 num_loss_counted_tokens: 1246 | |
total tokens: 2120 num samples: 4 num padding tokens: 283 - rank: 0 max len: 530 min len: 386 avg len: 459.25 num_loss_counted_tokens: 1173 | |
Per-token loss scaled by world size: 0.0016214739298447967Per-token loss scaled by world size: 0.0010020462796092033Per-token loss scaled by world size: 0.002396175405010581Per-token loss scaled by world size: 0.002153418492525816Per-token loss scaled by world size: 0.0007948008133098483Per-token loss scaled by world size: 0.0009105096105486155Per-token loss scaled by world size: 0.0012255565961822867 | |
Epoch: 0, Step: 27, Rank: 4, loss = 1.3995347023010254Epoch: 0, Step: 27, Rank: 3, loss = 0.8648911714553833Epoch: 0, Step: 27, Rank: 1, loss = 1.858669400215149Epoch: 0, Step: 27, Rank: 5, loss = 0.7858836054801941Epoch: 0, Step: 27, Rank: 6, loss = 1.057808518409729Epoch: 0, Step: 27, Rank: 2, loss = 2.0681989192962646Epoch: 0, Step: 27, Rank: 7, loss = 0.6860124468803406 | |
Per-token loss scaled by world size: 0.0010213142959401011 | |
Epoch: 0, Step: 27, Rank: 0, loss = 0.8815219402313232 | |
[2024-06-27 16:41:39,871] [INFO] [logging.py:96:log_dist] [Rank 0] step=27, skipped=0, lr=[1.4025974025974026e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:39,944] [INFO] [timer.py:260:stop] epoch=0/micro_step=27/global_step=27, RunningAvgSamplesPerSec=95.73464844936348, CurrSamplesPerSec=95.67492500395029, MemAllocated=22.31GB, MaxMemAllocated=28.6GB | |
throughput: 95.57318750022252 samples/s, lr: 1.4025974025974026e-06, loss: 0.8815219402313232 cuda_mem_allocated: 22.31079864501953 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6905.0 batch_size: 79.0 total loss: 1.200314998626709 | |
Epoch 0: 13% 27/213 [00:29<03:16, 1.06s/it] total tokens: 2415 num samples: 5 num padding tokens: 630 - rank: 2 max len: 483 min len: 289 avg len: 357.0 num_loss_counted_tokens: 943 | |
total tokens: 2296 num samples: 8 num padding tokens: 186 - rank: 3 max len: 287 min len: 242 avg len: 263.75 num_loss_counted_tokens: 1105 | |
total tokens: 2370 num samples: 10 num padding tokens: 116 - rank: 4 max len: 237 min len: 196 avg len: 225.4 num_loss_counted_tokens: 1098 | |
total tokens: 2352 num samples: 12 num padding tokens: 181 - rank: 5 max len: 196 min len: 168 avg len: 180.91666666666666 num_loss_counted_tokens: 861 | |
total tokens: 2150 num samples: 2 num padding tokens: 203 - rank: 1 max len: 1075 min len: 872 avg len: 973.5 num_loss_counted_tokens: 186 | |
total tokens: 2505 num samples: 15 num padding tokens: 307 - rank: 6 max len: 167 min len: 132 avg len: 146.53333333333333 num_loss_counted_tokens: 861 | |
total tokens: 2508 num samples: 19 num padding tokens: 434 - rank: 7 max len: 132 min len: 72 avg len: 109.15789473684211 num_loss_counted_tokens: 479 | |
total tokens: 2426 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2426 min len: 2426 avg len: 2426.0 num_loss_counted_tokens: 41 | |
Per-token loss scaled by world size: 0.0008533939253538847Per-token loss scaled by world size: 0.001326940837316215 | |
Per-token loss scaled by world size: 0.0022073008585721254Per-token loss scaled by world size: 0.0009769187308847904Per-token loss scaled by world size: 0.0015909703215584159Per-token loss scaled by world size: 0.0010183891281485558Per-token loss scaled by world size: 0.0011881894897669554 | |
Epoch: 0, Step: 28, Rank: 7, loss = 0.7841623425483704 | |
Epoch: 0, Step: 28, Rank: 2, loss = 1.2192927598953247Epoch: 0, Step: 28, Rank: 1, loss = 2.028233528137207Epoch: 0, Step: 28, Rank: 3, loss = 0.9357722997665405 | |
Epoch: 0, Step: 28, Rank: 5, loss = 1.0917975902557373Epoch: 0, Step: 28, Rank: 4, loss = 0.8976662158966064Epoch: 0, Step: 28, Rank: 6, loss = 1.4619028568267822Per-token loss scaled by world size: 0.0013041283236816525 | |
Epoch: 0, Step: 28, Rank: 0, loss = 1.1983308792114258 | |
[2024-06-27 16:41:40,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=28, skipped=0, lr=[1.4545454545454546e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:41,006] [INFO] [timer.py:260:stop] epoch=0/micro_step=28/global_step=28, RunningAvgSamplesPerSec=95.71717101122674, CurrSamplesPerSec=95.28229959225224, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.19415387452392 samples/s, lr: 1.4545454545454546e-06, loss: 1.1983308792114258 cuda_mem_allocated: 22.314496517181396 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7351.0 batch_size: 87.0 total loss: 1.2021448612213135 | |
Epoch 0: 13% 28/213 [00:30<03:16, 1.06s/it] total tokens: 2184 num samples: 6 num padding tokens: 196 - rank: 1 max len: 364 min len: 301 avg len: 331.3333333333333 num_loss_counted_tokens: 1004 | |
total tokens: 2400 num samples: 8 num padding tokens: 181 - rank: 2 max len: 300 min len: 243 avg len: 277.375 num_loss_counted_tokens: 1182 | |
total tokens: 2472 num samples: 12 num padding tokens: 222 - rank: 4 max len: 206 min len: 177 avg len: 187.5 num_loss_counted_tokens: 834 | |
total tokens: 2420 num samples: 10 num padding tokens: 212 - rank: 3 max len: 242 min len: 210 avg len: 220.8 num_loss_counted_tokens: 967 | |
total tokens: 2450 num samples: 14 num padding tokens: 145 - rank: 5 max len: 175 min len: 149 avg len: 164.64285714285714 num_loss_counted_tokens: 968 | |
total tokens: 2420 num samples: 20 num padding tokens: 366 - rank: 7 max len: 121 min len: 87 avg len: 102.7 num_loss_counted_tokens: 463 | |
total tokens: 2533 num samples: 17 num padding tokens: 279 - rank: 6 max len: 149 min len: 121 avg len: 132.58823529411765 num_loss_counted_tokens: 757 | |
total tokens: 1956 num samples: 3 num padding tokens: 471 - rank: 0 max len: 652 min len: 384 avg len: 495.0 num_loss_counted_tokens: 1121 | |
Per-token loss scaled by world size: 0.0012882179580628872Per-token loss scaled by world size: 0.0010056280298158526Per-token loss scaled by world size: 0.001909442711621523Per-token loss scaled by world size: 0.0009883438469842076Per-token loss scaled by world size: 0.001602682750672102Per-token loss scaled by world size: 0.0014511172194033861Per-token loss scaled by world size: 0.0013251092750579119 | |
Epoch: 0, Step: 29, Rank: 1, loss = 1.0389478206634521Epoch: 0, Step: 29, Rank: 4, loss = 1.5399655103683472 | |
Epoch: 0, Step: 29, Rank: 7, loss = 0.8110390305519104Epoch: 0, Step: 29, Rank: 2, loss = 1.2925636768341064 | |
Per-token loss scaled by world size: 0.0012569926911965013Epoch: 0, Step: 29, Rank: 0, loss = 0.7970992922782898 | |
Epoch: 0, Step: 29, Rank: 3, loss = 1.1703259944915771Epoch: 0, Step: 29, Rank: 5, loss = 1.0687006711959839 | |
Epoch: 0, Step: 29, Rank: 6, loss = 1.0137646198272705 | |
[2024-06-27 16:41:41,993] [INFO] [logging.py:96:log_dist] [Rank 0] step=29, skipped=0, lr=[1.5064935064935066e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:42,066] [INFO] [timer.py:260:stop] epoch=0/micro_step=29/global_step=29, RunningAvgSamplesPerSec=95.7113291911325, CurrSamplesPerSec=95.55969176220978, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.46315540159762 samples/s, lr: 1.5064935064935066e-06, loss: 0.7970992922782898 cuda_mem_allocated: 22.251407623291016 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6452.0 batch_size: 83.0 total loss: 1.0915507078170776 | |
Epoch 0: 14% 29/213 [00:31<03:14, 1.06s/it] total tokens: 2500 num samples: 10 num padding tokens: 221 - rank: 4 max len: 250 min len: 211 avg len: 227.9 num_loss_counted_tokens: 1054 | |
total tokens: 2036 num samples: 4 num padding tokens: 263 - rank: 1 max len: 509 min len: 386 avg len: 443.25 num_loss_counted_tokens: 940 | |
total tokens: 2238 num samples: 6 num padding tokens: 296 - rank: 2 max len: 373 min len: 288 avg len: 323.6666666666667 num_loss_counted_tokens: 670 | |
total tokens: 2288 num samples: 8 num padding tokens: 127 - rank: 3 max len: 286 min len: 253 avg len: 270.125 num_loss_counted_tokens: 934 | |
total tokens: 2450 num samples: 14 num padding tokens: 241 - rank: 6 max len: 175 min len: 144 avg len: 157.78571428571428 num_loss_counted_tokens: 939 | |
total tokens: 2520 num samples: 12 num padding tokens: 238 - rank: 5 max len: 210 min len: 175 avg len: 190.16666666666666 num_loss_counted_tokens: 772 | |
total tokens: 1762 num samples: 2 num padding tokens: 314 - rank: 0 max len: 881 min len: 567 avg len: 724.0 num_loss_counted_tokens: 1000 | |
total tokens: 2329 num samples: 17 num padding tokens: 430 - rank: 7 max len: 137 min len: 81 avg len: 111.70588235294117 num_loss_counted_tokens: 536 | |
Per-token loss scaled by world size: 0.0018716076156124473Per-token loss scaled by world size: 0.0014090798795223236Per-token loss scaled by world size: 0.0007029715925455093Per-token loss scaled by world size: 0.001286323182284832Per-token loss scaled by world size: 0.0013113869354128838Per-token loss scaled by world size: 0.0017444791737943888 | |
Per-token loss scaled by world size: 0.001640479895286262 | |
Epoch: 0, Step: 30, Rank: 2, loss = 1.1032042503356934Epoch: 0, Step: 30, Rank: 5, loss = 1.185388445854187Epoch: 0, Step: 30, Rank: 3, loss = 1.467543125152588 | |
Epoch: 0, Step: 30, Rank: 0, loss = 0.5913748741149902Epoch: 0, Step: 30, Rank: 1, loss = 1.574489951133728Epoch: 0, Step: 30, Rank: 6, loss = 1.082119345664978Per-token loss scaled by world size: 0.0009769739117473364 | |
Epoch: 0, Step: 30, Rank: 4, loss = 1.3800537586212158 | |
Epoch: 0, Step: 30, Rank: 7, loss = 0.8218793272972107 | |
[2024-06-27 16:41:43,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.5584415584415584e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:43,120] [INFO] [timer.py:260:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=95.71987713122812, CurrSamplesPerSec=95.95125004914885, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.85403544321636 samples/s, lr: 1.5584415584415584e-06, loss: 0.5913748741149902 cuda_mem_allocated: 22.293267726898193 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6730.0 batch_size: 84.0 total loss: 1.1507567167282104 | |
Epoch 0: 14% 30/213 [00:32<03:13, 1.06s/it] total tokens: 2416 num samples: 16 num padding tokens: 183 - rank: 6 max len: 151 min len: 126 avg len: 139.5625 num_loss_counted_tokens: 806 | |
total tokens: 2394 num samples: 9 num padding tokens: 155 - rank: 3 max len: 266 min len: 236 avg len: 248.77777777777777 num_loss_counted_tokens: 993 | |
total tokens: 2519 num samples: 11 num padding tokens: 187 - rank: 4 max len: 229 min len: 199 avg len: 212.0 num_loss_counted_tokens: 921 | |
total tokens: 2534 num samples: 7 num padding tokens: 335 - rank: 2 max len: 362 min len: 274 avg len: 314.14285714285717 num_loss_counted_tokens: 1184 | |
total tokens: 2405 num samples: 5 num padding tokens: 269 - rank: 1 max len: 481 min len: 380 avg len: 427.2 num_loss_counted_tokens: 1378 | |
total tokens: 2364 num samples: 12 num padding tokens: 252 - rank: 5 max len: 197 min len: 152 avg len: 176.0 num_loss_counted_tokens: 930 | |
total tokens: 2024 num samples: 2 num padding tokens: 399 - rank: 0 max len: 1012 min len: 613 avg len: 812.5 num_loss_counted_tokens: 843 | |
total tokens: 2500 num samples: 20 num padding tokens: 378 - rank: 7 max len: 125 min len: 81 avg len: 106.1 num_loss_counted_tokens: 467 | |
Per-token loss scaled by world size: 0.0010956590995192528Per-token loss scaled by world size: 0.0009007177432067692Per-token loss scaled by world size: 0.0011096697999164462 | |
Per-token loss scaled by world size: 0.001485607703216374Per-token loss scaled by world size: 0.0014966772869229317Per-token loss scaled by world size: 0.0019373169634491205Per-token loss scaled by world size: 0.0005619669100269675 | |
Epoch: 0, Step: 31, Rank: 5, loss = 0.8690800070762634 | |
Epoch: 0, Step: 31, Rank: 4, loss = 1.0571740865707397Epoch: 0, Step: 31, Rank: 6, loss = 1.0706926584243774Epoch: 0, Step: 31, Rank: 0, loss = 1.8692686557769775Epoch: 0, Step: 31, Rank: 3, loss = 1.433425784111023Epoch: 0, Step: 31, Rank: 2, loss = 1.4441064596176147 | |
Per-token loss scaled by world size: 0.0018234321614727378Epoch: 0, Step: 31, Rank: 7, loss = 0.5422278046607971 | |
Epoch: 0, Step: 31, Rank: 1, loss = 1.7593841552734375 | |
[2024-06-27 16:41:44,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=31, skipped=0, lr=[1.6103896103896105e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:44,178] [INFO] [timer.py:260:stop] epoch=0/micro_step=31/global_step=31, RunningAvgSamplesPerSec=95.71728674047134, CurrSamplesPerSec=95.64481267874619, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.55220837407094 samples/s, lr: 1.6103896103896105e-06, loss: 1.8692686557769775 cuda_mem_allocated: 22.28456163406372 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7719.0 batch_size: 81.0 total loss: 1.2556699514389038 | |
Epoch 0: 15% 31/213 [00:33<03:12, 1.06s/it] total tokens: 2534 num samples: 7 num padding tokens: 136 - rank: 1 max len: 362 min len: 323 avg len: 342.57142857142856 num_loss_counted_tokens: 1412 | |
total tokens: 2529 num samples: 9 num padding tokens: 228 - rank: 3 max len: 281 min len: 236 avg len: 255.66666666666666 num_loss_counted_tokens: 1308 | |
total tokens: 2530 num samples: 11 num padding tokens: 236 - rank: 4 max len: 230 min len: 196 avg len: 208.54545454545453 num_loss_counted_tokens: 866 | |
total tokens: 2400 num samples: 16 num padding tokens: 267 - rank: 6 max len: 150 min len: 118 avg len: 133.3125 num_loss_counted_tokens: 595 | |
total tokens: 2219 num samples: 7 num padding tokens: 119 - rank: 2 max len: 317 min len: 282 avg len: 300.0 num_loss_counted_tokens: 1212 | |
total tokens: 2522 num samples: 13 num padding tokens: 285 - rank: 5 max len: 194 min len: 152 avg len: 172.07692307692307 num_loss_counted_tokens: 909 | |
total tokens: 1744 num samples: 16 num padding tokens: 216 - rank: 7 max len: 109 min len: 72 avg len: 95.5 num_loss_counted_tokens: 304 | |
total tokens: 2456 num samples: 4 num padding tokens: 581 - rank: 0 max len: 614 min len: 376 avg len: 468.75 num_loss_counted_tokens: 1390 | |
Per-token loss scaled by world size: 0.0013062176294624805Per-token loss scaled by world size: 0.001880783005617559Per-token loss scaled by world size: 0.0009372846689075232Per-token loss scaled by world size: 0.0010156083153560758Per-token loss scaled by world size: 0.0012389987241476774Per-token loss scaled by world size: 0.002112359507009387Per-token loss scaled by world size: 0.0011788855772465467 | |
Epoch: 0, Step: 32, Rank: 7, loss = 0.8176637291908264Epoch: 0, Step: 32, Rank: 0, loss = 1.6407480239868164Epoch: 0, Step: 32, Rank: 1, loss = 1.1395115852355957 | |
Epoch: 0, Step: 32, Rank: 2, loss = 1.8427696228027344Epoch: 0, Step: 32, Rank: 6, loss = 1.0808714628219604 | |
Epoch: 0, Step: 32, Rank: 4, loss = 0.8859912753105164Per-token loss scaled by world size: 0.0017476618522778153 | |
Epoch: 0, Step: 32, Rank: 3, loss = 1.0284303426742554 | |
Epoch: 0, Step: 32, Rank: 5, loss = 1.5246164798736572 | |
[2024-06-27 16:41:45,155] [INFO] [logging.py:96:log_dist] [Rank 0] step=32, skipped=0, lr=[1.6623376623376625e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:45,227] [INFO] [timer.py:260:stop] epoch=0/micro_step=32/global_step=32, RunningAvgSamplesPerSec=95.74206604188649, CurrSamplesPerSec=96.46628893419549, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 96.36828212599697 samples/s, lr: 1.6623376623376625e-06, loss: 1.6407480239868164 cuda_mem_allocated: 22.268819332122803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6979.0 batch_size: 82.0 total loss: 1.2450753450393677 | |
Epoch 0: 15% 32/213 [00:34<03:11, 1.06s/it] total tokens: 2403 num samples: 9 num padding tokens: 332 - rank: 4 max len: 267 min len: 210 avg len: 230.11111111111111 num_loss_counted_tokens: 504 | |
total tokens: 2108 num samples: 4 num padding tokens: 249 - rank: 1 max len: 527 min len: 434 avg len: 464.75 num_loss_counted_tokens: 672 | |
total tokens: 2522 num samples: 13 num padding tokens: 212 - rank: 5 max len: 194 min len: 164 avg len: 177.69230769230768 num_loss_counted_tokens: 965 | |
total tokens: 2261 num samples: 7 num padding tokens: 139 - rank: 3 max len: 323 min len: 288 avg len: 303.14285714285717 num_loss_counted_tokens: 1081 | |
total tokens: 2430 num samples: 15 num padding tokens: 278 - rank: 6 max len: 162 min len: 123 avg len: 143.46666666666667 num_loss_counted_tokens: 871 | |
total tokens: 2130 num samples: 5 num padding tokens: 261 - rank: 2 max len: 426 min len: 326 avg len: 373.8 num_loss_counted_tokens: 666 | |
total tokens: 2313 num samples: 3 num padding tokens: 289 - rank: 0 max len: 771 min len: 596 avg len: 674.6666666666666 num_loss_counted_tokens: 979 | |
total tokens: 2196 num samples: 18 num padding tokens: 207 - rank: 7 max len: 122 min len: 89 avg len: 110.5 num_loss_counted_tokens: 602 | |
Per-token loss scaled by world size: 0.001472035888582468Per-token loss scaled by world size: 0.0011673582484945655 | |
Per-token loss scaled by world size: 0.0008859222289174795Per-token loss scaled by world size: 0.0016290945932269096Per-token loss scaled by world size: 0.0014870993327349424Per-token loss scaled by world size: 0.0014220958109945059Per-token loss scaled by world size: 0.0014339566696435213 | |
Epoch: 0, Step: 33, Rank: 5, loss = 1.1916130781173706 | |
Epoch: 0, Step: 33, Rank: 1, loss = 0.7171540260314941Epoch: 0, Step: 33, Rank: 2, loss = 1.2038068771362305Epoch: 0, Step: 33, Rank: 6, loss = 1.1607879400253296Epoch: 0, Step: 33, Rank: 0, loss = 0.9449765086174011Epoch: 0, Step: 33, Rank: 3, loss = 1.1511865854263306Epoch: 0, Step: 33, Rank: 4, loss = 1.3187520503997803 | |
Per-token loss scaled by world size: 0.000958455668296665 | |
Epoch: 0, Step: 33, Rank: 7, loss = 0.7758698463439941 | |
[2024-06-27 16:41:46,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=33, skipped=0, lr=[1.7142857142857145e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:46,287] [INFO] [timer.py:260:stop] epoch=0/micro_step=33/global_step=33, RunningAvgSamplesPerSec=95.73160222058533, CurrSamplesPerSec=95.41874766283384, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.30760431028871 samples/s, lr: 1.7142857142857145e-06, loss: 0.9449765086174011 cuda_mem_allocated: 22.253553867340088 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6476.0 batch_size: 88.0 total loss: 1.0580183267593384 | |
Epoch 0: 15% 33/213 [00:35<03:10, 1.06s/it] total tokens: 2520 num samples: 14 num padding tokens: 124 - rank: 5 max len: 180 min len: 162 avg len: 171.14285714285714 num_loss_counted_tokens: 946 | |
total tokens: 2520 num samples: 12 num padding tokens: 158 - rank: 4 max len: 210 min len: 181 avg len: 196.83333333333334 num_loss_counted_tokens: 818 | |
total tokens: 2254 num samples: 7 num padding tokens: 167 - rank: 2 max len: 322 min len: 280 avg len: 298.14285714285717 num_loss_counted_tokens: 1100 | |
total tokens: 2226 num samples: 6 num padding tokens: 107 - rank: 1 max len: 371 min len: 328 avg len: 353.1666666666667 num_loss_counted_tokens: 1386 | |
total tokens: 2400 num samples: 15 num padding tokens: 221 - rank: 6 max len: 160 min len: 121 avg len: 145.26666666666668 num_loss_counted_tokens: 886 | |
total tokens: 2439 num samples: 9 num padding tokens: 235 - rank: 3 max len: 271 min len: 216 avg len: 244.88888888888889 num_loss_counted_tokens: 1100 | |
total tokens: 2010 num samples: 3 num padding tokens: 470 - rank: 0 max len: 670 min len: 374 avg len: 513.3333333333334 num_loss_counted_tokens: 954 | |
total tokens: 2520 num samples: 21 num padding tokens: 400 - rank: 7 max len: 120 min len: 82 avg len: 100.95238095238095 num_loss_counted_tokens: 498 | |
Per-token loss scaled by world size: 0.001454559387639165Per-token loss scaled by world size: 0.001217790530063212Per-token loss scaled by world size: 0.000821855734102428Per-token loss scaled by world size: 0.001820175675675273Per-token loss scaled by world size: 0.0009127333760261536Per-token loss scaled by world size: 0.0013246271992102265Per-token loss scaled by world size: 0.0014604143798351288 | |
Epoch: 0, Step: 34, Rank: 2, loss = 1.376194953918457Epoch: 0, Step: 34, Rank: 7, loss = 0.7775782346725464Epoch: 0, Step: 34, Rank: 5, loss = 1.1521821022033691Epoch: 0, Step: 34, Rank: 1, loss = 1.3817346096038818 | |
Epoch: 0, Step: 34, Rank: 4, loss = 1.7221137285232544Epoch: 0, Step: 34, Rank: 3, loss = 1.2532628774642944 | |
Epoch: 0, Step: 34, Rank: 6, loss = 0.8635598421096802 | |
Per-token loss scaled by world size: 0.001401647343300283 | |
Epoch: 0, Step: 34, Rank: 0, loss = 1.3261336088180542 | |
[2024-06-27 16:41:47,271] [INFO] [logging.py:96:log_dist] [Rank 0] step=34, skipped=0, lr=[1.7662337662337665e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:47,345] [INFO] [timer.py:260:stop] epoch=0/micro_step=34/global_step=34, RunningAvgSamplesPerSec=95.7273962549508, CurrSamplesPerSec=95.59719438164073, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.4970715490705 samples/s, lr: 1.7662337662337665e-06, loss: 1.3261336088180542 cuda_mem_allocated: 22.30411958694458 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7569.0 batch_size: 83.0 total loss: 1.2315949201583862 | |
Epoch 0: 16% 34/213 [00:36<03:09, 1.06s/it] total tokens: 2366 num samples: 13 num padding tokens: 165 - rank: 5 max len: 182 min len: 160 avg len: 169.30769230769232 num_loss_counted_tokens: 992 | |
total tokens: 2496 num samples: 16 num padding tokens: 321 - rank: 6 max len: 156 min len: 122 avg len: 135.9375 num_loss_counted_tokens: 839 | |
total tokens: 2484 num samples: 12 num padding tokens: 140 - rank: 4 max len: 207 min len: 183 avg len: 195.33333333333334 num_loss_counted_tokens: 861 | |
total tokens: 2448 num samples: 8 num padding tokens: 236 - rank: 2 max len: 306 min len: 257 avg len: 276.5 num_loss_counted_tokens: 1338 | |
total tokens: 2520 num samples: 10 num padding tokens: 210 - rank: 3 max len: 252 min len: 210 avg len: 231.0 num_loss_counted_tokens: 631 | |
total tokens: 2465 num samples: 5 num padding tokens: 517 - rank: 1 max len: 493 min len: 341 avg len: 389.6 num_loss_counted_tokens: 754 | |
total tokens: 2520 num samples: 21 num padding tokens: 376 - rank: 7 max len: 120 min len: 84 avg len: 102.0952380952381 num_loss_counted_tokens: 566 | |
total tokens: 2208 num samples: 3 num padding tokens: 241 - rank: 0 max len: 736 min len: 592 avg len: 655.6666666666666 num_loss_counted_tokens: 690 | |
Per-token loss scaled by world size: 0.000942899496294558Per-token loss scaled by world size: 0.0009772757766768336Per-token loss scaled by world size: 0.0008537264075130224Per-token loss scaled by world size: 0.0011113828513771296Per-token loss scaled by world size: 0.0014757602475583553Per-token loss scaled by world size: 0.0019430734682828188Per-token loss scaled by world size: 0.0006698836805298924 | |
Epoch: 0, Step: 35, Rank: 6, loss = 0.9394814968109131 | |
Epoch: 0, Step: 35, Rank: 7, loss = 0.8506316542625427Epoch: 0, Step: 35, Rank: 3, loss = 1.9360297918319702 | |
Epoch: 0, Step: 35, Rank: 4, loss = 1.4704105854034424Epoch: 0, Step: 35, Rank: 2, loss = 1.1073540449142456 | |
Epoch: 0, Step: 35, Rank: 1, loss = 0.6674553751945496 | |
Per-token loss scaled by world size: 0.0020391889847815037 | |
Epoch: 0, Step: 35, Rank: 0, loss = 2.031796932220459 | |
Epoch: 0, Step: 35, Rank: 5, loss = 0.973733127117157 | |
[2024-06-27 16:41:48,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=35, skipped=0, lr=[1.8181818181818183e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:48,392] [INFO] [timer.py:260:stop] epoch=0/micro_step=35/global_step=35, RunningAvgSamplesPerSec=95.75808104411513, CurrSamplesPerSec=96.75048855426623, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.65930510995165 samples/s, lr: 1.8181818181818183e-06, loss: 2.031796932220459 cuda_mem_allocated: 22.283130645751953 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7971.0 batch_size: 74.0 total loss: 1.2471115589141846 | |
Epoch 0: 16% 35/213 [00:37<03:07, 1.05s/it] total tokens: 2260 num samples: 5 num padding tokens: 248 - rank: 1 max len: 452 min len: 371 avg len: 402.4 num_loss_counted_tokens: 1352 | |
total tokens: 2478 num samples: 7 num padding tokens: 281 - rank: 2 max len: 354 min len: 297 avg len: 313.85714285714283 num_loss_counted_tokens: 852 | |
total tokens: 2360 num samples: 8 num padding tokens: 140 - rank: 3 max len: 295 min len: 256 avg len: 277.5 num_loss_counted_tokens: 998 | |
total tokens: 2480 num samples: 10 num padding tokens: 268 - rank: 4 max len: 248 min len: 198 avg len: 221.2 num_loss_counted_tokens: 926 | |
total tokens: 2509 num samples: 13 num padding tokens: 187 - rank: 5 max len: 193 min len: 166 avg len: 178.6153846153846 num_loss_counted_tokens: 1112 | |
total tokens: 2475 num samples: 15 num padding tokens: 191 - rank: 6 max len: 165 min len: 133 avg len: 152.26666666666668 num_loss_counted_tokens: 905 | |
total tokens: 2217 num samples: 3 num padding tokens: 445 - rank: 0 max len: 739 min len: 514 avg len: 590.6666666666666 num_loss_counted_tokens: 1414 | |
total tokens: 2470 num samples: 19 num padding tokens: 384 - rank: 7 max len: 130 min len: 84 avg len: 109.78947368421052 num_loss_counted_tokens: 597 | |
Per-token loss scaled by world size: 0.0010562499519437551Per-token loss scaled by world size: 0.0007037639152258635Per-token loss scaled by world size: 0.0008446157444268465Per-token loss scaled by world size: 0.0009504646295681596Per-token loss scaled by world size: 0.0012687207199633121Per-token loss scaled by world size: 0.0014610859798267484Per-token loss scaled by world size: 0.0020083144772797823 | |
Epoch: 0, Step: 36, Rank: 7, loss = 0.7053473591804504Epoch: 0, Step: 36, Rank: 6, loss = 1.0586265325546265 | |
Epoch: 0, Step: 36, Rank: 5, loss = 0.8465161323547363Epoch: 0, Step: 36, Rank: 4, loss = 0.9526031613349915Per-token loss scaled by world size: 0.0008157128468155861Epoch: 0, Step: 36, Rank: 2, loss = 1.2715753316879272Epoch: 0, Step: 36, Rank: 3, loss = 1.4643734693527222 | |
Epoch: 0, Step: 36, Rank: 1, loss = 2.0128331184387207 | |
Epoch: 0, Step: 36, Rank: 0, loss = 0.8175482153892517 | |
[2024-06-27 16:41:49,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=36, skipped=0, lr=[1.8701298701298703e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:49,454] [INFO] [timer.py:260:stop] epoch=0/micro_step=36/global_step=36, RunningAvgSamplesPerSec=95.74451449625428, CurrSamplesPerSec=95.29896491444816, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.21396287004042 samples/s, lr: 1.8701298701298703e-06, loss: 0.8175482153892517 cuda_mem_allocated: 22.30435800552368 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8018.0 batch_size: 69.0 total loss: 1.141178011894226 | |
Epoch 0: 17% 36/213 [00:38<03:06, 1.06s/it] total tokens: 2412 num samples: 6 num padding tokens: 124 - rank: 1 max len: 402 min len: 351 avg len: 381.3333333333333 num_loss_counted_tokens: 1346 | |
total tokens: 2335 num samples: 5 num padding tokens: 157 - rank: 0 max len: 467 min len: 412 avg len: 435.6 num_loss_counted_tokens: 1137 | |
total tokens: 2528 num samples: 16 num padding tokens: 198 - rank: 6 max len: 158 min len: 126 avg len: 145.625 num_loss_counted_tokens: 809 | |
total tokens: 2529 num samples: 9 num padding tokens: 143 - rank: 3 max len: 281 min len: 247 avg len: 265.1111111111111 num_loss_counted_tokens: 1313 | |
total tokens: 2522 num samples: 13 num padding tokens: 187 - rank: 5 max len: 194 min len: 160 avg len: 179.6153846153846 num_loss_counted_tokens: 874 | |
total tokens: 2420 num samples: 10 num padding tokens: 161 - rank: 4 max len: 242 min len: 204 avg len: 225.9 num_loss_counted_tokens: 1080 | |
total tokens: 2366 num samples: 7 num padding tokens: 186 - rank: 2 max len: 338 min len: 288 avg len: 311.42857142857144 num_loss_counted_tokens: 1093 | |
total tokens: 2196 num samples: 18 num padding tokens: 284 - rank: 7 max len: 122 min len: 88 avg len: 106.22222222222223 num_loss_counted_tokens: 545 | |
Per-token loss scaled by world size: 0.0008984538726508617Per-token loss scaled by world size: 0.0017901754472404718Per-token loss scaled by world size: 0.002019608626142144Per-token loss scaled by world size: 0.0005305592785589397Per-token loss scaled by world size: 0.0008083579014055431Per-token loss scaled by world size: 0.0014550243504345417Per-token loss scaled by world size: 0.0008187826606445014 | |
Epoch: 0, Step: 37, Rank: 5, loss = 0.8302456140518188Epoch: 0, Step: 37, Rank: 2, loss = 0.9110321998596191Epoch: 0, Step: 37, Rank: 0, loss = 1.8152378797531128Epoch: 0, Step: 37, Rank: 1, loss = 2.0478830337524414Epoch: 0, Step: 37, Rank: 3, loss = 1.4753947257995605 | |
Epoch: 0, Step: 37, Rank: 4, loss = 0.8196749091148376Epoch: 0, Step: 37, Rank: 7, loss = 0.5379871129989624 | |
Per-token loss scaled by world size: 0.0010051794815808535 | |
Epoch: 0, Step: 37, Rank: 6, loss = 1.0192519426345825 | |
[2024-06-27 16:41:50,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=37, skipped=0, lr=[1.9220779220779223e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:50,513] [INFO] [timer.py:260:stop] epoch=0/micro_step=37/global_step=37, RunningAvgSamplesPerSec=95.71033406718995, CurrSamplesPerSec=94.56254605968631, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 94.47749979762523 samples/s, lr: 1.9220779220779223e-06, loss: 1.8152378797531128 cuda_mem_allocated: 22.268819332122803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8112.0 batch_size: 84.0 total loss: 1.1820884943008423 | |
Epoch 0: 17% 37/213 [00:39<03:06, 1.06s/it] total tokens: 2215 num samples: 5 num padding tokens: 199 - rank: 1 max len: 443 min len: 369 avg len: 403.2 num_loss_counted_tokens: 1274 | |
total tokens: 2420 num samples: 10 num padding tokens: 189 - rank: 4 max len: 242 min len: 210 avg len: 223.1 num_loss_counted_tokens: 951 | |
total tokens: 2376 num samples: 8 num padding tokens: 262 - rank: 3 max len: 297 min len: 242 avg len: 264.25 num_loss_counted_tokens: 1064 | |
total tokens: 2490 num samples: 15 num padding tokens: 239 - rank: 6 max len: 166 min len: 134 avg len: 150.06666666666666 num_loss_counted_tokens: 921 | |
total tokens: 2506 num samples: 7 num padding tokens: 196 - rank: 2 max len: 358 min len: 298 avg len: 330.0 num_loss_counted_tokens: 1123 | |
total tokens: 2484 num samples: 12 num padding tokens: 264 - rank: 5 max len: 207 min len: 167 avg len: 185.0 num_loss_counted_tokens: 829 | |
total tokens: 2532 num samples: 3 num padding tokens: 746 - rank: 0 max len: 844 min len: 450 avg len: 595.3333333333334 num_loss_counted_tokens: 1414 | |
total tokens: 2376 num samples: 18 num padding tokens: 327 - rank: 7 max len: 132 min len: 94 avg len: 113.83333333333333 num_loss_counted_tokens: 635 | |
Per-token loss scaled by world size: 0.0012521593598648906Per-token loss scaled by world size: 0.0016965597169473767Per-token loss scaled by world size: 0.0017084602732211351Per-token loss scaled by world size: 0.0015294712502509356 | |
Per-token loss scaled by world size: 0.0012999814935028553 | |
Per-token loss scaled by world size: 0.0010105837136507034 | |
Epoch: 0, Step: 38, Rank: 6, loss = 1.1224043369293213 | |
Epoch: 0, Step: 38, Rank: 2, loss = 1.5314210653305054Epoch: 0, Step: 38, Rank: 4, loss = 1.5207537412643433 | |
Epoch: 0, Step: 38, Rank: 3, loss = 1.3709797859191895Epoch: 0, Step: 38, Rank: 5, loss = 0.9058619737625122 | |
Epoch: 0, Step: 38, Rank: 1, loss = 1.1652709245681763Per-token loss scaled by world size: 0.0014145182212814689 | |
Per-token loss scaled by world size: 0.0007597419898957014 | |
Epoch: 0, Step: 38, Rank: 0, loss = 1.2679387331008911 | |
Epoch: 0, Step: 38, Rank: 7, loss = 0.6810137033462524 | |
[2024-06-27 16:41:51,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=38, skipped=0, lr=[1.9740259740259743e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:51,568] [INFO] [timer.py:260:stop] epoch=0/micro_step=38/global_step=38, RunningAvgSamplesPerSec=95.70948093972585, CurrSamplesPerSec=95.67963105712741, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.5758190461293 samples/s, lr: 1.9740259740259743e-06, loss: 1.2679387331008911 cuda_mem_allocated: 22.290405750274658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7171.0 batch_size: 81.0 total loss: 1.195705533027649 | |
Epoch 0: 18% 38/213 [00:40<03:04, 1.06s/it] total tokens: 2460 num samples: 10 num padding tokens: 221 - rank: 4 max len: 246 min len: 208 avg len: 223.9 num_loss_counted_tokens: 887 | |
total tokens: 2493 num samples: 9 num padding tokens: 136 - rank: 3 max len: 277 min len: 248 avg len: 261.8888888888889 num_loss_counted_tokens: 900 | |
total tokens: 2484 num samples: 12 num padding tokens: 123 - rank: 5 max len: 207 min len: 181 avg len: 196.75 num_loss_counted_tokens: 866 | |
total tokens: 2190 num samples: 5 num padding tokens: 140 - rank: 1 max len: 438 min len: 384 avg len: 410.0 num_loss_counted_tokens: 1305 | |
total tokens: 2443 num samples: 7 num padding tokens: 176 - rank: 2 max len: 349 min len: 297 avg len: 323.85714285714283 num_loss_counted_tokens: 1199 | |
total tokens: 2506 num samples: 14 num padding tokens: 302 - rank: 6 max len: 179 min len: 139 avg len: 157.42857142857142 num_loss_counted_tokens: 952 | |
total tokens: 2313 num samples: 3 num padding tokens: 484 - rank: 0 max len: 771 min len: 527 avg len: 609.6666666666666 num_loss_counted_tokens: 272 | |
total tokens: 2527 num samples: 19 num padding tokens: 389 - rank: 7 max len: 133 min len: 90 avg len: 112.52631578947368 num_loss_counted_tokens: 662 | |
Per-token loss scaled by world size: 0.0014128703624010086Per-token loss scaled by world size: 0.0008425601990893483Per-token loss scaled by world size: 0.0006428177002817392Per-token loss scaled by world size: 0.0016167518915608525Per-token loss scaled by world size: 0.0014310558326542377Per-token loss scaled by world size: 0.0014991330681368709 | |
Per-token loss scaled by world size: 0.0007515820907428861Epoch: 0, Step: 39, Rank: 3, loss = 1.272642970085144Epoch: 0, Step: 39, Rank: 4, loss = 0.7589361071586609Epoch: 0, Step: 39, Rank: 2, loss = 1.456289291381836Epoch: 0, Step: 39, Rank: 1, loss = 0.5790180563926697Epoch: 0, Step: 39, Rank: 5, loss = 1.289023518562317Epoch: 0, Step: 39, Rank: 0, loss = 1.3503440618515015 | |
Per-token loss scaled by world size: 0.001208451227284968 | |
Epoch: 0, Step: 39, Rank: 7, loss = 0.6769875884056091 | |
Epoch: 0, Step: 39, Rank: 6, loss = 1.0885124206542969 | |
[2024-06-27 16:41:52,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=39, skipped=0, lr=[2.0259740259740263e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:52,626] [INFO] [timer.py:260:stop] epoch=0/micro_step=39/global_step=39, RunningAvgSamplesPerSec=95.70580636308607, CurrSamplesPerSec=95.57370926073793, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.47888786767321 samples/s, lr: 2.0259740259740263e-06, loss: 1.3503440618515015 cuda_mem_allocated: 22.275021076202393 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7206.0 batch_size: 94.0 total loss: 1.058969259262085 | |
Epoch 0: 18% 39/213 [00:41<03:03, 1.06s/it] total tokens: 2340 num samples: 12 num padding tokens: 188 - rank: 4 max len: 195 min len: 168 avg len: 179.33333333333334 num_loss_counted_tokens: 931 | |
total tokens: 2527 num samples: 19 num padding tokens: 379 - rank: 6 max len: 133 min len: 101 avg len: 113.05263157894737 num_loss_counted_tokens: 663 | |
total tokens: 2490 num samples: 6 num padding tokens: 237 - rank: 1 max len: 415 min len: 357 avg len: 375.5 num_loss_counted_tokens: 1416 | |
total tokens: 2370 num samples: 10 num padding tokens: 234 - rank: 3 max len: 237 min len: 203 avg len: 213.6 num_loss_counted_tokens: 848 | |
total tokens: 2420 num samples: 4 num padding tokens: 375 - rank: 0 max len: 605 min len: 424 avg len: 511.25 num_loss_counted_tokens: 964 | |
total tokens: 2505 num samples: 15 num padding tokens: 309 - rank: 5 max len: 167 min len: 133 avg len: 146.4 num_loss_counted_tokens: 860 | |
total tokens: 600 num samples: 6 num padding tokens: 35 - rank: 7 max len: 100 min len: 87 avg len: 94.16666666666667 num_loss_counted_tokens: 121 | |
total tokens: 2324 num samples: 7 num padding tokens: 298 - rank: 2 max len: 332 min len: 237 avg len: 289.42857142857144 num_loss_counted_tokens: 1135 | |
Per-token loss scaled by world size: 0.0008073790231719613Per-token loss scaled by world size: 0.0008199179428629577 | |
Per-token loss scaled by world size: 0.0009082254837267101Per-token loss scaled by world size: 0.0015512213576585054Per-token loss scaled by world size: 0.0006398948607966304Per-token loss scaled by world size: 0.0017390016000717878Per-token loss scaled by world size: 0.0010933319572359324 | |
Epoch: 0, Step: 40, Rank: 4, loss = 0.8171684741973877 | |
Epoch: 0, Step: 40, Rank: 6, loss = 0.8298594355583191Epoch: 0, Step: 40, Rank: 5, loss = 0.9192377328872681Epoch: 0, Step: 40, Rank: 1, loss = 1.760087013244629Epoch: 0, Step: 40, Rank: 3, loss = 0.6476535797119141Epoch: 0, Step: 40, Rank: 2, loss = 1.5700299739837646 | |
Epoch: 0, Step: 40, Rank: 0, loss = 1.10658860206604 | |
Per-token loss scaled by world size: 0.0007266728789545596 | |
Epoch: 0, Step: 40, Rank: 7, loss = 0.7354837656021118 | |
[2024-06-27 16:41:53,614] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[2.0779220779220784e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:53,687] [INFO] [timer.py:260:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=95.68684120252419, CurrSamplesPerSec=94.99037576871649, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.89711033336869 samples/s, lr: 2.0779220779220784e-06, loss: 1.10658860206604 cuda_mem_allocated: 22.289571285247803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8097.0 batch_size: 71.0 total loss: 1.0482635498046875 | |
Epoch 0: 19% 40/213 [00:42<03:03, 1.06s/it] total tokens: 2380 num samples: 14 num padding tokens: 231 - rank: 6 max len: 170 min len: 141 avg len: 153.5 num_loss_counted_tokens: 899 | |
total tokens: 2310 num samples: 10 num padding tokens: 173 - rank: 4 max len: 231 min len: 197 avg len: 213.7 num_loss_counted_tokens: 926 | |
total tokens: 2316 num samples: 4 num padding tokens: 429 - rank: 1 max len: 579 min len: 364 avg len: 471.75 num_loss_counted_tokens: 1290 | |
total tokens: 2492 num samples: 7 num padding tokens: 256 - rank: 2 max len: 356 min len: 267 avg len: 319.42857142857144 num_loss_counted_tokens: 622 | |
total tokens: 2385 num samples: 9 num padding tokens: 185 - rank: 3 max len: 265 min len: 236 avg len: 244.44444444444446 num_loss_counted_tokens: 911 | |
total tokens: 2522 num samples: 13 num padding tokens: 182 - rank: 5 max len: 194 min len: 170 avg len: 180.0 num_loss_counted_tokens: 821 | |
total tokens: 2240 num samples: 16 num padding tokens: 443 - rank: 7 max len: 140 min len: 86 avg len: 112.3125 num_loss_counted_tokens: 526 | |
total tokens: 2208 num samples: 2 num padding tokens: 253 - rank: 0 max len: 1104 min len: 851 avg len: 977.5 num_loss_counted_tokens: 152 | |
Per-token loss scaled by world size: 0.0020663461182266474Per-token loss scaled by world size: 0.0014136169338598847Per-token loss scaled by world size: 0.001264766207896173Per-token loss scaled by world size: 0.0011986184399574995 | |
Per-token loss scaled by world size: 0.001326727564446628Per-token loss scaled by world size: 0.0004439983458723873Per-token loss scaled by world size: 0.0006016112747602165 | |
Epoch: 0, Step: 41, Rank: 2, loss = 1.010735034942627 | |
Epoch: 0, Step: 41, Rank: 6, loss = 1.1920324563980103Epoch: 0, Step: 41, Rank: 5, loss = 1.0665141344070435Epoch: 0, Step: 41, Rank: 4, loss = 1.7424464225769043Epoch: 0, Step: 41, Rank: 1, loss = 1.1187629699707031 | |
Epoch: 0, Step: 41, Rank: 0, loss = 0.37440159916877747 | |
Epoch: 0, Step: 41, Rank: 7, loss = 0.5073087215423584 | |
Per-token loss scaled by world size: 0.0018865406746044755 | |
Epoch: 0, Step: 41, Rank: 3, loss = 1.5908254384994507 | |
[2024-06-27 16:41:54,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=41, skipped=0, lr=[2.12987012987013e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:54,748] [INFO] [timer.py:260:stop] epoch=0/micro_step=41/global_step=41, RunningAvgSamplesPerSec=95.67992312399365, CurrSamplesPerSec=95.41777536276359, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.32775394767184 samples/s, lr: 2.12987012987013e-06, loss: 0.37440159916877747 cuda_mem_allocated: 22.24138832092285 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6746.0 batch_size: 78.0 total loss: 1.07537841796875 | |
Epoch 0: 19% 41/213 [00:44<03:02, 1.06s/it] total tokens: 2376 num samples: 9 num padding tokens: 169 - rank: 3 max len: 264 min len: 218 avg len: 245.22222222222223 num_loss_counted_tokens: 869 | |
total tokens: 2496 num samples: 16 num padding tokens: 198 - rank: 6 max len: 156 min len: 129 avg len: 143.625 num_loss_counted_tokens: 848 | |
total tokens: 2387 num samples: 11 num padding tokens: 183 - rank: 4 max len: 217 min len: 187 avg len: 200.36363636363637 num_loss_counted_tokens: 725 | |
total tokens: 2418 num samples: 13 num padding tokens: 133 - rank: 5 max len: 186 min len: 161 avg len: 175.76923076923077 num_loss_counted_tokens: 786 | |
total tokens: 2254 num samples: 7 num padding tokens: 134 - rank: 2 max len: 322 min len: 266 avg len: 302.85714285714283 num_loss_counted_tokens: 1362 | |
total tokens: 2355 num samples: 5 num padding tokens: 385 - rank: 1 max len: 471 min len: 349 avg len: 394.0 num_loss_counted_tokens: 1283 | |
total tokens: 1880 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1880 min len: 1880 avg len: 1880.0 num_loss_counted_tokens: 42 | |
total tokens: 2375 num samples: 19 num padding tokens: 337 - rank: 7 max len: 125 min len: 87 avg len: 107.26315789473684 num_loss_counted_tokens: 574 | |
Per-token loss scaled by world size: 0.0016895646695047617Per-token loss scaled by world size: 0.0022693488281220198Per-token loss scaled by world size: 0.0007895386079326272Per-token loss scaled by world size: 0.0010653851786628366Per-token loss scaled by world size: 0.0013372855028137565 | |
Per-token loss scaled by world size: 0.0008948277099989355Per-token loss scaled by world size: 0.0014117223909124732 | |
Epoch: 0, Step: 42, Rank: 0, loss = 1.0637871026992798 | |
Epoch: 0, Step: 42, Rank: 2, loss = 1.68703031539917Epoch: 0, Step: 42, Rank: 7, loss = 0.7883542776107788 | |
Epoch: 0, Step: 42, Rank: 1, loss = 2.265944719314575Epoch: 0, Step: 42, Rank: 4, loss = 1.3352795839309692Epoch: 0, Step: 42, Rank: 3, loss = 1.409604787826538Epoch: 0, Step: 42, Rank: 6, loss = 0.8934854865074158 | |
Per-token loss scaled by world size: 0.000884220760781318 | |
Epoch: 0, Step: 42, Rank: 5, loss = 0.8828944563865662 | |
[2024-06-27 16:41:55,737] [INFO] [logging.py:96:log_dist] [Rank 0] step=42, skipped=0, lr=[2.181818181818182e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:55,811] [INFO] [timer.py:260:stop] epoch=0/micro_step=42/global_step=42, RunningAvgSamplesPerSec=95.6659256823357, CurrSamplesPerSec=95.1232018218908, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.03297251063492 samples/s, lr: 2.181818181818182e-06, loss: 1.0637871026992798 cuda_mem_allocated: 22.275497913360596 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7988.0 batch_size: 84.0 total loss: 1.2907975912094116 | |
Epoch 0: 20% 42/213 [00:45<03:01, 1.06s/it] total tokens: 2410 num samples: 5 num padding tokens: 384 - rank: 1 max len: 482 min len: 371 avg len: 405.2 num_loss_counted_tokens: 806 | |
total tokens: 2448 num samples: 16 num padding tokens: 244 - rank: 6 max len: 153 min len: 122 avg len: 137.75 num_loss_counted_tokens: 655 | |
total tokens: 2522 num samples: 13 num padding tokens: 263 - rank: 5 max len: 194 min len: 156 avg len: 173.76923076923077 num_loss_counted_tokens: 880 | |
total tokens: 2460 num samples: 10 num padding tokens: 307 - rank: 4 max len: 246 min len: 195 avg len: 215.3 num_loss_counted_tokens: 910 | |
total tokens: 2366 num samples: 7 num padding tokens: 182 - rank: 3 max len: 338 min len: 274 avg len: 312.0 num_loss_counted_tokens: 1331 | |
total tokens: 2196 num samples: 6 num padding tokens: 83 - rank: 2 max len: 366 min len: 341 avg len: 352.1666666666667 num_loss_counted_tokens: 1194 | |
total tokens: 1862 num samples: 2 num padding tokens: 385 - rank: 0 max len: 931 min len: 546 avg len: 738.5 num_loss_counted_tokens: 184 | |
total tokens: 2160 num samples: 18 num padding tokens: 318 - rank: 7 max len: 120 min len: 89 avg len: 102.33333333333333 num_loss_counted_tokens: 421 | |
Per-token loss scaled by world size: 0.0012832334032282233Per-token loss scaled by world size: 0.0014733066782355309Per-token loss scaled by world size: 0.0017722281627357006Per-token loss scaled by world size: 0.0008605036418884993Per-token loss scaled by world size: 0.0012608567485585809Per-token loss scaled by world size: 0.0025709623005241156 | |
Per-token loss scaled by world size: 0.0003666903648991138 | |
Epoch: 0, Step: 43, Rank: 5, loss = 1.1226688623428345 | |
Epoch: 0, Step: 43, Rank: 3, loss = 1.2889591455459595Epoch: 0, Step: 43, Rank: 4, loss = 0.7528331279754639 | |
Epoch: 0, Step: 43, Rank: 6, loss = 1.103092074394226Epoch: 0, Step: 43, Rank: 1, loss = 1.5504781007766724Epoch: 0, Step: 43, Rank: 7, loss = 0.3208082318305969 | |
Epoch: 0, Step: 43, Rank: 0, loss = 2.2492706775665283 | |
Per-token loss scaled by world size: 0.001783867715857923 | |
Epoch: 0, Step: 43, Rank: 2, loss = 1.5606613159179688 | |
[2024-06-27 16:41:56,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=43, skipped=0, lr=[2.233766233766234e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:56,871] [INFO] [timer.py:260:stop] epoch=0/micro_step=43/global_step=43, RunningAvgSamplesPerSec=95.65676997727431, CurrSamplesPerSec=95.29197333881123, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.19937544596648 samples/s, lr: 2.233766233766234e-06, loss: 2.2492706775665283 cuda_mem_allocated: 22.267388343811035 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6999.0 batch_size: 82.0 total loss: 1.2435964345932007 | |
Epoch 0: 20% 43/213 [00:46<03:00, 1.06s/it] total tokens: 2360 num samples: 20 num padding tokens: 273 - rank: 7 max len: 118 min len: 84 avg len: 104.35 num_loss_counted_tokens: 527 | |
total tokens: 2420 num samples: 10 num padding tokens: 345 - rank: 4 max len: 242 min len: 185 avg len: 207.5 num_loss_counted_tokens: 879 | |
total tokens: 2336 num samples: 8 num padding tokens: 203 - rank: 3 max len: 292 min len: 243 avg len: 266.625 num_loss_counted_tokens: 1089 | |
total tokens: 2392 num samples: 13 num padding tokens: 234 - rank: 5 max len: 184 min len: 155 avg len: 166.0 num_loss_counted_tokens: 931 | |
total tokens: 2448 num samples: 16 num padding tokens: 268 - rank: 6 max len: 153 min len: 121 avg len: 136.25 num_loss_counted_tokens: 877 | |
total tokens: 2492 num samples: 7 num padding tokens: 163 - rank: 2 max len: 356 min len: 306 avg len: 332.7142857142857 num_loss_counted_tokens: 1319 | |
total tokens: 2500 num samples: 5 num padding tokens: 366 - rank: 1 max len: 500 min len: 374 avg len: 426.8 num_loss_counted_tokens: 985 | |
total tokens: 2388 num samples: 3 num padding tokens: 571 - rank: 0 max len: 796 min len: 510 avg len: 605.6666666666666 num_loss_counted_tokens: 1083 | |
Per-token loss scaled by world size: 0.0029433947056531906 | |
Per-token loss scaled by world size: 0.0010018106549978256Per-token loss scaled by world size: 0.002652614377439022Per-token loss scaled by world size: 0.0030629446264356375Per-token loss scaled by world size: 0.001765502616763115 | |
Per-token loss scaled by world size: 0.00018463570449966937Per-token loss scaled by world size: 5.102176146465354e-05 | |
Epoch: 0, Step: 44, Rank: 4, loss = 2.0508103370666504 | |
Epoch: 0, Step: 44, Rank: 3, loss = 2.1341066360473633 | |
Epoch: 0, Step: 44, Rank: 5, loss = 1.2301139831542969Epoch: 0, Step: 44, Rank: 2, loss = 1.848209023475647Epoch: 0, Step: 44, Rank: 1, loss = 0.1286449283361435Epoch: 0, Step: 44, Rank: 7, loss = 0.698011577129364 | |
Epoch: 0, Step: 44, Rank: 0, loss = 0.03554941341280937 | |
Per-token loss scaled by world size: 0.0018017509719356894 | |
Epoch: 0, Step: 44, Rank: 6, loss = 1.255370020866394 | |
[2024-06-27 16:41:57,855] [INFO] [logging.py:96:log_dist] [Rank 0] step=44, skipped=0, lr=[2.285714285714286e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:57,928] [INFO] [timer.py:260:stop] epoch=0/micro_step=44/global_step=44, RunningAvgSamplesPerSec=95.6550720074, CurrSamplesPerSec=95.58550710600755, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.4858616116909 samples/s, lr: 2.285714285714286e-06, loss: 0.03554941341280937 cuda_mem_allocated: 22.30388116836548 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5574.0 batch_size: 72.0 total loss: 1.1726019382476807 | |
Epoch 0: 21% 44/213 [00:47<02:59, 1.06s/it] total tokens: 2530 num samples: 11 num padding tokens: 208 - rank: 4 max len: 230 min len: 196 avg len: 211.0909090909091 num_loss_counted_tokens: 663 | |
total tokens: 2534 num samples: 7 num padding tokens: 144 - rank: 2 max len: 362 min len: 324 avg len: 341.42857142857144 num_loss_counted_tokens: 1483 | |
total tokens: 2305 num samples: 5 num padding tokens: 216 - rank: 1 max len: 461 min len: 379 avg len: 417.8 num_loss_counted_tokens: 1235 | |
total tokens: 2528 num samples: 8 num padding tokens: 462 - rank: 3 max len: 316 min len: 237 avg len: 258.25 num_loss_counted_tokens: 834 | |
total tokens: 2522 num samples: 13 num padding tokens: 215 - rank: 5 max len: 194 min len: 159 avg len: 177.46153846153845 num_loss_counted_tokens: 971 | |
total tokens: 2400 num samples: 16 num padding tokens: 303 - rank: 6 max len: 150 min len: 112 avg len: 131.0625 num_loss_counted_tokens: 792 | |
total tokens: 1744 num samples: 16 num padding tokens: 202 - rank: 7 max len: 109 min len: 78 avg len: 96.375 num_loss_counted_tokens: 403 | |
total tokens: 2364 num samples: 3 num padding tokens: 429 - rank: 0 max len: 788 min len: 480 avg len: 645.0 num_loss_counted_tokens: 591 | |
Per-token loss scaled by world size: 0.0011890575988218188Per-token loss scaled by world size: 0.0011890768073499203 | |
Per-token loss scaled by world size: 0.0010824004421010613Per-token loss scaled by world size: 0.0019755843095481396Per-token loss scaled by world size: 0.0018149197567254305Per-token loss scaled by world size: 0.0006511726533062756Per-token loss scaled by world size: 0.0012727526482194662 | |
Epoch: 0, Step: 45, Rank: 3, loss = 1.0844380855560303 | |
Epoch: 0, Step: 45, Rank: 5, loss = 1.0844205617904663Epoch: 0, Step: 45, Rank: 0, loss = 0.9871492385864258 | |
Epoch: 0, Step: 45, Rank: 2, loss = 1.8017328977584839 | |
Epoch: 0, Step: 45, Rank: 7, loss = 0.5938694477081299Epoch: 0, Step: 45, Rank: 1, loss = 1.6552067995071411 | |
Epoch: 0, Step: 45, Rank: 4, loss = 1.160750389099121Per-token loss scaled by world size: 0.001169984694570303 | |
Epoch: 0, Step: 45, Rank: 6, loss = 1.0670260190963745 | |
[2024-06-27 16:41:58,917] [INFO] [logging.py:96:log_dist] [Rank 0] step=45, skipped=0, lr=[2.337662337662338e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:41:58,990] [INFO] [timer.py:260:stop] epoch=0/micro_step=45/global_step=45, RunningAvgSamplesPerSec=95.6536079043272, CurrSamplesPerSec=95.59215602111865, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.50273413401776 samples/s, lr: 2.337662337662338e-06, loss: 0.9871492385864258 cuda_mem_allocated: 22.2478289604187 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7296.0 batch_size: 90.0 total loss: 1.1793242692947388 | |
Epoch 0: 21% 45/213 [00:48<02:58, 1.06s/it] total tokens: 2475 num samples: 9 num padding tokens: 74 - rank: 3 max len: 275 min len: 250 avg len: 266.77777777777777 num_loss_counted_tokens: 842 | |
total tokens: 2445 num samples: 5 num padding tokens: 314 - rank: 1 max len: 489 min len: 387 avg len: 426.2 num_loss_counted_tokens: 1142 | |
total tokens: 2475 num samples: 15 num padding tokens: 195 - rank: 6 max len: 165 min len: 133 avg len: 152.0 num_loss_counted_tokens: 842 | |
total tokens: 2492 num samples: 7 num padding tokens: 342 - rank: 2 max len: 356 min len: 288 avg len: 307.14285714285717 num_loss_counted_tokens: 827 | |
total tokens: 2490 num samples: 10 num padding tokens: 300 - rank: 4 max len: 249 min len: 205 avg len: 219.0 num_loss_counted_tokens: 890 | |
total tokens: 1584 num samples: 12 num padding tokens: 295 - rank: 7 max len: 132 min len: 84 avg len: 107.41666666666667 num_loss_counted_tokens: 352 | |
total tokens: 2436 num samples: 12 num padding tokens: 228 - rank: 5 max len: 203 min len: 166 avg len: 184.0 num_loss_counted_tokens: 776 | |
total tokens: 2055 num samples: 3 num padding tokens: 308 - rank: 0 max len: 685 min len: 529 avg len: 582.3333333333334 num_loss_counted_tokens: 1372 | |
Per-token loss scaled by world size: 0.0009048162610270083Per-token loss scaled by world size: 0.0003582813369575888 | |
Per-token loss scaled by world size: 0.0019048351095989347Per-token loss scaled by world size: 0.0017809424316510558Per-token loss scaled by world size: 0.0013538796920329332Per-token loss scaled by world size: 0.0016357492422685027Per-token loss scaled by world size: 0.0009523280314169824 | |
Epoch: 0, Step: 46, Rank: 7, loss = 0.7741833925247192 | |
Epoch: 0, Step: 46, Rank: 2, loss = 0.3065544664859772Epoch: 0, Step: 46, Rank: 1, loss = 1.523818850517273Epoch: 0, Step: 46, Rank: 4, loss = 1.6298245191574097 | |
Epoch: 0, Step: 46, Rank: 0, loss = 1.3995879888534546Epoch: 0, Step: 46, Rank: 6, loss = 1.158413290977478Epoch: 0, Step: 46, Rank: 3, loss = 0.8148356676101685 | |
Per-token loss scaled by world size: 0.0010868724202737212 | |
Epoch: 0, Step: 46, Rank: 5, loss = 0.929955244064331 | |
[2024-06-27 16:41:59,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=46, skipped=0, lr=[2.3896103896103896e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:00,048] [INFO] [timer.py:260:stop] epoch=0/micro_step=46/global_step=46, RunningAvgSamplesPerSec=95.65518766878829, CurrSamplesPerSec=95.72316693902721, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 95.6221443603923 samples/s, lr: 2.3896103896103896e-06, loss: 1.3995879888534546 cuda_mem_allocated: 22.224692344665527 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6845.0 batch_size: 73.0 total loss: 1.0671466588974 | |
Epoch 0: 22% 46/213 [00:49<02:56, 1.06s/it] total tokens: 2464 num samples: 7 num padding tokens: 161 - rank: 2 max len: 352 min len: 293 avg len: 329.0 num_loss_counted_tokens: 807 | |
total tokens: 2519 num samples: 11 num padding tokens: 330 - rank: 4 max len: 229 min len: 179 avg len: 199.0 num_loss_counted_tokens: 1026 | |
total tokens: 2520 num samples: 3 num padding tokens: 536 - rank: 0 max len: 840 min len: 551 avg len: 661.3333333333334 num_loss_counted_tokens: 947 | |
total tokens: 2076 num samples: 4 num padding tokens: 339 - rank: 1 max len: 519 min len: 365 avg len: 434.25 num_loss_counted_tokens: 1213 | |
total tokens: 2436 num samples: 14 num padding tokens: 156 - rank: 5 max len: 174 min len: 147 avg len: 162.85714285714286 num_loss_counted_tokens: 954 | |
total tokens: 2499 num samples: 17 num padding tokens: 211 - rank: 6 max len: 147 min len: 126 avg len: 134.58823529411765 num_loss_counted_tokens: 789 | |
total tokens: 2448 num samples: 9 num padding tokens: 229 - rank: 3 max len: 272 min len: 232 avg len: 246.55555555555554 num_loss_counted_tokens: 824 | |
total tokens: 2440 num samples: 20 num padding tokens: 264 - rank: 7 max len: 122 min len: 85 avg len: 108.8 num_loss_counted_tokens: 582 | |
Per-token loss scaled by world size: 0.0008768007974140346Per-token loss scaled by world size: 0.0009633160661906004Per-token loss scaled by world size: 0.0014150552451610565Per-token loss scaled by world size: 0.001330382889136672Per-token loss scaled by world size: 0.001794004230760038Per-token loss scaled by world size: 0.0012995372526347637Per-token loss scaled by world size: 0.001373719540424645 | |
Epoch: 0, Step: 47, Rank: 5, loss = 1.221889853477478Epoch: 0, Step: 47, Rank: 0, loss = 1.291639804840088Epoch: 0, Step: 47, Rank: 6, loss = 0.824411928653717 | |
Epoch: 0, Step: 47, Rank: 4, loss = 0.9057579040527344Epoch: 0, Step: 47, Rank: 3, loss = 1.3305057287216187 | |
Epoch: 0, Step: 47, Rank: 1, loss = 1.2508925199508667Epoch: 0, Step: 47, Rank: 2, loss = 1.6868125200271606 | |
Per-token loss scaled by world size: 0.0005604384932667017 | |
Epoch: 0, Step: 47, Rank: 7, loss = 0.5269522666931152 | |
[2024-06-27 16:42:01,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=47, skipped=0, lr=[2.4415584415584416e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:01,110] [INFO] [timer.py:260:stop] epoch=0/micro_step=47/global_step=47, RunningAvgSamplesPerSec=95.61542568032957, CurrSamplesPerSec=93.89803637706858, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 93.79973489816409 samples/s, lr: 2.4415584415584416e-06, loss: 1.291639804840088 cuda_mem_allocated: 22.25593852996826 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7522.0 batch_size: 82.0 total loss: 1.1298578977584839 | |
Epoch 0: 22% 47/213 [00:50<02:56, 1.06s/it] total tokens: 2343 num samples: 11 num padding tokens: 71 - rank: 4 max len: 213 min len: 200 avg len: 206.54545454545453 num_loss_counted_tokens: 841 | |
total tokens: 2520 num samples: 10 num padding tokens: 155 - rank: 3 max len: 252 min len: 213 avg len: 236.5 num_loss_counted_tokens: 751 | |
total tokens: 2352 num samples: 8 num padding tokens: 135 - rank: 2 max len: 294 min len: 260 avg len: 277.125 num_loss_counted_tokens: 994 | |
total tokens: 2412 num samples: 6 num padding tokens: 209 - rank: 1 max len: 402 min len: 300 avg len: 367.1666666666667 num_loss_counted_tokens: 1140 | |
total tokens: 2388 num samples: 12 num padding tokens: 174 - rank: 5 max len: 199 min len: 168 avg len: 184.5 num_loss_counted_tokens: 900 | |
total tokens: 2460 num samples: 15 num padding tokens: 268 - rank: 6 max len: 164 min len: 128 avg len: 146.13333333333333 num_loss_counted_tokens: 784 | |
total tokens: 2444 num samples: 4 num padding tokens: 384 - rank: 0 max len: 611 min len: 457 avg len: 515.0 num_loss_counted_tokens: 1313 | |
total tokens: 2286 num samples: 18 num padding tokens: 294 - rank: 7 max len: 127 min len: 85 avg len: 110.66666666666667 num_loss_counted_tokens: 582 | |
Per-token loss scaled by world size: 0.0016289016930386424Per-token loss scaled by world size: 0.0006295717321336269Per-token loss scaled by world size: 0.0008202112512663007Per-token loss scaled by world size: 0.0009915928822010756Per-token loss scaled by world size: 0.0016317907720804214Per-token loss scaled by world size: 0.0015985325444489717Per-token loss scaled by world size: 0.0002903067215811461 | |
Epoch: 0, Step: 48, Rank: 6, loss = 0.6292569637298584Epoch: 0, Step: 48, Rank: 3, loss = 1.6280872821807861Epoch: 0, Step: 48, Rank: 5, loss = 0.8198011517524719Epoch: 0, Step: 48, Rank: 4, loss = 0.9910970330238342 | |
Epoch: 0, Step: 48, Rank: 1, loss = 1.5977332592010498 | |
Epoch: 0, Step: 48, Rank: 2, loss = 1.6309748888015747Epoch: 0, Step: 48, Rank: 7, loss = 0.2901615798473358 | |
Per-token loss scaled by world size: 0.0015108458464965224 | |
Epoch: 0, Step: 48, Rank: 0, loss = 1.5100904703140259 | |
[2024-06-27 16:42:02,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=48, skipped=0, lr=[2.4935064935064936e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:02,174] [INFO] [timer.py:260:stop] epoch=0/micro_step=48/global_step=48, RunningAvgSamplesPerSec=95.60537421432693, CurrSamplesPerSec=95.15523520987631, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.05349983675903 samples/s, lr: 2.4935064935064936e-06, loss: 1.5100904703140259 cuda_mem_allocated: 22.307459831237793 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7996.0 batch_size: 83.0 total loss: 1.1371504068374634 | |
Epoch 0: 23% 48/213 [00:51<02:55, 1.06s/it] total tokens: 2431 num samples: 13 num padding tokens: 139 - rank: 5 max len: 187 min len: 165 avg len: 176.30769230769232 num_loss_counted_tokens: 843 | |
total tokens: 2527 num samples: 19 num padding tokens: 491 - rank: 7 max len: 133 min len: 76 avg len: 107.15789473684211 num_loss_counted_tokens: 614 | |
total tokens: 2429 num samples: 7 num padding tokens: 224 - rank: 1 max len: 347 min len: 294 avg len: 315.0 num_loss_counted_tokens: 1113 | |
total tokens: 2460 num samples: 15 num padding tokens: 217 - rank: 6 max len: 164 min len: 133 avg len: 149.53333333333333 num_loss_counted_tokens: 858 | |
total tokens: 2409 num samples: 11 num padding tokens: 186 - rank: 4 max len: 219 min len: 189 avg len: 202.0909090909091 num_loss_counted_tokens: 993 | |
total tokens: 2340 num samples: 9 num padding tokens: 160 - rank: 3 max len: 260 min len: 222 avg len: 242.22222222222223 num_loss_counted_tokens: 968 | |
total tokens: 2272 num samples: 8 num padding tokens: 44 - rank: 2 max len: 284 min len: 274 avg len: 278.5 num_loss_counted_tokens: 1146 | |
total tokens: 2245 num samples: 5 num padding tokens: 308 - rank: 0 max len: 449 min len: 348 avg len: 387.4 num_loss_counted_tokens: 1009 | |
Per-token loss scaled by world size: 0.0012428463669493794Per-token loss scaled by world size: 0.000941501755733043Per-token loss scaled by world size: 0.0021380565594881773Per-token loss scaled by world size: 0.00040824158350005746Per-token loss scaled by world size: 0.0023744963109493256Per-token loss scaled by world size: 0.0010642549023032188Per-token loss scaled by world size: 0.0017014429904520512 | |
Epoch: 0, Step: 49, Rank: 7, loss = 0.843421995639801Epoch: 0, Step: 49, Rank: 2, loss = 0.9849557280540466Epoch: 0, Step: 49, Rank: 1, loss = 0.7461401224136353Epoch: 0, Step: 49, Rank: 5, loss = 1.6944098472595215 | |
Epoch: 0, Step: 49, Rank: 4, loss = 0.32353144884109497 | |
Epoch: 0, Step: 49, Rank: 3, loss = 1.8817883729934692Epoch: 0, Step: 49, Rank: 6, loss = 1.3483935594558716Per-token loss scaled by world size: 0.0019839173182845116 | |
Epoch: 0, Step: 49, Rank: 0, loss = 1.5722544193267822 | |
[2024-06-27 16:42:03,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=49, skipped=0, lr=[2.5454545454545456e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:03,225] [INFO] [timer.py:260:stop] epoch=0/micro_step=49/global_step=49, RunningAvgSamplesPerSec=95.62340396637218, CurrSamplesPerSec=96.46018799825791, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 96.3483588365929 samples/s, lr: 2.5454545454545456e-06, loss: 1.5722544193267822 cuda_mem_allocated: 22.290405750274658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6340.0 batch_size: 74.0 total loss: 1.1743619441986084 | |
Epoch 0: 23% 49/213 [00:52<02:53, 1.06s/it] total tokens: 2410 num samples: 10 num padding tokens: 142 - rank: 4 max len: 241 min len: 213 avg len: 226.8 num_loss_counted_tokens: 1173 | |
total tokens: 2529 num samples: 9 num padding tokens: 154 - rank: 3 max len: 281 min len: 243 avg len: 263.8888888888889 num_loss_counted_tokens: 1264 | |
total tokens: 2532 num samples: 12 num padding tokens: 319 - rank: 5 max len: 211 min len: 163 avg len: 184.41666666666666 num_loss_counted_tokens: 1060 | |
total tokens: 2445 num samples: 15 num padding tokens: 259 - rank: 6 max len: 163 min len: 125 avg len: 145.73333333333332 num_loss_counted_tokens: 759 | |
total tokens: 2387 num samples: 7 num padding tokens: 204 - rank: 2 max len: 341 min len: 293 avg len: 311.85714285714283 num_loss_counted_tokens: 933 | |
total tokens: 2135 num samples: 5 num padding tokens: 183 - rank: 1 max len: 427 min len: 344 avg len: 390.4 num_loss_counted_tokens: 1374 | |
total tokens: 2232 num samples: 18 num padding tokens: 297 - rank: 7 max len: 124 min len: 81 avg len: 107.5 num_loss_counted_tokens: 507 | |
total tokens: 2256 num samples: 3 num padding tokens: 598 - rank: 0 max len: 752 min len: 443 avg len: 552.6666666666666 num_loss_counted_tokens: 702 | |
Per-token loss scaled by world size: 0.0013103155652061105Per-token loss scaled by world size: 0.0011389543069526553Per-token loss scaled by world size: 0.0009859923738986254Per-token loss scaled by world size: 0.0015023931628093123Per-token loss scaled by world size: 0.0010871132835745811Per-token loss scaled by world size: 0.000928265624679625Per-token loss scaled by world size: 0.0014505306025967002 | |
Epoch: 0, Step: 50, Rank: 6, loss = 1.0945351123809814Epoch: 0, Step: 50, Rank: 3, loss = 1.2592132091522217Epoch: 0, Step: 50, Rank: 4, loss = 0.9475386738777161 | |
Epoch: 0, Step: 50, Rank: 1, loss = 1.4437998533248901 | |
Epoch: 0, Step: 50, Rank: 5, loss = 1.0447158813476562Epoch: 0, Step: 50, Rank: 2, loss = 1.393959879875183Epoch: 0, Step: 50, Rank: 0, loss = 0.8920632600784302 | |
Per-token loss scaled by world size: 0.0004599055682774633 | |
Epoch: 0, Step: 50, Rank: 7, loss = 0.44196924567222595 | |
[2024-06-27 16:42:04,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[2.597402597402597e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:04,284] [INFO] [timer.py:260:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=95.62010348986574, CurrSamplesPerSec=95.46523767491054, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.37546559274087 samples/s, lr: 2.597402597402597e-06, loss: 0.8920632600784302 cuda_mem_allocated: 22.254269123077393 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7688.0 batch_size: 87.0 total loss: 1.064724326133728 | |
Epoch 0: 23% 50/213 [00:53<02:52, 1.06s/it] total tokens: 2409 num samples: 11 num padding tokens: 283 - rank: 4 max len: 219 min len: 184 avg len: 193.27272727272728 num_loss_counted_tokens: 725 | |
total tokens: 2500 num samples: 10 num padding tokens: 121 - rank: 3 max len: 250 min len: 220 avg len: 237.9 num_loss_counted_tokens: 829 | |
total tokens: 2400 num samples: 16 num padding tokens: 235 - rank: 6 max len: 150 min len: 116 avg len: 135.3125 num_loss_counted_tokens: 801 | |
total tokens: 2226 num samples: 7 num padding tokens: 288 - rank: 2 max len: 318 min len: 250 avg len: 276.85714285714283 num_loss_counted_tokens: 764 | |
total tokens: 2534 num samples: 14 num padding tokens: 203 - rank: 5 max len: 181 min len: 150 avg len: 166.5 num_loss_counted_tokens: 980 | |
total tokens: 2238 num samples: 6 num padding tokens: 114 - rank: 1 max len: 373 min len: 327 avg len: 354.0 num_loss_counted_tokens: 1115 | |
total tokens: 2325 num samples: 3 num padding tokens: 702 - rank: 0 max len: 775 min len: 412 avg len: 541.0 num_loss_counted_tokens: 483 | |
total tokens: 2415 num samples: 21 num padding tokens: 285 - rank: 7 max len: 115 min len: 91 avg len: 101.42857142857143 num_loss_counted_tokens: 506 | |
Per-token loss scaled by world size: 0.0010477951727807522Per-token loss scaled by world size: 0.0024228140246123075Per-token loss scaled by world size: 0.0011054554488509893Per-token loss scaled by world size: 0.001343426643870771Per-token loss scaled by world size: 0.0017975677037611604Per-token loss scaled by world size: 0.0014382523950189352Per-token loss scaled by world size: 0.0010343580506742 | |
Epoch: 0, Step: 51, Rank: 4, loss = 0.8737302422523499Epoch: 0, Step: 51, Rank: 2, loss = 2.0203239917755127Epoch: 0, Step: 51, Rank: 1, loss = 0.9218116402626038Epoch: 0, Step: 51, Rank: 6, loss = 1.1993227005004883Epoch: 0, Step: 51, Rank: 0, loss = 1.12024986743927 | |
Epoch: 0, Step: 51, Rank: 5, loss = 1.4989467859268188 | |
Epoch: 0, Step: 51, Rank: 3, loss = 0.8625253438949585 | |
Per-token loss scaled by world size: 0.0011430381564423442 | |
Epoch: 0, Step: 51, Rank: 7, loss = 0.9531509876251221 | |
[2024-06-27 16:42:05,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=51, skipped=0, lr=[2.649350649350649e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:05,347] [INFO] [timer.py:260:stop] epoch=0/micro_step=51/global_step=51, RunningAvgSamplesPerSec=95.60684554303046, CurrSamplesPerSec=94.97475906227525, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.88340238138608 samples/s, lr: 2.649350649350649e-06, loss: 1.12024986743927 cuda_mem_allocated: 22.27788257598877 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6671.0 batch_size: 88.0 total loss: 1.181257724761963 | |
Epoch 0: 24% 51/213 [00:54<02:51, 1.06s/it] total tokens: 2472 num samples: 6 num padding tokens: 232 - rank: 1 max len: 412 min len: 342 avg len: 373.3333333333333 num_loss_counted_tokens: 1330 | |
total tokens: 2529 num samples: 9 num padding tokens: 159 - rank: 3 max len: 281 min len: 237 avg len: 263.3333333333333 num_loss_counted_tokens: 1231 | |
total tokens: 2320 num samples: 10 num padding tokens: 161 - rank: 4 max len: 232 min len: 204 avg len: 215.9 num_loss_counted_tokens: 872 | |
total tokens: 2400 num samples: 12 num padding tokens: 288 - rank: 5 max len: 200 min len: 164 avg len: 176.0 num_loss_counted_tokens: 874 | |
total tokens: 2345 num samples: 7 num padding tokens: 209 - rank: 2 max len: 335 min len: 290 avg len: 305.14285714285717 num_loss_counted_tokens: 858 | |
total tokens: 2415 num samples: 15 num padding tokens: 249 - rank: 6 max len: 161 min len: 125 avg len: 144.4 num_loss_counted_tokens: 684 | |
total tokens: 2480 num samples: 20 num padding tokens: 349 - rank: 7 max len: 124 min len: 82 avg len: 106.55 num_loss_counted_tokens: 599 | |
total tokens: 2244 num samples: 4 num padding tokens: 104 - rank: 0 max len: 561 min len: 501 avg len: 535.0 num_loss_counted_tokens: 1269 | |
Per-token loss scaled by world size: 0.0014943027636036277Per-token loss scaled by world size: 0.0017698021838441491 | |
Per-token loss scaled by world size: 0.0012235771864652634Per-token loss scaled by world size: 0.0015416607493534684Per-token loss scaled by world size: 0.000856109953019768Per-token loss scaled by world size: 0.000686082465108484 | |
Epoch: 0, Step: 52, Rank: 0, loss = 1.5234416723251343Per-token loss scaled by world size: 0.0009099164744839072 | |
Epoch: 0, Step: 52, Rank: 5, loss = 1.2474370002746582Epoch: 0, Step: 52, Rank: 1, loss = 1.8043133020401 | |
Epoch: 0, Step: 52, Rank: 4, loss = 0.8728041052818298Epoch: 0, Step: 52, Rank: 3, loss = 1.5717231035232544Epoch: 0, Step: 52, Rank: 7, loss = 0.6994611024856567Per-token loss scaled by world size: 0.0009161165216937661 | |
Epoch: 0, Step: 52, Rank: 2, loss = 0.9276598691940308 | |
Epoch: 0, Step: 52, Rank: 6, loss = 0.9339808225631714 | |
[2024-06-27 16:42:06,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=52, skipped=0, lr=[2.7012987012987012e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:06,411] [INFO] [timer.py:260:stop] epoch=0/micro_step=52/global_step=52, RunningAvgSamplesPerSec=95.59802508126657, CurrSamplesPerSec=95.16780718230595, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.07589931786268 samples/s, lr: 2.7012987012987012e-06, loss: 1.5234416723251343 cuda_mem_allocated: 22.278955936431885 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8156.0 batch_size: 80.0 total loss: 1.1976025104522705 | |
Saving model in huggingface format at samples_seen: 4992 | |
Model saved in /instructlab/training_output/hf_format/samples_4992 | |
[16:42:24] INFO saving took 18.348140239715576 seconds utils.py:192 | |
Epoch 0: 24% 52/213 [01:14<17:37, 6.57s/it] total tokens: 2334 num samples: 6 num padding tokens: 77 - rank: 2 max len: 389 min len: 355 avg len: 376.1666666666667 num_loss_counted_tokens: 1411 | |
total tokens: 2464 num samples: 7 num padding tokens: 162 - rank: 3 max len: 352 min len: 291 avg len: 328.85714285714283 num_loss_counted_tokens: 1075 | |
total tokens: 2270 num samples: 5 num padding tokens: 111 - rank: 1 max len: 454 min len: 411 avg len: 431.8 num_loss_counted_tokens: 994 | |
total tokens: 2394 num samples: 14 num padding tokens: 232 - rank: 6 max len: 171 min len: 140 avg len: 154.42857142857142 num_loss_counted_tokens: 990 | |
total tokens: 2493 num samples: 9 num padding tokens: 299 - rank: 4 max len: 277 min len: 226 avg len: 243.77777777777777 num_loss_counted_tokens: 705 | |
total tokens: 2502 num samples: 18 num padding tokens: 487 - rank: 7 max len: 139 min len: 79 avg len: 111.94444444444444 num_loss_counted_tokens: 609 | |
total tokens: 2088 num samples: 3 num padding tokens: 367 - rank: 0 max len: 696 min len: 488 avg len: 573.6666666666666 num_loss_counted_tokens: 937 | |
total tokens: 2365 num samples: 11 num padding tokens: 168 - rank: 5 max len: 215 min len: 174 avg len: 199.72727272727272 num_loss_counted_tokens: 925 | |
Per-token loss scaled by world size: 0.0009469238575547934Per-token loss scaled by world size: 0.0005469601019285619Per-token loss scaled by world size: 0.0013078921474516392Per-token loss scaled by world size: 0.0017431610031053424Per-token loss scaled by world size: 0.0016096167964860797 | |
Per-token loss scaled by world size: 0.0010699580889195204 | |
Per-token loss scaled by world size: 0.0009220929350703955 | |
Epoch: 0, Step: 53, Rank: 6, loss = 0.9702418446540833 | |
Epoch: 0, Step: 53, Rank: 1, loss = 1.7860863208770752 | |
Epoch: 0, Step: 53, Rank: 4, loss = 1.3400989770889282Epoch: 0, Step: 53, Rank: 7, loss = 0.5604289770126343Epoch: 0, Step: 53, Rank: 3, loss = 1.6492536067962646 | |
Epoch: 0, Step: 53, Rank: 5, loss = 0.9447994828224182Epoch: 0, Step: 53, Rank: 2, loss = 1.0963058471679688 | |
Per-token loss scaled by world size: 0.0013132268795743585 | |
Epoch: 0, Step: 53, Rank: 0, loss = 1.3455650806427002 | |
[2024-06-27 16:42:25,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=53, skipped=0, lr=[2.7532467532467532e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:25,847] [INFO] [timer.py:260:stop] epoch=0/micro_step=53/global_step=53, RunningAvgSamplesPerSec=95.5470489427253, CurrSamplesPerSec=93.06575662566783, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 92.9692535458029 samples/s, lr: 2.7532467532467532e-06, loss: 1.3455650806427002 cuda_mem_allocated: 22.293028831481934 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8197.0 batch_size: 84.0 total loss: 1.2115974426269531 | |
Epoch 0: 25% 53/213 [01:15<13:07, 4.92s/it] total tokens: 2232 num samples: 6 num padding tokens: 95 - rank: 2 max len: 372 min len: 340 avg len: 356.1666666666667 num_loss_counted_tokens: 1037 | |
total tokens: 2315 num samples: 5 num padding tokens: 204 - rank: 1 max len: 463 min len: 375 avg len: 422.2 num_loss_counted_tokens: 1006 | |
total tokens: 2421 num samples: 9 num padding tokens: 297 - rank: 4 max len: 269 min len: 209 avg len: 236.0 num_loss_counted_tokens: 925 | |
total tokens: 2412 num samples: 12 num padding tokens: 115 - rank: 5 max len: 201 min len: 176 avg len: 191.41666666666666 num_loss_counted_tokens: 838 | |
total tokens: 2380 num samples: 14 num padding tokens: 376 - rank: 6 max len: 170 min len: 120 avg len: 143.14285714285714 num_loss_counted_tokens: 883 | |
total tokens: 2310 num samples: 7 num padding tokens: 303 - rank: 3 max len: 330 min len: 275 avg len: 286.7142857142857 num_loss_counted_tokens: 964 | |
total tokens: 2520 num samples: 21 num padding tokens: 210 - rank: 7 max len: 120 min len: 82 avg len: 110.0 num_loss_counted_tokens: 690 | |
total tokens: 2388 num samples: 4 num padding tokens: 297 - rank: 0 max len: 597 min len: 473 avg len: 522.75 num_loss_counted_tokens: 1384 | |
Per-token loss scaled by world size: 0.001039287424646318Per-token loss scaled by world size: 0.000878677936270833 | |
Per-token loss scaled by world size: 0.001104795839637518Per-token loss scaled by world size: 0.0010850143153220415 | |
Per-token loss scaled by world size: 0.001828947919420898Per-token loss scaled by world size: 0.0006685541593469679 | |
Per-token loss scaled by world size: 0.001402237918227911 | |
Epoch: 0, Step: 54, Rank: 4, loss = 1.066698670387268 | |
Epoch: 0, Step: 54, Rank: 5, loss = 1.1136316061019897 | |
Epoch: 0, Step: 54, Rank: 6, loss = 0.901853084564209Epoch: 0, Step: 54, Rank: 7, loss = 0.6861872673034668Epoch: 0, Step: 54, Rank: 1, loss = 1.8771864175796509Epoch: 0, Step: 54, Rank: 3, loss = 1.1339348554611206Epoch: 0, Step: 54, Rank: 2, loss = 1.439221978187561 | |
Per-token loss scaled by world size: 0.0010170326568186283 | |
Epoch: 0, Step: 54, Rank: 0, loss = 1.0438568592071533 | |
[2024-06-27 16:42:26,834] [INFO] [logging.py:96:log_dist] [Rank 0] step=54, skipped=0, lr=[2.8051948051948052e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:26,907] [INFO] [timer.py:260:stop] epoch=0/micro_step=54/global_step=54, RunningAvgSamplesPerSec=95.54578824729565, CurrSamplesPerSec=95.48153686473505, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 95.39688700373695 samples/s, lr: 2.8051948051948052e-06, loss: 1.0438568592071533 cuda_mem_allocated: 22.316523551940918 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8211.0 batch_size: 78.0 total loss: 1.1578212976455688 | |
Epoch 0: 25% 54/213 [01:16<09:58, 3.76s/it] total tokens: 2504 num samples: 8 num padding tokens: 211 - rank: 2 max len: 313 min len: 253 avg len: 286.625 num_loss_counted_tokens: 1052 | |
total tokens: 2440 num samples: 10 num padding tokens: 171 - rank: 3 max len: 244 min len: 206 avg len: 226.9 num_loss_counted_tokens: 703 | |
total tokens: 2520 num samples: 14 num padding tokens: 182 - rank: 5 max len: 180 min len: 147 avg len: 167.0 num_loss_counted_tokens: 819 | |
total tokens: 2460 num samples: 12 num padding tokens: 115 - rank: 4 max len: 205 min len: 181 avg len: 195.41666666666666 num_loss_counted_tokens: 722 | |
total tokens: 2208 num samples: 6 num padding tokens: 133 - rank: 1 max len: 368 min len: 319 avg len: 345.8333333333333 num_loss_counted_tokens: 1199 | |
total tokens: 2482 num samples: 17 num padding tokens: 204 - rank: 6 max len: 146 min len: 120 avg len: 134.0 num_loss_counted_tokens: 853 | |
total tokens: 2156 num samples: 4 num padding tokens: 238 - rank: 0 max len: 539 min len: 388 avg len: 479.5 num_loss_counted_tokens: 760 | |
total tokens: 2520 num samples: 21 num padding tokens: 304 - rank: 7 max len: 120 min len: 72 avg len: 105.52380952380952 num_loss_counted_tokens: 578 | |
Per-token loss scaled by world size: 0.001082298462279141Per-token loss scaled by world size: 0.0014010192826390266Per-token loss scaled by world size: 0.00193775596562773Per-token loss scaled by world size: 0.001309810089878738Per-token loss scaled by world size: 0.0008593339007347822Per-token loss scaled by world size: 0.0012067538918927312Per-token loss scaled by world size: 5.4270854889182374e-05 | |
Epoch: 0, Step: 55, Rank: 5, loss = 1.2334223985671997Epoch: 0, Step: 55, Rank: 4, loss = 0.9528285264968872Epoch: 0, Step: 55, Rank: 1, loss = 1.7059519290924072Epoch: 0, Step: 55, Rank: 2, loss = 1.1531240940093994Epoch: 0, Step: 55, Rank: 7, loss = 0.756536066532135 | |
Epoch: 0, Step: 55, Rank: 3, loss = 1.0623959302902222 | |
Epoch: 0, Step: 55, Rank: 0, loss = 0.04777870327234268 | |
Per-token loss scaled by world size: 0.0013464824296534061 | |
Epoch: 0, Step: 55, Rank: 6, loss = 1.185409426689148 | |
[2024-06-27 16:42:27,895] [INFO] [logging.py:96:log_dist] [Rank 0] step=55, skipped=0, lr=[2.8571428571428573e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:27,968] [INFO] [timer.py:260:stop] epoch=0/micro_step=55/global_step=55, RunningAvgSamplesPerSec=95.54029556140715, CurrSamplesPerSec=95.25554353781472, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.17010153209297 samples/s, lr: 2.8571428571428573e-06, loss: 0.04777870327234268 cuda_mem_allocated: 22.290405750274658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7043.0 batch_size: 79.0 total loss: 1.0121809244155884 | |
Epoch 0: 26% 55/213 [01:17<07:46, 2.95s/it] total tokens: 2486 num samples: 11 num padding tokens: 201 - rank: 5 max len: 226 min len: 195 avg len: 207.72727272727272 num_loss_counted_tokens: 728 | |
total tokens: 2522 num samples: 13 num padding tokens: 345 - rank: 6 max len: 194 min len: 138 avg len: 167.46153846153845 num_loss_counted_tokens: 956 | |
total tokens: 2220 num samples: 5 num padding tokens: 421 - rank: 2 max len: 444 min len: 331 avg len: 359.8 num_loss_counted_tokens: 878 | |
total tokens: 2412 num samples: 9 num padding tokens: 144 - rank: 4 max len: 268 min len: 229 avg len: 252.0 num_loss_counted_tokens: 798 | |
total tokens: 2060 num samples: 4 num padding tokens: 133 - rank: 1 max len: 515 min len: 453 avg len: 481.75 num_loss_counted_tokens: 1494 | |
total tokens: 2282 num samples: 7 num padding tokens: 177 - rank: 3 max len: 326 min len: 289 avg len: 300.7142857142857 num_loss_counted_tokens: 1259 | |
total tokens: 1738 num samples: 2 num padding tokens: 351 - rank: 0 max len: 869 min len: 518 avg len: 693.5 num_loss_counted_tokens: 470 | |
total tokens: 2329 num samples: 17 num padding tokens: 491 - rank: 7 max len: 137 min len: 85 avg len: 108.11764705882354 num_loss_counted_tokens: 598 | |
Per-token loss scaled by world size: 0.0010528620332479477Per-token loss scaled by world size: 0.0011164682218804955Per-token loss scaled by world size: 0.0017543130088597536Per-token loss scaled by world size: 0.0010380293242633343Per-token loss scaled by world size: 0.0021178482566028833Per-token loss scaled by world size: 0.001553378184325993 | |
Per-token loss scaled by world size: 0.00011049106979044154 | |
Epoch: 0, Step: 56, Rank: 4, loss = 0.9130945801734924Epoch: 0, Step: 56, Rank: 5, loss = 0.9682570695877075 | |
Epoch: 0, Step: 56, Rank: 2, loss = 1.5214279890060425Epoch: 0, Step: 56, Rank: 6, loss = 0.900230884552002Epoch: 0, Step: 56, Rank: 1, loss = 1.836703896522522 | |
Epoch: 0, Step: 56, Rank: 3, loss = 1.3471672534942627 | |
Epoch: 0, Step: 56, Rank: 7, loss = 0.09582337737083435Per-token loss scaled by world size: 0.0011272500269114971 | |
Epoch: 0, Step: 56, Rank: 0, loss = 0.9776076078414917 | |
[2024-06-27 16:42:28,949] [INFO] [logging.py:96:log_dist] [Rank 0] step=56, skipped=0, lr=[2.9090909090909093e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:29,023] [INFO] [timer.py:260:stop] epoch=0/micro_step=56/global_step=56, RunningAvgSamplesPerSec=95.54890673322127, CurrSamplesPerSec=96.00753080311638, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.91088754445059 samples/s, lr: 2.9090909090909093e-06, loss: 0.9776076078414917 cuda_mem_allocated: 22.303165912628174 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6938.0 batch_size: 79.0 total loss: 1.0700390338897705 | |
Epoch 0: 26% 56/213 [01:18<06:14, 2.38s/it] total tokens: 2232 num samples: 18 num padding tokens: 286 - rank: 7 max len: 124 min len: 79 avg len: 108.11111111111111 num_loss_counted_tokens: 486 | |
total tokens: 2400 num samples: 12 num padding tokens: 199 - rank: 5 max len: 200 min len: 162 avg len: 183.41666666666666 num_loss_counted_tokens: 855 | |
total tokens: 2512 num samples: 8 num padding tokens: 172 - rank: 2 max len: 314 min len: 276 avg len: 292.5 num_loss_counted_tokens: 907 | |
total tokens: 2385 num samples: 15 num padding tokens: 278 - rank: 6 max len: 159 min len: 126 avg len: 140.46666666666667 num_loss_counted_tokens: 881 | |
total tokens: 2475 num samples: 9 num padding tokens: 139 - rank: 3 max len: 275 min len: 234 avg len: 259.55555555555554 num_loss_counted_tokens: 700 | |
total tokens: 2310 num samples: 10 num padding tokens: 182 - rank: 4 max len: 231 min len: 201 avg len: 212.8 num_loss_counted_tokens: 1038 | |
total tokens: 2310 num samples: 6 num padding tokens: 219 - rank: 1 max len: 385 min len: 318 avg len: 348.5 num_loss_counted_tokens: 1234 | |
total tokens: 1935 num samples: 3 num padding tokens: 418 - rank: 0 max len: 645 min len: 399 avg len: 505.6666666666667 num_loss_counted_tokens: 607 | |
Per-token loss scaled by world size: 0.0011813545133918524Per-token loss scaled by world size: 0.0009591743582859635 | |
Per-token loss scaled by world size: 0.0010182595578953624Per-token loss scaled by world size: 0.0018032032530754805Per-token loss scaled by world size: 0.0012723897816613317Per-token loss scaled by world size: 0.0014724277425557375Per-token loss scaled by world size: 0.000225590803893283 | |
Epoch: 0, Step: 57, Rank: 2, loss = 0.7370055913925171 | |
Epoch: 0, Step: 57, Rank: 1, loss = 0.9077232480049133 | |
Epoch: 0, Step: 57, Rank: 4, loss = 1.3855363130569458Epoch: 0, Step: 57, Rank: 7, loss = 0.7824051976203918 | |
Epoch: 0, Step: 57, Rank: 0, loss = 0.17333833873271942Epoch: 0, Step: 57, Rank: 5, loss = 1.1313766241073608Epoch: 0, Step: 57, Rank: 3, loss = 0.9776724576950073 | |
Per-token loss scaled by world size: 0.0017048887675628066 | |
Epoch: 0, Step: 57, Rank: 6, loss = 1.309993863105774 | |
[2024-06-27 16:42:30,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=57, skipped=0, lr=[2.9610389610389613e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:30,078] [INFO] [timer.py:260:stop] epoch=0/micro_step=57/global_step=57, RunningAvgSamplesPerSec=95.55474805925606, CurrSamplesPerSec=95.87124378294246, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.77241429013496 samples/s, lr: 2.9610389610389613e-06, loss: 0.17333833873271942 cuda_mem_allocated: 22.27788257598877 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6147.0 batch_size: 75.0 total loss: 0.9256315231323242 | |
Epoch 0: 27% 57/213 [01:19<05:09, 1.98s/it] total tokens: 2512 num samples: 16 num padding tokens: 286 - rank: 6 max len: 157 min len: 120 avg len: 139.125 num_loss_counted_tokens: 805 | |
total tokens: 2492 num samples: 14 num padding tokens: 127 - rank: 5 max len: 178 min len: 158 avg len: 168.92857142857142 num_loss_counted_tokens: 898 | |
total tokens: 2330 num samples: 5 num padding tokens: 362 - rank: 1 max len: 466 min len: 361 avg len: 393.6 num_loss_counted_tokens: 1050 | |
total tokens: 2398 num samples: 11 num padding tokens: 222 - rank: 4 max len: 218 min len: 179 avg len: 197.8181818181818 num_loss_counted_tokens: 755 | |
total tokens: 2484 num samples: 9 num padding tokens: 299 - rank: 3 max len: 276 min len: 224 avg len: 242.77777777777777 num_loss_counted_tokens: 926 | |
total tokens: 2443 num samples: 7 num padding tokens: 267 - rank: 2 max len: 349 min len: 277 avg len: 310.85714285714283 num_loss_counted_tokens: 1243 | |
total tokens: 2520 num samples: 21 num padding tokens: 387 - rank: 7 max len: 120 min len: 77 avg len: 101.57142857142857 num_loss_counted_tokens: 577 | |
total tokens: 2284 num samples: 4 num padding tokens: 191 - rank: 0 max len: 571 min len: 482 avg len: 523.25 num_loss_counted_tokens: 1093 | |
Per-token loss scaled by world size: 0.001231023226864636Per-token loss scaled by world size: 0.00134338962379843Per-token loss scaled by world size: 0.002999958349391818 | |
Per-token loss scaled by world size: 0.0032275605481117964 | |
Per-token loss scaled by world size: 3.873389869113453e-05 | |
Epoch: 0, Step: 58, Rank: 5, loss = 1.089656949043274Per-token loss scaled by world size: 0.0014052734477445483 | |
Per-token loss scaled by world size: 0.0007985016563907266Epoch: 0, Step: 58, Rank: 2, loss = 2.617954969406128Epoch: 0, Step: 58, Rank: 4, loss = 0.9985136985778809Epoch: 0, Step: 58, Rank: 1, loss = 2.4333412647247314 | |
Epoch: 0, Step: 58, Rank: 0, loss = 0.03141803294420242 | |
Per-token loss scaled by world size: 0.0015811807243153453 | |
Epoch: 0, Step: 58, Rank: 6, loss = 1.1398524045944214 | |
Epoch: 0, Step: 58, Rank: 7, loss = 0.647684633731842 | |
Epoch: 0, Step: 58, Rank: 3, loss = 1.282535195350647 | |
[2024-06-27 16:42:31,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=58, skipped=0, lr=[3.0129870129870133e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:31,129] [INFO] [timer.py:260:stop] epoch=0/micro_step=58/global_step=58, RunningAvgSamplesPerSec=95.56768586613828, CurrSamplesPerSec=96.28470107620886, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 96.18543133682957 samples/s, lr: 3.0129870129870133e-06, loss: 0.03141803294420242 cuda_mem_allocated: 22.238765239715576 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6489.0 batch_size: 81.0 total loss: 1.280119776725769 | |
Epoch 0: 27% 58/213 [01:20<04:24, 1.70s/it] total tokens: 2457 num samples: 13 num padding tokens: 294 - rank: 5 max len: 189 min len: 150 avg len: 166.3846153846154 num_loss_counted_tokens: 777 | |
total tokens: 2519 num samples: 11 num padding tokens: 203 - rank: 4 max len: 229 min len: 196 avg len: 210.54545454545453 num_loss_counted_tokens: 912 | |
total tokens: 2457 num samples: 9 num padding tokens: 237 - rank: 3 max len: 273 min len: 234 avg len: 246.66666666666666 num_loss_counted_tokens: 1090 | |
total tokens: 2261 num samples: 7 num padding tokens: 204 - rank: 2 max len: 323 min len: 275 avg len: 293.85714285714283 num_loss_counted_tokens: 1104 | |
total tokens: 2448 num samples: 6 num padding tokens: 96 - rank: 1 max len: 408 min len: 370 avg len: 392.0 num_loss_counted_tokens: 1416 | |
total tokens: 2431 num samples: 17 num padding tokens: 203 - rank: 6 max len: 143 min len: 120 avg len: 131.05882352941177 num_loss_counted_tokens: 746 | |
total tokens: 2520 num samples: 21 num padding tokens: 376 - rank: 7 max len: 120 min len: 79 avg len: 102.0952380952381 num_loss_counted_tokens: 531 | |
total tokens: 2420 num samples: 5 num padding tokens: 116 - rank: 0 max len: 484 min len: 418 avg len: 460.8 num_loss_counted_tokens: 1511 | |
Per-token loss scaled by world size: 0.0022523868829011917Per-token loss scaled by world size: 0.0011322196805849671Per-token loss scaled by world size: 0.001769682508893311Per-token loss scaled by world size: 0.0030189810786396265Per-token loss scaled by world size: 0.0001382523769279942Per-token loss scaled by world size: 0.0014469276648014784 | |
Per-token loss scaled by world size: 0.00046376517275348306 | |
Epoch: 0, Step: 59, Rank: 2, loss = 1.7965601682662964Epoch: 0, Step: 59, Rank: 5, loss = 0.9030867218971252 | |
Epoch: 0, Step: 59, Rank: 4, loss = 1.4115430116653442 | |
Epoch: 0, Step: 59, Rank: 7, loss = 0.36991068720817566 | |
Epoch: 0, Step: 59, Rank: 3, loss = 2.4080147743225098Epoch: 0, Step: 59, Rank: 1, loss = 1.1541056632995605Epoch: 0, Step: 59, Rank: 0, loss = 0.11027354747056961 | |
Per-token loss scaled by world size: 0.00110234331805259 | |
Epoch: 0, Step: 59, Rank: 6, loss = 0.879256546497345 | |
[2024-06-27 16:42:32,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=59, skipped=0, lr=[3.0649350649350653e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:32,179] [INFO] [timer.py:260:stop] epoch=0/micro_step=59/global_step=59, RunningAvgSamplesPerSec=95.58806806892989, CurrSamplesPerSec=96.74351482399739, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 96.64512976243223 samples/s, lr: 3.0649350649350653e-06, loss: 0.11027354747056961 cuda_mem_allocated: 22.236618995666504 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6381.0 batch_size: 77.0 total loss: 1.129093885421753 | |
Epoch 0: 28% 59/213 [01:21<03:52, 1.51s/it] total tokens: 2490 num samples: 10 num padding tokens: 229 - rank: 4 max len: 249 min len: 202 avg len: 226.1 num_loss_counted_tokens: 798 | |
total tokens: 2384 num samples: 8 num padding tokens: 165 - rank: 3 max len: 298 min len: 252 avg len: 277.375 num_loss_counted_tokens: 760 | |
total tokens: 2505 num samples: 15 num padding tokens: 328 - rank: 6 max len: 167 min len: 126 avg len: 145.13333333333333 num_loss_counted_tokens: 796 | |
total tokens: 2345 num samples: 5 num padding tokens: 246 - rank: 1 max len: 469 min len: 397 avg len: 419.8 num_loss_counted_tokens: 1084 | |
total tokens: 2388 num samples: 12 num padding tokens: 204 - rank: 5 max len: 199 min len: 172 avg len: 182.0 num_loss_counted_tokens: 885 | |
total tokens: 2324 num samples: 7 num padding tokens: 147 - rank: 2 max len: 332 min len: 298 avg len: 311.0 num_loss_counted_tokens: 965 | |
total tokens: 1947 num samples: 3 num padding tokens: 184 - rank: 0 max len: 649 min len: 491 avg len: 587.6666666666666 num_loss_counted_tokens: 1032 | |
total tokens: 1890 num samples: 15 num padding tokens: 262 - rank: 7 max len: 126 min len: 76 avg len: 108.53333333333333 num_loss_counted_tokens: 425 | |
Per-token loss scaled by world size: 0.001511856447905302Per-token loss scaled by world size: 0.0010480601340532303 | |
Per-token loss scaled by world size: 0.0011559822596609592Per-token loss scaled by world size: 0.0005917920498177409Per-token loss scaled by world size: 0.0014987518079578876Per-token loss scaled by world size: 0.0011865315027534962Per-token loss scaled by world size: 0.0012093138648197055 | |
Epoch: 0, Step: 60, Rank: 3, loss = 1.4532719850540161 | |
Epoch: 0, Step: 60, Rank: 6, loss = 1.0074478387832642Epoch: 0, Step: 60, Rank: 4, loss = 1.1111879348754883Epoch: 0, Step: 60, Rank: 1, loss = 1.14055335521698Epoch: 0, Step: 60, Rank: 0, loss = 1.440675139427185Epoch: 0, Step: 60, Rank: 5, loss = 1.1624529361724854 | |
Epoch: 0, Step: 60, Rank: 7, loss = 0.5688601136207581 | |
Per-token loss scaled by world size: 0.002087149303406477 | |
Epoch: 0, Step: 60, Rank: 2, loss = 2.006272315979004 | |
[2024-06-27 16:42:33,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[3.116883116883117e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:33,237] [INFO] [timer.py:260:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=95.5896800783353, CurrSamplesPerSec=95.68165457496794, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.57792892466956 samples/s, lr: 3.116883116883117e-06, loss: 1.440675139427185 cuda_mem_allocated: 22.299350261688232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7690.0 batch_size: 82.0 total loss: 1.2363401651382446 | |
Epoch 0: 28% 60/213 [01:22<03:30, 1.37s/it] total tokens: 2368 num samples: 8 num padding tokens: 288 - rank: 4 max len: 296 min len: 233 avg len: 260.0 num_loss_counted_tokens: 681 | |
total tokens: 2282 num samples: 7 num padding tokens: 112 - rank: 3 max len: 326 min len: 301 avg len: 310.0 num_loss_counted_tokens: 930 | |
total tokens: 2230 num samples: 5 num padding tokens: 308 - rank: 1 max len: 446 min len: 355 avg len: 384.4 num_loss_counted_tokens: 1270 | |
total tokens: 2475 num samples: 11 num padding tokens: 258 - rank: 5 max len: 225 min len: 181 avg len: 201.54545454545453 num_loss_counted_tokens: 1132 | |
total tokens: 2130 num samples: 15 num padding tokens: 275 - rank: 7 max len: 142 min len: 93 avg len: 123.66666666666667 num_loss_counted_tokens: 696 | |
total tokens: 2046 num samples: 3 num padding tokens: 260 - rank: 0 max len: 682 min len: 524 avg len: 595.3333333333334 num_loss_counted_tokens: 1198 | |
total tokens: 2478 num samples: 14 num padding tokens: 241 - rank: 6 max len: 177 min len: 143 avg len: 159.78571428571428 num_loss_counted_tokens: 758 | |
total tokens: 2436 num samples: 7 num padding tokens: 94 - rank: 2 max len: 348 min len: 327 avg len: 334.57142857142856 num_loss_counted_tokens: 1529 | |
Per-token loss scaled by world size: 0.0009685418335720897Per-token loss scaled by world size: 0.0015601901104673743Per-token loss scaled by world size: 0.0011223881738260388Per-token loss scaled by world size: 0.0008867046562954783Per-token loss scaled by world size: 0.0015343008562922478Per-token loss scaled by world size: 0.0010373590048402548Per-token loss scaled by world size: 0.0005826232372783124 | |
Epoch: 0, Step: 61, Rank: 0, loss = 0.8440842032432556 | |
Epoch: 0, Step: 61, Rank: 1, loss = 1.3597056865692139 | |
Epoch: 0, Step: 61, Rank: 5, loss = 1.3371431827545166Epoch: 0, Step: 61, Rank: 4, loss = 0.7727631330490112 | |
Epoch: 0, Step: 61, Rank: 6, loss = 0.9781612753868103Epoch: 0, Step: 61, Rank: 7, loss = 0.5077561736106873Epoch: 0, Step: 61, Rank: 3, loss = 0.9040583968162537 | |
Per-token loss scaled by world size: 0.0024577686563134193 | |
Epoch: 0, Step: 61, Rank: 2, loss = 2.1419453620910645 | |
[2024-06-27 16:42:34,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=61, skipped=0, lr=[3.168831168831169e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:34,304] [INFO] [timer.py:260:stop] epoch=0/micro_step=61/global_step=61, RunningAvgSamplesPerSec=95.57210285639464, CurrSamplesPerSec=94.56356763337668, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.48000484304453 samples/s, lr: 3.168831168831169e-06, loss: 0.8440842032432556 cuda_mem_allocated: 22.296486854553223 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6972.0 batch_size: 79.0 total loss: 1.10570228099823 | |
Epoch 0: 29% 61/213 [01:23<03:14, 1.28s/it] total tokens: 2322 num samples: 18 num padding tokens: 309 - rank: 7 max len: 129 min len: 93 avg len: 111.83333333333333 num_loss_counted_tokens: 633 | |
total tokens: 2148 num samples: 4 num padding tokens: 246 - rank: 1 max len: 537 min len: 391 avg len: 475.5 num_loss_counted_tokens: 942 | |
total tokens: 2238 num samples: 6 num padding tokens: 82 - rank: 2 max len: 373 min len: 336 avg len: 359.3333333333333 num_loss_counted_tokens: 1197 | |
total tokens: 2508 num samples: 11 num padding tokens: 172 - rank: 5 max len: 228 min len: 190 avg len: 212.36363636363637 num_loss_counted_tokens: 706 | |
total tokens: 2534 num samples: 14 num padding tokens: 253 - rank: 6 max len: 181 min len: 139 avg len: 162.92857142857142 num_loss_counted_tokens: 932 | |
total tokens: 2338 num samples: 7 num padding tokens: 130 - rank: 3 max len: 334 min len: 283 avg len: 315.42857142857144 num_loss_counted_tokens: 1117 | |
total tokens: 2439 num samples: 9 num padding tokens: 214 - rank: 4 max len: 271 min len: 228 avg len: 247.22222222222223 num_loss_counted_tokens: 1136 | |
total tokens: 2000 num samples: 2 num padding tokens: 44 - rank: 0 max len: 1000 min len: 956 avg len: 978.0 num_loss_counted_tokens: 770 | |
Per-token loss scaled by world size: 0.0025360658764839172Per-token loss scaled by world size: 0.0011244043707847595Per-token loss scaled by world size: 0.0009880692232400179 | |
Per-token loss scaled by world size: 0.0014139594277366996Per-token loss scaled by world size: 0.001011083135381341Per-token loss scaled by world size: 0.0010678773978725076Per-token loss scaled by world size: 0.0004979260265827179 | |
Epoch: 0, Step: 62, Rank: 2, loss = 0.9898974895477295 | |
Epoch: 0, Step: 62, Rank: 5, loss = 0.8698714375495911Epoch: 0, Step: 62, Rank: 0, loss = 2.2326889038085938Epoch: 0, Step: 62, Rank: 1, loss = 1.2448145151138306 | |
Epoch: 0, Step: 62, Rank: 3, loss = 0.8901323080062866Epoch: 0, Step: 62, Rank: 6, loss = 0.9401325583457947Epoch: 0, Step: 62, Rank: 7, loss = 0.43836164474487305 | |
Per-token loss scaled by world size: 0.001049467478878796 | |
Epoch: 0, Step: 62, Rank: 4, loss = 0.92392498254776 | |
[2024-06-27 16:42:35,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=62, skipped=0, lr=[3.220779220779221e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:35,367] [INFO] [timer.py:260:stop] epoch=0/micro_step=62/global_step=62, RunningAvgSamplesPerSec=95.56527634285517, CurrSamplesPerSec=95.1642309164906, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.07971591307566 samples/s, lr: 3.220779220779221e-06, loss: 2.2326889038085938 cuda_mem_allocated: 22.259636402130127 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7043.0 batch_size: 73.0 total loss: 1.066227912902832 | |
Epoch 0: 29% 62/213 [01:24<03:03, 1.22s/it] total tokens: 2405 num samples: 13 num padding tokens: 210 - rank: 5 max len: 185 min len: 151 avg len: 168.84615384615384 num_loss_counted_tokens: 833 | |
total tokens: 2280 num samples: 20 num padding tokens: 266 - rank: 7 max len: 114 min len: 83 avg len: 100.7 num_loss_counted_tokens: 480 | |
total tokens: 2320 num samples: 10 num padding tokens: 233 - rank: 4 max len: 232 min len: 186 avg len: 208.7 num_loss_counted_tokens: 880 | |
total tokens: 2448 num samples: 17 num padding tokens: 203 - rank: 6 max len: 144 min len: 117 avg len: 132.05882352941177 num_loss_counted_tokens: 758 | |
total tokens: 2366 num samples: 7 num padding tokens: 199 - rank: 2 max len: 338 min len: 272 avg len: 309.57142857142856 num_loss_counted_tokens: 785 | |
total tokens: 2421 num samples: 9 num padding tokens: 148 - rank: 3 max len: 269 min len: 237 avg len: 252.55555555555554 num_loss_counted_tokens: 1164 | |
total tokens: 2255 num samples: 5 num padding tokens: 390 - rank: 1 max len: 451 min len: 343 avg len: 373.0 num_loss_counted_tokens: 1205 | |
total tokens: 2019 num samples: 3 num padding tokens: 310 - rank: 0 max len: 673 min len: 505 avg len: 569.6666666666666 num_loss_counted_tokens: 1049 | |
Per-token loss scaled by world size: 0.0011152056977152824Per-token loss scaled by world size: 0.0011478060623630881Per-token loss scaled by world size: 0.0007925205281935632Per-token loss scaled by world size: 0.0011192952515557408Per-token loss scaled by world size: 0.0007736588595435023Per-token loss scaled by world size: 0.002162716817110777 | |
Per-token loss scaled by world size: 0.0014523111749440432 | |
Epoch: 0, Step: 63, Rank: 7, loss = 0.6906839609146118Epoch: 0, Step: 63, Rank: 2, loss = 1.024703860282898 | |
Epoch: 0, Step: 63, Rank: 3, loss = 0.7075226902961731Epoch: 0, Step: 63, Rank: 6, loss = 0.9955999255180359Epoch: 0, Step: 63, Rank: 1, loss = 1.9307655096054077Epoch: 0, Step: 63, Rank: 4, loss = 0.9992508292198181Epoch: 0, Step: 63, Rank: 5, loss = 1.2965507507324219 | |
Per-token loss scaled by world size: 0.001413009944371879 | |
Epoch: 0, Step: 63, Rank: 0, loss = 1.2614645957946777 | |
[2024-06-27 16:42:36,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=63, skipped=0, lr=[3.272727272727273e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:36,431] [INFO] [timer.py:260:stop] epoch=0/micro_step=63/global_step=63, RunningAvgSamplesPerSec=95.55721092448304, CurrSamplesPerSec=95.07576462010017, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.98867269330214 samples/s, lr: 3.272727272727273e-06, loss: 1.2614645957946777 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7142.0 batch_size: 85.0 total loss: 1.1133177280426025 | |
Epoch 0: 30% 63/213 [01:25<02:55, 1.17s/it] total tokens: 2394 num samples: 9 num padding tokens: 240 - rank: 4 max len: 266 min len: 211 avg len: 239.33333333333334 num_loss_counted_tokens: 953 | |
total tokens: 2520 num samples: 12 num padding tokens: 280 - rank: 5 max len: 210 min len: 161 avg len: 186.66666666666666 num_loss_counted_tokens: 1082 | |
total tokens: 2385 num samples: 15 num padding tokens: 230 - rank: 6 max len: 159 min len: 123 avg len: 143.66666666666666 num_loss_counted_tokens: 767 | |
total tokens: 2408 num samples: 8 num padding tokens: 86 - rank: 3 max len: 301 min len: 270 avg len: 290.25 num_loss_counted_tokens: 860 | |
total tokens: 2464 num samples: 7 num padding tokens: 116 - rank: 2 max len: 352 min len: 311 avg len: 335.42857142857144 num_loss_counted_tokens: 1287 | |
total tokens: 1904 num samples: 16 num padding tokens: 210 - rank: 7 max len: 119 min len: 88 avg len: 105.875 num_loss_counted_tokens: 437 | |
total tokens: 2320 num samples: 5 num padding tokens: 246 - rank: 1 max len: 464 min len: 384 avg len: 414.8 num_loss_counted_tokens: 873 | |
total tokens: 2396 num samples: 4 num padding tokens: 319 - rank: 0 max len: 599 min len: 477 avg len: 519.25 num_loss_counted_tokens: 1446 | |
Per-token loss scaled by world size: 0.0006784854340367019Per-token loss scaled by world size: 0.001498804660513997Per-token loss scaled by world size: 0.0007464307127520442Per-token loss scaled by world size: 0.00103193789254874Per-token loss scaled by world size: 0.001229585730470717Per-token loss scaled by world size: 0.0012099394807592034Per-token loss scaled by world size: 0.0007858126773498952 | |
Epoch: 0, Step: 64, Rank: 1, loss = 1.3685959577560425Epoch: 0, Step: 64, Rank: 7, loss = 0.6195420026779175Epoch: 0, Step: 64, Rank: 6, loss = 1.1048259735107422 | |
Epoch: 0, Step: 64, Rank: 4, loss = 0.6815845370292664Epoch: 0, Step: 64, Rank: 2, loss = 0.942288339138031Epoch: 0, Step: 64, Rank: 5, loss = 1.1227654218673706Per-token loss scaled by world size: 0.0013852602569386363Epoch: 0, Step: 64, Rank: 3, loss = 0.717545211315155 | |
Epoch: 0, Step: 64, Rank: 0, loss = 1.2649158239364624 | |
[2024-06-27 16:42:37,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=64, skipped=0, lr=[3.324675324675325e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:37,490] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=64, RunningAvgSamplesPerSec=95.55867818094919, CurrSamplesPerSec=95.6482661112547, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.55481608839433 samples/s, lr: 3.324675324675325e-06, loss: 1.2649158239364624 cuda_mem_allocated: 22.306028842926025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7305.0 batch_size: 84.0 total loss: 0.9777579307556152 | |
Epoch 0: 30% 64/213 [01:26<02:49, 1.14s/it] total tokens: 2409 num samples: 11 num padding tokens: 207 - rank: 4 max len: 219 min len: 181 avg len: 200.1818181818182 num_loss_counted_tokens: 810 | |
total tokens: 2392 num samples: 8 num padding tokens: 328 - rank: 3 max len: 299 min len: 223 avg len: 258.0 num_loss_counted_tokens: 998 | |
total tokens: 2380 num samples: 7 num padding tokens: 85 - rank: 2 max len: 340 min len: 315 avg len: 327.85714285714283 num_loss_counted_tokens: 1024 | |
total tokens: 2478 num samples: 14 num padding tokens: 108 - rank: 5 max len: 177 min len: 164 avg len: 169.28571428571428 num_loss_counted_tokens: 1027 | |
total tokens: 2512 num samples: 16 num padding tokens: 231 - rank: 6 max len: 157 min len: 126 avg len: 142.5625 num_loss_counted_tokens: 917 | |
total tokens: 2304 num samples: 6 num padding tokens: 187 - rank: 1 max len: 384 min len: 340 avg len: 352.8333333333333 num_loss_counted_tokens: 758 | |
total tokens: 2520 num samples: 20 num padding tokens: 335 - rank: 7 max len: 126 min len: 87 avg len: 109.25 num_loss_counted_tokens: 581 | |
total tokens: 2088 num samples: 4 num padding tokens: 253 - rank: 0 max len: 522 min len: 413 avg len: 458.75 num_loss_counted_tokens: 671 | |
Per-token loss scaled by world size: 0.0012893658131361008Per-token loss scaled by world size: 0.0011579877464100718Per-token loss scaled by world size: 0.0006967362714931369Per-token loss scaled by world size: 0.001236456329934299Per-token loss scaled by world size: 0.0015634155133739114Per-token loss scaled by world size: 0.0014415328623726964Per-token loss scaled by world size: 0.0014332120772451162 | |
Epoch: 0, Step: 65, Rank: 3, loss = 1.0919824838638306Epoch: 0, Step: 65, Rank: 5, loss = 0.6570222973823547Epoch: 0, Step: 65, Rank: 6, loss = 1.2158719301223755 | |
Epoch: 0, Step: 65, Rank: 4, loss = 1.1659783124923706Epoch: 0, Step: 65, Rank: 1, loss = 1.351518988609314Epoch: 0, Step: 65, Rank: 2, loss = 1.4743008613586426 | |
Epoch: 0, Step: 65, Rank: 0, loss = 1.359365463256836 | |
Per-token loss scaled by world size: 0.0008278127643279731 | |
Epoch: 0, Step: 65, Rank: 7, loss = 0.780627429485321 | |
[2024-06-27 16:42:38,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=65, skipped=0, lr=[3.376623376623377e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:38,544] [INFO] [timer.py:260:stop] epoch=0/micro_step=65/global_step=65, RunningAvgSamplesPerSec=95.56746759253778, CurrSamplesPerSec=96.11558700104553, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.01785610384012 samples/s, lr: 3.376623376623377e-06, loss: 1.359365463256836 cuda_mem_allocated: 22.282296180725098 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7544.0 batch_size: 87.0 total loss: 1.1370835304260254 | |
Epoch 0: 31% 65/213 [01:27<02:44, 1.11s/it] total tokens: 2128 num samples: 4 num padding tokens: 100 - rank: 1 max len: 532 min len: 473 avg len: 507.0 num_loss_counted_tokens: 1435 | |
total tokens: 2360 num samples: 5 num padding tokens: 217 - rank: 2 max len: 472 min len: 390 avg len: 428.6 num_loss_counted_tokens: 844 | |
total tokens: 2450 num samples: 14 num padding tokens: 307 - rank: 6 max len: 175 min len: 138 avg len: 153.07142857142858 num_loss_counted_tokens: 890 | |
total tokens: 2453 num samples: 11 num padding tokens: 285 - rank: 5 max len: 223 min len: 176 avg len: 197.0909090909091 num_loss_counted_tokens: 1149 | |
total tokens: 2331 num samples: 7 num padding tokens: 242 - rank: 3 max len: 333 min len: 266 avg len: 298.42857142857144 num_loss_counted_tokens: 921 | |
total tokens: 2331 num samples: 9 num padding tokens: 213 - rank: 4 max len: 259 min len: 224 avg len: 235.33333333333334 num_loss_counted_tokens: 877 | |
total tokens: 2192 num samples: 16 num padding tokens: 309 - rank: 7 max len: 137 min len: 89 avg len: 117.6875 num_loss_counted_tokens: 480 | |
total tokens: 1995 num samples: 3 num padding tokens: 191 - rank: 0 max len: 665 min len: 534 avg len: 601.3333333333334 num_loss_counted_tokens: 1274 | |
Per-token loss scaled by world size: 0.0007761853048577905Per-token loss scaled by world size: 0.001305400743149221Per-token loss scaled by world size: 0.00045607821084558964Per-token loss scaled by world size: 0.0012145814253017306 | |
Per-token loss scaled by world size: 0.0018463066080585122Per-token loss scaled by world size: 0.0007220458355732262Per-token loss scaled by world size: 0.0017517150845378637 | |
Epoch: 0, Step: 66, Rank: 7, loss = 0.44307997822761536 | |
Epoch: 0, Step: 66, Rank: 0, loss = 0.7540640234947205Epoch: 0, Step: 66, Rank: 5, loss = 1.2681968212127686 | |
Epoch: 0, Step: 66, Rank: 2, loss = 1.179965853691101Epoch: 0, Step: 66, Rank: 6, loss = 0.7014675140380859Epoch: 0, Step: 66, Rank: 1, loss = 1.793686866760254Epoch: 0, Step: 66, Rank: 3, loss = 1.7017911672592163 | |
Per-token loss scaled by world size: 0.0013509176205843687 | |
Epoch: 0, Step: 66, Rank: 4, loss = 1.312416434288025 | |
[2024-06-27 16:42:39,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=66, skipped=0, lr=[3.428571428571429e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:39,602] [INFO] [timer.py:260:stop] epoch=0/micro_step=66/global_step=66, RunningAvgSamplesPerSec=95.56982206910908, CurrSamplesPerSec=95.71838833996887, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.628730239102 samples/s, lr: 3.428571428571429e-06, loss: 0.7540640234947205 cuda_mem_allocated: 22.283607482910156 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7772.0 batch_size: 79.0 total loss: 1.1443336009979248 | |
Epoch 0: 31% 66/213 [01:28<02:41, 1.10s/it] total tokens: 2303 num samples: 7 num padding tokens: 136 - rank: 2 max len: 329 min len: 290 avg len: 309.57142857142856 num_loss_counted_tokens: 973 | |
total tokens: 2520 num samples: 15 num padding tokens: 343 - rank: 6 max len: 168 min len: 130 avg len: 145.13333333333333 num_loss_counted_tokens: 877 | |
total tokens: 2332 num samples: 11 num padding tokens: 244 - rank: 5 max len: 212 min len: 170 avg len: 189.8181818181818 num_loss_counted_tokens: 811 | |
total tokens: 2120 num samples: 5 num padding tokens: 262 - rank: 1 max len: 424 min len: 331 avg len: 371.6 num_loss_counted_tokens: 1017 | |
total tokens: 2360 num samples: 10 num padding tokens: 113 - rank: 4 max len: 236 min len: 212 avg len: 224.7 num_loss_counted_tokens: 1101 | |
total tokens: 2451 num samples: 19 num padding tokens: 419 - rank: 7 max len: 129 min len: 87 avg len: 106.94736842105263 num_loss_counted_tokens: 575 | |
total tokens: 2280 num samples: 8 num padding tokens: 160 - rank: 3 max len: 285 min len: 240 avg len: 265.0 num_loss_counted_tokens: 747 | |
total tokens: 2139 num samples: 3 num padding tokens: 305 - rank: 0 max len: 713 min len: 559 avg len: 611.3333333333334 num_loss_counted_tokens: 513 | |
Per-token loss scaled by world size: 0.002457298571243882 | |
Per-token loss scaled by world size: 0.0017972920322790742Per-token loss scaled by world size: 0.0014236713759601116Per-token loss scaled by world size: 0.0008009669836610556Per-token loss scaled by world size: 0.0004675793170463294Per-token loss scaled by world size: 0.001420770538970828Per-token loss scaled by world size: 0.0010510666761547327 | |
Epoch: 0, Step: 67, Rank: 1, loss = 1.9053277969360352 | |
Epoch: 0, Step: 67, Rank: 5, loss = 1.1038792133331299Epoch: 0, Step: 67, Rank: 7, loss = 0.6210497617721558Epoch: 0, Step: 67, Rank: 2, loss = 1.3935753107070923 | |
Epoch: 0, Step: 67, Rank: 0, loss = 0.3625493049621582 | |
Epoch: 0, Step: 67, Rank: 4, loss = 0.814970850944519Epoch: 0, Step: 67, Rank: 6, loss = 1.1016299724578857Per-token loss scaled by world size: 0.0012896271655336022 | |
Epoch: 0, Step: 67, Rank: 3, loss = 0.9999446868896484 | |
[2024-06-27 16:42:40,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=67, skipped=0, lr=[3.480519480519481e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:40,661] [INFO] [timer.py:260:stop] epoch=0/micro_step=67/global_step=67, RunningAvgSamplesPerSec=95.5742345170444, CurrSamplesPerSec=95.85748118099102, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.76974912876737 samples/s, lr: 3.480519480519481e-06, loss: 0.3625493049621582 cuda_mem_allocated: 22.291836738586426 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6203.0 batch_size: 88.0 total loss: 1.0378658771514893 | |
Epoch 0: 31% 67/213 [01:29<02:38, 1.08s/it] total tokens: 2508 num samples: 12 num padding tokens: 128 - rank: 5 max len: 209 min len: 184 avg len: 198.33333333333334 num_loss_counted_tokens: 1092 | |
total tokens: 2170 num samples: 5 num padding tokens: 345 - rank: 1 max len: 434 min len: 328 avg len: 365.0 num_loss_counted_tokens: 1110 | |
total tokens: 2233 num samples: 7 num padding tokens: 226 - rank: 2 max len: 319 min len: 276 avg len: 286.7142857142857 num_loss_counted_tokens: 1051 | |
total tokens: 2460 num samples: 10 num padding tokens: 147 - rank: 4 max len: 246 min len: 217 avg len: 231.3 num_loss_counted_tokens: 980 | |
total tokens: 2457 num samples: 9 num padding tokens: 89 - rank: 3 max len: 273 min len: 253 avg len: 263.1111111111111 num_loss_counted_tokens: 751 | |
total tokens: 2366 num samples: 13 num padding tokens: 194 - rank: 6 max len: 182 min len: 145 avg len: 167.07692307692307 num_loss_counted_tokens: 744 | |
total tokens: 1810 num samples: 2 num padding tokens: 371 - rank: 0 max len: 905 min len: 534 avg len: 719.5 num_loss_counted_tokens: 380 | |
total tokens: 2448 num samples: 17 num padding tokens: 406 - rank: 7 max len: 144 min len: 76 avg len: 120.11764705882354 num_loss_counted_tokens: 755 | |
Per-token loss scaled by world size: 0.0008922195993363857Per-token loss scaled by world size: 0.0015631833812221885Per-token loss scaled by world size: 0.001041628886014223Per-token loss scaled by world size: 0.0008396353223361075Per-token loss scaled by world size: 0.0019877408631145954Per-token loss scaled by world size: 0.0017637622077018023Per-token loss scaled by world size: 0.0009744731360115111 | |
Epoch: 0, Step: 68, Rank: 4, loss = 0.8606573343276978 | |
Epoch: 0, Step: 68, Rank: 0, loss = 1.5078858137130737Epoch: 0, Step: 68, Rank: 6, loss = 0.8099332451820374Epoch: 0, Step: 68, Rank: 5, loss = 1.0047812461853027 | |
Epoch: 0, Step: 68, Rank: 3, loss = 1.7013691663742065Epoch: 0, Step: 68, Rank: 1, loss = 1.9174244403839111 | |
Epoch: 0, Step: 68, Rank: 2, loss = 0.9400011301040649 | |
Per-token loss scaled by world size: 0.0007082439842633903 | |
Epoch: 0, Step: 68, Rank: 7, loss = 0.683189868927002 | |
[2024-06-27 16:42:41,645] [INFO] [logging.py:96:log_dist] [Rank 0] step=68, skipped=0, lr=[3.532467532467533e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:41,718] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=68, RunningAvgSamplesPerSec=95.57679078865242, CurrSamplesPerSec=95.74324226641355, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.61876093409043 samples/s, lr: 3.532467532467533e-06, loss: 1.5078858137130737 cuda_mem_allocated: 22.282176971435547 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7717.0 batch_size: 83.0 total loss: 1.1781553030014038 | |
Epoch 0: 32% 68/213 [01:30<02:36, 1.08s/it] total tokens: 2317 num samples: 7 num padding tokens: 82 - rank: 1 max len: 331 min len: 310 avg len: 319.2857142857143 num_loss_counted_tokens: 933 | |
total tokens: 2528 num samples: 16 num padding tokens: 183 - rank: 6 max len: 158 min len: 132 avg len: 146.5625 num_loss_counted_tokens: 1054 | |
total tokens: 2354 num samples: 11 num padding tokens: 169 - rank: 4 max len: 214 min len: 184 avg len: 198.63636363636363 num_loss_counted_tokens: 775 | |
total tokens: 2530 num samples: 10 num padding tokens: 235 - rank: 3 max len: 253 min len: 216 avg len: 229.5 num_loss_counted_tokens: 1046 | |
total tokens: 2379 num samples: 13 num padding tokens: 160 - rank: 5 max len: 183 min len: 159 avg len: 170.69230769230768 num_loss_counted_tokens: 842 | |
total tokens: 2432 num samples: 8 num padding tokens: 180 - rank: 2 max len: 304 min len: 264 avg len: 281.5 num_loss_counted_tokens: 1323 | |
total tokens: 2340 num samples: 18 num padding tokens: 397 - rank: 7 max len: 130 min len: 83 avg len: 107.94444444444444 num_loss_counted_tokens: 582 | |
total tokens: 2460 num samples: 4 num padding tokens: 643 - rank: 0 max len: 615 min len: 342 avg len: 454.25 num_loss_counted_tokens: 505 | |
Per-token loss scaled by world size: 0.0014589219354093075Per-token loss scaled by world size: 0.0011279700556769967Per-token loss scaled by world size: 0.0012790620094165206Per-token loss scaled by world size: 0.0013136304914951324 | |
Per-token loss scaled by world size: 0.0009520920575596392Per-token loss scaled by world size: 0.001424268470145762Per-token loss scaled by world size: 0.0013369007501751184 | |
Epoch: 0, Step: 69, Rank: 3, loss = 1.2224634885787964 | |
Epoch: 0, Step: 69, Rank: 1, loss = 1.361244559288025Epoch: 0, Step: 69, Rank: 4, loss = 1.0780574083328247Epoch: 0, Step: 69, Rank: 6, loss = 1.394364595413208Epoch: 0, Step: 69, Rank: 0, loss = 1.277742862701416 | |
Epoch: 0, Step: 69, Rank: 5, loss = 0.909961998462677Epoch: 0, Step: 69, Rank: 2, loss = 1.2555023431777954 | |
Per-token loss scaled by world size: 0.0006965146749280393 | |
Epoch: 0, Step: 69, Rank: 7, loss = 0.6656938791275024 | |
[2024-06-27 16:42:42,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=69, skipped=0, lr=[3.584415584415585e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:42,774] [INFO] [timer.py:260:stop] epoch=0/micro_step=69/global_step=69, RunningAvgSamplesPerSec=95.58289771630727, CurrSamplesPerSec=95.9876877365107, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.89055917102058 samples/s, lr: 3.584415584415585e-06, loss: 1.277742862701416 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7646.0 batch_size: 73.0 total loss: 1.1456290483474731 | |
Epoch 0: 32% 69/213 [01:32<02:34, 1.07s/it] total tokens: 2520 num samples: 14 num padding tokens: 122 - rank: 5 max len: 180 min len: 162 avg len: 171.28571428571428 num_loss_counted_tokens: 1083 | |
total tokens: 2430 num samples: 15 num padding tokens: 222 - rank: 6 max len: 162 min len: 135 avg len: 147.2 num_loss_counted_tokens: 907 | |
total tokens: 2448 num samples: 8 num padding tokens: 337 - rank: 3 max len: 306 min len: 231 avg len: 263.875 num_loss_counted_tokens: 849 | |
total tokens: 2280 num samples: 6 num padding tokens: 180 - rank: 2 max len: 380 min len: 320 avg len: 350.0 num_loss_counted_tokens: 921 | |
total tokens: 2180 num samples: 5 num padding tokens: 130 - rank: 1 max len: 436 min len: 389 avg len: 410.0 num_loss_counted_tokens: 867 | |
total tokens: 2464 num samples: 11 num padding tokens: 252 - rank: 4 max len: 224 min len: 185 avg len: 201.0909090909091 num_loss_counted_tokens: 643 | |
total tokens: 2358 num samples: 18 num padding tokens: 424 - rank: 7 max len: 131 min len: 89 avg len: 107.44444444444444 num_loss_counted_tokens: 530 | |
total tokens: 1983 num samples: 3 num padding tokens: 409 - rank: 0 max len: 661 min len: 455 avg len: 524.6666666666666 num_loss_counted_tokens: 522 | |
Per-token loss scaled by world size: 0.001113825710490346Per-token loss scaled by world size: 0.0016974156023934484Per-token loss scaled by world size: 0.0010379968443885446Per-token loss scaled by world size: 0.0011254666605964303Per-token loss scaled by world size: 0.001076821587048471Per-token loss scaled by world size: 0.0011489738244563341Per-token loss scaled by world size: 0.0011098581599071622 | |
Epoch: 0, Step: 70, Rank: 3, loss = 1.0758163928985596Epoch: 0, Step: 70, Rank: 0, loss = 1.639491319656372Epoch: 0, Step: 70, Rank: 1, loss = 1.002575159072876Epoch: 0, Step: 70, Rank: 4, loss = 1.0400750637054443Epoch: 0, Step: 70, Rank: 5, loss = 1.0870600938796997 | |
Epoch: 0, Step: 70, Rank: 6, loss = 1.1097650527954102 | |
Epoch: 0, Step: 70, Rank: 2, loss = 1.0719842910766602 | |
Per-token loss scaled by world size: 0.0008070787880569696 | |
Epoch: 0, Step: 70, Rank: 7, loss = 0.7795372009277344 | |
[2024-06-27 16:42:43,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[3.6363636363636366e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:43,826] [INFO] [timer.py:260:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=95.59229452220809, CurrSamplesPerSec=96.22611727998532, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 96.09443790269731 samples/s, lr: 3.6363636363636366e-06, loss: 1.639491319656372 cuda_mem_allocated: 22.299350261688232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7727.0 batch_size: 78.0 total loss: 1.1007881164550781 | |
Epoch 0: 33% 70/213 [01:33<02:32, 1.06s/it] total tokens: 2464 num samples: 14 num padding tokens: 241 - rank: 5 max len: 176 min len: 140 avg len: 158.78571428571428 num_loss_counted_tokens: 929 | |
total tokens: 2448 num samples: 18 num padding tokens: 148 - rank: 6 max len: 136 min len: 116 avg len: 127.77777777777777 num_loss_counted_tokens: 758 | |
total tokens: 2509 num samples: 13 num padding tokens: 97 - rank: 4 max len: 193 min len: 179 avg len: 185.53846153846155 num_loss_counted_tokens: 909 | |
total tokens: 2530 num samples: 11 num padding tokens: 194 - rank: 3 max len: 230 min len: 194 avg len: 212.36363636363637 num_loss_counted_tokens: 682 | |
total tokens: 2472 num samples: 8 num padding tokens: 388 - rank: 2 max len: 309 min len: 234 avg len: 260.5 num_loss_counted_tokens: 899 | |
total tokens: 2261 num samples: 7 num padding tokens: 51 - rank: 1 max len: 323 min len: 312 avg len: 315.7142857142857 num_loss_counted_tokens: 707 | |
total tokens: 2442 num samples: 6 num padding tokens: 225 - rank: 0 max len: 407 min len: 331 avg len: 369.5 num_loss_counted_tokens: 1024 | |
total tokens: 2436 num samples: 21 num padding tokens: 275 - rank: 7 max len: 116 min len: 80 avg len: 102.9047619047619 num_loss_counted_tokens: 491 | |
Per-token loss scaled by world size: 0.0010819354793056846Per-token loss scaled by world size: 0.0018861504504457116Per-token loss scaled by world size: 0.0007152463658712804Per-token loss scaled by world size: 0.0019872556440532207 | |
Per-token loss scaled by world size: 0.0009638365008868277Per-token loss scaled by world size: 0.001101154019124806Per-token loss scaled by world size: 0.0011619852157309651 | |
Epoch: 0, Step: 71, Rank: 3, loss = 0.597767174243927 | |
Epoch: 0, Step: 71, Rank: 2, loss = 1.576350212097168Epoch: 0, Step: 71, Rank: 1, loss = 1.6608489751815796Epoch: 0, Step: 71, Rank: 5, loss = 0.904227614402771 | |
Epoch: 0, Step: 71, Rank: 6, loss = 0.9711291790008545Epoch: 0, Step: 71, Rank: 0, loss = 0.8055263757705688Epoch: 0, Step: 71, Rank: 4, loss = 0.9202894568443298 | |
Per-token loss scaled by world size: 0.0008045323193073273 | |
Epoch: 0, Step: 71, Rank: 7, loss = 0.6723878979682922 | |
[2024-06-27 16:42:44,814] [INFO] [logging.py:96:log_dist] [Rank 0] step=71, skipped=0, lr=[3.6883116883116886e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:44,887] [INFO] [timer.py:260:stop] epoch=0/micro_step=71/global_step=71, RunningAvgSamplesPerSec=95.58762990933738, CurrSamplesPerSec=95.27150068995282, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.16252160135527 samples/s, lr: 3.6883116883116886e-06, loss: 0.8055263757705688 cuda_mem_allocated: 22.27168083190918 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6686.0 batch_size: 92.0 total loss: 1.0135657787322998 | |
Epoch 0: 33% 71/213 [01:34<02:31, 1.06s/it] total tokens: 2365 num samples: 11 num padding tokens: 193 - rank: 4 max len: 215 min len: 176 avg len: 197.45454545454547 num_loss_counted_tokens: 826 | |
total tokens: 2280 num samples: 6 num padding tokens: 245 - rank: 1 max len: 380 min len: 301 avg len: 339.1666666666667 num_loss_counted_tokens: 915 | |
total tokens: 2304 num samples: 9 num padding tokens: 191 - rank: 3 max len: 256 min len: 218 avg len: 234.77777777777777 num_loss_counted_tokens: 689 | |
total tokens: 2422 num samples: 14 num padding tokens: 217 - rank: 5 max len: 173 min len: 145 avg len: 157.5 num_loss_counted_tokens: 839 | |
total tokens: 2412 num samples: 18 num padding tokens: 384 - rank: 6 max len: 134 min len: 96 avg len: 112.66666666666667 num_loss_counted_tokens: 608 | |
total tokens: 186 num samples: 2 num padding tokens: 2 - rank: 7 max len: 93 min len: 91 avg len: 92.0 num_loss_counted_tokens: 34 | |
total tokens: 1742 num samples: 2 num padding tokens: 430 - rank: 0 max len: 871 min len: 441 avg len: 656.0 num_loss_counted_tokens: 1062 | |
total tokens: 2360 num samples: 8 num padding tokens: 153 - rank: 2 max len: 295 min len: 257 avg len: 275.875 num_loss_counted_tokens: 907 | |
Per-token loss scaled by world size: 0.002451099455356598Per-token loss scaled by world size: 0.0010685380548238754Per-token loss scaled by world size: 0.0009499501320533454Per-token loss scaled by world size: 0.002362190280109644Per-token loss scaled by world size: 0.0006211479776538908 | |
Per-token loss scaled by world size: 0.0009166905074380338 | |
Per-token loss scaled by world size: 0.0010880791814997792Epoch: 0, Step: 72, Rank: 3, loss = 2.120361089706421 | |
Epoch: 0, Step: 72, Rank: 2, loss = 0.9591464996337891 | |
Epoch: 0, Step: 72, Rank: 1, loss = 2.2001681327819824 | |
Epoch: 0, Step: 72, Rank: 5, loss = 0.5575579404830933Epoch: 0, Step: 72, Rank: 4, loss = 0.8526989817619324Epoch: 0, Step: 72, Rank: 7, loss = 0.8228443264961243 | |
Per-token loss scaled by world size: 0.0014481379184871912 | |
Epoch: 0, Step: 72, Rank: 0, loss = 0.9766870737075806 | |
Epoch: 0, Step: 72, Rank: 6, loss = 1.2998847961425781 | |
[2024-06-27 16:42:45,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=72, skipped=0, lr=[3.7402597402597406e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:45,941] [INFO] [timer.py:260:stop] epoch=0/micro_step=72/global_step=72, RunningAvgSamplesPerSec=95.59109881094948, CurrSamplesPerSec=95.8310625846979, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 95.73720967888046 samples/s, lr: 3.7402597402597406e-06, loss: 0.9766870737075806 cuda_mem_allocated: 22.221830368041992 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7181.0 batch_size: 68.0 total loss: 1.2236685752868652 | |
Epoch 0: 34% 72/213 [01:35<02:29, 1.06s/it] total tokens: 2472 num samples: 12 num padding tokens: 280 - rank: 6 max len: 206 min len: 159 avg len: 182.66666666666666 num_loss_counted_tokens: 951 | |
total tokens: 2366 num samples: 7 num padding tokens: 117 - rank: 3 max len: 338 min len: 306 avg len: 321.2857142857143 num_loss_counted_tokens: 1090 | |
total tokens: 2530 num samples: 10 num padding tokens: 232 - rank: 5 max len: 253 min len: 211 avg len: 229.8 num_loss_counted_tokens: 1094 | |
total tokens: 2120 num samples: 5 num padding tokens: 243 - rank: 2 max len: 424 min len: 345 avg len: 375.4 num_loss_counted_tokens: 1073 | |
total tokens: 2196 num samples: 4 num padding tokens: 208 - rank: 1 max len: 549 min len: 454 avg len: 497.0 num_loss_counted_tokens: 329 | |
total tokens: 2432 num samples: 8 num padding tokens: 182 - rank: 4 max len: 304 min len: 255 avg len: 281.25 num_loss_counted_tokens: 1085 | |
total tokens: 1306 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1306 min len: 1306 avg len: 1306.0 num_loss_counted_tokens: 119 | |
total tokens: 2528 num samples: 16 num padding tokens: 508 - rank: 7 max len: 158 min len: 88 avg len: 126.25 num_loss_counted_tokens: 640 | |
Per-token loss scaled by world size: 0.001568732550367713Per-token loss scaled by world size: 0.0006801890558563173Per-token loss scaled by world size: 0.0012422791915014386Per-token loss scaled by world size: 0.0012018597917631269Per-token loss scaled by world size: 0.0011505482252687216 | |
Per-token loss scaled by world size: 0.0017177362460643053Per-token loss scaled by world size: 0.00045712426071986556 | |
Epoch: 0, Step: 73, Rank: 6, loss = 1.00775945186615 | |
Epoch: 0, Step: 73, Rank: 4, loss = 1.3153822422027588 | |
Epoch: 0, Step: 73, Rank: 5, loss = 1.0416511297225952Epoch: 0, Step: 73, Rank: 7, loss = 0.5703385472297668Epoch: 0, Step: 73, Rank: 3, loss = 0.9647347331047058 | |
Epoch: 0, Step: 73, Rank: 1, loss = 1.4403218030929565Epoch: 0, Step: 73, Rank: 0, loss = 0.38329869508743286 | |
Per-token loss scaled by world size: 0.0015694088069722056 | |
Epoch: 0, Step: 73, Rank: 2, loss = 1.3159493207931519 | |
[2024-06-27 16:42:46,921] [INFO] [logging.py:96:log_dist] [Rank 0] step=73, skipped=0, lr=[3.7922077922077926e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:46,995] [INFO] [timer.py:260:stop] epoch=0/micro_step=73/global_step=73, RunningAvgSamplesPerSec=95.59696390069647, CurrSamplesPerSec=96.0093163947517, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.87711063700091 samples/s, lr: 3.7922077922077926e-06, loss: 0.38329869508743286 cuda_mem_allocated: 22.24532461166382 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6708.0 batch_size: 81.0 total loss: 1.004929542541504 | |
Epoch 0: 34% 73/213 [01:36<02:28, 1.06s/it] total tokens: 2422 num samples: 14 num padding tokens: 403 - rank: 6 max len: 173 min len: 132 avg len: 144.21428571428572 num_loss_counted_tokens: 897 | |
total tokens: 2330 num samples: 10 num padding tokens: 196 - rank: 4 max len: 233 min len: 197 avg len: 213.4 num_loss_counted_tokens: 856 | |
total tokens: 2352 num samples: 12 num padding tokens: 96 - rank: 5 max len: 196 min len: 175 avg len: 188.0 num_loss_counted_tokens: 746 | |
total tokens: 2322 num samples: 9 num padding tokens: 84 - rank: 3 max len: 258 min len: 235 avg len: 248.66666666666666 num_loss_counted_tokens: 931 | |
total tokens: 2401 num samples: 7 num padding tokens: 115 - rank: 1 max len: 343 min len: 314 avg len: 326.57142857142856 num_loss_counted_tokens: 1266 | |
total tokens: 2368 num samples: 8 num padding tokens: 136 - rank: 2 max len: 296 min len: 263 avg len: 279.0 num_loss_counted_tokens: 989 | |
total tokens: 2270 num samples: 5 num padding tokens: 242 - rank: 0 max len: 454 min len: 363 avg len: 405.6 num_loss_counted_tokens: 907 | |
total tokens: 2520 num samples: 20 num padding tokens: 312 - rank: 7 max len: 126 min len: 90 avg len: 110.4 num_loss_counted_tokens: 667 | |
Per-token loss scaled by world size: 0.000789767480455339Per-token loss scaled by world size: 0.0023510174360126257Per-token loss scaled by world size: 0.0017637384589761496Per-token loss scaled by world size: 0.0011712139239534736Per-token loss scaled by world size: 0.0013083573430776596Per-token loss scaled by world size: 0.0006746900617145002Per-token loss scaled by world size: 0.0010104808025062084 | |
Epoch: 0, Step: 74, Rank: 4, loss = 0.7253026962280273 | |
Epoch: 0, Step: 74, Rank: 1, loss = 1.619773268699646Epoch: 0, Step: 74, Rank: 2, loss = 2.1591155529022217 | |
Epoch: 0, Step: 74, Rank: 6, loss = 1.0756136178970337 | |
Epoch: 0, Step: 74, Rank: 3, loss = 1.2015626430511475Epoch: 0, Step: 74, Rank: 5, loss = 0.9280003309249878Epoch: 0, Step: 74, Rank: 0, loss = 0.6196184754371643 | |
Per-token loss scaled by world size: 0.0007146446732804179 | |
Epoch: 0, Step: 74, Rank: 7, loss = 0.6563118100166321 | |
[2024-06-27 16:42:47,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=74, skipped=0, lr=[3.844155844155845e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:48,057] [INFO] [timer.py:260:stop] epoch=0/micro_step=74/global_step=74, RunningAvgSamplesPerSec=95.59046723007711, CurrSamplesPerSec=95.13144976206104, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.02099672496966 samples/s, lr: 3.844155844155845e-06, loss: 0.6196184754371643 cuda_mem_allocated: 22.286946296691895 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7347.0 batch_size: 87.0 total loss: 1.1231622695922852 | |
Epoch 0: 35% 74/213 [01:37<02:27, 1.06s/it] total tokens: 2432 num samples: 8 num padding tokens: 150 - rank: 3 max len: 304 min len: 265 avg len: 285.25 num_loss_counted_tokens: 807 | |
total tokens: 2200 num samples: 5 num padding tokens: 105 - rank: 1 max len: 440 min len: 391 avg len: 419.0 num_loss_counted_tokens: 1284 | |
total tokens: 2214 num samples: 6 num padding tokens: 172 - rank: 2 max len: 369 min len: 309 avg len: 340.3333333333333 num_loss_counted_tokens: 995 | |
total tokens: 2331 num samples: 9 num padding tokens: 311 - rank: 4 max len: 259 min len: 194 avg len: 224.44444444444446 num_loss_counted_tokens: 831 | |
total tokens: 2496 num samples: 13 num padding tokens: 147 - rank: 5 max len: 192 min len: 170 avg len: 180.69230769230768 num_loss_counted_tokens: 846 | |
total tokens: 2505 num samples: 15 num padding tokens: 304 - rank: 6 max len: 167 min len: 127 avg len: 146.73333333333332 num_loss_counted_tokens: 816 | |
total tokens: 1856 num samples: 2 num padding tokens: 473 - rank: 0 max len: 928 min len: 455 avg len: 691.5 num_loss_counted_tokens: 635 | |
total tokens: 2520 num samples: 20 num padding tokens: 452 - rank: 7 max len: 126 min len: 79 avg len: 103.4 num_loss_counted_tokens: 524 | |
Per-token loss scaled by world size: 0.0008282875642180443Per-token loss scaled by world size: 0.0007471221615560353Per-token loss scaled by world size: 0.0008397966739721596Per-token loss scaled by world size: 0.0005125603638589382 | |
Per-token loss scaled by world size: 0.0010294088860973716Per-token loss scaled by world size: 0.001366117037832737Per-token loss scaled by world size: 0.0016779708676040173 | |
Epoch: 0, Step: 75, Rank: 5, loss = 0.8489294648170471 | |
Epoch: 0, Step: 75, Rank: 4, loss = 0.8372951745986938Epoch: 0, Step: 75, Rank: 6, loss = 0.7552471160888672 | |
Epoch: 0, Step: 75, Rank: 7, loss = 0.5181344747543335Epoch: 0, Step: 75, Rank: 3, loss = 1.040603756904602 | |
Epoch: 0, Step: 75, Rank: 1, loss = 1.6962188482284546Epoch: 0, Step: 75, Rank: 2, loss = 1.3809735774993896 | |
Per-token loss scaled by world size: 0.0018006821628659964 | |
Epoch: 0, Step: 75, Rank: 0, loss = 1.8202645778656006 | |
[2024-06-27 16:42:49,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=75, skipped=0, lr=[3.896103896103897e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:49,120] [INFO] [timer.py:260:stop] epoch=0/micro_step=75/global_step=75, RunningAvgSamplesPerSec=95.58685459530183, CurrSamplesPerSec=95.32746055477232, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.19611189668922 samples/s, lr: 3.896103896103897e-06, loss: 1.8202645778656006 cuda_mem_allocated: 22.303165912628174 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8087.0 batch_size: 89.0 total loss: 1.112208366394043 | |
Epoch 0: 35% 75/213 [01:38<02:26, 1.06s/it] total tokens: 2304 num samples: 8 num padding tokens: 244 - rank: 3 max len: 288 min len: 232 avg len: 257.5 num_loss_counted_tokens: 775 | |
total tokens: 2320 num samples: 10 num padding tokens: 132 - rank: 4 max len: 232 min len: 203 avg len: 218.8 num_loss_counted_tokens: 888 | |
total tokens: 2366 num samples: 14 num padding tokens: 250 - rank: 6 max len: 169 min len: 136 avg len: 151.14285714285714 num_loss_counted_tokens: 754 | |
total tokens: 2400 num samples: 12 num padding tokens: 170 - rank: 5 max len: 200 min len: 171 avg len: 185.83333333333334 num_loss_counted_tokens: 781 | |
total tokens: 2395 num samples: 5 num padding tokens: 248 - rank: 1 max len: 479 min len: 389 avg len: 429.4 num_loss_counted_tokens: 1245 | |
total tokens: 2429 num samples: 7 num padding tokens: 176 - rank: 2 max len: 347 min len: 300 avg len: 321.85714285714283 num_loss_counted_tokens: 1317 | |
total tokens: 2489 num samples: 19 num padding tokens: 422 - rank: 7 max len: 131 min len: 79 avg len: 108.78947368421052 num_loss_counted_tokens: 559 | |
total tokens: 1965 num samples: 3 num padding tokens: 32 - rank: 0 max len: 655 min len: 634 avg len: 644.3333333333334 num_loss_counted_tokens: 1061 | |
Per-token loss scaled by world size: 0.001695839804597199 | |
Per-token loss scaled by world size: 0.0007643710705451667Per-token loss scaled by world size: 0.0016604990232735872Per-token loss scaled by world size: 0.0012597354361787438Per-token loss scaled by world size: 0.0008550084894523025Per-token loss scaled by world size: 0.001196287921629846Per-token loss scaled by world size: 0.0007154389168135822 | |
Epoch: 0, Step: 76, Rank: 2, loss = 1.4298049211502075 | |
Epoch: 0, Step: 76, Rank: 1, loss = 1.400008201599121Epoch: 0, Step: 76, Rank: 5, loss = 1.0621144771575928Epoch: 0, Step: 76, Rank: 3, loss = 0.7208790183067322Epoch: 0, Step: 76, Rank: 0, loss = 0.6444603800773621 | |
Epoch: 0, Step: 76, Rank: 4, loss = 1.008620262145996 | |
Epoch: 0, Step: 76, Rank: 7, loss = 0.6032044291496277 | |
Per-token loss scaled by world size: 0.0013651588233187795 | |
Epoch: 0, Step: 76, Rank: 6, loss = 1.1509995460510254 | |
[2024-06-27 16:42:50,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=76, skipped=0, lr=[3.948051948051949e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:50,169] [INFO] [timer.py:260:stop] epoch=0/micro_step=76/global_step=76, RunningAvgSamplesPerSec=95.59920643927133, CurrSamplesPerSec=96.50959539633816, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 96.4055681826072 samples/s, lr: 3.948051948051949e-06, loss: 0.6444603800773621 cuda_mem_allocated: 22.246755599975586 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6745.0 batch_size: 75.0 total loss: 1.0025113821029663 | |
Epoch 0: 36% 76/213 [01:39<02:24, 1.06s/it] total tokens: 2262 num samples: 6 num padding tokens: 305 - rank: 2 max len: 377 min len: 275 avg len: 326.1666666666667 num_loss_counted_tokens: 962 | |
total tokens: 2412 num samples: 12 num padding tokens: 245 - rank: 5 max len: 201 min len: 169 avg len: 180.58333333333334 num_loss_counted_tokens: 785 | |
total tokens: 1423 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1423 min len: 1423 avg len: 1423.0 num_loss_counted_tokens: 53 | |
total tokens: 2530 num samples: 10 num padding tokens: 198 - rank: 3 max len: 253 min len: 216 avg len: 233.2 num_loss_counted_tokens: 1036 | |
total tokens: 2365 num samples: 11 num padding tokens: 82 - rank: 4 max len: 215 min len: 201 avg len: 207.54545454545453 num_loss_counted_tokens: 1109 | |
total tokens: 2512 num samples: 16 num padding tokens: 257 - rank: 6 max len: 157 min len: 122 avg len: 140.9375 num_loss_counted_tokens: 890 | |
total tokens: 2216 num samples: 4 num padding tokens: 100 - rank: 1 max len: 554 min len: 499 avg len: 529.0 num_loss_counted_tokens: 1627 | |
total tokens: 1560 num samples: 13 num padding tokens: 214 - rank: 7 max len: 120 min len: 82 avg len: 103.53846153846153 num_loss_counted_tokens: 313 | |
Per-token loss scaled by world size: 0.0004683450679294765Per-token loss scaled by world size: 0.002021264052018523Per-token loss scaled by world size: 0.0007273624651134014Per-token loss scaled by world size: 0.0011792100267484784Per-token loss scaled by world size: 0.0008225165074691176 | |
Per-token loss scaled by world size: 0.0016094164457172155 | |
Per-token loss scaled by world size: 0.0012427542824298143 | |
Epoch: 0, Step: 77, Rank: 5, loss = 1.207805871963501Epoch: 0, Step: 77, Rank: 2, loss = 2.070279598236084Epoch: 0, Step: 77, Rank: 4, loss = 0.4797024428844452Epoch: 0, Step: 77, Rank: 7, loss = 0.7450010180473328 | |
Epoch: 0, Step: 77, Rank: 3, loss = 0.8424625396728516 | |
Epoch: 0, Step: 77, Rank: 1, loss = 1.6484447717666626 | |
Per-token loss scaled by world size: 0.0008087762980721891 | |
Epoch: 0, Step: 77, Rank: 6, loss = 0.8283891081809998 | |
Epoch: 0, Step: 77, Rank: 0, loss = 1.2728910446166992 | |
[2024-06-27 16:42:51,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=77, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:51,223] [INFO] [timer.py:260:stop] epoch=0/micro_step=77/global_step=77, RunningAvgSamplesPerSec=95.5900993727807, CurrSamplesPerSec=94.92095774052896, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.82430007432339 samples/s, lr: 4.000000000000001e-06, loss: 1.2728910446166992 cuda_mem_allocated: 22.258561611175537 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8194.0 batch_size: 70.0 total loss: 1.1368720531463623 | |
Epoch 0: 36% 77/213 [01:40<02:23, 1.06s/it] total tokens: 2530 num samples: 11 num padding tokens: 165 - rank: 4 max len: 230 min len: 200 avg len: 215.0 num_loss_counted_tokens: 925 | |
total tokens: 2408 num samples: 8 num padding tokens: 351 - rank: 3 max len: 301 min len: 231 avg len: 257.125 num_loss_counted_tokens: 956 | |
total tokens: 2440 num samples: 5 num padding tokens: 201 - rank: 1 max len: 488 min len: 413 avg len: 447.8 num_loss_counted_tokens: 1365 | |
total tokens: 2527 num samples: 19 num padding tokens: 471 - rank: 7 max len: 133 min len: 90 avg len: 108.21052631578948 num_loss_counted_tokens: 493 | |
total tokens: 2340 num samples: 12 num padding tokens: 180 - rank: 5 max len: 195 min len: 169 avg len: 180.0 num_loss_counted_tokens: 860 | |
total tokens: 2452 num samples: 4 num padding tokens: 187 - rank: 0 max len: 613 min len: 517 avg len: 566.25 num_loss_counted_tokens: 1685 | |
total tokens: 2520 num samples: 15 num padding tokens: 293 - rank: 6 max len: 168 min len: 133 avg len: 148.46666666666667 num_loss_counted_tokens: 822 | |
total tokens: 2412 num samples: 6 num padding tokens: 334 - rank: 2 max len: 402 min len: 304 avg len: 346.3333333333333 num_loss_counted_tokens: 1354 | |
Per-token loss scaled by world size: 0.00172456877771765 | |
Per-token loss scaled by world size: 0.001724603003822267Per-token loss scaled by world size: 0.0007516497280448675Per-token loss scaled by world size: 0.0008890745812095702Per-token loss scaled by world size: 0.0010419513564556837Per-token loss scaled by world size: 0.00047058751806616783Per-token loss scaled by world size: 0.001333838328719139 | |
Epoch: 0, Step: 78, Rank: 4, loss = 1.6023399829864502 | |
Epoch: 0, Step: 78, Rank: 7, loss = 0.6983765363693237Epoch: 0, Step: 78, Rank: 5, loss = 0.8260614275932312Epoch: 0, Step: 78, Rank: 0, loss = 0.43723464012145996Epoch: 0, Step: 78, Rank: 2, loss = 1.6023718118667603Epoch: 0, Step: 78, Rank: 3, loss = 1.2393025159835815Epoch: 0, Step: 78, Rank: 1, loss = 0.9681030511856079 | |
Per-token loss scaled by world size: 0.001018296810798347 | |
Epoch: 0, Step: 78, Rank: 6, loss = 0.9461250305175781 | |
[2024-06-27 16:42:52,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=78, skipped=0, lr=[4.051948051948053e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:52,284] [INFO] [timer.py:260:stop] epoch=0/micro_step=78/global_step=78, RunningAvgSamplesPerSec=95.57840715287408, CurrSamplesPerSec=94.70956840157199, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 94.60407082103654 samples/s, lr: 4.051948051948053e-06, loss: 0.43723464012145996 cuda_mem_allocated: 22.253076553344727 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7433.0 batch_size: 71.0 total loss: 1.0399892330169678 | |
Epoch 0: 37% 78/213 [01:41<02:22, 1.06s/it] total tokens: 2496 num samples: 6 num padding tokens: 302 - rank: 1 max len: 416 min len: 327 avg len: 365.6666666666667 num_loss_counted_tokens: 1387 | |
total tokens: 2310 num samples: 10 num padding tokens: 291 - rank: 4 max len: 231 min len: 182 avg len: 201.9 num_loss_counted_tokens: 709 | |
total tokens: 2421 num samples: 9 num padding tokens: 181 - rank: 3 max len: 269 min len: 234 avg len: 248.88888888888889 num_loss_counted_tokens: 951 | |
total tokens: 2268 num samples: 7 num padding tokens: 214 - rank: 2 max len: 324 min len: 271 avg len: 293.42857142857144 num_loss_counted_tokens: 1113 | |
total tokens: 2328 num samples: 3 num padding tokens: 633 - rank: 0 max len: 776 min len: 454 avg len: 565.0 num_loss_counted_tokens: 566 | |
total tokens: 2534 num samples: 14 num padding tokens: 170 - rank: 5 max len: 181 min len: 158 avg len: 168.85714285714286 num_loss_counted_tokens: 1027 | |
total tokens: 2496 num samples: 16 num padding tokens: 219 - rank: 6 max len: 156 min len: 130 avg len: 142.3125 num_loss_counted_tokens: 813 | |
total tokens: 2394 num samples: 19 num padding tokens: 447 - rank: 7 max len: 126 min len: 79 avg len: 102.47368421052632 num_loss_counted_tokens: 489 | |
Per-token loss scaled by world size: 0.0009908507345244288Per-token loss scaled by world size: 0.0009623213554732502Per-token loss scaled by world size: 0.0011945352889597416Per-token loss scaled by world size: 0.0004565715789794922Per-token loss scaled by world size: 0.0010352329118177295Per-token loss scaled by world size: 0.0012398260878399014Per-token loss scaled by world size: 0.0018819449469447136 | |
Epoch: 0, Step: 79, Rank: 4, loss = 0.8860682249069214Epoch: 0, Step: 79, Rank: 2, loss = 0.86055588722229Epoch: 0, Step: 79, Rank: 1, loss = 1.682929277420044Epoch: 0, Step: 79, Rank: 3, loss = 1.0682132244110107Epoch: 0, Step: 79, Rank: 5, loss = 0.9257569909095764Epoch: 0, Step: 79, Rank: 7, loss = 0.4082891345024109 | |
Per-token loss scaled by world size: 0.0011221279855817556Epoch: 0, Step: 79, Rank: 0, loss = 1.1087144613265991 | |
Epoch: 0, Step: 79, Rank: 6, loss = 1.0034629106521606 | |
[2024-06-27 16:42:53,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=79, skipped=0, lr=[4.103896103896105e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:53,334] [INFO] [timer.py:260:stop] epoch=0/micro_step=79/global_step=79, RunningAvgSamplesPerSec=95.58351417361247, CurrSamplesPerSec=95.97325116858099, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.87338955289519 samples/s, lr: 4.103896103896105e-06, loss: 1.1087144613265991 cuda_mem_allocated: 22.255342483520508 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7154.0 batch_size: 84.0 total loss: 0.9929987788200378 | |
Epoch 0: 37% 79/213 [01:42<02:21, 1.06s/it] total tokens: 2484 num samples: 6 num padding tokens: 188 - rank: 1 max len: 414 min len: 365 avg len: 382.6666666666667 num_loss_counted_tokens: 1555 | |
total tokens: 2483 num samples: 13 num padding tokens: 218 - rank: 5 max len: 191 min len: 146 avg len: 174.23076923076923 num_loss_counted_tokens: 919 | |
total tokens: 2465 num samples: 17 num padding tokens: 238 - rank: 6 max len: 145 min len: 119 avg len: 131.0 num_loss_counted_tokens: 704 | |
total tokens: 2450 num samples: 7 num padding tokens: 317 - rank: 2 max len: 350 min len: 274 avg len: 304.7142857142857 num_loss_counted_tokens: 764 | |
total tokens: 2530 num samples: 11 num padding tokens: 244 - rank: 4 max len: 230 min len: 193 avg len: 207.8181818181818 num_loss_counted_tokens: 1023 | |
total tokens: 2124 num samples: 18 num padding tokens: 242 - rank: 7 max len: 118 min len: 80 avg len: 104.55555555555556 num_loss_counted_tokens: 525 | |
total tokens: 2104 num samples: 4 num padding tokens: 87 - rank: 0 max len: 526 min len: 483 avg len: 504.25 num_loss_counted_tokens: 1045 | |
total tokens: 2448 num samples: 9 num padding tokens: 142 - rank: 3 max len: 272 min len: 237 avg len: 256.22222222222223 num_loss_counted_tokens: 841 | |
Per-token loss scaled by world size: 0.001160839106887579Per-token loss scaled by world size: 0.0007664890144951642 | |
Per-token loss scaled by world size: 0.0015949681401252747Per-token loss scaled by world size: 0.0011137451510876417Per-token loss scaled by world size: 0.0023581674322485924Per-token loss scaled by world size: 0.0005378788919188082Per-token loss scaled by world size: 0.0016982860397547483 | |
Epoch: 0, Step: 80, Rank: 6, loss = 0.7382247447967529Epoch: 0, Step: 80, Rank: 4, loss = 1.1180331707000732 | |
Epoch: 0, Step: 80, Rank: 2, loss = 2.271209955215454Epoch: 0, Step: 80, Rank: 0, loss = 1.6356617212295532Epoch: 0, Step: 80, Rank: 1, loss = 1.0726758241653442Epoch: 0, Step: 80, Rank: 5, loss = 1.5361536741256714Per-token loss scaled by world size: 0.0009483452304266393 | |
Epoch: 0, Step: 80, Rank: 7, loss = 0.5180445909500122 | |
Epoch: 0, Step: 80, Rank: 3, loss = 0.9133750200271606 | |
[2024-06-27 16:42:54,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[4.155844155844157e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:54,392] [INFO] [timer.py:260:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=95.58115587525431, CurrSamplesPerSec=95.39991570092737, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.30848412776561 samples/s, lr: 4.155844155844157e-06, loss: 1.6356617212295532 cuda_mem_allocated: 22.30030393600464 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7705.0 batch_size: 76.0 total loss: 1.2254222631454468 | |
Epoch 0: 38% 80/213 [01:43<02:20, 1.06s/it] total tokens: 2380 num samples: 10 num padding tokens: 99 - rank: 4 max len: 238 min len: 214 avg len: 228.1 num_loss_counted_tokens: 1040 | |
total tokens: 2532 num samples: 12 num padding tokens: 317 - rank: 5 max len: 211 min len: 156 avg len: 184.58333333333334 num_loss_counted_tokens: 1056 | |
total tokens: 2520 num samples: 6 num padding tokens: 178 - rank: 1 max len: 420 min len: 358 avg len: 390.3333333333333 num_loss_counted_tokens: 801 | |
total tokens: 2288 num samples: 8 num padding tokens: 211 - rank: 3 max len: 286 min len: 239 avg len: 259.625 num_loss_counted_tokens: 672 | |
total tokens: 2415 num samples: 7 num padding tokens: 247 - rank: 2 max len: 345 min len: 294 avg len: 309.7142857142857 num_loss_counted_tokens: 1129 | |
total tokens: 2496 num samples: 16 num padding tokens: 201 - rank: 6 max len: 156 min len: 133 avg len: 143.4375 num_loss_counted_tokens: 900 | |
total tokens: 2376 num samples: 18 num padding tokens: 257 - rank: 7 max len: 132 min len: 84 avg len: 117.72222222222223 num_loss_counted_tokens: 526 | |
total tokens: 2076 num samples: 3 num padding tokens: 483 - rank: 0 max len: 692 min len: 425 avg len: 531.0 num_loss_counted_tokens: 782 | |
Per-token loss scaled by world size: 0.0015214183367788792Per-token loss scaled by world size: 0.0013134705368429422Per-token loss scaled by world size: 0.0019505138043314219 | |
Per-token loss scaled by world size: 0.0011554965749382973Per-token loss scaled by world size: 0.001456272671930492Per-token loss scaled by world size: 0.0008338597253896296Per-token loss scaled by world size: 0.000560227083042264 | |
Epoch: 0, Step: 81, Rank: 1, loss = 1.1141513586044312 | |
Epoch: 0, Step: 81, Rank: 7, loss = 0.7073215246200562Epoch: 0, Step: 81, Rank: 2, loss = 1.6545233726501465 | |
Epoch: 0, Step: 81, Rank: 4, loss = 0.9801499247550964Epoch: 0, Step: 81, Rank: 3, loss = 1.2905430793762207 | |
Epoch: 0, Step: 81, Rank: 5, loss = 1.2352832555770874Epoch: 0, Step: 81, Rank: 0, loss = 0.47521260380744934 | |
Per-token loss scaled by world size: 0.0015612902352586389 | |
Epoch: 0, Step: 81, Rank: 6, loss = 1.324364423751831 | |
[2024-06-27 16:42:55,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=81, skipped=0, lr=[4.207792207792208e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:55,451] [INFO] [timer.py:260:stop] epoch=0/micro_step=81/global_step=81, RunningAvgSamplesPerSec=95.5792442356836, CurrSamplesPerSec=95.43037157470125, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.30841644892146 samples/s, lr: 4.207792207792208e-06, loss: 0.47521260380744934 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6786.0 batch_size: 86.0 total loss: 1.097693681716919 | |
Epoch 0: 38% 81/213 [01:44<02:19, 1.06s/it] total tokens: 2296 num samples: 7 num padding tokens: 246 - rank: 2 max len: 328 min len: 278 avg len: 292.85714285714283 num_loss_counted_tokens: 874 | |
total tokens: 2519 num samples: 11 num padding tokens: 349 - rank: 4 max len: 229 min len: 185 avg len: 197.27272727272728 num_loss_counted_tokens: 983 | |
total tokens: 2448 num samples: 9 num padding tokens: 185 - rank: 3 max len: 272 min len: 234 avg len: 251.44444444444446 num_loss_counted_tokens: 892 | |
total tokens: 2405 num samples: 13 num padding tokens: 204 - rank: 5 max len: 185 min len: 158 avg len: 169.30769230769232 num_loss_counted_tokens: 926 | |
total tokens: 2255 num samples: 5 num padding tokens: 197 - rank: 1 max len: 451 min len: 366 avg len: 411.6 num_loss_counted_tokens: 1065 | |
total tokens: 2480 num samples: 16 num padding tokens: 278 - rank: 6 max len: 155 min len: 125 avg len: 137.625 num_loss_counted_tokens: 774 | |
total tokens: 2125 num samples: 17 num padding tokens: 286 - rank: 7 max len: 125 min len: 81 avg len: 108.17647058823529 num_loss_counted_tokens: 525 | |
total tokens: 1902 num samples: 3 num padding tokens: 133 - rank: 0 max len: 634 min len: 536 avg len: 589.6666666666666 num_loss_counted_tokens: 707 | |
Per-token loss scaled by world size: 0.0011229239171370864Per-token loss scaled by world size: 0.0008722442435100675Per-token loss scaled by world size: 0.0006010057404637337Per-token loss scaled by world size: 0.0011305308435112238Per-token loss scaled by world size: 0.0013219299726188183Per-token loss scaled by world size: 0.0013634813949465752Per-token loss scaled by world size: 0.0010684422450140119 | |
Epoch: 0, Step: 82, Rank: 4, loss = 1.104676365852356Epoch: 0, Step: 82, Rank: 2, loss = 0.8580702543258667Epoch: 0, Step: 82, Rank: 7, loss = 0.5912393927574158Epoch: 0, Step: 82, Rank: 5, loss = 1.3004486560821533Epoch: 0, Step: 82, Rank: 3, loss = 1.1121597290039062 | |
Epoch: 0, Step: 82, Rank: 1, loss = 1.341324806213379 | |
Per-token loss scaled by world size: 0.000902304716873914 | |
Epoch: 0, Step: 82, Rank: 0, loss = 1.0510801076889038 | |
Epoch: 0, Step: 82, Rank: 6, loss = 0.8876422643661499 | |
[2024-06-27 16:42:56,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=82, skipped=0, lr=[4.25974025974026e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:56,498] [INFO] [timer.py:260:stop] epoch=0/micro_step=82/global_step=82, RunningAvgSamplesPerSec=95.594075131298, CurrSamplesPerSec=96.7804405307944, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 96.68151610017244 samples/s, lr: 4.25974025974026e-06, loss: 1.0510801076889038 cuda_mem_allocated: 22.252480506896973 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7870.0 batch_size: 69.0 total loss: 1.0308302640914917 | |
Epoch 0: 38% 82/213 [01:45<02:18, 1.05s/it] total tokens: 2025 num samples: 3 num padding tokens: 448 - rank: 1 max len: 675 min len: 415 avg len: 525.6666666666666 num_loss_counted_tokens: 884 | |
total tokens: 2457 num samples: 13 num padding tokens: 381 - rank: 6 max len: 189 min len: 135 avg len: 159.69230769230768 num_loss_counted_tokens: 747 | |
total tokens: 2322 num samples: 6 num padding tokens: 203 - rank: 2 max len: 387 min len: 327 avg len: 353.1666666666667 num_loss_counted_tokens: 1098 | |
total tokens: 2380 num samples: 10 num padding tokens: 198 - rank: 5 max len: 238 min len: 199 avg len: 218.2 num_loss_counted_tokens: 749 | |
total tokens: 2304 num samples: 8 num padding tokens: 209 - rank: 4 max len: 288 min len: 239 avg len: 261.875 num_loss_counted_tokens: 991 | |
total tokens: 2520 num samples: 8 num padding tokens: 126 - rank: 3 max len: 315 min len: 288 avg len: 299.25 num_loss_counted_tokens: 877 | |
total tokens: 2112 num samples: 2 num padding tokens: 349 - rank: 0 max len: 1056 min len: 707 avg len: 881.5 num_loss_counted_tokens: 67 | |
total tokens: 2430 num samples: 18 num padding tokens: 432 - rank: 7 max len: 135 min len: 85 avg len: 111.0 num_loss_counted_tokens: 615 | |
Per-token loss scaled by world size: 0.0019169342704117298Per-token loss scaled by world size: 0.0010238218819722533Per-token loss scaled by world size: 0.001062315539456904Per-token loss scaled by world size: 0.0011165173491463065Per-token loss scaled by world size: 0.00128621282055974Per-token loss scaled by world size: 0.002304702065885067Per-token loss scaled by world size: 0.0016947559779509902 | |
Epoch: 0, Step: 83, Rank: 4, loss = 1.5848253965377808Epoch: 0, Step: 83, Rank: 3, loss = 0.878269374370575Epoch: 0, Step: 83, Rank: 6, loss = 0.9230807423591614 | |
Epoch: 0, Step: 83, Rank: 0, loss = 0.8464447855949402Epoch: 0, Step: 83, Rank: 5, loss = 1.0633764266967773 | |
Epoch: 0, Step: 83, Rank: 2, loss = 1.401139497756958 | |
Epoch: 0, Step: 83, Rank: 1, loss = 1.9054124355316162 | |
Per-token loss scaled by world size: 0.0007454563747160137 | |
Epoch: 0, Step: 83, Rank: 7, loss = 0.6163060665130615 | |
[2024-06-27 16:42:57,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=83, skipped=0, lr=[4.311688311688312e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:57,550] [INFO] [timer.py:260:stop] epoch=0/micro_step=83/global_step=83, RunningAvgSamplesPerSec=95.60148813846139, CurrSamplesPerSec=96.19827702161118, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 96.09326832183318 samples/s, lr: 4.311688311688312e-06, loss: 0.8464447855949402 cuda_mem_allocated: 22.269653797149658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6614.0 batch_size: 78.0 total loss: 1.1523569822311401 | |
Epoch 0: 39% 83/213 [01:46<02:16, 1.05s/it] total tokens: 2366 num samples: 13 num padding tokens: 130 - rank: 5 max len: 182 min len: 156 avg len: 172.0 num_loss_counted_tokens: 768 | |
total tokens: 2490 num samples: 10 num padding tokens: 192 - rank: 3 max len: 249 min len: 205 avg len: 229.8 num_loss_counted_tokens: 879 | |
total tokens: 2480 num samples: 16 num padding tokens: 224 - rank: 6 max len: 155 min len: 123 avg len: 141.0 num_loss_counted_tokens: 785 | |
total tokens: 2443 num samples: 7 num padding tokens: 169 - rank: 1 max len: 349 min len: 306 avg len: 324.85714285714283 num_loss_counted_tokens: 982 | |
total tokens: 2460 num samples: 12 num padding tokens: 116 - rank: 4 max len: 205 min len: 182 avg len: 195.33333333333334 num_loss_counted_tokens: 1069 | |
total tokens: 2424 num samples: 8 num padding tokens: 169 - rank: 2 max len: 303 min len: 256 avg len: 281.875 num_loss_counted_tokens: 1202 | |
total tokens: 2074 num samples: 17 num padding tokens: 268 - rank: 7 max len: 122 min len: 90 avg len: 106.23529411764706 num_loss_counted_tokens: 510 | |
total tokens: 2368 num samples: 4 num padding tokens: 687 - rank: 0 max len: 592 min len: 356 avg len: 420.25 num_loss_counted_tokens: 483 | |
Per-token loss scaled by world size: 0.0016588460421189666Per-token loss scaled by world size: 0.0019453763961791992 | |
Per-token loss scaled by world size: 0.0014229760272428393Per-token loss scaled by world size: 0.0010197835508733988Per-token loss scaled by world size: 0.0008193189860321581Per-token loss scaled by world size: 0.0015579606406390667Per-token loss scaled by world size: 0.00024786771973595023 | |
Epoch: 0, Step: 84, Rank: 4, loss = 1.4230825901031494 | |
Epoch: 0, Step: 84, Rank: 2, loss = 1.6688897609710693Epoch: 0, Step: 84, Rank: 3, loss = 0.7028732895851135Epoch: 0, Step: 84, Rank: 1, loss = 1.2207355499267578 | |
Epoch: 0, Step: 84, Rank: 6, loss = 0.8748468160629272Epoch: 0, Step: 84, Rank: 5, loss = 1.3365354537963867 | |
Per-token loss scaled by world size: 0.001019681105390191Epoch: 0, Step: 84, Rank: 0, loss = 0.21263952553272247 | |
Epoch: 0, Step: 84, Rank: 7, loss = 0.8747588992118835 | |
[2024-06-27 16:42:58,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=84, skipped=0, lr=[4.363636363636364e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:58,605] [INFO] [timer.py:260:stop] epoch=0/micro_step=84/global_step=84, RunningAvgSamplesPerSec=95.60652494337405, CurrSamplesPerSec=96.01627625761489, MemAllocated=22.23GB, MaxMemAllocated=28.61GB | |
throughput: 95.903303368105 samples/s, lr: 4.363636363636364e-06, loss: 0.21263952553272247 cuda_mem_allocated: 22.230417251586914 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6863.0 batch_size: 75.0 total loss: 1.0392951965332031 | |
Epoch 0: 39% 84/213 [01:47<02:15, 1.05s/it] total tokens: 2513 num samples: 7 num padding tokens: 109 - rank: 1 max len: 359 min len: 333 avg len: 343.42857142857144 num_loss_counted_tokens: 1106 | |
total tokens: 2432 num samples: 16 num padding tokens: 245 - rank: 6 max len: 152 min len: 121 avg len: 136.6875 num_loss_counted_tokens: 844 | |
total tokens: 2480 num samples: 8 num padding tokens: 137 - rank: 2 max len: 310 min len: 263 avg len: 292.875 num_loss_counted_tokens: 1178 | |
total tokens: 2530 num samples: 10 num padding tokens: 217 - rank: 3 max len: 253 min len: 215 avg len: 231.3 num_loss_counted_tokens: 1144 | |
total tokens: 2365 num samples: 11 num padding tokens: 181 - rank: 4 max len: 215 min len: 183 avg len: 198.54545454545453 num_loss_counted_tokens: 1043 | |
total tokens: 2085 num samples: 3 num padding tokens: 332 - rank: 0 max len: 695 min len: 523 avg len: 584.3333333333334 num_loss_counted_tokens: 1090 | |
total tokens: 2261 num samples: 19 num padding tokens: 228 - rank: 7 max len: 119 min len: 87 avg len: 107.0 num_loss_counted_tokens: 525 | |
total tokens: 2379 num samples: 13 num padding tokens: 178 - rank: 5 max len: 183 min len: 158 avg len: 169.30769230769232 num_loss_counted_tokens: 871 | |
Per-token loss scaled by world size: 0.0008963113650679588Per-token loss scaled by world size: 0.0012215877650305629Per-token loss scaled by world size: 0.0016951096476987004Per-token loss scaled by world size: 0.0009873019298538566Per-token loss scaled by world size: 0.0014749136753380299Per-token loss scaled by world size: 0.0014574956148862839Per-token loss scaled by world size: 0.0008935710648074746 | |
Epoch: 0, Step: 85, Rank: 5, loss = 1.078051209449768Epoch: 0, Step: 85, Rank: 7, loss = 0.7909947633743286Epoch: 0, Step: 85, Rank: 2, loss = 1.495934247970581 | |
Epoch: 0, Step: 85, Rank: 4, loss = 0.8712939023971558Epoch: 0, Step: 85, Rank: 6, loss = 1.301611304283142 | |
Epoch: 0, Step: 85, Rank: 3, loss = 1.2862398624420166Epoch: 0, Step: 85, Rank: 1, loss = 0.7885764837265015 | |
Per-token loss scaled by world size: 0.0007795032579451799 | |
Epoch: 0, Step: 85, Rank: 0, loss = 0.6879116296768188 | |
[2024-06-27 16:42:59,599] [INFO] [logging.py:96:log_dist] [Rank 0] step=85, skipped=0, lr=[4.415584415584416e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:42:59,672] [INFO] [timer.py:260:stop] epoch=0/micro_step=85/global_step=85, RunningAvgSamplesPerSec=95.597124104535, CurrSamplesPerSec=94.83249625936865, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.72786134576255 samples/s, lr: 4.415584415584416e-06, loss: 0.6879116296768188 cuda_mem_allocated: 22.307936668395996 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7060.0 batch_size: 87.0 total loss: 1.037576675415039 | |
Epoch 0: 40% 85/213 [01:48<02:15, 1.06s/it] total tokens: 2418 num samples: 13 num padding tokens: 198 - rank: 5 max len: 186 min len: 147 avg len: 170.76923076923077 num_loss_counted_tokens: 794 | |
total tokens: 2508 num samples: 11 num padding tokens: 213 - rank: 4 max len: 228 min len: 190 avg len: 208.63636363636363 num_loss_counted_tokens: 865 | |
total tokens: 2490 num samples: 10 num padding tokens: 51 - rank: 3 max len: 249 min len: 237 avg len: 243.9 num_loss_counted_tokens: 834 | |
total tokens: 2424 num samples: 8 num padding tokens: 126 - rank: 1 max len: 303 min len: 277 avg len: 287.25 num_loss_counted_tokens: 1198 | |
total tokens: 2482 num samples: 17 num padding tokens: 281 - rank: 6 max len: 146 min len: 119 avg len: 129.47058823529412 num_loss_counted_tokens: 684 | |
total tokens: 2466 num samples: 9 num padding tokens: 110 - rank: 2 max len: 274 min len: 249 avg len: 261.77777777777777 num_loss_counted_tokens: 936 | |
total tokens: 2478 num samples: 21 num padding tokens: 345 - rank: 7 max len: 118 min len: 83 avg len: 101.57142857142857 num_loss_counted_tokens: 516 | |
total tokens: 2448 num samples: 6 num padding tokens: 367 - rank: 0 max len: 408 min len: 304 avg len: 346.8333333333333 num_loss_counted_tokens: 922 | |
Per-token loss scaled by world size: 0.0009426996111869812Per-token loss scaled by world size: 0.0017545504961162806Per-token loss scaled by world size: 0.0008898203959688544Per-token loss scaled by world size: 0.0015400846023112535Per-token loss scaled by world size: 0.0008276195731014013 | |
Per-token loss scaled by world size: 0.0017029775772243738 | |
Per-token loss scaled by world size: 0.0019132475135847926Epoch: 0, Step: 86, Rank: 4, loss = 0.7449683547019958 | |
Epoch: 0, Step: 86, Rank: 5, loss = 1.386533498764038Epoch: 0, Step: 86, Rank: 1, loss = 1.2170518636703491Epoch: 0, Step: 86, Rank: 6, loss = 1.345777988433838 | |
Epoch: 0, Step: 86, Rank: 7, loss = 0.6540263891220093Epoch: 0, Step: 86, Rank: 0, loss = 0.7031805515289307 | |
Per-token loss scaled by world size: 0.0018686041003093123Epoch: 0, Step: 86, Rank: 2, loss = 1.5119438171386719 | |
Epoch: 0, Step: 86, Rank: 3, loss = 1.476664423942566 | |
[2024-06-27 16:43:00,650] [INFO] [logging.py:96:log_dist] [Rank 0] step=86, skipped=0, lr=[4.467532467532468e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:00,724] [INFO] [timer.py:260:stop] epoch=0/micro_step=86/global_step=86, RunningAvgSamplesPerSec=95.6053182623432, CurrSamplesPerSec=96.29036534808219, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 96.17814827775237 samples/s, lr: 4.467532467532468e-06, loss: 0.7031805515289307 cuda_mem_allocated: 22.25104808807373 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6322.0 batch_size: 80.0 total loss: 1.1300183534622192 | |
Epoch 0: 40% 86/213 [01:49<02:14, 1.06s/it] total tokens: 2415 num samples: 15 num padding tokens: 127 - rank: 6 max len: 161 min len: 137 avg len: 152.53333333333333 num_loss_counted_tokens: 803 | |
total tokens: 2349 num samples: 9 num padding tokens: 205 - rank: 3 max len: 261 min len: 231 avg len: 238.22222222222223 num_loss_counted_tokens: 961 | |
total tokens: 2443 num samples: 7 num padding tokens: 312 - rank: 2 max len: 349 min len: 262 avg len: 304.42857142857144 num_loss_counted_tokens: 1160 | |
total tokens: 2470 num samples: 5 num padding tokens: 306 - rank: 1 max len: 494 min len: 362 avg len: 432.8 num_loss_counted_tokens: 1010 | |
total tokens: 2310 num samples: 10 num padding tokens: 170 - rank: 4 max len: 231 min len: 195 avg len: 214.0 num_loss_counted_tokens: 1026 | |
total tokens: 1774 num samples: 2 num padding tokens: 191 - rank: 0 max len: 887 min len: 696 avg len: 791.5 num_loss_counted_tokens: 657 | |
total tokens: 2522 num samples: 13 num padding tokens: 176 - rank: 5 max len: 194 min len: 168 avg len: 180.46153846153845 num_loss_counted_tokens: 1085 | |
total tokens: 2470 num samples: 19 num padding tokens: 472 - rank: 7 max len: 130 min len: 86 avg len: 105.15789473684211 num_loss_counted_tokens: 532 | |
Per-token loss scaled by world size: 0.0015821176348254085Per-token loss scaled by world size: 0.0015043980674818158Per-token loss scaled by world size: 0.001299357390962541Per-token loss scaled by world size: 0.0015812250785529613Per-token loss scaled by world size: 0.0013089682906866074Per-token loss scaled by world size: 0.0008274533902294934Per-token loss scaled by world size: 0.0008156728581525385 | |
Epoch: 0, Step: 87, Rank: 2, loss = 1.203330397605896Epoch: 0, Step: 87, Rank: 4, loss = 1.2654963731765747Epoch: 0, Step: 87, Rank: 5, loss = 1.264782428741455 | |
Epoch: 0, Step: 87, Rank: 1, loss = 1.0470110177993774Epoch: 0, Step: 87, Rank: 7, loss = 0.6524363160133362 | |
Epoch: 0, Step: 87, Rank: 3, loss = 0.6618592739105225 | |
Epoch: 0, Step: 87, Rank: 6, loss = 1.0393234491348267 | |
Per-token loss scaled by world size: 0.002122139558196068 | |
Epoch: 0, Step: 87, Rank: 0, loss = 1.697446346282959 | |
[2024-06-27 16:43:01,720] [INFO] [logging.py:96:log_dist] [Rank 0] step=87, skipped=0, lr=[4.51948051948052e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:01,791] [INFO] [timer.py:260:stop] epoch=0/micro_step=87/global_step=87, RunningAvgSamplesPerSec=95.59505615084385, CurrSamplesPerSec=94.7408333409803, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.6213224226909 samples/s, lr: 4.51948051948052e-06, loss: 1.697446346282959 cuda_mem_allocated: 22.305790424346924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6399.0 batch_size: 98.0 total loss: 1.103960633277893 | |
Epoch 0: 41% 87/213 [01:51<02:13, 1.06s/it] total tokens: 2533 num samples: 17 num padding tokens: 527 - rank: 6 max len: 149 min len: 103 avg len: 118.0 num_loss_counted_tokens: 636 | |
total tokens: 2442 num samples: 11 num padding tokens: 152 - rank: 4 max len: 222 min len: 187 avg len: 208.1818181818182 num_loss_counted_tokens: 811 | |
total tokens: 2418 num samples: 13 num padding tokens: 144 - rank: 5 max len: 186 min len: 154 avg len: 174.92307692307693 num_loss_counted_tokens: 824 | |
total tokens: 2220 num samples: 5 num padding tokens: 231 - rank: 1 max len: 444 min len: 361 avg len: 397.8 num_loss_counted_tokens: 997 | |
total tokens: 2492 num samples: 7 num padding tokens: 337 - rank: 2 max len: 356 min len: 278 avg len: 307.85714285714283 num_loss_counted_tokens: 922 | |
total tokens: 2466 num samples: 9 num padding tokens: 249 - rank: 3 max len: 274 min len: 224 avg len: 246.33333333333334 num_loss_counted_tokens: 1177 | |
total tokens: 582 num samples: 6 num padding tokens: 48 - rank: 7 max len: 97 min len: 81 avg len: 89.0 num_loss_counted_tokens: 113 | |
total tokens: 2346 num samples: 3 num padding tokens: 111 - rank: 0 max len: 782 min len: 680 avg len: 745.0 num_loss_counted_tokens: 1849 | |
Per-token loss scaled by world size: 0.001375011052004993Per-token loss scaled by world size: 0.0013483152724802494Per-token loss scaled by world size: 0.0020987342577427626Per-token loss scaled by world size: 0.0017517569940537214Per-token loss scaled by world size: 0.001406427356414497Per-token loss scaled by world size: 0.001403752132318914 | |
Per-token loss scaled by world size: 0.00018736824858933687 | |
Epoch: 0, Step: 88, Rank: 3, loss = 1.010633111000061Epoch: 0, Step: 88, Rank: 6, loss = 0.9910117387771606Epoch: 0, Step: 88, Rank: 4, loss = 1.542569637298584 | |
Epoch: 0, Step: 88, Rank: 1, loss = 1.287541389465332Epoch: 0, Step: 88, Rank: 5, loss = 1.033724069595337 | |
Epoch: 0, Step: 88, Rank: 0, loss = 1.0317578315734863Per-token loss scaled by world size: 0.00247918046079576Epoch: 0, Step: 88, Rank: 7, loss = 0.1377156674861908 | |
Epoch: 0, Step: 88, Rank: 2, loss = 1.822197675704956 | |
[2024-06-27 16:43:02,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=88, skipped=0, lr=[4.571428571428572e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:02,832] [INFO] [timer.py:260:stop] epoch=0/micro_step=88/global_step=88, RunningAvgSamplesPerSec=95.61524178052744, CurrSamplesPerSec=97.36274753977538, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 97.2533504015458 samples/s, lr: 4.571428571428572e-06, loss: 1.0317578315734863 cuda_mem_allocated: 22.222307205200195 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5880.0 batch_size: 70.0 total loss: 1.1071438789367676 | |
Epoch 0: 41% 88/213 [01:52<02:11, 1.05s/it] total tokens: 2451 num samples: 19 num padding tokens: 299 - rank: 7 max len: 129 min len: 92 avg len: 113.26315789473684 num_loss_counted_tokens: 649 | |
total tokens: 2400 num samples: 10 num padding tokens: 254 - rank: 4 max len: 240 min len: 189 avg len: 214.6 num_loss_counted_tokens: 1011 | |
total tokens: 2511 num samples: 9 num padding tokens: 158 - rank: 3 max len: 279 min len: 242 avg len: 261.44444444444446 num_loss_counted_tokens: 1064 | |
total tokens: 2400 num samples: 16 num padding tokens: 136 - rank: 6 max len: 150 min len: 132 avg len: 141.5 num_loss_counted_tokens: 836 | |
total tokens: 2366 num samples: 13 num padding tokens: 202 - rank: 5 max len: 182 min len: 151 avg len: 166.46153846153845 num_loss_counted_tokens: 719 | |
total tokens: 2450 num samples: 5 num padding tokens: 375 - rank: 1 max len: 490 min len: 368 avg len: 415.0 num_loss_counted_tokens: 700 | |
total tokens: 2480 num samples: 8 num padding tokens: 102 - rank: 2 max len: 310 min len: 285 avg len: 297.25 num_loss_counted_tokens: 1358 | |
total tokens: 2028 num samples: 2 num padding tokens: 281 - rank: 0 max len: 1014 min len: 733 avg len: 873.5 num_loss_counted_tokens: 716 | |
Per-token loss scaled by world size: 0.002198912436142564Per-token loss scaled by world size: 0.001618594047613442Per-token loss scaled by world size: 0.001716773840598762Per-token loss scaled by world size: 0.0017455482156947255Per-token loss scaled by world size: 0.0003748796880245209Per-token loss scaled by world size: 0.0010032765567302704 | |
Epoch: 0, Step: 89, Rank: 4, loss = 1.2910311222076416Epoch: 0, Step: 89, Rank: 3, loss = 1.7539074420928955 | |
Epoch: 0, Step: 89, Rank: 6, loss = 1.3693417310714722Epoch: 0, Step: 89, Rank: 1, loss = 0.2990134060382843 | |
Epoch: 0, Step: 89, Rank: 2, loss = 1.392292857170105 | |
Epoch: 0, Step: 89, Rank: 7, loss = 0.8002384305000305 | |
Per-token loss scaled by world size: 0.00016662481357343495 | |
Per-token loss scaled by world size: 0.0017450222512707114 | |
Epoch: 0, Step: 89, Rank: 5, loss = 1.3918733596801758 | |
Epoch: 0, Step: 89, Rank: 0, loss = 0.13290411233901978 | |
[2024-06-27 16:43:03,812] [INFO] [logging.py:96:log_dist] [Rank 0] step=89, skipped=0, lr=[4.623376623376624e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:03,888] [INFO] [timer.py:260:stop] epoch=0/micro_step=89/global_step=89, RunningAvgSamplesPerSec=95.61529867072625, CurrSamplesPerSec=95.62019148109555, MemAllocated=22.17GB, MaxMemAllocated=28.61GB | |
throughput: 95.508261464518 samples/s, lr: 4.623376623376624e-06, loss: 0.13290411233901978 cuda_mem_allocated: 22.17030906677246 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6381.0 batch_size: 63.0 total loss: 1.0538253784179688 | |
Epoch 0: 42% 89/213 [01:53<02:10, 1.05s/it] total tokens: 2484 num samples: 9 num padding tokens: 183 - rank: 3 max len: 276 min len: 238 avg len: 255.66666666666666 num_loss_counted_tokens: 937 | |
total tokens: 2505 num samples: 15 num padding tokens: 382 - rank: 6 max len: 167 min len: 120 avg len: 141.53333333333333 num_loss_counted_tokens: 766 | |
total tokens: 2345 num samples: 7 num padding tokens: 187 - rank: 2 max len: 335 min len: 299 avg len: 308.2857142857143 num_loss_counted_tokens: 1223 | |
total tokens: 2115 num samples: 5 num padding tokens: 175 - rank: 1 max len: 423 min len: 353 avg len: 388.0 num_loss_counted_tokens: 1037 | |
total tokens: 2360 num samples: 10 num padding tokens: 132 - rank: 4 max len: 236 min len: 211 avg len: 222.8 num_loss_counted_tokens: 832 | |
total tokens: 2508 num samples: 12 num padding tokens: 194 - rank: 5 max len: 209 min len: 168 avg len: 192.83333333333334 num_loss_counted_tokens: 1088 | |
total tokens: 1800 num samples: 15 num padding tokens: 210 - rank: 7 max len: 120 min len: 86 avg len: 106.0 num_loss_counted_tokens: 370 | |
total tokens: 2028 num samples: 3 num padding tokens: 342 - rank: 0 max len: 676 min len: 427 avg len: 562.0 num_loss_counted_tokens: 1008 | |
Per-token loss scaled by world size: 0.0015429630875587463Per-token loss scaled by world size: 0.0017726727528497577Per-token loss scaled by world size: 0.0008002962567843497Per-token loss scaled by world size: 0.0008999903802759945Per-token loss scaled by world size: 0.001257456955499947Per-token loss scaled by world size: 0.0010317731648683548Per-token loss scaled by world size: 0.0012257853522896767 | |
Epoch: 0, Step: 90, Rank: 5, loss = 0.7261688113212585Epoch: 0, Step: 90, Rank: 1, loss = 1.6084789037704468Epoch: 0, Step: 90, Rank: 0, loss = 0.9362052083015442 | |
Epoch: 0, Step: 90, Rank: 2, loss = 1.4000461101531982 | |
Epoch: 0, Step: 90, Rank: 3, loss = 1.1122469902038574Epoch: 0, Step: 90, Rank: 4, loss = 0.8166287541389465Epoch: 0, Step: 90, Rank: 6, loss = 1.1409850120544434 | |
Per-token loss scaled by world size: 0.0009240962681360543 | |
Epoch: 0, Step: 90, Rank: 7, loss = 0.8385018706321716 | |
[2024-06-27 16:43:04,868] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[4.675324675324676e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:04,942] [INFO] [timer.py:260:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=95.62186205499108, CurrSamplesPerSec=96.19634650287894, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 96.10113487466258 samples/s, lr: 4.675324675324676e-06, loss: 0.9362052083015442 cuda_mem_allocated: 22.285276889801025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7259.0 batch_size: 85.0 total loss: 1.0724077224731445 | |
Epoch 0: 42% 90/213 [01:54<02:09, 1.05s/it] total tokens: 2443 num samples: 7 num padding tokens: 193 - rank: 2 max len: 349 min len: 299 avg len: 321.42857142857144 num_loss_counted_tokens: 717 | |
total tokens: 1302 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1302 min len: 1302 avg len: 1302.0 num_loss_counted_tokens: 1174 | |
total tokens: 2398 num samples: 11 num padding tokens: 204 - rank: 5 max len: 218 min len: 169 avg len: 199.45454545454547 num_loss_counted_tokens: 1233 | |
total tokens: 2510 num samples: 10 num padding tokens: 171 - rank: 4 max len: 251 min len: 220 avg len: 233.9 num_loss_counted_tokens: 1092 | |
total tokens: 2115 num samples: 5 num padding tokens: 201 - rank: 1 max len: 423 min len: 350 avg len: 382.8 num_loss_counted_tokens: 1102 | |
total tokens: 2520 num samples: 15 num padding tokens: 182 - rank: 6 max len: 168 min len: 138 avg len: 155.86666666666667 num_loss_counted_tokens: 950 | |
total tokens: 2312 num samples: 8 num padding tokens: 158 - rank: 3 max len: 289 min len: 252 avg len: 269.25 num_loss_counted_tokens: 990 | |
total tokens: 2430 num samples: 18 num padding tokens: 492 - rank: 7 max len: 135 min len: 88 avg len: 107.66666666666667 num_loss_counted_tokens: 517 | |
Per-token loss scaled by world size: 0.0012386260787025094Per-token loss scaled by world size: 0.0010084491223096848 | |
Per-token loss scaled by world size: 0.0012216198956593871Per-token loss scaled by world size: 0.002200382761657238Per-token loss scaled by world size: 0.0009146772790700197 | |
Per-token loss scaled by world size: 0.0025044328067451715 | |
Per-token loss scaled by world size: 0.0003536655567586422 | |
Epoch: 0, Step: 91, Rank: 6, loss = 1.043232798576355 | |
Epoch: 0, Step: 91, Rank: 3, loss = 0.8493662476539612Epoch: 0, Step: 91, Rank: 4, loss = 1.0289093255996704Epoch: 0, Step: 91, Rank: 5, loss = 0.7703869342803955 | |
Epoch: 0, Step: 91, Rank: 1, loss = 2.109358549118042 | |
Epoch: 0, Step: 91, Rank: 2, loss = 1.8532724380493164 | |
Per-token loss scaled by world size: 0.0006543058552779257 | |
Epoch: 0, Step: 91, Rank: 0, loss = 0.2978748083114624 | |
Epoch: 0, Step: 91, Rank: 7, loss = 0.5510891079902649 | |
[2024-06-27 16:43:05,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=91, skipped=0, lr=[4.727272727272728e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:05,996] [INFO] [timer.py:260:stop] epoch=0/micro_step=91/global_step=91, RunningAvgSamplesPerSec=95.60866132169923, CurrSamplesPerSec=94.46109835817516, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 94.37583622681522 samples/s, lr: 4.727272727272728e-06, loss: 0.2978748083114624 cuda_mem_allocated: 22.235901832580566 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6738.0 batch_size: 78.0 total loss: 1.0629363059997559 | |
Epoch 0: 43% 91/213 [01:55<02:08, 1.05s/it] total tokens: 2387 num samples: 7 num padding tokens: 115 - rank: 1 max len: 341 min len: 313 avg len: 324.57142857142856 num_loss_counted_tokens: 1116 | |
total tokens: 2431 num samples: 17 num padding tokens: 205 - rank: 6 max len: 143 min len: 121 avg len: 130.94117647058823 num_loss_counted_tokens: 845 | |
total tokens: 2392 num samples: 8 num padding tokens: 133 - rank: 2 max len: 299 min len: 265 avg len: 282.375 num_loss_counted_tokens: 1023 | |
total tokens: 2365 num samples: 11 num padding tokens: 201 - rank: 4 max len: 215 min len: 176 avg len: 196.72727272727272 num_loss_counted_tokens: 970 | |
total tokens: 2358 num samples: 9 num padding tokens: 176 - rank: 3 max len: 262 min len: 229 avg len: 242.44444444444446 num_loss_counted_tokens: 859 | |
total tokens: 2436 num samples: 14 num padding tokens: 232 - rank: 5 max len: 174 min len: 145 avg len: 157.42857142857142 num_loss_counted_tokens: 909 | |
total tokens: 2060 num samples: 4 num padding tokens: 423 - rank: 0 max len: 515 min len: 347 avg len: 409.25 num_loss_counted_tokens: 692 | |
total tokens: 2478 num samples: 21 num padding tokens: 374 - rank: 7 max len: 118 min len: 78 avg len: 100.19047619047619 num_loss_counted_tokens: 529 | |
Per-token loss scaled by world size: 0.0010861437767744064Per-token loss scaled by world size: 0.0011819093488156796Per-token loss scaled by world size: 0.000861181877553463Per-token loss scaled by world size: 0.001792823662981391Per-token loss scaled by world size: 0.001157134771347046Per-token loss scaled by world size: 0.0018526121275499463Per-token loss scaled by world size: 0.0008318768814206123 | |
Epoch: 0, Step: 92, Rank: 4, loss = 1.0903114080429077Epoch: 0, Step: 92, Rank: 6, loss = 1.0019676685333252Epoch: 0, Step: 92, Rank: 5, loss = 0.7944402694702148 | |
Epoch: 0, Step: 92, Rank: 2, loss = 1.7090346813201904 | |
Epoch: 0, Step: 92, Rank: 1, loss = 1.6538798809051514Epoch: 0, Step: 92, Rank: 3, loss = 0.7674064040184021Epoch: 0, Step: 92, Rank: 0, loss = 1.067456841468811 | |
Per-token loss scaled by world size: 0.0004880076739937067 | |
Epoch: 0, Step: 92, Rank: 7, loss = 0.450187087059021 | |
[2024-06-27 16:43:06,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=92, skipped=0, lr=[4.779220779220779e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:07,048] [INFO] [timer.py:260:stop] epoch=0/micro_step=92/global_step=92, RunningAvgSamplesPerSec=95.6158212565109, CurrSamplesPerSec=96.25737917601084, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 96.16243716891807 samples/s, lr: 4.779220779220779e-06, loss: 1.067456841468811 cuda_mem_allocated: 22.248901844024658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7380.0 batch_size: 78.0 total loss: 1.0668354034423828 | |
Epoch 0: 43% 92/213 [01:56<02:07, 1.05s/it] total tokens: 2445 num samples: 15 num padding tokens: 186 - rank: 6 max len: 163 min len: 138 avg len: 150.6 num_loss_counted_tokens: 1027 | |
total tokens: 2358 num samples: 9 num padding tokens: 142 - rank: 3 max len: 262 min len: 231 avg len: 246.22222222222223 num_loss_counted_tokens: 922 | |
total tokens: 2519 num samples: 11 num padding tokens: 129 - rank: 4 max len: 229 min len: 198 avg len: 217.27272727272728 num_loss_counted_tokens: 862 | |
total tokens: 2436 num samples: 6 num padding tokens: 238 - rank: 1 max len: 406 min len: 319 avg len: 366.3333333333333 num_loss_counted_tokens: 1240 | |
total tokens: 2496 num samples: 8 num padding tokens: 176 - rank: 2 max len: 312 min len: 275 avg len: 290.0 num_loss_counted_tokens: 1057 | |
total tokens: 2522 num samples: 13 num padding tokens: 172 - rank: 5 max len: 194 min len: 164 avg len: 180.76923076923077 num_loss_counted_tokens: 983 | |
total tokens: 2248 num samples: 4 num padding tokens: 262 - rank: 0 max len: 562 min len: 445 avg len: 496.5 num_loss_counted_tokens: 1124 | |
total tokens: 2376 num samples: 18 num padding tokens: 392 - rank: 7 max len: 132 min len: 89 avg len: 110.22222222222223 num_loss_counted_tokens: 506 | |
Per-token loss scaled by world size: 0.0023113649804145098Per-token loss scaled by world size: 0.0009054794791154563Per-token loss scaled by world size: 0.0015215310268104076Per-token loss scaled by world size: 0.0003785519802477211Per-token loss scaled by world size: 0.001724438858218491Per-token loss scaled by world size: 0.0016183024272322655 | |
Per-token loss scaled by world size: 2.7260410206508823e-05 | |
Epoch: 0, Step: 93, Rank: 5, loss = 0.7668279409408569Epoch: 0, Step: 93, Rank: 2, loss = 1.2885465621948242Epoch: 0, Step: 93, Rank: 3, loss = 1.370499849319458Epoch: 0, Step: 93, Rank: 4, loss = 1.4603841304779053 | |
Epoch: 0, Step: 93, Rank: 1, loss = 1.95743727684021 | |
Epoch: 0, Step: 93, Rank: 7, loss = 0.3205862045288086 | |
Per-token loss scaled by world size: 0.001355792977847159 | |
Epoch: 0, Step: 93, Rank: 0, loss = 0.02308616042137146 | |
Epoch: 0, Step: 93, Rank: 6, loss = 1.1481871604919434 | |
[2024-06-27 16:43:08,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=93, skipped=0, lr=[4.831168831168831e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:08,137] [INFO] [timer.py:260:stop] epoch=0/micro_step=93/global_step=93, RunningAvgSamplesPerSec=95.58558191144411, CurrSamplesPerSec=92.94019790462077, MemAllocated=22.18GB, MaxMemAllocated=28.61GB | |
throughput: 92.83682658960338 samples/s, lr: 4.831168831168831e-06, loss: 0.02308616042137146 cuda_mem_allocated: 22.18426275253296 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6775.0 batch_size: 73.0 total loss: 1.0419445037841797 | |
Epoch 0: 44% 93/213 [01:57<02:07, 1.06s/it] total tokens: 2497 num samples: 11 num padding tokens: 129 - rank: 4 max len: 227 min len: 197 avg len: 215.27272727272728 num_loss_counted_tokens: 785 | |
total tokens: 2349 num samples: 9 num padding tokens: 138 - rank: 3 max len: 261 min len: 232 avg len: 245.66666666666666 num_loss_counted_tokens: 734 | |
total tokens: 2232 num samples: 18 num padding tokens: 324 - rank: 7 max len: 124 min len: 80 avg len: 106.0 num_loss_counted_tokens: 516 | |
total tokens: 2490 num samples: 15 num padding tokens: 302 - rank: 6 max len: 166 min len: 124 avg len: 145.86666666666667 num_loss_counted_tokens: 902 | |
total tokens: 2416 num samples: 8 num padding tokens: 121 - rank: 2 max len: 302 min len: 264 avg len: 286.875 num_loss_counted_tokens: 847 | |
total tokens: 2331 num samples: 7 num padding tokens: 102 - rank: 1 max len: 333 min len: 303 avg len: 318.42857142857144 num_loss_counted_tokens: 1131 | |
total tokens: 2340 num samples: 12 num padding tokens: 173 - rank: 5 max len: 195 min len: 167 avg len: 180.58333333333334 num_loss_counted_tokens: 859 | |
total tokens: 2360 num samples: 5 num padding tokens: 428 - rank: 0 max len: 472 min len: 338 avg len: 386.4 num_loss_counted_tokens: 1056 | |
Per-token loss scaled by world size: 0.0018750750459730625Per-token loss scaled by world size: 0.0012753132032230496 | |
Per-token loss scaled by world size: 0.0008121962891891599Per-token loss scaled by world size: 0.0009522992768324912 | |
Per-token loss scaled by world size: 0.0004623029672075063Per-token loss scaled by world size: 0.0008761505596339703 | |
Per-token loss scaled by world size: 0.0009124188218265772 | |
Epoch: 0, Step: 94, Rank: 2, loss = 1.9828919172286987 | |
Epoch: 0, Step: 94, Rank: 6, loss = 0.8588975667953491 | |
Epoch: 0, Step: 94, Rank: 7, loss = 0.48888537287712097 | |
Epoch: 0, Step: 94, Rank: 5, loss = 1.007056474685669Epoch: 0, Step: 94, Rank: 1, loss = 1.3486436605453491 | |
Epoch: 0, Step: 94, Rank: 4, loss = 0.9265292286872864Epoch: 0, Step: 94, Rank: 3, loss = 0.9648829102516174 | |
Per-token loss scaled by world size: 0.0020132879726588726 | |
Epoch: 0, Step: 94, Rank: 0, loss = 2.129051923751831 | |
[2024-06-27 16:43:09,132] [INFO] [logging.py:96:log_dist] [Rank 0] step=94, skipped=0, lr=[4.883116883116883e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:09,206] [INFO] [timer.py:260:stop] epoch=0/micro_step=94/global_step=94, RunningAvgSamplesPerSec=95.57660419808835, CurrSamplesPerSec=94.76663186543692, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.65764693270745 samples/s, lr: 4.883116883116883e-06, loss: 2.129051923751831 cuda_mem_allocated: 22.30698299407959 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8460.0 batch_size: 80.0 total loss: 1.2133549451828003 | |
Epoch 0: 44% 94/213 [01:58<02:06, 1.07s/it] total tokens: 2475 num samples: 15 num padding tokens: 306 - rank: 6 max len: 165 min len: 129 avg len: 144.6 num_loss_counted_tokens: 785 | |
total tokens: 2312 num samples: 8 num padding tokens: 177 - rank: 3 max len: 289 min len: 250 avg len: 266.875 num_loss_counted_tokens: 821 | |
total tokens: 2430 num samples: 10 num padding tokens: 151 - rank: 4 max len: 243 min len: 195 avg len: 227.9 num_loss_counted_tokens: 613 | |
total tokens: 2340 num samples: 12 num padding tokens: 168 - rank: 5 max len: 195 min len: 165 avg len: 181.0 num_loss_counted_tokens: 686 | |
total tokens: 2432 num samples: 19 num padding tokens: 317 - rank: 7 max len: 128 min len: 83 avg len: 111.3157894736842 num_loss_counted_tokens: 581 | |
total tokens: 2385 num samples: 5 num padding tokens: 213 - rank: 1 max len: 477 min len: 391 avg len: 434.4 num_loss_counted_tokens: 1148 | |
total tokens: 2516 num samples: 4 num padding tokens: 235 - rank: 0 max len: 629 min len: 513 avg len: 570.25 num_loss_counted_tokens: 1321 | |
total tokens: 2506 num samples: 7 num padding tokens: 226 - rank: 2 max len: 358 min len: 298 avg len: 325.7142857142857 num_loss_counted_tokens: 1306 | |
Per-token loss scaled by world size: 0.0012495614355430007Per-token loss scaled by world size: 0.0014387929113581777 | |
Per-token loss scaled by world size: 0.0009725113050080836Per-token loss scaled by world size: 0.0006293528131209314Per-token loss scaled by world size: 0.0007363572949543595Per-token loss scaled by world size: 0.0011649337830021977Per-token loss scaled by world size: 0.00047067031846381724 | |
Epoch: 0, Step: 95, Rank: 5, loss = 1.1019569635391235 | |
Epoch: 0, Step: 95, Rank: 2, loss = 1.2688355445861816Epoch: 0, Step: 95, Rank: 3, loss = 1.027325987815857Epoch: 0, Step: 95, Rank: 6, loss = 0.8576334118843079 | |
Epoch: 0, Step: 95, Rank: 0, loss = 0.6493750810623169Epoch: 0, Step: 95, Rank: 4, loss = 0.5550104975700378Epoch: 0, Step: 95, Rank: 7, loss = 0.41507238149642944 | |
Per-token loss scaled by world size: 0.002537165768444538 | |
Epoch: 0, Step: 95, Rank: 1, loss = 2.2374629974365234 | |
[2024-06-27 16:43:10,193] [INFO] [logging.py:96:log_dist] [Rank 0] step=95, skipped=0, lr=[4.935064935064935e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:10,267] [INFO] [timer.py:260:stop] epoch=0/micro_step=95/global_step=95, RunningAvgSamplesPerSec=95.57520554244903, CurrSamplesPerSec=95.4467041102086, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.3548215672358 samples/s, lr: 4.935064935064935e-06, loss: 0.6493750810623169 cuda_mem_allocated: 22.292194366455078 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7055.0 batch_size: 84.0 total loss: 1.014083981513977 | |
Epoch 0: 45% 95/213 [01:59<02:05, 1.06s/it] total tokens: 2440 num samples: 10 num padding tokens: 108 - rank: 4 max len: 244 min len: 222 avg len: 233.2 num_loss_counted_tokens: 891 | |
total tokens: 2490 num samples: 15 num padding tokens: 163 - rank: 6 max len: 166 min len: 139 avg len: 155.13333333333333 num_loss_counted_tokens: 952 | |
total tokens: 2254 num samples: 7 num padding tokens: 173 - rank: 2 max len: 322 min len: 283 avg len: 297.2857142857143 num_loss_counted_tokens: 1395 | |
total tokens: 2275 num samples: 5 num padding tokens: 351 - rank: 1 max len: 455 min len: 334 avg len: 384.8 num_loss_counted_tokens: 1157 | |
total tokens: 2520 num samples: 9 num padding tokens: 178 - rank: 3 max len: 280 min len: 244 avg len: 260.22222222222223 num_loss_counted_tokens: 1321 | |
total tokens: 2431 num samples: 11 num padding tokens: 280 - rank: 5 max len: 221 min len: 173 avg len: 195.54545454545453 num_loss_counted_tokens: 1052 | |
total tokens: 2484 num samples: 18 num padding tokens: 383 - rank: 7 max len: 138 min len: 87 avg len: 116.72222222222223 num_loss_counted_tokens: 669 | |
total tokens: 2368 num samples: 4 num padding tokens: 331 - rank: 0 max len: 592 min len: 457 avg len: 509.25 num_loss_counted_tokens: 1247 | |
Per-token loss scaled by world size: 0.00098507571965456Per-token loss scaled by world size: 0.0031795003451406956Per-token loss scaled by world size: 0.0007708168122917414Per-token loss scaled by world size: 0.001233852468430996Per-token loss scaled by world size: 0.0006469864747487009Per-token loss scaled by world size: 0.0011521325213834643Per-token loss scaled by world size: 0.0006869107601232827 | |
Epoch: 0, Step: 96, Rank: 1, loss = 2.9314992427825928Epoch: 0, Step: 96, Rank: 2, loss = 0.9082397818565369Epoch: 0, Step: 96, Rank: 4, loss = 1.137611985206604Epoch: 0, Step: 96, Rank: 6, loss = 0.7106931209564209 | |
Epoch: 0, Step: 96, Rank: 3, loss = 1.062266230583191 | |
Epoch: 0, Step: 96, Rank: 7, loss = 0.5965215563774109Per-token loss scaled by world size: 0.000977267394773662 | |
Epoch: 0, Step: 96, Rank: 0, loss = 0.6333317160606384 | |
Epoch: 0, Step: 96, Rank: 5, loss = 0.9010405540466309 | |
[2024-06-27 16:43:11,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=96, skipped=0, lr=[4.987012987012987e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:11,325] [INFO] [timer.py:260:stop] epoch=0/micro_step=96/global_step=96, RunningAvgSamplesPerSec=95.57639972406895, CurrSamplesPerSec=95.68758920512795, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.59140712097529 samples/s, lr: 4.987012987012987e-06, loss: 0.6333317160606384 cuda_mem_allocated: 22.26547908782959 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7376.0 batch_size: 85.0 total loss: 1.1101503372192383 | |
Epoch 0: 45% 96/213 [02:00<02:04, 1.06s/it] total tokens: 2421 num samples: 9 num padding tokens: 218 - rank: 3 max len: 269 min len: 228 avg len: 244.77777777777777 num_loss_counted_tokens: 1099 | |
total tokens: 2366 num samples: 13 num padding tokens: 100 - rank: 5 max len: 182 min len: 166 avg len: 174.30769230769232 num_loss_counted_tokens: 804 | |
total tokens: 2512 num samples: 8 num padding tokens: 151 - rank: 2 max len: 314 min len: 277 avg len: 295.125 num_loss_counted_tokens: 1281 | |
total tokens: 2275 num samples: 5 num padding tokens: 269 - rank: 1 max len: 455 min len: 354 avg len: 401.2 num_loss_counted_tokens: 1162 | |
total tokens: 2415 num samples: 15 num padding tokens: 283 - rank: 6 max len: 161 min len: 127 avg len: 142.13333333333333 num_loss_counted_tokens: 786 | |
total tokens: 2464 num samples: 11 num padding tokens: 128 - rank: 4 max len: 224 min len: 196 avg len: 212.36363636363637 num_loss_counted_tokens: 860 | |
total tokens: 1750 num samples: 14 num padding tokens: 213 - rank: 7 max len: 125 min len: 86 avg len: 109.78571428571429 num_loss_counted_tokens: 457 | |
total tokens: 2187 num samples: 3 num padding tokens: 393 - rank: 0 max len: 729 min len: 459 avg len: 598.0 num_loss_counted_tokens: 973 | |
Per-token loss scaled by world size: 0.0006518478039652109Per-token loss scaled by world size: 0.0007151501486077905Per-token loss scaled by world size: 0.0022212897893041372Per-token loss scaled by world size: 0.0016506633255630732Per-token loss scaled by world size: 0.001024365541525185Per-token loss scaled by world size: 0.0012011848157271743Per-token loss scaled by world size: 0.0010668985778465867 | |
Epoch: 0, Step: 97, Rank: 1, loss = 0.5627076029777527Epoch: 0, Step: 97, Rank: 4, loss = 1.036922812461853Epoch: 0, Step: 97, Rank: 7, loss = 0.6173533797264099Epoch: 0, Step: 97, Rank: 5, loss = 1.4249351024627686 | |
Epoch: 0, Step: 97, Rank: 2, loss = 1.9175283908843994Epoch: 0, Step: 97, Rank: 3, loss = 0.8842835426330566 | |
Epoch: 0, Step: 97, Rank: 0, loss = 0.9210001826286316 | |
Per-token loss scaled by world size: 0.0012455853866413236 | |
Epoch: 0, Step: 97, Rank: 6, loss = 1.075251579284668 | |
[2024-06-27 16:43:12,312] [INFO] [logging.py:96:log_dist] [Rank 0] step=97, skipped=0, lr=[5.038961038961039e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:12,385] [INFO] [timer.py:260:stop] epoch=0/micro_step=97/global_step=97, RunningAvgSamplesPerSec=95.57569892794656, CurrSamplesPerSec=95.50986994725591, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.41442899305576 samples/s, lr: 5.038961038961039e-06, loss: 0.9210001826286316 cuda_mem_allocated: 22.26214027404785 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6906.0 batch_size: 80.0 total loss: 1.0549978017807007 | |
Epoch 0: 46% 97/213 [02:01<02:03, 1.06s/it] total tokens: 2506 num samples: 14 num padding tokens: 136 - rank: 5 max len: 179 min len: 155 avg len: 169.28571428571428 num_loss_counted_tokens: 768 | |
total tokens: 2508 num samples: 12 num padding tokens: 149 - rank: 4 max len: 209 min len: 182 avg len: 196.58333333333334 num_loss_counted_tokens: 1038 | |
total tokens: 2448 num samples: 16 num padding tokens: 204 - rank: 6 max len: 153 min len: 130 avg len: 140.25 num_loss_counted_tokens: 1011 | |
total tokens: 2422 num samples: 7 num padding tokens: 95 - rank: 1 max len: 346 min len: 317 avg len: 332.42857142857144 num_loss_counted_tokens: 1122 | |
total tokens: 2520 num samples: 9 num padding tokens: 349 - rank: 3 max len: 280 min len: 210 avg len: 241.22222222222223 num_loss_counted_tokens: 1049 | |
total tokens: 2528 num samples: 8 num padding tokens: 166 - rank: 2 max len: 316 min len: 282 avg len: 295.25 num_loss_counted_tokens: 1074 | |
total tokens: 1764 num samples: 14 num padding tokens: 211 - rank: 7 max len: 126 min len: 90 avg len: 110.92857142857143 num_loss_counted_tokens: 444 | |
total tokens: 2515 num samples: 5 num padding tokens: 417 - rank: 0 max len: 503 min len: 353 avg len: 419.6 num_loss_counted_tokens: 1118 | |
Per-token loss scaled by world size: 0.001600139308720827Per-token loss scaled by world size: 0.001376742497086525Per-token loss scaled by world size: 0.0014025341952219605Per-token loss scaled by world size: 0.001431688666343689 | |
Per-token loss scaled by world size: 0.0016798594733700156Per-token loss scaled by world size: 0.0009573631105013192Per-token loss scaled by world size: 0.0009599090553820133 | |
Epoch: 0, Step: 98, Rank: 6, loss = 1.1826869249343872 | |
Epoch: 0, Step: 98, Rank: 4, loss = 1.3493174314498901Epoch: 0, Step: 98, Rank: 5, loss = 1.1609381437301636Epoch: 0, Step: 98, Rank: 2, loss = 1.2072714567184448 | |
Epoch: 0, Step: 98, Rank: 7, loss = 0.8094432950019836 | |
Epoch: 0, Step: 98, Rank: 1, loss = 1.4165414571762085 | |
Epoch: 0, Step: 98, Rank: 0, loss = 0.8072964549064636Per-token loss scaled by world size: 0.0014877779176458716 | |
Epoch: 0, Step: 98, Rank: 3, loss = 1.2545686960220337 | |
[2024-06-27 16:43:13,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=98, skipped=0, lr=[5.090909090909091e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:13,437] [INFO] [timer.py:260:stop] epoch=0/micro_step=98/global_step=98, RunningAvgSamplesPerSec=95.5826592384301, CurrSamplesPerSec=96.24854373387852, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 96.14877451577362 samples/s, lr: 5.090909090909091e-06, loss: 0.8072964549064636 cuda_mem_allocated: 22.24138832092285 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6746.0 batch_size: 81.0 total loss: 1.1485079526901245 | |
Epoch 0: 46% 98/213 [02:02<02:01, 1.06s/it] total tokens: 2300 num samples: 5 num padding tokens: 193 - rank: 1 max len: 460 min len: 385 avg len: 421.4 num_loss_counted_tokens: 1540 | |
total tokens: 2298 num samples: 6 num padding tokens: 80 - rank: 2 max len: 383 min len: 344 avg len: 369.6666666666667 num_loss_counted_tokens: 1249 | |
total tokens: 2368 num samples: 8 num padding tokens: 237 - rank: 4 max len: 296 min len: 224 avg len: 266.375 num_loss_counted_tokens: 1129 | |
total tokens: 2478 num samples: 14 num padding tokens: 330 - rank: 6 max len: 177 min len: 134 avg len: 153.42857142857142 num_loss_counted_tokens: 887 | |
total tokens: 2453 num samples: 11 num padding tokens: 192 - rank: 5 max len: 223 min len: 184 avg len: 205.54545454545453 num_loss_counted_tokens: 876 | |
total tokens: 2366 num samples: 7 num padding tokens: 130 - rank: 3 max len: 338 min len: 304 avg len: 319.42857142857144 num_loss_counted_tokens: 1116 | |
total tokens: 1995 num samples: 3 num padding tokens: 217 - rank: 0 max len: 665 min len: 499 avg len: 592.6666666666666 num_loss_counted_tokens: 967 | |
total tokens: 2394 num samples: 18 num padding tokens: 351 - rank: 7 max len: 133 min len: 92 avg len: 113.5 num_loss_counted_tokens: 643 | |
Per-token loss scaled by world size: 0.0013148096622899175Per-token loss scaled by world size: 0.0010535691399127245Per-token loss scaled by world size: 0.002054269891232252 | |
Per-token loss scaled by world size: 0.0012908512726426125Per-token loss scaled by world size: 0.001670012716203928Per-token loss scaled by world size: 0.0011821951484307647 | |
Per-token loss scaled by world size: 4.973548129783012e-05 | |
Epoch: 0, Step: 99, Rank: 7, loss = 0.793864369392395 | |
Epoch: 0, Step: 99, Rank: 4, loss = 1.2583545446395874Epoch: 0, Step: 99, Rank: 5, loss = 0.9907090663909912Epoch: 0, Step: 99, Rank: 2, loss = 1.5478923320770264 | |
Epoch: 0, Step: 99, Rank: 6, loss = 0.9726564288139343 | |
Epoch: 0, Step: 99, Rank: 1, loss = 0.8907840847969055 | |
Epoch: 0, Step: 99, Rank: 0, loss = 0.03747568652033806Per-token loss scaled by world size: 0.00165279780048877 | |
Epoch: 0, Step: 99, Rank: 3, loss = 1.2453831434249878 | |
[2024-06-27 16:43:14,420] [INFO] [logging.py:96:log_dist] [Rank 0] step=99, skipped=0, lr=[5.142857142857142e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:14,494] [INFO] [timer.py:260:stop] epoch=0/micro_step=99/global_step=99, RunningAvgSamplesPerSec=95.58188604594227, CurrSamplesPerSec=95.50771776429504, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.40360012169106 samples/s, lr: 5.142857142857142e-06, loss: 0.03747568652033806 cuda_mem_allocated: 22.266432762145996 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6028.0 batch_size: 68.0 total loss: 0.9671398997306824 | |
Epoch 0: 46% 99/213 [02:03<02:00, 1.06s/it] total tokens: 2445 num samples: 5 num padding tokens: 263 - rank: 1 max len: 489 min len: 389 avg len: 436.4 num_loss_counted_tokens: 1284 | |
total tokens: 2343 num samples: 11 num padding tokens: 204 - rank: 5 max len: 213 min len: 182 avg len: 194.45454545454547 num_loss_counted_tokens: 831 | |
total tokens: 2520 num samples: 14 num padding tokens: 401 - rank: 6 max len: 180 min len: 134 avg len: 151.35714285714286 num_loss_counted_tokens: 849 | |
total tokens: 2331 num samples: 7 num padding tokens: 231 - rank: 3 max len: 333 min len: 276 avg len: 300.0 num_loss_counted_tokens: 871 | |
total tokens: 2298 num samples: 6 num padding tokens: 151 - rank: 2 max len: 383 min len: 340 avg len: 357.8333333333333 num_loss_counted_tokens: 1372 | |
total tokens: 2520 num samples: 10 num padding tokens: 217 - rank: 4 max len: 252 min len: 214 avg len: 230.3 num_loss_counted_tokens: 883 | |
total tokens: 2508 num samples: 19 num padding tokens: 416 - rank: 7 max len: 132 min len: 86 avg len: 110.10526315789474 num_loss_counted_tokens: 609 | |
total tokens: 2312 num samples: 4 num padding tokens: 185 - rank: 0 max len: 578 min len: 506 avg len: 531.75 num_loss_counted_tokens: 425 | |
Per-token loss scaled by world size: 0.001400453969836235Per-token loss scaled by world size: 0.002045578323304653Per-token loss scaled by world size: 0.000271871336735785Per-token loss scaled by world size: 0.0013699863338842988Per-token loss scaled by world size: 0.0006936364807188511Per-token loss scaled by world size: 0.0012198326876387Per-token loss scaled by world size: 0.0007138267392292619 | |
Epoch: 0, Step: 100, Rank: 2, loss = 1.707546591758728Epoch: 0, Step: 100, Rank: 1, loss = 1.1690289974212646Epoch: 0, Step: 100, Rank: 0, loss = 0.22694461047649384Epoch: 0, Step: 100, Rank: 5, loss = 0.5790130496025085Epoch: 0, Step: 100, Rank: 3, loss = 1.1435960531234741 | |
Epoch: 0, Step: 100, Rank: 6, loss = 1.018255352973938 | |
Per-token loss scaled by world size: 0.0017443158430978656Epoch: 0, Step: 100, Rank: 7, loss = 0.595866858959198 | |
Epoch: 0, Step: 100, Rank: 4, loss = 1.456067681312561 | |
[2024-06-27 16:43:15,479] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[5.194805194805194e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:15,552] [INFO] [timer.py:260:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=95.58238974698239, CurrSamplesPerSec=95.63127399372422, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.53538636895325 samples/s, lr: 5.194805194805194e-06, loss: 0.22694461047649384 cuda_mem_allocated: 22.296963691711426 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6678.0 batch_size: 87.0 total loss: 0.9870399236679077 | |
Epoch 0: 47% 100/213 [02:04<01:59, 1.06s/it] total tokens: 2508 num samples: 11 num padding tokens: 167 - rank: 4 max len: 228 min len: 193 avg len: 212.8181818181818 num_loss_counted_tokens: 1125 | |
total tokens: 2238 num samples: 6 num padding tokens: 117 - rank: 1 max len: 373 min len: 338 avg len: 353.5 num_loss_counted_tokens: 925 | |
total tokens: 2432 num samples: 8 num padding tokens: 121 - rank: 2 max len: 304 min len: 265 avg len: 288.875 num_loss_counted_tokens: 1284 | |
total tokens: 2480 num samples: 16 num padding tokens: 209 - rank: 6 max len: 155 min len: 131 avg len: 141.9375 num_loss_counted_tokens: 766 | |
total tokens: 2385 num samples: 9 num padding tokens: 189 - rank: 3 max len: 265 min len: 230 avg len: 244.0 num_loss_counted_tokens: 976 | |
total tokens: 2444 num samples: 13 num padding tokens: 158 - rank: 5 max len: 188 min len: 157 avg len: 175.84615384615384 num_loss_counted_tokens: 931 | |
total tokens: 2484 num samples: 4 num padding tokens: 625 - rank: 0 max len: 621 min len: 380 avg len: 464.75 num_loss_counted_tokens: 1367 | |
total tokens: 2470 num samples: 19 num padding tokens: 358 - rank: 7 max len: 130 min len: 77 avg len: 111.15789473684211 num_loss_counted_tokens: 647 | |
Per-token loss scaled by world size: 0.0013672267086803913Per-token loss scaled by world size: 0.0015457401750609279Per-token loss scaled by world size: 0.0010128431022167206Per-token loss scaled by world size: 0.0012039744760841131Per-token loss scaled by world size: 0.0007222264539450407Per-token loss scaled by world size: 0.0010944722453132272Per-token loss scaled by world size: 0.0010803642217069864 | |
Epoch: 0, Step: 101, Rank: 3, loss = 1.333216905593872Epoch: 0, Step: 101, Rank: 5, loss = 0.987648606300354Epoch: 0, Step: 101, Rank: 6, loss = 0.704261064529419Epoch: 0, Step: 101, Rank: 2, loss = 1.5072898864746094Epoch: 0, Step: 101, Rank: 1, loss = 1.1740256547927856 | |
Epoch: 0, Step: 101, Rank: 4, loss = 1.0672472715377808 | |
Epoch: 0, Step: 101, Rank: 0, loss = 1.053490161895752 | |
Per-token loss scaled by world size: 0.000558714207727462 | |
Epoch: 0, Step: 101, Rank: 7, loss = 0.5448161959648132 | |
[2024-06-27 16:43:16,537] [INFO] [logging.py:96:log_dist] [Rank 0] step=101, skipped=0, lr=[5.246753246753246e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:16,611] [INFO] [timer.py:260:stop] epoch=0/micro_step=101/global_step=101, RunningAvgSamplesPerSec=95.58266990283569, CurrSamplesPerSec=95.61013314552595, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.5216067145714 samples/s, lr: 5.246753246753246e-06, loss: 1.053490161895752 cuda_mem_allocated: 22.263213634490967 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7801.0 batch_size: 87.0 total loss: 1.046499490737915 | |
Epoch 0: 47% 101/213 [02:05<01:58, 1.06s/it] total tokens: 2331 num samples: 9 num padding tokens: 179 - rank: 4 max len: 259 min len: 210 avg len: 239.11111111111111 num_loss_counted_tokens: 956 | |
total tokens: 2415 num samples: 7 num padding tokens: 376 - rank: 3 max len: 345 min len: 262 avg len: 291.2857142857143 num_loss_counted_tokens: 894 | |
total tokens: 2508 num samples: 12 num padding tokens: 165 - rank: 5 max len: 209 min len: 175 avg len: 195.25 num_loss_counted_tokens: 838 | |
total tokens: 2244 num samples: 6 num padding tokens: 76 - rank: 2 max len: 374 min len: 350 avg len: 361.3333333333333 num_loss_counted_tokens: 1413 total tokens: 2490 num samples: 15 num padding tokens: 266 - rank: 6 max len: 166 min len: 129 avg len: 148.26666666666668 num_loss_counted_tokens: 813 | |
total tokens: 2380 num samples: 5 num padding tokens: 293 - rank: 1 max len: 476 min len: 380 avg len: 417.4 num_loss_counted_tokens: 935 | |
total tokens: 1360 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1360 min len: 1360 avg len: 1360.0 num_loss_counted_tokens: 78 | |
total tokens: 2413 num samples: 19 num padding tokens: 286 - rank: 7 max len: 127 min len: 90 avg len: 111.94736842105263 num_loss_counted_tokens: 514 | |
Per-token loss scaled by world size: 0.0008656862191855907Per-token loss scaled by world size: 0.0011554027441889048Per-token loss scaled by world size: 0.0014031647006049752Per-token loss scaled by world size: 0.0007795292185619473Per-token loss scaled by world size: 0.0012363907881081104Per-token loss scaled by world size: 0.0020034904591739178Per-token loss scaled by world size: 0.0012333495542407036 | |
Epoch: 0, Step: 102, Rank: 5, loss = 0.7303145527839661Epoch: 0, Step: 102, Rank: 2, loss = 1.1837447881698608Epoch: 0, Step: 102, Rank: 4, loss = 0.974726676940918Epoch: 0, Step: 102, Rank: 6, loss = 0.6576303243637085 | |
Epoch: 0, Step: 102, Rank: 0, loss = 1.0430501699447632Epoch: 0, Step: 102, Rank: 1, loss = 1.6901947259902954Epoch: 0, Step: 102, Rank: 3, loss = 1.0404845476150513Per-token loss scaled by world size: 0.0007255471427924931 | |
Epoch: 0, Step: 102, Rank: 7, loss = 0.6120896935462952 | |
[2024-06-27 16:43:17,602] [INFO] [logging.py:96:log_dist] [Rank 0] step=102, skipped=0, lr=[5.298701298701298e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:17,675] [INFO] [timer.py:260:stop] epoch=0/micro_step=102/global_step=102, RunningAvgSamplesPerSec=95.57705841833213, CurrSamplesPerSec=95.02476406244469, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.92926016588501 samples/s, lr: 5.298701298701298e-06, loss: 1.0430501699447632 cuda_mem_allocated: 22.30650568008423 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6749.0 batch_size: 95.0 total loss: 0.9915294647216797 | |
Epoch 0: 48% 102/213 [02:06<01:57, 1.06s/it] total tokens: 2331 num samples: 9 num padding tokens: 69 - rank: 3 max len: 259 min len: 243 avg len: 251.33333333333334 num_loss_counted_tokens: 971 | |
total tokens: 2370 num samples: 6 num padding tokens: 254 - rank: 1 max len: 395 min len: 326 avg len: 352.6666666666667 num_loss_counted_tokens: 900 | |
total tokens: 2508 num samples: 12 num padding tokens: 192 - rank: 5 max len: 209 min len: 162 avg len: 193.0 num_loss_counted_tokens: 1038 | |
total tokens: 2496 num samples: 16 num padding tokens: 236 - rank: 6 max len: 156 min len: 124 avg len: 141.25 num_loss_counted_tokens: 909 | |
total tokens: 2528 num samples: 8 num padding tokens: 190 - rank: 2 max len: 316 min len: 269 avg len: 292.25 num_loss_counted_tokens: 1195 | |
total tokens: 2360 num samples: 10 num padding tokens: 125 - rank: 4 max len: 236 min len: 216 avg len: 223.5 num_loss_counted_tokens: 917 | |
total tokens: 2148 num samples: 4 num padding tokens: 242 - rank: 0 max len: 537 min len: 404 avg len: 476.5 num_loss_counted_tokens: 1283 | |
total tokens: 2091 num samples: 17 num padding tokens: 254 - rank: 7 max len: 123 min len: 88 avg len: 108.05882352941177 num_loss_counted_tokens: 524 | |
Per-token loss scaled by world size: 0.0016331078950315714Per-token loss scaled by world size: 0.0015577217563986778 | |
Per-token loss scaled by world size: 0.0023325905203819275Per-token loss scaled by world size: 0.0015249855350703Per-token loss scaled by world size: 0.0009915867121890187Per-token loss scaled by world size: 0.0011698987800627947 | |
Per-token loss scaled by world size: 0.00047059752978384495 | |
Epoch: 0, Step: 103, Rank: 4, loss = 1.4767378568649292 | |
Epoch: 0, Step: 103, Rank: 3, loss = 1.0578809976577759Epoch: 0, Step: 103, Rank: 1, loss = 1.3789681196212769Epoch: 0, Step: 103, Rank: 5, loss = 1.4085699319839478Epoch: 0, Step: 103, Rank: 2, loss = 2.1092450618743896 | |
Epoch: 0, Step: 103, Rank: 6, loss = 0.8966423273086548Epoch: 0, Step: 103, Rank: 0, loss = 0.4255378246307373 | |
Per-token loss scaled by world size: 0.0006349242175929248 | |
Epoch: 0, Step: 103, Rank: 7, loss = 0.5741302371025085 | |
[2024-06-27 16:43:18,658] [INFO] [logging.py:96:log_dist] [Rank 0] step=103, skipped=0, lr=[5.3506493506493504e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:18,732] [INFO] [timer.py:260:stop] epoch=0/micro_step=103/global_step=103, RunningAvgSamplesPerSec=95.57862505003528, CurrSamplesPerSec=95.73554800529733, MemAllocated=22.23GB, MaxMemAllocated=28.61GB | |
throughput: 95.6420864608076 samples/s, lr: 5.3506493506493504e-06, loss: 0.4255378246307373 cuda_mem_allocated: 22.226122856140137 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7234.0 batch_size: 80.0 total loss: 1.1659640073776245 | |
Epoch 0: 48% 103/213 [02:07<01:56, 1.06s/it] total tokens: 2500 num samples: 10 num padding tokens: 221 - rank: 3 max len: 250 min len: 207 avg len: 227.9 num_loss_counted_tokens: 805 | |
total tokens: 2492 num samples: 7 num padding tokens: 142 - rank: 1 max len: 356 min len: 316 avg len: 335.7142857142857 num_loss_counted_tokens: 1404 | |
total tokens: 2460 num samples: 20 num padding tokens: 437 - rank: 7 max len: 123 min len: 79 avg len: 101.15 num_loss_counted_tokens: 531 | |
total tokens: 2444 num samples: 13 num padding tokens: 168 - rank: 5 max len: 188 min len: 162 avg len: 175.07692307692307 num_loss_counted_tokens: 906 | |
total tokens: 2436 num samples: 12 num padding tokens: 101 - rank: 4 max len: 203 min len: 189 avg len: 194.58333333333334 num_loss_counted_tokens: 1115 | |
total tokens: 2512 num samples: 16 num padding tokens: 232 - rank: 6 max len: 157 min len: 126 avg len: 142.5 num_loss_counted_tokens: 835 | |
total tokens: 2528 num samples: 8 num padding tokens: 187 - rank: 2 max len: 316 min len: 255 avg len: 292.625 num_loss_counted_tokens: 826 | |
total tokens: 2465 num samples: 5 num padding tokens: 347 - rank: 0 max len: 493 min len: 357 avg len: 423.6 num_loss_counted_tokens: 1178 | |
Per-token loss scaled by world size: 0.003020831849426031Per-token loss scaled by world size: 0.0009701667004264891Per-token loss scaled by world size: 0.001339601119980216Per-token loss scaled by world size: 0.0008365093963220716Per-token loss scaled by world size: 0.0007713797385804355Per-token loss scaled by world size: 0.0012270438019186258 | |
Per-token loss scaled by world size: 8.1786020018626e-05 | |
Epoch: 0, Step: 104, Rank: 0, loss = 2.7674596309661865Epoch: 0, Step: 104, Rank: 6, loss = 0.7066802382469177 | |
Epoch: 0, Step: 104, Rank: 5, loss = 0.8887939453125 | |
Epoch: 0, Step: 104, Rank: 1, loss = 1.227242112159729 | |
Epoch: 0, Step: 104, Rank: 4, loss = 0.7663471698760986 | |
Epoch: 0, Step: 104, Rank: 3, loss = 1.1241254806518555 | |
Epoch: 0, Step: 104, Rank: 7, loss = 0.0749262198805809 | |
Per-token loss scaled by world size: 0.001221204991452396 | |
Epoch: 0, Step: 104, Rank: 2, loss = 1.1187764406204224 | |
[2024-06-27 16:43:19,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=104, skipped=0, lr=[5.4025974025974024e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:19,797] [INFO] [timer.py:260:stop] epoch=0/micro_step=104/global_step=104, RunningAvgSamplesPerSec=95.57386472245344, CurrSamplesPerSec=95.09550191086954, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.99230300665424 samples/s, lr: 5.4025974025974024e-06, loss: 2.7674596309661865 cuda_mem_allocated: 22.29434061050415 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7329.0 batch_size: 71.0 total loss: 1.0842937231063843 | |
Saving model in huggingface format at samples_seen: 9984 | |
Model saved in /instructlab/training_output/hf_format/samples_9984 | |
[16:43:39] INFO saving took 19.706674814224243 seconds utils.py:192 | |
Epoch 0: 49% 104/213 [02:28<12:40, 6.97s/it] total tokens: 2412 num samples: 18 num padding tokens: 446 - rank: 7 max len: 134 min len: 88 avg len: 109.22222222222223 num_loss_counted_tokens: 493 | |
total tokens: 2450 num samples: 7 num padding tokens: 192 - rank: 2 max len: 350 min len: 262 avg len: 322.57142857142856 num_loss_counted_tokens: 1051 | |
total tokens: 2260 num samples: 5 num padding tokens: 207 - rank: 1 max len: 452 min len: 354 avg len: 410.6 num_loss_counted_tokens: 1263 | |
total tokens: 2358 num samples: 9 num padding tokens: 120 - rank: 3 max len: 262 min len: 232 avg len: 248.66666666666666 num_loss_counted_tokens: 644 | |
total tokens: 2352 num samples: 12 num padding tokens: 241 - rank: 5 max len: 196 min len: 163 avg len: 175.91666666666666 num_loss_counted_tokens: 730 | |
total tokens: 2310 num samples: 10 num padding tokens: 127 - rank: 4 max len: 231 min len: 200 avg len: 218.3 num_loss_counted_tokens: 937 | |
total tokens: 2415 num samples: 15 num padding tokens: 180 - rank: 6 max len: 161 min len: 137 avg len: 149.0 num_loss_counted_tokens: 941 | |
total tokens: 2412 num samples: 4 num padding tokens: 281 - rank: 0 max len: 603 min len: 469 avg len: 532.75 num_loss_counted_tokens: 1316 | |
Per-token loss scaled by world size: 0.0023011916782706976Per-token loss scaled by world size: 0.001061311224475503Per-token loss scaled by world size: 0.001045188750140369Per-token loss scaled by world size: 0.0008734570001251996Per-token loss scaled by world size: 0.001185024157166481Per-token loss scaled by world size: 0.000813312828540802Per-token loss scaled by world size: 0.0021490640938282013 | |
Epoch: 0, Step: 105, Rank: 2, loss = 2.0287880897521973 | |
Epoch: 0, Step: 105, Rank: 3, loss = 1.894668698310852Epoch: 0, Step: 105, Rank: 5, loss = 0.9214645624160767Epoch: 0, Step: 105, Rank: 7, loss = 0.7700615525245667Epoch: 0, Step: 105, Rank: 6, loss = 0.9356785416603088 | |
Epoch: 0, Step: 105, Rank: 4, loss = 1.0447468757629395Epoch: 0, Step: 105, Rank: 1, loss = 0.7170369029045105 | |
Per-token loss scaled by world size: 0.0007719391142018139 | |
Epoch: 0, Step: 105, Rank: 0, loss = 0.680560827255249 | |
[2024-06-27 16:43:40,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=105, skipped=0, lr=[5.4545454545454545e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:40,566] [INFO] [timer.py:260:stop] epoch=0/micro_step=105/global_step=105, RunningAvgSamplesPerSec=95.57250805279344, CurrSamplesPerSec=95.43432977852252, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.31719294092582 samples/s, lr: 5.4545454545454545e-06, loss: 0.680560827255249 cuda_mem_allocated: 22.256415367126465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7053.0 batch_size: 82.0 total loss: 1.1241257190704346 | |
Epoch 0: 49% 105/213 [02:29<09:21, 5.20s/it] total tokens: 2400 num samples: 12 num padding tokens: 122 - rank: 5 max len: 200 min len: 181 avg len: 189.83333333333334 num_loss_counted_tokens: 790 | |
total tokens: 2380 num samples: 5 num padding tokens: 553 - rank: 2 max len: 476 min len: 316 avg len: 365.4 num_loss_counted_tokens: 1103 | |
total tokens: 2512 num samples: 4 num padding tokens: 392 - rank: 1 max len: 628 min len: 488 avg len: 530.0 num_loss_counted_tokens: 1675 | |
total tokens: 2313 num samples: 9 num padding tokens: 265 - rank: 4 max len: 257 min len: 205 avg len: 227.55555555555554 num_loss_counted_tokens: 704 | |
total tokens: 2464 num samples: 8 num padding tokens: 179 - rank: 3 max len: 308 min len: 261 avg len: 285.625 num_loss_counted_tokens: 1170 | |
total tokens: 2478 num samples: 14 num padding tokens: 292 - rank: 6 max len: 177 min len: 138 avg len: 156.14285714285714 num_loss_counted_tokens: 851 | |
total tokens: 2083 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2083 min len: 2083 avg len: 2083.0 num_loss_counted_tokens: 124 | |
total tokens: 2466 num samples: 18 num padding tokens: 366 - rank: 7 max len: 137 min len: 76 avg len: 116.66666666666667 num_loss_counted_tokens: 634 | |
Per-token loss scaled by world size: 0.0007911288412287831Per-token loss scaled by world size: 0.0009512954275123775Per-token loss scaled by world size: 0.0015752391191199422Per-token loss scaled by world size: 0.0010100816143676639Per-token loss scaled by world size: 0.0018828223692253232Per-token loss scaled by world size: 0.0008825026452541351Per-token loss scaled by world size: 0.0005050410982221365 | |
Epoch: 0, Step: 106, Rank: 5, loss = 0.8634195327758789Epoch: 0, Step: 106, Rank: 4, loss = 0.7180483341217041Epoch: 0, Step: 106, Rank: 3, loss = 0.8009814620018005Epoch: 0, Step: 106, Rank: 2, loss = 1.4297263622283936Epoch: 0, Step: 106, Rank: 0, loss = 0.9167752861976624Epoch: 0, Step: 106, Rank: 1, loss = 1.7088966369628906 | |
Epoch: 0, Step: 106, Rank: 7, loss = 0.45838794112205505Per-token loss scaled by world size: 0.0009067204082384706 | |
Epoch: 0, Step: 106, Rank: 6, loss = 0.8229621052742004 | |
[2024-06-27 16:43:41,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=106, skipped=0, lr=[5.5064935064935065e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:41,616] [INFO] [timer.py:260:stop] epoch=0/micro_step=106/global_step=106, RunningAvgSamplesPerSec=95.5807034083944, CurrSamplesPerSec=96.43242000471321, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 96.2917930339483 samples/s, lr: 5.5064935064935065e-06, loss: 0.9167752861976624 cuda_mem_allocated: 22.256415367126465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7261.0 batch_size: 76.0 total loss: 0.9648997783660889 | |
Epoch 0: 50% 106/213 [02:30<07:03, 3.95s/it] total tokens: 2387 num samples: 11 num padding tokens: 217 - rank: 4 max len: 217 min len: 187 avg len: 197.27272727272728 num_loss_counted_tokens: 850 | |
total tokens: 2418 num samples: 13 num padding tokens: 250 - rank: 5 max len: 186 min len: 155 avg len: 166.76923076923077 num_loss_counted_tokens: 930 | |
total tokens: 2430 num samples: 9 num padding tokens: 223 - rank: 3 max len: 270 min len: 220 avg len: 245.22222222222223 num_loss_counted_tokens: 1140 | |
total tokens: 2324 num samples: 7 num padding tokens: 280 - rank: 2 max len: 332 min len: 278 avg len: 292.0 num_loss_counted_tokens: 985 | |
total tokens: 2270 num samples: 5 num padding tokens: 197 - rank: 1 max len: 454 min len: 354 avg len: 414.6 num_loss_counted_tokens: 914 | |
total tokens: 2400 num samples: 16 num padding tokens: 163 - rank: 6 max len: 150 min len: 126 avg len: 139.8125 num_loss_counted_tokens: 831 | |
total tokens: 2500 num samples: 20 num padding tokens: 321 - rank: 7 max len: 125 min len: 79 avg len: 108.95 num_loss_counted_tokens: 643 | |
total tokens: 2295 num samples: 3 num padding tokens: 389 - rank: 0 max len: 765 min len: 468 avg len: 635.3333333333334 num_loss_counted_tokens: 896 | |
Per-token loss scaled by world size: 0.0005507110035978258Per-token loss scaled by world size: 0.0011924388818442822 | |
Per-token loss scaled by world size: 0.001110840355977416Per-token loss scaled by world size: 0.001158279599621892Per-token loss scaled by world size: 0.0006834580563008785Per-token loss scaled by world size: 0.0017679422162473202Per-token loss scaled by world size: 0.001754762139171362 | |
Epoch: 0, Step: 107, Rank: 7, loss = 0.5352222323417664 | |
Epoch: 0, Step: 107, Rank: 4, loss = 1.0795979499816895Epoch: 0, Step: 107, Rank: 3, loss = 1.158901572227478 | |
Epoch: 0, Step: 107, Rank: 2, loss = 0.6642357707023621Epoch: 0, Step: 107, Rank: 5, loss = 1.7182188034057617 | |
Epoch: 0, Step: 107, Rank: 1, loss = 1.125702977180481 | |
Epoch: 0, Step: 107, Rank: 0, loss = 1.7054094076156616 | |
Per-token loss scaled by world size: 0.0011686119250953197 | |
Epoch: 0, Step: 107, Rank: 6, loss = 1.1357446908950806 | |
[2024-06-27 16:43:42,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=107, skipped=0, lr=[5.5584415584415585e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:42,674] [INFO] [timer.py:260:stop] epoch=0/micro_step=107/global_step=107, RunningAvgSamplesPerSec=95.5802584338448, CurrSamplesPerSec=95.53400369131076, MemAllocated=22.17GB, MaxMemAllocated=28.61GB | |
throughput: 95.44396655019426 samples/s, lr: 5.5584415584415585e-06, loss: 1.7054094076156616 cuda_mem_allocated: 22.169832229614258 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7775.0 batch_size: 75.0 total loss: 1.1403791904449463 | |
Epoch 0: 50% 107/213 [02:31<05:27, 3.09s/it] total tokens: 2510 num samples: 10 num padding tokens: 114 - rank: 3 max len: 251 min len: 222 avg len: 239.6 num_loss_counted_tokens: 1049 | |
total tokens: 2415 num samples: 5 num padding tokens: 259 - rank: 0 max len: 483 min len: 383 avg len: 431.2 num_loss_counted_tokens: 991 | |
total tokens: 2379 num samples: 13 num padding tokens: 228 - rank: 5 max len: 183 min len: 157 avg len: 165.46153846153845 num_loss_counted_tokens: 855 | |
total tokens: 2250 num samples: 6 num padding tokens: 223 - rank: 1 max len: 375 min len: 311 avg len: 337.8333333333333 num_loss_counted_tokens: 1194 | |
total tokens: 2464 num samples: 8 num padding tokens: 186 - rank: 2 max len: 308 min len: 255 avg len: 284.75 num_loss_counted_tokens: 1265 | |
total tokens: 2512 num samples: 16 num padding tokens: 238 - rank: 6 max len: 157 min len: 129 avg len: 142.125 num_loss_counted_tokens: 826 | |
total tokens: 2431 num samples: 11 num padding tokens: 184 - rank: 4 max len: 221 min len: 191 avg len: 204.27272727272728 num_loss_counted_tokens: 866 | |
total tokens: 2451 num samples: 19 num padding tokens: 426 - rank: 7 max len: 129 min len: 88 avg len: 106.57894736842105 num_loss_counted_tokens: 507 | |
Per-token loss scaled by world size: 0.0014553734799847007 | |
Per-token loss scaled by world size: 0.0009343641577288508Per-token loss scaled by world size: 0.0014532349305227399Per-token loss scaled by world size: 0.0017517812084406614Per-token loss scaled by world size: 0.0013760587899014354 | |
Per-token loss scaled by world size: 0.000980113516561687Per-token loss scaled by world size: 0.0011582722654566169 | |
Epoch: 0, Step: 108, Rank: 4, loss = 1.2630822658538818 | |
Epoch: 0, Step: 108, Rank: 2, loss = 1.520327091217041 | |
Epoch: 0, Step: 108, Rank: 1, loss = 1.2612262964248657Epoch: 0, Step: 108, Rank: 0, loss = 0.8109112977981567 | |
Epoch: 0, Step: 108, Rank: 5, loss = 1.1942470073699951 | |
Epoch: 0, Step: 108, Rank: 6, loss = 1.0052355527877808Per-token loss scaled by world size: 0.0007734561804682016Epoch: 0, Step: 108, Rank: 3, loss = 0.8506160378456116 | |
Epoch: 0, Step: 108, Rank: 7, loss = 0.6712632775306702 | |
[2024-06-27 16:43:43,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=108, skipped=0, lr=[5.6103896103896105e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:43,727] [INFO] [timer.py:260:stop] epoch=0/micro_step=108/global_step=108, RunningAvgSamplesPerSec=95.57805006680081, CurrSamplesPerSec=95.3467380496661, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.23866813093137 samples/s, lr: 5.6103896103896105e-06, loss: 0.8109112977981567 cuda_mem_allocated: 22.26023244857788 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6943.0 batch_size: 91.0 total loss: 1.0721136331558228 | |
Epoch 0: 51% 108/213 [02:32<04:19, 2.48s/it] total tokens: 2305 num samples: 5 num padding tokens: 315 - rank: 1 max len: 461 min len: 345 avg len: 398.0 num_loss_counted_tokens: 1261 | |
total tokens: 2412 num samples: 12 num padding tokens: 220 - rank: 5 max len: 201 min len: 171 avg len: 182.66666666666666 num_loss_counted_tokens: 847 | |
total tokens: 2394 num samples: 7 num padding tokens: 158 - rank: 2 max len: 342 min len: 291 avg len: 319.42857142857144 num_loss_counted_tokens: 1335 | |
total tokens: 2380 num samples: 14 num padding tokens: 347 - rank: 6 max len: 170 min len: 127 avg len: 145.21428571428572 num_loss_counted_tokens: 787 | |
total tokens: 2409 num samples: 11 num padding tokens: 107 - rank: 4 max len: 219 min len: 204 avg len: 209.27272727272728 num_loss_counted_tokens: 847 | |
total tokens: 2264 num samples: 8 num padding tokens: 192 - rank: 3 max len: 283 min len: 229 avg len: 259.0 num_loss_counted_tokens: 1150 | |
total tokens: 2349 num samples: 3 num padding tokens: 414 - rank: 0 max len: 783 min len: 560 avg len: 645.0 num_loss_counted_tokens: 418 | |
total tokens: 2413 num samples: 19 num padding tokens: 444 - rank: 7 max len: 127 min len: 77 avg len: 103.63157894736842 num_loss_counted_tokens: 572 | |
Per-token loss scaled by world size: 0.0013497844338417053Per-token loss scaled by world size: 0.0008786096586845815Per-token loss scaled by world size: 0.0011327891843393445Per-token loss scaled by world size: 0.0009134126012213528Per-token loss scaled by world size: 0.0012686452828347683Per-token loss scaled by world size: 0.0007940421928651631Per-token loss scaled by world size: 0.0012829240877181292 | |
Epoch: 0, Step: 109, Rank: 2, loss = 1.3027106523513794Epoch: 0, Step: 109, Rank: 0, loss = 0.8479681611061096 | |
Epoch: 0, Step: 109, Rank: 3, loss = 0.8815573453903198Epoch: 0, Step: 109, Rank: 5, loss = 1.0932831764221191 | |
Epoch: 0, Step: 109, Rank: 4, loss = 0.7663499712944031 | |
Epoch: 0, Step: 109, Rank: 6, loss = 1.2244012355804443 | |
Epoch: 0, Step: 109, Rank: 1, loss = 1.2381820678710938 | |
Per-token loss scaled by world size: 0.0005081428098492324 | |
Epoch: 0, Step: 109, Rank: 7, loss = 0.4904213547706604 | |
[2024-06-27 16:43:44,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=109, skipped=0, lr=[5.6623376623376625e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:44,789] [INFO] [timer.py:260:stop] epoch=0/micro_step=109/global_step=109, RunningAvgSamplesPerSec=95.5746145637315, CurrSamplesPerSec=95.2118465158864, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.12144903796158 samples/s, lr: 5.6623376623376625e-06, loss: 0.8479681611061096 cuda_mem_allocated: 22.28265380859375 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7721.0 batch_size: 84.0 total loss: 0.9806092977523804 | |
Epoch 0: 51% 109/213 [02:34<03:33, 2.05s/it] total tokens: 2480 num samples: 16 num padding tokens: 232 - rank: 6 max len: 155 min len: 124 avg len: 140.5 num_loss_counted_tokens: 853 | |
total tokens: 2421 num samples: 9 num padding tokens: 168 - rank: 3 max len: 269 min len: 227 avg len: 250.33333333333334 num_loss_counted_tokens: 996 | |
total tokens: 2387 num samples: 7 num padding tokens: 50 - rank: 1 max len: 341 min len: 328 avg len: 333.85714285714283 num_loss_counted_tokens: 1025 | |
total tokens: 2464 num samples: 11 num padding tokens: 196 - rank: 4 max len: 224 min len: 188 avg len: 206.1818181818182 num_loss_counted_tokens: 1166 | |
total tokens: 2405 num samples: 13 num padding tokens: 189 - rank: 5 max len: 185 min len: 156 avg len: 170.46153846153845 num_loss_counted_tokens: 845 | |
total tokens: 2480 num samples: 8 num padding tokens: 155 - rank: 2 max len: 310 min len: 272 avg len: 290.625 num_loss_counted_tokens: 1037 | |
total tokens: 2220 num samples: 4 num padding tokens: 559 - rank: 0 max len: 555 min len: 346 avg len: 415.25 num_loss_counted_tokens: 833 | |
total tokens: 2520 num samples: 21 num padding tokens: 360 - rank: 7 max len: 120 min len: 79 avg len: 102.85714285714286 num_loss_counted_tokens: 549 | |
Per-token loss scaled by world size: 0.0010499960044398904Per-token loss scaled by world size: 0.0006242187810130417Per-token loss scaled by world size: 0.0012215960305184126Per-token loss scaled by world size: 0.0008489690371789038Per-token loss scaled by world size: 0.002225430915132165Per-token loss scaled by world size: 0.0013001501793041825Per-token loss scaled by world size: 0.0010214060312137008 | |
Epoch: 0, Step: 110, Rank: 5, loss = 0.8964341282844543Epoch: 0, Step: 110, Rank: 7, loss = 0.5329267978668213Epoch: 0, Step: 110, Rank: 2, loss = 1.0429376363754272Epoch: 0, Step: 110, Rank: 1, loss = 1.8999615907669067 | |
Epoch: 0, Step: 110, Rank: 4, loss = 0.7248073220252991 | |
Epoch: 0, Step: 110, Rank: 6, loss = 1.1100032329559326Epoch: 0, Step: 110, Rank: 3, loss = 0.8720254302024841 | |
Per-token loss scaled by world size: 0.00103566306643188 | |
Epoch: 0, Step: 110, Rank: 0, loss = 0.8841972947120667 | |
[2024-06-27 16:43:45,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[5.7142857142857145e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:45,848] [INFO] [timer.py:260:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=95.57409197275712, CurrSamplesPerSec=95.51820774008056, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.42229785693144 samples/s, lr: 5.7142857142857145e-06, loss: 0.8841972947120667 cuda_mem_allocated: 22.29601001739502 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6830.0 batch_size: 85.0 total loss: 0.9954116344451904 | |
Epoch 0: 52% 110/213 [02:35<03:00, 1.75s/it] total tokens: 2480 num samples: 10 num padding tokens: 253 - rank: 4 max len: 248 min len: 204 avg len: 222.7 num_loss_counted_tokens: 1133 | |
total tokens: 2388 num samples: 12 num padding tokens: 182 - rank: 5 max len: 199 min len: 164 avg len: 183.83333333333334 num_loss_counted_tokens: 1049 | |
total tokens: 2430 num samples: 9 num padding tokens: 123 - rank: 3 max len: 270 min len: 249 avg len: 256.3333333333333 num_loss_counted_tokens: 931 | |
total tokens: 2430 num samples: 15 num padding tokens: 245 - rank: 6 max len: 162 min len: 129 avg len: 145.66666666666666 num_loss_counted_tokens: 816 | |
total tokens: 2178 num samples: 6 num padding tokens: 400 - rank: 2 max len: 363 min len: 272 avg len: 296.3333333333333 num_loss_counted_tokens: 1065 | |
total tokens: 2180 num samples: 5 num padding tokens: 104 - rank: 1 max len: 436 min len: 380 avg len: 415.2 num_loss_counted_tokens: 1275 | |
total tokens: 2148 num samples: 3 num padding tokens: 341 - rank: 0 max len: 716 min len: 461 avg len: 602.3333333333334 num_loss_counted_tokens: 512 | |
total tokens: 2451 num samples: 19 num padding tokens: 438 - rank: 7 max len: 129 min len: 87 avg len: 105.94736842105263 num_loss_counted_tokens: 517 | |
Per-token loss scaled by world size: 0.000539909116923809Per-token loss scaled by world size: 0.0014843979151919484Per-token loss scaled by world size: 0.0006432944792322814Per-token loss scaled by world size: 0.0007334641413763165Per-token loss scaled by world size: 0.0016622388502582908Per-token loss scaled by world size: 0.0009984803618863225Per-token loss scaled by world size: 0.0008674445562064648 | |
Epoch: 0, Step: 111, Rank: 4, loss = 0.4900350272655487Epoch: 0, Step: 111, Rank: 1, loss = 1.3472766876220703 | |
Epoch: 0, Step: 111, Rank: 5, loss = 0.6657103896141052Epoch: 0, Step: 111, Rank: 6, loss = 0.7873143553733826Epoch: 0, Step: 111, Rank: 7, loss = 0.5838701725006104Epoch: 0, Step: 111, Rank: 2, loss = 1.508689522743225 | |
Per-token loss scaled by world size: 0.0011880651582032442Epoch: 0, Step: 111, Rank: 3, loss = 0.906245768070221 | |
Epoch: 0, Step: 111, Rank: 0, loss = 1.078317642211914 | |
[2024-06-27 16:43:46,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=111, skipped=0, lr=[5.7662337662337665e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:46,910] [INFO] [timer.py:260:stop] epoch=0/micro_step=111/global_step=111, RunningAvgSamplesPerSec=95.57033467189922, CurrSamplesPerSec=95.16627767723796, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.07026478742348 samples/s, lr: 5.7662337662337665e-06, loss: 1.078317642211914 cuda_mem_allocated: 22.314615726470947 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7261.0 batch_size: 80.0 total loss: 0.920932412147522 | |
Epoch 0: 52% 111/213 [02:36<02:37, 1.55s/it] total tokens: 2483 num samples: 13 num padding tokens: 231 - rank: 5 max len: 191 min len: 163 avg len: 173.23076923076923 num_loss_counted_tokens: 834 | |
total tokens: 2352 num samples: 8 num padding tokens: 323 - rank: 3 max len: 294 min len: 228 avg len: 253.625 num_loss_counted_tokens: 751 | |
total tokens: 2486 num samples: 11 num padding tokens: 205 - rank: 4 max len: 226 min len: 192 avg len: 207.36363636363637 num_loss_counted_tokens: 1021 | |
total tokens: 2415 num samples: 15 num padding tokens: 212 - rank: 6 max len: 161 min len: 131 avg len: 146.86666666666667 num_loss_counted_tokens: 755 | |
total tokens: 2387 num samples: 7 num padding tokens: 121 - rank: 2 max len: 341 min len: 299 avg len: 323.7142857142857 num_loss_counted_tokens: 1326 | |
total tokens: 2460 num samples: 6 num padding tokens: 89 - rank: 1 max len: 410 min len: 382 avg len: 395.1666666666667 num_loss_counted_tokens: 1718 | |
total tokens: 2229 num samples: 3 num padding tokens: 421 - rank: 0 max len: 743 min len: 502 avg len: 602.6666666666666 num_loss_counted_tokens: 753 | |
total tokens: 2451 num samples: 19 num padding tokens: 460 - rank: 7 max len: 129 min len: 84 avg len: 104.78947368421052 num_loss_counted_tokens: 550 | |
Per-token loss scaled by world size: 0.0012570393737405539Per-token loss scaled by world size: 0.0013796824496239424Per-token loss scaled by world size: 0.001287271617911756Per-token loss scaled by world size: 0.0007230991614051163Per-token loss scaled by world size: 0.0010709468042477965 | |
Per-token loss scaled by world size: 0.0010025746887549758Per-token loss scaled by world size: 0.001407001749612391 | |
Epoch: 0, Step: 112, Rank: 4, loss = 0.7849241495132446 | |
Epoch: 0, Step: 112, Rank: 0, loss = 1.4976452589035034Epoch: 0, Step: 112, Rank: 6, loss = 1.0882948637008667Epoch: 0, Step: 112, Rank: 1, loss = 1.364516258239746 | |
Epoch: 0, Step: 112, Rank: 2, loss = 1.3973333835601807Epoch: 0, Step: 112, Rank: 5, loss = 1.1625127792358398Epoch: 0, Step: 112, Rank: 3, loss = 1.5273003578186035 | |
Per-token loss scaled by world size: 0.0005925724981352687 | |
Epoch: 0, Step: 112, Rank: 7, loss = 0.6432374715805054 | |
[2024-06-27 16:43:47,897] [INFO] [logging.py:96:log_dist] [Rank 0] step=112, skipped=0, lr=[5.8181818181818185e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:47,970] [INFO] [timer.py:260:stop] epoch=0/micro_step=112/global_step=112, RunningAvgSamplesPerSec=95.57025007421866, CurrSamplesPerSec=95.56102982482079, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.46677680487447 samples/s, lr: 5.8181818181818185e-06, loss: 1.4976452589035034 cuda_mem_allocated: 22.296963691711426 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8684.0 batch_size: 79.0 total loss: 1.183220624923706 | |
Epoch 0: 53% 112/213 [02:37<02:21, 1.40s/it] total tokens: 1995 num samples: 3 num padding tokens: 369 - rank: 1 max len: 665 min len: 468 avg len: 542.0 num_loss_counted_tokens: 1207 | |
total tokens: 2304 num samples: 6 num padding tokens: 372 - rank: 2 max len: 384 min len: 285 avg len: 322.0 num_loss_counted_tokens: 1377 | |
total tokens: 2475 num samples: 9 num padding tokens: 278 - rank: 3 max len: 275 min len: 217 avg len: 244.11111111111111 num_loss_counted_tokens: 704 | |
total tokens: 2496 num samples: 12 num padding tokens: 153 - rank: 4 max len: 208 min len: 184 avg len: 195.25 num_loss_counted_tokens: 780 | |
total tokens: 2482 num samples: 17 num padding tokens: 199 - rank: 6 max len: 146 min len: 116 avg len: 134.2941176470588 num_loss_counted_tokens: 762 | |
total tokens: 2366 num samples: 13 num padding tokens: 229 - rank: 5 max len: 182 min len: 149 avg len: 164.3846153846154 num_loss_counted_tokens: 889 | |
total tokens: 2352 num samples: 21 num padding tokens: 182 - rank: 7 max len: 112 min len: 89 avg len: 103.33333333333333 num_loss_counted_tokens: 492 | |
total tokens: 2448 num samples: 3 num padding tokens: 158 - rank: 0 max len: 816 min len: 692 avg len: 763.3333333333334 num_loss_counted_tokens: 1948 | |
Per-token loss scaled by world size: 0.0007658984395675361Per-token loss scaled by world size: 0.0015837273094803095Per-token loss scaled by world size: 0.0005065425066277385Per-token loss scaled by world size: 0.0018945533083751798Per-token loss scaled by world size: 0.0015129988314583898Per-token loss scaled by world size: 0.0009648238192312419Per-token loss scaled by world size: 0.0017755450680851936 | |
Epoch: 0, Step: 113, Rank: 5, loss = 0.895115315914154Epoch: 0, Step: 113, Rank: 4, loss = 0.7105622887611389 | |
Epoch: 0, Step: 113, Rank: 1, loss = 1.469303011894226Epoch: 0, Step: 113, Rank: 2, loss = 1.75767183303833 | |
Epoch: 0, Step: 113, Rank: 3, loss = 1.4036846160888672Epoch: 0, Step: 113, Rank: 7, loss = 0.4699448347091675Epoch: 0, Step: 113, Rank: 0, loss = 1.6472619771957397 | |
Per-token loss scaled by world size: 0.0010901163332164288 | |
Epoch: 0, Step: 113, Rank: 6, loss = 1.0113554000854492 | |
[2024-06-27 16:43:48,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=113, skipped=0, lr=[5.8701298701298705e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:49,025] [INFO] [timer.py:260:stop] epoch=0/micro_step=113/global_step=113, RunningAvgSamplesPerSec=95.5734168043772, CurrSamplesPerSec=95.92304300315246, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.82390150242313 samples/s, lr: 5.8701298701298705e-06, loss: 1.6472619771957397 cuda_mem_allocated: 22.275378704071045 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7422.0 batch_size: 78.0 total loss: 1.1706123352050781 | |
Epoch 0: 53% 113/213 [02:38<02:09, 1.30s/it] total tokens: 2436 num samples: 12 num padding tokens: 177 - rank: 5 max len: 203 min len: 171 avg len: 188.25 num_loss_counted_tokens: 939 | |
total tokens: 2508 num samples: 11 num padding tokens: 112 - rank: 4 max len: 228 min len: 205 avg len: 217.8181818181818 num_loss_counted_tokens: 935 | |
total tokens: 2432 num samples: 8 num padding tokens: 146 - rank: 2 max len: 304 min len: 263 avg len: 285.75 num_loss_counted_tokens: 1449 | |
total tokens: 2125 num samples: 5 num padding tokens: 337 - rank: 1 max len: 425 min len: 308 avg len: 357.6 num_loss_counted_tokens: 821 | |
total tokens: 2349 num samples: 9 num padding tokens: 146 - rank: 3 max len: 261 min len: 228 avg len: 244.77777777777777 num_loss_counted_tokens: 1024 | |
total tokens: 2490 num samples: 15 num padding tokens: 217 - rank: 6 max len: 166 min len: 140 avg len: 151.53333333333333 num_loss_counted_tokens: 897 | |
total tokens: 2502 num samples: 18 num padding tokens: 471 - rank: 7 max len: 139 min len: 86 avg len: 112.83333333333333 num_loss_counted_tokens: 571 | |
total tokens: 2288 num samples: 4 num padding tokens: 211 - rank: 0 max len: 572 min len: 438 avg len: 519.25 num_loss_counted_tokens: 1454 | |
Per-token loss scaled by world size: 0.0011834506876766682Per-token loss scaled by world size: 0.001648307079449296Per-token loss scaled by world size: 0.0008010825840756297Per-token loss scaled by world size: 0.0013708685291931033Per-token loss scaled by world size: 0.0017046001739799976Per-token loss scaled by world size: 0.0010962167289108038Per-token loss scaled by world size: 0.0005097519024275243 | |
Epoch: 0, Step: 114, Rank: 2, loss = 1.5708366632461548Epoch: 0, Step: 114, Rank: 6, loss = 1.1278284788131714Epoch: 0, Step: 114, Rank: 4, loss = 1.3064377307891846Epoch: 0, Step: 114, Rank: 5, loss = 0.7634317278862Epoch: 0, Step: 114, Rank: 3, loss = 1.6244839429855347 | |
Epoch: 0, Step: 114, Rank: 1, loss = 1.0446945428848267 | |
Per-token loss scaled by world size: 0.0016073103761300445 | |
Epoch: 0, Step: 114, Rank: 7, loss = 0.4857935905456543 | |
Epoch: 0, Step: 114, Rank: 0, loss = 1.5317667722702026 | |
[2024-06-27 16:43:50,017] [INFO] [logging.py:96:log_dist] [Rank 0] step=114, skipped=0, lr=[5.9220779220779226e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:50,091] [INFO] [timer.py:260:stop] epoch=0/micro_step=114/global_step=114, RunningAvgSamplesPerSec=95.56755037887156, CurrSamplesPerSec=94.92082348143207, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.82528264846462 samples/s, lr: 5.9220779220779226e-06, loss: 1.5317667722702026 cuda_mem_allocated: 22.314496517181396 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7624.0 batch_size: 85.0 total loss: 1.181909203529358 | |
Epoch 0: 54% 114/213 [02:39<02:01, 1.23s/it] total tokens: 2366 num samples: 14 num padding tokens: 229 - rank: 6 max len: 169 min len: 132 avg len: 152.64285714285714 num_loss_counted_tokens: 778 | |
total tokens: 2364 num samples: 12 num padding tokens: 134 - rank: 5 max len: 197 min len: 172 avg len: 185.83333333333334 num_loss_counted_tokens: 858 | |
total tokens: 2520 num samples: 7 num padding tokens: 114 - rank: 1 max len: 360 min len: 319 avg len: 343.7142857142857 num_loss_counted_tokens: 926 | |
total tokens: 2504 num samples: 8 num padding tokens: 152 - rank: 2 max len: 313 min len: 273 avg len: 294.0 num_loss_counted_tokens: 1042 | |
total tokens: 2310 num samples: 10 num padding tokens: 183 - rank: 4 max len: 231 min len: 198 avg len: 212.7 num_loss_counted_tokens: 1059 | |
total tokens: 2448 num samples: 9 num padding tokens: 108 - rank: 3 max len: 272 min len: 233 avg len: 260.0 num_loss_counted_tokens: 815 | |
total tokens: 2470 num samples: 19 num padding tokens: 342 - rank: 7 max len: 130 min len: 91 avg len: 112.0 num_loss_counted_tokens: 625 | |
total tokens: 2400 num samples: 5 num padding tokens: 184 - rank: 0 max len: 480 min len: 408 avg len: 443.2 num_loss_counted_tokens: 1499 | |
Per-token loss scaled by world size: 0.0014064004644751549Per-token loss scaled by world size: 0.0015796622028574347Per-token loss scaled by world size: 0.001168790739029646Per-token loss scaled by world size: 0.0011280010221526027Per-token loss scaled by world size: 0.00075917859794572Per-token loss scaled by world size: 0.000771721126511693Per-token loss scaled by world size: 0.0015636840835213661 | |
Epoch: 0, Step: 115, Rank: 0, loss = 1.6600275039672852Epoch: 0, Step: 115, Rank: 4, loss = 1.2282530069351196Epoch: 0, Step: 115, Rank: 3, loss = 1.1853880882263184 | |
Epoch: 0, Step: 115, Rank: 2, loss = 1.4779510498046875 | |
Epoch: 0, Step: 115, Rank: 5, loss = 0.7978017926216125 | |
Epoch: 0, Step: 115, Rank: 6, loss = 0.8109824657440186 | |
Per-token loss scaled by world size: 0.0007565103587694466 | |
Epoch: 0, Step: 115, Rank: 1, loss = 1.643236517906189 | |
Epoch: 0, Step: 115, Rank: 7, loss = 0.7949978113174438 | |
[2024-06-27 16:43:51,064] [INFO] [logging.py:96:log_dist] [Rank 0] step=115, skipped=0, lr=[5.9740259740259746e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:51,138] [INFO] [timer.py:260:stop] epoch=0/micro_step=115/global_step=115, RunningAvgSamplesPerSec=95.57696341630651, CurrSamplesPerSec=96.64308848371037, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 96.54710665487602 samples/s, lr: 5.9740259740259746e-06, loss: 1.6600275039672852 cuda_mem_allocated: 22.252480506896973 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8407.0 batch_size: 72.0 total loss: 1.1998296976089478 | |
Epoch 0: 54% 115/213 [02:40<01:54, 1.17s/it] total tokens: 2400 num samples: 8 num padding tokens: 310 - rank: 3 max len: 300 min len: 241 avg len: 261.25 num_loss_counted_tokens: 936 | |
total tokens: 2370 num samples: 10 num padding tokens: 231 - rank: 4 max len: 237 min len: 202 avg len: 213.9 num_loss_counted_tokens: 1037 | |
total tokens: 2366 num samples: 14 num padding tokens: 244 - rank: 6 max len: 169 min len: 134 avg len: 151.57142857142858 num_loss_counted_tokens: 808 | |
total tokens: 2522 num samples: 13 num padding tokens: 151 - rank: 5 max len: 194 min len: 172 avg len: 182.3846153846154 num_loss_counted_tokens: 946 | |
total tokens: 2425 num samples: 5 num padding tokens: 284 - rank: 1 max len: 485 min len: 384 avg len: 428.2 num_loss_counted_tokens: 1170 | |
total tokens: 2286 num samples: 6 num padding tokens: 222 - rank: 2 max len: 381 min len: 301 avg len: 344.0 num_loss_counted_tokens: 1070 | |
total tokens: 2415 num samples: 3 num padding tokens: 448 - rank: 0 max len: 805 min len: 565 avg len: 655.6666666666666 num_loss_counted_tokens: 656 | |
total tokens: 2358 num samples: 18 num padding tokens: 289 - rank: 7 max len: 131 min len: 92 avg len: 114.94444444444444 num_loss_counted_tokens: 595 | |
Per-token loss scaled by world size: 0.0012289314763620496Per-token loss scaled by world size: 0.002072478411719203Per-token loss scaled by world size: 0.001205145730637014Per-token loss scaled by world size: 0.0019482182106003165Per-token loss scaled by world size: 0.0002463744021952152Per-token loss scaled by world size: 0.0011704021599143744 | |
Per-token loss scaled by world size: 0.000930821755900979 | |
Epoch: 0, Step: 116, Rank: 0, loss = 0.21939639747142792Epoch: 0, Step: 116, Rank: 2, loss = 1.845542073249817 | |
Epoch: 0, Step: 116, Rank: 5, loss = 1.073182225227356Epoch: 0, Step: 116, Rank: 4, loss = 1.0943634510040283Epoch: 0, Step: 116, Rank: 1, loss = 1.7348883152008057Epoch: 0, Step: 116, Rank: 6, loss = 1.0422431230545044 | |
Epoch: 0, Step: 116, Rank: 3, loss = 0.8288967609405518 | |
Per-token loss scaled by world size: 0.0006563226343132555 | |
Epoch: 0, Step: 116, Rank: 7, loss = 0.5844553112983704 | |
[2024-06-27 16:43:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] step=116, skipped=0, lr=[6.025974025974027e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:52,198] [INFO] [timer.py:260:stop] epoch=0/micro_step=116/global_step=116, RunningAvgSamplesPerSec=95.57454524359996, CurrSamplesPerSec=95.30207762330119, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.21207165547335 samples/s, lr: 6.025974025974027e-06, loss: 0.21939639747142792 cuda_mem_allocated: 22.290286540985107 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7124.0 batch_size: 76.0 total loss: 1.0528709888458252 | |
Epoch 0: 54% 116/213 [02:41<01:50, 1.14s/it] total tokens: 2513 num samples: 7 num padding tokens: 263 - rank: 3 max len: 359 min len: 276 avg len: 321.42857142857144 num_loss_counted_tokens: 1420 | |
total tokens: 2412 num samples: 12 num padding tokens: 253 - rank: 5 max len: 201 min len: 154 avg len: 179.91666666666666 num_loss_counted_tokens: 854 | |
total tokens: 2413 num samples: 19 num padding tokens: 400 - rank: 7 max len: 127 min len: 81 avg len: 105.94736842105263 num_loss_counted_tokens: 552 | |
total tokens: 2349 num samples: 9 num padding tokens: 171 - rank: 4 max len: 261 min len: 207 avg len: 242.0 num_loss_counted_tokens: 845 | |
total tokens: 2464 num samples: 16 num padding tokens: 181 - rank: 6 max len: 154 min len: 129 avg len: 142.6875 num_loss_counted_tokens: 897 | |
total tokens: 2140 num samples: 4 num padding tokens: 229 - rank: 2 max len: 535 min len: 419 avg len: 477.75 num_loss_counted_tokens: 1069 | |
total tokens: 2336 num samples: 2 num padding tokens: 474 - rank: 1 max len: 1168 min len: 694 avg len: 931.0 num_loss_counted_tokens: 1281 | |
total tokens: 1368 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1368 min len: 1368 avg len: 1368.0 num_loss_counted_tokens: 33 | |
Per-token loss scaled by world size: 0.0012666051043197513Per-token loss scaled by world size: 0.0005617947317659855Per-token loss scaled by world size: 0.0015880611026659608Per-token loss scaled by world size: 0.0012504234910011292 | |
Per-token loss scaled by world size: 0.0009642968652769923 | |
Per-token loss scaled by world size: 0.0006882617017254233Per-token loss scaled by world size: 0.001057146699167788 | |
Epoch: 0, Step: 117, Rank: 1, loss = 1.2537058591842651 | |
Epoch: 0, Step: 117, Rank: 5, loss = 0.9668281674385071 | |
Epoch: 0, Step: 117, Rank: 6, loss = 0.5632694363594055Epoch: 0, Step: 117, Rank: 4, loss = 1.2699298858642578Epoch: 0, Step: 117, Rank: 2, loss = 1.5922297239303589Epoch: 0, Step: 117, Rank: 7, loss = 0.6900683641433716 | |
Per-token loss scaled by world size: 0.0014010658487677574 | |
Epoch: 0, Step: 117, Rank: 3, loss = 1.0599217414855957 | |
Epoch: 0, Step: 117, Rank: 0, loss = 1.4047436714172363 | |
[2024-06-27 16:43:53,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=117, skipped=0, lr=[6.077922077922079e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:53,263] [INFO] [timer.py:260:stop] epoch=0/micro_step=117/global_step=117, RunningAvgSamplesPerSec=95.57029493959251, CurrSamplesPerSec=95.088225778494, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.98546838800567 samples/s, lr: 6.077922077922079e-06, loss: 1.4047436714172363 cuda_mem_allocated: 22.31079864501953 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8021.0 batch_size: 86.0 total loss: 1.1000871658325195 | |
Epoch 0: 55% 117/213 [02:42<01:47, 1.12s/it] total tokens: 2392 num samples: 13 num padding tokens: 192 - rank: 5 max len: 184 min len: 148 avg len: 169.23076923076923 num_loss_counted_tokens: 840 | |
total tokens: 2499 num samples: 17 num padding tokens: 238 - rank: 6 max len: 147 min len: 121 avg len: 133.0 num_loss_counted_tokens: 794 | |
total tokens: 2415 num samples: 5 num padding tokens: 372 - rank: 1 max len: 483 min len: 357 avg len: 408.6 num_loss_counted_tokens: 1253 | |
total tokens: 2450 num samples: 7 num padding tokens: 279 - rank: 2 max len: 350 min len: 281 avg len: 310.14285714285717 num_loss_counted_tokens: 880 | |
total tokens: 2520 num samples: 9 num padding tokens: 242 - rank: 3 max len: 280 min len: 229 avg len: 253.11111111111111 num_loss_counted_tokens: 1085 | |
total tokens: 2497 num samples: 11 num padding tokens: 293 - rank: 4 max len: 227 min len: 187 avg len: 200.36363636363637 num_loss_counted_tokens: 869 | |
total tokens: 2439 num samples: 3 num padding tokens: 468 - rank: 0 max len: 813 min len: 576 avg len: 657.0 num_loss_counted_tokens: 859 | |
total tokens: 2520 num samples: 21 num padding tokens: 309 - rank: 7 max len: 120 min len: 88 avg len: 105.28571428571429 num_loss_counted_tokens: 549 | |
Per-token loss scaled by world size: 0.0007097934721969068Per-token loss scaled by world size: 0.0010727191111072898Per-token loss scaled by world size: 0.002075920579954982Per-token loss scaled by world size: 0.0016427291557192802Per-token loss scaled by world size: 0.00012205814709886909Per-token loss scaled by world size: 0.0014300370821729302 | |
Per-token loss scaled by world size: 0.0013806807110086083 | |
Epoch: 0, Step: 118, Rank: 7, loss = 0.5714724659919739 | |
Epoch: 0, Step: 118, Rank: 2, loss = 1.6713756322860718Epoch: 0, Step: 118, Rank: 5, loss = 0.8636729717254639Epoch: 0, Step: 118, Rank: 1, loss = 1.3226022720336914 | |
Epoch: 0, Step: 118, Rank: 6, loss = 1.1513586044311523Epoch: 0, Step: 118, Rank: 0, loss = 0.09827206283807755 | |
Per-token loss scaled by world size: 0.0016008660895749927Epoch: 0, Step: 118, Rank: 3, loss = 1.111620545387268 | |
Epoch: 0, Step: 118, Rank: 4, loss = 1.2888972759246826 | |
[2024-06-27 16:43:54,242] [INFO] [logging.py:96:log_dist] [Rank 0] step=118, skipped=0, lr=[6.129870129870131e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:54,315] [INFO] [timer.py:260:stop] epoch=0/micro_step=118/global_step=118, RunningAvgSamplesPerSec=95.57504243557545, CurrSamplesPerSec=96.12416857369716, MemAllocated=22.18GB, MaxMemAllocated=28.61GB | |
throughput: 96.01829114308913 samples/s, lr: 6.129870129870131e-06, loss: 0.09827206283807755 cuda_mem_allocated: 22.176749229431152 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6441.0 batch_size: 74.0 total loss: 1.0099090337753296 | |
Epoch 0: 55% 118/213 [02:43<01:44, 1.10s/it] total tokens: 2520 num samples: 10 num padding tokens: 230 - rank: 4 max len: 252 min len: 212 avg len: 229.0 num_loss_counted_tokens: 1025 | |
total tokens: 2484 num samples: 6 num padding tokens: 193 - rank: 1 max len: 414 min len: 346 avg len: 381.8333333333333 num_loss_counted_tokens: 1235 | |
total tokens: 2448 num samples: 8 num padding tokens: 218 - rank: 3 max len: 306 min len: 260 avg len: 278.75 num_loss_counted_tokens: 908 | |
total tokens: 2460 num samples: 12 num padding tokens: 221 - rank: 5 max len: 205 min len: 167 avg len: 186.58333333333334 num_loss_counted_tokens: 855 | |
total tokens: 2275 num samples: 7 num padding tokens: 59 - rank: 2 max len: 325 min len: 308 avg len: 316.57142857142856 num_loss_counted_tokens: 1228 | |
total tokens: 2460 num samples: 15 num padding tokens: 277 - rank: 6 max len: 164 min len: 132 avg len: 145.53333333333333 num_loss_counted_tokens: 930 | |
total tokens: 2520 num samples: 4 num padding tokens: 520 - rank: 0 max len: 630 min len: 426 avg len: 500.0 num_loss_counted_tokens: 512 | |
total tokens: 2091 num samples: 17 num padding tokens: 321 - rank: 7 max len: 123 min len: 77 avg len: 104.11764705882354 num_loss_counted_tokens: 466 | |
Per-token loss scaled by world size: 0.0021853686776012182Per-token loss scaled by world size: 0.0009460271103307605Per-token loss scaled by world size: 0.0014799173222854733Per-token loss scaled by world size: 0.0004427703679539263Per-token loss scaled by world size: 0.0010059248888865113Per-token loss scaled by world size: 0.0009903759928420186Per-token loss scaled by world size: 0.0009024076862260699 | |
Epoch: 0, Step: 119, Rank: 2, loss = 0.9149264693260193Epoch: 0, Step: 119, Rank: 1, loss = 1.4312649965286255Epoch: 0, Step: 119, Rank: 0, loss = 2.1135246753692627 | |
Epoch: 0, Step: 119, Rank: 7, loss = 0.42821428179740906Epoch: 0, Step: 119, Rank: 3, loss = 0.9728551506996155Epoch: 0, Step: 119, Rank: 4, loss = 0.9578173756599426Epoch: 0, Step: 119, Rank: 5, loss = 0.8727410435676575 | |
Per-token loss scaled by world size: 0.000973792455624789 | |
Epoch: 0, Step: 119, Rank: 6, loss = 0.9417790174484253 | |
[2024-06-27 16:43:55,303] [INFO] [logging.py:96:log_dist] [Rank 0] step=119, skipped=0, lr=[6.181818181818182e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:55,376] [INFO] [timer.py:260:stop] epoch=0/micro_step=119/global_step=119, RunningAvgSamplesPerSec=95.57361871326114, CurrSamplesPerSec=95.40875426747318, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.32032940589379 samples/s, lr: 6.181818181818182e-06, loss: 2.1135246753692627 cuda_mem_allocated: 22.270727157592773 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7737.0 batch_size: 82.0 total loss: 1.0791404247283936 | |
Epoch 0: 56% 119/213 [02:44<01:42, 1.09s/it] total tokens: 2416 num samples: 8 num padding tokens: 87 - rank: 3 max len: 302 min len: 264 avg len: 291.125 num_loss_counted_tokens: 984 | |
total tokens: 2358 num samples: 9 num padding tokens: 197 - rank: 4 max len: 262 min len: 223 avg len: 240.11111111111111 num_loss_counted_tokens: 805 | |
total tokens: 2520 num samples: 14 num padding tokens: 400 - rank: 6 max len: 180 min len: 125 avg len: 151.42857142857142 num_loss_counted_tokens: 834 | |
total tokens: 2340 num samples: 6 num padding tokens: 174 - rank: 1 max len: 390 min len: 345 avg len: 361.0 num_loss_counted_tokens: 1464 | |
total tokens: 2398 num samples: 11 num padding tokens: 202 - rank: 5 max len: 218 min len: 181 avg len: 199.63636363636363 num_loss_counted_tokens: 932 | |
total tokens: 2440 num samples: 20 num padding tokens: 311 - rank: 7 max len: 122 min len: 77 avg len: 106.45 num_loss_counted_tokens: 573 | |
total tokens: 2394 num samples: 7 num padding tokens: 168 - rank: 2 max len: 342 min len: 306 avg len: 318.0 num_loss_counted_tokens: 1248 | |
total tokens: 2400 num samples: 5 num padding tokens: 215 - rank: 0 max len: 480 min len: 405 avg len: 437.0 num_loss_counted_tokens: 1430 | |
Per-token loss scaled by world size: 0.001979653723537922Per-token loss scaled by world size: 0.0022558097261935472Per-token loss scaled by world size: 0.0005044273566454649Per-token loss scaled by world size: 0.001412503537721932Per-token loss scaled by world size: 0.0010639942483976483Per-token loss scaled by world size: 0.001233058050274849Per-token loss scaled by world size: 0.0008246272918768227 | |
Epoch: 0, Step: 120, Rank: 1, loss = 2.143019199371338Epoch: 0, Step: 120, Rank: 7, loss = 0.4792059659957886 | |
Epoch: 0, Step: 120, Rank: 4, loss = 1.3418784141540527Epoch: 0, Step: 120, Rank: 0, loss = 1.8806711435317993 | |
Epoch: 0, Step: 120, Rank: 5, loss = 1.0107945203781128Epoch: 0, Step: 120, Rank: 6, loss = 1.1714051961898804 | |
Epoch: 0, Step: 120, Rank: 3, loss = 0.7833959460258484Per-token loss scaled by world size: 0.000949969282373786 | |
Epoch: 0, Step: 120, Rank: 2, loss = 0.9024708271026611 | |
[2024-06-27 16:43:56,373] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[6.233766233766234e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:56,446] [INFO] [timer.py:260:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=95.56467845568022, CurrSamplesPerSec=94.53008927758125, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.44339562896951 samples/s, lr: 6.233766233766234e-06, loss: 1.8806711435317993 cuda_mem_allocated: 22.308533191680908 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7600.0 batch_size: 91.0 total loss: 1.2141051292419434 | |
Epoch 0: 56% 120/213 [02:45<01:40, 1.08s/it] total tokens: 2506 num samples: 7 num padding tokens: 181 - rank: 2 max len: 358 min len: 317 avg len: 332.14285714285717 num_loss_counted_tokens: 1002 | |
total tokens: 2520 num samples: 10 num padding tokens: 167 - rank: 4 max len: 252 min len: 221 avg len: 235.3 num_loss_counted_tokens: 1140 | |
total tokens: 2336 num samples: 8 num padding tokens: 51 - rank: 3 max len: 292 min len: 272 avg len: 285.625 num_loss_counted_tokens: 1150 | |
total tokens: 2420 num samples: 11 num padding tokens: 394 - rank: 5 max len: 220 min len: 163 avg len: 184.1818181818182 num_loss_counted_tokens: 823 | |
total tokens: 2270 num samples: 5 num padding tokens: 272 - rank: 1 max len: 454 min len: 360 avg len: 399.6 num_loss_counted_tokens: 1175 | |
total tokens: 2430 num samples: 15 num padding tokens: 181 - rank: 6 max len: 162 min len: 140 avg len: 149.93333333333334 num_loss_counted_tokens: 726 | |
total tokens: 2502 num samples: 18 num padding tokens: 278 - rank: 7 max len: 139 min len: 92 avg len: 123.55555555555556 num_loss_counted_tokens: 620 | |
total tokens: 2082 num samples: 3 num padding tokens: 257 - rank: 0 max len: 694 min len: 551 avg len: 608.3333333333334 num_loss_counted_tokens: 836 | |
Per-token loss scaled by world size: 0.0017989190528169274Per-token loss scaled by world size: 0.0009142596973106265Per-token loss scaled by world size: 0.0005701335612684488Per-token loss scaled by world size: 0.0016012239502742887Per-token loss scaled by world size: 0.0012194853043183684Per-token loss scaled by world size: 0.0010350669035688043Per-token loss scaled by world size: 0.0006449150387197733 | |
Epoch: 0, Step: 121, Rank: 1, loss = 1.6583784818649292Epoch: 0, Step: 121, Rank: 5, loss = 0.8428331613540649Epoch: 0, Step: 121, Rank: 2, loss = 1.476128339767456 | |
Epoch: 0, Step: 121, Rank: 7, loss = 0.5255918502807617 | |
Epoch: 0, Step: 121, Rank: 3, loss = 0.5945310592651367Epoch: 0, Step: 121, Rank: 6, loss = 0.9542023539543152Epoch: 0, Step: 121, Rank: 4, loss = 1.1242129802703857 | |
Per-token loss scaled by world size: 0.0020224445033818483 | |
Epoch: 0, Step: 121, Rank: 0, loss = 1.8644410371780396 | |
[2024-06-27 16:43:57,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=121, skipped=0, lr=[6.285714285714286e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:57,500] [INFO] [timer.py:260:stop] epoch=0/micro_step=121/global_step=121, RunningAvgSamplesPerSec=95.56904640689498, CurrSamplesPerSec=96.08728326540975, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.99139481438394 samples/s, lr: 6.285714285714286e-06, loss: 1.8644410371780396 cuda_mem_allocated: 22.30221176147461 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7375.0 batch_size: 80.0 total loss: 1.1300398111343384 | |
Epoch 0: 57% 121/213 [02:46<01:38, 1.07s/it] total tokens: 2416 num samples: 16 num padding tokens: 249 - rank: 6 max len: 151 min len: 119 avg len: 135.4375 num_loss_counted_tokens: 716 | |
total tokens: 2436 num samples: 12 num padding tokens: 102 - rank: 4 max len: 203 min len: 185 avg len: 194.5 num_loss_counted_tokens: 857 | |
total tokens: 2500 num samples: 10 num padding tokens: 230 - rank: 3 max len: 250 min len: 206 avg len: 227.0 num_loss_counted_tokens: 881 | |
total tokens: 2520 num samples: 14 num padding tokens: 135 - rank: 5 max len: 180 min len: 153 avg len: 170.35714285714286 num_loss_counted_tokens: 949 | |
total tokens: 2504 num samples: 8 num padding tokens: 216 - rank: 2 max len: 313 min len: 250 avg len: 286.0 num_loss_counted_tokens: 1043 | |
total tokens: 2364 num samples: 6 num padding tokens: 231 - rank: 1 max len: 394 min len: 317 avg len: 355.5 num_loss_counted_tokens: 1054 | |
total tokens: 2380 num samples: 20 num padding tokens: 226 - rank: 7 max len: 119 min len: 86 avg len: 107.7 num_loss_counted_tokens: 517 | |
total tokens: 2500 num samples: 4 num padding tokens: 464 - rank: 0 max len: 625 min len: 450 avg len: 509.0 num_loss_counted_tokens: 1365 | |
Per-token loss scaled by world size: 0.0009591860580258071Per-token loss scaled by world size: 0.0020830880384892225Per-token loss scaled by world size: 0.0009858061093837023 | |
Per-token loss scaled by world size: 0.001744576497003436Per-token loss scaled by world size: 0.0008211091626435518Per-token loss scaled by world size: 0.0009824485750868917 | |
Per-token loss scaled by world size: 6.285587733145803e-05 | |
Epoch: 0, Step: 122, Rank: 2, loss = 1.8359817266464233 | |
Epoch: 0, Step: 122, Rank: 4, loss = 0.8454025983810425Epoch: 0, Step: 122, Rank: 6, loss = 0.868864893913269Epoch: 0, Step: 122, Rank: 3, loss = 1.5376261472702026Epoch: 0, Step: 122, Rank: 5, loss = 0.8659056425094604Epoch: 0, Step: 122, Rank: 7, loss = 0.7237051129341125 | |
Epoch: 0, Step: 122, Rank: 0, loss = 0.05539959669113159Per-token loss scaled by world size: 0.0016570407897233963 | |
Epoch: 0, Step: 122, Rank: 1, loss = 1.4604743719100952 | |
[2024-06-27 16:43:58,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=122, skipped=0, lr=[6.337662337662338e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:58,558] [INFO] [timer.py:260:stop] epoch=0/micro_step=122/global_step=122, RunningAvgSamplesPerSec=95.56903762750007, CurrSamplesPerSec=95.56799289102226, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.45656967859614 samples/s, lr: 6.337662337662338e-06, loss: 0.05539959669113159 cuda_mem_allocated: 22.262975215911865 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7051.0 batch_size: 71.0 total loss: 1.024169921875 | |
Epoch 0: 57% 122/213 [02:47<01:37, 1.07s/it] total tokens: 2261 num samples: 7 num padding tokens: 279 - rank: 2 max len: 323 min len: 266 avg len: 283.14285714285717 num_loss_counted_tokens: 802 | |
total tokens: 2445 num samples: 5 num padding tokens: 564 - rank: 1 max len: 489 min len: 327 avg len: 376.2 num_loss_counted_tokens: 1151 | |
total tokens: 2506 num samples: 14 num padding tokens: 208 - rank: 6 max len: 179 min len: 146 avg len: 164.14285714285714 num_loss_counted_tokens: 872 | |
total tokens: 2370 num samples: 10 num padding tokens: 200 - rank: 4 max len: 237 min len: 200 avg len: 217.0 num_loss_counted_tokens: 827 | |
total tokens: 2482 num samples: 17 num padding tokens: 398 - rank: 7 max len: 146 min len: 95 avg len: 122.58823529411765 num_loss_counted_tokens: 564 | |
total tokens: 2394 num samples: 9 num padding tokens: 156 - rank: 3 max len: 266 min len: 237 avg len: 248.66666666666666 num_loss_counted_tokens: 844 | |
total tokens: 2388 num samples: 12 num padding tokens: 132 - rank: 5 max len: 199 min len: 180 avg len: 188.0 num_loss_counted_tokens: 1013 | |
total tokens: 1614 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1614 min len: 1614 avg len: 1614.0 num_loss_counted_tokens: 52 | |
Per-token loss scaled by world size: 0.001413169433362782Per-token loss scaled by world size: 0.0011624034959822893Per-token loss scaled by world size: 0.0012983427150174975Per-token loss scaled by world size: 0.0008783380035310984Per-token loss scaled by world size: 0.0015020800055935979Per-token loss scaled by world size: 0.001309256418608129 | |
Per-token loss scaled by world size: 0.0017879215301945806 | |
Epoch: 0, Step: 123, Rank: 3, loss = 1.269909381866455Epoch: 0, Step: 123, Rank: 5, loss = 1.0445648431777954Epoch: 0, Step: 123, Rank: 1, loss = 1.3498066663742065Epoch: 0, Step: 123, Rank: 4, loss = 1.1667232513427734 | |
Epoch: 0, Step: 123, Rank: 7, loss = 0.7892965078353882Per-token loss scaled by world size: 0.0008320391061715782 | |
Epoch: 0, Step: 123, Rank: 6, loss = 1.1765305995941162 | |
Epoch: 0, Step: 123, Rank: 2, loss = 1.6066709756851196 | |
Epoch: 0, Step: 123, Rank: 0, loss = 0.7476911544799805 | |
[2024-06-27 16:43:59,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=123, skipped=0, lr=[6.38961038961039e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:43:59,612] [INFO] [timer.py:260:stop] epoch=0/micro_step=123/global_step=123, RunningAvgSamplesPerSec=95.57401954502487, CurrSamplesPerSec=96.17564426304219, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 96.08469226534486 samples/s, lr: 6.38961038961039e-06, loss: 0.7476911544799805 cuda_mem_allocated: 22.28825807571411 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7189.0 batch_size: 84.0 total loss: 1.1438992023468018 | |
Epoch 0: 58% 123/213 [02:48<01:35, 1.06s/it] total tokens: 2522 num samples: 13 num padding tokens: 280 - rank: 5 max len: 194 min len: 156 avg len: 172.46153846153845 num_loss_counted_tokens: 938 | |
total tokens: 2431 num samples: 11 num padding tokens: 166 - rank: 4 max len: 221 min len: 195 avg len: 205.9090909090909 num_loss_counted_tokens: 802 | |
total tokens: 2392 num samples: 8 num padding tokens: 350 - rank: 3 max len: 299 min len: 232 avg len: 255.25 num_loss_counted_tokens: 853 | |
total tokens: 2533 num samples: 17 num padding tokens: 303 - rank: 6 max len: 149 min len: 115 avg len: 131.1764705882353 num_loss_counted_tokens: 861 | |
total tokens: 2464 num samples: 7 num padding tokens: 200 - rank: 2 max len: 352 min len: 304 avg len: 323.42857142857144 num_loss_counted_tokens: 1108 | |
total tokens: 2390 num samples: 5 num padding tokens: 335 - rank: 1 max len: 478 min len: 356 avg len: 411.0 num_loss_counted_tokens: 1253 | |
total tokens: 1998 num samples: 18 num padding tokens: 199 - rank: 7 max len: 111 min len: 84 avg len: 99.94444444444444 num_loss_counted_tokens: 447 | |
total tokens: 2229 num samples: 3 num padding tokens: 296 - rank: 0 max len: 743 min len: 555 avg len: 644.3333333333334 num_loss_counted_tokens: 622 | |
Per-token loss scaled by world size: 0.0009348822059109807Per-token loss scaled by world size: 0.0005714487633667886Per-token loss scaled by world size: 0.0007352223037742078Per-token loss scaled by world size: 0.001046256278641522Per-token loss scaled by world size: 0.0018581245094537735Per-token loss scaled by world size: 0.0011434057960286736Per-token loss scaled by world size: 0.0018057546112686396 | |
Epoch: 0, Step: 124, Rank: 6, loss = 0.8826456665992737Epoch: 0, Step: 124, Rank: 4, loss = 0.6941417455673218Epoch: 0, Step: 124, Rank: 7, loss = 0.5395190715789795 | |
Epoch: 0, Step: 124, Rank: 2, loss = 1.7543017864227295Epoch: 0, Step: 124, Rank: 1, loss = 1.7048580646514893Epoch: 0, Step: 124, Rank: 3, loss = 1.0795179605484009Epoch: 0, Step: 124, Rank: 5, loss = 0.9877967238426208Per-token loss scaled by world size: 0.0012136552250012755 | |
Epoch: 0, Step: 124, Rank: 0, loss = 1.14584219455719 | |
[2024-06-27 16:44:00,602] [INFO] [logging.py:96:log_dist] [Rank 0] step=124, skipped=0, lr=[6.441558441558442e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:00,676] [INFO] [timer.py:260:stop] epoch=0/micro_step=124/global_step=124, RunningAvgSamplesPerSec=95.57015467084068, CurrSamplesPerSec=95.1048008117561, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.01835079747293 samples/s, lr: 6.441558441558442e-06, loss: 1.14584219455719 cuda_mem_allocated: 22.30256938934326 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7553.0 batch_size: 88.0 total loss: 1.098577857017517 | |
Epoch 0: 58% 124/213 [02:49<01:34, 1.06s/it] total tokens: 2320 num samples: 10 num padding tokens: 186 - rank: 3 max len: 232 min len: 198 avg len: 213.4 num_loss_counted_tokens: 751 | |
total tokens: 2520 num samples: 15 num padding tokens: 137 - rank: 5 max len: 168 min len: 150 avg len: 158.86666666666667 num_loss_counted_tokens: 941 | |
total tokens: 2340 num samples: 12 num padding tokens: 130 - rank: 4 max len: 195 min len: 172 avg len: 184.16666666666666 num_loss_counted_tokens: 1045 | |
total tokens: 2208 num samples: 6 num padding tokens: 174 - rank: 1 max len: 368 min len: 309 avg len: 339.0 num_loss_counted_tokens: 756 | |
total tokens: 2533 num samples: 17 num padding tokens: 109 - rank: 6 max len: 149 min len: 131 avg len: 142.58823529411765 num_loss_counted_tokens: 915 | |
total tokens: 2304 num samples: 8 num padding tokens: 142 - rank: 2 max len: 288 min len: 237 avg len: 270.25 num_loss_counted_tokens: 873 | |
total tokens: 2304 num samples: 18 num padding tokens: 320 - rank: 7 max len: 128 min len: 85 avg len: 110.22222222222223 num_loss_counted_tokens: 572 | |
total tokens: 2475 num samples: 5 num padding tokens: 424 - rank: 0 max len: 495 min len: 380 avg len: 410.2 num_loss_counted_tokens: 926 | |
Per-token loss scaled by world size: 0.0008412335882894695Per-token loss scaled by world size: 0.0020697233267128468Per-token loss scaled by world size: 0.0022198939695954323Per-token loss scaled by world size: 0.001070388127118349Per-token loss scaled by world size: 0.0009610801353119314Per-token loss scaled by world size: 0.0020381328649818897Per-token loss scaled by world size: 0.0007252657669596374 | |
Epoch: 0, Step: 125, Rank: 2, loss = 1.8671491146087646Epoch: 0, Step: 125, Rank: 5, loss = 0.7588978409767151Epoch: 0, Step: 125, Rank: 1, loss = 2.00262188911438 | |
Epoch: 0, Step: 125, Rank: 4, loss = 0.9656239151954651 | |
Epoch: 0, Step: 125, Rank: 6, loss = 0.8670144081115723Epoch: 0, Step: 125, Rank: 3, loss = 1.8386505842208862Epoch: 0, Step: 125, Rank: 7, loss = 0.6542803645133972Per-token loss scaled by world size: 0.0004296208207961172 | |
Epoch: 0, Step: 125, Rank: 0, loss = 0.38757169246673584 | |
[2024-06-27 16:44:01,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=125, skipped=0, lr=[6.493506493506494e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:01,727] [INFO] [timer.py:260:stop] epoch=0/micro_step=125/global_step=125, RunningAvgSamplesPerSec=95.5765014863784, CurrSamplesPerSec=96.35718955593482, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 96.26715987705444 samples/s, lr: 6.493506493506494e-06, loss: 0.38757169246673584 cuda_mem_allocated: 22.294698238372803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7217.0 batch_size: 79.0 total loss: 1.1677261590957642 | |
Epoch 0: 59% 125/213 [02:50<01:33, 1.06s/it] total tokens: 2509 num samples: 13 num padding tokens: 237 - rank: 5 max len: 193 min len: 161 avg len: 174.76923076923077 num_loss_counted_tokens: 826 | |
total tokens: 2280 num samples: 5 num padding tokens: 344 - rank: 1 max len: 456 min len: 334 avg len: 387.2 num_loss_counted_tokens: 1200 | |
total tokens: 2324 num samples: 7 num padding tokens: 121 - rank: 2 max len: 332 min len: 303 avg len: 314.7142857142857 num_loss_counted_tokens: 1307 | |
total tokens: 2508 num samples: 11 num padding tokens: 229 - rank: 4 max len: 228 min len: 194 avg len: 207.1818181818182 num_loss_counted_tokens: 867 | |
total tokens: 2496 num samples: 16 num padding tokens: 252 - rank: 6 max len: 156 min len: 130 avg len: 140.25 num_loss_counted_tokens: 708 | |
total tokens: 2037 num samples: 3 num padding tokens: 334 - rank: 0 max len: 679 min len: 456 avg len: 567.6666666666666 num_loss_counted_tokens: 543 | |
total tokens: 2451 num samples: 19 num padding tokens: 379 - rank: 7 max len: 129 min len: 95 avg len: 109.05263157894737 num_loss_counted_tokens: 580 | |
total tokens: 2376 num samples: 8 num padding tokens: 251 - rank: 3 max len: 297 min len: 232 avg len: 265.625 num_loss_counted_tokens: 947 | |
Per-token loss scaled by world size: 0.0015425544697791338Per-token loss scaled by world size: 0.0013749138452112675Per-token loss scaled by world size: 0.0004735948459710926Per-token loss scaled by world size: 0.001547633670270443Per-token loss scaled by world size: 0.0006732430192641914Per-token loss scaled by world size: 0.0009818142279982567Per-token loss scaled by world size: 0.0009908878710120916 | |
Epoch: 0, Step: 126, Rank: 4, loss = 1.408352255821228Epoch: 0, Step: 126, Rank: 0, loss = 0.43239209055900574Epoch: 0, Step: 126, Rank: 2, loss = 1.2552963495254517Epoch: 0, Step: 126, Rank: 5, loss = 0.904680609703064 | |
Epoch: 0, Step: 126, Rank: 1, loss = 1.4129894971847534Epoch: 0, Step: 126, Rank: 6, loss = 0.8963963985443115 | |
Epoch: 0, Step: 126, Rank: 7, loss = 0.6146708726882935Per-token loss scaled by world size: 0.0013469000114127994 | |
Epoch: 0, Step: 126, Rank: 3, loss = 1.2297197580337524 | |
[2024-06-27 16:44:02,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=126, skipped=0, lr=[6.545454545454546e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:02,791] [INFO] [timer.py:260:stop] epoch=0/micro_step=126/global_step=126, RunningAvgSamplesPerSec=95.57143016782707, CurrSamplesPerSec=94.95173547007046, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.85834473556251 samples/s, lr: 6.545454545454546e-06, loss: 0.43239209055900574 cuda_mem_allocated: 22.279313564300537 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7304.0 batch_size: 89.0 total loss: 1.0193121433258057 | |
Epoch 0: 59% 126/213 [02:52<01:32, 1.06s/it] total tokens: 2313 num samples: 9 num padding tokens: 191 - rank: 3 max len: 257 min len: 216 avg len: 235.77777777777777 num_loss_counted_tokens: 1037 | |
total tokens: 2394 num samples: 14 num padding tokens: 198 - rank: 5 max len: 171 min len: 145 avg len: 156.85714285714286 num_loss_counted_tokens: 772 | |
total tokens: 2520 num samples: 18 num padding tokens: 234 - rank: 6 max len: 140 min len: 114 avg len: 127.0 num_loss_counted_tokens: 806 | |
total tokens: 2232 num samples: 6 num padding tokens: 90 - rank: 1 max len: 372 min len: 329 avg len: 357.0 num_loss_counted_tokens: 1425 | |
total tokens: 2343 num samples: 11 num padding tokens: 241 - rank: 4 max len: 213 min len: 172 avg len: 191.0909090909091 num_loss_counted_tokens: 1037 | |
total tokens: 2512 num samples: 8 num padding tokens: 144 - rank: 2 max len: 314 min len: 267 avg len: 296.0 num_loss_counted_tokens: 790 | |
total tokens: 2058 num samples: 3 num padding tokens: 408 - rank: 0 max len: 686 min len: 418 avg len: 550.0 num_loss_counted_tokens: 1292 | |
total tokens: 2052 num samples: 18 num padding tokens: 244 - rank: 7 max len: 114 min len: 82 avg len: 100.44444444444444 num_loss_counted_tokens: 443 | |
Per-token loss scaled by world size: 0.0011991484789177775Per-token loss scaled by world size: 0.0010116973426192999Per-token loss scaled by world size: 0.0012707292335107923Per-token loss scaled by world size: 0.0016641626134514809Per-token loss scaled by world size: 0.001668070675805211Per-token loss scaled by world size: 0.0005764652159996331Per-token loss scaled by world size: 0.0011702096089720726 | |
Epoch: 0, Step: 127, Rank: 6, loss = 0.9229209423065186Epoch: 0, Step: 127, Rank: 5, loss = 1.067523717880249Epoch: 0, Step: 127, Rank: 4, loss = 1.1592227220535278Epoch: 0, Step: 127, Rank: 3, loss = 1.0939232110977173Epoch: 0, Step: 127, Rank: 1, loss = 1.5181323289871216Epoch: 0, Step: 127, Rank: 2, loss = 1.5216975212097168 | |
Epoch: 0, Step: 127, Rank: 0, loss = 0.5258803963661194 | |
Per-token loss scaled by world size: 0.000593892065808177 | |
Epoch: 0, Step: 127, Rank: 7, loss = 0.541778028011322 | |
[2024-06-27 16:44:03,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=127, skipped=0, lr=[6.597402597402598e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:03,843] [INFO] [timer.py:260:stop] epoch=0/micro_step=127/global_step=127, RunningAvgSamplesPerSec=95.57515624587967, CurrSamplesPerSec=96.03945254724954, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.94013897330925 samples/s, lr: 6.597402597402598e-06, loss: 0.5258803963661194 cuda_mem_allocated: 22.270727157592773 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7298.0 batch_size: 79.0 total loss: 1.0438848733901978 | |
Epoch 0: 60% 127/213 [02:53<01:31, 1.06s/it] total tokens: 2480 num samples: 5 num padding tokens: 180 - rank: 1 max len: 496 min len: 375 avg len: 460.0 num_loss_counted_tokens: 1043 | |
total tokens: 2340 num samples: 9 num padding tokens: 330 - rank: 4 max len: 260 min len: 197 avg len: 223.33333333333334 num_loss_counted_tokens: 759 | |
total tokens: 2471 num samples: 7 num padding tokens: 180 - rank: 2 max len: 353 min len: 291 avg len: 327.2857142857143 num_loss_counted_tokens: 1128 | |
total tokens: 2352 num samples: 12 num padding tokens: 287 - rank: 5 max len: 196 min len: 154 avg len: 172.08333333333334 num_loss_counted_tokens: 655 | |
total tokens: 2272 num samples: 8 num padding tokens: 117 - rank: 3 max len: 284 min len: 261 avg len: 269.375 num_loss_counted_tokens: 794 | |
total tokens: 2448 num samples: 16 num padding tokens: 143 - rank: 6 max len: 153 min len: 137 avg len: 144.0625 num_loss_counted_tokens: 846 | |
total tokens: 2466 num samples: 18 num padding tokens: 356 - rank: 7 max len: 137 min len: 92 avg len: 117.22222222222223 num_loss_counted_tokens: 661 | |
total tokens: 2085 num samples: 3 num padding tokens: 156 - rank: 0 max len: 695 min len: 578 avg len: 643.0 num_loss_counted_tokens: 1464 | |
Per-token loss scaled by world size: 0.000901527819223702Per-token loss scaled by world size: 0.0006376776145771146Per-token loss scaled by world size: 0.0008352432632818818Per-token loss scaled by world size: 0.0021958923898637295Per-token loss scaled by world size: 0.0014212250243872404Per-token loss scaled by world size: 0.0005086821620352566 | |
Epoch: 0, Step: 128, Rank: 3, loss = 0.8686220645904541Per-token loss scaled by world size: 0.0005391775048337877Epoch: 0, Step: 128, Rank: 6, loss = 0.6144023537635803 | |
Epoch: 0, Step: 128, Rank: 5, loss = 0.8047568798065186Epoch: 0, Step: 128, Rank: 2, loss = 1.3693503141403198 | |
Epoch: 0, Step: 128, Rank: 1, loss = 2.1157422065734863Epoch: 0, Step: 128, Rank: 0, loss = 0.49011528491973877 | |
Per-token loss scaled by world size: 0.0014169786591082811 | |
Epoch: 0, Step: 128, Rank: 7, loss = 0.5194975137710571 | |
Epoch: 0, Step: 128, Rank: 4, loss = 1.3652589321136475 | |
[2024-06-27 16:44:04,831] [INFO] [logging.py:96:log_dist] [Rank 0] step=128, skipped=0, lr=[6.64935064935065e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:04,904] [INFO] [timer.py:260:stop] epoch=0/micro_step=128/global_step=128, RunningAvgSamplesPerSec=95.57294372586809, CurrSamplesPerSec=95.29718309292473, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.19363624994179 samples/s, lr: 6.64935064935065e-06, loss: 0.49011528491973877 cuda_mem_allocated: 22.280386924743652 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7708.0 batch_size: 82.0 total loss: 1.0184682607650757 | |
Epoch 0: 60% 128/213 [02:54<01:30, 1.06s/it] total tokens: 2365 num samples: 11 num padding tokens: 254 - rank: 5 max len: 215 min len: 171 avg len: 191.9090909090909 num_loss_counted_tokens: 832 | |
total tokens: 2394 num samples: 14 num padding tokens: 216 - rank: 6 max len: 171 min len: 138 avg len: 155.57142857142858 num_loss_counted_tokens: 696 | |
total tokens: 2475 num samples: 5 num padding tokens: 143 - rank: 1 max len: 495 min len: 391 avg len: 466.4 num_loss_counted_tokens: 1176 | |
total tokens: 2268 num samples: 7 num padding tokens: 192 - rank: 3 max len: 324 min len: 280 avg len: 296.57142857142856 num_loss_counted_tokens: 760 | |
total tokens: 2511 num samples: 9 num padding tokens: 330 - rank: 4 max len: 279 min len: 225 avg len: 242.33333333333334 num_loss_counted_tokens: 889 | |
total tokens: 2250 num samples: 6 num padding tokens: 166 - rank: 2 max len: 375 min len: 328 avg len: 347.3333333333333 num_loss_counted_tokens: 741 | |
total tokens: 1914 num samples: 3 num padding tokens: 152 - rank: 0 max len: 638 min len: 503 avg len: 587.3333333333334 num_loss_counted_tokens: 978 | |
total tokens: 2484 num samples: 18 num padding tokens: 473 - rank: 7 max len: 138 min len: 80 avg len: 111.72222222222223 num_loss_counted_tokens: 562 | |
Per-token loss scaled by world size: 0.0006010799552313983Per-token loss scaled by world size: 0.0014900823589414358Per-token loss scaled by world size: 0.000691260036546737 | |
Per-token loss scaled by world size: 0.0006552952690981328Per-token loss scaled by world size: 0.0016043368959799409Per-token loss scaled by world size: 0.0003588599502108991Per-token loss scaled by world size: 0.0010356663260608912 | |
Epoch: 0, Step: 129, Rank: 1, loss = 1.5196977853775024 | |
Epoch: 0, Step: 129, Rank: 4, loss = 0.6130264401435852Epoch: 0, Step: 129, Rank: 6, loss = 0.7049988508224487Epoch: 0, Step: 129, Rank: 2, loss = 1.6362230777740479Epoch: 0, Step: 129, Rank: 3, loss = 0.6683192849159241Epoch: 0, Step: 129, Rank: 5, loss = 1.0562502145767212 | |
Epoch: 0, Step: 129, Rank: 7, loss = 0.3659922778606415 | |
Per-token loss scaled by world size: 0.002812962979078293 | |
Epoch: 0, Step: 129, Rank: 0, loss = 2.868870735168457 | |
[2024-06-27 16:44:05,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=129, skipped=0, lr=[6.701298701298702e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:05,969] [INFO] [timer.py:260:stop] epoch=0/micro_step=129/global_step=129, RunningAvgSamplesPerSec=95.56814434706736, CurrSamplesPerSec=94.9672550116559, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.87946738790043 samples/s, lr: 6.701298701298702e-06, loss: 2.868870735168457 cuda_mem_allocated: 22.30650568008423 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8159.0 batch_size: 84.0 total loss: 1.179172396659851 | |
Epoch 0: 61% 129/213 [02:55<01:29, 1.06s/it] total tokens: 1784 num samples: 2 num padding tokens: 290 - rank: 1 max len: 892 min len: 602 avg len: 747.0 num_loss_counted_tokens: 285 | |
total tokens: 2496 num samples: 13 num padding tokens: 146 - rank: 5 max len: 192 min len: 169 avg len: 180.76923076923077 num_loss_counted_tokens: 932 | |
total tokens: 2530 num samples: 10 num padding tokens: 346 - rank: 4 max len: 253 min len: 192 avg len: 218.4 num_loss_counted_tokens: 776 | |
total tokens: 2285 num samples: 5 num padding tokens: 274 - rank: 2 max len: 457 min len: 360 avg len: 402.2 num_loss_counted_tokens: 1063 | |
total tokens: 2520 num samples: 15 num padding tokens: 294 - rank: 6 max len: 168 min len: 138 avg len: 148.4 num_loss_counted_tokens: 874 | |
total tokens: 2443 num samples: 7 num padding tokens: 322 - rank: 3 max len: 349 min len: 266 avg len: 303.0 num_loss_counted_tokens: 811 | |
total tokens: 2412 num samples: 18 num padding tokens: 367 - rank: 7 max len: 134 min len: 99 avg len: 113.61111111111111 num_loss_counted_tokens: 626 | |
total tokens: 2007 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2007 min len: 2007 avg len: 2007.0 num_loss_counted_tokens: 31 | |
Per-token loss scaled by world size: 0.0005227336660027504Per-token loss scaled by world size: 0.0014124942244961858Per-token loss scaled by world size: 0.000678228447213769Per-token loss scaled by world size: 0.0007176249637268484Per-token loss scaled by world size: 0.0009657181799411774Per-token loss scaled by world size: 0.0009431529324501753 | |
Per-token loss scaled by world size: 0.001532725989818573 | |
Epoch: 0, Step: 130, Rank: 7, loss = 0.5286144018173218Epoch: 0, Step: 130, Rank: 2, loss = 1.428384780883789 | |
Epoch: 0, Step: 130, Rank: 4, loss = 0.7256982326507568 | |
Epoch: 0, Step: 130, Rank: 1, loss = 0.6858584880828857Epoch: 0, Step: 130, Rank: 0, loss = 1.54996919631958Epoch: 0, Step: 130, Rank: 3, loss = 0.9537634253501892 | |
Epoch: 0, Step: 130, Rank: 5, loss = 0.9765825271606445Per-token loss scaled by world size: 0.0011094971559941769 | |
Epoch: 0, Step: 130, Rank: 6, loss = 1.121978998184204 | |
[2024-06-27 16:44:06,957] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=0, lr=[6.753246753246754e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:07,032] [INFO] [timer.py:260:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=95.56551080018184, CurrSamplesPerSec=95.23222596147365, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.116280948571 samples/s, lr: 6.753246753246754e-06, loss: 1.54996919631958 cuda_mem_allocated: 22.287423133850098 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8090.0 batch_size: 82.0 total loss: 0.9963563084602356 | |
Epoch 0: 61% 130/213 [02:56<01:28, 1.06s/it] total tokens: 2410 num samples: 10 num padding tokens: 151 - rank: 4 max len: 241 min len: 208 avg len: 225.9 num_loss_counted_tokens: 1235 | |
total tokens: 2478 num samples: 6 num padding tokens: 200 - rank: 1 max len: 413 min len: 357 avg len: 379.6666666666667 num_loss_counted_tokens: 1177 | |
total tokens: 2505 num samples: 15 num padding tokens: 262 - rank: 6 max len: 167 min len: 131 avg len: 149.53333333333333 num_loss_counted_tokens: 940 | |
total tokens: 2448 num samples: 12 num padding tokens: 236 - rank: 5 max len: 204 min len: 167 avg len: 184.33333333333334 num_loss_counted_tokens: 1105 | |
total tokens: 2219 num samples: 7 num padding tokens: 303 - rank: 3 max len: 317 min len: 255 avg len: 273.7142857142857 num_loss_counted_tokens: 729 | |
total tokens: 2478 num samples: 7 num padding tokens: 125 - rank: 2 max len: 354 min len: 319 avg len: 336.14285714285717 num_loss_counted_tokens: 1040 | |
total tokens: 2048 num samples: 16 num padding tokens: 292 - rank: 7 max len: 128 min len: 83 avg len: 109.75 num_loss_counted_tokens: 512 | |
total tokens: 2154 num samples: 3 num padding tokens: 459 - rank: 0 max len: 718 min len: 441 avg len: 565.0 num_loss_counted_tokens: 1443 | |
Per-token loss scaled by world size: 0.0014059062814339995Per-token loss scaled by world size: 0.0007113716565072536Per-token loss scaled by world size: 0.0020565642043948174Per-token loss scaled by world size: 0.0005713762366212904Per-token loss scaled by world size: 0.0008375751203857362Per-token loss scaled by world size: 0.000900787184946239Per-token loss scaled by world size: 0.0008407584973610938 | |
Epoch: 0, Step: 131, Rank: 4, loss = 1.3359624147415161Epoch: 0, Step: 131, Rank: 0, loss = 1.9542502164840698Epoch: 0, Step: 131, Rank: 7, loss = 0.6759809255599976 | |
Epoch: 0, Step: 131, Rank: 3, loss = 0.8559730052947998 | |
Epoch: 0, Step: 131, Rank: 6, loss = 0.7959057688713074Epoch: 0, Step: 131, Rank: 1, loss = 0.5429502725601196 | |
Epoch: 0, Step: 131, Rank: 5, loss = 0.7989307641983032Per-token loss scaled by world size: 0.001123271300457418 | |
Epoch: 0, Step: 131, Rank: 2, loss = 1.0673885345458984 | |
[2024-06-27 16:44:08,022] [INFO] [logging.py:96:log_dist] [Rank 0] step=131, skipped=0, lr=[6.805194805194806e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:08,096] [INFO] [timer.py:260:stop] epoch=0/micro_step=131/global_step=131, RunningAvgSamplesPerSec=95.56185156256835, CurrSamplesPerSec=95.09577141863868, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.99521642176269 samples/s, lr: 6.805194805194806e-06, loss: 1.9542502164840698 cuda_mem_allocated: 22.300780773162842 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7602.0 batch_size: 84.0 total loss: 1.003417730331421 | |
Epoch 0: 62% 131/213 [02:57<01:27, 1.06s/it] total tokens: 2388 num samples: 12 num padding tokens: 113 - rank: 4 max len: 199 min len: 176 avg len: 189.58333333333334 num_loss_counted_tokens: 988 | |
total tokens: 2464 num samples: 14 num padding tokens: 152 - rank: 5 max len: 176 min len: 153 avg len: 165.14285714285714 num_loss_counted_tokens: 1024 | |
total tokens: 2506 num samples: 7 num padding tokens: 306 - rank: 1 max len: 358 min len: 276 avg len: 314.2857142857143 num_loss_counted_tokens: 1167 | |
total tokens: 2400 num samples: 16 num padding tokens: 179 - rank: 6 max len: 150 min len: 127 avg len: 138.8125 num_loss_counted_tokens: 909 | |
total tokens: 2394 num samples: 9 num padding tokens: 100 - rank: 2 max len: 266 min len: 243 avg len: 254.88888888888889 num_loss_counted_tokens: 1039 | |
total tokens: 2530 num samples: 11 num padding tokens: 131 - rank: 3 max len: 230 min len: 200 avg len: 218.0909090909091 num_loss_counted_tokens: 1239 | |
total tokens: 2261 num samples: 19 num padding tokens: 251 - rank: 7 max len: 119 min len: 83 avg len: 105.78947368421052 num_loss_counted_tokens: 530 | |
total tokens: 2112 num samples: 4 num padding tokens: 290 - rank: 0 max len: 528 min len: 379 avg len: 455.5 num_loss_counted_tokens: 1238 | |
Per-token loss scaled by world size: 0.0010354174301028252Per-token loss scaled by world size: 0.000709801446646452Per-token loss scaled by world size: 0.0011828482383862138Per-token loss scaled by world size: 0.0009205403621308506 | |
Per-token loss scaled by world size: 0.0012205899693071842Per-token loss scaled by world size: 0.001384471426717937Per-token loss scaled by world size: 0.0011119794799014926 | |
Epoch: 0, Step: 132, Rank: 3, loss = 1.0672248601913452 | |
Epoch: 0, Step: 132, Rank: 2, loss = 0.9342054128646851Epoch: 0, Step: 132, Rank: 7, loss = 0.6404183506965637 | |
Epoch: 0, Step: 132, Rank: 6, loss = 0.8305575251579285Epoch: 0, Step: 132, Rank: 1, loss = 1.0032835006713867Epoch: 0, Step: 132, Rank: 0, loss = 1.2491393089294434Epoch: 0, Step: 132, Rank: 4, loss = 1.1012773513793945 | |
Per-token loss scaled by world size: 0.0014845936093479395 | |
Epoch: 0, Step: 132, Rank: 5, loss = 1.3394745588302612 | |
[2024-06-27 16:44:09,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=132, skipped=0, lr=[6.857142857142858e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:09,151] [INFO] [timer.py:260:stop] epoch=0/micro_step=132/global_step=132, RunningAvgSamplesPerSec=95.56408360896346, CurrSamplesPerSec=95.85289452503737, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.7535563371992 samples/s, lr: 6.857142857142858e-06, loss: 1.2491393089294434 cuda_mem_allocated: 22.30256938934326 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7218.0 batch_size: 77.0 total loss: 1.0206975936889648 | |
Epoch 0: 62% 132/213 [02:58<01:25, 1.06s/it] total tokens: 2345 num samples: 7 num padding tokens: 64 - rank: 2 max len: 335 min len: 310 avg len: 325.85714285714283 num_loss_counted_tokens: 967 | |
total tokens: 2492 num samples: 14 num padding tokens: 243 - rank: 6 max len: 178 min len: 137 avg len: 160.64285714285714 num_loss_counted_tokens: 851 | |
total tokens: 2475 num samples: 11 num padding tokens: 214 - rank: 5 max len: 225 min len: 179 avg len: 205.54545454545453 num_loss_counted_tokens: 980 | |
total tokens: 2430 num samples: 9 num padding tokens: 247 - rank: 4 max len: 270 min len: 229 avg len: 242.55555555555554 num_loss_counted_tokens: 652 | |
total tokens: 2352 num samples: 8 num padding tokens: 98 - rank: 3 max len: 294 min len: 271 avg len: 281.75 num_loss_counted_tokens: 1010 | |
total tokens: 2115 num samples: 5 num padding tokens: 275 - rank: 1 max len: 423 min len: 338 avg len: 368.0 num_loss_counted_tokens: 1207 | |
total tokens: 2430 num samples: 18 num padding tokens: 443 - rank: 7 max len: 135 min len: 83 avg len: 110.38888888888889 num_loss_counted_tokens: 560 | |
total tokens: 2310 num samples: 3 num padding tokens: 394 - rank: 0 max len: 770 min len: 530 avg len: 638.6666666666666 num_loss_counted_tokens: 670 | |
Per-token loss scaled by world size: 0.0013352029491215944Per-token loss scaled by world size: 0.0022145186085253954Per-token loss scaled by world size: 0.0012434078380465508Per-token loss scaled by world size: 0.0011681333417072892Per-token loss scaled by world size: 0.0012957679573446512 | |
Per-token loss scaled by world size: 0.001793670584447682 | |
Per-token loss scaled by world size: 2.943790832432569e-06 | |
Epoch: 0, Step: 133, Rank: 4, loss = 1.014961838722229Epoch: 0, Step: 133, Rank: 5, loss = 1.16012442111969Epoch: 0, Step: 133, Rank: 3, loss = 1.9241398572921753 | |
Epoch: 0, Step: 133, Rank: 1, loss = 1.125860333442688Epoch: 0, Step: 133, Rank: 6, loss = 1.0803660154342651Epoch: 0, Step: 133, Rank: 2, loss = 1.5584754943847656 | |
Per-token loss scaled by world size: 0.0006679170182906091 | |
Epoch: 0, Step: 133, Rank: 0, loss = 0.0025577861815690994 | |
Epoch: 0, Step: 133, Rank: 7, loss = 0.5803363919258118 | |
[2024-06-27 16:44:10,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=133, skipped=0, lr=[6.90909090909091e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:10,206] [INFO] [timer.py:260:stop] epoch=0/micro_step=133/global_step=133, RunningAvgSamplesPerSec=95.56742149943724, CurrSamplesPerSec=96.00334179112947, MemAllocated=22.18GB, MaxMemAllocated=28.61GB | |
throughput: 95.9086487170161 samples/s, lr: 6.90909090909091e-06, loss: 0.0025577861815690994 cuda_mem_allocated: 22.177703380584717 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6951.0 batch_size: 70.0 total loss: 1.0558527708053589 | |
Epoch 0: 62% 133/213 [02:59<01:24, 1.06s/it] total tokens: 2390 num samples: 10 num padding tokens: 215 - rank: 4 max len: 239 min len: 202 avg len: 217.5 num_loss_counted_tokens: 780 | |
total tokens: 2484 num samples: 9 num padding tokens: 134 - rank: 3 max len: 276 min len: 245 avg len: 261.1111111111111 num_loss_counted_tokens: 868 | |
total tokens: 2534 num samples: 7 num padding tokens: 198 - rank: 2 max len: 362 min len: 283 avg len: 333.7142857142857 num_loss_counted_tokens: 1084 | |
total tokens: 2388 num samples: 12 num padding tokens: 128 - rank: 5 max len: 199 min len: 172 avg len: 188.33333333333334 num_loss_counted_tokens: 870 | |
total tokens: 2154 num samples: 3 num padding tokens: 706 - rank: 1 max len: 718 min len: 362 avg len: 482.6666666666667 num_loss_counted_tokens: 870 | |
total tokens: 2520 num samples: 15 num padding tokens: 181 - rank: 6 max len: 168 min len: 135 avg len: 155.93333333333334 num_loss_counted_tokens: 803 | |
total tokens: 2527 num samples: 19 num padding tokens: 418 - rank: 7 max len: 133 min len: 79 avg len: 111.0 num_loss_counted_tokens: 650 | |
total tokens: 2458 num samples: 2 num padding tokens: 495 - rank: 0 max len: 1229 min len: 734 avg len: 981.5 num_loss_counted_tokens: 34 | |
Per-token loss scaled by world size: 0.0009803920984268188Per-token loss scaled by world size: 0.0010018316097557545Per-token loss scaled by world size: 0.0011422885581851006Per-token loss scaled by world size: 0.0007027192623354495Per-token loss scaled by world size: 0.0018464797176420689Per-token loss scaled by world size: 0.0011519595282152295Per-token loss scaled by world size: 0.0018283786484971642 | |
Epoch: 0, Step: 134, Rank: 5, loss = 0.8736518621444702Epoch: 0, Step: 134, Rank: 2, loss = 0.8927571773529053 | |
Epoch: 0, Step: 134, Rank: 7, loss = 0.6262106895446777Epoch: 0, Step: 134, Rank: 4, loss = 1.0179219245910645 | |
Epoch: 0, Step: 134, Rank: 1, loss = 1.6454442739486694Epoch: 0, Step: 134, Rank: 6, loss = 1.026539921760559Epoch: 0, Step: 134, Rank: 3, loss = 1.6293139457702637 | |
Per-token loss scaled by world size: 0.0005575825343839824 | |
Epoch: 0, Step: 134, Rank: 0, loss = 0.49687573313713074 | |
[2024-06-27 16:44:11,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=134, skipped=0, lr=[6.961038961038962e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:11,276] [INFO] [timer.py:260:stop] epoch=0/micro_step=134/global_step=134, RunningAvgSamplesPerSec=95.56025907693834, CurrSamplesPerSec=94.63117377450484, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.54169745440304 samples/s, lr: 6.961038961038962e-06, loss: 0.49687573313713074 cuda_mem_allocated: 22.30543279647827 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7129.0 batch_size: 86.0 total loss: 1.0260894298553467 | |
Epoch 0: 63% 134/213 [03:00<01:23, 1.06s/it] total tokens: 2136 num samples: 4 num padding tokens: 503 - rank: 1 max len: 534 min len: 339 avg len: 408.25 num_loss_counted_tokens: 912 | |
total tokens: 2296 num samples: 8 num padding tokens: 165 - rank: 3 max len: 287 min len: 251 avg len: 266.375 num_loss_counted_tokens: 601 | |
total tokens: 2385 num samples: 15 num padding tokens: 151 - rank: 6 max len: 159 min len: 135 avg len: 148.93333333333334 num_loss_counted_tokens: 986 | |
total tokens: 2345 num samples: 7 num padding tokens: 175 - rank: 2 max len: 335 min len: 287 avg len: 310.0 num_loss_counted_tokens: 1201 | |
total tokens: 2431 num samples: 13 num padding tokens: 189 - rank: 5 max len: 187 min len: 160 avg len: 172.46153846153845 num_loss_counted_tokens: 863 | |
total tokens: 2453 num samples: 11 num padding tokens: 163 - rank: 4 max len: 223 min len: 188 avg len: 208.1818181818182 num_loss_counted_tokens: 1139 | |
total tokens: 2271 num samples: 3 num padding tokens: 353 - rank: 0 max len: 757 min len: 546 avg len: 639.3333333333334 num_loss_counted_tokens: 1205 | |
total tokens: 2508 num samples: 19 num padding tokens: 472 - rank: 7 max len: 132 min len: 89 avg len: 107.15789473684211 num_loss_counted_tokens: 609 | |
Per-token loss scaled by world size: 0.0002656845317687839Per-token loss scaled by world size: 0.0014110160991549492Per-token loss scaled by world size: 0.0017205494223162532 | |
Per-token loss scaled by world size: 0.0004115866613574326Per-token loss scaled by world size: 0.0012919787550345063Per-token loss scaled by world size: 0.0014171176590025425Per-token loss scaled by world size: 0.0011637437855824828 | |
Epoch: 0, Step: 135, Rank: 1, loss = 1.2626830339431763 | |
Epoch: 0, Step: 135, Rank: 0, loss = 0.2377544492483139Epoch: 0, Step: 135, Rank: 7, loss = 0.3683186173439026Epoch: 0, Step: 135, Rank: 2, loss = 1.5396766662597656 | |
Epoch: 0, Step: 135, Rank: 6, loss = 1.156159520149231 | |
Epoch: 0, Step: 135, Rank: 5, loss = 1.041405200958252Epoch: 0, Step: 135, Rank: 3, loss = 1.2681431770324707Per-token loss scaled by world size: 0.0011596310650929809 | |
Epoch: 0, Step: 135, Rank: 4, loss = 1.0377248525619507 | |
[2024-06-27 16:44:12,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=135, skipped=0, lr=[7.012987012987014e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:12,342] [INFO] [timer.py:260:stop] epoch=0/micro_step=135/global_step=135, RunningAvgSamplesPerSec=95.55506672289751, CurrSamplesPerSec=94.87459381436206, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.78998974774268 samples/s, lr: 7.012987012987014e-06, loss: 0.2377544492483139 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7159.0 batch_size: 79.0 total loss: 0.9889832139015198 | |
Epoch 0: 63% 135/213 [03:01<01:22, 1.06s/it] total tokens: 2496 num samples: 16 num padding tokens: 145 - rank: 6 max len: 156 min len: 136 avg len: 146.9375 num_loss_counted_tokens: 982 | |
total tokens: 2317 num samples: 7 num padding tokens: 214 - rank: 2 max len: 331 min len: 277 avg len: 300.42857142857144 num_loss_counted_tokens: 950 | |
total tokens: 2418 num samples: 13 num padding tokens: 145 - rank: 5 max len: 186 min len: 162 avg len: 174.84615384615384 num_loss_counted_tokens: 896 | |
total tokens: 2475 num samples: 9 num padding tokens: 263 - rank: 3 max len: 275 min len: 221 avg len: 245.77777777777777 num_loss_counted_tokens: 656 | |
total tokens: 2420 num samples: 11 num padding tokens: 169 - rank: 4 max len: 220 min len: 191 avg len: 204.63636363636363 num_loss_counted_tokens: 766 | |
total tokens: 2235 num samples: 5 num padding tokens: 295 - rank: 1 max len: 447 min len: 349 avg len: 388.0 num_loss_counted_tokens: 697 | |
total tokens: 2227 num samples: 17 num padding tokens: 330 - rank: 7 max len: 131 min len: 82 avg len: 111.58823529411765 num_loss_counted_tokens: 571 | |
total tokens: 2436 num samples: 3 num padding tokens: 584 - rank: 0 max len: 812 min len: 517 avg len: 617.3333333333334 num_loss_counted_tokens: 840 | |
Per-token loss scaled by world size: 0.0007394844433292747Per-token loss scaled by world size: 0.001284346915781498Per-token loss scaled by world size: 0.0008952285861596465Per-token loss scaled by world size: 0.000706278660800308Per-token loss scaled by world size: 0.002014349214732647Per-token loss scaled by world size: 0.0013594189658761024Per-token loss scaled by world size: 0.0014989018673077226 | |
Epoch: 0, Step: 136, Rank: 5, loss = 0.7644420266151428Epoch: 0, Step: 136, Rank: 2, loss = 1.3276935815811157Epoch: 0, Step: 136, Rank: 4, loss = 0.7301155924797058 | |
Epoch: 0, Step: 136, Rank: 6, loss = 0.9254425764083862 | |
Epoch: 0, Step: 136, Rank: 0, loss = 2.082333564758301Epoch: 0, Step: 136, Rank: 3, loss = 1.4052993059158325Epoch: 0, Step: 136, Rank: 1, loss = 1.5494898557662964Per-token loss scaled by world size: 0.0005571566289290786 | |
Epoch: 0, Step: 136, Rank: 7, loss = 0.575960636138916 | |
[2024-06-27 16:44:13,328] [INFO] [logging.py:96:log_dist] [Rank 0] step=136, skipped=0, lr=[7.064935064935066e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:13,401] [INFO] [timer.py:260:stop] epoch=0/micro_step=136/global_step=136, RunningAvgSamplesPerSec=95.55470304500388, CurrSamplesPerSec=95.50635854081955, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.40421045199179 samples/s, lr: 7.064935064935066e-06, loss: 2.082333564758301 cuda_mem_allocated: 22.300780773162842 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8270.0 batch_size: 80.0 total loss: 1.1700971126556396 | |
Epoch 0: 64% 136/213 [03:02<01:21, 1.06s/it] total tokens: 2216 num samples: 4 num padding tokens: 444 - rank: 1 max len: 554 min len: 386 avg len: 443.0 num_loss_counted_tokens: 1361 | |
total tokens: 2508 num samples: 11 num padding tokens: 157 - rank: 4 max len: 228 min len: 199 avg len: 213.72727272727272 num_loss_counted_tokens: 984 | |
total tokens: 2340 num samples: 12 num padding tokens: 105 - rank: 5 max len: 195 min len: 174 avg len: 186.25 num_loss_counted_tokens: 738 | |
total tokens: 2457 num samples: 9 num padding tokens: 200 - rank: 3 max len: 273 min len: 231 avg len: 250.77777777777777 num_loss_counted_tokens: 770 | |
total tokens: 2268 num samples: 7 num padding tokens: 185 - rank: 2 max len: 324 min len: 275 avg len: 297.57142857142856 num_loss_counted_tokens: 1022 | |
total tokens: 2366 num samples: 14 num padding tokens: 231 - rank: 6 max len: 169 min len: 139 avg len: 152.5 num_loss_counted_tokens: 833 | |
total tokens: 2304 num samples: 3 num padding tokens: 237 - rank: 0 max len: 768 min len: 622 avg len: 689.0 num_loss_counted_tokens: 1623 | |
total tokens: 2502 num samples: 18 num padding tokens: 417 - rank: 7 max len: 139 min len: 82 avg len: 115.83333333333333 num_loss_counted_tokens: 620 | |
Per-token loss scaled by world size: 0.0006578860920853913Per-token loss scaled by world size: 0.0017411127919331193Per-token loss scaled by world size: 0.001466740621253848Per-token loss scaled by world size: 0.0010080640204250813Per-token loss scaled by world size: 0.0013916424941271544Per-token loss scaled by world size: 0.0007880296907387674Per-token loss scaled by world size: 0.0011377192568033934 | |
Epoch: 0, Step: 137, Rank: 0, loss = 0.614465594291687Epoch: 0, Step: 137, Rank: 2, loss = 1.3699357509613037Epoch: 0, Step: 137, Rank: 1, loss = 1.6261993646621704Epoch: 0, Step: 137, Rank: 5, loss = 0.9415318369865417 | |
Epoch: 0, Step: 137, Rank: 6, loss = 0.7360197305679321 | |
Epoch: 0, Step: 137, Rank: 4, loss = 1.29979407787323 | |
Epoch: 0, Step: 137, Rank: 3, loss = 1.0626298189163208 | |
Per-token loss scaled by world size: 0.0006538264569826424 | |
Epoch: 0, Step: 137, Rank: 7, loss = 0.6106739044189453 | |
[2024-06-27 16:44:14,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=137, skipped=0, lr=[7.116883116883118e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:14,466] [INFO] [timer.py:260:stop] epoch=0/micro_step=137/global_step=137, RunningAvgSamplesPerSec=95.54978723244932, CurrSamplesPerSec=94.8956118815484, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.8050546599265 samples/s, lr: 7.116883116883118e-06, loss: 0.614465594291687 cuda_mem_allocated: 22.262856006622314 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7472.0 batch_size: 77.0 total loss: 1.0326563119888306 | |
Epoch 0: 64% 137/213 [03:03<01:20, 1.06s/it] total tokens: 2392 num samples: 8 num padding tokens: 219 - rank: 3 max len: 299 min len: 248 avg len: 271.625 num_loss_counted_tokens: 834 | |
total tokens: 2440 num samples: 10 num padding tokens: 191 - rank: 4 max len: 244 min len: 211 avg len: 224.9 num_loss_counted_tokens: 1017 | |
total tokens: 2328 num samples: 6 num padding tokens: 290 - rank: 2 max len: 388 min len: 306 avg len: 339.6666666666667 num_loss_counted_tokens: 1437 | |
total tokens: 2295 num samples: 5 num padding tokens: 220 - rank: 1 max len: 459 min len: 393 avg len: 415.0 num_loss_counted_tokens: 1082 | |
total tokens: 2366 num samples: 13 num padding tokens: 227 - rank: 6 max len: 182 min len: 146 avg len: 164.53846153846155 num_loss_counted_tokens: 733 | |
total tokens: 2484 num samples: 12 num padding tokens: 126 - rank: 5 max len: 207 min len: 186 avg len: 196.5 num_loss_counted_tokens: 926 | |
total tokens: 2482 num samples: 17 num padding tokens: 573 - rank: 7 max len: 146 min len: 90 avg len: 112.29411764705883 num_loss_counted_tokens: 530 | |
total tokens: 2380 num samples: 4 num padding tokens: 274 - rank: 0 max len: 595 min len: 470 avg len: 526.5 num_loss_counted_tokens: 1084 | |
Per-token loss scaled by world size: 0.0008470662287436426Per-token loss scaled by world size: 0.0013530355645343661Per-token loss scaled by world size: 0.0015552501427009702Per-token loss scaled by world size: 0.0004919858183711767Per-token loss scaled by world size: 0.001061309245415032Per-token loss scaled by world size: 0.0008762083598412573Per-token loss scaled by world size: 0.0012540918542072177 | |
Epoch: 0, Step: 138, Rank: 3, loss = 0.7816303372383118 | |
Epoch: 0, Step: 138, Rank: 5, loss = 0.979323148727417Epoch: 0, Step: 138, Rank: 2, loss = 1.4351071119308472 | |
Epoch: 0, Step: 138, Rank: 7, loss = 0.4539799094200134Epoch: 0, Step: 138, Rank: 1, loss = 1.2485135793685913Epoch: 0, Step: 138, Rank: 4, loss = 1.1572132110595703Epoch: 0, Step: 138, Rank: 6, loss = 0.8085212707519531 | |
Per-token loss scaled by world size: 0.0023880566004663706 | |
Epoch: 0, Step: 138, Rank: 0, loss = 2.2035791873931885 | |
[2024-06-27 16:44:15,458] [INFO] [logging.py:96:log_dist] [Rank 0] step=138, skipped=0, lr=[7.16883116883117e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:15,531] [INFO] [timer.py:260:stop] epoch=0/micro_step=138/global_step=138, RunningAvgSamplesPerSec=95.54469298402175, CurrSamplesPerSec=94.86192039624711, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.74756591392382 samples/s, lr: 7.16883116883117e-06, loss: 2.2035791873931885 cuda_mem_allocated: 22.312707901000977 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7382.0 batch_size: 90.0 total loss: 1.1334835290908813 | |
Epoch 0: 65% 138/213 [03:04<01:19, 1.06s/it] total tokens: 2380 num samples: 14 num padding tokens: 317 - rank: 6 max len: 170 min len: 128 avg len: 147.35714285714286 num_loss_counted_tokens: 849 | |
total tokens: 2398 num samples: 11 num padding tokens: 305 - rank: 5 max len: 218 min len: 177 avg len: 190.27272727272728 num_loss_counted_tokens: 935 | |
total tokens: 2510 num samples: 10 num padding tokens: 142 - rank: 4 max len: 251 min len: 219 avg len: 236.8 num_loss_counted_tokens: 966 | |
total tokens: 2328 num samples: 8 num padding tokens: 181 - rank: 3 max len: 291 min len: 254 avg len: 268.375 num_loss_counted_tokens: 804 | |
total tokens: 2394 num samples: 7 num padding tokens: 197 - rank: 2 max len: 342 min len: 298 avg len: 313.85714285714283 num_loss_counted_tokens: 1020 | |
total tokens: 2500 num samples: 20 num padding tokens: 419 - rank: 7 max len: 125 min len: 83 avg len: 104.05 num_loss_counted_tokens: 500 | |
total tokens: 2514 num samples: 6 num padding tokens: 145 - rank: 1 max len: 419 min len: 351 avg len: 394.8333333333333 num_loss_counted_tokens: 1133 | |
total tokens: 2418 num samples: 3 num padding tokens: 642 - rank: 0 max len: 806 min len: 484 avg len: 592.0 num_loss_counted_tokens: 241 | |
Per-token loss scaled by world size: 0.0013513833982869983Per-token loss scaled by world size: 0.0017933724448084831Per-token loss scaled by world size: 0.0018549689557403326Per-token loss scaled by world size: 0.001310093211941421Per-token loss scaled by world size: 0.0008366020629182458Per-token loss scaled by world size: 0.0021304823458194733 | |
Per-token loss scaled by world size: 2.506226701370906e-05 | |
Epoch: 0, Step: 139, Rank: 4, loss = 1.0346529483795166 | |
Epoch: 0, Step: 139, Rank: 2, loss = 1.3730508089065552Epoch: 0, Step: 139, Rank: 3, loss = 1.420210599899292Epoch: 0, Step: 139, Rank: 6, loss = 1.003040075302124 | |
Epoch: 0, Step: 139, Rank: 7, loss = 0.6405234336853027Per-token loss scaled by world size: 0.0022204299457371235Epoch: 0, Step: 139, Rank: 5, loss = 1.631150484085083 | |
Epoch: 0, Step: 139, Rank: 0, loss = 0.019188297912478447 | |
Epoch: 0, Step: 139, Rank: 1, loss = 1.700016736984253 | |
[2024-06-27 16:44:16,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=139, skipped=0, lr=[7.220779220779221e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:16,590] [INFO] [timer.py:260:stop] epoch=0/micro_step=139/global_step=139, RunningAvgSamplesPerSec=95.5456173093053, CurrSamplesPerSec=95.6714923779423, MemAllocated=22.21GB, MaxMemAllocated=28.61GB | |
throughput: 95.57035194485849 samples/s, lr: 7.220779220779221e-06, loss: 0.019188297912478447 cuda_mem_allocated: 22.20704174041748 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6125.0 batch_size: 75.0 total loss: 1.1027292013168335 | |
Epoch 0: 65% 139/213 [03:05<01:18, 1.06s/it] total tokens: 2390 num samples: 10 num padding tokens: 130 - rank: 4 max len: 239 min len: 208 avg len: 226.0 num_loss_counted_tokens: 898 | |
total tokens: 2400 num samples: 12 num padding tokens: 351 - rank: 5 max len: 200 min len: 149 avg len: 170.75 num_loss_counted_tokens: 807 | |
total tokens: 2240 num samples: 7 num padding tokens: 84 - rank: 2 max len: 320 min len: 295 avg len: 308.0 num_loss_counted_tokens: 698 | |
total tokens: 2346 num samples: 6 num padding tokens: 215 - rank: 1 max len: 391 min len: 326 avg len: 355.1666666666667 num_loss_counted_tokens: 751 | |
total tokens: 2516 num samples: 17 num padding tokens: 410 - rank: 6 max len: 148 min len: 107 avg len: 123.88235294117646 num_loss_counted_tokens: 645 | |
total tokens: 2344 num samples: 8 num padding tokens: 186 - rank: 3 max len: 293 min len: 241 avg len: 269.75 num_loss_counted_tokens: 1187 | |
total tokens: 1060 num samples: 10 num padding tokens: 77 - rank: 7 max len: 106 min len: 86 avg len: 98.3 num_loss_counted_tokens: 222 | |
total tokens: 1959 num samples: 3 num padding tokens: 389 - rank: 0 max len: 653 min len: 442 avg len: 523.3333333333334 num_loss_counted_tokens: 640 | |
Per-token loss scaled by world size: 0.0007017544703558087Per-token loss scaled by world size: 0.0018811069894582033Per-token loss scaled by world size: 0.0012803279096260667Per-token loss scaled by world size: 0.0008335780003108084Per-token loss scaled by world size: 0.0009041429730132222Per-token loss scaled by world size: 0.001534727867692709Per-token loss scaled by world size: 0.001060348586179316 | |
Epoch: 0, Step: 140, Rank: 1, loss = 1.1017221212387085Epoch: 0, Step: 140, Rank: 2, loss = 1.6186925172805786Epoch: 0, Step: 140, Rank: 7, loss = 0.6038597226142883Epoch: 0, Step: 140, Rank: 0, loss = 0.7172938585281372Epoch: 0, Step: 140, Rank: 3, loss = 0.7780150175094604Epoch: 0, Step: 140, Rank: 5, loss = 1.3206332921981812 | |
Per-token loss scaled by world size: 0.0013403245247900486 | |
Epoch: 0, Step: 140, Rank: 4, loss = 0.9124299883842468 | |
Epoch: 0, Step: 140, Rank: 6, loss = 1.1533492803573608 | |
[2024-06-27 16:44:17,577] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=0, lr=[7.272727272727273e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:17,650] [INFO] [timer.py:260:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=95.5454281614417, CurrSamplesPerSec=95.51952198153343, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.42537339784873 samples/s, lr: 7.272727272727273e-06, loss: 0.7172938585281372 cuda_mem_allocated: 22.280386924743652 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6884.0 batch_size: 82.0 total loss: 1.0257494449615479 | |
Epoch 0: 66% 140/213 [03:06<01:17, 1.06s/it] total tokens: 2478 num samples: 14 num padding tokens: 286 - rank: 6 max len: 177 min len: 137 avg len: 156.57142857142858 num_loss_counted_tokens: 759 | |
total tokens: 2332 num samples: 11 num padding tokens: 182 - rank: 5 max len: 212 min len: 180 avg len: 195.45454545454547 num_loss_counted_tokens: 956 | |
total tokens: 2240 num samples: 7 num padding tokens: 188 - rank: 2 max len: 320 min len: 278 avg len: 293.14285714285717 num_loss_counted_tokens: 953 | |
total tokens: 2480 num samples: 10 num padding tokens: 224 - rank: 4 max len: 248 min len: 215 avg len: 225.6 num_loss_counted_tokens: 1166 | |
total tokens: 2448 num samples: 9 num padding tokens: 110 - rank: 3 max len: 272 min len: 248 avg len: 259.77777777777777 num_loss_counted_tokens: 1008 | |
total tokens: 2532 num samples: 6 num padding tokens: 325 - rank: 1 max len: 422 min len: 323 avg len: 367.8333333333333 num_loss_counted_tokens: 1187 | |
total tokens: 2232 num samples: 18 num padding tokens: 359 - rank: 7 max len: 124 min len: 86 avg len: 104.05555555555556 num_loss_counted_tokens: 418 | |
total tokens: 2022 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2022 min len: 2022 avg len: 2022.0 num_loss_counted_tokens: 40 | |
Per-token loss scaled by world size: 0.0014895814238116145Per-token loss scaled by world size: 0.0010798623552545905Per-token loss scaled by world size: 0.0009584724903106689Per-token loss scaled by world size: 0.0010508013656362891Per-token loss scaled by world size: 0.0012848537880927324Per-token loss scaled by world size: 0.000895876728463918Per-token loss scaled by world size: 0.0012915564002469182 | |
Epoch: 0, Step: 141, Rank: 4, loss = 1.262234091758728 | |
Epoch: 0, Step: 141, Rank: 1, loss = 0.8904227614402771Epoch: 0, Step: 141, Rank: 7, loss = 0.8121856451034546Epoch: 0, Step: 141, Rank: 5, loss = 0.915048360824585Epoch: 0, Step: 141, Rank: 2, loss = 1.0887529850006104Epoch: 0, Step: 141, Rank: 6, loss = 1.0944325923919678 | |
Epoch: 0, Step: 141, Rank: 3, loss = 0.7591435313224792 | |
Per-token loss scaled by world size: 0.0015006223693490028 | |
Epoch: 0, Step: 141, Rank: 0, loss = 1.2715898752212524 | |
[2024-06-27 16:44:18,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=141, skipped=0, lr=[7.324675324675325e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:18,716] [INFO] [timer.py:260:stop] epoch=0/micro_step=141/global_step=141, RunningAvgSamplesPerSec=95.5412929192286, CurrSamplesPerSec=94.9740422062108, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.85941740546409 samples/s, lr: 7.324675324675325e-06, loss: 1.2715898752212524 cuda_mem_allocated: 22.309725284576416 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6779.0 batch_size: 91.0 total loss: 1.0117262601852417 | |
Epoch 0: 66% 141/213 [03:07<01:16, 1.06s/it] total tokens: 2400 num samples: 10 num padding tokens: 196 - rank: 4 max len: 240 min len: 198 avg len: 220.4 num_loss_counted_tokens: 1036 | |
total tokens: 2335 num samples: 5 num padding tokens: 305 - rank: 1 max len: 467 min len: 365 avg len: 406.0 num_loss_counted_tokens: 1256 | |
total tokens: 2429 num samples: 7 num padding tokens: 156 - rank: 2 max len: 347 min len: 306 avg len: 324.7142857142857 num_loss_counted_tokens: 952 | |
total tokens: 2376 num samples: 12 num padding tokens: 219 - rank: 5 max len: 198 min len: 165 avg len: 179.75 num_loss_counted_tokens: 882 | |
total tokens: 2280 num samples: 8 num padding tokens: 205 - rank: 3 max len: 285 min len: 246 avg len: 259.375 num_loss_counted_tokens: 1125 | |
total tokens: 2430 num samples: 15 num padding tokens: 137 - rank: 6 max len: 162 min len: 139 avg len: 152.86666666666667 num_loss_counted_tokens: 825 | |
total tokens: 2430 num samples: 18 num padding tokens: 417 - rank: 7 max len: 135 min len: 74 avg len: 111.83333333333333 num_loss_counted_tokens: 483 | |
total tokens: 2464 num samples: 4 num padding tokens: 203 - rank: 0 max len: 616 min len: 521 avg len: 565.25 num_loss_counted_tokens: 1676 | |
Per-token loss scaled by world size: 0.0010539243230596185Per-token loss scaled by world size: 0.0019192431354895234Per-token loss scaled by world size: 0.0014142990112304688Per-token loss scaled by world size: 0.0012386123416945338 | |
Per-token loss scaled by world size: 0.0008109916816465557Per-token loss scaled by world size: 0.0008485869620926678Per-token loss scaled by world size: 0.0014819141943007708 | |
Epoch: 0, Step: 142, Rank: 1, loss = 1.2336223125457764 | |
Epoch: 0, Step: 142, Rank: 5, loss = 0.9192854762077332Epoch: 0, Step: 142, Rank: 2, loss = 1.6740598678588867Epoch: 0, Step: 142, Rank: 7, loss = 0.7073875069618225 | |
Epoch: 0, Step: 142, Rank: 0, loss = 1.080379605293274 | |
Epoch: 0, Step: 142, Rank: 3, loss = 1.2925996780395508 | |
Epoch: 0, Step: 142, Rank: 4, loss = 0.7401799559593201 | |
Per-token loss scaled by world size: 0.0007090984145179391 | |
Epoch: 0, Step: 142, Rank: 6, loss = 0.6185110807418823 | |
[2024-06-27 16:44:19,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=142, skipped=0, lr=[7.376623376623377e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:19,776] [INFO] [timer.py:260:stop] epoch=0/micro_step=142/global_step=142, RunningAvgSamplesPerSec=95.54047477303465, CurrSamplesPerSec=95.42688862693015, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.32208952701063 samples/s, lr: 7.376623376623377e-06, loss: 1.080379605293274 cuda_mem_allocated: 22.25748872756958 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6978.0 batch_size: 82.0 total loss: 1.0332531929016113 | |
Epoch 0: 67% 142/213 [03:09<01:15, 1.06s/it] total tokens: 2385 num samples: 15 num padding tokens: 278 - rank: 6 max len: 159 min len: 121 avg len: 140.46666666666667 num_loss_counted_tokens: 722 | |
total tokens: 2261 num samples: 7 num padding tokens: 155 - rank: 2 max len: 323 min len: 277 avg len: 300.85714285714283 num_loss_counted_tokens: 1249 | |
total tokens: 2519 num samples: 11 num padding tokens: 224 - rank: 4 max len: 229 min len: 194 avg len: 208.63636363636363 num_loss_counted_tokens: 742 | |
total tokens: 2403 num samples: 9 num padding tokens: 145 - rank: 3 max len: 267 min len: 230 avg len: 250.88888888888889 num_loss_counted_tokens: 894 | |
total tokens: 2376 num samples: 4 num padding tokens: 396 - rank: 0 max len: 594 min len: 424 avg len: 495.0 num_loss_counted_tokens: 878 | |
total tokens: 2262 num samples: 6 num padding tokens: 153 - rank: 1 max len: 377 min len: 325 avg len: 351.5 num_loss_counted_tokens: 571 | |
total tokens: 2483 num samples: 13 num padding tokens: 126 - rank: 5 max len: 191 min len: 170 avg len: 181.30769230769232 num_loss_counted_tokens: 841 | |
total tokens: 1800 num samples: 15 num padding tokens: 246 - rank: 7 max len: 120 min len: 80 avg len: 103.6 num_loss_counted_tokens: 376 | |
Per-token loss scaled by world size: 0.0010804999619722366Per-token loss scaled by world size: 0.0008333396399393678Per-token loss scaled by world size: 0.001860053394921124Per-token loss scaled by world size: 0.0003264991973992437Per-token loss scaled by world size: 0.0023405977990478277Per-token loss scaled by world size: 0.0010631910990923643 | |
Per-token loss scaled by world size: 0.0009541444596834481 | |
Epoch: 0, Step: 143, Rank: 4, loss = 1.026745080947876 | |
Epoch: 0, Step: 143, Rank: 0, loss = 1.767515778541565Epoch: 0, Step: 143, Rank: 2, loss = 0.7918809652328491Epoch: 0, Step: 143, Rank: 1, loss = 2.2241530418395996 | |
Epoch: 0, Step: 143, Rank: 7, loss = 0.31025585532188416Epoch: 0, Step: 143, Rank: 3, loss = 1.0102972984313965Epoch: 0, Step: 143, Rank: 6, loss = 0.9066757559776306 | |
Per-token loss scaled by world size: 0.0007517453050240874 | |
Epoch: 0, Step: 143, Rank: 5, loss = 0.7143459916114807 | |
[2024-06-27 16:44:20,762] [INFO] [logging.py:96:log_dist] [Rank 0] step=143, skipped=0, lr=[7.428571428571429e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:20,835] [INFO] [timer.py:260:stop] epoch=0/micro_step=143/global_step=143, RunningAvgSamplesPerSec=95.53748320901032, CurrSamplesPerSec=95.12050525798561, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.02079491174923 samples/s, lr: 7.428571428571429e-06, loss: 1.767515778541565 cuda_mem_allocated: 22.25999402999878 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7602.0 batch_size: 87.0 total loss: 1.093983769416809 | |
Epoch 0: 67% 143/213 [03:10<01:14, 1.06s/it] total tokens: 2365 num samples: 11 num padding tokens: 193 - rank: 4 max len: 215 min len: 182 avg len: 197.45454545454547 num_loss_counted_tokens: 838 | |
total tokens: 2370 num samples: 10 num padding tokens: 101 - rank: 3 max len: 237 min len: 220 avg len: 226.9 num_loss_counted_tokens: 806 | |
total tokens: 2516 num samples: 17 num padding tokens: 188 - rank: 6 max len: 148 min len: 123 avg len: 136.94117647058823 num_loss_counted_tokens: 876 | |
total tokens: 2464 num samples: 14 num padding tokens: 147 - rank: 5 max len: 176 min len: 152 avg len: 165.5 num_loss_counted_tokens: 937 | |
total tokens: 2184 num samples: 6 num padding tokens: 125 - rank: 1 max len: 364 min len: 319 avg len: 343.1666666666667 num_loss_counted_tokens: 1042 | |
total tokens: 2178 num samples: 18 num padding tokens: 310 - rank: 7 max len: 121 min len: 81 avg len: 103.77777777777777 num_loss_counted_tokens: 507 | |
total tokens: 2502 num samples: 9 num padding tokens: 140 - rank: 2 max len: 278 min len: 241 avg len: 262.44444444444446 num_loss_counted_tokens: 989 | |
total tokens: 2295 num samples: 5 num padding tokens: 181 - rank: 0 max len: 459 min len: 398 avg len: 422.8 num_loss_counted_tokens: 931 | |
Per-token loss scaled by world size: 0.0010864302748814225Per-token loss scaled by world size: 0.0008442088728770614Per-token loss scaled by world size: 0.0006439759745262563Per-token loss scaled by world size: 0.0015561669133603573Per-token loss scaled by world size: 0.0006814489024691284Per-token loss scaled by world size: 0.0006666479166597128 | |
Per-token loss scaled by world size: 0.001569844433106482 | |
Epoch: 0, Step: 144, Rank: 3, loss = 0.998157799243927Epoch: 0, Step: 144, Rank: 5, loss = 0.5916529297828674Epoch: 0, Step: 144, Rank: 4, loss = 0.7756168842315674Epoch: 0, Step: 144, Rank: 1, loss = 1.429728388786316Epoch: 0, Step: 144, Rank: 6, loss = 0.6260811686515808 | |
Epoch: 0, Step: 144, Rank: 0, loss = 1.4422945976257324 | |
Epoch: 0, Step: 144, Rank: 7, loss = 0.6124827861785889 | |
Per-token loss scaled by world size: 0.0013184297131374478 | |
Epoch: 0, Step: 144, Rank: 2, loss = 1.2113072872161865 | |
[2024-06-27 16:44:21,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=144, skipped=0, lr=[7.480519480519481e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:21,896] [INFO] [timer.py:260:stop] epoch=0/micro_step=144/global_step=144, RunningAvgSamplesPerSec=95.53654719780536, CurrSamplesPerSec=95.40475297437041, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.30618310098518 samples/s, lr: 7.480519480519481e-06, loss: 1.4422945976257324 cuda_mem_allocated: 22.263213634490967 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7350.0 batch_size: 78.0 total loss: 0.9609153270721436 | |
Epoch 0: 68% 144/213 [03:11<01:13, 1.06s/it] total tokens: 2424 num samples: 8 num padding tokens: 139 - rank: 3 max len: 303 min len: 264 avg len: 285.625 num_loss_counted_tokens: 1256 | |
total tokens: 2422 num samples: 14 num padding tokens: 247 - rank: 6 max len: 173 min len: 139 avg len: 155.35714285714286 num_loss_counted_tokens: 812 | |
total tokens: 2480 num samples: 10 num padding tokens: 150 - rank: 4 max len: 248 min len: 226 avg len: 233.0 num_loss_counted_tokens: 868 | |
total tokens: 2420 num samples: 11 num padding tokens: 282 - rank: 5 max len: 220 min len: 175 avg len: 194.36363636363637 num_loss_counted_tokens: 860 | |
total tokens: 2475 num samples: 5 num padding tokens: 316 - rank: 1 max len: 495 min len: 394 avg len: 431.8 num_loss_counted_tokens: 1410 | |
total tokens: 2292 num samples: 6 num padding tokens: 170 - rank: 2 max len: 382 min len: 307 avg len: 353.6666666666667 num_loss_counted_tokens: 1454 | |
total tokens: 2532 num samples: 4 num padding tokens: 191 - rank: 0 max len: 633 min len: 533 avg len: 585.25 num_loss_counted_tokens: 406 | |
total tokens: 2329 num samples: 17 num padding tokens: 304 - rank: 7 max len: 137 min len: 89 avg len: 119.11764705882354 num_loss_counted_tokens: 698 | |
Per-token loss scaled by world size: 0.0007776703569106758Per-token loss scaled by world size: 0.001510463422164321Per-token loss scaled by world size: 0.0011910111643373966Per-token loss scaled by world size: 0.0009628282277844846Per-token loss scaled by world size: 0.0011534468503668904Per-token loss scaled by world size: 0.0011745213996618986 | |
Per-token loss scaled by world size: 0.0011086183367297053 | |
Epoch: 0, Step: 145, Rank: 1, loss = 0.9739718437194824 | |
Epoch: 0, Step: 145, Rank: 0, loss = 0.9564957618713379 | |
Epoch: 0, Step: 145, Rank: 5, loss = 1.252551794052124Epoch: 0, Step: 145, Rank: 6, loss = 0.6448831558227539Epoch: 0, Step: 145, Rank: 4, loss = 0.9876460433006287Per-token loss scaled by world size: 0.0008147003827616572 | |
Epoch: 0, Step: 145, Rank: 2, loss = 0.9193217754364014 | |
Epoch: 0, Step: 145, Rank: 3, loss = 0.7984253168106079 | |
Epoch: 0, Step: 145, Rank: 7, loss = 0.6755902767181396 | |
[2024-06-27 16:44:22,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=145, skipped=0, lr=[7.532467532467533e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:22,955] [INFO] [timer.py:260:stop] epoch=0/micro_step=145/global_step=145, RunningAvgSamplesPerSec=95.53649342757056, CurrSamplesPerSec=95.52885866870383, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.43503098029122 samples/s, lr: 7.532467532467533e-06, loss: 0.9564957618713379 cuda_mem_allocated: 22.24281930923462 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6634.0 batch_size: 73.0 total loss: 0.901110827922821 | |
Epoch 0: 68% 145/213 [03:12<01:12, 1.06s/it] total tokens: 2415 num samples: 15 num padding tokens: 185 - rank: 6 max len: 161 min len: 133 avg len: 148.66666666666666 num_loss_counted_tokens: 880 | |
total tokens: 2457 num samples: 13 num padding tokens: 108 - rank: 5 max len: 189 min len: 165 avg len: 180.69230769230768 num_loss_counted_tokens: 883 | |
total tokens: 2128 num samples: 2 num padding tokens: 688 - rank: 0 max len: 1064 min len: 376 avg len: 720.0 num_loss_counted_tokens: 790 | |
total tokens: 2220 num samples: 6 num padding tokens: 51 - rank: 1 max len: 370 min len: 342 avg len: 361.5 num_loss_counted_tokens: 1123 | |
total tokens: 2324 num samples: 7 num padding tokens: 215 - rank: 2 max len: 332 min len: 277 avg len: 301.2857142857143 num_loss_counted_tokens: 1002 | |
total tokens: 2475 num samples: 9 num padding tokens: 212 - rank: 3 max len: 275 min len: 228 avg len: 251.44444444444446 num_loss_counted_tokens: 913 | |
total tokens: 2453 num samples: 11 num padding tokens: 178 - rank: 4 max len: 223 min len: 191 avg len: 206.8181818181818 num_loss_counted_tokens: 946 | |
total tokens: 2413 num samples: 19 num padding tokens: 305 - rank: 7 max len: 127 min len: 93 avg len: 110.94736842105263 num_loss_counted_tokens: 622 | |
Per-token loss scaled by world size: 0.0002635603304952383Per-token loss scaled by world size: 0.002979615004733205Per-token loss scaled by world size: 0.0013306393520906568Per-token loss scaled by world size: 0.001626241602934897Per-token loss scaled by world size: 0.0016706387978047132Per-token loss scaled by world size: 0.0017800417263060808Per-token loss scaled by world size: 6.406172906281427e-05 | |
Epoch: 0, Step: 146, Rank: 2, loss = 2.010495185852051Epoch: 0, Step: 146, Rank: 7, loss = 0.897848904132843 | |
Epoch: 0, Step: 146, Rank: 3, loss = 1.097306489944458Epoch: 0, Step: 146, Rank: 1, loss = 0.1778373271226883 | |
Epoch: 0, Step: 146, Rank: 5, loss = 1.2010831832885742Epoch: 0, Step: 146, Rank: 6, loss = 1.1272635459899902Epoch: 0, Step: 146, Rank: 0, loss = 0.04322565346956253Per-token loss scaled by world size: 0.0014580595307052135 | |
Epoch: 0, Step: 146, Rank: 4, loss = 0.98382568359375 | |
[2024-06-27 16:44:23,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=146, skipped=0, lr=[7.584415584415585e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:24,019] [INFO] [timer.py:260:stop] epoch=0/micro_step=146/global_step=146, RunningAvgSamplesPerSec=95.52677220267049, CurrSamplesPerSec=94.15671397654344, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 94.04635877754014 samples/s, lr: 7.584415584415585e-06, loss: 0.04322565346956253 cuda_mem_allocated: 22.25391149520874 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5398.0 batch_size: 71.0 total loss: 0.9423607587814331 | |
Epoch 0: 69% 146/213 [03:13<01:11, 1.06s/it] total tokens: 2484 num samples: 6 num padding tokens: 428 - rank: 1 max len: 414 min len: 304 avg len: 342.6666666666667 num_loss_counted_tokens: 831 | |
total tokens: 2408 num samples: 8 num padding tokens: 186 - rank: 2 max len: 301 min len: 254 avg len: 277.75 num_loss_counted_tokens: 1027 | |
total tokens: 2457 num samples: 13 num padding tokens: 207 - rank: 4 max len: 189 min len: 159 avg len: 173.07692307692307 num_loss_counted_tokens: 761 | |
total tokens: 2502 num samples: 18 num padding tokens: 162 - rank: 6 max len: 139 min len: 119 avg len: 130.0 num_loss_counted_tokens: 829 | |
total tokens: 2450 num samples: 10 num padding tokens: 338 - rank: 3 max len: 245 min len: 190 avg len: 211.2 num_loss_counted_tokens: 997 | |
total tokens: 2385 num samples: 15 num padding tokens: 106 - rank: 5 max len: 159 min len: 141 avg len: 151.93333333333334 num_loss_counted_tokens: 944 | |
total tokens: 2242 num samples: 19 num padding tokens: 264 - rank: 7 max len: 118 min len: 74 avg len: 104.10526315789474 num_loss_counted_tokens: 493 | |
total tokens: 2512 num samples: 4 num padding tokens: 559 - rank: 0 max len: 628 min len: 421 avg len: 488.25 num_loss_counted_tokens: 1403 | |
Per-token loss scaled by world size: 0.0008790154824964702Per-token loss scaled by world size: 0.0010674609802663326Per-token loss scaled by world size: 0.0006925399648025632Per-token loss scaled by world size: 0.0021320469677448273Per-token loss scaled by world size: 0.0007565536070615053Per-token loss scaled by world size: 0.0015504419570788741Per-token loss scaled by world size: 0.00043560672202147543 | |
Epoch: 0, Step: 147, Rank: 3, loss = 0.8989031910896301Per-token loss scaled by world size: 0.0010127967689186335Epoch: 0, Step: 147, Rank: 5, loss = 1.0916123390197754 | |
Epoch: 0, Step: 147, Rank: 2, loss = 0.7082086801528931Epoch: 0, Step: 147, Rank: 4, loss = 1.5855207443237305Epoch: 0, Step: 147, Rank: 0, loss = 2.1802845001220703Epoch: 0, Step: 147, Rank: 1, loss = 0.7736706137657166Epoch: 0, Step: 147, Rank: 7, loss = 0.44546231627464294 | |
Epoch: 0, Step: 147, Rank: 6, loss = 1.0357112884521484 | |
[2024-06-27 16:44:25,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=147, skipped=0, lr=[7.636363636363638e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:25,079] [INFO] [timer.py:260:stop] epoch=0/micro_step=147/global_step=147, RunningAvgSamplesPerSec=95.527977747593, CurrSamplesPerSec=95.70189446183589, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.59111210292849 samples/s, lr: 7.636363636363638e-06, loss: 2.1802845001220703 cuda_mem_allocated: 22.271442413330078 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8181.0 batch_size: 76.0 total loss: 1.0899217128753662 | |
Epoch 0: 69% 147/213 [03:14<01:10, 1.06s/it] total tokens: 2520 num samples: 14 num padding tokens: 240 - rank: 5 max len: 180 min len: 151 avg len: 162.85714285714286 num_loss_counted_tokens: 868 | |
total tokens: 2332 num samples: 11 num padding tokens: 168 - rank: 4 max len: 212 min len: 185 avg len: 196.72727272727272 num_loss_counted_tokens: 943 | |
total tokens: 2533 num samples: 17 num padding tokens: 213 - rank: 6 max len: 149 min len: 121 avg len: 136.47058823529412 num_loss_counted_tokens: 879 | |
total tokens: 2450 num samples: 10 num padding tokens: 125 - rank: 3 max len: 245 min len: 216 avg len: 232.5 num_loss_counted_tokens: 992 | |
total tokens: 2368 num samples: 8 num padding tokens: 239 - rank: 2 max len: 296 min len: 245 avg len: 266.125 num_loss_counted_tokens: 931 | |
total tokens: 2499 num samples: 7 num padding tokens: 197 - rank: 1 max len: 357 min len: 298 avg len: 328.85714285714283 num_loss_counted_tokens: 1071 | |
total tokens: 2360 num samples: 5 num padding tokens: 376 - rank: 0 max len: 472 min len: 364 avg len: 396.8 num_loss_counted_tokens: 960 | |
total tokens: 2178 num samples: 18 num padding tokens: 264 - rank: 7 max len: 121 min len: 89 avg len: 106.33333333333333 num_loss_counted_tokens: 531 | |
Per-token loss scaled by world size: 0.0009763744310475886Per-token loss scaled by world size: 0.0010440904879942536Per-token loss scaled by world size: 0.0009442294831387699Per-token loss scaled by world size: 0.00043020150042138994Per-token loss scaled by world size: 0.001133461482822895Per-token loss scaled by world size: 0.0011239995947107673Per-token loss scaled by world size: 0.0013407077640295029 | |
Epoch: 0, Step: 148, Rank: 6, loss = 0.9927287101745605 | |
Epoch: 0, Step: 148, Rank: 2, loss = 0.960045337677002 | |
Epoch: 0, Step: 148, Rank: 7, loss = 0.43740737438201904Epoch: 0, Step: 148, Rank: 4, loss = 1.0615789890289307Epoch: 0, Step: 148, Rank: 3, loss = 1.1428265571594238Epoch: 0, Step: 148, Rank: 5, loss = 1.152446985244751 | |
Per-token loss scaled by world size: 0.001125379465520382 | |
Epoch: 0, Step: 148, Rank: 0, loss = 1.3631646633148193 | |
Epoch: 0, Step: 148, Rank: 1, loss = 1.144229531288147 | |
[2024-06-27 16:44:26,067] [INFO] [logging.py:96:log_dist] [Rank 0] step=148, skipped=0, lr=[7.68831168831169e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:26,141] [INFO] [timer.py:260:stop] epoch=0/micro_step=148/global_step=148, RunningAvgSamplesPerSec=95.52644096960582, CurrSamplesPerSec=95.3041303179296, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.19440143609661 samples/s, lr: 7.68831168831169e-06, loss: 1.3631646633148193 cuda_mem_allocated: 22.266432762145996 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8134.0 batch_size: 92.0 total loss: 1.0318034887313843 | |
Epoch 0: 69% 148/213 [03:15<01:08, 1.06s/it] total tokens: 2506 num samples: 14 num padding tokens: 286 - rank: 6 max len: 179 min len: 139 avg len: 158.57142857142858 num_loss_counted_tokens: 774 | |
total tokens: 2225 num samples: 5 num padding tokens: 185 - rank: 1 max len: 445 min len: 383 avg len: 408.0 num_loss_counted_tokens: 1197 | |
total tokens: 2280 num samples: 6 num padding tokens: 193 - rank: 2 max len: 380 min len: 326 avg len: 347.8333333333333 num_loss_counted_tokens: 1115 | |
total tokens: 2403 num samples: 9 num padding tokens: 253 - rank: 4 max len: 267 min len: 211 avg len: 238.88888888888889 num_loss_counted_tokens: 1017 | |
total tokens: 2488 num samples: 8 num padding tokens: 195 - rank: 3 max len: 311 min len: 267 avg len: 286.625 num_loss_counted_tokens: 1155 | |
total tokens: 2508 num samples: 12 num padding tokens: 142 - rank: 5 max len: 209 min len: 182 avg len: 197.16666666666666 num_loss_counted_tokens: 1044 | |
total tokens: 2415 num samples: 3 num padding tokens: 600 - rank: 0 max len: 805 min len: 486 avg len: 605.0 num_loss_counted_tokens: 1073 | |
total tokens: 2346 num samples: 17 num padding tokens: 314 - rank: 7 max len: 138 min len: 88 avg len: 119.52941176470588 num_loss_counted_tokens: 545 | |
Per-token loss scaled by world size: 0.0012120491592213511Per-token loss scaled by world size: 0.0011271694675087929Per-token loss scaled by world size: 0.001291439519263804Per-token loss scaled by world size: 0.0018482906743884087Per-token loss scaled by world size: 0.0018949678633362055Per-token loss scaled by world size: 0.0014185960171744227Per-token loss scaled by world size: 0.0011383399832993746 | |
Epoch: 0, Step: 149, Rank: 6, loss = 1.0449378490447998Epoch: 0, Step: 149, Rank: 3, loss = 1.113382339477539Epoch: 0, Step: 149, Rank: 2, loss = 0.9717609882354736 | |
Epoch: 0, Step: 149, Rank: 5, loss = 1.6336991786956787 | |
Epoch: 0, Step: 149, Rank: 4, loss = 0.9813913702964783Epoch: 0, Step: 149, Rank: 1, loss = 1.593457579612732 | |
Epoch: 0, Step: 149, Rank: 0, loss = 1.223007082939148Per-token loss scaled by world size: 0.0006273670587688684 | |
Epoch: 0, Step: 149, Rank: 7, loss = 0.5408688187599182 | |
[2024-06-27 16:44:27,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=149, skipped=0, lr=[7.74025974025974e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:27,200] [INFO] [timer.py:260:stop] epoch=0/micro_step=149/global_step=149, RunningAvgSamplesPerSec=95.5255815864022, CurrSamplesPerSec=95.40027734926508, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.29524346422747 samples/s, lr: 7.74025974025974e-06, loss: 1.223007082939148 cuda_mem_allocated: 22.290048122406006 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6897.0 batch_size: 75.0 total loss: 1.1378130912780762 | |
Epoch 0: 70% 149/213 [03:16<01:07, 1.06s/it] total tokens: 2492 num samples: 14 num padding tokens: 150 - rank: 5 max len: 178 min len: 151 avg len: 167.28571428571428 num_loss_counted_tokens: 816 | |
total tokens: 2421 num samples: 9 num padding tokens: 203 - rank: 3 max len: 269 min len: 230 avg len: 246.44444444444446 num_loss_counted_tokens: 879 | |
total tokens: 2486 num samples: 11 num padding tokens: 245 - rank: 4 max len: 226 min len: 182 avg len: 203.72727272727272 num_loss_counted_tokens: 773 | |
total tokens: 2496 num samples: 6 num padding tokens: 418 - rank: 1 max len: 416 min len: 323 avg len: 346.3333333333333 num_loss_counted_tokens: 1244 | |
total tokens: 2516 num samples: 17 num padding tokens: 157 - rank: 6 max len: 148 min len: 127 avg len: 138.76470588235293 num_loss_counted_tokens: 1075 | |
total tokens: 2394 num samples: 19 num padding tokens: 267 - rank: 7 max len: 126 min len: 91 avg len: 111.94736842105263 num_loss_counted_tokens: 664 | |
total tokens: 2226 num samples: 7 num padding tokens: 137 - rank: 2 max len: 318 min len: 273 avg len: 298.42857142857144 num_loss_counted_tokens: 809 | |
total tokens: 2145 num samples: 3 num padding tokens: 433 - rank: 0 max len: 715 min len: 457 avg len: 570.6666666666666 num_loss_counted_tokens: 1266 | |
Per-token loss scaled by world size: 0.0013901798520237207Per-token loss scaled by world size: 0.001195683958940208 | |
Per-token loss scaled by world size: 0.0014899137895554304Per-token loss scaled by world size: 0.0012237357441335917Per-token loss scaled by world size: 0.0017264188500121236Per-token loss scaled by world size: 0.0012565903598442674 | |
Per-token loss scaled by world size: 4.7075547627173364e-05 | |
Epoch: 0, Step: 150, Rank: 3, loss = 1.0355101823806763 | |
Epoch: 0, Step: 150, Rank: 4, loss = 1.2859662771224976 | |
Epoch: 0, Step: 150, Rank: 6, loss = 0.890635073184967Epoch: 0, Step: 150, Rank: 5, loss = 1.1097995042800903Epoch: 0, Step: 150, Rank: 1, loss = 0.9360027313232422 | |
Epoch: 0, Step: 150, Rank: 7, loss = 0.9115301370620728 | |
Epoch: 0, Step: 150, Rank: 0, loss = 0.03506539762020111Per-token loss scaled by world size: 0.0017523315036669374 | |
Epoch: 0, Step: 150, Rank: 2, loss = 1.3052679300308228 | |
[2024-06-27 16:44:28,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=0, lr=[7.792207792207793e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:28,270] [INFO] [timer.py:260:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=95.51859677260447, CurrSamplesPerSec=94.50282238843697, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.41403127022389 samples/s, lr: 7.792207792207793e-06, loss: 0.03506539762020111 cuda_mem_allocated: 22.307698249816895 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5959.0 batch_size: 77.0 total loss: 0.9387221336364746 | |
Epoch 0: 70% 150/213 [03:17<01:07, 1.06s/it] total tokens: 2471 num samples: 7 num padding tokens: 354 - rank: 2 max len: 353 min len: 266 avg len: 302.42857142857144 num_loss_counted_tokens: 1058 | |
total tokens: 2367 num samples: 9 num padding tokens: 85 - rank: 3 max len: 263 min len: 235 avg len: 253.55555555555554 num_loss_counted_tokens: 1081 | |
total tokens: 2520 num samples: 15 num padding tokens: 435 - rank: 6 max len: 168 min len: 123 avg len: 139.0 num_loss_counted_tokens: 767 | |
total tokens: 2364 num samples: 12 num padding tokens: 110 - rank: 5 max len: 197 min len: 171 avg len: 187.83333333333334 num_loss_counted_tokens: 804 | |
total tokens: 2490 num samples: 6 num padding tokens: 207 - rank: 1 max len: 415 min len: 356 avg len: 380.5 num_loss_counted_tokens: 1282 | |
total tokens: 2519 num samples: 11 num padding tokens: 175 - rank: 4 max len: 229 min len: 198 avg len: 213.0909090909091 num_loss_counted_tokens: 872 | |
total tokens: 1959 num samples: 3 num padding tokens: 456 - rank: 0 max len: 653 min len: 423 avg len: 501.0 num_loss_counted_tokens: 849 | |
total tokens: 2178 num samples: 18 num padding tokens: 314 - rank: 7 max len: 121 min len: 80 avg len: 103.55555555555556 num_loss_counted_tokens: 465 | |
Per-token loss scaled by world size: 0.0012200291967019439Per-token loss scaled by world size: 0.0009625935344956815Per-token loss scaled by world size: 0.001282306620851159Per-token loss scaled by world size: 0.0006296195206232369Per-token loss scaled by world size: 0.0007678360561840236Per-token loss scaled by world size: 0.0009363946155644953 | |
Per-token loss scaled by world size: 0.0014265937497839332 | |
Epoch: 0, Step: 151, Rank: 4, loss = 1.1462174654006958Epoch: 0, Step: 151, Rank: 5, loss = 0.9043565988540649Epoch: 0, Step: 151, Rank: 6, loss = 1.204727053642273Epoch: 0, Step: 151, Rank: 1, loss = 0.7213819622993469 | |
Epoch: 0, Step: 151, Rank: 3, loss = 0.59152752161026Epoch: 0, Step: 151, Rank: 0, loss = 0.8797427415847778Epoch: 0, Step: 151, Rank: 2, loss = 1.340284824371338 | |
Per-token loss scaled by world size: 0.0006018998101353645 | |
Epoch: 0, Step: 151, Rank: 7, loss = 0.565484881401062 | |
[2024-06-27 16:44:29,249] [INFO] [logging.py:96:log_dist] [Rank 0] step=151, skipped=0, lr=[7.844155844155844e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:29,323] [INFO] [timer.py:260:stop] epoch=0/micro_step=151/global_step=151, RunningAvgSamplesPerSec=95.52278964485126, CurrSamplesPerSec=96.14741994325043, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 96.03633729631645 samples/s, lr: 7.844155844155844e-06, loss: 0.8797427415847778 cuda_mem_allocated: 22.285396099090576 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7516.0 batch_size: 80.0 total loss: 0.9192153215408325 | |
Epoch 0: 71% 151/213 [03:18<01:05, 1.06s/it] total tokens: 2450 num samples: 7 num padding tokens: 226 - rank: 2 max len: 350 min len: 298 avg len: 317.7142857142857 num_loss_counted_tokens: 1079 | |
total tokens: 2180 num samples: 5 num padding tokens: 129 - rank: 1 max len: 436 min len: 382 avg len: 410.2 num_loss_counted_tokens: 1612 | |
total tokens: 2484 num samples: 9 num padding tokens: 239 - rank: 3 max len: 276 min len: 231 avg len: 249.44444444444446 num_loss_counted_tokens: 1039 | |
total tokens: 2486 num samples: 11 num padding tokens: 165 - rank: 4 max len: 226 min len: 197 avg len: 211.0 num_loss_counted_tokens: 855 | |
total tokens: 2496 num samples: 13 num padding tokens: 208 - rank: 5 max len: 192 min len: 161 avg len: 176.0 num_loss_counted_tokens: 879 | |
total tokens: 2516 num samples: 17 num padding tokens: 204 - rank: 6 max len: 148 min len: 126 avg len: 136.0 num_loss_counted_tokens: 772 | |
total tokens: 2196 num samples: 18 num padding tokens: 293 - rank: 7 max len: 122 min len: 74 avg len: 105.72222222222223 num_loss_counted_tokens: 513 | |
total tokens: 1942 num samples: 2 num padding tokens: 426 - rank: 0 max len: 971 min len: 545 avg len: 758.0 num_loss_counted_tokens: 373 | |
Per-token loss scaled by world size: 0.0010860618203878403Per-token loss scaled by world size: 0.0008671770337969065Per-token loss scaled by world size: 0.0008599386201240122Per-token loss scaled by world size: 0.0015545168425887823 | |
Per-token loss scaled by world size: 0.001294502755627036Per-token loss scaled by world size: 0.0005012648180127144Per-token loss scaled by world size: 0.0010573173640295863 | |
Epoch: 0, Step: 152, Rank: 5, loss = 0.6834362149238586 | |
Epoch: 0, Step: 152, Rank: 4, loss = 0.8631476759910583Epoch: 0, Step: 152, Rank: 3, loss = 0.6891889572143555Epoch: 0, Step: 152, Rank: 7, loss = 0.8403029441833496Epoch: 0, Step: 152, Rank: 0, loss = 0.39838021993637085 | |
Epoch: 0, Step: 152, Rank: 2, loss = 1.2354522943496704 | |
Epoch: 0, Step: 152, Rank: 1, loss = 1.0288060903549194 | |
Per-token loss scaled by world size: 0.0012087048962712288 | |
Epoch: 0, Step: 152, Rank: 6, loss = 0.9606181979179382 | |
[2024-06-27 16:44:30,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=152, skipped=0, lr=[7.896103896103897e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:30,384] [INFO] [timer.py:260:stop] epoch=0/micro_step=152/global_step=152, RunningAvgSamplesPerSec=95.52105077686362, CurrSamplesPerSec=95.26266499604426, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.17174363466441 samples/s, lr: 7.896103896103897e-06, loss: 0.39838021993637085 cuda_mem_allocated: 22.30507516860962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6358.0 batch_size: 81.0 total loss: 0.8374165296554565 | |
Epoch 0: 71% 152/213 [03:19<01:04, 1.06s/it] total tokens: 2519 num samples: 11 num padding tokens: 168 - rank: 4 max len: 229 min len: 194 avg len: 213.72727272727272 num_loss_counted_tokens: 899 | |
total tokens: 2522 num samples: 13 num padding tokens: 192 - rank: 5 max len: 194 min len: 163 avg len: 179.23076923076923 num_loss_counted_tokens: 828 | |
total tokens: 2504 num samples: 8 num padding tokens: 148 - rank: 2 max len: 313 min len: 272 avg len: 294.5 num_loss_counted_tokens: 1253 | |
total tokens: 2400 num samples: 15 num padding tokens: 176 - rank: 6 max len: 160 min len: 132 avg len: 148.26666666666668 num_loss_counted_tokens: 815 | |
total tokens: 2448 num samples: 9 num padding tokens: 183 - rank: 3 max len: 272 min len: 240 avg len: 251.66666666666666 num_loss_counted_tokens: 578 | |
total tokens: 2448 num samples: 6 num padding tokens: 272 - rank: 1 max len: 408 min len: 318 avg len: 362.6666666666667 num_loss_counted_tokens: 1177 | |
total tokens: 2061 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2061 min len: 2061 avg len: 2061.0 num_loss_counted_tokens: 61 | |
total tokens: 2096 num samples: 16 num padding tokens: 356 - rank: 7 max len: 131 min len: 78 avg len: 108.75 num_loss_counted_tokens: 428 | |
Per-token loss scaled by world size: 0.0016401316970586777Per-token loss scaled by world size: 0.0009697260684333742Per-token loss scaled by world size: 0.0009346483275294304Per-token loss scaled by world size: 0.000913429306820035Per-token loss scaled by world size: 0.00241025909781456Per-token loss scaled by world size: 0.0007596174837090075 | |
Per-token loss scaled by world size: 0.0007443809299729764 | |
Epoch: 0, Step: 153, Rank: 0, loss = 1.6300859451293945Epoch: 0, Step: 153, Rank: 6, loss = 0.9289236068725586Epoch: 0, Step: 153, Rank: 2, loss = 0.9637864828109741 | |
Epoch: 0, Step: 153, Rank: 4, loss = 0.907834529876709Epoch: 0, Step: 153, Rank: 5, loss = 0.7398216128349304 | |
Epoch: 0, Step: 153, Rank: 1, loss = 2.395496368408203Epoch: 0, Step: 153, Rank: 3, loss = 0.7549648284912109 | |
Per-token loss scaled by world size: 0.0006195841124281287 | |
Epoch: 0, Step: 153, Rank: 7, loss = 0.6157891750335693 | |
[2024-06-27 16:44:31,371] [INFO] [logging.py:96:log_dist] [Rank 0] step=153, skipped=0, lr=[7.948051948051948e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:31,445] [INFO] [timer.py:260:stop] epoch=0/micro_step=153/global_step=153, RunningAvgSamplesPerSec=95.52057259037232, CurrSamplesPerSec=95.44889879652133, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.3623192387369 samples/s, lr: 7.948051948051948e-06, loss: 1.6300859451293945 cuda_mem_allocated: 22.28933095932007 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7951.0 batch_size: 78.0 total loss: 1.117087721824646 | |
Epoch 0: 72% 153/213 [03:20<01:03, 1.06s/it] total tokens: 2340 num samples: 5 num padding tokens: 277 - rank: 1 max len: 468 min len: 373 avg len: 412.6 num_loss_counted_tokens: 706 | |
total tokens: 2415 num samples: 7 num padding tokens: 173 - rank: 2 max len: 345 min len: 308 avg len: 320.2857142857143 num_loss_counted_tokens: 642 | |
total tokens: 2376 num samples: 8 num padding tokens: 127 - rank: 3 max len: 297 min len: 248 avg len: 281.125 num_loss_counted_tokens: 1122 | |
total tokens: 2448 num samples: 16 num padding tokens: 186 - rank: 6 max len: 153 min len: 132 avg len: 141.375 num_loss_counted_tokens: 939 | |
total tokens: 2470 num samples: 10 num padding tokens: 137 - rank: 4 max len: 247 min len: 212 avg len: 233.3 num_loss_counted_tokens: 890 | |
total tokens: 2520 num samples: 12 num padding tokens: 365 - rank: 5 max len: 210 min len: 154 avg len: 179.58333333333334 num_loss_counted_tokens: 882 | |
total tokens: 2334 num samples: 3 num padding tokens: 567 - rank: 0 max len: 778 min len: 492 avg len: 589.0 num_loss_counted_tokens: 116 | |
total tokens: 2508 num samples: 19 num padding tokens: 386 - rank: 7 max len: 132 min len: 86 avg len: 111.6842105263158 num_loss_counted_tokens: 615 | |
Per-token loss scaled by world size: 0.0009216332109645009Per-token loss scaled by world size: 0.001395265688188374Per-token loss scaled by world size: 0.0012246581027284265 | |
Per-token loss scaled by world size: 0.0015793470665812492Per-token loss scaled by world size: 0.0022104827221482992Per-token loss scaled by world size: 0.0011294519063085318Per-token loss scaled by world size: 0.0007388272206299007 | |
Epoch: 0, Step: 154, Rank: 4, loss = 1.333001971244812 | |
Epoch: 0, Step: 154, Rank: 5, loss = 0.8805053234100342 | |
Epoch: 0, Step: 154, Rank: 6, loss = 0.7058570384979248 | |
Epoch: 0, Step: 154, Rank: 1, loss = 1.1700077056884766Epoch: 0, Step: 154, Rank: 0, loss = 1.50886869430542Epoch: 0, Step: 154, Rank: 3, loss = 1.079050064086914Epoch: 0, Step: 154, Rank: 2, loss = 2.111840009689331 | |
Per-token loss scaled by world size: 0.0005048534949310124 | |
Epoch: 0, Step: 154, Rank: 7, loss = 0.48232442140579224 | |
[2024-06-27 16:44:32,426] [INFO] [logging.py:96:log_dist] [Rank 0] step=154, skipped=0, lr=[8.000000000000001e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:32,500] [INFO] [timer.py:260:stop] epoch=0/micro_step=154/global_step=154, RunningAvgSamplesPerSec=95.5232979435799, CurrSamplesPerSec=95.9366187166062, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.85145700680155 samples/s, lr: 8.000000000000001e-06, loss: 1.50886869430542 cuda_mem_allocated: 22.298396110534668 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7643.0 batch_size: 75.0 total loss: 1.1589319705963135 | |
Epoch 0: 72% 154/213 [03:21<01:02, 1.06s/it] total tokens: 2408 num samples: 8 num padding tokens: 314 - rank: 2 max len: 301 min len: 251 avg len: 261.75 num_loss_counted_tokens: 928 | |
total tokens: 2376 num samples: 11 num padding tokens: 224 - rank: 4 max len: 216 min len: 182 avg len: 195.63636363636363 num_loss_counted_tokens: 830 | |
total tokens: 2448 num samples: 17 num padding tokens: 161 - rank: 6 max len: 144 min len: 124 avg len: 134.52941176470588 num_loss_counted_tokens: 795 | |
total tokens: 2485 num samples: 7 num padding tokens: 178 - rank: 1 max len: 355 min len: 303 avg len: 329.57142857142856 num_loss_counted_tokens: 781 | |
total tokens: 2470 num samples: 10 num padding tokens: 123 - rank: 3 max len: 247 min len: 222 avg len: 234.7 num_loss_counted_tokens: 1140 | |
total tokens: 2534 num samples: 14 num padding tokens: 207 - rank: 5 max len: 181 min len: 148 avg len: 166.21428571428572 num_loss_counted_tokens: 882 | |
total tokens: 2356 num samples: 19 num padding tokens: 362 - rank: 7 max len: 124 min len: 86 avg len: 104.94736842105263 num_loss_counted_tokens: 546 | |
total tokens: 1959 num samples: 3 num padding tokens: 325 - rank: 0 max len: 653 min len: 481 avg len: 544.6666666666666 num_loss_counted_tokens: 993 | |
Per-token loss scaled by world size: 0.0017538447864353657Per-token loss scaled by world size: 0.0008229690138250589Per-token loss scaled by world size: 0.0018292198656126857Per-token loss scaled by world size: 0.0015242770314216614 | |
Per-token loss scaled by world size: 9.683984535513446e-05Per-token loss scaled by world size: 0.0013910243287682533Per-token loss scaled by world size: 0.0009414778323844075 | |
Epoch: 0, Step: 155, Rank: 1, loss = 1.474351167678833 | |
Epoch: 0, Step: 155, Rank: 5, loss = 1.413598895072937Epoch: 0, Step: 155, Rank: 7, loss = 0.6633130311965942Epoch: 0, Step: 155, Rank: 2, loss = 1.2285672426223755 | |
Epoch: 0, Step: 155, Rank: 0, loss = 0.07805291563272476 | |
Epoch: 0, Step: 155, Rank: 6, loss = 1.1211656332015991Epoch: 0, Step: 155, Rank: 3, loss = 0.7588311433792114 | |
Per-token loss scaled by world size: 0.001262161647900939 | |
Epoch: 0, Step: 155, Rank: 4, loss = 1.0173022747039795 | |
[2024-06-27 16:44:33,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=155, skipped=0, lr=[8.051948051948052e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:33,564] [INFO] [timer.py:260:stop] epoch=0/micro_step=155/global_step=155, RunningAvgSamplesPerSec=95.5203929789199, CurrSamplesPerSec=95.0808834033988, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.98856065100797 samples/s, lr: 8.051948051948052e-06, loss: 0.07805291563272476 cuda_mem_allocated: 22.302927494049072 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6448.0 batch_size: 79.0 total loss: 0.9693978428840637 | |
Epoch 0: 73% 155/213 [03:22<01:01, 1.06s/it] total tokens: 2360 num samples: 5 num padding tokens: 220 - rank: 1 max len: 472 min len: 391 avg len: 428.0 num_loss_counted_tokens: 1355 | |
total tokens: 2366 num samples: 14 num padding tokens: 303 - rank: 6 max len: 169 min len: 132 avg len: 147.35714285714286 num_loss_counted_tokens: 810 | |
total tokens: 2450 num samples: 10 num padding tokens: 219 - rank: 4 max len: 245 min len: 205 avg len: 223.1 num_loss_counted_tokens: 863 | |
total tokens: 2432 num samples: 8 num padding tokens: 190 - rank: 3 max len: 304 min len: 249 avg len: 280.25 num_loss_counted_tokens: 1068 | |
total tokens: 2448 num samples: 12 num padding tokens: 130 - rank: 5 max len: 204 min len: 183 avg len: 193.16666666666666 num_loss_counted_tokens: 928 | |
total tokens: 2274 num samples: 6 num padding tokens: 108 - rank: 2 max len: 379 min len: 326 avg len: 361.0 num_loss_counted_tokens: 1005 | |
total tokens: 2489 num samples: 19 num padding tokens: 381 - rank: 7 max len: 131 min len: 84 avg len: 110.94736842105263 num_loss_counted_tokens: 569 | |
total tokens: 2442 num samples: 3 num padding tokens: 527 - rank: 0 max len: 814 min len: 525 avg len: 638.3333333333334 num_loss_counted_tokens: 594 | |
Per-token loss scaled by world size: 0.00195693620480597Per-token loss scaled by world size: 0.002407415071502328Per-token loss scaled by world size: 0.0012463328894227743Per-token loss scaled by world size: 0.0015858891420066357Per-token loss scaled by world size: 0.001947255339473486 | |
Per-token loss scaled by world size: 0.0009352493216283619 | |
Per-token loss scaled by world size: 0.00039935976383276284 | |
Epoch: 0, Step: 156, Rank: 1, loss = 1.4234436750411987 | |
Epoch: 0, Step: 156, Rank: 3, loss = 1.7598203420639038Epoch: 0, Step: 156, Rank: 4, loss = 1.4305202960968018Epoch: 0, Step: 156, Rank: 2, loss = 0.9110693335533142Epoch: 0, Step: 156, Rank: 5, loss = 1.1592849493026733 | |
Per-token loss scaled by world size: 0.000920161313842982Epoch: 0, Step: 156, Rank: 0, loss = 0.6836672425270081Epoch: 0, Step: 156, Rank: 7, loss = 0.29193198680877686 | |
Epoch: 0, Step: 156, Rank: 6, loss = 0.672637939453125 | |
[2024-06-27 16:44:34,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=156, skipped=0, lr=[8.103896103896105e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:34,617] [INFO] [timer.py:260:stop] epoch=0/micro_step=156/global_step=156, RunningAvgSamplesPerSec=95.52398849511951, CurrSamplesPerSec=96.07730982796416, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.96673211372057 samples/s, lr: 8.103896103896105e-06, loss: 0.6836672425270081 cuda_mem_allocated: 22.248186588287354 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5848.0 batch_size: 73.0 total loss: 1.0415469408035278 | |
Saving model in huggingface format at samples_seen: 14976 | |
Model saved in /instructlab/training_output/hf_format/samples_14976 | |
[16:44:52] INFO saving took 18.200772523880005 seconds utils.py:192 | |
Epoch 0: 73% 156/213 [03:42<06:11, 6.52s/it] total tokens: 2320 num samples: 20 num padding tokens: 235 - rank: 7 max len: 116 min len: 89 avg len: 104.25 num_loss_counted_tokens: 594 | |
total tokens: 2280 num samples: 8 num padding tokens: 256 - rank: 3 max len: 285 min len: 235 avg len: 253.0 num_loss_counted_tokens: 764 | |
total tokens: 2443 num samples: 7 num padding tokens: 158 - rank: 2 max len: 349 min len: 296 avg len: 326.42857142857144 num_loss_counted_tokens: 1380 | |
total tokens: 2475 num samples: 11 num padding tokens: 170 - rank: 4 max len: 225 min len: 192 avg len: 209.54545454545453 num_loss_counted_tokens: 855 | |
total tokens: 2340 num samples: 5 num padding tokens: 363 - rank: 1 max len: 468 min len: 352 avg len: 395.4 num_loss_counted_tokens: 1316 | |
total tokens: 1879 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1879 min len: 1879 avg len: 1879.0 num_loss_counted_tokens: 24 | |
total tokens: 2533 num samples: 17 num padding tokens: 266 - rank: 6 max len: 149 min len: 116 avg len: 133.35294117647058 num_loss_counted_tokens: 846 | |
total tokens: 2457 num samples: 13 num padding tokens: 251 - rank: 5 max len: 189 min len: 151 avg len: 169.69230769230768 num_loss_counted_tokens: 831 | |
Per-token loss scaled by world size: 0.001834677066653967Per-token loss scaled by world size: 0.0024485799949616194Per-token loss scaled by world size: 0.0018722937675192952 | |
Per-token loss scaled by world size: 0.0005153739475645125Per-token loss scaled by world size: 0.0018307490972802043Per-token loss scaled by world size: 0.0014390636933967471 | |
Per-token loss scaled by world size: 1.293275363423163e-05 | |
Epoch: 0, Step: 157, Rank: 1, loss = 1.9854923486709595 | |
Epoch: 0, Step: 157, Rank: 3, loss = 1.4876937866210938Epoch: 0, Step: 157, Rank: 4, loss = 1.5181962251663208 | |
Epoch: 0, Step: 157, Rank: 5, loss = 1.4845086336135864Epoch: 0, Step: 157, Rank: 7, loss = 0.4179038405418396Epoch: 0, Step: 157, Rank: 0, loss = 0.010486846789717674Epoch: 0, Step: 157, Rank: 2, loss = 1.1669007539749146 | |
Per-token loss scaled by world size: 0.0007140173111110926 | |
Epoch: 0, Step: 157, Rank: 6, loss = 0.5789787769317627 | |
[2024-06-27 16:44:53,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=157, skipped=0, lr=[8.155844155844157e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:53,877] [INFO] [timer.py:260:stop] epoch=0/micro_step=157/global_step=157, RunningAvgSamplesPerSec=95.52437665563359, CurrSamplesPerSec=95.58419104817747, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.4674332136315 samples/s, lr: 8.155844155844157e-06, loss: 0.010486846789717674 cuda_mem_allocated: 22.25570011138916 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6487.0 batch_size: 76.0 total loss: 1.0812700986862183 | |
Epoch 0: 74% 157/213 [03:43<04:33, 4.88s/it] total tokens: 2445 num samples: 15 num padding tokens: 263 - rank: 6 max len: 163 min len: 129 avg len: 145.46666666666667 num_loss_counted_tokens: 875 | |
total tokens: 2496 num samples: 13 num padding tokens: 115 - rank: 5 max len: 192 min len: 174 avg len: 183.15384615384616 num_loss_counted_tokens: 987 | |
total tokens: 2193 num samples: 17 num padding tokens: 336 - rank: 7 max len: 129 min len: 76 avg len: 109.23529411764706 num_loss_counted_tokens: 509 | |
total tokens: 2415 num samples: 5 num padding tokens: 247 - rank: 1 max len: 483 min len: 403 avg len: 433.6 num_loss_counted_tokens: 1454 | |
total tokens: 2202 num samples: 6 num padding tokens: 233 - rank: 2 max len: 367 min len: 290 avg len: 328.1666666666667 num_loss_counted_tokens: 1029 | |
total tokens: 2296 num samples: 8 num padding tokens: 141 - rank: 3 max len: 287 min len: 231 avg len: 269.375 num_loss_counted_tokens: 848 | |
total tokens: 2530 num samples: 11 num padding tokens: 131 - rank: 4 max len: 230 min len: 201 avg len: 218.0909090909091 num_loss_counted_tokens: 866 | |
total tokens: 1728 num samples: 2 num padding tokens: 376 - rank: 0 max len: 864 min len: 488 avg len: 676.0 num_loss_counted_tokens: 490 | |
Per-token loss scaled by world size: 0.0010265298187732697Per-token loss scaled by world size: 0.0011326593812555075Per-token loss scaled by world size: 0.0014472390757873654Per-token loss scaled by world size: 0.00044208040344528854Per-token loss scaled by world size: 0.0009611049899831414Per-token loss scaled by world size: 0.0012791331391781569 | |
Per-token loss scaled by world size: 0.0007209066534414887 | |
Epoch: 0, Step: 158, Rank: 5, loss = 1.056684136390686Epoch: 0, Step: 158, Rank: 4, loss = 1.165931224822998 | |
Epoch: 0, Step: 158, Rank: 3, loss = 1.316707730293274 | |
Epoch: 0, Step: 158, Rank: 1, loss = 1.4897516965866089 | |
Epoch: 0, Step: 158, Rank: 2, loss = 0.9893374443054199Epoch: 0, Step: 158, Rank: 7, loss = 0.4550665020942688 | |
Per-token loss scaled by world size: 0.0013711793581023812 | |
Epoch: 0, Step: 158, Rank: 6, loss = 0.7420833110809326 | |
Epoch: 0, Step: 158, Rank: 0, loss = 1.4114577770233154 | |
[2024-06-27 16:44:54,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=158, skipped=0, lr=[8.20779220779221e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:54,934] [INFO] [timer.py:260:stop] epoch=0/micro_step=158/global_step=158, RunningAvgSamplesPerSec=95.52671347480282, CurrSamplesPerSec=95.89030797537188, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.781390368938 samples/s, lr: 8.20779220779221e-06, loss: 1.4114577770233154 cuda_mem_allocated: 22.3084135055542 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8235.0 batch_size: 79.0 total loss: 1.078377366065979 | |
Epoch 0: 74% 158/213 [03:44<03:25, 3.73s/it] total tokens: 2464 num samples: 11 num padding tokens: 198 - rank: 4 max len: 224 min len: 195 avg len: 206.0 num_loss_counted_tokens: 1044 | |
total tokens: 2286 num samples: 9 num padding tokens: 140 - rank: 3 max len: 254 min len: 228 avg len: 238.44444444444446 num_loss_counted_tokens: 894 | |
total tokens: 2522 num samples: 13 num padding tokens: 217 - rank: 5 max len: 194 min len: 166 avg len: 177.30769230769232 num_loss_counted_tokens: 867 | |
total tokens: 2196 num samples: 6 num padding tokens: 123 - rank: 1 max len: 366 min len: 307 avg len: 345.5 num_loss_counted_tokens: 1298 | |
total tokens: 2475 num samples: 15 num padding tokens: 292 - rank: 6 max len: 165 min len: 126 avg len: 145.53333333333333 num_loss_counted_tokens: 867 | |
total tokens: 2448 num samples: 8 num padding tokens: 203 - rank: 2 max len: 306 min len: 264 avg len: 280.625 num_loss_counted_tokens: 1040 | |
total tokens: 2445 num samples: 5 num padding tokens: 184 - rank: 0 max len: 489 min len: 429 avg len: 452.2 num_loss_counted_tokens: 1200 | |
total tokens: 2040 num samples: 17 num padding tokens: 298 - rank: 7 max len: 120 min len: 84 avg len: 102.47058823529412 num_loss_counted_tokens: 434 | |
Per-token loss scaled by world size: 0.0009523297776468098Per-token loss scaled by world size: 0.0029412847943603992 | |
Per-token loss scaled by world size: 0.00033892333158291876Per-token loss scaled by world size: 0.0012412263313308358Per-token loss scaled by world size: 0.0014161269646137953Per-token loss scaled by world size: 0.0013040577759966254Per-token loss scaled by world size: 0.0011136519024148583 | |
Epoch: 0, Step: 159, Rank: 2, loss = 2.306334972381592Epoch: 0, Step: 159, Rank: 1, loss = 0.7467455863952637 | |
Epoch: 0, Step: 159, Rank: 5, loss = 0.9732765555381775Epoch: 0, Step: 159, Rank: 4, loss = 1.022544264793396Epoch: 0, Step: 159, Rank: 6, loss = 1.11042058467865Epoch: 0, Step: 159, Rank: 7, loss = 0.2657582461833954Epoch: 0, Step: 159, Rank: 3, loss = 0.8732422590255737 | |
Per-token loss scaled by world size: 0.0011324587976559997 | |
Epoch: 0, Step: 159, Rank: 0, loss = 0.8879892230033875 | |
[2024-06-27 16:44:55,914] [INFO] [logging.py:96:log_dist] [Rank 0] step=159, skipped=0, lr=[8.25974025974026e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:55,987] [INFO] [timer.py:260:stop] epoch=0/micro_step=159/global_step=159, RunningAvgSamplesPerSec=95.53034629260411, CurrSamplesPerSec=96.10046972268438, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 96.01197201284474 samples/s, lr: 8.25974025974026e-06, loss: 0.8879892230033875 cuda_mem_allocated: 22.297919273376465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6273.0 batch_size: 80.0 total loss: 1.0232889652252197 | |
Epoch 0: 75% 159/213 [03:45<02:38, 2.93s/it] total tokens: 2338 num samples: 7 num padding tokens: 110 - rank: 2 max len: 334 min len: 306 avg len: 318.2857142857143 num_loss_counted_tokens: 689 | |
total tokens: 2140 num samples: 5 num padding tokens: 211 - rank: 1 max len: 428 min len: 336 avg len: 385.8 num_loss_counted_tokens: 371 | |
total tokens: 2448 num samples: 16 num padding tokens: 227 - rank: 6 max len: 153 min len: 123 avg len: 138.8125 num_loss_counted_tokens: 710 | |
total tokens: 2486 num samples: 11 num padding tokens: 225 - rank: 4 max len: 226 min len: 186 avg len: 205.54545454545453 num_loss_counted_tokens: 1039 | |
total tokens: 2424 num samples: 8 num padding tokens: 290 - rank: 3 max len: 303 min len: 230 avg len: 266.75 num_loss_counted_tokens: 1010 | |
total tokens: 2440 num samples: 20 num padding tokens: 338 - rank: 7 max len: 122 min len: 85 avg len: 105.1 num_loss_counted_tokens: 499 | |
total tokens: 2418 num samples: 13 num padding tokens: 214 - rank: 5 max len: 186 min len: 154 avg len: 169.53846153846155 num_loss_counted_tokens: 948 | |
total tokens: 2425 num samples: 5 num padding tokens: 97 - rank: 0 max len: 485 min len: 439 avg len: 465.6 num_loss_counted_tokens: 1028 | |
Per-token loss scaled by world size: 0.001245336141437292Per-token loss scaled by world size: 0.0009528773953206837Per-token loss scaled by world size: 0.0013875555014237761Per-token loss scaled by world size: 0.0007451264536939561Per-token loss scaled by world size: 0.0015872535295784473Per-token loss scaled by world size: 0.001065027667209506 | |
Per-token loss scaled by world size: 0.001153118908405304 | |
Epoch: 0, Step: 160, Rank: 6, loss = 1.0781497955322266Epoch: 0, Step: 160, Rank: 4, loss = 0.8249536156654358Epoch: 0, Step: 160, Rank: 0, loss = 1.201276183128357 | |
Epoch: 0, Step: 160, Rank: 2, loss = 1.3741647005081177 | |
Epoch: 0, Step: 160, Rank: 7, loss = 0.6450932025909424 | |
Per-token loss scaled by world size: 0.0015741924289613962 | |
Epoch: 0, Step: 160, Rank: 1, loss = 0.9220476746559143Epoch: 0, Step: 160, Rank: 3, loss = 0.9983126521110535 | |
Epoch: 0, Step: 160, Rank: 5, loss = 1.3628571033477783 | |
[2024-06-27 16:44:56,973] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=0, lr=[8.311688311688313e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:57,046] [INFO] [timer.py:260:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=95.5305368440001, CurrSamplesPerSec=95.56046284456318, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.46227272619508 samples/s, lr: 8.311688311688313e-06, loss: 1.201276183128357 cuda_mem_allocated: 22.28825807571411 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6926.0 batch_size: 90.0 total loss: 1.0508568286895752 | |
Epoch 0: 75% 160/213 [03:46<02:05, 2.37s/it] total tokens: 2320 num samples: 5 num padding tokens: 366 - rank: 1 max len: 464 min len: 327 avg len: 390.8 num_loss_counted_tokens: 1180 | |
total tokens: 2505 num samples: 15 num padding tokens: 361 - rank: 6 max len: 167 min len: 123 avg len: 142.93333333333334 num_loss_counted_tokens: 878 | |
total tokens: 1926 num samples: 3 num padding tokens: 229 - rank: 0 max len: 642 min len: 494 avg len: 565.6666666666666 num_loss_counted_tokens: 882 | |
total tokens: 2358 num samples: 9 num padding tokens: 97 - rank: 3 max len: 262 min len: 244 avg len: 251.22222222222223 num_loss_counted_tokens: 908 | |
total tokens: 2360 num samples: 10 num padding tokens: 159 - rank: 4 max len: 236 min len: 207 avg len: 220.1 num_loss_counted_tokens: 942 | |
total tokens: 2460 num samples: 12 num padding tokens: 228 - rank: 5 max len: 205 min len: 168 avg len: 186.0 num_loss_counted_tokens: 953 | |
total tokens: 2488 num samples: 8 num padding tokens: 128 - rank: 2 max len: 311 min len: 269 avg len: 295.0 num_loss_counted_tokens: 954 | |
total tokens: 1952 num samples: 16 num padding tokens: 245 - rank: 7 max len: 122 min len: 83 avg len: 106.6875 num_loss_counted_tokens: 513 | |
Per-token loss scaled by world size: 0.0006825725431554019 | |
Per-token loss scaled by world size: 0.0019457702292129397Per-token loss scaled by world size: 0.0009742581169120967Per-token loss scaled by world size: 0.002291369019076228Per-token loss scaled by world size: 0.0006346364971250296Per-token loss scaled by world size: 0.0006645788089372218 | |
Epoch: 0, Step: 161, Rank: 7, loss = 0.6624366641044617 | |
Epoch: 0, Step: 161, Rank: 5, loss = 0.9455174803733826Epoch: 0, Step: 161, Rank: 1, loss = 1.8883700370788574 | |
Epoch: 0, Step: 161, Rank: 2, loss = 2.223773717880249Per-token loss scaled by world size: 0.0013651832705363631Epoch: 0, Step: 161, Rank: 4, loss = 0.6159147024154663 | |
Per-token loss scaled by world size: 0.0003247994463890791 | |
Epoch: 0, Step: 161, Rank: 6, loss = 0.6449737548828125 | |
Epoch: 0, Step: 161, Rank: 0, loss = 0.31521785259246826Epoch: 0, Step: 161, Rank: 3, loss = 1.3249104022979736 | |
[2024-06-27 16:44:58,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=161, skipped=0, lr=[8.363636363636365e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:58,106] [INFO] [timer.py:260:stop] epoch=0/micro_step=161/global_step=161, RunningAvgSamplesPerSec=95.52485499208065, CurrSamplesPerSec=94.63553304073679, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.53490532614812 samples/s, lr: 8.363636363636365e-06, loss: 0.31521785259246826 cuda_mem_allocated: 22.316523551940918 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7764.0 batch_size: 75.0 total loss: 1.0776392221450806 | |
Epoch 0: 76% 161/213 [03:47<01:42, 1.98s/it] total tokens: 2366 num samples: 13 num padding tokens: 168 - rank: 5 max len: 182 min len: 157 avg len: 169.07692307692307 num_loss_counted_tokens: 796 | |
total tokens: 2331 num samples: 9 num padding tokens: 210 - rank: 3 max len: 259 min len: 221 avg len: 235.66666666666666 num_loss_counted_tokens: 766 | |
total tokens: 2480 num samples: 16 num padding tokens: 210 - rank: 6 max len: 155 min len: 129 avg len: 141.875 num_loss_counted_tokens: 767 | |
total tokens: 2240 num samples: 7 num padding tokens: 234 - rank: 2 max len: 320 min len: 268 avg len: 286.57142857142856 num_loss_counted_tokens: 1149 | |
total tokens: 2431 num samples: 11 num padding tokens: 230 - rank: 4 max len: 221 min len: 182 avg len: 200.0909090909091 num_loss_counted_tokens: 995 | |
total tokens: 2196 num samples: 6 num padding tokens: 132 - rank: 1 max len: 366 min len: 324 avg len: 344.0 num_loss_counted_tokens: 1154 | |
total tokens: 2094 num samples: 3 num padding tokens: 468 - rank: 0 max len: 698 min len: 373 avg len: 542.0 num_loss_counted_tokens: 1175 | |
total tokens: 2500 num samples: 20 num padding tokens: 318 - rank: 7 max len: 125 min len: 73 avg len: 109.1 num_loss_counted_tokens: 658 | |
Per-token loss scaled by world size: 0.0014379018684849143Per-token loss scaled by world size: 0.0008389458525925875Per-token loss scaled by world size: 0.0011760307243093848Per-token loss scaled by world size: 0.0006399924168363214Per-token loss scaled by world size: 0.001471386174671352Per-token loss scaled by world size: 0.0011908792657777667Per-token loss scaled by world size: 0.0010667771566659212 | |
Epoch: 0, Step: 162, Rank: 1, loss = 1.2867424488067627Epoch: 0, Step: 162, Rank: 3, loss = 1.065688133239746Epoch: 0, Step: 162, Rank: 0, loss = 0.5727131962776184Epoch: 0, Step: 162, Rank: 2, loss = 1.316706657409668Epoch: 0, Step: 162, Rank: 5, loss = 0.7507516741752625Epoch: 0, Step: 162, Rank: 4, loss = 1.0524004697799683 | |
Epoch: 0, Step: 162, Rank: 6, loss = 0.9546322226524353 | |
Per-token loss scaled by world size: 0.0007984080584719777 | |
Epoch: 0, Step: 162, Rank: 7, loss = 0.7144753932952881 | |
[2024-06-27 16:44:59,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=162, skipped=0, lr=[8.415584415584416e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:44:59,159] [INFO] [timer.py:260:stop] epoch=0/micro_step=162/global_step=162, RunningAvgSamplesPerSec=95.52827025622443, CurrSamplesPerSec=96.07442135691372, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.97318254233846 samples/s, lr: 8.415584415584416e-06, loss: 0.5727131962776184 cuda_mem_allocated: 22.26834201812744 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7159.0 batch_size: 82.0 total loss: 0.9642637968063354 | |
Epoch 0: 76% 162/213 [03:48<01:26, 1.70s/it] total tokens: 2422 num samples: 7 num padding tokens: 220 - rank: 1 max len: 346 min len: 290 avg len: 314.57142857142856 num_loss_counted_tokens: 1140 | |
total tokens: 2520 num samples: 9 num padding tokens: 126 - rank: 2 max len: 280 min len: 252 avg len: 266.0 num_loss_counted_tokens: 1095 | |
total tokens: 2483 num samples: 13 num padding tokens: 179 - rank: 5 max len: 191 min len: 163 avg len: 177.23076923076923 num_loss_counted_tokens: 850 | |
total tokens: 2385 num samples: 15 num padding tokens: 143 - rank: 6 max len: 159 min len: 137 avg len: 149.46666666666667 num_loss_counted_tokens: 795 | |
total tokens: 2365 num samples: 11 num padding tokens: 114 - rank: 4 max len: 215 min len: 194 avg len: 204.63636363636363 num_loss_counted_tokens: 950 | |
total tokens: 2490 num samples: 10 num padding tokens: 217 - rank: 3 max len: 249 min len: 216 avg len: 227.3 num_loss_counted_tokens: 1036 | |
total tokens: 2430 num samples: 18 num padding tokens: 274 - rank: 7 max len: 135 min len: 86 avg len: 119.77777777777777 num_loss_counted_tokens: 651 | |
total tokens: 2124 num samples: 4 num padding tokens: 315 - rank: 0 max len: 531 min len: 355 avg len: 452.25 num_loss_counted_tokens: 754 | |
Per-token loss scaled by world size: 0.0008776300819590688Per-token loss scaled by world size: 0.0005072795902378857 | |
Per-token loss scaled by world size: 0.0013124013785272837Per-token loss scaled by world size: 0.0013257238315418363Per-token loss scaled by world size: 0.0011973134241998196Per-token loss scaled by world size: 0.0009716937784105539Per-token loss scaled by world size: 0.0012665154645219445 | |
Epoch: 0, Step: 163, Rank: 4, loss = 0.7991918921470642 | |
Epoch: 0, Step: 163, Rank: 1, loss = 0.8848486542701721Epoch: 0, Step: 163, Rank: 7, loss = 0.4619414806365967 | |
Epoch: 0, Step: 163, Rank: 5, loss = 1.2072372436523438Epoch: 0, Step: 163, Rank: 2, loss = 1.1951055526733398Per-token loss scaled by world size: 0.0011598337441682816Epoch: 0, Step: 163, Rank: 6, loss = 1.0903035402297974Epoch: 0, Step: 163, Rank: 3, loss = 1.1533206701278687 | |
Epoch: 0, Step: 163, Rank: 0, loss = 1.05617356300354 | |
[2024-06-27 16:45:00,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=163, skipped=0, lr=[8.467532467532467e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:00,223] [INFO] [timer.py:260:stop] epoch=0/micro_step=163/global_step=163, RunningAvgSamplesPerSec=95.52560036124314, CurrSamplesPerSec=95.1003308232461, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.00270247696382 samples/s, lr: 8.467532467532467e-06, loss: 1.05617356300354 cuda_mem_allocated: 22.314138412475586 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7285.0 batch_size: 93.0 total loss: 0.9810153841972351 | |
Epoch 0: 77% 163/213 [03:49<01:15, 1.51s/it] total tokens: 2448 num samples: 9 num padding tokens: 192 - rank: 3 max len: 272 min len: 233 avg len: 250.66666666666666 num_loss_counted_tokens: 870 | |
total tokens: 2330 num samples: 10 num padding tokens: 179 - rank: 4 max len: 233 min len: 192 avg len: 215.1 num_loss_counted_tokens: 758 | |
total tokens: 2286 num samples: 6 num padding tokens: 118 - rank: 1 max len: 381 min len: 328 avg len: 361.3333333333333 num_loss_counted_tokens: 1594 | |
total tokens: 2470 num samples: 13 num padding tokens: 203 - rank: 5 max len: 190 min len: 156 avg len: 174.3846153846154 num_loss_counted_tokens: 768 | |
total tokens: 2500 num samples: 20 num padding tokens: 428 - rank: 7 max len: 125 min len: 81 avg len: 103.6 num_loss_counted_tokens: 531 | |
total tokens: 2496 num samples: 16 num padding tokens: 222 - rank: 6 max len: 156 min len: 125 avg len: 142.125 num_loss_counted_tokens: 847 | |
total tokens: 2289 num samples: 7 num padding tokens: 195 - rank: 2 max len: 327 min len: 274 avg len: 299.14285714285717 num_loss_counted_tokens: 1004 | |
total tokens: 2315 num samples: 5 num padding tokens: 213 - rank: 0 max len: 463 min len: 387 avg len: 420.4 num_loss_counted_tokens: 951 | |
Per-token loss scaled by world size: 0.0013982943492010236Per-token loss scaled by world size: 0.0010906007373705506Per-token loss scaled by world size: 0.0014007121790200472Per-token loss scaled by world size: 0.0018552440451458097Per-token loss scaled by world size: 0.0006412114598788321Per-token loss scaled by world size: 0.0016390442615374923Per-token loss scaled by world size: 0.0012130242539569736 | |
Epoch: 0, Step: 164, Rank: 4, loss = 1.254095196723938 | |
Epoch: 0, Step: 164, Rank: 5, loss = 0.9781325459480286Epoch: 0, Step: 164, Rank: 2, loss = 1.2562637329101562Epoch: 0, Step: 164, Rank: 3, loss = 1.0879311561584473 | |
Epoch: 0, Step: 164, Rank: 0, loss = 1.6639219522476196Epoch: 0, Step: 164, Rank: 1, loss = 1.4700177907943726 | |
Epoch: 0, Step: 164, Rank: 7, loss = 0.5750865340232849 | |
Per-token loss scaled by world size: 0.0009058396681211889 | |
Epoch: 0, Step: 164, Rank: 6, loss = 0.8124249577522278 | |
[2024-06-27 16:45:01,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=164, skipped=0, lr=[8.51948051948052e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:01,289] [INFO] [timer.py:260:stop] epoch=0/micro_step=164/global_step=164, RunningAvgSamplesPerSec=95.52233437832396, CurrSamplesPerSec=94.9994075707783, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.90109152479735 samples/s, lr: 8.51948051948052e-06, loss: 1.6639219522476196 cuda_mem_allocated: 22.29601001739502 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7175.0 batch_size: 90.0 total loss: 1.1372342109680176 | |
Epoch 0: 77% 164/213 [03:50<01:07, 1.38s/it] total tokens: 2219 num samples: 7 num padding tokens: 143 - rank: 2 max len: 317 min len: 277 avg len: 296.57142857142856 num_loss_counted_tokens: 1110 | |
total tokens: 2460 num samples: 6 num padding tokens: 371 - rank: 1 max len: 410 min len: 321 avg len: 348.1666666666667 num_loss_counted_tokens: 1222 | |
total tokens: 2366 num samples: 14 num padding tokens: 173 - rank: 6 max len: 169 min len: 133 avg len: 156.64285714285714 num_loss_counted_tokens: 866 | |
total tokens: 2370 num samples: 10 num padding tokens: 223 - rank: 4 max len: 237 min len: 199 avg len: 214.7 num_loss_counted_tokens: 807 | |
total tokens: 2466 num samples: 9 num padding tokens: 67 - rank: 3 max len: 274 min len: 258 avg len: 266.55555555555554 num_loss_counted_tokens: 1452 | |
total tokens: 2508 num samples: 19 num padding tokens: 423 - rank: 7 max len: 132 min len: 75 avg len: 109.73684210526316 num_loss_counted_tokens: 611 | |
total tokens: 2512 num samples: 4 num padding tokens: 355 - rank: 0 max len: 628 min len: 415 avg len: 539.25 num_loss_counted_tokens: 973 | |
total tokens: 2470 num samples: 13 num padding tokens: 128 - rank: 5 max len: 190 min len: 171 avg len: 180.15384615384616 num_loss_counted_tokens: 934 | |
Per-token loss scaled by world size: 0.0008241506875492632Per-token loss scaled by world size: 0.00046000751899555326Per-token loss scaled by world size: 0.0007669746992178261 | |
Per-token loss scaled by world size: 0.0012351912446320057Per-token loss scaled by world size: 0.0013300224673002958Per-token loss scaled by world size: 0.0013751053484156728Per-token loss scaled by world size: 0.0021312134340405464 | |
Epoch: 0, Step: 165, Rank: 7, loss = 0.45540744066238403 | |
Epoch: 0, Step: 165, Rank: 6, loss = 0.815909206867218 | |
Epoch: 0, Step: 165, Rank: 4, loss = 0.759304940700531 | |
Epoch: 0, Step: 165, Rank: 1, loss = 1.22283935546875Per-token loss scaled by world size: 0.001149850431829691Epoch: 0, Step: 165, Rank: 0, loss = 2.109901189804077 | |
Epoch: 0, Step: 165, Rank: 3, loss = 1.3613543510437012 | |
Epoch: 0, Step: 165, Rank: 2, loss = 1.3167222738265991 | |
Epoch: 0, Step: 165, Rank: 5, loss = 1.1383519172668457 | |
[2024-06-27 16:45:02,277] [INFO] [logging.py:96:log_dist] [Rank 0] step=165, skipped=0, lr=[8.571428571428571e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:02,350] [INFO] [timer.py:260:stop] epoch=0/micro_step=165/global_step=165, RunningAvgSamplesPerSec=95.5211540600801, CurrSamplesPerSec=95.33032685507348, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.23335215201077 samples/s, lr: 8.571428571428571e-06, loss: 2.109901189804077 cuda_mem_allocated: 22.30256938934326 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7920.0 batch_size: 74.0 total loss: 1.1474738121032715 | |
Epoch 0: 77% 165/213 [03:51<01:01, 1.28s/it] total tokens: 2400 num samples: 15 num padding tokens: 193 - rank: 6 max len: 160 min len: 134 avg len: 147.13333333333333 num_loss_counted_tokens: 845 | |
total tokens: 2328 num samples: 6 num padding tokens: 238 - rank: 2 max len: 388 min len: 307 avg len: 348.3333333333333 num_loss_counted_tokens: 1098 | |
total tokens: 2331 num samples: 9 num padding tokens: 127 - rank: 4 max len: 259 min len: 235 avg len: 244.88888888888889 num_loss_counted_tokens: 1002 | |
total tokens: 2530 num samples: 11 num padding tokens: 458 - rank: 5 max len: 230 min len: 161 avg len: 188.36363636363637 num_loss_counted_tokens: 914 | |
total tokens: 2416 num samples: 8 num padding tokens: 125 - rank: 3 max len: 302 min len: 268 avg len: 286.375 num_loss_counted_tokens: 1033 | |
total tokens: 2215 num samples: 5 num padding tokens: 138 - rank: 1 max len: 443 min len: 391 avg len: 415.4 num_loss_counted_tokens: 880 | |
total tokens: 2322 num samples: 18 num padding tokens: 447 - rank: 7 max len: 129 min len: 79 avg len: 104.16666666666667 num_loss_counted_tokens: 504 | |
total tokens: 2526 num samples: 3 num padding tokens: 771 - rank: 0 max len: 842 min len: 456 avg len: 585.0 num_loss_counted_tokens: 1353 | |
Per-token loss scaled by world size: 0.0007832744158804417Per-token loss scaled by world size: 0.0010006297379732132 | |
Per-token loss scaled by world size: 0.0009075973648577929Per-token loss scaled by world size: 0.0013177812797948718Per-token loss scaled by world size: 0.0012516657589003444Per-token loss scaled by world size: 0.002046181121841073 | |
Epoch: 0, Step: 166, Rank: 4, loss = 0.7368654012680054 | |
Per-token loss scaled by world size: 0.0013636179501190782 | |
Epoch: 0, Step: 166, Rank: 3, loss = 0.9413424730300903 | |
Epoch: 0, Step: 166, Rank: 5, loss = 0.8538222312927246Epoch: 0, Step: 166, Rank: 6, loss = 1.2397027015686035 | |
Epoch: 0, Step: 166, Rank: 2, loss = 1.177504539489746Epoch: 0, Step: 166, Rank: 0, loss = 1.9249448776245117 | |
Epoch: 0, Step: 166, Rank: 1, loss = 1.2828235626220703 | |
Per-token loss scaled by world size: 0.0007487289258278906 | |
Epoch: 0, Step: 166, Rank: 7, loss = 0.7043667435646057 | |
[2024-06-27 16:45:03,336] [INFO] [logging.py:96:log_dist] [Rank 0] step=166, skipped=0, lr=[8.623376623376624e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:03,409] [INFO] [timer.py:260:stop] epoch=0/micro_step=166/global_step=166, RunningAvgSamplesPerSec=95.52142902235161, CurrSamplesPerSec=95.56626904071355, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.46879132937315 samples/s, lr: 8.623376623376624e-06, loss: 1.9249448776245117 cuda_mem_allocated: 22.27036952972412 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7526.0 batch_size: 86.0 total loss: 1.1076714992523193 | |
Epoch 0: 78% 166/213 [03:52<00:57, 1.21s/it] total tokens: 2388 num samples: 6 num padding tokens: 196 - rank: 1 max len: 398 min len: 330 avg len: 365.3333333333333 num_loss_counted_tokens: 1142 | |
total tokens: 2509 num samples: 13 num padding tokens: 160 - rank: 5 max len: 193 min len: 167 avg len: 180.69230769230768 num_loss_counted_tokens: 838 | |
total tokens: 2505 num samples: 15 num padding tokens: 280 - rank: 6 max len: 167 min len: 134 avg len: 148.33333333333334 num_loss_counted_tokens: 885 | |
total tokens: 2296 num samples: 7 num padding tokens: 175 - rank: 2 max len: 328 min len: 271 avg len: 303.0 num_loss_counted_tokens: 1188 | |
total tokens: 2376 num samples: 9 num padding tokens: 132 - rank: 3 max len: 264 min len: 234 avg len: 249.33333333333334 num_loss_counted_tokens: 1121 | |
total tokens: 2508 num samples: 11 num padding tokens: 204 - rank: 4 max len: 228 min len: 193 avg len: 209.45454545454547 num_loss_counted_tokens: 1064 | |
total tokens: 2175 num samples: 5 num padding tokens: 89 - rank: 0 max len: 435 min len: 404 avg len: 417.2 num_loss_counted_tokens: 1093 | |
total tokens: 1920 num samples: 15 num padding tokens: 299 - rank: 7 max len: 128 min len: 82 avg len: 108.06666666666666 num_loss_counted_tokens: 384 | |
Per-token loss scaled by world size: 0.0010036593303084373Per-token loss scaled by world size: 0.0013827683869749308Per-token loss scaled by world size: 0.0014345779782161117Per-token loss scaled by world size: 0.0006510185194201767Per-token loss scaled by world size: 0.0008119576377794147Per-token loss scaled by world size: 0.0012880823342129588Per-token loss scaled by world size: 0.001457458594813943 | |
Epoch: 0, Step: 167, Rank: 2, loss = 1.2871750593185425Epoch: 0, Step: 167, Rank: 6, loss = 0.9005333185195923Epoch: 0, Step: 167, Rank: 1, loss = 1.2406889200210571 | |
Epoch: 0, Step: 167, Rank: 3, loss = 1.1557319164276123Per-token loss scaled by world size: 0.00107005110476166 | |
Epoch: 0, Step: 167, Rank: 0, loss = 1.3077046871185303 | |
Epoch: 0, Step: 167, Rank: 7, loss = 0.584126353263855Epoch: 0, Step: 167, Rank: 5, loss = 0.7285289764404297 | |
Epoch: 0, Step: 167, Rank: 4, loss = 0.9601033329963684 | |
[2024-06-27 16:45:04,398] [INFO] [logging.py:96:log_dist] [Rank 0] step=167, skipped=0, lr=[8.675324675324675e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:04,472] [INFO] [timer.py:260:stop] epoch=0/micro_step=167/global_step=167, RunningAvgSamplesPerSec=95.51948023286842, CurrSamplesPerSec=95.20095103260238, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.11338257032996 samples/s, lr: 8.675324675324675e-06, loss: 1.3077046871185303 cuda_mem_allocated: 22.248186588287354 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7178.0 batch_size: 81.0 total loss: 1.0205739736557007 | |
Epoch 0: 78% 167/213 [03:53<00:53, 1.17s/it] total tokens: 2343 num samples: 11 num padding tokens: 183 - rank: 5 max len: 213 min len: 181 avg len: 196.36363636363637 num_loss_counted_tokens: 858 | |
total tokens: 2488 num samples: 8 num padding tokens: 198 - rank: 3 max len: 311 min len: 261 avg len: 286.25 num_loss_counted_tokens: 923 | |
total tokens: 2322 num samples: 9 num padding tokens: 203 - rank: 4 max len: 258 min len: 217 avg len: 235.44444444444446 num_loss_counted_tokens: 739 | |
total tokens: 1989 num samples: 3 num padding tokens: 547 - rank: 1 max len: 663 min len: 379 avg len: 480.6666666666667 num_loss_counted_tokens: 681 | |
total tokens: 2464 num samples: 7 num padding tokens: 112 - rank: 2 max len: 352 min len: 317 avg len: 336.0 num_loss_counted_tokens: 1312 | |
total tokens: 2464 num samples: 14 num padding tokens: 391 - rank: 6 max len: 176 min len: 129 avg len: 148.07142857142858 num_loss_counted_tokens: 810 | |
total tokens: 1280 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1280 min len: 1280 avg len: 1280.0 num_loss_counted_tokens: 1177 | |
total tokens: 2064 num samples: 16 num padding tokens: 313 - rank: 7 max len: 129 min len: 88 avg len: 109.4375 num_loss_counted_tokens: 490 | |
Per-token loss scaled by world size: 0.0008998421835713089Per-token loss scaled by world size: 0.001561703160405159Per-token loss scaled by world size: 0.0007155205239541829Per-token loss scaled by world size: 0.0019404091872274876Per-token loss scaled by world size: 0.001316272304393351Per-token loss scaled by world size: 0.0006784764700569212Per-token loss scaled by world size: 0.001315814908593893 | |
Epoch: 0, Step: 168, Rank: 4, loss = 0.8010845184326172Epoch: 0, Step: 168, Rank: 2, loss = 1.3903062343597412Epoch: 0, Step: 168, Rank: 7, loss = 0.6040136814117432Epoch: 0, Step: 168, Rank: 3, loss = 1.171404242515564Epoch: 0, Step: 168, Rank: 1, loss = 1.7274492979049683 | |
Epoch: 0, Step: 168, Rank: 0, loss = 0.6369921565055847 | |
Epoch: 0, Step: 168, Rank: 5, loss = 1.1718114614486694 | |
Per-token loss scaled by world size: 0.0009378840331919491 | |
Epoch: 0, Step: 168, Rank: 6, loss = 0.8349512815475464 | |
[2024-06-27 16:45:05,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=168, skipped=0, lr=[8.727272727272728e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:05,534] [INFO] [timer.py:260:stop] epoch=0/micro_step=168/global_step=168, RunningAvgSamplesPerSec=95.51657645473954, CurrSamplesPerSec=95.03985883275973, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 94.94257840924043 samples/s, lr: 8.727272727272728e-06, loss: 0.6369921565055847 cuda_mem_allocated: 22.246159553527832 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7122.0 batch_size: 82.0 total loss: 1.0422515869140625 | |
Epoch 0: 79% 168/213 [03:54<00:51, 1.14s/it] total tokens: 2295 num samples: 9 num padding tokens: 213 - rank: 4 max len: 255 min len: 209 avg len: 231.33333333333334 num_loss_counted_tokens: 811 | |
total tokens: 2376 num samples: 8 num padding tokens: 185 - rank: 3 max len: 297 min len: 260 avg len: 273.875 num_loss_counted_tokens: 746 | |
total tokens: 2478 num samples: 14 num padding tokens: 298 - rank: 6 max len: 177 min len: 136 avg len: 155.71428571428572 num_loss_counted_tokens: 754 | |
total tokens: 2376 num samples: 6 num padding tokens: 296 - rank: 2 max len: 396 min len: 303 avg len: 346.6666666666667 num_loss_counted_tokens: 1138 | |
total tokens: 2244 num samples: 4 num padding tokens: 207 - rank: 1 max len: 561 min len: 431 avg len: 509.25 num_loss_counted_tokens: 991 | |
total tokens: 2448 num samples: 12 num padding tokens: 159 - rank: 5 max len: 204 min len: 179 avg len: 190.75 num_loss_counted_tokens: 1060 | |
total tokens: 2251 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2251 min len: 2251 avg len: 2251.0 num_loss_counted_tokens: 24 | |
total tokens: 2448 num samples: 18 num padding tokens: 281 - rank: 7 max len: 136 min len: 92 avg len: 120.38888888888889 num_loss_counted_tokens: 702 | |
Per-token loss scaled by world size: 0.0009560675825923681Per-token loss scaled by world size: 0.00148081686347723 | |
Per-token loss scaled by world size: 0.0016440061153843999Per-token loss scaled by world size: 0.0026628724299371243Per-token loss scaled by world size: 0.0025588665157556534Per-token loss scaled by world size: 0.0006659601931460202 | |
Per-token loss scaled by world size: 4.704659659182653e-05 | |
Epoch: 0, Step: 169, Rank: 3, loss = 0.7217115163803101 | |
Epoch: 0, Step: 169, Rank: 1, loss = 1.931624412536621Epoch: 0, Step: 169, Rank: 6, loss = 1.1178315877914429Epoch: 0, Step: 169, Rank: 4, loss = 1.2410191297531128Epoch: 0, Step: 169, Rank: 2, loss = 2.0101358890533447 | |
Epoch: 0, Step: 169, Rank: 7, loss = 0.5027167201042175 | |
Per-token loss scaled by world size: 0.001737786689773202 | |
Epoch: 0, Step: 169, Rank: 0, loss = 0.03551429882645607 | |
Epoch: 0, Step: 169, Rank: 5, loss = 1.3118116855621338 | |
[2024-06-27 16:45:06,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=169, skipped=0, lr=[8.779220779220779e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:06,595] [INFO] [timer.py:260:stop] epoch=0/micro_step=169/global_step=169, RunningAvgSamplesPerSec=95.51537653151031, CurrSamplesPerSec=95.31660628694037, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.22803676650572 samples/s, lr: 8.779220779220779e-06, loss: 0.03551429882645607 cuda_mem_allocated: 22.26035165786743 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6039.0 batch_size: 79.0 total loss: 1.1090456247329712 | |
Epoch 0: 79% 169/213 [03:55<00:49, 1.11s/it] total tokens: 2490 num samples: 10 num padding tokens: 139 - rank: 4 max len: 249 min len: 222 avg len: 235.1 num_loss_counted_tokens: 859 | |
total tokens: 2436 num samples: 12 num padding tokens: 187 - rank: 5 max len: 203 min len: 177 avg len: 187.41666666666666 num_loss_counted_tokens: 980 | |
total tokens: 2520 num samples: 8 num padding tokens: 262 - rank: 3 max len: 315 min len: 249 avg len: 282.25 num_loss_counted_tokens: 1248 | |
total tokens: 2202 num samples: 6 num padding tokens: 125 - rank: 2 max len: 367 min len: 329 avg len: 346.1666666666667 num_loss_counted_tokens: 1153 | |
total tokens: 2225 num samples: 5 num padding tokens: 132 - rank: 1 max len: 445 min len: 383 avg len: 418.6 num_loss_counted_tokens: 1291 | |
total tokens: 2464 num samples: 14 num padding tokens: 324 - rank: 6 max len: 176 min len: 131 avg len: 152.85714285714286 num_loss_counted_tokens: 793 | |
total tokens: 2470 num samples: 19 num padding tokens: 346 - rank: 7 max len: 130 min len: 81 avg len: 111.78947368421052 num_loss_counted_tokens: 653 | |
total tokens: 2094 num samples: 3 num padding tokens: 284 - rank: 0 max len: 698 min len: 486 avg len: 603.3333333333334 num_loss_counted_tokens: 679 | |
Per-token loss scaled by world size: 0.0015283907996490598Per-token loss scaled by world size: 0.0010944915702566504Per-token loss scaled by world size: 0.0010478865588083863Per-token loss scaled by world size: 0.0009824164444580674Per-token loss scaled by world size: 0.0015285349218174815Per-token loss scaled by world size: 8.862371032591909e-05Per-token loss scaled by world size: 0.001544258906506002 | |
Epoch: 0, Step: 170, Rank: 3, loss = 1.1294808387756348Epoch: 0, Step: 170, Rank: 1, loss = 0.7743881940841675Epoch: 0, Step: 170, Rank: 4, loss = 0.8088292479515076Epoch: 0, Step: 170, Rank: 2, loss = 1.1412073373794556Epoch: 0, Step: 170, Rank: 7, loss = 0.7260057330131531Epoch: 0, Step: 170, Rank: 5, loss = 1.1295872926712036 | |
Epoch: 0, Step: 170, Rank: 0, loss = 0.06549292057752609 | |
Per-token loss scaled by world size: 0.0016950792632997036 | |
Epoch: 0, Step: 170, Rank: 6, loss = 1.2526636123657227 | |
[2024-06-27 16:45:07,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=0, lr=[8.831168831168832e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:07,655] [INFO] [timer.py:260:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=95.51581875646623, CurrSamplesPerSec=95.58972781179715, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.48001989971416 samples/s, lr: 8.831168831168832e-06, loss: 0.06549292057752609 cuda_mem_allocated: 22.292909622192383 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5912.0 batch_size: 80.0 total loss: 0.8784568905830383 | |
Epoch 0: 80% 170/213 [03:56<00:47, 1.10s/it] total tokens: 2340 num samples: 12 num padding tokens: 169 - rank: 5 max len: 195 min len: 167 avg len: 180.91666666666666 num_loss_counted_tokens: 1055 | |
total tokens: 2445 num samples: 15 num padding tokens: 165 - rank: 6 max len: 163 min len: 140 avg len: 152.0 num_loss_counted_tokens: 824 | |
total tokens: 2475 num samples: 11 num padding tokens: 148 - rank: 4 max len: 225 min len: 196 avg len: 211.54545454545453 num_loss_counted_tokens: 984 | |
total tokens: 2480 num samples: 10 num padding tokens: 111 - rank: 3 max len: 248 min len: 227 avg len: 236.9 num_loss_counted_tokens: 1035 | |
total tokens: 2424 num samples: 8 num padding tokens: 231 - rank: 2 max len: 303 min len: 248 avg len: 274.125 num_loss_counted_tokens: 840 | |
total tokens: 2499 num samples: 7 num padding tokens: 116 - rank: 1 max len: 357 min len: 306 avg len: 340.42857142857144 num_loss_counted_tokens: 1249 | |
total tokens: 2527 num samples: 19 num padding tokens: 466 - rank: 7 max len: 133 min len: 80 avg len: 108.47368421052632 num_loss_counted_tokens: 581 | |
total tokens: 2120 num samples: 5 num padding tokens: 159 - rank: 0 max len: 424 min len: 357 avg len: 392.2 num_loss_counted_tokens: 842 | |
Per-token loss scaled by world size: 0.0011326620588079095Per-token loss scaled by world size: 0.0018464570166543126Per-token loss scaled by world size: 0.0012575339060276747Per-token loss scaled by world size: 0.0009472208912484348Per-token loss scaled by world size: 0.0011802929220721126Per-token loss scaled by world size: 0.0007484994712285697Per-token loss scaled by world size: 0.0012649205746129155 | |
Epoch: 0, Step: 171, Rank: 3, loss = 1.5914151668548584Epoch: 0, Step: 171, Rank: 6, loss = 0.9762130975723267Epoch: 0, Step: 171, Rank: 2, loss = 1.0838370323181152Epoch: 0, Step: 171, Rank: 4, loss = 0.8163859844207764Epoch: 0, Step: 171, Rank: 7, loss = 0.6451129913330078Epoch: 0, Step: 171, Rank: 5, loss = 1.01726496219635 | |
Epoch: 0, Step: 171, Rank: 0, loss = 1.0902034044265747 | |
Per-token loss scaled by world size: 0.0008528829785063863 | |
Epoch: 0, Step: 171, Rank: 1, loss = 0.7350785136222839 | |
[2024-06-27 16:45:08,639] [INFO] [logging.py:96:log_dist] [Rank 0] step=171, skipped=0, lr=[8.883116883116883e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:08,712] [INFO] [timer.py:260:stop] epoch=0/micro_step=171/global_step=171, RunningAvgSamplesPerSec=95.51692396945764, CurrSamplesPerSec=95.70296354783399, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.57999351964935 samples/s, lr: 8.883116883116883e-06, loss: 1.0902034044265747 cuda_mem_allocated: 22.248186588287354 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6895.0 batch_size: 89.0 total loss: 0.994438886642456 | |
Epoch 0: 80% 171/213 [03:57<00:45, 1.09s/it] total tokens: 2295 num samples: 5 num padding tokens: 64 - rank: 1 max len: 459 min len: 432 avg len: 446.2 num_loss_counted_tokens: 1542 | |
total tokens: 2415 num samples: 15 num padding tokens: 259 - rank: 6 max len: 161 min len: 131 avg len: 143.73333333333332 num_loss_counted_tokens: 861 | |
total tokens: 2444 num samples: 13 num padding tokens: 170 - rank: 5 max len: 188 min len: 162 avg len: 174.92307692307693 num_loss_counted_tokens: 971 | |
total tokens: 2466 num samples: 6 num padding tokens: 257 - rank: 2 max len: 411 min len: 340 avg len: 368.1666666666667 num_loss_counted_tokens: 1287 | |
total tokens: 2519 num samples: 11 num padding tokens: 265 - rank: 4 max len: 229 min len: 189 avg len: 204.9090909090909 num_loss_counted_tokens: 958 | |
total tokens: 2336 num samples: 8 num padding tokens: 314 - rank: 3 max len: 292 min len: 231 avg len: 252.75 num_loss_counted_tokens: 878 | |
total tokens: 2470 num samples: 19 num padding tokens: 387 - rank: 7 max len: 130 min len: 87 avg len: 109.63157894736842 num_loss_counted_tokens: 612 | |
total tokens: 2127 num samples: 3 num padding tokens: 464 - rank: 0 max len: 709 min len: 466 avg len: 554.3333333333334 num_loss_counted_tokens: 359 | |
Per-token loss scaled by world size: 0.0013863114872947335Per-token loss scaled by world size: 0.00040177933988161385Per-token loss scaled by world size: 0.0015643986407667398Per-token loss scaled by world size: 0.0011546143796294928Per-token loss scaled by world size: 0.0011599431745707989Per-token loss scaled by world size: 0.0012639054330065846Per-token loss scaled by world size: 0.0008857083739712834 | |
Epoch: 0, Step: 172, Rank: 2, loss = 1.2462940216064453Epoch: 0, Step: 172, Rank: 0, loss = 0.36119961738586426Epoch: 0, Step: 172, Rank: 3, loss = 1.0427888631820679Epoch: 0, Step: 172, Rank: 6, loss = 0.7962518334388733Epoch: 0, Step: 172, Rank: 4, loss = 1.0379983186721802Epoch: 0, Step: 172, Rank: 1, loss = 1.406394362449646 | |
Epoch: 0, Step: 172, Rank: 5, loss = 1.1362509727478027 | |
Per-token loss scaled by world size: 0.000800221343524754 | |
Epoch: 0, Step: 172, Rank: 7, loss = 0.7193989753723145 | |
[2024-06-27 16:45:09,697] [INFO] [logging.py:96:log_dist] [Rank 0] step=172, skipped=0, lr=[8.935064935064936e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:09,770] [INFO] [timer.py:260:stop] epoch=0/micro_step=172/global_step=172, RunningAvgSamplesPerSec=95.51743307433195, CurrSamplesPerSec=95.60354982801469, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.50289269606104 samples/s, lr: 8.935064935064936e-06, loss: 0.36119961738586426 cuda_mem_allocated: 22.305790424346924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7192.0 batch_size: 77.0 total loss: 0.9683221578598022 | |
Epoch 0: 81% 172/213 [03:59<00:44, 1.08s/it] total tokens: 2392 num samples: 8 num padding tokens: 197 - rank: 3 max len: 299 min len: 250 avg len: 274.375 num_loss_counted_tokens: 1120 | |
total tokens: 2490 num samples: 10 num padding tokens: 229 - rank: 4 max len: 249 min len: 211 avg len: 226.1 num_loss_counted_tokens: 668 | |
total tokens: 2346 num samples: 17 num padding tokens: 459 - rank: 7 max len: 138 min len: 83 avg len: 111.0 num_loss_counted_tokens: 593 | |
total tokens: 2178 num samples: 6 num padding tokens: 68 - rank: 1 max len: 363 min len: 342 avg len: 351.6666666666667 num_loss_counted_tokens: 1107 | |
total tokens: 2331 num samples: 7 num padding tokens: 148 - rank: 2 max len: 333 min len: 304 avg len: 311.85714285714283 num_loss_counted_tokens: 1071 | |
total tokens: 2534 num samples: 14 num padding tokens: 261 - rank: 6 max len: 181 min len: 140 avg len: 162.35714285714286 num_loss_counted_tokens: 813 | |
total tokens: 2532 num samples: 12 num padding tokens: 151 - rank: 5 max len: 211 min len: 185 avg len: 198.41666666666666 num_loss_counted_tokens: 956 | |
total tokens: 2520 num samples: 5 num padding tokens: 369 - rank: 0 max len: 504 min len: 368 avg len: 430.2 num_loss_counted_tokens: 1270 | |
Per-token loss scaled by world size: 0.002612935146316886Per-token loss scaled by world size: 0.002688279841095209 | |
Per-token loss scaled by world size: 0.0011648435611277819Per-token loss scaled by world size: 0.0010634049540385604Per-token loss scaled by world size: 0.0013304702006280422Per-token loss scaled by world size: 0.000800069363322109Per-token loss scaled by world size: 1.587607243891398e-06 | |
Epoch: 0, Step: 173, Rank: 2, loss = 2.158937692642212 | |
Epoch: 0, Step: 173, Rank: 1, loss = 2.221191167831421Epoch: 0, Step: 173, Rank: 5, loss = 0.9624519944190979Epoch: 0, Step: 173, Rank: 4, loss = 0.8786383867263794 | |
Epoch: 0, Step: 173, Rank: 3, loss = 1.0993009805679321Epoch: 0, Step: 173, Rank: 0, loss = 0.0013117605121806264Epoch: 0, Step: 173, Rank: 7, loss = 0.6610572934150696 | |
Per-token loss scaled by world size: 0.001148425624705851 | |
Epoch: 0, Step: 173, Rank: 6, loss = 0.9488866925239563 | |
[2024-06-27 16:45:10,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=173, skipped=0, lr=[8.987012987012987e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:10,837] [INFO] [timer.py:260:stop] epoch=0/micro_step=173/global_step=173, RunningAvgSamplesPerSec=95.51372212660489, CurrSamplesPerSec=94.88702465252159, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 94.78639719435425 samples/s, lr: 8.987012987012987e-06, loss: 0.0013117605121806264 cuda_mem_allocated: 22.238646030426025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6610.0 batch_size: 82.0 total loss: 1.1164721250534058 | |
Epoch 0: 81% 173/213 [04:00<00:42, 1.07s/it] total tokens: 2480 num samples: 10 num padding tokens: 206 - rank: 4 max len: 248 min len: 206 avg len: 227.4 num_loss_counted_tokens: 759 | |
total tokens: 2280 num samples: 8 num padding tokens: 170 - rank: 3 max len: 285 min len: 249 avg len: 263.75 num_loss_counted_tokens: 1007 | |
total tokens: 2424 num samples: 12 num padding tokens: 190 - rank: 5 max len: 202 min len: 173 avg len: 186.16666666666666 num_loss_counted_tokens: 969 | |
total tokens: 2275 num samples: 7 num padding tokens: 133 - rank: 2 max len: 325 min len: 288 avg len: 306.0 num_loss_counted_tokens: 1135 | |
total tokens: 2442 num samples: 6 num padding tokens: 286 - rank: 1 max len: 407 min len: 335 avg len: 359.3333333333333 num_loss_counted_tokens: 1367 | |
total tokens: 2408 num samples: 14 num padding tokens: 269 - rank: 6 max len: 172 min len: 124 avg len: 152.78571428571428 num_loss_counted_tokens: 865 | |
total tokens: 2480 num samples: 20 num padding tokens: 393 - rank: 7 max len: 124 min len: 86 avg len: 104.35 num_loss_counted_tokens: 559 | |
total tokens: 2388 num samples: 4 num padding tokens: 386 - rank: 0 max len: 597 min len: 455 avg len: 500.5 num_loss_counted_tokens: 779 | |
Per-token loss scaled by world size: 0.0011211256496608257Per-token loss scaled by world size: 0.0007397548761218786Per-token loss scaled by world size: 0.000849206349812448Per-token loss scaled by world size: 0.001305122161284089Per-token loss scaled by world size: 0.0023030356969684362Per-token loss scaled by world size: 0.0013969514984637499Per-token loss scaled by world size: 0.0011770547134801745 | |
Epoch: 0, Step: 174, Rank: 4, loss = 0.9891130924224854Epoch: 0, Step: 174, Rank: 1, loss = 2.031853199005127Epoch: 0, Step: 174, Rank: 7, loss = 0.6526487469673157 | |
Epoch: 0, Step: 174, Rank: 5, loss = 1.1514440774917603 | |
Epoch: 0, Step: 174, Rank: 0, loss = 0.7492123246192932 | |
Epoch: 0, Step: 174, Rank: 3, loss = 1.0384565591812134Epoch: 0, Step: 174, Rank: 2, loss = 1.2324604988098145 | |
Per-token loss scaled by world size: 0.001220523496158421 | |
Epoch: 0, Step: 174, Rank: 6, loss = 1.076806902885437 | |
[2024-06-27 16:45:11,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=174, skipped=0, lr=[9.03896103896104e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:11,895] [INFO] [timer.py:260:stop] epoch=0/micro_step=174/global_step=174, RunningAvgSamplesPerSec=95.51509604612146, CurrSamplesPerSec=95.75061899140783, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 95.6629914375588 samples/s, lr: 9.03896103896104e-06, loss: 0.7492123246192932 cuda_mem_allocated: 22.22063636779785 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7058.0 batch_size: 77.0 total loss: 1.115249514579773 | |
Epoch 0: 82% 174/213 [04:01<00:41, 1.07s/it] total tokens: 2336 num samples: 8 num padding tokens: 135 - rank: 3 max len: 292 min len: 254 avg len: 275.125 num_loss_counted_tokens: 992 | |
total tokens: 2510 num samples: 10 num padding tokens: 206 - rank: 4 max len: 251 min len: 220 avg len: 230.4 num_loss_counted_tokens: 732 | |
total tokens: 2478 num samples: 14 num padding tokens: 433 - rank: 6 max len: 177 min len: 127 avg len: 146.07142857142858 num_loss_counted_tokens: 873 | |
total tokens: 2471 num samples: 7 num padding tokens: 226 - rank: 2 max len: 353 min len: 299 avg len: 320.7142857142857 num_loss_counted_tokens: 946 | |
total tokens: 2514 num samples: 6 num padding tokens: 152 - rank: 1 max len: 419 min len: 367 avg len: 393.6666666666667 num_loss_counted_tokens: 1337 | |
total tokens: 2088 num samples: 3 num padding tokens: 492 - rank: 0 max len: 696 min len: 423 avg len: 532.0 num_loss_counted_tokens: 172 | |
total tokens: 2365 num samples: 11 num padding tokens: 242 - rank: 5 max len: 215 min len: 178 avg len: 193.0 num_loss_counted_tokens: 770 | |
total tokens: 2480 num samples: 20 num padding tokens: 383 - rank: 7 max len: 124 min len: 82 avg len: 104.85 num_loss_counted_tokens: 511 | |
Per-token loss scaled by world size: 0.0009717451757751405Per-token loss scaled by world size: 0.0011971183121204376Per-token loss scaled by world size: 0.001287206425331533Per-token loss scaled by world size: 0.0012606112286448479Per-token loss scaled by world size: 0.0008488288149237633Per-token loss scaled by world size: 0.00039046432357281446Per-token loss scaled by world size: 0.001592941232956946 | |
Epoch: 0, Step: 175, Rank: 4, loss = 1.1438465118408203Epoch: 0, Step: 175, Rank: 0, loss = 1.5220553874969482Epoch: 0, Step: 175, Rank: 3, loss = 0.9285025000572205Epoch: 0, Step: 175, Rank: 6, loss = 0.8110559582710266Epoch: 0, Step: 175, Rank: 2, loss = 1.2045140266418457Epoch: 0, Step: 175, Rank: 1, loss = 1.2299257516860962 | |
Per-token loss scaled by world size: 0.0009589032852090895 | |
Epoch: 0, Step: 175, Rank: 7, loss = 0.37308865785598755 | |
Epoch: 0, Step: 175, Rank: 5, loss = 0.9162321090698242 | |
[2024-06-27 16:45:12,882] [INFO] [logging.py:96:log_dist] [Rank 0] step=175, skipped=0, lr=[9.090909090909091e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:12,954] [INFO] [timer.py:260:stop] epoch=0/micro_step=175/global_step=175, RunningAvgSamplesPerSec=95.51336471757931, CurrSamplesPerSec=95.2165071224411, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.12904490154641 samples/s, lr: 9.090909090909091e-06, loss: 1.5220553874969482 cuda_mem_allocated: 22.306148052215576 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7644.0 batch_size: 84.0 total loss: 1.0161526203155518 | |
Epoch 0: 82% 175/213 [04:02<00:40, 1.07s/it] total tokens: 2520 num samples: 8 num padding tokens: 244 - rank: 3 max len: 315 min len: 259 avg len: 284.5 num_loss_counted_tokens: 1285 | |
total tokens: 2260 num samples: 5 num padding tokens: 167 - rank: 1 max len: 452 min len: 365 avg len: 418.6 num_loss_counted_tokens: 952 | |
total tokens: 2398 num samples: 11 num padding tokens: 130 - rank: 5 max len: 218 min len: 195 avg len: 206.1818181818182 num_loss_counted_tokens: 883 | |
total tokens: 2322 num samples: 9 num padding tokens: 167 - rank: 4 max len: 258 min len: 225 avg len: 239.44444444444446 num_loss_counted_tokens: 749 | |
total tokens: 2509 num samples: 13 num padding tokens: 441 - rank: 6 max len: 193 min len: 140 avg len: 159.07692307692307 num_loss_counted_tokens: 820 | |
total tokens: 2527 num samples: 19 num padding tokens: 426 - rank: 7 max len: 133 min len: 95 avg len: 110.57894736842105 num_loss_counted_tokens: 541 | |
total tokens: 2534 num samples: 7 num padding tokens: 173 - rank: 2 max len: 362 min len: 319 avg len: 337.2857142857143 num_loss_counted_tokens: 780 | |
total tokens: 2523 num samples: 3 num padding tokens: 548 - rank: 0 max len: 841 min len: 556 avg len: 658.3333333333334 num_loss_counted_tokens: 995 | |
Per-token loss scaled by world size: 0.001511968788690865Per-token loss scaled by world size: 0.0017572097713127732 | |
Per-token loss scaled by world size: 0.0011316514573991299Per-token loss scaled by world size: 0.00019721472926903516Per-token loss scaled by world size: 0.0006134875002317131Per-token loss scaled by world size: 0.001095901825465262Per-token loss scaled by world size: 0.001595411798916757 | |
Epoch: 0, Step: 176, Rank: 5, loss = 1.1895414590835571 | |
Epoch: 0, Step: 176, Rank: 4, loss = 1.382484793663025Epoch: 0, Step: 176, Rank: 2, loss = 0.890326738357544 | |
Epoch: 0, Step: 176, Rank: 6, loss = 0.8622007369995117Epoch: 0, Step: 176, Rank: 7, loss = 0.48266127705574036Epoch: 0, Step: 176, Rank: 1, loss = 0.15515868365764618 | |
Epoch: 0, Step: 176, Rank: 3, loss = 1.2551902532577515 | |
Per-token loss scaled by world size: 0.0025945627130568027 | |
Epoch: 0, Step: 176, Rank: 0, loss = 2.0412721633911133 | |
[2024-06-27 16:45:13,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=176, skipped=0, lr=[9.142857142857144e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:14,014] [INFO] [timer.py:260:stop] epoch=0/micro_step=176/global_step=176, RunningAvgSamplesPerSec=95.51318968110657, CurrSamplesPerSec=95.48291802406345, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.38999404424317 samples/s, lr: 9.142857142857144e-06, loss: 2.0412721633911133 cuda_mem_allocated: 22.303761959075928 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6294.0 batch_size: 85.0 total loss: 1.0323545932769775 | |
Epoch 0: 83% 176/213 [04:03<00:39, 1.06s/it] total tokens: 2298 num samples: 6 num padding tokens: 159 - rank: 1 max len: 383 min len: 326 avg len: 356.5 num_loss_counted_tokens: 1312 | |
total tokens: 2365 num samples: 11 num padding tokens: 178 - rank: 4 max len: 215 min len: 182 avg len: 198.8181818181818 num_loss_counted_tokens: 800 | |
total tokens: 2520 num samples: 14 num padding tokens: 182 - rank: 5 max len: 180 min len: 154 avg len: 167.0 num_loss_counted_tokens: 919 | |
total tokens: 2349 num samples: 9 num padding tokens: 212 - rank: 3 max len: 261 min len: 220 avg len: 237.44444444444446 num_loss_counted_tokens: 887 | |
total tokens: 2516 num samples: 17 num padding tokens: 296 - rank: 6 max len: 148 min len: 113 avg len: 130.58823529411765 num_loss_counted_tokens: 793 | |
total tokens: 2472 num samples: 8 num padding tokens: 152 - rank: 2 max len: 309 min len: 274 avg len: 290.0 num_loss_counted_tokens: 856 | |
total tokens: 2028 num samples: 4 num padding tokens: 230 - rank: 0 max len: 507 min len: 384 avg len: 449.5 num_loss_counted_tokens: 836 | |
total tokens: 2016 num samples: 18 num padding tokens: 221 - rank: 7 max len: 112 min len: 80 avg len: 99.72222222222223 num_loss_counted_tokens: 428 | |
Per-token loss scaled by world size: 0.0005695175495930016Per-token loss scaled by world size: 0.0015690315049141645Per-token loss scaled by world size: 0.0012052420061081648Per-token loss scaled by world size: 0.0013606504071503878Per-token loss scaled by world size: 0.001335320994257927Per-token loss scaled by world size: 0.0009560829494148493 | |
Per-token loss scaled by world size: 0.0011460825335234404 | |
Epoch: 0, Step: 177, Rank: 3, loss = 0.8616697788238525 | |
Epoch: 0, Step: 177, Rank: 7, loss = 0.5132777094841003Epoch: 0, Step: 177, Rank: 0, loss = 1.4140896797180176Epoch: 0, Step: 177, Rank: 5, loss = 1.0862243175506592 | |
Epoch: 0, Step: 177, Rank: 6, loss = 1.2262861728668213Epoch: 0, Step: 177, Rank: 4, loss = 1.2034580707550049Epoch: 0, Step: 177, Rank: 1, loss = 1.0329068899154663 | |
Per-token loss scaled by world size: 0.0013108010170981288 | |
Epoch: 0, Step: 177, Rank: 2, loss = 1.1813594102859497 | |
[2024-06-27 16:45:14,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=177, skipped=0, lr=[9.194805194805195e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:15,071] [INFO] [timer.py:260:stop] epoch=0/micro_step=177/global_step=177, RunningAvgSamplesPerSec=95.51372495975795, CurrSamplesPerSec=95.60695487905038, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.51551139238525 samples/s, lr: 9.194805194805195e-06, loss: 1.4140896797180176 cuda_mem_allocated: 22.24425172805786 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7210.0 batch_size: 78.0 total loss: 1.0649089813232422 | |
Epoch 0: 83% 177/213 [04:04<00:38, 1.06s/it] total tokens: 2457 num samples: 13 num padding tokens: 241 - rank: 5 max len: 189 min len: 154 avg len: 170.46153846153845 num_loss_counted_tokens: 936 | |
total tokens: 2448 num samples: 16 num padding tokens: 359 - rank: 6 max len: 153 min len: 116 avg len: 130.5625 num_loss_counted_tokens: 728 | |
total tokens: 2260 num samples: 5 num padding tokens: 127 - rank: 0 max len: 452 min len: 399 avg len: 426.6 num_loss_counted_tokens: 923 | |
total tokens: 2304 num samples: 6 num padding tokens: 225 - rank: 1 max len: 384 min len: 318 avg len: 346.5 num_loss_counted_tokens: 912 | |
total tokens: 2512 num samples: 8 num padding tokens: 271 - rank: 2 max len: 314 min len: 249 avg len: 280.125 num_loss_counted_tokens: 794 | |
total tokens: 2508 num samples: 22 num padding tokens: 289 - rank: 7 max len: 114 min len: 81 avg len: 100.86363636363636 num_loss_counted_tokens: 545 | |
total tokens: 2420 num samples: 11 num padding tokens: 137 - rank: 4 max len: 220 min len: 192 avg len: 207.54545454545453 num_loss_counted_tokens: 859 | |
total tokens: 2480 num samples: 10 num padding tokens: 138 - rank: 3 max len: 248 min len: 222 avg len: 234.2 num_loss_counted_tokens: 1062 | |
Per-token loss scaled by world size: 0.001116479979828Per-token loss scaled by world size: 0.0017378719057887793Per-token loss scaled by world size: 0.0009340020478703082Per-token loss scaled by world size: 0.0020257076248526573Per-token loss scaled by world size: 0.000724754820112139Per-token loss scaled by world size: 0.0007250454509630799Per-token loss scaled by world size: 0.0008427196880802512 | |
Epoch: 0, Step: 178, Rank: 1, loss = 1.041117548942566Epoch: 0, Step: 178, Rank: 2, loss = 1.6205655336380005Epoch: 0, Step: 178, Rank: 5, loss = 0.8709568977355957Epoch: 0, Step: 178, Rank: 0, loss = 1.8889724016189575 | |
Epoch: 0, Step: 178, Rank: 7, loss = 0.6758338809013367 | |
Epoch: 0, Step: 178, Rank: 3, loss = 0.6761049032211304Epoch: 0, Step: 178, Rank: 6, loss = 0.7858361005783081Per-token loss scaled by world size: 0.0011364814126864076 | |
Epoch: 0, Step: 178, Rank: 4, loss = 1.0597689151763916 | |
[2024-06-27 16:45:16,053] [INFO] [logging.py:96:log_dist] [Rank 0] step=178, skipped=0, lr=[9.246753246753248e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:16,126] [INFO] [timer.py:260:stop] epoch=0/micro_step=178/global_step=178, RunningAvgSamplesPerSec=95.50831565467585, CurrSamplesPerSec=94.57103023242705, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.48330815910239 samples/s, lr: 9.246753246753248e-06, loss: 1.8889724016189575 cuda_mem_allocated: 22.264286518096924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7460.0 batch_size: 85.0 total loss: 1.0773944854736328 | |
Epoch 0: 84% 178/213 [04:05<00:37, 1.06s/it] total tokens: 2392 num samples: 13 num padding tokens: 197 - rank: 5 max len: 184 min len: 159 avg len: 168.84615384615384 num_loss_counted_tokens: 687 | |
total tokens: 2448 num samples: 16 num padding tokens: 137 - rank: 6 max len: 153 min len: 126 avg len: 144.4375 num_loss_counted_tokens: 809 | |
total tokens: 2421 num samples: 9 num padding tokens: 203 - rank: 3 max len: 269 min len: 226 avg len: 246.44444444444446 num_loss_counted_tokens: 777 | |
total tokens: 2288 num samples: 8 num padding tokens: 75 - rank: 2 max len: 286 min len: 271 avg len: 276.625 num_loss_counted_tokens: 1355 | |
total tokens: 2220 num samples: 6 num padding tokens: 225 - rank: 1 max len: 370 min len: 291 avg len: 332.5 num_loss_counted_tokens: 555 | |
total tokens: 2453 num samples: 11 num padding tokens: 247 - rank: 4 max len: 223 min len: 186 avg len: 200.54545454545453 num_loss_counted_tokens: 881 | |
total tokens: 1386 num samples: 11 num padding tokens: 109 - rank: 7 max len: 126 min len: 99 avg len: 116.0909090909091 num_loss_counted_tokens: 410 | |
total tokens: 2247 num samples: 3 num padding tokens: 393 - rank: 0 max len: 749 min len: 410 avg len: 618.0 num_loss_counted_tokens: 1517 | |
Per-token loss scaled by world size: 0.0014678981387987733Per-token loss scaled by world size: 0.0016708056209608912 | |
Per-token loss scaled by world size: 0.001159891253337264Per-token loss scaled by world size: 0.0014186871703714132Per-token loss scaled by world size: 0.0013158961664885283Per-token loss scaled by world size: 0.000993027351796627 | |
Per-token loss scaled by world size: 0.0011547444155439734 | |
Epoch: 0, Step: 179, Rank: 2, loss = 1.334135890007019 | |
Epoch: 0, Step: 179, Rank: 4, loss = 1.1959850788116455Epoch: 0, Step: 179, Rank: 5, loss = 1.0541961193084717Epoch: 0, Step: 179, Rank: 1, loss = 1.5185534954071045 | |
Epoch: 0, Step: 179, Rank: 0, loss = 1.2894092798233032 | |
Epoch: 0, Step: 179, Rank: 6, loss = 0.9025377035140991Epoch: 0, Step: 179, Rank: 3, loss = 1.049518346786499Per-token loss scaled by world size: 0.0008294759318232536 | |
Epoch: 0, Step: 179, Rank: 7, loss = 0.7538899183273315 | |
[2024-06-27 16:45:17,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=179, skipped=0, lr=[9.298701298701299e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:17,186] [INFO] [timer.py:260:stop] epoch=0/micro_step=179/global_step=179, RunningAvgSamplesPerSec=95.50820630067743, CurrSamplesPerSec=95.4889638966114, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.4017691776416 samples/s, lr: 9.298701298701299e-06, loss: 1.2894092798233032 cuda_mem_allocated: 22.26786518096924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7271.0 batch_size: 87.0 total loss: 1.1372783184051514 | |
Epoch 0: 84% 179/213 [04:06<00:36, 1.06s/it] total tokens: 2298 num samples: 6 num padding tokens: 274 - rank: 2 max len: 383 min len: 307 avg len: 337.3333333333333 num_loss_counted_tokens: 1149 | |
total tokens: 2380 num samples: 14 num padding tokens: 172 - rank: 6 max len: 170 min len: 142 avg len: 157.71428571428572 num_loss_counted_tokens: 835 | |
total tokens: 2432 num samples: 8 num padding tokens: 318 - rank: 3 max len: 304 min len: 227 avg len: 264.25 num_loss_counted_tokens: 681 | |
total tokens: 2332 num samples: 11 num padding tokens: 166 - rank: 4 max len: 212 min len: 188 avg len: 196.9090909090909 num_loss_counted_tokens: 991 | |
total tokens: 2156 num samples: 4 num padding tokens: 286 - rank: 1 max len: 539 min len: 406 avg len: 467.5 num_loss_counted_tokens: 1104 | |
total tokens: 2414 num samples: 17 num padding tokens: 518 - rank: 7 max len: 142 min len: 85 avg len: 111.52941176470588 num_loss_counted_tokens: 484 | |
total tokens: 2405 num samples: 13 num padding tokens: 93 - rank: 5 max len: 185 min len: 171 avg len: 177.84615384615384 num_loss_counted_tokens: 926 | |
total tokens: 2214 num samples: 2 num padding tokens: 436 - rank: 0 max len: 1107 min len: 671 avg len: 889.0 num_loss_counted_tokens: 1035 | |
Per-token loss scaled by world size: 0.0008883044356480241Per-token loss scaled by world size: 0.0005795389297418296Per-token loss scaled by world size: 0.0023983160499483347Per-token loss scaled by world size: 0.001471399562433362Per-token loss scaled by world size: 0.000853748875670135Per-token loss scaled by world size: 0.000522683490999043Per-token loss scaled by world size: 0.0010825896169990301 | |
Epoch: 0, Step: 180, Rank: 7, loss = 0.5304954648017883Epoch: 0, Step: 180, Rank: 0, loss = 0.8131316900253296Epoch: 0, Step: 180, Rank: 1, loss = 2.1953585147857666Epoch: 0, Step: 180, Rank: 2, loss = 1.3468823432922363Epoch: 0, Step: 180, Rank: 4, loss = 0.4784514009952545 | |
Epoch: 0, Step: 180, Rank: 6, loss = 0.7815003991127014 | |
Epoch: 0, Step: 180, Rank: 3, loss = 0.9909754395484924Per-token loss scaled by world size: 0.0009467682102695107 | |
Epoch: 0, Step: 180, Rank: 5, loss = 0.8666479587554932 | |
[2024-06-27 16:45:18,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=0, lr=[9.350649350649352e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:18,246] [INFO] [timer.py:260:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=95.5080060315496, CurrSamplesPerSec=95.47257162164566, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.38122673524238 samples/s, lr: 9.350649350649352e-06, loss: 0.8131316900253296 cuda_mem_allocated: 22.29064416885376 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7323.0 batch_size: 86.0 total loss: 1.0004303455352783 | |
Epoch 0: 85% 180/213 [04:07<00:34, 1.06s/it] total tokens: 2519 num samples: 11 num padding tokens: 258 - rank: 3 max len: 229 min len: 178 avg len: 205.54545454545453 num_loss_counted_tokens: 632 | |
total tokens: 2264 num samples: 8 num padding tokens: 214 - rank: 2 max len: 283 min len: 233 avg len: 256.25 num_loss_counted_tokens: 859 | |
total tokens: 2385 num samples: 15 num padding tokens: 200 - rank: 5 max len: 159 min len: 136 avg len: 145.66666666666666 num_loss_counted_tokens: 888 | |
total tokens: 2464 num samples: 14 num padding tokens: 127 - rank: 4 max len: 176 min len: 159 avg len: 166.92857142857142 num_loss_counted_tokens: 1062 | |
total tokens: 2310 num samples: 7 num padding tokens: 168 - rank: 1 max len: 330 min len: 284 avg len: 306.0 num_loss_counted_tokens: 898 | |
total tokens: 2430 num samples: 18 num padding tokens: 221 - rank: 6 max len: 135 min len: 113 avg len: 122.72222222222223 num_loss_counted_tokens: 713 | |
total tokens: 1904 num samples: 17 num padding tokens: 260 - rank: 7 max len: 112 min len: 81 avg len: 96.70588235294117 num_loss_counted_tokens: 374 | |
total tokens: 2185 num samples: 5 num padding tokens: 281 - rank: 0 max len: 437 min len: 344 avg len: 380.8 num_loss_counted_tokens: 1006 | |
Per-token loss scaled by world size: 0.0007339948206208646Per-token loss scaled by world size: 0.0015326475258916616Per-token loss scaled by world size: 0.0011294849682599306Per-token loss scaled by world size: 0.0009089059312827885Per-token loss scaled by world size: 0.0008310244884341955Per-token loss scaled by world size: 0.0014194620307534933Per-token loss scaled by world size: 0.0008979425183497369 | |
Epoch: 0, Step: 181, Rank: 3, loss = 1.527858018875122Epoch: 0, Step: 181, Rank: 4, loss = 0.7317010760307312Epoch: 0, Step: 181, Rank: 6, loss = 1.12595534324646Epoch: 0, Step: 181, Rank: 5, loss = 0.8284275531768799 | |
Epoch: 0, Step: 181, Rank: 7, loss = 0.9060655832290649 | |
Epoch: 0, Step: 181, Rank: 2, loss = 0.8951364755630493 | |
Epoch: 0, Step: 181, Rank: 1, loss = 1.4150261878967285Per-token loss scaled by world size: 0.0008510663756169379 | |
Epoch: 0, Step: 181, Rank: 0, loss = 0.8484067916870117 | |
[2024-06-27 16:45:19,235] [INFO] [logging.py:96:log_dist] [Rank 0] step=181, skipped=0, lr=[9.402597402597403e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:19,308] [INFO] [timer.py:260:stop] epoch=0/micro_step=181/global_step=181, RunningAvgSamplesPerSec=95.50664189902065, CurrSamplesPerSec=95.26444552698341, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.17550043858623 samples/s, lr: 9.402597402597403e-06, loss: 0.8484067916870117 cuda_mem_allocated: 22.314138412475586 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7975.0 batch_size: 82.0 total loss: 1.0348222255706787 | |
Epoch 0: 85% 181/213 [04:08<00:33, 1.06s/it] total tokens: 2367 num samples: 9 num padding tokens: 190 - rank: 4 max len: 263 min len: 226 avg len: 241.88888888888889 num_loss_counted_tokens: 1090 | |
total tokens: 2354 num samples: 11 num padding tokens: 238 - rank: 5 max len: 214 min len: 170 avg len: 192.36363636363637 num_loss_counted_tokens: 979 | |
total tokens: 2520 num samples: 15 num padding tokens: 386 - rank: 6 max len: 168 min len: 127 avg len: 142.26666666666668 num_loss_counted_tokens: 730 | |
total tokens: 2440 num samples: 8 num padding tokens: 151 - rank: 3 max len: 305 min len: 265 avg len: 286.125 num_loss_counted_tokens: 1294 | |
total tokens: 2499 num samples: 7 num padding tokens: 96 - rank: 2 max len: 357 min len: 313 avg len: 343.2857142857143 num_loss_counted_tokens: 1358 | |
total tokens: 2392 num samples: 4 num padding tokens: 673 - rank: 1 max len: 598 min len: 357 avg len: 429.75 num_loss_counted_tokens: 564 | |
total tokens: 2268 num samples: 18 num padding tokens: 354 - rank: 7 max len: 126 min len: 87 avg len: 106.33333333333333 num_loss_counted_tokens: 468 | |
total tokens: 2520 num samples: 2 num padding tokens: 657 - rank: 0 max len: 1260 min len: 603 avg len: 931.5 num_loss_counted_tokens: 517 | |
Per-token loss scaled by world size: 0.0007794813718646765 | |
Per-token loss scaled by world size: 0.0011284893844276667Per-token loss scaled by world size: 0.00098012899979949Per-token loss scaled by world size: 0.0010211406042799354Per-token loss scaled by world size: 0.0005723547074012458Per-token loss scaled by world size: 0.0007053760928101838Per-token loss scaled by world size: 0.0015353828202933073 | |
Epoch: 0, Step: 182, Rank: 6, loss = 0.7433329224586487 | |
Epoch: 0, Step: 182, Rank: 4, loss = 1.076155662536621Epoch: 0, Step: 182, Rank: 3, loss = 0.9346755146980286Epoch: 0, Step: 182, Rank: 1, loss = 0.6726642847061157Epoch: 0, Step: 182, Rank: 5, loss = 0.9737852215766907 | |
Epoch: 0, Step: 182, Rank: 2, loss = 1.4641793966293335 | |
Epoch: 0, Step: 182, Rank: 7, loss = 0.5458117723464966 | |
Per-token loss scaled by world size: 0.0014747710665687919 | |
Epoch: 0, Step: 182, Rank: 0, loss = 1.4063785076141357 | |
[2024-06-27 16:45:20,292] [INFO] [logging.py:96:log_dist] [Rank 0] step=182, skipped=0, lr=[9.454545454545456e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:20,365] [INFO] [timer.py:260:stop] epoch=0/micro_step=182/global_step=182, RunningAvgSamplesPerSec=95.50775338150443, CurrSamplesPerSec=95.70712638683914, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 95.62075917034181 samples/s, lr: 9.454545454545456e-06, loss: 1.4063785076141357 cuda_mem_allocated: 22.315807819366455 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7629.0 batch_size: 75.0 total loss: 0.977122962474823 | |
Epoch 0: 85% 182/213 [04:09<00:32, 1.06s/it] total tokens: 2288 num samples: 8 num padding tokens: 146 - rank: 3 max len: 286 min len: 251 avg len: 267.75 num_loss_counted_tokens: 1108 | |
total tokens: 2520 num samples: 15 num padding tokens: 339 - rank: 6 max len: 168 min len: 129 avg len: 145.4 num_loss_counted_tokens: 760 | |
total tokens: 2534 num samples: 7 num padding tokens: 200 - rank: 2 max len: 362 min len: 294 avg len: 333.42857142857144 num_loss_counted_tokens: 1246 | |
total tokens: 2460 num samples: 10 num padding tokens: 223 - rank: 4 max len: 246 min len: 198 avg len: 223.7 num_loss_counted_tokens: 927 | |
total tokens: 2490 num samples: 6 num padding tokens: 134 - rank: 1 max len: 415 min len: 367 avg len: 392.6666666666667 num_loss_counted_tokens: 1772 | |
total tokens: 2440 num samples: 4 num padding tokens: 354 - rank: 0 max len: 610 min len: 443 avg len: 521.5 num_loss_counted_tokens: 1156 | |
total tokens: 2352 num samples: 12 num padding tokens: 141 - rank: 5 max len: 196 min len: 168 avg len: 184.25 num_loss_counted_tokens: 829 | |
total tokens: 1736 num samples: 14 num padding tokens: 264 - rank: 7 max len: 124 min len: 83 avg len: 105.14285714285714 num_loss_counted_tokens: 380 | |
Per-token loss scaled by world size: 0.0012613165890797973Per-token loss scaled by world size: 0.0012109667295590043Per-token loss scaled by world size: 0.0010508657433092594Per-token loss scaled by world size: 0.0014420837396755815Per-token loss scaled by world size: 0.0003740513348020613Per-token loss scaled by world size: 0.0011507897870615125Per-token loss scaled by world size: 0.0012479756260290742 | |
Epoch: 0, Step: 183, Rank: 4, loss = 1.2163821458816528Epoch: 0, Step: 183, Rank: 1, loss = 1.1678260564804077 | |
Epoch: 0, Step: 183, Rank: 6, loss = 1.0134286880493164Epoch: 0, Step: 183, Rank: 0, loss = 1.3907095193862915Epoch: 0, Step: 183, Rank: 7, loss = 0.3607257604598999Per-token loss scaled by world size: 0.000893204880412668 | |
Epoch: 0, Step: 183, Rank: 3, loss = 1.109792947769165 | |
Epoch: 0, Step: 183, Rank: 2, loss = 1.2035164833068848 | |
Epoch: 0, Step: 183, Rank: 5, loss = 0.8613844513893127 | |
[2024-06-27 16:45:21,353] [INFO] [logging.py:96:log_dist] [Rank 0] step=183, skipped=0, lr=[9.506493506493507e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:21,426] [INFO] [timer.py:260:stop] epoch=0/micro_step=183/global_step=183, RunningAvgSamplesPerSec=95.50590159509746, CurrSamplesPerSec=95.17374572678816, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.07298095178953 samples/s, lr: 9.506493506493507e-06, loss: 1.3907095193862915 cuda_mem_allocated: 22.273946285247803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7715.0 batch_size: 81.0 total loss: 1.040470838546753 | |
Epoch 0: 86% 183/213 [04:10<00:31, 1.06s/it] total tokens: 2430 num samples: 15 num padding tokens: 258 - rank: 6 max len: 162 min len: 125 avg len: 144.8 num_loss_counted_tokens: 877 | |
total tokens: 2472 num samples: 12 num padding tokens: 257 - rank: 5 max len: 206 min len: 163 avg len: 184.58333333333334 num_loss_counted_tokens: 1013 | |
total tokens: 2472 num samples: 6 num padding tokens: 240 - rank: 1 max len: 412 min len: 347 avg len: 372.0 num_loss_counted_tokens: 1153 | |
total tokens: 2376 num samples: 8 num padding tokens: 221 - rank: 3 max len: 297 min len: 255 avg len: 269.375 num_loss_counted_tokens: 812 | |
total tokens: 2422 num samples: 7 num padding tokens: 126 - rank: 2 max len: 346 min len: 305 avg len: 328.0 num_loss_counted_tokens: 953 | |
total tokens: 1888 num samples: 16 num padding tokens: 209 - rank: 7 max len: 118 min len: 80 avg len: 104.9375 num_loss_counted_tokens: 498 | |
total tokens: 2470 num samples: 10 num padding tokens: 184 - rank: 4 max len: 247 min len: 214 avg len: 228.6 num_loss_counted_tokens: 995 | |
total tokens: 1965 num samples: 3 num padding tokens: 436 - rank: 0 max len: 655 min len: 435 avg len: 509.6666666666667 num_loss_counted_tokens: 1243 | |
Per-token loss scaled by world size: 0.0013108783168718219Per-token loss scaled by world size: 0.0013408289523795247Per-token loss scaled by world size: 0.0006426859763450921Per-token loss scaled by world size: 0.0008282589260488749Per-token loss scaled by world size: 0.0011900000972673297 | |
Per-token loss scaled by world size: 0.0010997983627021313Per-token loss scaled by world size: 0.0010241015115752816 | |
Epoch: 0, Step: 184, Rank: 4, loss = 0.7236912250518799 | |
Epoch: 0, Step: 184, Rank: 3, loss = 1.145379900932312 | |
Epoch: 0, Step: 184, Rank: 6, loss = 1.0397626161575317 | |
Epoch: 0, Step: 184, Rank: 7, loss = 0.5615468621253967Epoch: 0, Step: 184, Rank: 5, loss = 0.8948086500167847 | |
Epoch: 0, Step: 184, Rank: 1, loss = 0.9609488248825073Epoch: 0, Step: 184, Rank: 0, loss = 1.1715493202209473 | |
Per-token loss scaled by world size: 0.0025872536934912205 | |
Epoch: 0, Step: 184, Rank: 2, loss = 2.260612964630127 | |
[2024-06-27 16:45:22,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=184, skipped=0, lr=[9.558441558441558e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:22,482] [INFO] [timer.py:260:stop] epoch=0/micro_step=184/global_step=184, RunningAvgSamplesPerSec=95.50760673469927, CurrSamplesPerSec=95.8172431123498, MemAllocated=22.17GB, MaxMemAllocated=28.61GB | |
throughput: 95.72796877388573 samples/s, lr: 9.558441558441558e-06, loss: 1.1715493202209473 cuda_mem_allocated: 22.16720724105835 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6990.0 batch_size: 69.0 total loss: 1.0947874784469604 | |
Epoch 0: 86% 184/213 [04:11<00:30, 1.06s/it] total tokens: 2470 num samples: 13 num padding tokens: 313 - rank: 5 max len: 190 min len: 153 avg len: 165.92307692307693 num_loss_counted_tokens: 909 | |
total tokens: 2448 num samples: 16 num padding tokens: 258 - rank: 6 max len: 153 min len: 123 avg len: 136.875 num_loss_counted_tokens: 898 | |
total tokens: 2530 num samples: 11 num padding tokens: 250 - rank: 4 max len: 230 min len: 191 avg len: 207.27272727272728 num_loss_counted_tokens: 940 | |
total tokens: 2200 num samples: 5 num padding tokens: 168 - rank: 1 max len: 440 min len: 368 avg len: 406.4 num_loss_counted_tokens: 1329 | |
total tokens: 2520 num samples: 9 num padding tokens: 231 - rank: 3 max len: 280 min len: 234 avg len: 254.33333333333334 num_loss_counted_tokens: 1066 | |
total tokens: 2443 num samples: 7 num padding tokens: 237 - rank: 2 max len: 349 min len: 289 avg len: 315.14285714285717 num_loss_counted_tokens: 1248 | |
total tokens: 2420 num samples: 20 num padding tokens: 434 - rank: 7 max len: 121 min len: 78 avg len: 99.3 num_loss_counted_tokens: 421 | |
total tokens: 2379 num samples: 3 num padding tokens: 541 - rank: 0 max len: 793 min len: 474 avg len: 612.6666666666666 num_loss_counted_tokens: 108 | |
Per-token loss scaled by world size: 0.0011308403918519616Per-token loss scaled by world size: 0.001636433182284236Per-token loss scaled by world size: 0.001152449636720121Per-token loss scaled by world size: 0.0018354151397943497Per-token loss scaled by world size: 0.0017969459295272827Per-token loss scaled by world size: 0.0006493672844953835 | |
Per-token loss scaled by world size: 1.4561818716174457e-05 | |
Epoch: 0, Step: 185, Rank: 4, loss = 1.2735540866851807Epoch: 0, Step: 185, Rank: 6, loss = 0.88007652759552Epoch: 0, Step: 185, Rank: 3, loss = 0.8968939185142517 | |
Epoch: 0, Step: 185, Rank: 2, loss = 1.3984731435775757Epoch: 0, Step: 185, Rank: 5, loss = 1.428411841392517Epoch: 0, Step: 185, Rank: 1, loss = 0.5053700804710388 | |
Epoch: 0, Step: 185, Rank: 0, loss = 0.011332735419273376 | |
Per-token loss scaled by world size: 0.0012430856004357338 | |
Epoch: 0, Step: 185, Rank: 7, loss = 0.967431366443634 | |
[2024-06-27 16:45:23,463] [INFO] [logging.py:96:log_dist] [Rank 0] step=185, skipped=0, lr=[9.610389610389611e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:23,536] [INFO] [timer.py:260:stop] epoch=0/micro_step=185/global_step=185, RunningAvgSamplesPerSec=95.5105045409722, CurrSamplesPerSec=96.04084989457415, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.94930656829945 samples/s, lr: 9.610389610389611e-06, loss: 0.011332735419273376 cuda_mem_allocated: 22.283011436462402 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6226.0 batch_size: 72.0 total loss: 0.9201930165290833 | |
Epoch 0: 87% 185/213 [04:12<00:29, 1.06s/it] total tokens: 2532 num samples: 12 num padding tokens: 160 - rank: 4 max len: 211 min len: 184 avg len: 197.66666666666666 num_loss_counted_tokens: 848 | |
total tokens: 2292 num samples: 6 num padding tokens: 163 - rank: 1 max len: 382 min len: 314 avg len: 354.8333333333333 num_loss_counted_tokens: 1043 | |
total tokens: 2448 num samples: 8 num padding tokens: 85 - rank: 2 max len: 306 min len: 265 avg len: 295.375 num_loss_counted_tokens: 753 | |
total tokens: 2385 num samples: 9 num padding tokens: 251 - rank: 3 max len: 265 min len: 215 avg len: 237.11111111111111 num_loss_counted_tokens: 643 | |
total tokens: 2448 num samples: 16 num padding tokens: 251 - rank: 6 max len: 153 min len: 114 avg len: 137.3125 num_loss_counted_tokens: 738 | |
total tokens: 2379 num samples: 13 num padding tokens: 180 - rank: 5 max len: 183 min len: 153 avg len: 169.15384615384616 num_loss_counted_tokens: 818 | |
total tokens: 2128 num samples: 19 num padding tokens: 261 - rank: 7 max len: 112 min len: 85 avg len: 98.26315789473684 num_loss_counted_tokens: 418 | |
total tokens: 1923 num samples: 3 num padding tokens: 325 - rank: 0 max len: 641 min len: 449 avg len: 532.6666666666666 num_loss_counted_tokens: 97 | |
Per-token loss scaled by world size: 0.001147231669165194Per-token loss scaled by world size: 0.0009057666757144034Per-token loss scaled by world size: 0.0006410079076886177Per-token loss scaled by world size: 0.001685183378867805Per-token loss scaled by world size: 0.00046050758101046085Per-token loss scaled by world size: 0.0013913216535001993 | |
Per-token loss scaled by world size: 0.0011303906794637442 | |
Epoch: 0, Step: 186, Rank: 5, loss = 1.0979007482528687Epoch: 0, Step: 186, Rank: 6, loss = 0.8668187260627747 | |
Epoch: 0, Step: 186, Rank: 4, loss = 0.6134445667266846 | |
Epoch: 0, Step: 186, Rank: 3, loss = 1.6127204895019531Epoch: 0, Step: 186, Rank: 1, loss = 1.3314948081970215Epoch: 0, Step: 186, Rank: 0, loss = 0.4407057464122772 | |
Epoch: 0, Step: 186, Rank: 2, loss = 1.0817838907241821 | |
Per-token loss scaled by world size: 0.0005094701773487031 | |
Epoch: 0, Step: 186, Rank: 7, loss = 0.48756298422813416 | |
[2024-06-27 16:45:24,522] [INFO] [logging.py:96:log_dist] [Rank 0] step=186, skipped=0, lr=[9.662337662337662e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:24,595] [INFO] [timer.py:260:stop] epoch=0/micro_step=186/global_step=186, RunningAvgSamplesPerSec=95.51138441331445, CurrSamplesPerSec=95.6726744465095, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.58278426781294 samples/s, lr: 9.662337662337662e-06, loss: 0.4407057464122772 cuda_mem_allocated: 22.264286518096924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7656.0 batch_size: 77.0 total loss: 0.9415539503097534 | |
Epoch 0: 87% 186/213 [04:13<00:28, 1.06s/it] total tokens: 2512 num samples: 16 num padding tokens: 179 - rank: 6 max len: 157 min len: 131 avg len: 145.8125 num_loss_counted_tokens: 837 | |
total tokens: 2506 num samples: 14 num padding tokens: 160 - rank: 5 max len: 179 min len: 158 avg len: 167.57142857142858 num_loss_counted_tokens: 1018 | |
total tokens: 2367 num samples: 9 num padding tokens: 261 - rank: 3 max len: 263 min len: 209 avg len: 234.0 num_loss_counted_tokens: 892 | |
total tokens: 2528 num samples: 8 num padding tokens: 134 - rank: 2 max len: 316 min len: 277 avg len: 299.25 num_loss_counted_tokens: 953 | |
total tokens: 2466 num samples: 6 num padding tokens: 201 - rank: 1 max len: 411 min len: 323 avg len: 377.5 num_loss_counted_tokens: 1398 | |
total tokens: 2508 num samples: 12 num padding tokens: 141 - rank: 4 max len: 209 min len: 184 avg len: 197.25 num_loss_counted_tokens: 959 | |
total tokens: 2280 num samples: 19 num padding tokens: 364 - rank: 7 max len: 120 min len: 78 avg len: 100.84210526315789 num_loss_counted_tokens: 469 | |
total tokens: 2236 num samples: 4 num padding tokens: 266 - rank: 0 max len: 559 min len: 444 avg len: 492.5 num_loss_counted_tokens: 1441 | |
Per-token loss scaled by world size: 0.0012301565147936344Per-token loss scaled by world size: 0.0013147848658263683Per-token loss scaled by world size: 0.0007165342685766518Per-token loss scaled by world size: 0.0020608806516975164Per-token loss scaled by world size: 0.0014867944410070777Per-token loss scaled by world size: 0.0007552782772108912Per-token loss scaled by world size: 0.000965710380114615 | |
Epoch: 0, Step: 187, Rank: 3, loss = 1.139432430267334Epoch: 0, Step: 187, Rank: 2, loss = 1.2178194522857666Epoch: 0, Step: 187, Rank: 6, loss = 0.6636898517608643Per-token loss scaled by world size: 0.0012501401361078024 | |
Epoch: 0, Step: 187, Rank: 7, loss = 0.6995764970779419Epoch: 0, Step: 187, Rank: 1, loss = 1.908890724182129 | |
Epoch: 0, Step: 187, Rank: 5, loss = 1.377143383026123Epoch: 0, Step: 187, Rank: 0, loss = 0.8944892287254333 | |
Epoch: 0, Step: 187, Rank: 4, loss = 1.157942295074463 | |
[2024-06-27 16:45:25,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=187, skipped=0, lr=[9.714285714285715e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:25,661] [INFO] [timer.py:260:stop] epoch=0/micro_step=187/global_step=187, RunningAvgSamplesPerSec=95.5077649092269, CurrSamplesPerSec=94.84641291889088, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 94.75285009330487 samples/s, lr: 9.714285714285715e-06, loss: 0.8944892287254333 cuda_mem_allocated: 22.267388343811035 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7410.0 batch_size: 87.0 total loss: 1.1323729753494263 | |
Epoch 0: 88% 187/213 [04:14<00:27, 1.06s/it] total tokens: 2508 num samples: 11 num padding tokens: 190 - rank: 4 max len: 228 min len: 189 avg len: 210.72727272727272 num_loss_counted_tokens: 873 | |
total tokens: 2466 num samples: 9 num padding tokens: 165 - rank: 3 max len: 274 min len: 240 avg len: 255.66666666666666 num_loss_counted_tokens: 919 | |
total tokens: 2226 num samples: 7 num padding tokens: 146 - rank: 2 max len: 318 min len: 284 avg len: 297.14285714285717 num_loss_counted_tokens: 873 | |
total tokens: 2457 num samples: 13 num padding tokens: 238 - rank: 5 max len: 189 min len: 159 avg len: 170.69230769230768 num_loss_counted_tokens: 950 | |
total tokens: 2412 num samples: 6 num padding tokens: 256 - rank: 1 max len: 402 min len: 322 avg len: 359.3333333333333 num_loss_counted_tokens: 1154 | |
total tokens: 2416 num samples: 16 num padding tokens: 192 - rank: 6 max len: 151 min len: 125 avg len: 139.0 num_loss_counted_tokens: 769 | |
total tokens: 2190 num samples: 3 num padding tokens: 535 - rank: 0 max len: 730 min len: 409 avg len: 551.6666666666666 num_loss_counted_tokens: 594 | |
total tokens: 1860 num samples: 15 num padding tokens: 297 - rank: 7 max len: 124 min len: 87 avg len: 104.2 num_loss_counted_tokens: 400 | |
Per-token loss scaled by world size: 0.000783691939432174Per-token loss scaled by world size: 0.00196462101303041Per-token loss scaled by world size: 0.0009586469968780875Per-token loss scaled by world size: 0.0011195316910743713 | |
Per-token loss scaled by world size: 0.0013131442246958613Per-token loss scaled by world size: 0.00043911824468523264Per-token loss scaled by world size: 0.0024799967650324106 | |
Epoch: 0, Step: 188, Rank: 5, loss = 1.0450828075408936 | |
Epoch: 0, Step: 188, Rank: 4, loss = 0.7315764427185059Epoch: 0, Step: 188, Rank: 6, loss = 0.8948969841003418Epoch: 0, Step: 188, Rank: 2, loss = 1.8339736461639404 | |
Epoch: 0, Step: 188, Rank: 1, loss = 2.315077066421509 | |
Epoch: 0, Step: 188, Rank: 3, loss = 1.2258201837539673Epoch: 0, Step: 188, Rank: 0, loss = 0.40991687774658203Per-token loss scaled by world size: 0.0005501627456396818 | |
Epoch: 0, Step: 188, Rank: 7, loss = 0.5135769248008728 | |
[2024-06-27 16:45:26,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=188, skipped=0, lr=[9.766233766233766e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:26,720] [INFO] [timer.py:260:stop] epoch=0/micro_step=188/global_step=188, RunningAvgSamplesPerSec=95.50709539103818, CurrSamplesPerSec=95.38339581628732, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.2920860980205 samples/s, lr: 9.766233766233766e-06, loss: 0.40991687774658203 cuda_mem_allocated: 22.26822280883789 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7468.0 batch_size: 80.0 total loss: 1.1212400197982788 | |
Epoch 0: 88% 188/213 [04:15<00:26, 1.06s/it] total tokens: 2475 num samples: 11 num padding tokens: 116 - rank: 4 max len: 225 min len: 195 avg len: 214.45454545454547 num_loss_counted_tokens: 911 | |
total tokens: 2303 num samples: 7 num padding tokens: 169 - rank: 2 max len: 329 min len: 285 avg len: 304.85714285714283 num_loss_counted_tokens: 1069 | |
total tokens: 2140 num samples: 4 num padding tokens: 556 - rank: 1 max len: 535 min len: 332 avg len: 396.0 num_loss_counted_tokens: 847 | |
total tokens: 2529 num samples: 9 num padding tokens: 196 - rank: 3 max len: 281 min len: 226 avg len: 259.22222222222223 num_loss_counted_tokens: 1158 | |
total tokens: 2444 num samples: 13 num padding tokens: 173 - rank: 5 max len: 188 min len: 157 avg len: 174.69230769230768 num_loss_counted_tokens: 915 | |
total tokens: 2496 num samples: 16 num padding tokens: 220 - rank: 6 max len: 156 min len: 127 avg len: 142.25 num_loss_counted_tokens: 953 | |
total tokens: 2400 num samples: 3 num padding tokens: 243 - rank: 0 max len: 800 min len: 614 avg len: 719.0 num_loss_counted_tokens: 662 | |
total tokens: 2520 num samples: 20 num padding tokens: 330 - rank: 7 max len: 126 min len: 87 avg len: 109.5 num_loss_counted_tokens: 620 | |
Per-token loss scaled by world size: 0.001411886652931571Per-token loss scaled by world size: 0.0006349166505970061Per-token loss scaled by world size: 0.0010255166562274098Per-token loss scaled by world size: 0.0009100721217691898Per-token loss scaled by world size: 0.0005326938698999584Per-token loss scaled by world size: 0.0012043744791299105Per-token loss scaled by world size: 0.001199553138576448 | |
Epoch: 0, Step: 189, Rank: 2, loss = 1.3409394025802612 | |
Epoch: 0, Step: 189, Rank: 7, loss = 0.6030120849609375 | |
Epoch: 0, Step: 189, Rank: 6, loss = 0.9739844799041748Epoch: 0, Step: 189, Rank: 5, loss = 0.8643410205841064Epoch: 0, Step: 189, Rank: 3, loss = 1.1438546180725098 | |
Epoch: 0, Step: 189, Rank: 4, loss = 0.5059260129928589Epoch: 0, Step: 189, Rank: 1, loss = 1.1392755508422852 | |
Per-token loss scaled by world size: 0.0015033723320811987 | |
Epoch: 0, Step: 189, Rank: 0, loss = 1.4278278350830078 | |
[2024-06-27 16:45:27,714] [INFO] [logging.py:96:log_dist] [Rank 0] step=189, skipped=0, lr=[9.81818181818182e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:27,787] [INFO] [timer.py:260:stop] epoch=0/micro_step=189/global_step=189, RunningAvgSamplesPerSec=95.50364287488297, CurrSamplesPerSec=94.86578688268519, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.7772943032388 samples/s, lr: 9.81818181818182e-06, loss: 1.4278278350830078 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7598.0 batch_size: 79.0 total loss: 0.9998951554298401 | |
Epoch 0: 89% 189/213 [04:17<00:25, 1.06s/it] total tokens: 2412 num samples: 9 num padding tokens: 153 - rank: 3 max len: 268 min len: 236 avg len: 251.0 num_loss_counted_tokens: 895 | |
total tokens: 2530 num samples: 11 num padding tokens: 145 - rank: 4 max len: 230 min len: 205 avg len: 216.8181818181818 num_loss_counted_tokens: 941 | |
total tokens: 2440 num samples: 8 num padding tokens: 126 - rank: 2 max len: 305 min len: 275 avg len: 289.25 num_loss_counted_tokens: 931 | |
total tokens: 2220 num samples: 6 num padding tokens: 219 - rank: 1 max len: 370 min len: 307 avg len: 333.5 num_loss_counted_tokens: 1011 | |
total tokens: 2480 num samples: 16 num padding tokens: 197 - rank: 6 max len: 155 min len: 134 avg len: 142.6875 num_loss_counted_tokens: 845 | |
total tokens: 2044 num samples: 4 num padding tokens: 168 - rank: 0 max len: 511 min len: 392 avg len: 469.0 num_loss_counted_tokens: 783 | |
total tokens: 2509 num samples: 13 num padding tokens: 225 - rank: 5 max len: 193 min len: 156 avg len: 175.69230769230768 num_loss_counted_tokens: 1088 | |
total tokens: 2508 num samples: 19 num padding tokens: 424 - rank: 7 max len: 132 min len: 82 avg len: 109.6842105263158 num_loss_counted_tokens: 590 | |
Per-token loss scaled by world size: 0.0021460973657667637Per-token loss scaled by world size: 0.0009927600622177124Per-token loss scaled by world size: 0.0011540588457137346Per-token loss scaled by world size: 0.0009585152147337794Per-token loss scaled by world size: 0.0012274347245693207Per-token loss scaled by world size: 0.0017290335381403565Per-token loss scaled by world size: 0.001280484371818602 | |
Epoch: 0, Step: 190, Rank: 1, loss = 1.995870590209961Epoch: 0, Step: 190, Rank: 5, loss = 1.0732747316360474Epoch: 0, Step: 190, Rank: 6, loss = 0.9232668876647949 | |
Epoch: 0, Step: 190, Rank: 4, loss = 0.8914191722869873Epoch: 0, Step: 190, Rank: 0, loss = 1.1415143013000488Epoch: 0, Step: 190, Rank: 2, loss = 1.6080012321472168 | |
Epoch: 0, Step: 190, Rank: 3, loss = 1.1908504962921143 | |
Per-token loss scaled by world size: 0.0006715547642670572 | |
Epoch: 0, Step: 190, Rank: 7, loss = 0.6245459318161011 | |
[2024-06-27 16:45:28,774] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=0, lr=[9.87012987012987e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:28,847] [INFO] [timer.py:260:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=95.50177831485982, CurrSamplesPerSec=95.15438070620463, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.0533203244417 samples/s, lr: 9.87012987012987e-06, loss: 1.1415143013000488 cuda_mem_allocated: 22.299350261688232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7440.0 batch_size: 81.0 total loss: 1.1810928583145142 | |
Epoch 0: 89% 190/213 [04:18<00:24, 1.06s/it] total tokens: 2448 num samples: 16 num padding tokens: 188 - rank: 6 max len: 153 min len: 131 avg len: 141.25 num_loss_counted_tokens: 861 | |
total tokens: 2464 num samples: 11 num padding tokens: 183 - rank: 4 max len: 224 min len: 189 avg len: 207.36363636363637 num_loss_counted_tokens: 849 total tokens: 2431 num samples: 13 num padding tokens: 235 - rank: 5 max len: 187 min len: 155 avg len: 168.92307692307693 num_loss_counted_tokens: 896 | |
total tokens: 2440 num samples: 8 num padding tokens: 170 - rank: 2 max len: 305 min len: 262 avg len: 283.75 num_loss_counted_tokens: 1176 | |
total tokens: 2304 num samples: 9 num padding tokens: 182 - rank: 3 max len: 256 min len: 224 avg len: 235.77777777777777 num_loss_counted_tokens: 882 | |
total tokens: 2506 num samples: 7 num padding tokens: 122 - rank: 1 max len: 358 min len: 312 avg len: 340.57142857142856 num_loss_counted_tokens: 1136 | |
total tokens: 2413 num samples: 19 num padding tokens: 463 - rank: 7 max len: 127 min len: 83 avg len: 102.63157894736842 num_loss_counted_tokens: 436 | |
total tokens: 2108 num samples: 4 num padding tokens: 341 - rank: 0 max len: 527 min len: 366 avg len: 441.75 num_loss_counted_tokens: 1064 | |
Per-token loss scaled by world size: 0.0012760428944602609Per-token loss scaled by world size: 0.0012569351820275187Per-token loss scaled by world size: 0.000763555581215769Per-token loss scaled by world size: 0.001470689894631505Per-token loss scaled by world size: 0.0017765691736713052Per-token loss scaled by world size: 0.0009371329215355217Per-token loss scaled by world size: 9.509848314337432e-05 | |
Epoch: 0, Step: 191, Rank: 5, loss = 0.9950213432312012Epoch: 0, Step: 191, Rank: 2, loss = 1.0101474523544312Epoch: 0, Step: 191, Rank: 7, loss = 0.6044496893882751 | |
Epoch: 0, Step: 191, Rank: 3, loss = 1.406376600265503Epoch: 0, Step: 191, Rank: 6, loss = 1.1642348766326904 | |
Epoch: 0, Step: 191, Rank: 4, loss = 0.7418578267097473Epoch: 0, Step: 191, Rank: 0, loss = 0.07528233528137207 | |
Per-token loss scaled by world size: 0.00225253589451313 | |
Epoch: 0, Step: 191, Rank: 1, loss = 1.7831637859344482 | |
[2024-06-27 16:45:29,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=191, skipped=0, lr=[9.922077922077923e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:29,916] [INFO] [timer.py:260:stop] epoch=0/micro_step=191/global_step=191, RunningAvgSamplesPerSec=95.49743093847269, CurrSamplesPerSec=94.68709624850175, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.59331398647349 samples/s, lr: 9.922077922077923e-06, loss: 0.07528233528137207 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6333.0 batch_size: 79.0 total loss: 0.9725667834281921 | |
Epoch 0: 90% 191/213 [04:19<00:23, 1.06s/it] total tokens: 2310 num samples: 10 num padding tokens: 89 - rank: 4 max len: 231 min len: 215 avg len: 222.1 num_loss_counted_tokens: 858 | |
total tokens: 2416 num samples: 8 num padding tokens: 90 - rank: 2 max len: 302 min len: 278 avg len: 290.75 num_loss_counted_tokens: 1058 | |
total tokens: 2304 num samples: 16 num padding tokens: 417 - rank: 7 max len: 144 min len: 92 avg len: 117.9375 num_loss_counted_tokens: 568 | |
total tokens: 2343 num samples: 11 num padding tokens: 96 - rank: 5 max len: 213 min len: 191 avg len: 204.27272727272728 num_loss_counted_tokens: 857 | |
total tokens: 2483 num samples: 13 num padding tokens: 270 - rank: 6 max len: 191 min len: 156 avg len: 170.23076923076923 num_loss_counted_tokens: 867 | |
total tokens: 2448 num samples: 9 num padding tokens: 149 - rank: 3 max len: 272 min len: 233 avg len: 255.44444444444446 num_loss_counted_tokens: 907 | |
total tokens: 2444 num samples: 4 num padding tokens: 323 - rank: 0 max len: 611 min len: 445 avg len: 530.25 num_loss_counted_tokens: 1166 | |
total tokens: 2478 num samples: 6 num padding tokens: 252 - rank: 1 max len: 413 min len: 308 avg len: 371.0 num_loss_counted_tokens: 1580 | |
Per-token loss scaled by world size: 0.0009023071615956724Per-token loss scaled by world size: 0.0011001058155670762Per-token loss scaled by world size: 0.00240907515399158Per-token loss scaled by world size: 0.0009518050355836749Per-token loss scaled by world size: 0.0010200438555330038 | |
Per-token loss scaled by world size: 0.000768601952586323Per-token loss scaled by world size: 0.0013964374084025621 | |
Epoch: 0, Step: 192, Rank: 5, loss = 0.8334242701530457 | |
Epoch: 0, Step: 192, Rank: 6, loss = 0.963280200958252Epoch: 0, Step: 192, Rank: 4, loss = 0.7900826930999756Epoch: 0, Step: 192, Rank: 1, loss = 2.1094465255737305Epoch: 0, Step: 192, Rank: 0, loss = 1.2227555513381958 | |
Epoch: 0, Step: 192, Rank: 2, loss = 0.8931758999824524Epoch: 0, Step: 192, Rank: 7, loss = 0.673007071018219 | |
Per-token loss scaled by world size: 0.0019724303856492043 | |
Epoch: 0, Step: 192, Rank: 3, loss = 1.7271093130111694 | |
[2024-06-27 16:45:30,910] [INFO] [logging.py:96:log_dist] [Rank 0] step=192, skipped=0, lr=[9.974025974025974e-06], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:30,983] [INFO] [timer.py:260:stop] epoch=0/micro_step=192/global_step=192, RunningAvgSamplesPerSec=95.49336333706128, CurrSamplesPerSec=94.7307585623436, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.63853583015073 samples/s, lr: 9.974025974025974e-06, loss: 1.2227555513381958 cuda_mem_allocated: 22.315450191497803 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7005.0 batch_size: 75.0 total loss: 1.1515352725982666 | |
Epoch 0: 90% 192/213 [04:20<00:22, 1.06s/it] total tokens: 2464 num samples: 11 num padding tokens: 201 - rank: 4 max len: 224 min len: 185 avg len: 205.72727272727272 num_loss_counted_tokens: 781 | |
total tokens: 2511 num samples: 9 num padding tokens: 269 - rank: 3 max len: 279 min len: 235 avg len: 249.11111111111111 num_loss_counted_tokens: 989 | |
total tokens: 2274 num samples: 3 num padding tokens: 454 - rank: 1 max len: 758 min len: 445 avg len: 606.6666666666666 num_loss_counted_tokens: 1541 | |
total tokens: 2471 num samples: 7 num padding tokens: 281 - rank: 2 max len: 353 min len: 292 avg len: 312.85714285714283 num_loss_counted_tokens: 826 | |
total tokens: 2465 num samples: 17 num padding tokens: 134 - rank: 6 max len: 145 min len: 129 avg len: 137.11764705882354 num_loss_counted_tokens: 802 | |
total tokens: 2379 num samples: 13 num padding tokens: 175 - rank: 5 max len: 183 min len: 148 avg len: 169.53846153846155 num_loss_counted_tokens: 708 | |
total tokens: 2451 num samples: 19 num padding tokens: 435 - rank: 7 max len: 129 min len: 76 avg len: 106.10526315789474 num_loss_counted_tokens: 507 | |
total tokens: 1423 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1423 min len: 1423 avg len: 1423.0 num_loss_counted_tokens: 46 | |
Per-token loss scaled by world size: 0.001439020736142993Per-token loss scaled by world size: 0.0017227329080924392Per-token loss scaled by world size: 0.0010293645318597555Per-token loss scaled by world size: 0.0006321917753666639Per-token loss scaled by world size: 0.0010930694406852126Per-token loss scaled by world size: 0.0014109313488006592Per-token loss scaled by world size: 0.0010262137511745095 | |
Epoch: 0, Step: 193, Rank: 5, loss = 1.2287437915802002Epoch: 0, Step: 193, Rank: 1, loss = 1.4709985256195068Epoch: 0, Step: 193, Rank: 3, loss = 0.8789486885070801Epoch: 0, Step: 193, Rank: 2, loss = 1.2047590017318726Epoch: 0, Step: 193, Rank: 0, loss = 0.9333446621894836 | |
Epoch: 0, Step: 193, Rank: 7, loss = 0.5398127436637878 | |
Epoch: 0, Step: 193, Rank: 4, loss = 0.8762582540512085 | |
Per-token loss scaled by world size: 0.001018903567455709 | |
Epoch: 0, Step: 193, Rank: 6, loss = 0.8700162768363953 | |
[2024-06-27 16:45:31,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=193, skipped=0, lr=[1.0025974025974026e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:32,043] [INFO] [timer.py:260:stop] epoch=0/micro_step=193/global_step=193, RunningAvgSamplesPerSec=95.4935769723228, CurrSamplesPerSec=95.53418502380327, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.42066970964524 samples/s, lr: 1.0025974025974026e-05, loss: 0.9333446621894836 cuda_mem_allocated: 22.256415367126465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6831.0 batch_size: 87.0 total loss: 1.0003602504730225 | |
Epoch 0: 91% 193/213 [04:21<00:21, 1.06s/it] total tokens: 2520 num samples: 10 num padding tokens: 239 - rank: 4 max len: 252 min len: 189 avg len: 228.1 num_loss_counted_tokens: 947 | |
total tokens: 2352 num samples: 8 num padding tokens: 241 - rank: 3 max len: 294 min len: 255 avg len: 263.875 num_loss_counted_tokens: 1161 | |
total tokens: 2520 num samples: 7 num padding tokens: 250 - rank: 2 max len: 360 min len: 302 avg len: 324.2857142857143 num_loss_counted_tokens: 1107 | |
total tokens: 2415 num samples: 15 num padding tokens: 234 - rank: 6 max len: 161 min len: 133 avg len: 145.4 num_loss_counted_tokens: 752 | |
total tokens: 2265 num samples: 5 num padding tokens: 152 - rank: 1 max len: 453 min len: 385 avg len: 422.6 num_loss_counted_tokens: 1495 | |
total tokens: 2376 num samples: 18 num padding tokens: 449 - rank: 7 max len: 132 min len: 82 avg len: 107.05555555555556 num_loss_counted_tokens: 571 | |
total tokens: 2418 num samples: 13 num padding tokens: 153 - rank: 5 max len: 186 min len: 162 avg len: 174.23076923076923 num_loss_counted_tokens: 1014 | |
total tokens: 2418 num samples: 3 num padding tokens: 628 - rank: 0 max len: 806 min len: 454 avg len: 596.6666666666666 num_loss_counted_tokens: 903 | |
Per-token loss scaled by world size: 0.0010664125438779593Per-token loss scaled by world size: 0.00121632544323802 | |
Per-token loss scaled by world size: 0.0012926302151754498Per-token loss scaled by world size: 0.0012308725854381919 | |
Per-token loss scaled by world size: 0.001624868018552661Per-token loss scaled by world size: 0.0014212167588993907Per-token loss scaled by world size: 0.001071709906682372 | |
Epoch: 0, Step: 194, Rank: 6, loss = 0.9009853005409241 | |
Epoch: 0, Step: 194, Rank: 0, loss = 1.0921109914779663 | |
Epoch: 0, Step: 194, Rank: 1, loss = 1.0276429653167725Epoch: 0, Step: 194, Rank: 2, loss = 1.039933443069458Epoch: 0, Step: 194, Rank: 3, loss = 1.3728103637695312 | |
Epoch: 0, Step: 194, Rank: 5, loss = 1.200750470161438 | |
Epoch: 0, Step: 194, Rank: 4, loss = 0.9054608941078186 | |
Per-token loss scaled by world size: 0.0005893373745493591 | |
Epoch: 0, Step: 194, Rank: 7, loss = 0.49791643023490906 | |
[2024-06-27 16:45:33,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=194, skipped=0, lr=[1.0077922077922078e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:33,105] [INFO] [timer.py:260:stop] epoch=0/micro_step=194/global_step=194, RunningAvgSamplesPerSec=95.49232894312561, CurrSamplesPerSec=95.25455202681347, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.15561749280342 samples/s, lr: 1.0077922077922078e-05, loss: 1.0921109914779663 cuda_mem_allocated: 22.284084796905518 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6759.0 batch_size: 91.0 total loss: 1.0047013759613037 | |
Epoch 0: 91% 194/213 [04:22<00:20, 1.06s/it] total tokens: 2320 num samples: 10 num padding tokens: 170 - rank: 4 max len: 232 min len: 204 avg len: 215.0 num_loss_counted_tokens: 925 | |
total tokens: 2464 num samples: 16 num padding tokens: 280 - rank: 6 max len: 154 min len: 121 avg len: 136.5 num_loss_counted_tokens: 738 | |
total tokens: 2448 num samples: 6 num padding tokens: 337 - rank: 1 max len: 408 min len: 321 avg len: 351.8333333333333 num_loss_counted_tokens: 736 | |
total tokens: 2440 num samples: 8 num padding tokens: 161 - rank: 2 max len: 305 min len: 268 avg len: 284.875 num_loss_counted_tokens: 1214 | |
total tokens: 2352 num samples: 12 num padding tokens: 239 - rank: 5 max len: 196 min len: 156 avg len: 176.08333333333334 num_loss_counted_tokens: 854 | |
total tokens: 2412 num samples: 9 num padding tokens: 138 - rank: 3 max len: 268 min len: 235 avg len: 252.66666666666666 num_loss_counted_tokens: 1039 | |
total tokens: 2328 num samples: 4 num padding tokens: 337 - rank: 0 max len: 582 min len: 433 avg len: 497.75 num_loss_counted_tokens: 1010 | |
total tokens: 2520 num samples: 21 num padding tokens: 433 - rank: 7 max len: 120 min len: 75 avg len: 99.38095238095238 num_loss_counted_tokens: 462 | |
Per-token loss scaled by world size: 0.0008172934758476913Per-token loss scaled by world size: 0.0023313036654144526Per-token loss scaled by world size: 0.00047380433534272015Per-token loss scaled by world size: 0.0005812046001665294Per-token loss scaled by world size: 0.0011117256944999099Per-token loss scaled by world size: 0.0010522139491513371Per-token loss scaled by world size: 0.0009268131107091904 | |
Epoch: 0, Step: 195, Rank: 5, loss = 0.7142123579978943Epoch: 0, Step: 195, Rank: 2, loss = 2.0372679233551025 | |
Epoch: 0, Step: 195, Rank: 1, loss = 0.5079001784324646Epoch: 0, Step: 195, Rank: 7, loss = 0.4140457510948181Epoch: 0, Step: 195, Rank: 6, loss = 0.9715093374252319 | |
Epoch: 0, Step: 195, Rank: 3, loss = 0.8099188208580017 | |
Epoch: 0, Step: 195, Rank: 4, loss = 0.9195034503936768 | |
Per-token loss scaled by world size: 0.001252303016372025 | |
Epoch: 0, Step: 195, Rank: 0, loss = 1.0943562984466553 | |
[2024-06-27 16:45:34,086] [INFO] [logging.py:96:log_dist] [Rank 0] step=195, skipped=0, lr=[1.012987012987013e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:34,159] [INFO] [timer.py:260:stop] epoch=0/micro_step=195/global_step=195, RunningAvgSamplesPerSec=95.4953690320373, CurrSamplesPerSec=96.08267459216192, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.98336317419347 samples/s, lr: 1.012987012987013e-05, loss: 1.0943562984466553 cuda_mem_allocated: 22.2825345993042 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6991.0 batch_size: 77.0 total loss: 0.9335892200469971 | |
Epoch 0: 92% 195/213 [04:23<00:19, 1.06s/it] total tokens: 2508 num samples: 19 num padding tokens: 270 - rank: 7 max len: 132 min len: 89 avg len: 117.78947368421052 num_loss_counted_tokens: 714 | |
total tokens: 2408 num samples: 8 num padding tokens: 227 - rank: 3 max len: 301 min len: 238 avg len: 272.625 num_loss_counted_tokens: 933 | |
total tokens: 2286 num samples: 6 num padding tokens: 247 - rank: 2 max len: 381 min len: 311 avg len: 339.8333333333333 num_loss_counted_tokens: 684 | |
total tokens: 2380 num samples: 10 num padding tokens: 149 - rank: 4 max len: 238 min len: 205 avg len: 223.1 num_loss_counted_tokens: 1017 | |
total tokens: 2235 num samples: 5 num padding tokens: 105 - rank: 1 max len: 447 min len: 410 avg len: 426.0 num_loss_counted_tokens: 962 | |
total tokens: 2448 num samples: 12 num padding tokens: 189 - rank: 5 max len: 204 min len: 166 avg len: 188.25 num_loss_counted_tokens: 1046 | |
total tokens: 2415 num samples: 15 num padding tokens: 223 - rank: 6 max len: 161 min len: 135 avg len: 146.13333333333333 num_loss_counted_tokens: 837 | |
total tokens: 1722 num samples: 2 num padding tokens: 370 - rank: 0 max len: 861 min len: 491 avg len: 676.0 num_loss_counted_tokens: 67 | |
Per-token loss scaled by world size: 0.0013390100793913007Per-token loss scaled by world size: 0.0005130370846018195Per-token loss scaled by world size: 0.0008802613010630012Per-token loss scaled by world size: 0.00047325657214969397Per-token loss scaled by world size: 0.0015453464584425092Per-token loss scaled by world size: 0.001698609790764749Per-token loss scaled by world size: 0.0008306367672048509 | |
Epoch: 0, Step: 196, Rank: 4, loss = 1.2059459686279297Epoch: 0, Step: 196, Rank: 3, loss = 0.462054044008255Epoch: 0, Step: 196, Rank: 5, loss = 0.7927853465080261 | |
Epoch: 0, Step: 196, Rank: 7, loss = 0.4262267053127289 | |
Epoch: 0, Step: 196, Rank: 2, loss = 1.3917776346206665 | |
Epoch: 0, Step: 196, Rank: 1, loss = 1.5298104286193848Epoch: 0, Step: 196, Rank: 0, loss = 0.7480922341346741 | |
Per-token loss scaled by world size: 0.0009147594682872295 | |
Epoch: 0, Step: 196, Rank: 6, loss = 0.8238552212715149 | |
[2024-06-27 16:45:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=196, skipped=0, lr=[1.0181818181818182e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:35,207] [INFO] [timer.py:260:stop] epoch=0/micro_step=196/global_step=196, RunningAvgSamplesPerSec=95.50097399058647, CurrSamplesPerSec=96.5951895904003, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.49756834760994 samples/s, lr: 1.0181818181818182e-05, loss: 0.7480922341346741 cuda_mem_allocated: 22.278598308563232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7205.0 batch_size: 75.0 total loss: 0.9225685596466064 | |
Epoch 0: 92% 196/213 [04:24<00:17, 1.06s/it] total tokens: 2512 num samples: 8 num padding tokens: 253 - rank: 2 max len: 314 min len: 264 avg len: 282.375 num_loss_counted_tokens: 1220 | |
total tokens: 2376 num samples: 9 num padding tokens: 143 - rank: 3 max len: 264 min len: 233 avg len: 248.11111111111111 num_loss_counted_tokens: 901 | |
total tokens: 2310 num samples: 10 num padding tokens: 229 - rank: 4 max len: 231 min len: 191 avg len: 208.1 num_loss_counted_tokens: 881 | |
total tokens: 2448 num samples: 6 num padding tokens: 386 - rank: 1 max len: 408 min len: 317 avg len: 343.6666666666667 num_loss_counted_tokens: 1229 | |
total tokens: 2448 num samples: 16 num padding tokens: 243 - rank: 6 max len: 153 min len: 122 avg len: 137.8125 num_loss_counted_tokens: 872 | |
total tokens: 2470 num samples: 13 num padding tokens: 224 - rank: 5 max len: 190 min len: 155 avg len: 172.76923076923077 num_loss_counted_tokens: 1006 | |
total tokens: 2424 num samples: 4 num padding tokens: 345 - rank: 0 max len: 606 min len: 442 avg len: 519.75 num_loss_counted_tokens: 1662 | |
total tokens: 2400 num samples: 20 num padding tokens: 352 - rank: 7 max len: 120 min len: 78 avg len: 102.4 num_loss_counted_tokens: 527 | |
Per-token loss scaled by world size: 0.0013834828278049827Per-token loss scaled by world size: 0.0013621066464111209 | |
Per-token loss scaled by world size: 0.0016819677548483014Per-token loss scaled by world size: 0.0017475062049925327Per-token loss scaled by world size: 0.0009556447621434927Per-token loss scaled by world size: 0.0010820509633049369Per-token loss scaled by world size: 0.0005597122362814844 | |
Epoch: 0, Step: 197, Rank: 1, loss = 1.0951337814331055 | |
Epoch: 0, Step: 197, Rank: 5, loss = 1.1123201847076416Epoch: 0, Step: 197, Rank: 0, loss = 1.352302074432373Epoch: 0, Step: 197, Rank: 4, loss = 1.4049949645996094 | |
Epoch: 0, Step: 197, Rank: 3, loss = 0.7683383822441101 | |
Epoch: 0, Step: 197, Rank: 2, loss = 0.8699689507484436Epoch: 0, Step: 197, Rank: 7, loss = 0.4500086307525635Per-token loss scaled by world size: 0.0010425393702462316 | |
Epoch: 0, Step: 197, Rank: 6, loss = 0.838201642036438 | |
[2024-06-27 16:45:36,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=197, skipped=0, lr=[1.0233766233766234e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:36,260] [INFO] [timer.py:260:stop] epoch=0/micro_step=197/global_step=197, RunningAvgSamplesPerSec=95.50476434877227, CurrSamplesPerSec=96.24582900192681, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.14948625561544 samples/s, lr: 1.0233766233766234e-05, loss: 1.352302074432373 cuda_mem_allocated: 22.275140285491943 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6432.0 batch_size: 95.0 total loss: 0.9864086508750916 | |
Epoch 0: 92% 197/213 [04:25<00:16, 1.06s/it]Per-token loss scaled by world size: 0.0004932651063427329Per-token loss scaled by world size: 0.0010348332580178976Per-token loss scaled by world size: 0.0024723494425415993Per-token loss scaled by world size: 0.002346582477912307Per-token loss scaled by world size: 0.0011164387688040733Per-token loss scaled by world size: 0.001422531553544104Per-token loss scaled by world size: 0.00032833332079462707 | |
Epoch: 0, Step: 198, Rank: 2, loss = 2.1633057594299316Epoch: 0, Step: 198, Rank: 1, loss = 0.9054790735244751 | |
Epoch: 0, Step: 198, Rank: 7, loss = 0.4316069781780243Per-token loss scaled by world size: 0.0007739402935840189Epoch: 0, Step: 198, Rank: 5, loss = 0.9768838882446289Epoch: 0, Step: 198, Rank: 0, loss = 0.28729164600372314 | |
Epoch: 0, Step: 198, Rank: 4, loss = 1.2447150945663452 | |
Epoch: 0, Step: 198, Rank: 3, loss = 2.0532596111297607 | |
Epoch: 0, Step: 198, Rank: 6, loss = 0.6771977543830872 | |
[2024-06-27 16:45:37,252] [INFO] [logging.py:96:log_dist] [Rank 0] step=198, skipped=0, lr=[1.0285714285714285e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:37,326] [INFO] [timer.py:260:stop] epoch=0/micro_step=198/global_step=198, RunningAvgSamplesPerSec=95.50090666357386, CurrSamplesPerSec=94.75456702105905, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.66345521112734 samples/s, lr: 1.0285714285714285e-05, loss: 0.28729164600372314 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7000.0 batch_size: 74.0 total loss: 1.0924674272537231 | |
Epoch 0: 93% 198/213 [04:26<00:15, 1.06s/it]Per-token loss scaled by world size: 0.0009132793638855219Per-token loss scaled by world size: 0.0011212702374905348Per-token loss scaled by world size: 0.0013516373001039028Per-token loss scaled by world size: 0.001887813676148653Per-token loss scaled by world size: 0.0009413102525286376Per-token loss scaled by world size: 0.001637465669773519Per-token loss scaled by world size: 0.00034621317172423005 | |
Epoch: 0, Step: 199, Rank: 4, loss = 0.9335998296737671Epoch: 0, Step: 199, Rank: 1, loss = 1.9298175573349Epoch: 0, Step: 199, Rank: 2, loss = 1.3817112445831299Epoch: 0, Step: 199, Rank: 3, loss = 1.1462185382843018Epoch: 0, Step: 199, Rank: 0, loss = 1.6738992929458618Epoch: 0, Step: 199, Rank: 5, loss = 0.9622544050216675 | |
Per-token loss scaled by world size: 0.0006317552179098129 | |
Epoch: 0, Step: 199, Rank: 7, loss = 0.3539164066314697 | |
Epoch: 0, Step: 199, Rank: 6, loss = 0.6458117961883545 | |
[2024-06-27 16:45:38,320] [INFO] [logging.py:96:log_dist] [Rank 0] step=199, skipped=0, lr=[1.0337662337662338e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:38,393] [INFO] [timer.py:260:stop] epoch=0/micro_step=199/global_step=199, RunningAvgSamplesPerSec=95.49716030562774, CurrSamplesPerSec=94.7685054255217, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.67923685352804 samples/s, lr: 1.0337662337662338e-05, loss: 1.6738992929458618 cuda_mem_allocated: 22.305552005767822 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8178.0 batch_size: 76.0 total loss: 1.1284037828445435 | |
Epoch 0: 93% 199/213 [04:27<00:14, 1.06s/it]Per-token loss scaled by world size: 0.0012315186904743314Per-token loss scaled by world size: 0.001281864708289504Per-token loss scaled by world size: 0.0010520215146243572Per-token loss scaled by world size: 0.0012065048795193434Per-token loss scaled by world size: 0.0011280846083536744Per-token loss scaled by world size: 0.0006103774067014456Per-token loss scaled by world size: 0.001040586386807263 | |
Epoch: 0, Step: 200, Rank: 6, loss = 1.1613221168518066Epoch: 0, Step: 200, Rank: 5, loss = 0.9920563101768494Epoch: 0, Step: 200, Rank: 3, loss = 1.2087984085083008Epoch: 0, Step: 200, Rank: 4, loss = 1.137734055519104Epoch: 0, Step: 200, Rank: 2, loss = 1.0637837648391724 | |
Epoch: 0, Step: 200, Rank: 0, loss = 0.9812729954719543 | |
Per-token loss scaled by world size: 0.0011521843262016773Epoch: 0, Step: 200, Rank: 7, loss = 0.5755859017372131 | |
Epoch: 0, Step: 200, Rank: 1, loss = 1.0865098237991333 | |
[2024-06-27 16:45:39,379] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=0, lr=[1.0389610389610389e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:39,452] [INFO] [timer.py:260:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=95.49707749729224, CurrSamplesPerSec=95.48076705556622, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.39015223214508 samples/s, lr: 1.0389610389610389e-05, loss: 0.9812729954719543 cuda_mem_allocated: 22.248901844024658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7544.0 batch_size: 77.0 total loss: 1.0258828401565552 | |
Epoch 0: 94% 200/213 [04:28<00:13, 1.06s/it]Per-token loss scaled by world size: 0.0011857650242745876Per-token loss scaled by world size: 0.001865518162958324Per-token loss scaled by world size: 0.00039492372889071703Per-token loss scaled by world size: 0.002400167053565383Per-token loss scaled by world size: 0.0012875181855633855 | |
Per-token loss scaled by world size: 0.0010468184482306242Per-token loss scaled by world size: 4.230281774653122e-05 | |
Epoch: 0, Step: 201, Rank: 0, loss = 0.03658664971590042Epoch: 0, Step: 201, Rank: 2, loss = 1.6134400367736816Epoch: 0, Step: 201, Rank: 1, loss = 2.0758445262908936Epoch: 0, Step: 201, Rank: 5, loss = 1.0255385637283325Epoch: 0, Step: 201, Rank: 7, loss = 0.34155964851379395 | |
Epoch: 0, Step: 201, Rank: 4, loss = 0.9053671360015869 | |
Epoch: 0, Step: 201, Rank: 6, loss = 1.1135423183441162 | |
Per-token loss scaled by world size: 0.0016990388976410031 | |
Epoch: 0, Step: 201, Rank: 3, loss = 1.4694563150405884 | |
[2024-06-27 16:45:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=201, skipped=0, lr=[1.0441558441558442e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:40,517] [INFO] [timer.py:260:stop] epoch=0/micro_step=201/global_step=201, RunningAvgSamplesPerSec=95.4938933973782, CurrSamplesPerSec=94.86759731126175, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.78079692447984 samples/s, lr: 1.0441558441558442e-05, loss: 0.03658664971590042 cuda_mem_allocated: 22.298276901245117 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6919.0 batch_size: 84.0 total loss: 1.0726670026779175 | |
Epoch 0: 94% 201/213 [04:29<00:12, 1.06s/it]Per-token loss scaled by world size: 0.0014778947224840522Per-token loss scaled by world size: 0.0011135080130770802Per-token loss scaled by world size: 0.0001225405721925199Per-token loss scaled by world size: 0.000718852155841887Per-token loss scaled by world size: 0.00242992932908237Per-token loss scaled by world size: 0.0008624705951660872Per-token loss scaled by world size: 0.0016202842816710472 | |
Epoch: 0, Step: 202, Rank: 6, loss = 0.9898200035095215Epoch: 0, Step: 202, Rank: 3, loss = 0.7457720041275024 | |
Epoch: 0, Step: 202, Rank: 5, loss = 1.0851854085922241Epoch: 0, Step: 202, Rank: 7, loss = 0.48145124316215515Epoch: 0, Step: 202, Rank: 1, loss = 1.6274452209472656Epoch: 0, Step: 202, Rank: 2, loss = 0.5776396989822388 | |
Epoch: 0, Step: 202, Rank: 0, loss = 0.08207155019044876Per-token loss scaled by world size: 0.0017897068755701184 | |
Epoch: 0, Step: 202, Rank: 4, loss = 1.1986562013626099 | |
[2024-06-27 16:45:41,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=202, skipped=0, lr=[1.0493506493506493e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:41,578] [INFO] [timer.py:260:stop] epoch=0/micro_step=202/global_step=202, RunningAvgSamplesPerSec=95.49321073068064, CurrSamplesPerSec=95.35755401578473, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.2712527272112 samples/s, lr: 1.0493506493506493e-05, loss: 0.08207155019044876 cuda_mem_allocated: 22.24389410018921 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5358.0 batch_size: 86.0 total loss: 0.8485051989555359 | |
Epoch 0: 95% 202/213 [04:30<00:11, 1.06s/it]Per-token loss scaled by world size: 0.0009257036726921797Per-token loss scaled by world size: 0.0008928405586630106 | |
Per-token loss scaled by world size: 0.0009540586033836007Per-token loss scaled by world size: 0.0015737077919766307Per-token loss scaled by world size: 0.0007556969649158418Per-token loss scaled by world size: 0.0017886889399960637Per-token loss scaled by world size: 0.0004159482487011701 | |
Epoch: 0, Step: 203, Rank: 3, loss = 0.9218851327896118 | |
Epoch: 0, Step: 203, Rank: 4, loss = 0.8891575932502747Epoch: 0, Step: 203, Rank: 2, loss = 0.950123131275177Epoch: 0, Step: 203, Rank: 1, loss = 1.5672162771224976 | |
Epoch: 0, Step: 203, Rank: 6, loss = 0.7525796890258789Epoch: 0, Step: 203, Rank: 0, loss = 1.7813105583190918Epoch: 0, Step: 203, Rank: 7, loss = 0.414232462644577 | |
Per-token loss scaled by world size: 0.0009836278622969985 | |
Epoch: 0, Step: 203, Rank: 5, loss = 0.9795703887939453 | |
[2024-06-27 16:45:42,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=203, skipped=0, lr=[1.0545454545454546e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:42,647] [INFO] [timer.py:260:stop] epoch=0/micro_step=203/global_step=203, RunningAvgSamplesPerSec=95.48823621543296, CurrSamplesPerSec=94.50364304966583, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.41257017285113 samples/s, lr: 1.0545454545454546e-05, loss: 1.7813105583190918 cuda_mem_allocated: 22.281221389770508 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7967.0 batch_size: 88.0 total loss: 1.0320093631744385 | |
Epoch 0: 95% 203/213 [04:31<00:10, 1.06s/it]Per-token loss scaled by world size: 0.0012585805961862206Per-token loss scaled by world size: 0.0018764209235087037Per-token loss scaled by world size: 0.001739348634146154 | |
Per-token loss scaled by world size: 0.00040951493429020047Per-token loss scaled by world size: 0.0005824476247653365Per-token loss scaled by world size: 0.001583782839588821Per-token loss scaled by world size: 0.0012463717721402645 | |
Epoch: 0, Step: 204, Rank: 1, loss = 1.5320976972579956 | |
Epoch: 0, Step: 204, Rank: 5, loss = 1.0276310443878174Epoch: 0, Step: 204, Rank: 2, loss = 1.4201781749725342Epoch: 0, Step: 204, Rank: 0, loss = 0.3343689441680908 | |
Epoch: 0, Step: 204, Rank: 4, loss = 1.2931586503982544Epoch: 0, Step: 204, Rank: 7, loss = 0.4755684733390808Epoch: 0, Step: 204, Rank: 6, loss = 1.017662525177002Per-token loss scaled by world size: 0.001400351757183671 | |
Epoch: 0, Step: 204, Rank: 3, loss = 1.1433871984481812 | |
[2024-06-27 16:45:43,629] [INFO] [logging.py:96:log_dist] [Rank 0] step=204, skipped=0, lr=[1.0597402597402597e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:43,704] [INFO] [timer.py:260:stop] epoch=0/micro_step=204/global_step=204, RunningAvgSamplesPerSec=95.48944404349362, CurrSamplesPerSec=95.73283937338222, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.6434949873371 samples/s, lr: 1.0597402597402597e-05, loss: 0.3343689441680908 cuda_mem_allocated: 22.275736331939697 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6532.0 batch_size: 80.0 total loss: 1.0305064916610718 | |
Epoch 0: 96% 204/213 [04:32<00:09, 1.06s/it]Per-token loss scaled by world size: 0.0011754371225833893Per-token loss scaled by world size: 0.0013395511778071523Per-token loss scaled by world size: 0.0014359309570863843Per-token loss scaled by world size: 0.0008394854376092553Per-token loss scaled by world size: 0.0009823060827329755Per-token loss scaled by world size: 0.0013119019567966461Per-token loss scaled by world size: 0.0006086709909141064 | |
Epoch: 0, Step: 205, Rank: 6, loss = 1.0483429431915283Epoch: 0, Step: 205, Rank: 4, loss = 1.1947121620178223Epoch: 0, Step: 205, Rank: 5, loss = 0.8760942816734314Epoch: 0, Step: 205, Rank: 7, loss = 0.7487160563468933 | |
Epoch: 0, Step: 205, Rank: 1, loss = 0.5428584218025208 | |
Epoch: 0, Step: 205, Rank: 2, loss = 1.2806708812713623 | |
Epoch: 0, Step: 205, Rank: 3, loss = 1.1700525283813477 | |
Per-token loss scaled by world size: 0.00039375247433781624 | |
Epoch: 0, Step: 205, Rank: 0, loss = 0.35117799043655396 | |
[2024-06-27 16:45:44,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=205, skipped=0, lr=[1.064935064935065e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:44,768] [INFO] [timer.py:260:stop] epoch=0/micro_step=205/global_step=205, RunningAvgSamplesPerSec=95.48736002215364, CurrSamplesPerSec=95.06824460052529, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.97899321412127 samples/s, lr: 1.064935064935065e-05, loss: 0.35117799043655396 cuda_mem_allocated: 22.300780773162842 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7135.0 batch_size: 83.0 total loss: 0.9015781879425049 | |
Epoch 0: 96% 205/213 [04:34<00:08, 1.06s/it]Per-token loss scaled by world size: 0.0012879545101895928Per-token loss scaled by world size: 0.0018403093563392758Per-token loss scaled by world size: 0.000890835712198168 | |
Per-token loss scaled by world size: 0.0012690601870417595Per-token loss scaled by world size: 0.0005999550921842456Per-token loss scaled by world size: 0.0011914812494069338Per-token loss scaled by world size: 0.0010568759171292186 | |
Epoch: 0, Step: 206, Rank: 1, loss = 1.6295939683914185 | |
Epoch: 0, Step: 206, Rank: 3, loss = 1.1404837369918823Epoch: 0, Step: 206, Rank: 2, loss = 0.7888350486755371 | |
Epoch: 0, Step: 206, Rank: 5, loss = 1.1237528324127197 | |
Epoch: 0, Step: 206, Rank: 6, loss = 0.9358636140823364Epoch: 0, Step: 206, Rank: 4, loss = 1.0550566911697388 | |
Epoch: 0, Step: 206, Rank: 0, loss = 0.5312602519989014 | |
Per-token loss scaled by world size: 0.0006703666877001524 | |
Epoch: 0, Step: 206, Rank: 7, loss = 0.5936096906661987 | |
[2024-06-27 16:45:45,759] [INFO] [logging.py:96:log_dist] [Rank 0] step=206, skipped=0, lr=[1.0701298701298701e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:45,833] [INFO] [timer.py:260:stop] epoch=0/micro_step=206/global_step=206, RunningAvgSamplesPerSec=95.48476584776098, CurrSamplesPerSec=94.96105106536425, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.87148661371907 samples/s, lr: 1.0701298701298701e-05, loss: 0.5312602519989014 cuda_mem_allocated: 22.258323192596436 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7084.0 batch_size: 86.0 total loss: 0.9748069643974304 | |
Epoch 0: 97% 206/213 [04:35<00:07, 1.06s/it]Per-token loss scaled by world size: 0.0016551455482840538Per-token loss scaled by world size: 0.0008989550988189876Per-token loss scaled by world size: 0.0005099721602164209Per-token loss scaled by world size: 0.0011714991414919496Per-token loss scaled by world size: 0.001012127031572163Per-token loss scaled by world size: 0.0009847930632531643Per-token loss scaled by world size: 0.0016896437155082822 | |
Epoch: 0, Step: 207, Rank: 1, loss = 1.5103203058242798Epoch: 0, Step: 207, Rank: 7, loss = 0.46534958481788635Epoch: 0, Step: 207, Rank: 3, loss = 0.8202965259552002Epoch: 0, Step: 207, Rank: 5, loss = 1.0689929723739624Per-token loss scaled by world size: 0.001681935740634799 | |
Epoch: 0, Step: 207, Rank: 6, loss = 0.9235659241676331 | |
Epoch: 0, Step: 207, Rank: 4, loss = 0.8986237049102783 | |
Epoch: 0, Step: 207, Rank: 0, loss = 1.5417999029159546 | |
Epoch: 0, Step: 207, Rank: 2, loss = 1.5347663164138794 | |
[2024-06-27 16:45:46,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=207, skipped=0, lr=[1.0753246753246754e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:46,894] [INFO] [timer.py:260:stop] epoch=0/micro_step=207/global_step=207, RunningAvgSamplesPerSec=95.47979360160143, CurrSamplesPerSec=94.47616973900475, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 94.39559376621257 samples/s, lr: 1.0753246753246754e-05, loss: 1.5417999029159546 cuda_mem_allocated: 22.265955924987793 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7300.0 batch_size: 87.0 total loss: 1.0954643487930298 | |
Epoch 0: 97% 207/213 [04:36<00:06, 1.06s/it]Per-token loss scaled by world size: 0.0009553829440847039Per-token loss scaled by world size: 0.0008654401754029095Per-token loss scaled by world size: 0.0007160137756727636Per-token loss scaled by world size: 0.002981931436806917Per-token loss scaled by world size: 0.0008299394976347685Per-token loss scaled by world size: 0.0015533638652414083Per-token loss scaled by world size: 0.00061251618899405 | |
Epoch: 0, Step: 208, Rank: 3, loss = 0.8504031300544739Per-token loss scaled by world size: 0.0010385990608483553Epoch: 0, Step: 208, Rank: 4, loss = 0.9387831687927246Epoch: 0, Step: 208, Rank: 5, loss = 0.7035730481147766Epoch: 0, Step: 208, Rank: 0, loss = 1.5263742208480835Epoch: 0, Step: 208, Rank: 1, loss = 2.9301204681396484Epoch: 0, Step: 208, Rank: 6, loss = 0.8155192732810974 | |
Epoch: 0, Step: 208, Rank: 7, loss = 0.6018736958503723 | |
Epoch: 0, Step: 208, Rank: 2, loss = 1.0205533504486084 | |
[2024-06-27 16:45:47,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=208, skipped=0, lr=[1.0805194805194805e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:45:47,955] [INFO] [timer.py:260:stop] epoch=0/micro_step=208/global_step=208, RunningAvgSamplesPerSec=95.47857870669601, CurrSamplesPerSec=95.23017636304725, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.14421778673592 samples/s, lr: 1.0805194805194805e-05, loss: 1.5263742208480835 cuda_mem_allocated: 22.306028842926025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7861.0 batch_size: 77.0 total loss: 1.1734000444412231 | |
Saving model in huggingface format at samples_seen: 19968 | |
Model saved in /instructlab/training_output/hf_format/samples_19968 | |
[16:46:07] INFO saving took 19.810648202896118 seconds utils.py:192 | |
Epoch 0: 98% 208/213 [04:57<00:35, 7.01s/it]Per-token loss scaled by world size: 0.002101774327456951Per-token loss scaled by world size: 0.0013186883879825473Per-token loss scaled by world size: 0.000686340790707618Per-token loss scaled by world size: 0.0009648792911320925Per-token loss scaled by world size: 1.8539762095315382e-05Per-token loss scaled by world size: 0.0010202389676123857Per-token loss scaled by world size: 0.0011470072204247117 | |
Epoch: 0, Step: 209, Rank: 3, loss = 1.6288750171661377Epoch: 0, Step: 209, Rank: 7, loss = 0.5319141149520874 | |
Epoch: 0, Step: 209, Rank: 1, loss = 1.0219835042953491Epoch: 0, Step: 209, Rank: 5, loss = 0.7477814555168152Epoch: 0, Step: 209, Rank: 4, loss = 0.79068523645401Epoch: 0, Step: 209, Rank: 6, loss = 0.8889305591583252Epoch: 0, Step: 209, Rank: 0, loss = 0.014368315227329731 | |
Per-token loss scaled by world size: 0.0010877520544454455 | |
Epoch: 0, Step: 209, Rank: 2, loss = 0.8430078029632568 | |
[2024-06-27 16:46:08,753] [INFO] [logging.py:96:log_dist] [Rank 0] step=209, skipped=0, lr=[1.0857142857142858e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:08,826] [INFO] [timer.py:260:stop] epoch=0/micro_step=209/global_step=209, RunningAvgSamplesPerSec=95.47899111663284, CurrSamplesPerSec=95.56402359212984, MemAllocated=22.18GB, MaxMemAllocated=28.61GB | |
throughput: 95.4660977708315 samples/s, lr: 1.0857142857142858e-05, loss: 0.014368315227329731 cuda_mem_allocated: 22.18426275253296 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6200.0 batch_size: 80.0 total loss: 0.8084432482719421 | |
Epoch 0: 98% 209/213 [04:58<00:20, 5.22s/it]Per-token loss scaled by world size: 0.0018051369115710258Per-token loss scaled by world size: 0.001097288215532899Per-token loss scaled by world size: 0.0015073013491928577Per-token loss scaled by world size: 0.0007449850090779364Per-token loss scaled by world size: 0.0011214318219572306Per-token loss scaled by world size: 0.0006019007414579391Per-token loss scaled by world size: 0.001032371772453189 | |
Epoch: 0, Step: 210, Rank: 4, loss = 1.0904301404953003Epoch: 0, Step: 210, Rank: 1, loss = 1.793854832649231Epoch: 0, Step: 210, Rank: 6, loss = 0.740328848361969Epoch: 0, Step: 210, Rank: 3, loss = 1.4978806972503662 | |
Epoch: 0, Step: 210, Rank: 2, loss = 1.0259194374084473Epoch: 0, Step: 210, Rank: 5, loss = 1.1144229173660278 | |
Epoch: 0, Step: 210, Rank: 7, loss = 0.5981388688087463 | |
Per-token loss scaled by world size: 0.0009304205304943025 | |
Epoch: 0, Step: 210, Rank: 0, loss = 0.9246054291725159 | |
[2024-06-27 16:46:09,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=0, lr=[1.0909090909090909e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:09,886] [INFO] [timer.py:260:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=95.47980753281173, CurrSamplesPerSec=95.64910678717274, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.55998658634613 samples/s, lr: 1.0909090909090909e-05, loss: 0.9246054291725159 cuda_mem_allocated: 22.302927494049072 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7950.0 batch_size: 79.0 total loss: 1.09819757938385 | |
Epoch 0: 99% 210/213 [04:59<00:11, 3.97s/it]Per-token loss scaled by world size: 0.002172557869926095Per-token loss scaled by world size: 0.0012432762887328863Per-token loss scaled by world size: 0.000700995558872819Per-token loss scaled by world size: 0.0013389564119279385Per-token loss scaled by world size: 0.0005327127291820943Per-token loss scaled by world size: 0.001046357792802155Per-token loss scaled by world size: 0.001193067990243435 | |
Epoch: 0, Step: 211, Rank: 5, loss = 1.0844477415084839Epoch: 0, Step: 211, Rank: 7, loss = 0.4646586775779724Epoch: 0, Step: 211, Rank: 1, loss = 0.6114434003829956Epoch: 0, Step: 211, Rank: 2, loss = 1.8950135707855225 | |
Epoch: 0, Step: 211, Rank: 0, loss = 1.1679047346115112Epoch: 0, Step: 211, Rank: 3, loss = 1.0406535863876343 | |
Per-token loss scaled by world size: 0.0007425962248817086Epoch: 0, Step: 211, Rank: 4, loss = 0.9126855731010437 | |
Epoch: 0, Step: 211, Rank: 6, loss = 0.6477295756340027 | |
[2024-06-27 16:46:10,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=211, skipped=0, lr=[1.0961038961038962e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:10,942] [INFO] [timer.py:260:stop] epoch=0/micro_step=211/global_step=211, RunningAvgSamplesPerSec=95.48116084184663, CurrSamplesPerSec=95.76348544461233, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.6756752132934 samples/s, lr: 1.0961038961038962e-05, loss: 1.1679047346115112 cuda_mem_allocated: 22.292194366455078 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6978.0 batch_size: 86.0 total loss: 0.9780671000480652 | |
Epoch 0: 99% 211/213 [05:00<00:06, 3.10s/it]Per-token loss scaled by world size: 0.0019772422965615988Per-token loss scaled by world size: 0.0015252706361934543 | |
Per-token loss scaled by world size: 0.0007615063805133104Per-token loss scaled by world size: 0.0018605277873575687Per-token loss scaled by world size: 0.0020414446480572224Per-token loss scaled by world size: 0.001390826073475182Per-token loss scaled by world size: 2.7286825570627116e-05 | |
Epoch: 0, Step: 212, Rank: 5, loss = 1.547192096710205 | |
Epoch: 0, Step: 212, Rank: 1, loss = 1.1935242414474487 | |
Epoch: 0, Step: 212, Rank: 2, loss = 0.5958787202835083Epoch: 0, Step: 212, Rank: 4, loss = 1.4558629989624023 | |
Epoch: 0, Step: 212, Rank: 3, loss = 1.5974303483963013Epoch: 0, Step: 212, Rank: 6, loss = 1.0883214473724365 | |
Epoch: 0, Step: 212, Rank: 0, loss = 0.02135194092988968 | |
Per-token loss scaled by world size: 0.0015294674085453153 | |
Epoch: 0, Step: 212, Rank: 7, loss = 1.1968082189559937 | |
[2024-06-27 16:46:11,920] [INFO] [logging.py:96:log_dist] [Rank 0] step=212, skipped=0, lr=[1.1012987012987013e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:11,996] [INFO] [timer.py:260:stop] epoch=0/micro_step=212/global_step=212, RunningAvgSamplesPerSec=95.48374005385998, CurrSamplesPerSec=96.02587061734143, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 95.93849310685054 samples/s, lr: 1.1012987012987013e-05, loss: 0.02135194092988968 cuda_mem_allocated: 22.219921112060547 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6260.0 batch_size: 77.0 total loss: 1.0870462656021118 | |
Epoch 0: 100% 212/213 [05:01<00:02, 2.48s/it]Per-token loss scaled by world size: 0.0009601730853319168Per-token loss scaled by world size: 0.0009028929634951055Per-token loss scaled by world size: 0.0016907410463318229Per-token loss scaled by world size: 0.0004221251292619854Per-token loss scaled by world size: 0.0012152851559221745Per-token loss scaled by world size: 0.000740953313652426 | |
Per-token loss scaled by world size: 0.0007369752856902778 | |
Epoch: 0, Step: 213, Rank: 6, loss = 0.9365257024765015Epoch: 0, Step: 213, Rank: 1, loss = 1.260554552078247Epoch: 0, Step: 213, Rank: 3, loss = 0.9959395527839661 | |
Epoch: 0, Step: 213, Rank: 0, loss = 1.7537211179733276Epoch: 0, Step: 213, Rank: 7, loss = 0.4378492832183838Epoch: 0, Step: 213, Rank: 4, loss = 0.7685538530349731 | |
Epoch: 0, Step: 213, Rank: 5, loss = 0.7644276022911072Per-token loss scaled by world size: 0.0012419105041772127 | |
Epoch: 0, Step: 213, Rank: 2, loss = 1.288171648979187 | |
[2024-06-27 16:46:12,984] [INFO] [logging.py:96:log_dist] [Rank 0] step=213, skipped=0, lr=[1.1064935064935066e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:13,058] [INFO] [timer.py:260:stop] epoch=0/micro_step=213/global_step=213, RunningAvgSamplesPerSec=95.47816445424138, CurrSamplesPerSec=94.32154009119824, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.22757610867174 samples/s, lr: 1.1064935064935066e-05, loss: 1.7537211179733276 cuda_mem_allocated: 22.303642749786377 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8298.0 batch_size: 86.0 total loss: 1.025717854499817 | |
Epoch 0: 100% 213/213 [05:02<00:00, 1.42s/it] | |
total tokens: 2367 num samples: 9 num padding tokens: 128 - rank: 3 max len: 263 min len: 232 avg len: 248.77777777777777 num_loss_counted_tokens: 658 | |
total tokens: 2400 num samples: 15 num padding tokens: 230 - rank: 6 max len: 160 min len: 128 avg len: 144.66666666666666 num_loss_counted_tokens: 768 | |
total tokens: 2390 num samples: 10 num padding tokens: 136 - rank: 3 max len: 239 min len: 212 avg len: 225.4 num_loss_counted_tokens: 967 | |
total tokens: 2280 num samples: 6 num padding tokens: 264 - rank: 3 max len: 380 min len: 307 avg len: 336.0 num_loss_counted_tokens: 1014 | |
total tokens: 2352 num samples: 8 num padding tokens: 104 - rank: 3 max len: 294 min len: 265 avg len: 281.0 num_loss_counted_tokens: 816 | |
total tokens: 2464 num samples: 16 num padding tokens: 213 - rank: 6 max len: 154 min len: 131 avg len: 140.6875 num_loss_counted_tokens: 893 total tokens: 2366 num samples: 14 num padding tokens: 169 - rank: 6 max len: 169 min len: 147 avg len: 156.92857142857142 num_loss_counted_tokens: 905 | |
total tokens: 2388 num samples: 12 num padding tokens: 235 - rank: 5 max len: 199 min len: 161 avg len: 179.41666666666666 num_loss_counted_tokens: 921 | |
total tokens: 2366 num samples: 13 num padding tokens: 146 - rank: 5 max len: 182 min len: 157 avg len: 170.76923076923077 num_loss_counted_tokens: 808 | |
total tokens: 2349 num samples: 9 num padding tokens: 302 - rank: 3 max len: 261 min len: 211 avg len: 227.44444444444446 num_loss_counted_tokens: 793 | |
total tokens: 2457 num samples: 9 num padding tokens: 205 - rank: 3 max len: 273 min len: 233 avg len: 250.22222222222223 num_loss_counted_tokens: 1173 | |
total tokens: 2490 num samples: 10 num padding tokens: 84 - rank: 3 max len: 249 min len: 229 avg len: 240.6 num_loss_counted_tokens: 1008 | |
total tokens: 2440 num samples: 8 num padding tokens: 165 - rank: 3 max len: 305 min len: 265 avg len: 284.375 num_loss_counted_tokens: 1273 | |
total tokens: 2401 num samples: 7 num padding tokens: 227 - rank: 3 max len: 343 min len: 274 avg len: 310.57142857142856 num_loss_counted_tokens: 1012 | |
total tokens: 2375 num samples: 19 num padding tokens: 376 - rank: 7 max len: 125 min len: 77 avg len: 105.21052631578948 num_loss_counted_tokens: 492 | |
total tokens: 2375 num samples: 19 num padding tokens: 388 - rank: 7 max len: 125 min len: 85 avg len: 104.57894736842105 num_loss_counted_tokens: 527 | |
total tokens: 2496 num samples: 8 num padding tokens: 155 - rank: 3 max len: 312 min len: 270 avg len: 292.625 num_loss_counted_tokens: 1116 | |
total tokens: 2394 num samples: 14 num padding tokens: 191 - rank: 6 max len: 171 min len: 135 avg len: 157.35714285714286 num_loss_counted_tokens: 872 | |
total tokens: 2288 num samples: 8 num padding tokens: 140 - rank: 3 max len: 286 min len: 252 avg len: 268.5 num_loss_counted_tokens: 804 | |
total tokens: 2096 num samples: 4 num padding tokens: 201 - rank: 0 max len: 524 min len: 427 avg len: 473.75 num_loss_counted_tokens: 650 | |
total tokens: 2415 num samples: 15 num padding tokens: 321 - rank: 6 max len: 161 min len: 125 avg len: 139.6 num_loss_counted_tokens: 764 | |
total tokens: 2430 num samples: 9 num padding tokens: 221 - rank: 3 max len: 270 min len: 231 avg len: 245.44444444444446 num_loss_counted_tokens: 921 | |
total tokens: 2534 num samples: 14 num padding tokens: 284 - rank: 6 max len: 181 min len: 134 avg len: 160.71428571428572 num_loss_counted_tokens: 978 | |
total tokens: 2504 num samples: 8 num padding tokens: 240 - rank: 3 max len: 313 min len: 250 avg len: 283.0 num_loss_counted_tokens: 1218 total tokens: 2088 num samples: 3 num padding tokens: 126 - rank: 0 max len: 696 min len: 604 avg len: 654.0 num_loss_counted_tokens: 1164 | |
total tokens: 2358 num samples: 9 num padding tokens: 124 - rank: 3 max len: 262 min len: 232 avg len: 248.22222222222223 num_loss_counted_tokens: 1150 | |
total tokens: 2403 num samples: 9 num padding tokens: 160 - rank: 3 max len: 267 min len: 219 avg len: 249.22222222222223 num_loss_counted_tokens: 1080 | |
total tokens: 2445 num samples: 15 num padding tokens: 271 - rank: 6 max len: 163 min len: 134 avg len: 144.93333333333334 num_loss_counted_tokens: 998 | |
total tokens: 1698 num samples: 2 num padding tokens: 168 - rank: 0 max len: 849 min len: 681 avg len: 765.0 num_loss_counted_tokens: 746 total tokens: 2509 num samples: 13 num padding tokens: 169 - rank: 5 max len: 193 min len: 169 avg len: 180.0 num_loss_counted_tokens: 803 | |
total tokens: 2475 num samples: 11 num padding tokens: 162 - rank: 4 max len: 225 min len: 200 avg len: 210.27272727272728 num_loss_counted_tokens: 781 | |
total tokens: 2499 num samples: 17 num padding tokens: 510 - rank: 7 max len: 147 min len: 88 avg len: 117.0 num_loss_counted_tokens: 569 | |
total tokens: 2376 num samples: 12 num padding tokens: 283 - rank: 5 max len: 198 min len: 163 avg len: 174.41666666666666 num_loss_counted_tokens: 832 | |
total tokens: 2430 num samples: 6 num padding tokens: 273 - rank: 1 max len: 405 min len: 340 avg len: 359.5 num_loss_counted_tokens: 1104 total tokens: 2412 num samples: 9 num padding tokens: 246 - rank: 3 max len: 268 min len: 213 avg len: 240.66666666666666 num_loss_counted_tokens: 994 | |
total tokens: 2490 num samples: 15 num padding tokens: 222 - rank: 6 max len: 166 min len: 134 avg len: 151.2 num_loss_counted_tokens: 715 | |
total tokens: 1744 num samples: 2 num padding tokens: 300 - rank: 0 max len: 872 min len: 572 avg len: 722.0 num_loss_counted_tokens: 610 | |
total tokens: 2398 num samples: 11 num padding tokens: 219 - rank: 5 max len: 218 min len: 182 avg len: 198.0909090909091 num_loss_counted_tokens: 1133 | |
total tokens: 2400 num samples: 15 num padding tokens: 258 - rank: 6 max len: 160 min len: 126 avg len: 142.8 num_loss_counted_tokens: 795 | |
total tokens: 2076 num samples: 3 num padding tokens: 262 - rank: 0 max len: 692 min len: 499 avg len: 604.6666666666666 num_loss_counted_tokens: 1518 | |
total tokens: 2470 num samples: 13 num padding tokens: 320 - rank: 6 max len: 190 min len: 143 avg len: 165.3846153846154 num_loss_counted_tokens: 681 | |
total tokens: 2478 num samples: 14 num padding tokens: 281 - rank: 6 max len: 177 min len: 139 avg len: 156.92857142857142 num_loss_counted_tokens: 939 | |
total tokens: 2416 num samples: 8 num padding tokens: 289 - rank: 4 max len: 302 min len: 239 avg len: 265.875 num_loss_counted_tokens: 1151 | |
total tokens: 2376 num samples: 9 num padding tokens: 274 - rank: 4 max len: 264 min len: 195 avg len: 233.55555555555554 num_loss_counted_tokens: 689 | |
total tokens: 2418 num samples: 13 num padding tokens: 137 - rank: 5 max len: 186 min len: 163 avg len: 175.46153846153845 num_loss_counted_tokens: 977 | |
total tokens: 2278 num samples: 17 num padding tokens: 391 - rank: 7 max len: 134 min len: 79 avg len: 111.0 num_loss_counted_tokens: 510 | |
total tokens: 2214 num samples: 18 num padding tokens: 340 - rank: 7 max len: 123 min len: 86 avg len: 104.11111111111111 num_loss_counted_tokens: 503 | |
total tokens: 2376 num samples: 9 num padding tokens: 186 - rank: 4 max len: 264 min len: 226 avg len: 243.33333333333334 num_loss_counted_tokens: 775 | |
total tokens: 2451 num samples: 19 num padding tokens: 434 - rank: 7 max len: 129 min len: 83 avg len: 106.15789473684211 num_loss_counted_tokens: 583 | |
total tokens: 2442 num samples: 11 num padding tokens: 215 - rank: 4 max len: 222 min len: 186 avg len: 202.45454545454547 num_loss_counted_tokens: 990 | |
total tokens: 2412 num samples: 12 num padding tokens: 175 - rank: 5 max len: 201 min len: 166 avg len: 186.41666666666666 num_loss_counted_tokens: 967 total tokens: 2530 num samples: 11 num padding tokens: 156 - rank: 4 max len: 230 min len: 202 avg len: 215.8181818181818 num_loss_counted_tokens: 985 | |
total tokens: 2506 num samples: 14 num padding tokens: 348 - rank: 6 max len: 179 min len: 135 avg len: 154.14285714285714 num_loss_counted_tokens: 786 | |
total tokens: 2432 num samples: 16 num padding tokens: 165 - rank: 6 max len: 152 min len: 124 avg len: 141.6875 num_loss_counted_tokens: 862 | |
total tokens: 2475 num samples: 15 num padding tokens: 245 - rank: 6 max len: 165 min len: 123 avg len: 148.66666666666666 num_loss_counted_tokens: 817 | |
total tokens: 2508 num samples: 12 num padding tokens: 110 - rank: 5 max len: 209 min len: 194 avg len: 199.83333333333334 num_loss_counted_tokens: 1069 | |
total tokens: 2532 num samples: 12 num padding tokens: 144 - rank: 5 max len: 211 min len: 188 avg len: 199.0 num_loss_counted_tokens: 801 total tokens: 2284 num samples: 4 num padding tokens: 522 - rank: 1 max len: 571 min len: 355 avg len: 440.5 num_loss_counted_tokens: 1112 | |
total tokens: 2379 num samples: 13 num padding tokens: 180 - rank: 5 max len: 183 min len: 154 avg len: 169.15384615384616 num_loss_counted_tokens: 928 total tokens: 2340 num samples: 10 num padding tokens: 385 - rank: 5 max len: 234 min len: 171 avg len: 195.5 num_loss_counted_tokens: 751 | |
total tokens: 2288 num samples: 8 num padding tokens: 180 - rank: 3 max len: 286 min len: 246 avg len: 263.5 num_loss_counted_tokens: 830 total tokens: 2409 num samples: 11 num padding tokens: 167 - rank: 5 max len: 219 min len: 192 avg len: 203.8181818181818 num_loss_counted_tokens: 965 | |
total tokens: 2295 num samples: 5 num padding tokens: 350 - rank: 1 max len: 459 min len: 336 avg len: 389.0 num_loss_counted_tokens: 1189 | |
total tokens: 2500 num samples: 20 num padding tokens: 471 - rank: 7 max len: 125 min len: 84 avg len: 101.45 num_loss_counted_tokens: 562 | |
total tokens: 2520 num samples: 12 num padding tokens: 134 - rank: 4 max len: 210 min len: 185 avg len: 198.83333333333334 num_loss_counted_tokens: 1056 | |
total tokens: 2148 num samples: 4 num padding tokens: 193 - rank: 1 max len: 537 min len: 447 avg len: 488.75 num_loss_counted_tokens: 1435 | |
total tokens: 2414 num samples: 17 num padding tokens: 438 - rank: 7 max len: 142 min len: 87 avg len: 116.23529411764706 num_loss_counted_tokens: 586 | |
total tokens: 2460 num samples: 12 num padding tokens: 127 - rank: 4 max len: 205 min len: 184 avg len: 194.41666666666666 num_loss_counted_tokens: 919 | |
total tokens: 2412 num samples: 12 num padding tokens: 222 - rank: 5 max len: 201 min len: 167 avg len: 182.5 num_loss_counted_tokens: 939 | |
total tokens: 2475 num samples: 15 num padding tokens: 192 - rank: 6 max len: 165 min len: 139 avg len: 152.2 num_loss_counted_tokens: 820 | |
total tokens: 2400 num samples: 16 num padding tokens: 255 - rank: 6 max len: 150 min len: 119 avg len: 134.0625 num_loss_counted_tokens: 793 | |
total tokens: 2398 num samples: 11 num padding tokens: 180 - rank: 4 max len: 218 min len: 190 avg len: 201.63636363636363 num_loss_counted_tokens: 935 | |
total tokens: 2324 num samples: 7 num padding tokens: 297 - rank: 2 max len: 332 min len: 268 avg len: 289.57142857142856 num_loss_counted_tokens: 780 | |
total tokens: 2360 num samples: 10 num padding tokens: 236 - rank: 4 max len: 236 min len: 191 avg len: 212.4 num_loss_counted_tokens: 926 | |
total tokens: 2430 num samples: 18 num padding tokens: 406 - rank: 7 max len: 135 min len: 83 avg len: 112.44444444444444 num_loss_counted_tokens: 600 | |
total tokens: 2366 num samples: 14 num padding tokens: 280 - rank: 6 max len: 169 min len: 135 avg len: 149.0 num_loss_counted_tokens: 825 | |
total tokens: 2410 num samples: 10 num padding tokens: 154 - rank: 4 max len: 241 min len: 213 avg len: 225.6 num_loss_counted_tokens: 709 | |
total tokens: 2261 num samples: 17 num padding tokens: 381 - rank: 7 max len: 133 min len: 81 avg len: 110.58823529411765 num_loss_counted_tokens: 536 | |
total tokens: 2489 num samples: 19 num padding tokens: 513 - rank: 7 max len: 131 min len: 80 avg len: 104.0 num_loss_counted_tokens: 572 | |
total tokens: 2028 num samples: 3 num padding tokens: 171 - rank: 0 max len: 676 min len: 526 avg len: 619.0 num_loss_counted_tokens: 732 | |
total tokens: 2464 num samples: 14 num padding tokens: 137 - rank: 5 max len: 176 min len: 151 avg len: 166.21428571428572 num_loss_counted_tokens: 837 | |
total tokens: 2310 num samples: 10 num padding tokens: 141 - rank: 4 max len: 231 min len: 208 avg len: 216.9 num_loss_counted_tokens: 998 | |
total tokens: 2484 num samples: 6 num padding tokens: 337 - rank: 1 max len: 414 min len: 300 avg len: 357.8333333333333 num_loss_counted_tokens: 1072 | |
total tokens: 1950 num samples: 2 num padding tokens: 393 - rank: 0 max len: 975 min len: 582 avg len: 778.5 num_loss_counted_tokens: 304 | |
total tokens: 2125 num samples: 5 num padding tokens: 201 - rank: 1 max len: 425 min len: 348 avg len: 384.8 num_loss_counted_tokens: 1293 | |
total tokens: 1956 num samples: 3 num padding tokens: 204 - rank: 0 max len: 652 min len: 503 avg len: 584.0 num_loss_counted_tokens: 1236 | |
total tokens: 2346 num samples: 6 num padding tokens: 238 - rank: 1 max len: 391 min len: 323 avg len: 351.3333333333333 num_loss_counted_tokens: 1103 total tokens: 2208 num samples: 3 num padding tokens: 365 - rank: 0 max len: 736 min len: 525 avg len: 614.3333333333334 num_loss_counted_tokens: 716 | |
total tokens: 2235 num samples: 3 num padding tokens: 338 - rank: 0 max len: 745 min len: 520 avg len: 632.3333333333334 num_loss_counted_tokens: 926 | |
total tokens: 2457 num samples: 13 num padding tokens: 123 - rank: 5 max len: 189 min len: 165 avg len: 179.53846153846155 num_loss_counted_tokens: 762 | |
total tokens: 2472 num samples: 12 num padding tokens: 233 - rank: 5 max len: 206 min len: 173 avg len: 186.58333333333334 num_loss_counted_tokens: 1002 | |
total tokens: 2205 num samples: 3 num padding tokens: 384 - rank: 0 max len: 735 min len: 542 avg len: 607.0 num_loss_counted_tokens: 1090 | |
total tokens: 2195 num samples: 5 num padding tokens: 77 - rank: 1 max len: 439 min len: 397 avg len: 423.6 num_loss_counted_tokens: 1624 | |
total tokens: 2532 num samples: 6 num padding tokens: 288 - rank: 1 max len: 422 min len: 333 avg len: 374.0 num_loss_counted_tokens: 1594 | |
total tokens: 2484 num samples: 4 num padding tokens: 231 - rank: 0 max len: 621 min len: 465 avg len: 563.25 num_loss_counted_tokens: 1292 | |
total tokens: 2470 num samples: 13 num padding tokens: 184 - rank: 5 max len: 190 min len: 160 avg len: 175.84615384615384 num_loss_counted_tokens: 987 | |
total tokens: 2412 num samples: 18 num padding tokens: 395 - rank: 7 max len: 134 min len: 73 avg len: 112.05555555555556 num_loss_counted_tokens: 593 | |
total tokens: 2500 num samples: 10 num padding tokens: 208 - rank: 4 max len: 250 min len: 210 avg len: 229.2 num_loss_counted_tokens: 767 | |
total tokens: 2317 num samples: 7 num padding tokens: 210 - rank: 2 max len: 331 min len: 275 avg len: 301.0 num_loss_counted_tokens: 1042 | |
total tokens: 2520 num samples: 20 num padding tokens: 362 - rank: 7 max len: 126 min len: 84 avg len: 107.9 num_loss_counted_tokens: 604 | |
total tokens: 2031 num samples: 3 num padding tokens: 304 - rank: 0 max len: 677 min len: 511 avg len: 575.6666666666666 num_loss_counted_tokens: 866 | |
total tokens: 2465 num samples: 5 num padding tokens: 532 - rank: 1 max len: 493 min len: 335 avg len: 386.6 num_loss_counted_tokens: 709 | |
total tokens: 2528 num samples: 8 num padding tokens: 297 - rank: 2 max len: 316 min len: 241 avg len: 278.875 num_loss_counted_tokens: 1159 | |
total tokens: 2225 num samples: 5 num padding tokens: 140 - rank: 2 max len: 445 min len: 390 avg len: 417.0 num_loss_counted_tokens: 956 | |
total tokens: 2450 num samples: 7 num padding tokens: 170 - rank: 2 max len: 350 min len: 295 avg len: 325.7142857142857 num_loss_counted_tokens: 1273 | |
total tokens: 2344 num samples: 8 num padding tokens: 220 - rank: 2 max len: 293 min len: 250 avg len: 265.5 num_loss_counted_tokens: 998 | |
total tokens: 2317 num samples: 7 num padding tokens: 110 - rank: 2 max len: 331 min len: 293 avg len: 315.2857142857143 num_loss_counted_tokens: 1301 | |
total tokens: 2130 num samples: 5 num padding tokens: 191 - rank: 1 max len: 426 min len: 349 avg len: 387.8 num_loss_counted_tokens: 1021 | |
total tokens: 1926 num samples: 3 num padding tokens: 384 - rank: 0 max len: 642 min len: 442 avg len: 514.0 num_loss_counted_tokens: 579 | |
total tokens: 2508 num samples: 19 num padding tokens: 427 - rank: 7 max len: 132 min len: 84 avg len: 109.52631578947368 num_loss_counted_tokens: 553 | |
total tokens: 1586 num samples: 13 num padding tokens: 184 - rank: 7 max len: 122 min len: 88 avg len: 107.84615384615384 num_loss_counted_tokens: 411 | |
total tokens: 2250 num samples: 6 num padding tokens: 83 - rank: 1 max len: 375 min len: 344 avg len: 361.1666666666667 num_loss_counted_tokens: 1255 | |
total tokens: 2440 num samples: 8 num padding tokens: 173 - rank: 2 max len: 305 min len: 268 avg len: 283.375 num_loss_counted_tokens: 1086 | |
total tokens: 2436 num samples: 7 num padding tokens: 190 - rank: 2 max len: 348 min len: 308 avg len: 320.85714285714283 num_loss_counted_tokens: 1271 | |
total tokens: 2190 num samples: 3 num padding tokens: 152 - rank: 0 max len: 730 min len: 643 avg len: 679.3333333333334 num_loss_counted_tokens: 1205 | |
total tokens: 2466 num samples: 6 num padding tokens: 155 - rank: 1 max len: 411 min len: 348 avg len: 385.1666666666667 num_loss_counted_tokens: 1020 | |
total tokens: 2430 num samples: 10 num padding tokens: 241 - rank: 4 max len: 243 min len: 203 avg len: 218.9 num_loss_counted_tokens: 1092 | |
total tokens: 2478 num samples: 21 num padding tokens: 260 - rank: 7 max len: 118 min len: 91 avg len: 105.61904761904762 num_loss_counted_tokens: 583 | |
total tokens: 2360 num samples: 5 num padding tokens: 315 - rank: 1 max len: 472 min len: 383 avg len: 409.0 num_loss_counted_tokens: 1251 | |
total tokens: 2134 num samples: 2 num padding tokens: 586 - rank: 0 max len: 1067 min len: 481 avg len: 774.0 num_loss_counted_tokens: 257 | |
total tokens: 2430 num samples: 18 num padding tokens: 385 - rank: 7 max len: 135 min len: 92 avg len: 113.61111111111111 num_loss_counted_tokens: 604 | |
total tokens: 2430 num samples: 9 num padding tokens: 230 - rank: 4 max len: 270 min len: 219 avg len: 244.44444444444446 num_loss_counted_tokens: 1001 | |
total tokens: 2424 num samples: 3 num padding tokens: 601 - rank: 0 max len: 808 min len: 498 avg len: 607.6666666666666 num_loss_counted_tokens: 790 | |
total tokens: 2508 num samples: 12 num padding tokens: 241 - rank: 4 max len: 209 min len: 176 avg len: 188.91666666666666 num_loss_counted_tokens: 958 | |
total tokens: 2136 num samples: 4 num padding tokens: 281 - rank: 1 max len: 534 min len: 371 avg len: 463.75 num_loss_counted_tokens: 1133 | |
total tokens: 2471 num samples: 7 num padding tokens: 275 - rank: 2 max len: 353 min len: 287 avg len: 313.7142857142857 num_loss_counted_tokens: 861 | |
total tokens: 2196 num samples: 6 num padding tokens: 154 - rank: 2 max len: 366 min len: 324 avg len: 340.3333333333333 num_loss_counted_tokens: 856 | |
total tokens: 2316 num samples: 6 num padding tokens: 142 - rank: 2 max len: 386 min len: 323 avg len: 362.3333333333333 num_loss_counted_tokens: 1026 | |
total tokens: 2352 num samples: 7 num padding tokens: 211 - rank: 2 max len: 336 min len: 279 avg len: 305.85714285714283 num_loss_counted_tokens: 911 | |
total tokens: 2275 num samples: 7 num padding tokens: 214 - rank: 2 max len: 325 min len: 271 avg len: 294.42857142857144 num_loss_counted_tokens: 1031 | |
total tokens: 2530 num samples: 5 num padding tokens: 227 - rank: 1 max len: 506 min len: 408 avg len: 460.6 num_loss_counted_tokens: 1241 | |
total tokens: 2530 num samples: 11 num padding tokens: 132 - rank: 4 max len: 230 min len: 202 avg len: 218.0 num_loss_counted_tokens: 960 | |
total tokens: 2460 num samples: 5 num padding tokens: 206 - rank: 1 max len: 492 min len: 426 avg len: 450.8 num_loss_counted_tokens: 1434 | |
total tokens: 2527 num samples: 7 num padding tokens: 388 - rank: 2 max len: 361 min len: 275 avg len: 305.57142857142856 num_loss_counted_tokens: 954 | |
total tokens: 2448 num samples: 6 num padding tokens: 202 - rank: 2 max len: 408 min len: 344 avg len: 374.3333333333333 num_loss_counted_tokens: 1126 | |
total tokens: 2345 num samples: 7 num padding tokens: 175 - rank: 2 max len: 335 min len: 288 avg len: 310.0 num_loss_counted_tokens: 1133 | |
Per-token loss scaled by world size: 0.002296476624906063Per-token loss scaled by world size: 0.001420582877472043 | |
Per-token loss scaled by world size: 0.0012042293092235923Per-token loss scaled by world size: 0.0011147839250043035Per-token loss scaled by world size: 0.0005252897390164435Per-token loss scaled by world size: 0.0008365436224266887 | |
Per-token loss scaled by world size: 0.001103811664506793 | |
Epoch: 1, Step: 214, Rank: 1, loss = 1.7766118049621582 | |
Epoch: 1, Step: 214, Rank: 4, loss = 0.8624246716499329Epoch: 1, Step: 214, Rank: 6, loss = 0.9316219091415405 | |
Epoch: 1, Step: 214, Rank: 7, loss = 0.40637728571891785Epoch: 1, Step: 214, Rank: 5, loss = 1.0989984273910522 | |
Epoch: 1, Step: 214, Rank: 3, loss = 0.6471710801124573 | |
Epoch: 1, Step: 214, Rank: 0, loss = 0.8539363145828247 | |
Per-token loss scaled by world size: 0.0014279285678640008 | |
Epoch: 1, Step: 214, Rank: 2, loss = 1.1046812534332275 | |
[2024-06-27 16:46:14,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=214, skipped=0, lr=[1.1116883116883117e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:14,605] [INFO] [timer.py:260:stop] epoch=0/micro_step=214/global_step=214, RunningAvgSamplesPerSec=95.46949933726506, CurrSamplesPerSec=93.6756760403641, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 93.4701379605077 samples/s, lr: 1.1116883116883117e-05, loss: 0.8539363145828247 cuda_mem_allocated: 22.264524936676025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6189.0 batch_size: 83.0 total loss: 0.960227906703949 | |
total tokens: 2496 num samples: 16 num padding tokens: 195 - rank: 6 max len: 156 min len: 136 avg len: 143.8125 num_loss_counted_tokens: 889 | |
total tokens: 2466 num samples: 9 num padding tokens: 182 - rank: 3 max len: 274 min len: 240 avg len: 253.77777777777777 num_loss_counted_tokens: 816 | |
total tokens: 1962 num samples: 3 num padding tokens: 496 - rank: 0 max len: 654 min len: 380 avg len: 488.6666666666667 num_loss_counted_tokens: 805 | |
total tokens: 2340 num samples: 18 num padding tokens: 437 - rank: 7 max len: 130 min len: 75 avg len: 105.72222222222223 num_loss_counted_tokens: 503 | |
total tokens: 2522 num samples: 13 num padding tokens: 186 - rank: 5 max len: 194 min len: 166 avg len: 179.69230769230768 num_loss_counted_tokens: 956 | |
total tokens: 2528 num samples: 8 num padding tokens: 143 - rank: 2 max len: 316 min len: 281 avg len: 298.125 num_loss_counted_tokens: 1421 | |
total tokens: 2520 num samples: 7 num padding tokens: 170 - rank: 1 max len: 360 min len: 318 avg len: 335.7142857142857 num_loss_counted_tokens: 1445 | |
total tokens: 2340 num samples: 10 num padding tokens: 145 - rank: 4 max len: 234 min len: 200 avg len: 219.5 num_loss_counted_tokens: 763 | |
Per-token loss scaled by world size: 0.000811442849226296 | |
Per-token loss scaled by world size: 0.0005398521898314357Per-token loss scaled by world size: 0.0005947211175225675Per-token loss scaled by world size: 0.0008146419422701001Per-token loss scaled by world size: 0.0014103346038609743Per-token loss scaled by world size: 0.0007928390405140817Per-token loss scaled by world size: 0.0013934166636317968Epoch: 1, Step: 215, Rank: 5, loss = 0.7699578404426575 | |
Epoch: 1, Step: 215, Rank: 1, loss = 1.3221782445907593Epoch: 1, Step: 215, Rank: 7, loss = 0.5122522711753845 | |
Epoch: 1, Step: 215, Rank: 0, loss = 0.5643159747123718Per-token loss scaled by world size: 0.0009408224141225219Epoch: 1, Step: 215, Rank: 3, loss = 0.7729933857917786 | |
Epoch: 1, Step: 215, Rank: 2, loss = 1.3382312059402466Epoch: 1, Step: 215, Rank: 6, loss = 0.7523051500320435 | |
Epoch: 1, Step: 215, Rank: 4, loss = 0.8927228450775146 | |
[2024-06-27 16:46:15,608] [INFO] [logging.py:96:log_dist] [Rank 0] step=215, skipped=0, lr=[1.116883116883117e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:15,681] [INFO] [timer.py:260:stop] epoch=0/micro_step=215/global_step=215, RunningAvgSamplesPerSec=95.46292928552982, CurrSamplesPerSec=94.0902015638502, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 93.98035309165259 samples/s, lr: 1.116883116883117e-05, loss: 0.5643159747123718 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7591.0 batch_size: 86.0 total loss: 0.8656196594238281 | |
total tokens: 2486 num samples: 11 num padding tokens: 138 - rank: 4 max len: 226 min len: 197 avg len: 213.45454545454547 num_loss_counted_tokens: 913 | |
total tokens: 2522 num samples: 13 num padding tokens: 153 - rank: 5 max len: 194 min len: 162 avg len: 182.23076923076923 num_loss_counted_tokens: 1098 | |
total tokens: 2352 num samples: 8 num padding tokens: 240 - rank: 3 max len: 294 min len: 242 avg len: 264.0 num_loss_counted_tokens: 994 | |
total tokens: 2512 num samples: 16 num padding tokens: 189 - rank: 6 max len: 157 min len: 131 avg len: 145.1875 num_loss_counted_tokens: 886 | |
total tokens: 2025 num samples: 3 num padding tokens: 260 - rank: 1 max len: 675 min len: 534 avg len: 588.3333333333334 num_loss_counted_tokens: 988 | |
total tokens: 2466 num samples: 6 num padding tokens: 422 - rank: 2 max len: 411 min len: 297 avg len: 340.6666666666667 num_loss_counted_tokens: 1036 | |
total tokens: 1879 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1879 min len: 1879 avg len: 1879.0 num_loss_counted_tokens: 24 | |
total tokens: 1965 num samples: 15 num padding tokens: 355 - rank: 7 max len: 131 min len: 89 avg len: 107.33333333333333 num_loss_counted_tokens: 455 | |
Per-token loss scaled by world size: 0.0005548645276576281Per-token loss scaled by world size: 0.0017093719216063619Per-token loss scaled by world size: 0.0008652483229525387Per-token loss scaled by world size: 0.0018376084044575691Per-token loss scaled by world size: 0.001039906986989081Per-token loss scaled by world size: 0.0008855124469846487 | |
Epoch: 1, Step: 216, Rank: 3, loss = 0.7651934623718262Epoch: 1, Step: 216, Rank: 4, loss = 0.4794723093509674Epoch: 1, Step: 216, Rank: 6, loss = 0.8986095786094666 | |
Epoch: 1, Step: 216, Rank: 5, loss = 0.7476826906204224 | |
Epoch: 1, Step: 216, Rank: 1, loss = 1.4771109819412231 | |
Per-token loss scaled by world size: 0.0007243577274493873Epoch: 1, Step: 216, Rank: 2, loss = 1.5879234075546265 | |
Per-token loss scaled by world size: 0.0006256073829717934 | |
Epoch: 1, Step: 216, Rank: 7, loss = 0.54060298204422 | |
Epoch: 1, Step: 216, Rank: 0, loss = 0.6259356141090393 | |
[2024-06-27 16:46:16,675] [INFO] [logging.py:96:log_dist] [Rank 0] step=216, skipped=0, lr=[1.1220779220779221e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:16,748] [INFO] [timer.py:260:stop] epoch=0/micro_step=216/global_step=216, RunningAvgSamplesPerSec=95.46231217443578, CurrSamplesPerSec=95.33104909965964, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 95.13864259389388 samples/s, lr: 1.1220779220779221e-05, loss: 0.6259356141090393 cuda_mem_allocated: 22.21705961227417 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6913.0 batch_size: 74.0 total loss: 0.8903163075447083 | |
total tokens: 2322 num samples: 9 num padding tokens: 228 - rank: 4 max len: 258 min len: 207 avg len: 232.66666666666666 num_loss_counted_tokens: 745 | |
total tokens: 2233 num samples: 7 num padding tokens: 149 - rank: 3 max len: 319 min len: 281 avg len: 297.7142857142857 num_loss_counted_tokens: 974 | |
total tokens: 2358 num samples: 6 num padding tokens: 185 - rank: 2 max len: 393 min len: 323 avg len: 362.1666666666667 num_loss_counted_tokens: 1284 | |
total tokens: 2455 num samples: 5 num padding tokens: 162 - rank: 1 max len: 491 min len: 398 avg len: 458.6 num_loss_counted_tokens: 1158 | |
total tokens: 2190 num samples: 3 num padding tokens: 249 - rank: 0 max len: 730 min len: 493 avg len: 647.0 num_loss_counted_tokens: 1259 | |
total tokens: 2460 num samples: 12 num padding tokens: 192 - rank: 5 max len: 205 min len: 169 avg len: 189.0 num_loss_counted_tokens: 1103 | |
total tokens: 2500 num samples: 20 num padding tokens: 395 - rank: 7 max len: 125 min len: 74 avg len: 105.25 num_loss_counted_tokens: 559 | |
total tokens: 2366 num samples: 14 num padding tokens: 302 - rank: 6 max len: 169 min len: 127 avg len: 147.42857142857142 num_loss_counted_tokens: 839 | |
Per-token loss scaled by world size: 0.0013181151589378715Per-token loss scaled by world size: 0.0006808307371102273 | |
Per-token loss scaled by world size: 0.001662246068008244Per-token loss scaled by world size: 0.0006483058095909655 | |
Per-token loss scaled by world size: 0.0005777254118584096Per-token loss scaled by world size: 0.0016224569408223033Per-token loss scaled by world size: 0.0008047409355640411 | |
Epoch: 1, Step: 217, Rank: 4, loss = 1.2174440622329712 | |
Epoch: 1, Step: 217, Rank: 1, loss = 1.5352920293807983 | |
Epoch: 1, Step: 217, Rank: 5, loss = 0.6288322806358337 | |
Epoch: 1, Step: 217, Rank: 2, loss = 0.7432788610458374 | |
Epoch: 1, Step: 217, Rank: 3, loss = 1.4985418319702148 | |
Epoch: 1, Step: 217, Rank: 7, loss = 0.5987914800643921 | |
Per-token loss scaled by world size: 0.000987947336398065Epoch: 1, Step: 217, Rank: 0, loss = 0.5336016416549683 | |
Epoch: 1, Step: 217, Rank: 6, loss = 0.9124928116798401 | |
[2024-06-27 16:46:17,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=217, skipped=0, lr=[1.1272727272727272e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:17,795] [INFO] [timer.py:260:stop] epoch=0/micro_step=217/global_step=217, RunningAvgSamplesPerSec=95.46639412326074, CurrSamplesPerSec=96.3480360733383, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 96.25273116309623 samples/s, lr: 1.1272727272727272e-05, loss: 0.5336016416549683 cuda_mem_allocated: 22.222545623779297 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7389.0 batch_size: 67.0 total loss: 0.9585343599319458 | |
total tokens: 2460 num samples: 10 num padding tokens: 207 - rank: 4 max len: 246 min len: 214 avg len: 225.3 num_loss_counted_tokens: 854 | |
total tokens: 2532 num samples: 12 num padding tokens: 281 - rank: 5 max len: 211 min len: 168 avg len: 187.58333333333334 num_loss_counted_tokens: 1006 | |
total tokens: 2135 num samples: 5 num padding tokens: 186 - rank: 1 max len: 427 min len: 362 avg len: 389.8 num_loss_counted_tokens: 1156 | |
total tokens: 2513 num samples: 7 num padding tokens: 199 - rank: 2 max len: 359 min len: 301 avg len: 330.57142857142856 num_loss_counted_tokens: 1115 | |
total tokens: 2475 num samples: 15 num padding tokens: 207 - rank: 6 max len: 165 min len: 135 avg len: 151.2 num_loss_counted_tokens: 906 | |
total tokens: 2180 num samples: 4 num padding tokens: 218 - rank: 0 max len: 545 min len: 437 avg len: 490.5 num_loss_counted_tokens: 645 | |
total tokens: 2412 num samples: 18 num padding tokens: 395 - rank: 7 max len: 134 min len: 85 avg len: 112.05555555555556 num_loss_counted_tokens: 546 | |
total tokens: 2368 num samples: 8 num padding tokens: 228 - rank: 3 max len: 296 min len: 250 avg len: 267.5 num_loss_counted_tokens: 983 | |
Per-token loss scaled by world size: 0.001084265997633338Per-token loss scaled by world size: 0.0006487083155661821Per-token loss scaled by world size: 0.0005310449050739408Per-token loss scaled by world size: 0.0012095078127458692Per-token loss scaled by world size: 0.0004982489626854658 | |
Per-token loss scaled by world size: 0.0009069818188436329Per-token loss scaled by world size: 0.0019450075924396515 | |
Epoch: 1, Step: 218, Rank: 4, loss = 1.028019666671753 | |
Epoch: 1, Step: 218, Rank: 6, loss = 0.615056574344635 | |
Epoch: 1, Step: 218, Rank: 7, loss = 0.5034969449043274Epoch: 1, Step: 218, Rank: 2, loss = 1.1467646360397339 | |
Epoch: 1, Step: 218, Rank: 1, loss = 0.47240227460861206 | |
Epoch: 1, Step: 218, Rank: 5, loss = 0.8599321246147156Per-token loss scaled by world size: 0.0016924588708207011Epoch: 1, Step: 218, Rank: 0, loss = 1.844110369682312 | |
Epoch: 1, Step: 218, Rank: 3, loss = 1.604662537574768 | |
[2024-06-27 16:46:18,782] [INFO] [logging.py:96:log_dist] [Rank 0] step=218, skipped=0, lr=[1.1324675324675325e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:18,856] [INFO] [timer.py:260:stop] epoch=0/micro_step=218/global_step=218, RunningAvgSamplesPerSec=95.46553579363504, CurrSamplesPerSec=95.2813526177259, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.18765021811546 samples/s, lr: 1.1324675324675325e-05, loss: 1.844110369682312 cuda_mem_allocated: 22.26214027404785 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7585.0 batch_size: 82.0 total loss: 1.0093055963516235 | |
total tokens: 2289 num samples: 7 num padding tokens: 218 - rank: 2 max len: 327 min len: 277 avg len: 295.85714285714283 num_loss_counted_tokens: 1198 | |
total tokens: 2520 num samples: 15 num padding tokens: 285 - rank: 6 max len: 168 min len: 135 avg len: 149.0 num_loss_counted_tokens: 848 | |
total tokens: 2160 num samples: 5 num padding tokens: 215 - rank: 1 max len: 432 min len: 336 avg len: 389.0 num_loss_counted_tokens: 787 | |
total tokens: 2420 num samples: 10 num padding tokens: 127 - rank: 4 max len: 242 min len: 209 avg len: 229.3 num_loss_counted_tokens: 1194 | |
total tokens: 2436 num samples: 12 num padding tokens: 123 - rank: 5 max len: 203 min len: 176 avg len: 192.75 num_loss_counted_tokens: 1017 | |
total tokens: 2278 num samples: 17 num padding tokens: 392 - rank: 7 max len: 134 min len: 82 avg len: 110.94117647058823 num_loss_counted_tokens: 570 | |
total tokens: 2136 num samples: 4 num padding tokens: 119 - rank: 0 max len: 534 min len: 464 avg len: 504.25 num_loss_counted_tokens: 1366 | |
total tokens: 2484 num samples: 9 num padding tokens: 134 - rank: 3 max len: 276 min len: 244 avg len: 261.1111111111111 num_loss_counted_tokens: 989 | |
Per-token loss scaled by world size: 0.0008583253948017955Per-token loss scaled by world size: 0.0017359615303575993Per-token loss scaled by world size: 0.000685947248712182Per-token loss scaled by world size: 0.000800142006482929Per-token loss scaled by world size: 0.000621266954112798Per-token loss scaled by world size: 0.0005622903699986637 | |
Per-token loss scaled by world size: 0.0010194805217906833 | |
Epoch: 1, Step: 219, Rank: 6, loss = 0.8624024391174316 | |
Epoch: 1, Step: 219, Rank: 4, loss = 0.689205527305603 | |
Epoch: 1, Step: 219, Rank: 7, loss = 0.6242179870605469Epoch: 1, Step: 219, Rank: 5, loss = 0.8039426803588867Epoch: 1, Step: 219, Rank: 1, loss = 1.7442073822021484Epoch: 1, Step: 219, Rank: 2, loss = 1.0243231058120728 | |
Epoch: 1, Step: 219, Rank: 0, loss = 0.5649612545967102 | |
Per-token loss scaled by world size: 0.001489321468397975 | |
Epoch: 1, Step: 219, Rank: 3, loss = 1.4963957071304321 | |
[2024-06-27 16:46:19,852] [INFO] [logging.py:96:log_dist] [Rank 0] step=219, skipped=0, lr=[1.1376623376623376e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:19,925] [INFO] [timer.py:260:stop] epoch=0/micro_step=219/global_step=219, RunningAvgSamplesPerSec=95.46437978482534, CurrSamplesPerSec=95.21533629913814, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.10585659902823 samples/s, lr: 1.1376623376623376e-05, loss: 0.5649612545967102 cuda_mem_allocated: 22.256415367126465 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8038.0 batch_size: 76.0 total loss: 0.9762070775032043 | |
total tokens: 2470 num samples: 10 num padding tokens: 113 - rank: 3 max len: 247 min len: 226 avg len: 235.7 num_loss_counted_tokens: 684 | |
total tokens: 2409 num samples: 11 num padding tokens: 246 - rank: 4 max len: 219 min len: 186 avg len: 196.63636363636363 num_loss_counted_tokens: 897 | |
total tokens: 2520 num samples: 8 num padding tokens: 299 - rank: 2 max len: 315 min len: 249 avg len: 277.625 num_loss_counted_tokens: 1000 | |
total tokens: 2388 num samples: 6 num padding tokens: 229 - rank: 1 max len: 398 min len: 319 avg len: 359.8333333333333 num_loss_counted_tokens: 1279 | |
total tokens: 2400 num samples: 15 num padding tokens: 258 - rank: 6 max len: 160 min len: 128 avg len: 142.8 num_loss_counted_tokens: 773 | |
total tokens: 2418 num samples: 13 num padding tokens: 155 - rank: 5 max len: 186 min len: 162 avg len: 174.07692307692307 num_loss_counted_tokens: 821 | |
total tokens: 2204 num samples: 4 num padding tokens: 162 - rank: 0 max len: 551 min len: 442 avg len: 510.5 num_loss_counted_tokens: 1046 | |
total tokens: 2286 num samples: 18 num padding tokens: 262 - rank: 7 max len: 127 min len: 90 avg len: 112.44444444444444 num_loss_counted_tokens: 670 | |
Per-token loss scaled by world size: 0.0006416768883354962Per-token loss scaled by world size: 0.0006137069431133568 | |
Per-token loss scaled by world size: 0.0009246247936971486Per-token loss scaled by world size: 0.0009718507062643766Per-token loss scaled by world size: 0.0011631427332758904Per-token loss scaled by world size: 0.0010256441310048103Per-token loss scaled by world size: 0.000941418285947293 | |
Epoch: 1, Step: 220, Rank: 1, loss = 0.6292443871498108 | |
Epoch: 1, Step: 220, Rank: 5, loss = 0.9067102074623108Epoch: 1, Step: 220, Rank: 7, loss = 0.6018163561820984Epoch: 1, Step: 220, Rank: 6, loss = 1.1406068801879883Epoch: 1, Step: 220, Rank: 2, loss = 0.9530211091041565Epoch: 1, Step: 220, Rank: 3, loss = 1.0057722330093384 | |
Per-token loss scaled by world size: 0.0006566005758941174Epoch: 1, Step: 220, Rank: 4, loss = 0.9231783151626587 | |
Epoch: 1, Step: 220, Rank: 0, loss = 0.6438789367675781 | |
[2024-06-27 16:46:20,916] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=0, lr=[1.1428571428571429e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:20,990] [INFO] [timer.py:260:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=95.46212447321193, CurrSamplesPerSec=94.97522950494863, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.88069703830355 samples/s, lr: 1.1428571428571429e-05, loss: 0.6438789367675781 cuda_mem_allocated: 22.31079864501953 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7845.0 batch_size: 84.0 total loss: 0.8505285978317261 | |
total tokens: 2478 num samples: 14 num padding tokens: 168 - rank: 5 max len: 177 min len: 155 avg len: 165.0 num_loss_counted_tokens: 1068 | |
total tokens: 2466 num samples: 9 num padding tokens: 241 - rank: 3 max len: 274 min len: 222 avg len: 247.22222222222223 num_loss_counted_tokens: 1016 | |
total tokens: 2464 num samples: 16 num padding tokens: 163 - rank: 6 max len: 154 min len: 132 avg len: 143.8125 num_loss_counted_tokens: 759 | |
total tokens: 2392 num samples: 8 num padding tokens: 121 - rank: 2 max len: 299 min len: 275 avg len: 283.875 num_loss_counted_tokens: 1038 | |
total tokens: 2220 num samples: 6 num padding tokens: 234 - rank: 1 max len: 370 min len: 303 avg len: 331.0 num_loss_counted_tokens: 1264 | |
total tokens: 2431 num samples: 11 num padding tokens: 274 - rank: 4 max len: 221 min len: 184 avg len: 196.0909090909091 num_loss_counted_tokens: 1034 | |
total tokens: 2489 num samples: 19 num padding tokens: 493 - rank: 7 max len: 131 min len: 81 avg len: 105.05263157894737 num_loss_counted_tokens: 446 | |
total tokens: 2470 num samples: 5 num padding tokens: 330 - rank: 0 max len: 494 min len: 393 avg len: 428.0 num_loss_counted_tokens: 1240 | |
Per-token loss scaled by world size: 0.0015894882380962372Per-token loss scaled by world size: 0.001343012903816998Per-token loss scaled by world size: 0.0011542169377207756Per-token loss scaled by world size: 0.0004924156237393618Per-token loss scaled by world size: 0.0009595828596502542Per-token loss scaled by world size: 0.000687390856910497 | |
Per-token loss scaled by world size: 0.0005613004905171692 | |
Epoch: 1, Step: 221, Rank: 2, loss = 1.534452199935913 | |
Epoch: 1, Step: 221, Rank: 1, loss = 1.2965110540390015Epoch: 1, Step: 221, Rank: 5, loss = 1.1142522096633911 | |
Epoch: 1, Step: 221, Rank: 7, loss = 0.4753657579421997Epoch: 1, Step: 221, Rank: 6, loss = 0.9263573288917542 | |
Epoch: 1, Step: 221, Rank: 3, loss = 0.6635899543762207 | |
Epoch: 1, Step: 221, Rank: 0, loss = 0.5418654680252075 | |
Per-token loss scaled by world size: 0.0007334320107474923 | |
Epoch: 1, Step: 221, Rank: 4, loss = 0.7080368995666504 | |
[2024-06-27 16:46:21,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=221, skipped=0, lr=[1.148051948051948e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:22,050] [INFO] [timer.py:260:stop] epoch=0/micro_step=221/global_step=221, RunningAvgSamplesPerSec=95.46124146038943, CurrSamplesPerSec=95.26913382551584, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.16672752656623 samples/s, lr: 1.148051948051948e-05, loss: 0.5418654680252075 cuda_mem_allocated: 22.256772994995117 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7723.0 batch_size: 77.0 total loss: 0.9075538516044617 | |
total tokens: 2415 num samples: 7 num padding tokens: 128 - rank: 1 max len: 345 min len: 299 avg len: 326.7142857142857 num_loss_counted_tokens: 1383 | |
total tokens: 2286 num samples: 9 num padding tokens: 67 - rank: 3 max len: 254 min len: 234 avg len: 246.55555555555554 num_loss_counted_tokens: 890 | |
total tokens: 2512 num samples: 16 num padding tokens: 153 - rank: 6 max len: 157 min len: 139 avg len: 147.4375 num_loss_counted_tokens: 779 | |
total tokens: 2344 num samples: 8 num padding tokens: 134 - rank: 2 max len: 293 min len: 257 avg len: 276.25 num_loss_counted_tokens: 1114 | |
total tokens: 2509 num samples: 13 num padding tokens: 206 - rank: 5 max len: 193 min len: 160 avg len: 177.15384615384616 num_loss_counted_tokens: 900 | |
total tokens: 2320 num samples: 10 num padding tokens: 175 - rank: 4 max len: 232 min len: 197 avg len: 214.5 num_loss_counted_tokens: 803 | |
total tokens: 2502 num samples: 18 num padding tokens: 554 - rank: 7 max len: 139 min len: 79 avg len: 108.22222222222223 num_loss_counted_tokens: 541 | |
total tokens: 2319 num samples: 3 num padding tokens: 828 - rank: 0 max len: 773 min len: 354 avg len: 497.0 num_loss_counted_tokens: 907 | |
Per-token loss scaled by world size: 0.00027043011505156755Per-token loss scaled by world size: 0.0016199905658140779Per-token loss scaled by world size: 0.0005302864010445774Per-token loss scaled by world size: 0.0009501695749349892Per-token loss scaled by world size: 0.0012530400417745113Per-token loss scaled by world size: 0.0013440429465845227Per-token loss scaled by world size: 0.0010848339879885316 | |
Epoch: 1, Step: 222, Rank: 0, loss = 0.2382151186466217Epoch: 1, Step: 222, Rank: 2, loss = 1.4270092248916626 | |
Epoch: 1, Step: 222, Rank: 7, loss = 0.4671160578727722Epoch: 1, Step: 222, Rank: 3, loss = 0.8369806408882141Epoch: 1, Step: 222, Rank: 5, loss = 1.103771686553955 | |
Epoch: 1, Step: 222, Rank: 6, loss = 0.9556031823158264 | |
Epoch: 1, Step: 222, Rank: 1, loss = 1.1839338541030884 | |
Per-token loss scaled by world size: 0.0013893907889723778 | |
Epoch: 1, Step: 222, Rank: 4, loss = 1.2238795757293701 | |
[2024-06-27 16:46:23,037] [INFO] [logging.py:96:log_dist] [Rank 0] step=222, skipped=0, lr=[1.1532467532467533e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:23,111] [INFO] [timer.py:260:stop] epoch=0/micro_step=222/global_step=222, RunningAvgSamplesPerSec=95.46022812996155, CurrSamplesPerSec=95.23882581733496, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.10688994544005 samples/s, lr: 1.1532467532467533e-05, loss: 0.2382151186466217 cuda_mem_allocated: 22.27788257598877 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7047.0 batch_size: 85.0 total loss: 0.9295637011528015 | |
total tokens: 2502 num samples: 18 num padding tokens: 222 - rank: 6 max len: 139 min len: 117 avg len: 126.66666666666667 num_loss_counted_tokens: 862 | |
total tokens: 2520 num samples: 10 num padding tokens: 252 - rank: 3 max len: 252 min len: 208 avg len: 226.8 num_loss_counted_tokens: 1277 | |
total tokens: 2400 num samples: 15 num padding tokens: 146 - rank: 5 max len: 160 min len: 142 avg len: 150.26666666666668 num_loss_counted_tokens: 997 | |
total tokens: 2460 num samples: 12 num padding tokens: 231 - rank: 4 max len: 205 min len: 163 avg len: 185.75 num_loss_counted_tokens: 749 | |
total tokens: 2226 num samples: 7 num padding tokens: 258 - rank: 2 max len: 318 min len: 253 avg len: 281.14285714285717 num_loss_counted_tokens: 481 | |
total tokens: 2328 num samples: 6 num padding tokens: 244 - rank: 1 max len: 388 min len: 320 avg len: 347.3333333333333 num_loss_counted_tokens: 995 | |
total tokens: 2128 num samples: 19 num padding tokens: 259 - rank: 7 max len: 112 min len: 82 avg len: 98.36842105263158 num_loss_counted_tokens: 476 | |
total tokens: 2520 num samples: 5 num padding tokens: 296 - rank: 0 max len: 504 min len: 406 avg len: 444.8 num_loss_counted_tokens: 1232 | |
Per-token loss scaled by world size: 0.0007850955589674413Per-token loss scaled by world size: 0.0008037019870243967Per-token loss scaled by world size: 0.001868387800641358Per-token loss scaled by world size: 0.0006797353271394968Per-token loss scaled by world size: 0.0008470106986351311Per-token loss scaled by world size: 0.000865851528942585Per-token loss scaled by world size: 0.0008801720105111599 | |
Epoch: 1, Step: 223, Rank: 7, loss = 0.6183042526245117Epoch: 1, Step: 223, Rank: 6, loss = 0.7141425609588623Epoch: 1, Step: 223, Rank: 0, loss = 1.6995322704315186 | |
Epoch: 1, Step: 223, Rank: 2, loss = 0.7876002192497253Per-token loss scaled by world size: 0.0009885545587167144Epoch: 1, Step: 223, Rank: 3, loss = 0.7704620957374573 | |
Epoch: 1, Step: 223, Rank: 5, loss = 0.8006264567375183 | |
Epoch: 1, Step: 223, Rank: 4, loss = 0.731067419052124 | |
Epoch: 1, Step: 223, Rank: 1, loss = 0.8992139101028442 | |
[2024-06-27 16:46:24,104] [INFO] [logging.py:96:log_dist] [Rank 0] step=223, skipped=0, lr=[1.1584415584415584e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:24,177] [INFO] [timer.py:260:stop] epoch=0/micro_step=223/global_step=223, RunningAvgSamplesPerSec=95.4571465881128, CurrSamplesPerSec=94.78400974738666, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.68424623211884 samples/s, lr: 1.1584415584415584e-05, loss: 1.6995322704315186 cuda_mem_allocated: 22.281102180480957 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7277.0 batch_size: 80.0 total loss: 0.8776186108589172 | |
total tokens: 2412 num samples: 12 num padding tokens: 280 - rank: 5 max len: 201 min len: 161 avg len: 177.66666666666666 num_loss_counted_tokens: 885 | |
total tokens: 2530 num samples: 10 num padding tokens: 220 - rank: 4 max len: 253 min len: 210 avg len: 231.0 num_loss_counted_tokens: 947 | |
total tokens: 2268 num samples: 7 num padding tokens: 248 - rank: 3 max len: 324 min len: 261 avg len: 288.57142857142856 num_loss_counted_tokens: 1022 | |
total tokens: 2037 num samples: 3 num padding tokens: 234 - rank: 1 max len: 679 min len: 453 avg len: 601.0 num_loss_counted_tokens: 291 | |
total tokens: 2225 num samples: 5 num padding tokens: 389 - rank: 2 max len: 445 min len: 338 avg len: 367.2 num_loss_counted_tokens: 724 | |
total tokens: 2415 num samples: 15 num padding tokens: 246 - rank: 6 max len: 161 min len: 130 avg len: 144.6 num_loss_counted_tokens: 744 | |
total tokens: 1600 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1600 min len: 1600 avg len: 1600.0 num_loss_counted_tokens: 53 | |
total tokens: 2470 num samples: 19 num padding tokens: 432 - rank: 7 max len: 130 min len: 87 avg len: 107.26315789473684 num_loss_counted_tokens: 591 | |
Per-token loss scaled by world size: 0.0006054527475498617Per-token loss scaled by world size: 0.0005420049419626594Per-token loss scaled by world size: 0.0009975614957511425Per-token loss scaled by world size: 0.0003570506814867258Per-token loss scaled by world size: 0.001056499662809074Per-token loss scaled by world size: 0.0008691463735885918 | |
Per-token loss scaled by world size: 0.0006222155061550438 | |
Epoch: 1, Step: 224, Rank: 5, loss = 0.6085556745529175Epoch: 1, Step: 224, Rank: 6, loss = 0.5447826981544495 | |
Epoch: 1, Step: 224, Rank: 1, loss = 1.0026739835739136 | |
Epoch: 1, Step: 224, Rank: 7, loss = 0.35888057947158813Epoch: 1, Step: 224, Rank: 3, loss = 1.0619142055511475Epoch: 1, Step: 224, Rank: 4, loss = 0.6254043579101562 | |
Epoch: 1, Step: 224, Rank: 0, loss = 0.8736007213592529 | |
Per-token loss scaled by world size: 0.0011403149692341685 | |
Epoch: 1, Step: 224, Rank: 2, loss = 1.146159052848816 | |
[2024-06-27 16:46:25,161] [INFO] [logging.py:96:log_dist] [Rank 0] step=224, skipped=0, lr=[1.1636363636363637e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:25,234] [INFO] [timer.py:260:stop] epoch=0/micro_step=224/global_step=224, RunningAvgSamplesPerSec=95.45767559493824, CurrSamplesPerSec=95.57473011300587, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.47544665510173 samples/s, lr: 1.1636363636363637e-05, loss: 0.8736007213592529 cuda_mem_allocated: 22.2478289604187 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8041.0 batch_size: 71.0 total loss: 0.7777464389801025 | |
total tokens: 2360 num samples: 10 num padding tokens: 88 - rank: 4 max len: 236 min len: 211 avg len: 227.2 num_loss_counted_tokens: 1014 | |
total tokens: 2480 num samples: 8 num padding tokens: 147 - rank: 2 max len: 310 min len: 268 avg len: 291.625 num_loss_counted_tokens: 1105 | |
total tokens: 2448 num samples: 18 num padding tokens: 504 - rank: 7 max len: 136 min len: 84 avg len: 108.0 num_loss_counted_tokens: 543 | |
total tokens: 2349 num samples: 9 num padding tokens: 60 - rank: 3 max len: 261 min len: 240 avg len: 254.33333333333334 num_loss_counted_tokens: 1197 | |
total tokens: 2505 num samples: 15 num padding tokens: 222 - rank: 6 max len: 167 min len: 140 avg len: 152.2 num_loss_counted_tokens: 998 | |
total tokens: 2382 num samples: 6 num padding tokens: 252 - rank: 1 max len: 397 min len: 315 avg len: 355.0 num_loss_counted_tokens: 1102 | |
total tokens: 2520 num samples: 12 num padding tokens: 277 - rank: 5 max len: 210 min len: 167 avg len: 186.91666666666666 num_loss_counted_tokens: 1078 | |
total tokens: 2188 num samples: 4 num padding tokens: 335 - rank: 0 max len: 547 min len: 403 avg len: 463.25 num_loss_counted_tokens: 1165 | |
Per-token loss scaled by world size: 0.0021303310059010983Per-token loss scaled by world size: 0.0010834417771548033Per-token loss scaled by world size: 0.0010008623357862234Per-token loss scaled by world size: 0.0007383509073406458Per-token loss scaled by world size: 0.00048202730249613523Per-token loss scaled by world size: 0.001526564359664917Per-token loss scaled by world size: 0.0013264539884403348 | |
Epoch: 1, Step: 225, Rank: 1, loss = 2.0696165561676025Epoch: 1, Step: 225, Rank: 2, loss = 1.0525636672973633Epoch: 1, Step: 225, Rank: 4, loss = 0.9723377823829651 | |
Epoch: 1, Step: 225, Rank: 6, loss = 0.7173079252243042Epoch: 1, Step: 225, Rank: 3, loss = 1.4830572605133057 | |
Epoch: 1, Step: 225, Rank: 7, loss = 0.4682895243167877 | |
Per-token loss scaled by world size: 0.000637897988781333 | |
Epoch: 1, Step: 225, Rank: 0, loss = 1.2886500358581543 | |
Epoch: 1, Step: 225, Rank: 5, loss = 0.6197178959846497 | |
[2024-06-27 16:46:26,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=225, skipped=0, lr=[1.1688311688311688e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:26,290] [INFO] [timer.py:260:stop] epoch=0/micro_step=225/global_step=225, RunningAvgSamplesPerSec=95.45877770704384, CurrSamplesPerSec=95.70407815228533, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.55463467757419 samples/s, lr: 1.1688311688311688e-05, loss: 1.2886500358581543 cuda_mem_allocated: 22.277524948120117 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7772.0 batch_size: 73.0 total loss: 1.0839425325393677 | |
total tokens: 2240 num samples: 20 num padding tokens: 273 - rank: 7 max len: 112 min len: 80 avg len: 98.35 num_loss_counted_tokens: 479 | |
total tokens: 2502 num samples: 18 num padding tokens: 226 - rank: 6 max len: 139 min len: 113 avg len: 126.44444444444444 num_loss_counted_tokens: 669 | |
total tokens: 2296 num samples: 7 num padding tokens: 203 - rank: 1 max len: 328 min len: 271 avg len: 299.0 num_loss_counted_tokens: 1144 | |
total tokens: 2439 num samples: 9 num padding tokens: 97 - rank: 2 max len: 271 min len: 242 avg len: 260.22222222222223 num_loss_counted_tokens: 1037 | |
total tokens: 2460 num samples: 12 num padding tokens: 174 - rank: 4 max len: 205 min len: 174 avg len: 190.5 num_loss_counted_tokens: 881 | |
total tokens: 2380 num samples: 14 num padding tokens: 155 - rank: 5 max len: 170 min len: 140 avg len: 158.92857142857142 num_loss_counted_tokens: 939 | |
total tokens: 2420 num samples: 10 num padding tokens: 151 - rank: 3 max len: 242 min len: 212 avg len: 226.9 num_loss_counted_tokens: 711 | |
total tokens: 2524 num samples: 4 num padding tokens: 762 - rank: 0 max len: 631 min len: 339 avg len: 440.5 num_loss_counted_tokens: 577 | |
Per-token loss scaled by world size: 0.0012125144712626934Per-token loss scaled by world size: 0.0007440359913744032Per-token loss scaled by world size: 0.0009185302769765258Per-token loss scaled by world size: 0.0011922685662284493Per-token loss scaled by world size: 0.00022955541498959064 | |
Per-token loss scaled by world size: 0.0029797181487083435 | |
Per-token loss scaled by world size: 0.0009129224927164614 | |
Epoch: 1, Step: 226, Rank: 6, loss = 0.9706556797027588 | |
Epoch: 1, Step: 226, Rank: 5, loss = 0.9871383309364319 | |
Epoch: 1, Step: 226, Rank: 2, loss = 0.6057382822036743Epoch: 1, Step: 226, Rank: 4, loss = 0.7477984428405762 | |
Epoch: 1, Step: 226, Rank: 7, loss = 0.7432330250740051 | |
Epoch: 1, Step: 226, Rank: 0, loss = 0.18688680231571198Epoch: 1, Step: 226, Rank: 1, loss = 2.425863027572632 | |
Per-token loss scaled by world size: 0.0017871969612315297 | |
Epoch: 1, Step: 226, Rank: 3, loss = 1.455001711845398 | |
[2024-06-27 16:46:27,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=226, skipped=0, lr=[1.1740259740259741e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:27,348] [INFO] [timer.py:260:stop] epoch=0/micro_step=226/global_step=226, RunningAvgSamplesPerSec=95.45887726570147, CurrSamplesPerSec=95.48108403430889, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.3615739354829 samples/s, lr: 1.1740259740259741e-05, loss: 0.18688680231571198 cuda_mem_allocated: 22.24711322784424 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6513.0 batch_size: 75.0 total loss: 1.015289306640625 | |
total tokens: 2052 num samples: 4 num padding tokens: 184 - rank: 1 max len: 513 min len: 393 avg len: 467.0 num_loss_counted_tokens: 1062 | |
total tokens: 2505 num samples: 15 num padding tokens: 274 - rank: 6 max len: 167 min len: 130 avg len: 148.73333333333332 num_loss_counted_tokens: 805 | |
total tokens: 2401 num samples: 7 num padding tokens: 144 - rank: 2 max len: 343 min len: 304 avg len: 322.42857142857144 num_loss_counted_tokens: 903 | |
total tokens: 2382 num samples: 3 num padding tokens: 547 - rank: 0 max len: 794 min len: 517 avg len: 611.6666666666666 num_loss_counted_tokens: 828 | |
total tokens: 2432 num samples: 8 num padding tokens: 354 - rank: 3 max len: 304 min len: 234 avg len: 259.75 num_loss_counted_tokens: 987 | |
total tokens: 2530 num samples: 11 num padding tokens: 293 - rank: 4 max len: 230 min len: 186 avg len: 203.36363636363637 num_loss_counted_tokens: 733 | |
total tokens: 2016 num samples: 16 num padding tokens: 309 - rank: 7 max len: 126 min len: 84 avg len: 106.6875 num_loss_counted_tokens: 453 | |
total tokens: 2418 num samples: 13 num padding tokens: 125 - rank: 5 max len: 186 min len: 168 avg len: 176.3846153846154 num_loss_counted_tokens: 930 | |
Per-token loss scaled by world size: 0.001343224081210792Per-token loss scaled by world size: 0.0014666931238025427Per-token loss scaled by world size: 0.0008153169183060527Per-token loss scaled by world size: 0.0009411191567778587Per-token loss scaled by world size: 0.0006741951801814139Per-token loss scaled by world size: 0.001054826658219099Per-token loss scaled by world size: 0.0013487170217558742 | |
Epoch: 1, Step: 227, Rank: 2, loss = 1.289039969444275Epoch: 1, Step: 227, Rank: 5, loss = 1.1805260181427002Epoch: 1, Step: 227, Rank: 1, loss = 0.7165616750717163 | |
Epoch: 1, Step: 227, Rank: 6, loss = 0.8271260857582092 | |
Epoch: 1, Step: 227, Rank: 0, loss = 0.5925332903862 | |
Epoch: 1, Step: 227, Rank: 4, loss = 0.9270607829093933Epoch: 1, Step: 227, Rank: 3, loss = 1.1853536367416382 | |
Per-token loss scaled by world size: 0.0005826203268952668 | |
Epoch: 1, Step: 227, Rank: 7, loss = 0.512050449848175 | |
[2024-06-27 16:46:28,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=227, skipped=0, lr=[1.1792207792207792e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:28,405] [INFO] [timer.py:260:stop] epoch=0/micro_step=227/global_step=227, RunningAvgSamplesPerSec=95.45821474175928, CurrSamplesPerSec=95.3100407677102, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.15390848707546 samples/s, lr: 1.1792207792207792e-05, loss: 0.5925332903862 cuda_mem_allocated: 22.24425172805786 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7031.0 batch_size: 84.0 total loss: 0.9037814140319824 | |
total tokens: 2303 num samples: 7 num padding tokens: 92 - rank: 2 max len: 329 min len: 306 avg len: 315.85714285714283 num_loss_counted_tokens: 1372 | |
total tokens: 2388 num samples: 12 num padding tokens: 144 - rank: 5 max len: 199 min len: 172 avg len: 187.0 num_loss_counted_tokens: 930 | |
total tokens: 2352 num samples: 8 num padding tokens: 200 - rank: 3 max len: 294 min len: 237 avg len: 269.0 num_loss_counted_tokens: 1081 | |
total tokens: 2320 num samples: 10 num padding tokens: 113 - rank: 4 max len: 232 min len: 201 avg len: 220.7 num_loss_counted_tokens: 651 | |
total tokens: 2405 num samples: 5 num padding tokens: 283 - rank: 1 max len: 481 min len: 342 avg len: 424.4 num_loss_counted_tokens: 1356 | |
total tokens: 2505 num samples: 15 num padding tokens: 240 - rank: 6 max len: 167 min len: 138 avg len: 151.0 num_loss_counted_tokens: 836 | |
total tokens: 2466 num samples: 18 num padding tokens: 351 - rank: 7 max len: 137 min len: 86 avg len: 117.5 num_loss_counted_tokens: 703 | |
total tokens: 2000 num samples: 2 num padding tokens: 401 - rank: 0 max len: 1000 min len: 599 avg len: 799.5 num_loss_counted_tokens: 1251 | |
Per-token loss scaled by world size: 0.0009606981184333563Per-token loss scaled by world size: 0.0005065760924480855Per-token loss scaled by world size: 0.0011382445227354765Per-token loss scaled by world size: 0.0011512255296111107Per-token loss scaled by world size: 0.001043441821821034Per-token loss scaled by world size: 0.0011049897875636816Per-token loss scaled by world size: 0.001126572722569108 | |
Epoch: 1, Step: 228, Rank: 2, loss = 0.9010147452354431Epoch: 1, Step: 228, Rank: 7, loss = 0.47510507702827454Epoch: 1, Step: 228, Rank: 5, loss = 1.0797055959701538 | |
Epoch: 1, Step: 228, Rank: 1, loss = 1.0675311088562012 | |
Per-token loss scaled by world size: 0.0007445869850926101Epoch: 1, Step: 228, Rank: 4, loss = 1.0363422632217407Epoch: 1, Step: 228, Rank: 6, loss = 0.9786179661750793Epoch: 1, Step: 228, Rank: 3, loss = 1.056584358215332 | |
Epoch: 1, Step: 228, Rank: 0, loss = 0.698329508304596 | |
[2024-06-27 16:46:29,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=228, skipped=0, lr=[1.1844155844155845e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:29,457] [INFO] [timer.py:260:stop] epoch=0/micro_step=228/global_step=228, RunningAvgSamplesPerSec=95.4619160153981, CurrSamplesPerSec=96.30206440910487, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 96.19809315934593 samples/s, lr: 1.1844155844155845e-05, loss: 0.698329508304596 cuda_mem_allocated: 22.303642749786377 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7503.0 batch_size: 80.0 total loss: 0.9116538166999817 | |
total tokens: 2520 num samples: 14 num padding tokens: 180 - rank: 5 max len: 180 min len: 157 avg len: 167.14285714285714 num_loss_counted_tokens: 1033 | |
total tokens: 2296 num samples: 7 num padding tokens: 180 - rank: 2 max len: 328 min len: 284 avg len: 302.2857142857143 num_loss_counted_tokens: 1128 | |
total tokens: 2400 num samples: 10 num padding tokens: 315 - rank: 4 max len: 240 min len: 183 avg len: 208.5 num_loss_counted_tokens: 1023 | |
total tokens: 2454 num samples: 6 num padding tokens: 259 - rank: 1 max len: 409 min len: 337 avg len: 365.8333333333333 num_loss_counted_tokens: 1215 | |
total tokens: 2496 num samples: 16 num padding tokens: 196 - rank: 6 max len: 156 min len: 130 avg len: 143.75 num_loss_counted_tokens: 881 | |
total tokens: 2264 num samples: 8 num padding tokens: 216 - rank: 3 max len: 283 min len: 240 avg len: 256.0 num_loss_counted_tokens: 1175 | |
total tokens: 2348 num samples: 4 num padding tokens: 404 - rank: 0 max len: 587 min len: 412 avg len: 486.0 num_loss_counted_tokens: 1472 | |
total tokens: 2451 num samples: 19 num padding tokens: 389 - rank: 7 max len: 129 min len: 86 avg len: 108.52631578947368 num_loss_counted_tokens: 471 | |
Per-token loss scaled by world size: 0.0013528160052374005Per-token loss scaled by world size: 0.0013775452971458435Per-token loss scaled by world size: 0.0011563264997676015Per-token loss scaled by world size: 0.0010295365937054157Per-token loss scaled by world size: 0.0011473774211481214 | |
Per-token loss scaled by world size: 0.000847765535581857Per-token loss scaled by world size: 8.189996879082173e-05 | |
Epoch: 1, Step: 229, Rank: 3, loss = 0.8408740162849426 | |
Epoch: 1, Step: 229, Rank: 1, loss = 1.104912519454956Epoch: 1, Step: 229, Rank: 6, loss = 0.6924124956130981 | |
Epoch: 1, Step: 229, Rank: 5, loss = 0.9444296956062317Epoch: 1, Step: 229, Rank: 2, loss = 1.125110149383545Epoch: 1, Step: 229, Rank: 4, loss = 0.9371204972267151 | |
Epoch: 1, Step: 229, Rank: 0, loss = 0.06689179688692093 | |
Per-token loss scaled by world size: 0.0007822882616892457 | |
Epoch: 1, Step: 229, Rank: 7, loss = 0.6389339566230774 | |
[2024-06-27 16:46:30,441] [INFO] [logging.py:96:log_dist] [Rank 0] step=229, skipped=0, lr=[1.1896103896103896e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:30,515] [INFO] [timer.py:260:stop] epoch=0/micro_step=229/global_step=229, RunningAvgSamplesPerSec=95.46190881304153, CurrSamplesPerSec=95.46028110833097, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.3534892730837 samples/s, lr: 1.1896103896103896e-05, loss: 0.06689179688692093 cuda_mem_allocated: 22.269057750701904 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6534.0 batch_size: 86.0 total loss: 0.7938356995582581 | |
total tokens: 2522 num samples: 13 num padding tokens: 276 - rank: 5 max len: 194 min len: 153 avg len: 172.76923076923077 num_loss_counted_tokens: 750 | |
total tokens: 2432 num samples: 16 num padding tokens: 166 - rank: 6 max len: 152 min len: 128 avg len: 141.625 num_loss_counted_tokens: 826 | |
total tokens: 2424 num samples: 8 num padding tokens: 246 - rank: 3 max len: 303 min len: 247 avg len: 272.25 num_loss_counted_tokens: 975 | |
total tokens: 2220 num samples: 6 num padding tokens: 71 - rank: 1 max len: 370 min len: 352 avg len: 358.1666666666667 num_loss_counted_tokens: 1186 | |
total tokens: 2401 num samples: 7 num padding tokens: 131 - rank: 2 max len: 343 min len: 304 avg len: 324.2857142857143 num_loss_counted_tokens: 608 | |
total tokens: 2375 num samples: 19 num padding tokens: 301 - rank: 7 max len: 125 min len: 81 avg len: 109.15789473684211 num_loss_counted_tokens: 568 | |
total tokens: 2320 num samples: 10 num padding tokens: 209 - rank: 4 max len: 232 min len: 198 avg len: 211.1 num_loss_counted_tokens: 919 | |
total tokens: 2365 num samples: 5 num padding tokens: 322 - rank: 0 max len: 473 min len: 372 avg len: 408.6 num_loss_counted_tokens: 1116 | |
Per-token loss scaled by world size: 0.0011656841961666942Per-token loss scaled by world size: 0.0005307049723342061Per-token loss scaled by world size: 0.0018485913751646876Per-token loss scaled by world size: 0.001182698761112988Per-token loss scaled by world size: 0.0011426317505538464 | |
Per-token loss scaled by world size: 0.001214418327435851Per-token loss scaled by world size: 0.0010356530547142029 | |
Epoch: 1, Step: 230, Rank: 5, loss = 1.1300686597824097 | |
Epoch: 1, Step: 230, Rank: 0, loss = 1.766329050064087Epoch: 1, Step: 230, Rank: 7, loss = 0.5070886015892029Epoch: 1, Step: 230, Rank: 2, loss = 1.1138112545013428Epoch: 1, Step: 230, Rank: 1, loss = 1.0917845964431763Epoch: 1, Step: 230, Rank: 4, loss = 1.1603766679763794Epoch: 1, Step: 230, Rank: 3, loss = 0.9895665049552917 | |
Per-token loss scaled by world size: 0.0007738174172118306 | |
Epoch: 1, Step: 230, Rank: 6, loss = 0.7393825650215149 | |
[2024-06-27 16:46:31,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=0, lr=[1.1948051948051949e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:31,569] [INFO] [timer.py:260:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=95.46431614658313, CurrSamplesPerSec=96.01394092798833, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.90254958478681 samples/s, lr: 1.1948051948051949e-05, loss: 1.766329050064087 cuda_mem_allocated: 22.275736331939697 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7644.0 batch_size: 78.0 total loss: 1.0623009204864502 | |
total tokens: 2436 num samples: 12 num padding tokens: 226 - rank: 5 max len: 203 min len: 164 avg len: 184.16666666666666 num_loss_counted_tokens: 852 | |
total tokens: 2460 num samples: 15 num padding tokens: 211 - rank: 6 max len: 164 min len: 132 avg len: 149.93333333333334 num_loss_counted_tokens: 885 | |
total tokens: 2303 num samples: 7 num padding tokens: 153 - rank: 1 max len: 329 min len: 284 avg len: 307.14285714285717 num_loss_counted_tokens: 981 | |
total tokens: 2475 num samples: 11 num padding tokens: 121 - rank: 4 max len: 225 min len: 203 avg len: 214.0 num_loss_counted_tokens: 908 | |
total tokens: 2331 num samples: 9 num padding tokens: 143 - rank: 3 max len: 259 min len: 227 avg len: 243.11111111111111 num_loss_counted_tokens: 663 | |
total tokens: 2264 num samples: 8 num padding tokens: 83 - rank: 2 max len: 283 min len: 259 avg len: 272.625 num_loss_counted_tokens: 787 | |
total tokens: 2489 num samples: 19 num padding tokens: 400 - rank: 7 max len: 131 min len: 87 avg len: 109.94736842105263 num_loss_counted_tokens: 589 | |
total tokens: 2315 num samples: 5 num padding tokens: 322 - rank: 0 max len: 463 min len: 335 avg len: 398.6 num_loss_counted_tokens: 1044 | |
Per-token loss scaled by world size: 0.0008054596255533397Per-token loss scaled by world size: 0.0008824141696095467Per-token loss scaled by world size: 0.0003325326251797378Per-token loss scaled by world size: 0.001858191448263824Per-token loss scaled by world size: 0.0007661249837838113Per-token loss scaled by world size: 0.0007683064322918653Per-token loss scaled by world size: 0.0007881383062340319 | |
Epoch: 1, Step: 231, Rank: 4, loss = 0.7649852633476257Epoch: 1, Step: 231, Rank: 5, loss = 0.8380728363990784Epoch: 1, Step: 231, Rank: 7, loss = 0.31582286953926086Epoch: 1, Step: 231, Rank: 1, loss = 1.7648173570632935 | |
Epoch: 1, Step: 231, Rank: 6, loss = 0.7485343813896179Epoch: 1, Step: 231, Rank: 0, loss = 0.7276272177696228Epoch: 1, Step: 231, Rank: 3, loss = 0.7296990156173706 | |
Per-token loss scaled by world size: 0.0020757734309881926 | |
Epoch: 1, Step: 231, Rank: 2, loss = 1.9714657068252563 | |
[2024-06-27 16:46:32,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=231, skipped=0, lr=[1.2e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:32,635] [INFO] [timer.py:260:stop] epoch=0/micro_step=231/global_step=231, RunningAvgSamplesPerSec=95.46183549609881, CurrSamplesPerSec=94.89959294724719, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 94.80556806722254 samples/s, lr: 1.2e-05, loss: 0.7276272177696228 cuda_mem_allocated: 22.248544216156006 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7598.0 batch_size: 84.0 total loss: 0.9826280474662781 | |
total tokens: 2340 num samples: 10 num padding tokens: 108 - rank: 4 max len: 234 min len: 216 avg len: 223.2 num_loss_counted_tokens: 888 | |
total tokens: 2365 num samples: 11 num padding tokens: 210 - rank: 5 max len: 215 min len: 182 avg len: 195.9090909090909 num_loss_counted_tokens: 905 | |
total tokens: 2272 num samples: 8 num padding tokens: 188 - rank: 3 max len: 284 min len: 235 avg len: 260.5 num_loss_counted_tokens: 988 | |
total tokens: 2244 num samples: 6 num padding tokens: 158 - rank: 1 max len: 374 min len: 325 avg len: 347.6666666666667 num_loss_counted_tokens: 1080 | |
total tokens: 2478 num samples: 14 num padding tokens: 302 - rank: 6 max len: 177 min len: 135 avg len: 155.42857142857142 num_loss_counted_tokens: 912 | |
total tokens: 2240 num samples: 7 num padding tokens: 95 - rank: 2 max len: 320 min len: 294 avg len: 306.42857142857144 num_loss_counted_tokens: 1063 | |
total tokens: 2032 num samples: 4 num padding tokens: 234 - rank: 0 max len: 508 min len: 396 avg len: 449.5 num_loss_counted_tokens: 1197 | |
total tokens: 2304 num samples: 18 num padding tokens: 404 - rank: 7 max len: 128 min len: 84 avg len: 105.55555555555556 num_loss_counted_tokens: 491 | |
Per-token loss scaled by world size: 0.0011287112720310688Per-token loss scaled by world size: 0.0011079174000769854Per-token loss scaled by world size: 0.0013160385424271226Per-token loss scaled by world size: 0.0016818487783893943Per-token loss scaled by world size: 0.0005572142545133829Per-token loss scaled by world size: 0.0016016666777431965 | |
Per-token loss scaled by world size: 1.6094815009637387e-06 | |
Epoch: 1, Step: 232, Rank: 4, loss = 0.9021224975585938Epoch: 1, Step: 232, Rank: 6, loss = 0.8855029940605164 | |
Epoch: 1, Step: 232, Rank: 1, loss = 1.2801320552825928Epoch: 1, Step: 232, Rank: 3, loss = 1.0518437623977661 | |
Epoch: 1, Step: 232, Rank: 7, loss = 0.4453534781932831Epoch: 1, Step: 232, Rank: 5, loss = 1.3442176580429077 | |
Epoch: 1, Step: 232, Rank: 0, loss = 0.0012863781303167343 | |
Per-token loss scaled by world size: 0.0016978166531771421 | |
Epoch: 1, Step: 232, Rank: 2, loss = 1.3569799661636353 | |
[2024-06-27 16:46:33,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=232, skipped=0, lr=[1.2051948051948053e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:33,701] [INFO] [timer.py:260:stop] epoch=0/micro_step=232/global_step=232, RunningAvgSamplesPerSec=95.45858446109682, CurrSamplesPerSec=94.71988377292908, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 94.60776071204435 samples/s, lr: 1.2051948051948053e-05, loss: 0.0012863781303167343 cuda_mem_allocated: 22.238646030426025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6394.0 batch_size: 73.0 total loss: 0.9084298610687256 | |
total tokens: 2313 num samples: 9 num padding tokens: 148 - rank: 3 max len: 257 min len: 228 avg len: 240.55555555555554 num_loss_counted_tokens: 1043 | |
total tokens: 2486 num samples: 11 num padding tokens: 210 - rank: 4 max len: 226 min len: 191 avg len: 206.9090909090909 num_loss_counted_tokens: 869 | |
total tokens: 2385 num samples: 15 num padding tokens: 211 - rank: 6 max len: 159 min len: 130 avg len: 144.93333333333334 num_loss_counted_tokens: 737 | |
total tokens: 2418 num samples: 13 num padding tokens: 171 - rank: 5 max len: 186 min len: 159 avg len: 172.84615384615384 num_loss_counted_tokens: 983 | |
total tokens: 2408 num samples: 8 num padding tokens: 154 - rank: 2 max len: 301 min len: 262 avg len: 281.75 num_loss_counted_tokens: 1154 | |
total tokens: 2406 num samples: 6 num padding tokens: 279 - rank: 1 max len: 401 min len: 316 avg len: 354.5 num_loss_counted_tokens: 971 | |
total tokens: 2480 num samples: 20 num padding tokens: 393 - rank: 7 max len: 124 min len: 90 avg len: 104.35 num_loss_counted_tokens: 538 | |
total tokens: 2515 num samples: 5 num padding tokens: 140 - rank: 0 max len: 503 min len: 430 avg len: 475.0 num_loss_counted_tokens: 1511 | |
Per-token loss scaled by world size: 0.0012491034576669335Per-token loss scaled by world size: 0.00051008106674999 | |
Per-token loss scaled by world size: 0.0016371281817555428Per-token loss scaled by world size: 0.0007800052990205586Per-token loss scaled by world size: 0.0007931729196570814Per-token loss scaled by world size: 0.0017092993948608637Per-token loss scaled by world size: 0.0009690917795524001 | |
Epoch: 1, Step: 233, Rank: 5, loss = 1.2367686033248901 | |
Epoch: 1, Step: 233, Rank: 7, loss = 0.5050440430641174Epoch: 1, Step: 233, Rank: 0, loss = 1.6209615468978882 | |
Epoch: 1, Step: 233, Rank: 3, loss = 0.7853403091430664Epoch: 1, Step: 233, Rank: 4, loss = 0.7723027467727661Epoch: 1, Step: 233, Rank: 6, loss = 0.959522008895874Epoch: 1, Step: 233, Rank: 2, loss = 1.6924200057983398 | |
Per-token loss scaled by world size: 0.000889244896825403 | |
Epoch: 1, Step: 233, Rank: 1, loss = 0.8804636001586914 | |
[2024-06-27 16:46:34,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=233, skipped=0, lr=[1.2103896103896104e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:34,756] [INFO] [timer.py:260:stop] epoch=0/micro_step=233/global_step=233, RunningAvgSamplesPerSec=95.46127318385888, CurrSamplesPerSec=96.08372927439163, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.98775638349886 samples/s, lr: 1.2103896103896104e-05, loss: 1.6209615468978882 cuda_mem_allocated: 22.275736331939697 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7921.0 batch_size: 76.0 total loss: 1.056602954864502 | |
total tokens: 2528 num samples: 16 num padding tokens: 151 - rank: 6 max len: 158 min len: 136 avg len: 148.5625 num_loss_counted_tokens: 1012 | |
total tokens: 2313 num samples: 9 num padding tokens: 101 - rank: 3 max len: 257 min len: 229 avg len: 245.77777777777777 num_loss_counted_tokens: 1022 | |
total tokens: 2453 num samples: 11 num padding tokens: 197 - rank: 4 max len: 223 min len: 189 avg len: 205.0909090909091 num_loss_counted_tokens: 1054 | |
total tokens: 2379 num samples: 13 num padding tokens: 131 - rank: 5 max len: 183 min len: 162 avg len: 172.92307692307693 num_loss_counted_tokens: 968 | |
total tokens: 2310 num samples: 6 num padding tokens: 177 - rank: 1 max len: 385 min len: 325 avg len: 355.5 num_loss_counted_tokens: 1228 | |
total tokens: 2233 num samples: 7 num padding tokens: 266 - rank: 2 max len: 319 min len: 261 avg len: 281.0 num_loss_counted_tokens: 631 | |
total tokens: 2448 num samples: 18 num padding tokens: 369 - rank: 7 max len: 136 min len: 89 avg len: 115.5 num_loss_counted_tokens: 618 | |
total tokens: 2296 num samples: 4 num padding tokens: 401 - rank: 0 max len: 574 min len: 412 avg len: 473.75 num_loss_counted_tokens: 948 | |
Per-token loss scaled by world size: 0.0005367082194425166Per-token loss scaled by world size: 0.0010475920280441642Per-token loss scaled by world size: 0.000929902889765799Per-token loss scaled by world size: 0.0008232997497543693Per-token loss scaled by world size: 0.0010162729304283857Per-token loss scaled by world size: 0.0007735125254839659Per-token loss scaled by world size: 0.0009562580962665379 | |
Epoch: 1, Step: 234, Rank: 5, loss = 0.9442732334136963Epoch: 1, Step: 234, Rank: 7, loss = 0.48377537727355957 | |
Epoch: 1, Step: 234, Rank: 1, loss = 0.8381912112236023 | |
Epoch: 1, Step: 234, Rank: 3, loss = 0.8619471192359924Epoch: 1, Step: 234, Rank: 6, loss = 0.9160430431365967Epoch: 1, Step: 234, Rank: 0, loss = 0.742101788520813Epoch: 1, Step: 234, Rank: 4, loss = 0.6972248554229736 | |
Per-token loss scaled by world size: 0.0014601092552766204 | |
Epoch: 1, Step: 234, Rank: 2, loss = 1.3161059617996216 | |
[2024-06-27 16:46:35,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=234, skipped=0, lr=[1.2155844155844157e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:35,818] [INFO] [timer.py:260:stop] epoch=0/micro_step=234/global_step=234, RunningAvgSamplesPerSec=95.46081119678031, CurrSamplesPerSec=95.35421186868149, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.24840061304806 samples/s, lr: 1.2155844155844157e-05, loss: 0.742101788520813 cuda_mem_allocated: 22.27454423904419 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7211.0 batch_size: 79.0 total loss: 0.8499577641487122 | |
total tokens: 2448 num samples: 9 num padding tokens: 138 - rank: 3 max len: 272 min len: 242 avg len: 256.6666666666667 num_loss_counted_tokens: 911 | |
total tokens: 2135 num samples: 5 num padding tokens: 219 - rank: 1 max len: 427 min len: 350 avg len: 383.2 num_loss_counted_tokens: 1003 | |
total tokens: 2303 num samples: 7 num padding tokens: 204 - rank: 2 max len: 329 min len: 276 avg len: 299.85714285714283 num_loss_counted_tokens: 407 | |
total tokens: 2483 num samples: 13 num padding tokens: 174 - rank: 5 max len: 191 min len: 161 avg len: 177.6153846153846 num_loss_counted_tokens: 878 | |
total tokens: 2400 num samples: 15 num padding tokens: 185 - rank: 6 max len: 160 min len: 129 avg len: 147.66666666666666 num_loss_counted_tokens: 807 | |
total tokens: 2330 num samples: 10 num padding tokens: 220 - rank: 4 max len: 233 min len: 193 avg len: 211.0 num_loss_counted_tokens: 1014 | |
total tokens: 2432 num samples: 19 num padding tokens: 395 - rank: 7 max len: 128 min len: 86 avg len: 107.21052631578948 num_loss_counted_tokens: 581 | |
total tokens: 2154 num samples: 3 num padding tokens: 200 - rank: 0 max len: 718 min len: 554 avg len: 651.3333333333334 num_loss_counted_tokens: 1701 | |
Per-token loss scaled by world size: 0.0011220475425943732Per-token loss scaled by world size: 0.0010597495129331946Per-token loss scaled by world size: 0.000415762304328382Per-token loss scaled by world size: 0.000529101409483701Per-token loss scaled by world size: 0.00135630345903337Per-token loss scaled by world size: 0.0005803359090350568Per-token loss scaled by world size: 0.0008116917451843619 | |
Epoch: 1, Step: 235, Rank: 0, loss = 0.5780870914459229Epoch: 1, Step: 235, Rank: 5, loss = 1.0556429624557495 | |
Epoch: 1, Step: 235, Rank: 4, loss = 1.1176996231079102Epoch: 1, Step: 235, Rank: 7, loss = 0.41415122151374817 | |
Epoch: 1, Step: 235, Rank: 2, loss = 1.3510477542877197Epoch: 1, Step: 235, Rank: 3, loss = 0.8085464239120483Epoch: 1, Step: 235, Rank: 1, loss = 0.5270511507987976 | |
Per-token loss scaled by world size: 0.0008192601380869746 | |
Epoch: 1, Step: 235, Rank: 6, loss = 0.8160855174064636 | |
[2024-06-27 16:46:36,806] [INFO] [logging.py:96:log_dist] [Rank 0] step=235, skipped=0, lr=[1.2207792207792208e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:36,880] [INFO] [timer.py:260:stop] epoch=0/micro_step=235/global_step=235, RunningAvgSamplesPerSec=95.46023429115813, CurrSamplesPerSec=95.32658038690946, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.23947909514104 samples/s, lr: 1.2207792207792208e-05, loss: 0.5780870914459229 cuda_mem_allocated: 22.269296169281006 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7969.0 batch_size: 79.0 total loss: 0.8335390090942383 | |
total tokens: 2418 num samples: 13 num padding tokens: 171 - rank: 5 max len: 186 min len: 159 avg len: 172.84615384615384 num_loss_counted_tokens: 847 | |
total tokens: 2338 num samples: 7 num padding tokens: 268 - rank: 2 max len: 334 min len: 280 avg len: 295.7142857142857 num_loss_counted_tokens: 1302 | |
total tokens: 2464 num samples: 16 num padding tokens: 270 - rank: 6 max len: 154 min len: 123 avg len: 137.125 num_loss_counted_tokens: 813 | |
total tokens: 2466 num samples: 9 num padding tokens: 192 - rank: 3 max len: 274 min len: 232 avg len: 252.66666666666666 num_loss_counted_tokens: 1098 | |
total tokens: 2320 num samples: 10 num padding tokens: 220 - rank: 4 max len: 232 min len: 193 avg len: 210.0 num_loss_counted_tokens: 764 | |
total tokens: 2400 num samples: 6 num padding tokens: 280 - rank: 1 max len: 400 min len: 336 avg len: 353.3333333333333 num_loss_counted_tokens: 1153 | |
total tokens: 2147 num samples: 19 num padding tokens: 325 - rank: 7 max len: 113 min len: 77 avg len: 95.89473684210526 num_loss_counted_tokens: 435 | |
total tokens: 2448 num samples: 3 num padding tokens: 600 - rank: 0 max len: 816 min len: 499 avg len: 616.0 num_loss_counted_tokens: 1419 | |
Per-token loss scaled by world size: 0.000767097226344049Per-token loss scaled by world size: 0.0014411701122298837Per-token loss scaled by world size: 0.002061654580757022Per-token loss scaled by world size: 0.0009628473198972642Per-token loss scaled by world size: 0.001427920302376151Per-token loss scaled by world size: 0.0006877083797007799Per-token loss scaled by world size: 0.000403098005335778 | |
Epoch: 1, Step: 236, Rank: 7, loss = 0.6875109076499939Epoch: 1, Step: 236, Rank: 2, loss = 1.2916487455368042Epoch: 1, Step: 236, Rank: 1, loss = 1.8477578163146973 | |
Per-token loss scaled by world size: 0.0012513279216364026Epoch: 1, Step: 236, Rank: 6, loss = 0.862951934337616 | |
Epoch: 1, Step: 236, Rank: 3, loss = 0.3612765967845917Epoch: 1, Step: 236, Rank: 5, loss = 0.6163586378097534Epoch: 1, Step: 236, Rank: 0, loss = 1.2797735929489136 | |
Epoch: 1, Step: 236, Rank: 4, loss = 1.1215026378631592 | |
[2024-06-27 16:46:37,867] [INFO] [logging.py:96:log_dist] [Rank 0] step=236, skipped=0, lr=[1.2259740259740261e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:37,940] [INFO] [timer.py:260:stop] epoch=0/micro_step=236/global_step=236, RunningAvgSamplesPerSec=95.46043111511592, CurrSamplesPerSec=95.50631323403665, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.40644839657303 samples/s, lr: 1.2259740259740261e-05, loss: 1.2797735929489136 cuda_mem_allocated: 22.277405738830566 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7170.0 batch_size: 85.0 total loss: 1.008597731590271 | |
total tokens: 2380 num samples: 14 num padding tokens: 286 - rank: 6 max len: 170 min len: 131 avg len: 149.57142857142858 num_loss_counted_tokens: 894 | |
total tokens: 2522 num samples: 13 num padding tokens: 110 - rank: 5 max len: 194 min len: 173 avg len: 185.53846153846155 num_loss_counted_tokens: 816 | |
total tokens: 2508 num samples: 11 num padding tokens: 206 - rank: 4 max len: 228 min len: 196 avg len: 209.27272727272728 num_loss_counted_tokens: 1041 | |
total tokens: 2214 num samples: 6 num padding tokens: 74 - rank: 1 max len: 369 min len: 346 avg len: 356.6666666666667 num_loss_counted_tokens: 1186 | |
total tokens: 2493 num samples: 9 num padding tokens: 246 - rank: 3 max len: 277 min len: 231 avg len: 249.66666666666666 num_loss_counted_tokens: 990 | |
total tokens: 2310 num samples: 7 num padding tokens: 229 - rank: 2 max len: 330 min len: 281 avg len: 297.2857142857143 num_loss_counted_tokens: 1128 | |
total tokens: 2227 num samples: 17 num padding tokens: 409 - rank: 7 max len: 131 min len: 81 avg len: 106.94117647058823 num_loss_counted_tokens: 522 | |
total tokens: 2031 num samples: 3 num padding tokens: 495 - rank: 0 max len: 677 min len: 427 avg len: 512.0 num_loss_counted_tokens: 1101 | |
Per-token loss scaled by world size: 0.0011679630260914564Per-token loss scaled by world size: 0.0008783003431744874Per-token loss scaled by world size: 0.0017180921277031302Per-token loss scaled by world size: 0.000690492510329932Per-token loss scaled by world size: 0.00040183818782679737Per-token loss scaled by world size: 0.0010495736496523023 | |
Epoch: 1, Step: 237, Rank: 2, loss = 0.8634790182113647Epoch: 1, Step: 237, Rank: 1, loss = 1.6890993118286133Epoch: 1, Step: 237, Rank: 5, loss = 1.1482536792755127Epoch: 1, Step: 237, Rank: 6, loss = 0.6788404583930969 | |
Epoch: 1, Step: 237, Rank: 7, loss = 0.39505717158317566 | |
Epoch: 1, Step: 237, Rank: 3, loss = 1.0318621397018433 | |
Per-token loss scaled by world size: 0.0011431180173531175Per-token loss scaled by world size: 0.0007061999058350921 | |
Epoch: 1, Step: 237, Rank: 0, loss = 0.6942827701568604Epoch: 1, Step: 237, Rank: 4, loss = 1.1238279342651367 | |
[2024-06-27 16:46:38,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=237, skipped=0, lr=[1.2311688311688312e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:39,003] [INFO] [timer.py:260:stop] epoch=0/micro_step=237/global_step=237, RunningAvgSamplesPerSec=95.45923910540589, CurrSamplesPerSec=95.18112495092059, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.06196012363505 samples/s, lr: 1.2311688311688312e-05, loss: 0.6942827701568604 cuda_mem_allocated: 22.309129238128662 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7865.0 batch_size: 88.0 total loss: 0.9530878067016602 | |
total tokens: 2478 num samples: 7 num padding tokens: 321 - rank: 2 max len: 354 min len: 268 avg len: 308.14285714285717 num_loss_counted_tokens: 1055 | |
total tokens: 2370 num samples: 10 num padding tokens: 126 - rank: 4 max len: 237 min len: 214 avg len: 224.4 num_loss_counted_tokens: 970 | |
total tokens: 2394 num samples: 9 num padding tokens: 94 - rank: 3 max len: 266 min len: 239 avg len: 255.55555555555554 num_loss_counted_tokens: 1175 | |
total tokens: 2520 num samples: 15 num padding tokens: 286 - rank: 6 max len: 168 min len: 131 avg len: 148.93333333333334 num_loss_counted_tokens: 869 | |
total tokens: 2470 num samples: 5 num padding tokens: 339 - rank: 1 max len: 494 min len: 378 avg len: 426.2 num_loss_counted_tokens: 1610 | |
total tokens: 2332 num samples: 11 num padding tokens: 200 - rank: 5 max len: 212 min len: 177 avg len: 193.8181818181818 num_loss_counted_tokens: 873 | |
total tokens: 2304 num samples: 18 num padding tokens: 305 - rank: 7 max len: 128 min len: 95 avg len: 111.05555555555556 num_loss_counted_tokens: 513 | |
total tokens: 2082 num samples: 3 num padding tokens: 267 - rank: 0 max len: 694 min len: 496 avg len: 605.0 num_loss_counted_tokens: 1399 | |
Per-token loss scaled by world size: 0.0009078406146727502Per-token loss scaled by world size: 0.0009973897831514478Per-token loss scaled by world size: 0.0010273231891915202Per-token loss scaled by world size: 0.0009046648046933115Per-token loss scaled by world size: 0.0008965819142758846Per-token loss scaled by world size: 0.0014194970717653632Per-token loss scaled by world size: 0.0004754299880005419 | |
Epoch: 1, Step: 238, Rank: 2, loss = 0.9396154880523682Epoch: 1, Step: 238, Rank: 3, loss = 0.9122376441955566Epoch: 1, Step: 238, Rank: 1, loss = 1.2983075380325317Epoch: 1, Step: 238, Rank: 0, loss = 0.8274290561676025Epoch: 1, Step: 238, Rank: 5, loss = 0.8200362324714661 | |
Epoch: 1, Step: 238, Rank: 4, loss = 0.8303337097167969 | |
Epoch: 1, Step: 238, Rank: 7, loss = 0.4348401427268982 | |
Per-token loss scaled by world size: 0.000674682087264955 | |
Epoch: 1, Step: 238, Rank: 6, loss = 0.6170811057090759 | |
[2024-06-27 16:46:39,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=238, skipped=0, lr=[1.2363636363636364e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:40,064] [INFO] [timer.py:260:stop] epoch=0/micro_step=238/global_step=238, RunningAvgSamplesPerSec=95.45861426631367, CurrSamplesPerSec=95.31200356011722, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.19345620792838 samples/s, lr: 1.2363636363636364e-05, loss: 0.8274290561676025 cuda_mem_allocated: 22.291121006011963 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7317.0 batch_size: 84.0 total loss: 0.8349851369857788 | |
total tokens: 2510 num samples: 10 num padding tokens: 284 - rank: 4 max len: 251 min len: 197 avg len: 222.6 num_loss_counted_tokens: 865 | |
total tokens: 2352 num samples: 6 num padding tokens: 115 - rank: 1 max len: 392 min len: 362 avg len: 372.8333333333333 num_loss_counted_tokens: 894 | |
total tokens: 2470 num samples: 13 num padding tokens: 222 - rank: 5 max len: 190 min len: 151 avg len: 172.92307692307693 num_loss_counted_tokens: 952 | |
total tokens: 2400 num samples: 16 num padding tokens: 134 - rank: 6 max len: 150 min len: 134 avg len: 141.625 num_loss_counted_tokens: 884 | |
total tokens: 2226 num samples: 7 num padding tokens: 207 - rank: 3 max len: 318 min len: 256 avg len: 288.42857142857144 num_loss_counted_tokens: 594 | |
total tokens: 2527 num samples: 7 num padding tokens: 114 - rank: 2 max len: 361 min len: 325 avg len: 344.7142857142857 num_loss_counted_tokens: 1176 | |
total tokens: 2340 num samples: 18 num padding tokens: 308 - rank: 7 max len: 130 min len: 89 avg len: 112.88888888888889 num_loss_counted_tokens: 637 | |
total tokens: 2052 num samples: 4 num padding tokens: 175 - rank: 0 max len: 513 min len: 410 avg len: 469.25 num_loss_counted_tokens: 1158 | |
Per-token loss scaled by world size: 0.001025509089231491Per-token loss scaled by world size: 0.0009908818174153566 | |
Per-token loss scaled by world size: 0.0004580298555083573Per-token loss scaled by world size: 0.0008304164512082934Per-token loss scaled by world size: 0.0007772233220748603Per-token loss scaled by world size: 0.001054841442964971Per-token loss scaled by world size: 0.0006115583819337189 | |
Epoch: 1, Step: 239, Rank: 3, loss = 0.9061654806137085 | |
Epoch: 1, Step: 239, Rank: 6, loss = 0.8755679726600647Epoch: 1, Step: 239, Rank: 2, loss = 0.40472662448883057Epoch: 1, Step: 239, Rank: 4, loss = 0.7337767481803894 | |
Epoch: 1, Step: 239, Rank: 5, loss = 0.9320842623710632Epoch: 1, Step: 239, Rank: 1, loss = 0.686773955821991 | |
Epoch: 1, Step: 239, Rank: 7, loss = 0.540388286113739 | |
Per-token loss scaled by world size: 0.0013989635044708848 | |
Epoch: 1, Step: 239, Rank: 0, loss = 1.236159086227417 | |
[2024-06-27 16:46:41,059] [INFO] [logging.py:96:log_dist] [Rank 0] step=239, skipped=0, lr=[1.2415584415584416e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:41,132] [INFO] [timer.py:260:stop] epoch=0/micro_step=239/global_step=239, RunningAvgSamplesPerSec=95.45573324308228, CurrSamplesPerSec=94.78064075113976, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.66690490923898 samples/s, lr: 1.2415584415584416e-05, loss: 1.236159086227417 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7069.0 batch_size: 92.0 total loss: 0.789455235004425 | |
total tokens: 2484 num samples: 12 num padding tokens: 166 - rank: 4 max len: 207 min len: 181 avg len: 193.16666666666666 num_loss_counted_tokens: 1092 | |
total tokens: 2492 num samples: 14 num padding tokens: 150 - rank: 5 max len: 178 min len: 162 avg len: 167.28571428571428 num_loss_counted_tokens: 883 | |
total tokens: 2408 num samples: 8 num padding tokens: 268 - rank: 2 max len: 301 min len: 245 avg len: 267.5 num_loss_counted_tokens: 1224 total tokens: 2496 num samples: 16 num padding tokens: 237 - rank: 6 max len: 156 min len: 125 avg len: 141.1875 num_loss_counted_tokens: 756 | |
total tokens: 2420 num samples: 10 num padding tokens: 140 - rank: 3 max len: 242 min len: 209 avg len: 228.0 num_loss_counted_tokens: 983 | |
total tokens: 2436 num samples: 7 num padding tokens: 148 - rank: 1 max len: 348 min len: 301 avg len: 326.85714285714283 num_loss_counted_tokens: 1090 | |
total tokens: 2330 num samples: 5 num padding tokens: 326 - rank: 0 max len: 466 min len: 372 avg len: 400.8 num_loss_counted_tokens: 1464 | |
total tokens: 2480 num samples: 20 num padding tokens: 346 - rank: 7 max len: 124 min len: 83 avg len: 106.7 num_loss_counted_tokens: 549 | |
Per-token loss scaled by world size: 0.002294777426868677Per-token loss scaled by world size: 0.0010149937588721514Per-token loss scaled by world size: 0.00023398602206725627Per-token loss scaled by world size: 0.0016519647324457765Per-token loss scaled by world size: 0.0013896905584260821Per-token loss scaled by world size: 0.003019897500053048 | |
Per-token loss scaled by world size: 5.881906326976605e-05 | |
Epoch: 1, Step: 240, Rank: 4, loss = 1.507955551147461 | |
Epoch: 1, Step: 240, Rank: 1, loss = 0.15375806391239166Epoch: 1, Step: 240, Rank: 5, loss = 1.0855473279953003Epoch: 1, Step: 240, Rank: 6, loss = 0.9132004380226135Epoch: 1, Step: 240, Rank: 3, loss = 1.984450101852417Epoch: 1, Step: 240, Rank: 2, loss = 0.6669777631759644 | |
Epoch: 1, Step: 240, Rank: 0, loss = 0.0386514775454998 | |
Per-token loss scaled by world size: 0.0011207807110622525 | |
Epoch: 1, Step: 240, Rank: 7, loss = 0.7364930510520935 | |
[2024-06-27 16:46:42,118] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=0, lr=[1.2467532467532468e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:42,191] [INFO] [timer.py:260:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=95.45639615216628, CurrSamplesPerSec=95.61376570882686, MemAllocated=22.21GB, MaxMemAllocated=28.61GB | |
throughput: 95.49181727092164 samples/s, lr: 1.2467532467532468e-05, loss: 0.0386514775454998 cuda_mem_allocated: 22.205370903015137 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5257.0 batch_size: 72.0 total loss: 0.8858791589736938 | |
total tokens: 2295 num samples: 9 num padding tokens: 162 - rank: 4 max len: 255 min len: 216 avg len: 237.0 num_loss_counted_tokens: 835 | |
total tokens: 2451 num samples: 19 num padding tokens: 352 - rank: 7 max len: 129 min len: 90 avg len: 110.47368421052632 num_loss_counted_tokens: 595 | |
total tokens: 2525 num samples: 5 num padding tokens: 410 - rank: 1 max len: 505 min len: 399 avg len: 423.0 num_loss_counted_tokens: 1102 | |
total tokens: 2208 num samples: 6 num padding tokens: 164 - rank: 2 max len: 368 min len: 316 avg len: 340.6666666666667 num_loss_counted_tokens: 983 | |
total tokens: 2508 num samples: 12 num padding tokens: 235 - rank: 5 max len: 209 min len: 160 avg len: 189.41666666666666 num_loss_counted_tokens: 1008 | |
total tokens: 2496 num samples: 16 num padding tokens: 218 - rank: 6 max len: 156 min len: 130 avg len: 142.375 num_loss_counted_tokens: 952 | |
total tokens: 2464 num samples: 8 num padding tokens: 257 - rank: 3 max len: 308 min len: 258 avg len: 275.875 num_loss_counted_tokens: 1114 | |
total tokens: 2388 num samples: 4 num padding tokens: 212 - rank: 0 max len: 597 min len: 512 avg len: 544.0 num_loss_counted_tokens: 853 | |
Per-token loss scaled by world size: 0.0008982328581623733Per-token loss scaled by world size: 0.001150192110799253Per-token loss scaled by world size: 0.0008859310764819384Per-token loss scaled by world size: 0.0007420735200867057Per-token loss scaled by world size: 0.0011567281326279044 | |
Per-token loss scaled by world size: 0.0011595517862588167Per-token loss scaled by world size: 0.0004642182029783726 | |
Epoch: 1, Step: 241, Rank: 2, loss = 0.7608108520507812 | |
Epoch: 1, Step: 241, Rank: 1, loss = 1.179234504699707Epoch: 1, Step: 241, Rank: 5, loss = 0.9083008170127869 | |
Epoch: 1, Step: 241, Rank: 7, loss = 0.47593972086906433Epoch: 1, Step: 241, Rank: 4, loss = 0.9209132194519043Epoch: 1, Step: 241, Rank: 3, loss = 1.1859354972839355 | |
Epoch: 1, Step: 241, Rank: 0, loss = 1.1888304948806763Per-token loss scaled by world size: 0.0009261174709536135 | |
Epoch: 1, Step: 241, Rank: 6, loss = 0.9495019316673279 | |
[2024-06-27 16:46:43,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=241, skipped=0, lr=[1.251948051948052e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:43,260] [INFO] [timer.py:260:stop] epoch=0/micro_step=241/global_step=241, RunningAvgSamplesPerSec=95.45364901018382, CurrSamplesPerSec=94.80429572019615, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.70157163618045 samples/s, lr: 1.251948051948052e-05, loss: 1.1888304948806763 cuda_mem_allocated: 22.275497913360596 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8202.0 batch_size: 82.0 total loss: 0.9461833834648132 | |
total tokens: 2365 num samples: 11 num padding tokens: 111 - rank: 4 max len: 215 min len: 189 avg len: 204.9090909090909 num_loss_counted_tokens: 1013 | |
total tokens: 2506 num samples: 7 num padding tokens: 154 - rank: 1 max len: 358 min len: 308 avg len: 336.0 num_loss_counted_tokens: 1057 | |
total tokens: 2431 num samples: 13 num padding tokens: 180 - rank: 5 max len: 187 min len: 152 avg len: 173.15384615384616 num_loss_counted_tokens: 748 | |
total tokens: 2424 num samples: 8 num padding tokens: 176 - rank: 2 max len: 303 min len: 265 avg len: 281.0 num_loss_counted_tokens: 1135 | |
total tokens: 2533 num samples: 17 num padding tokens: 171 - rank: 6 max len: 149 min len: 126 avg len: 138.94117647058823 num_loss_counted_tokens: 932 | |
total tokens: 2394 num samples: 19 num padding tokens: 271 - rank: 7 max len: 126 min len: 88 avg len: 111.73684210526316 num_loss_counted_tokens: 637 | |
total tokens: 2349 num samples: 9 num padding tokens: 173 - rank: 3 max len: 261 min len: 222 avg len: 241.77777777777777 num_loss_counted_tokens: 856 | |
total tokens: 2268 num samples: 4 num padding tokens: 299 - rank: 0 max len: 567 min len: 430 avg len: 492.25 num_loss_counted_tokens: 1240 | |
Per-token loss scaled by world size: 0.0003301144752185792Per-token loss scaled by world size: 0.0007739200373180211Per-token loss scaled by world size: 0.0012889053905382752Per-token loss scaled by world size: 0.000827514159027487Per-token loss scaled by world size: 0.0015094919363036752Per-token loss scaled by world size: 0.0005019098753109574Per-token loss scaled by world size: 0.000982986530289054 | |
Epoch: 1, Step: 242, Rank: 0, loss = 0.26561835408210754Epoch: 1, Step: 242, Rank: 3, loss = 0.622715413570404 | |
Epoch: 1, Step: 242, Rank: 5, loss = 1.0370855331420898 | |
Epoch: 1, Step: 242, Rank: 1, loss = 1.214574933052063Epoch: 1, Step: 242, Rank: 7, loss = 0.40384921431541443Epoch: 1, Step: 242, Rank: 4, loss = 0.6658385992050171Epoch: 1, Step: 242, Rank: 2, loss = 0.7909355759620667 | |
Per-token loss scaled by world size: 0.0007346933125518262 | |
Epoch: 1, Step: 242, Rank: 6, loss = 0.5911526083946228 | |
[2024-06-27 16:46:44,253] [INFO] [logging.py:96:log_dist] [Rank 0] step=242, skipped=0, lr=[1.2571428571428572e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:44,326] [INFO] [timer.py:260:stop] epoch=0/micro_step=242/global_step=242, RunningAvgSamplesPerSec=95.45136803354794, CurrSamplesPerSec=94.90932336186366, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.80496537226871 samples/s, lr: 1.2571428571428572e-05, loss: 0.26561835408210754 cuda_mem_allocated: 22.315569400787354 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6437.0 batch_size: 94.0 total loss: 0.6989713311195374 | |
total tokens: 2470 num samples: 10 num padding tokens: 243 - rank: 3 max len: 247 min len: 203 avg len: 222.7 num_loss_counted_tokens: 947 | |
total tokens: 2431 num samples: 17 num padding tokens: 180 - rank: 6 max len: 143 min len: 117 avg len: 132.41176470588235 num_loss_counted_tokens: 874 | |
total tokens: 2296 num samples: 8 num padding tokens: 94 - rank: 2 max len: 287 min len: 254 avg len: 275.25 num_loss_counted_tokens: 639 | |
total tokens: 2366 num samples: 7 num padding tokens: 169 - rank: 1 max len: 338 min len: 288 avg len: 313.85714285714283 num_loss_counted_tokens: 1237 | |
total tokens: 2450 num samples: 14 num padding tokens: 227 - rank: 5 max len: 175 min len: 144 avg len: 158.78571428571428 num_loss_counted_tokens: 891 | |
total tokens: 2424 num samples: 12 num padding tokens: 160 - rank: 4 max len: 202 min len: 178 avg len: 188.66666666666666 num_loss_counted_tokens: 1103 | |
total tokens: 2436 num samples: 21 num padding tokens: 334 - rank: 7 max len: 116 min len: 78 avg len: 100.0952380952381 num_loss_counted_tokens: 491 | |
total tokens: 2424 num samples: 6 num padding tokens: 151 - rank: 0 max len: 404 min len: 340 avg len: 378.8333333333333 num_loss_counted_tokens: 1272 | |
Per-token loss scaled by world size: 0.0009886563057079911Per-token loss scaled by world size: 0.0024725967086851597Per-token loss scaled by world size: 0.0012395113008096814Per-token loss scaled by world size: 0.001230726600624621Per-token loss scaled by world size: 0.0007296023541130126Per-token loss scaled by world size: 0.00040266563883051276Per-token loss scaled by world size: 0.0017131021013483405 | |
Epoch: 1, Step: 243, Rank: 6, loss = 0.8281232714653015Epoch: 1, Step: 243, Rank: 5, loss = 1.0308873653411865Epoch: 1, Step: 243, Rank: 0, loss = 1.038245677947998Epoch: 1, Step: 243, Rank: 1, loss = 2.071108818054199Epoch: 1, Step: 243, Rank: 2, loss = 0.6111331582069397 | |
Epoch: 1, Step: 243, Rank: 7, loss = 0.33728280663490295 | |
Epoch: 1, Step: 243, Rank: 3, loss = 1.4349371194839478 | |
Per-token loss scaled by world size: 0.0007733175298199058 | |
Epoch: 1, Step: 243, Rank: 4, loss = 0.6477500796318054 | |
[2024-06-27 16:46:45,317] [INFO] [logging.py:96:log_dist] [Rank 0] step=243, skipped=0, lr=[1.2623376623376624e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:45,390] [INFO] [timer.py:260:stop] epoch=0/micro_step=243/global_step=243, RunningAvgSamplesPerSec=95.45001913280761, CurrSamplesPerSec=95.12738179810475, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.0260423341034 samples/s, lr: 1.2623376623376624e-05, loss: 1.038245677947998 cuda_mem_allocated: 22.29863452911377 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6701.0 batch_size: 77.0 total loss: 0.9999335408210754 | |
total tokens: 2534 num samples: 7 num padding tokens: 130 - rank: 2 max len: 362 min len: 314 avg len: 343.42857142857144 num_loss_counted_tokens: 1412 | |
total tokens: 2366 num samples: 13 num padding tokens: 330 - rank: 6 max len: 182 min len: 139 avg len: 156.6153846153846 num_loss_counted_tokens: 771 | |
total tokens: 2480 num samples: 8 num padding tokens: 213 - rank: 3 max len: 310 min len: 257 avg len: 283.375 num_loss_counted_tokens: 892 | |
total tokens: 2230 num samples: 5 num padding tokens: 233 - rank: 1 max len: 446 min len: 367 avg len: 399.4 num_loss_counted_tokens: 1251 | |
total tokens: 2332 num samples: 11 num padding tokens: 201 - rank: 5 max len: 212 min len: 182 avg len: 193.72727272727272 num_loss_counted_tokens: 898 | |
total tokens: 2410 num samples: 10 num padding tokens: 132 - rank: 4 max len: 241 min len: 215 avg len: 227.8 num_loss_counted_tokens: 748 | |
total tokens: 2346 num samples: 17 num padding tokens: 492 - rank: 7 max len: 138 min len: 83 avg len: 109.05882352941177 num_loss_counted_tokens: 542 | |
total tokens: 2444 num samples: 4 num padding tokens: 162 - rank: 0 max len: 611 min len: 476 avg len: 570.5 num_loss_counted_tokens: 1206 | |
Per-token loss scaled by world size: 0.0016033538850024343Per-token loss scaled by world size: 0.0009365088772028685Per-token loss scaled by world size: 0.0005134354578331113Per-token loss scaled by world size: 0.0007007298991084099Per-token loss scaled by world size: 0.0008321358473040164Per-token loss scaled by world size: 0.001442187582142651 | |
Per-token loss scaled by world size: 0.0005987093900330365Epoch: 1, Step: 244, Rank: 1, loss = 1.6394293308258057Epoch: 1, Step: 244, Rank: 5, loss = 0.9575803279876709Epoch: 1, Step: 244, Rank: 4, loss = 0.5249877572059631 | |
Epoch: 1, Step: 244, Rank: 3, loss = 0.8508589267730713Epoch: 1, Step: 244, Rank: 0, loss = 0.7164963483810425Epoch: 1, Step: 244, Rank: 2, loss = 1.4746367931365967 | |
Per-token loss scaled by world size: 0.0007857671007514 | |
Epoch: 1, Step: 244, Rank: 7, loss = 0.6121803522109985 | |
Epoch: 1, Step: 244, Rank: 6, loss = 0.803446888923645 | |
[2024-06-27 16:46:46,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=244, skipped=0, lr=[1.2675324675324676e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:46,447] [INFO] [timer.py:260:stop] epoch=0/micro_step=244/global_step=244, RunningAvgSamplesPerSec=95.45149704140981, CurrSamplesPerSec=95.80901261461477, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.69195538007219 samples/s, lr: 1.2675324675324676e-05, loss: 0.7164963483810425 cuda_mem_allocated: 22.253076553344727 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8180.0 batch_size: 77.0 total loss: 0.947452187538147 | |
total tokens: 2405 num samples: 13 num padding tokens: 240 - rank: 5 max len: 185 min len: 153 avg len: 166.53846153846155 num_loss_counted_tokens: 730 | |
total tokens: 2442 num samples: 6 num padding tokens: 323 - rank: 1 max len: 407 min len: 311 avg len: 353.1666666666667 num_loss_counted_tokens: 1121 | |
total tokens: 2360 num samples: 8 num padding tokens: 99 - rank: 2 max len: 295 min len: 265 avg len: 282.625 num_loss_counted_tokens: 800 | |
total tokens: 2448 num samples: 16 num padding tokens: 451 - rank: 6 max len: 153 min len: 103 avg len: 124.8125 num_loss_counted_tokens: 740 | |
total tokens: 2420 num samples: 11 num padding tokens: 185 - rank: 4 max len: 220 min len: 185 avg len: 203.1818181818182 num_loss_counted_tokens: 907 | |
total tokens: 2340 num samples: 9 num padding tokens: 139 - rank: 3 max len: 260 min len: 224 avg len: 244.55555555555554 num_loss_counted_tokens: 1008 | |
total tokens: 392 num samples: 4 num padding tokens: 34 - rank: 7 max len: 98 min len: 84 avg len: 89.5 num_loss_counted_tokens: 63 | |
total tokens: 2332 num samples: 4 num padding tokens: 390 - rank: 0 max len: 583 min len: 409 avg len: 485.5 num_loss_counted_tokens: 1064 | |
Per-token loss scaled by world size: 0.0011938331881538033Per-token loss scaled by world size: 0.0013049523113295436Per-token loss scaled by world size: 0.0003334296925459057Per-token loss scaled by world size: 0.0008199867443181574Per-token loss scaled by world size: 0.0010781510500237346Per-token loss scaled by world size: 0.0006666453555226326Per-token loss scaled by world size: 0.0007516139303334057 | |
Epoch: 1, Step: 245, Rank: 1, loss = 1.253226399421692Epoch: 1, Step: 245, Rank: 2, loss = 1.3698736429214478Epoch: 1, Step: 245, Rank: 3, loss = 1.1317890882492065Epoch: 1, Step: 245, Rank: 4, loss = 0.8607810735702515Epoch: 1, Step: 245, Rank: 7, loss = 0.35001781582832336Epoch: 1, Step: 245, Rank: 5, loss = 0.6998109817504883 | |
Epoch: 1, Step: 245, Rank: 6, loss = 0.7890067100524902 | |
Per-token loss scaled by world size: 0.0020646490156650543 | |
Epoch: 1, Step: 245, Rank: 0, loss = 2.167365312576294 | |
[2024-06-27 16:46:47,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=245, skipped=0, lr=[1.2727272727272728e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:47,508] [INFO] [timer.py:260:stop] epoch=0/micro_step=245/global_step=245, RunningAvgSamplesPerSec=95.45123083697602, CurrSamplesPerSec=95.38685299311936, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.25950978824237 samples/s, lr: 1.2727272727272728e-05, loss: 2.167365312576294 cuda_mem_allocated: 22.294579029083252 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8398.0 batch_size: 84.0 total loss: 1.0777338743209839 | |
total tokens: 2304 num samples: 9 num padding tokens: 152 - rank: 4 max len: 256 min len: 215 avg len: 239.11111111111111 num_loss_counted_tokens: 731 | |
total tokens: 2456 num samples: 8 num padding tokens: 181 - rank: 3 max len: 307 min len: 269 avg len: 284.375 num_loss_counted_tokens: 1058 | |
total tokens: 2422 num samples: 14 num padding tokens: 306 - rank: 6 max len: 173 min len: 133 avg len: 151.14285714285714 num_loss_counted_tokens: 802 | |
total tokens: 2322 num samples: 6 num padding tokens: 264 - rank: 2 max len: 387 min len: 312 avg len: 343.0 num_loss_counted_tokens: 1306 | |
total tokens: 2204 num samples: 4 num padding tokens: 293 - rank: 1 max len: 551 min len: 450 avg len: 477.75 num_loss_counted_tokens: 1094 | |
total tokens: 2448 num samples: 12 num padding tokens: 169 - rank: 5 max len: 204 min len: 174 avg len: 189.91666666666666 num_loss_counted_tokens: 1064 | |
total tokens: 2451 num samples: 19 num padding tokens: 463 - rank: 7 max len: 129 min len: 78 avg len: 104.63157894736842 num_loss_counted_tokens: 592 | |
total tokens: 1868 num samples: 2 num padding tokens: 86 - rank: 0 max len: 934 min len: 848 avg len: 891.0 num_loss_counted_tokens: 820 | |
Per-token loss scaled by world size: 0.0018964088521897793Per-token loss scaled by world size: 0.000759051414206624Per-token loss scaled by world size: 0.00064320262754336Per-token loss scaled by world size: 0.0007432251586578786Per-token loss scaled by world size: 0.0012624531518667936Per-token loss scaled by world size: 0.0010214410722255707 | |
Per-token loss scaled by world size: 0.0011542127467691898 | |
Epoch: 1, Step: 246, Rank: 3, loss = 1.0964405536651611 | |
Epoch: 1, Step: 246, Rank: 5, loss = 0.6592361330986023Epoch: 1, Step: 246, Rank: 7, loss = 0.5586214661598206Epoch: 1, Step: 246, Rank: 1, loss = 1.6470310688018799Epoch: 1, Step: 246, Rank: 6, loss = 0.8871216177940369 | |
Epoch: 1, Step: 246, Rank: 2, loss = 0.6454910635948181 | |
Epoch: 1, Step: 246, Rank: 4, loss = 1.0024337768554688 | |
Per-token loss scaled by world size: 0.0018903552554547787 | |
Epoch: 1, Step: 246, Rank: 0, loss = 1.6417735815048218 | |
[2024-06-27 16:46:48,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=246, skipped=0, lr=[1.277922077922078e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:48,565] [INFO] [timer.py:260:stop] epoch=0/micro_step=246/global_step=246, RunningAvgSamplesPerSec=95.45301470820124, CurrSamplesPerSec=95.88848113751678, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.78913757628398 samples/s, lr: 1.277922077922078e-05, loss: 1.6417735815048218 cuda_mem_allocated: 22.296606063842773 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6948.0 batch_size: 84.0 total loss: 1.0172686576843262 | |
total tokens: 2340 num samples: 12 num padding tokens: 155 - rank: 5 max len: 195 min len: 167 avg len: 182.08333333333334 num_loss_counted_tokens: 952 | |
total tokens: 2390 num samples: 10 num padding tokens: 220 - rank: 4 max len: 239 min len: 199 avg len: 217.0 num_loss_counted_tokens: 768 | |
total tokens: 2352 num samples: 6 num padding tokens: 429 - rank: 2 max len: 392 min len: 287 avg len: 320.5 num_loss_counted_tokens: 726 | |
total tokens: 2296 num samples: 8 num padding tokens: 226 - rank: 3 max len: 287 min len: 247 avg len: 258.75 num_loss_counted_tokens: 1112 | |
total tokens: 2460 num samples: 15 num padding tokens: 217 - rank: 6 max len: 164 min len: 138 avg len: 149.53333333333333 num_loss_counted_tokens: 829 | |
total tokens: 2356 num samples: 4 num padding tokens: 257 - rank: 1 max len: 589 min len: 433 avg len: 524.75 num_loss_counted_tokens: 1445 | |
total tokens: 2430 num samples: 18 num padding tokens: 365 - rank: 7 max len: 135 min len: 90 avg len: 114.72222222222223 num_loss_counted_tokens: 590 | |
total tokens: 2229 num samples: 3 num padding tokens: 202 - rank: 0 max len: 743 min len: 604 avg len: 675.6666666666666 num_loss_counted_tokens: 1362 | |
Per-token loss scaled by world size: 0.0008994477102532983Per-token loss scaled by world size: 0.001282328856177628Per-token loss scaled by world size: 0.0008731319103389978Per-token loss scaled by world size: 0.0012572278501465917Per-token loss scaled by world size: 0.0007818038575351238Per-token loss scaled by world size: 0.0010385923087596893 | |
Per-token loss scaled by world size: 0.0011345522943884134 | |
Epoch: 1, Step: 247, Rank: 4, loss = 0.7542993426322937 | |
Epoch: 1, Step: 247, Rank: 7, loss = 0.7322302460670471Epoch: 1, Step: 247, Rank: 2, loss = 1.0753930807113647Epoch: 1, Step: 247, Rank: 3, loss = 0.6556402444839478 | |
Epoch: 1, Step: 247, Rank: 5, loss = 0.8709895014762878Epoch: 1, Step: 247, Rank: 6, loss = 1.0543427467346191 | |
Epoch: 1, Step: 247, Rank: 0, loss = 0.9514638781547546Per-token loss scaled by world size: 0.0014118417166173458 | |
Epoch: 1, Step: 247, Rank: 1, loss = 1.1840057373046875 | |
[2024-06-27 16:46:49,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=247, skipped=0, lr=[1.2831168831168832e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:49,623] [INFO] [timer.py:260:stop] epoch=0/micro_step=247/global_step=247, RunningAvgSamplesPerSec=95.45356740037474, CurrSamplesPerSec=95.58861586927027, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.50275678570598 samples/s, lr: 1.2831168831168832e-05, loss: 0.9514638781547546 cuda_mem_allocated: 22.29064416885376 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6709.0 batch_size: 86.0 total loss: 0.9097955226898193 | |
Per-token loss scaled by world size: 0.0008694587741047144Per-token loss scaled by world size: 0.0006813480868004262Per-token loss scaled by world size: 0.0005269909743219614Per-token loss scaled by world size: 0.001867679413408041Per-token loss scaled by world size: 0.0010225275764241815Per-token loss scaled by world size: 0.0009433904197067022Per-token loss scaled by world size: 0.0013845161302015185 | |
Epoch: 1, Step: 248, Rank: 6, loss = 0.8177259564399719Epoch: 1, Step: 248, Rank: 3, loss = 0.6408078670501709 | |
Epoch: 1, Step: 248, Rank: 5, loss = 0.8872587084770203Epoch: 1, Step: 248, Rank: 7, loss = 0.4956350028514862Epoch: 1, Step: 248, Rank: 2, loss = 0.9616872072219849Epoch: 1, Step: 248, Rank: 0, loss = 1.7565524578094482Per-token loss scaled by world size: 0.0008515167864970863 | |
Epoch: 1, Step: 248, Rank: 1, loss = 1.3021373748779297 | |
Epoch: 1, Step: 248, Rank: 4, loss = 0.8008515238761902 | |
[2024-06-27 16:46:50,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=248, skipped=0, lr=[1.2883116883116884e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:50,671] [INFO] [timer.py:260:stop] epoch=0/micro_step=248/global_step=248, RunningAvgSamplesPerSec=95.45776603671675, CurrSamplesPerSec=96.49768397803614, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 96.40766868874157 samples/s, lr: 1.2883116883116884e-05, loss: 1.7565524578094482 cuda_mem_allocated: 22.256892204284668 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7524.0 batch_size: 78.0 total loss: 0.9578319787979126 | |
Per-token loss scaled by world size: 0.000633179908618331Per-token loss scaled by world size: 0.0006502951728180051Per-token loss scaled by world size: 0.0008006296120584011Per-token loss scaled by world size: 0.0010704627493396401Per-token loss scaled by world size: 0.0009111033868975937Per-token loss scaled by world size: 0.0004684995219577104Per-token loss scaled by world size: 0.0013270556228235364 | |
Epoch: 1, Step: 249, Rank: 7, loss = 0.45713841915130615Epoch: 1, Step: 249, Rank: 6, loss = 0.6178252696990967Epoch: 1, Step: 249, Rank: 4, loss = 0.6345255374908447 | |
Epoch: 1, Step: 249, Rank: 1, loss = 0.8890091180801392Epoch: 1, Step: 249, Rank: 5, loss = 0.7812143564224243Epoch: 1, Step: 249, Rank: 2, loss = 1.0445040464401245Epoch: 1, Step: 249, Rank: 3, loss = 1.2948745489120483Per-token loss scaled by world size: 0.001634923624806106 | |
Epoch: 1, Step: 249, Rank: 0, loss = 1.5952767133712769 | |
[2024-06-27 16:46:51,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=249, skipped=0, lr=[1.2935064935064937e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:51,737] [INFO] [timer.py:260:stop] epoch=0/micro_step=249/global_step=249, RunningAvgSamplesPerSec=95.45546477131123, CurrSamplesPerSec=94.89270457270105, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.79956378183819 samples/s, lr: 1.2935064935064937e-05, loss: 1.5952767133712769 cuda_mem_allocated: 22.314496517181396 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7806.0 batch_size: 87.0 total loss: 0.91429603099823 | |
Per-token loss scaled by world size: 0.0010442727943882346Per-token loss scaled by world size: 0.0009170144912786782Per-token loss scaled by world size: 0.001187845948152244Per-token loss scaled by world size: 0.0013031557900831103Per-token loss scaled by world size: 0.0013772959355264902 | |
Per-token loss scaled by world size: 0.0006523561314679682Per-token loss scaled by world size: 0.0005529810441657901 | |
Epoch: 1, Step: 250, Rank: 1, loss = 1.2186135053634644 | |
Epoch: 1, Step: 250, Rank: 6, loss = 0.9765255451202393 | |
Epoch: 1, Step: 250, Rank: 3, loss = 0.8575232028961182Epoch: 1, Step: 250, Rank: 2, loss = 0.6100345253944397Epoch: 1, Step: 250, Rank: 7, loss = 0.5171064138412476Epoch: 1, Step: 250, Rank: 0, loss = 1.2879438400268555Per-token loss scaled by world size: 0.0008250686223618686Epoch: 1, Step: 250, Rank: 5, loss = 1.1107844114303589 | |
Epoch: 1, Step: 250, Rank: 4, loss = 0.7715423107147217 | |
[2024-06-27 16:46:52,729] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=0, lr=[1.2987012987012988e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:52,803] [INFO] [timer.py:260:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=95.45307770136554, CurrSamplesPerSec=94.8671055830847, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.7809754089273 samples/s, lr: 1.2987012987012988e-05, loss: 1.2879438400268555 cuda_mem_allocated: 22.288377285003662 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7481.0 batch_size: 84.0 total loss: 0.9187591671943665 | |
Per-token loss scaled by world size: 0.0010691670468077064Per-token loss scaled by world size: 0.0013301678700372577Per-token loss scaled by world size: 0.001465952955186367Per-token loss scaled by world size: 0.00061423284932971Per-token loss scaled by world size: 0.0007980645168572664Per-token loss scaled by world size: 0.00017644117178861052Per-token loss scaled by world size: 0.0024659635964781046 | |
Epoch: 1, Step: 251, Rank: 5, loss = 0.72843337059021Epoch: 1, Step: 251, Rank: 3, loss = 0.9758822321891785Epoch: 1, Step: 251, Rank: 0, loss = 2.2508082389831543Epoch: 1, Step: 251, Rank: 4, loss = 1.2141107320785522 | |
Epoch: 1, Step: 251, Rank: 2, loss = 0.16104668378829956 | |
Epoch: 1, Step: 251, Rank: 1, loss = 1.3380485773086548 | |
Per-token loss scaled by world size: 0.000775110733229667 | |
Epoch: 1, Step: 251, Rank: 7, loss = 0.5606410503387451 | |
Epoch: 1, Step: 251, Rank: 6, loss = 0.7074823379516602 | |
[2024-06-27 16:46:53,784] [INFO] [logging.py:96:log_dist] [Rank 0] step=251, skipped=0, lr=[1.303896103896104e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:53,858] [INFO] [timer.py:260:stop] epoch=0/micro_step=251/global_step=251, RunningAvgSamplesPerSec=95.45491018956811, CurrSamplesPerSec=95.91155007465154, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 95.8188848212093 samples/s, lr: 1.303896103896104e-05, loss: 2.2508082389831543 cuda_mem_allocated: 22.271442413330078 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7302.0 batch_size: 81.0 total loss: 0.9920566082000732 | |
Per-token loss scaled by world size: 0.0011229186784476042Per-token loss scaled by world size: 0.000886655121576041Per-token loss scaled by world size: 0.0013310906942933798Per-token loss scaled by world size: 0.0021814664360135794Per-token loss scaled by world size: 0.00029832281870767474Per-token loss scaled by world size: 0.0006796540110372007Per-token loss scaled by world size: 0.0005574325914494693 | |
Epoch: 1, Step: 252, Rank: 0, loss = 2.135382890701294Epoch: 1, Step: 252, Rank: 5, loss = 0.8679245114326477Epoch: 1, Step: 252, Rank: 1, loss = 1.0991970300674438Epoch: 1, Step: 252, Rank: 2, loss = 1.302971363067627Epoch: 1, Step: 252, Rank: 4, loss = 0.5456568002700806 | |
Epoch: 1, Step: 252, Rank: 6, loss = 0.6652963161468506 | |
Epoch: 1, Step: 252, Rank: 7, loss = 0.2920207381248474 | |
Per-token loss scaled by world size: 0.0010891538113355637 | |
Epoch: 1, Step: 252, Rank: 3, loss = 1.066145420074463 | |
[2024-06-27 16:46:54,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=252, skipped=0, lr=[1.3090909090909092e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:54,918] [INFO] [timer.py:260:stop] epoch=0/micro_step=252/global_step=252, RunningAvgSamplesPerSec=95.45489084931795, CurrSamplesPerSec=95.4500753709462, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.36387763773239 samples/s, lr: 1.3090909090909092e-05, loss: 2.135382890701294 cuda_mem_allocated: 22.30650568008423 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7831.0 batch_size: 83.0 total loss: 0.996824324131012 | |
Per-token loss scaled by world size: 0.0007933040033094585Per-token loss scaled by world size: 0.0010198986856266856 | |
Per-token loss scaled by world size: 0.0006461910088546574Per-token loss scaled by world size: 0.0008227372309193015Per-token loss scaled by world size: 0.001042232383042574 | |
Per-token loss scaled by world size: 0.0007965200347825885Per-token loss scaled by world size: 0.0010902613867074251 | |
Epoch: 1, Step: 253, Rank: 1, loss = 0.7613735198974609 | |
Epoch: 1, Step: 253, Rank: 4, loss = 0.7896220684051514 | |
Epoch: 1, Step: 253, Rank: 5, loss = 0.6201817989349365Epoch: 1, Step: 253, Rank: 2, loss = 0.9788477420806885 | |
Epoch: 1, Step: 253, Rank: 3, loss = 1.0002825260162354Epoch: 1, Step: 253, Rank: 6, loss = 0.7644600868225098 | |
Epoch: 1, Step: 253, Rank: 0, loss = 1.0463783740997314 | |
Per-token loss scaled by world size: 0.000651489244773984 | |
Epoch: 1, Step: 253, Rank: 7, loss = 0.6252667903900146 | |
[2024-06-27 16:46:55,904] [INFO] [logging.py:96:log_dist] [Rank 0] step=253, skipped=0, lr=[1.3142857142857145e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:55,977] [INFO] [timer.py:260:stop] epoch=0/micro_step=253/global_step=253, RunningAvgSamplesPerSec=95.45495586831997, CurrSamplesPerSec=95.47121339834761, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.38538422724592 samples/s, lr: 1.3142857142857145e-05, loss: 1.0463783740997314 cuda_mem_allocated: 22.256772994995117 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7678.0 batch_size: 80.0 total loss: 0.8233016133308411 | |
Per-token loss scaled by world size: 0.0005201502935960889Per-token loss scaled by world size: 0.0006109977839514613Per-token loss scaled by world size: 0.0010340382577851415Per-token loss scaled by world size: 0.0003194270539097488Per-token loss scaled by world size: 0.0003940401948057115Per-token loss scaled by world size: 0.0018396878149360418Per-token loss scaled by world size: 0.00040526880184188485 | |
Epoch: 1, Step: 254, Rank: 5, loss = 0.6464356780052185Epoch: 1, Step: 254, Rank: 1, loss = 1.9463896751403809Epoch: 1, Step: 254, Rank: 2, loss = 1.0940124988555908 | |
Epoch: 1, Step: 254, Rank: 6, loss = 0.5503190159797668Epoch: 1, Step: 254, Rank: 4, loss = 0.3379538357257843 | |
Epoch: 1, Step: 254, Rank: 0, loss = 0.4287743866443634 | |
Epoch: 1, Step: 254, Rank: 7, loss = 0.4168945252895355 | |
Per-token loss scaled by world size: 0.0010070151183754206 | |
Epoch: 1, Step: 254, Rank: 3, loss = 1.0654219388961792 | |
[2024-06-27 16:46:56,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=254, skipped=0, lr=[1.3194805194805196e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:57,036] [INFO] [timer.py:260:stop] epoch=0/micro_step=254/global_step=254, RunningAvgSamplesPerSec=95.45470895350645, CurrSamplesPerSec=95.39277370804996, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.2910261719887 samples/s, lr: 1.3194805194805196e-05, loss: 0.4287743866443634 cuda_mem_allocated: 22.262856006622314 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8464.0 batch_size: 78.0 total loss: 0.8107752203941345 | |
Per-token loss scaled by world size: 0.0008168249041773379Per-token loss scaled by world size: 0.00140188483055681Per-token loss scaled by world size: 0.000639227801002562Per-token loss scaled by world size: 0.0007751485100015998Per-token loss scaled by world size: 0.001577736926265061Per-token loss scaled by world size: 0.0011177122360095382Per-token loss scaled by world size: 0.0009635292226448655 | |
Epoch: 1, Step: 255, Rank: 4, loss = 0.7310582995414734Epoch: 1, Step: 255, Rank: 1, loss = 1.254686951637268Epoch: 1, Step: 255, Rank: 3, loss = 0.5721088647842407Per-token loss scaled by world size: 0.0012350209290161729 | |
Epoch: 1, Step: 255, Rank: 7, loss = 0.6937578916549683 | |
Epoch: 1, Step: 255, Rank: 0, loss = 1.4120745658874512 | |
Epoch: 1, Step: 255, Rank: 5, loss = 0.8623586297035217Epoch: 1, Step: 255, Rank: 6, loss = 1.0003525018692017 | |
Epoch: 1, Step: 255, Rank: 2, loss = 1.1053436994552612 | |
[2024-06-27 16:46:58,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=255, skipped=0, lr=[1.3246753246753249e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:58,103] [INFO] [timer.py:260:stop] epoch=0/micro_step=255/global_step=255, RunningAvgSamplesPerSec=95.45237548064779, CurrSamplesPerSec=94.86795493495595, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.77459500745552 samples/s, lr: 1.3246753246753249e-05, loss: 1.4120745658874512 cuda_mem_allocated: 22.259278774261475 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7160.0 batch_size: 81.0 total loss: 0.9539676308631897 | |
Per-token loss scaled by world size: 0.0010382115142419934Per-token loss scaled by world size: 0.0013286086032167077Per-token loss scaled by world size: 0.0008076017838902771Per-token loss scaled by world size: 0.001189123373478651Per-token loss scaled by world size: 0.0010962182423099875Per-token loss scaled by world size: 0.0006732336478307843 | |
Per-token loss scaled by world size: 0.0009348522871732712 | |
Epoch: 1, Step: 256, Rank: 4, loss = 1.1018363237380981Epoch: 1, Step: 256, Rank: 1, loss = 1.043532371520996Epoch: 1, Step: 256, Rank: 0, loss = 1.3354177474975586 | |
Epoch: 1, Step: 256, Rank: 5, loss = 0.8117407560348511Epoch: 1, Step: 256, Rank: 2, loss = 1.1952176094055176 | |
Epoch: 1, Step: 256, Rank: 3, loss = 0.939643383026123 | |
Epoch: 1, Step: 256, Rank: 6, loss = 0.6766839623451233 | |
Per-token loss scaled by world size: 0.00039727220428176224 | |
Epoch: 1, Step: 256, Rank: 7, loss = 0.3993082344532013 | |
[2024-06-27 16:46:59,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=256, skipped=0, lr=[1.32987012987013e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:46:59,172] [INFO] [timer.py:260:stop] epoch=0/micro_step=256/global_step=256, RunningAvgSamplesPerSec=95.44875724966012, CurrSamplesPerSec=94.54207482347903, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.4441488014261 samples/s, lr: 1.32987012987013e-05, loss: 1.3354177474975586 cuda_mem_allocated: 22.29243278503418 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8041.0 batch_size: 92.0 total loss: 0.9379225373268127 | |
Per-token loss scaled by world size: 0.0005503868451341987Per-token loss scaled by world size: 0.0005360267241485417Per-token loss scaled by world size: 0.0012882084120064974Per-token loss scaled by world size: 0.0008106383611448109Per-token loss scaled by world size: 0.001321268966421485 | |
Per-token loss scaled by world size: 0.000772886211052537Per-token loss scaled by world size: 0.001282807788811624 | |
Epoch: 1, Step: 257, Rank: 5, loss = 0.7540963292121887 | |
Epoch: 1, Step: 257, Rank: 3, loss = 1.1933319568634033Epoch: 1, Step: 257, Rank: 7, loss = 0.4986388683319092 | |
Epoch: 1, Step: 257, Rank: 2, loss = 1.1983559131622314Epoch: 1, Step: 257, Rank: 4, loss = 0.5119973421096802 | |
Epoch: 1, Step: 257, Rank: 0, loss = 0.7189773917198181Epoch: 1, Step: 257, Rank: 1, loss = 1.2291104793548584 | |
Per-token loss scaled by world size: 0.0008184509351849556 | |
Epoch: 1, Step: 257, Rank: 6, loss = 0.7613639831542969 | |
[2024-06-27 16:47:00,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=257, skipped=0, lr=[1.3350649350649351e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:00,237] [INFO] [timer.py:260:stop] epoch=0/micro_step=257/global_step=257, RunningAvgSamplesPerSec=95.44676417770023, CurrSamplesPerSec=94.94320524217223, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.84303948515465 samples/s, lr: 1.3350649350649351e-05, loss: 0.7189773917198181 cuda_mem_allocated: 22.299350261688232 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7442.0 batch_size: 79.0 total loss: 0.8582339882850647 | |
Per-token loss scaled by world size: 0.0006608775001950562Per-token loss scaled by world size: 0.0006481227464973927Per-token loss scaled by world size: 0.002020034473389387 | |
Per-token loss scaled by world size: 0.0009490686934441328Per-token loss scaled by world size: 0.0015174505533650517Per-token loss scaled by world size: 0.0009281217353418469Per-token loss scaled by world size: 0.0006677981582470238 | |
Epoch: 1, Step: 258, Rank: 5, loss = 0.6171748638153076 | |
Epoch: 1, Step: 258, Rank: 0, loss = 1.923577904701233Epoch: 1, Step: 258, Rank: 6, loss = 0.9037506580352783Epoch: 1, Step: 258, Rank: 2, loss = 1.4449923038482666Epoch: 1, Step: 258, Rank: 7, loss = 0.6293206214904785 | |
Epoch: 1, Step: 258, Rank: 3, loss = 0.6359108090400696Epoch: 1, Step: 258, Rank: 4, loss = 0.8838039040565491 | |
Per-token loss scaled by world size: 0.0008991471258923411 | |
Epoch: 1, Step: 258, Rank: 1, loss = 0.856212854385376 | |
[2024-06-27 16:47:01,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=258, skipped=0, lr=[1.3402597402597404e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:01,304] [INFO] [timer.py:260:stop] epoch=0/micro_step=258/global_step=258, RunningAvgSamplesPerSec=95.44390292075974, CurrSamplesPerSec=94.71983920925733, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.63050657531697 samples/s, lr: 1.3402597402597404e-05, loss: 1.923577904701233 cuda_mem_allocated: 22.285038471221924 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7618.0 batch_size: 88.0 total loss: 0.9868429899215698 | |
Per-token loss scaled by world size: 0.0011074679205194116Per-token loss scaled by world size: 0.00047799741150811315 | |
Per-token loss scaled by world size: 0.0008888741722330451Per-token loss scaled by world size: 0.000572878576349467Per-token loss scaled by world size: 0.0008960003033280373Per-token loss scaled by world size: 0.0007222312851808965 | |
Epoch: 1, Step: 259, Rank: 1, loss = 1.0318832397460938 | |
Epoch: 1, Step: 259, Rank: 5, loss = 0.8282085061073303Epoch: 1, Step: 259, Rank: 2, loss = 0.5337796211242676Epoch: 1, Step: 259, Rank: 3, loss = 0.8348482847213745Epoch: 1, Step: 259, Rank: 7, loss = 0.44537410140037537 | |
Epoch: 1, Step: 259, Rank: 6, loss = 0.6729390025138855 | |
Per-token loss scaled by world size: 0.0018047965131700039Per-token loss scaled by world size: 0.0010775371920317411 | |
Epoch: 1, Step: 259, Rank: 0, loss = 1.6816191673278809 | |
Epoch: 1, Step: 259, Rank: 4, loss = 1.0039952993392944 | |
[2024-06-27 16:47:02,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=259, skipped=0, lr=[1.3454545454545455e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:02,364] [INFO] [timer.py:260:stop] epoch=0/micro_step=259/global_step=259, RunningAvgSamplesPerSec=95.4435498160777, CurrSamplesPerSec=95.35324088337632, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.2569632372302 samples/s, lr: 1.3454545454545455e-05, loss: 1.6816191673278809 cuda_mem_allocated: 22.303642749786377 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7454.0 batch_size: 95.0 total loss: 0.8790809512138367 | |
Per-token loss scaled by world size: 0.00202369736507535Per-token loss scaled by world size: 0.0004912478616461158Per-token loss scaled by world size: 0.001281720818951726Per-token loss scaled by world size: 0.00076179055031389Per-token loss scaled by world size: 0.0008513694046996534Per-token loss scaled by world size: 0.0009491120581515133Per-token loss scaled by world size: 0.0007165421266108751 | |
Epoch: 1, Step: 260, Rank: 2, loss = 1.9528679847717285 | |
Epoch: 1, Step: 260, Rank: 7, loss = 0.4740541875362396Epoch: 1, Step: 260, Rank: 6, loss = 0.6914631724357605 | |
Epoch: 1, Step: 260, Rank: 4, loss = 0.735127866268158Epoch: 1, Step: 260, Rank: 1, loss = 1.2368606328964233 | |
Epoch: 1, Step: 260, Rank: 3, loss = 0.9158931374549866Epoch: 1, Step: 260, Rank: 5, loss = 0.8215714693069458 | |
Per-token loss scaled by world size: 0.0014426347333937883 | |
Epoch: 1, Step: 260, Rank: 0, loss = 1.3921425342559814 | |
[2024-06-27 16:47:03,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=0, lr=[1.3506493506493508e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:03,432] [INFO] [timer.py:260:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=95.44067752927418, CurrSamplesPerSec=94.70818724811915, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.61774264271504 samples/s, lr: 1.3506493506493508e-05, loss: 1.3921425342559814 cuda_mem_allocated: 22.306028842926025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7720.0 batch_size: 75.0 total loss: 1.0274975299835205 | |
Saving model in huggingface format at samples_seen: 24960 | |
Model saved in /instructlab/training_output/hf_format/samples_24960 | |
[16:47:21] INFO saving took 18.11514163017273 seconds utils.py:192 | |
Per-token loss scaled by world size: 0.001038363203406334Per-token loss scaled by world size: 0.0011764621594920754Per-token loss scaled by world size: 0.0013484086375683546 | |
Per-token loss scaled by world size: 0.0008955668308772147Per-token loss scaled by world size: 0.0008847821154631674Per-token loss scaled by world size: 0.001986877294257283 | |
Per-token loss scaled by world size: 2.6366771635366604e-05 | |
Epoch: 1, Step: 261, Rank: 3, loss = 1.0842890739440918Epoch: 1, Step: 261, Rank: 6, loss = 0.946022629737854Epoch: 1, Step: 261, Rank: 4, loss = 0.8349738121032715Epoch: 1, Step: 261, Rank: 5, loss = 0.7201476693153381Epoch: 1, Step: 261, Rank: 2, loss = 0.7114754319190979Epoch: 1, Step: 261, Rank: 1, loss = 1.5976977348327637 | |
Epoch: 1, Step: 261, Rank: 7, loss = 0.02120218053460121 | |
Per-token loss scaled by world size: 0.00198960630223155 | |
Epoch: 1, Step: 261, Rank: 0, loss = 1.5998921394348145 | |
[2024-06-27 16:47:22,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=261, skipped=0, lr=[1.3558441558441559e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:22,628] [INFO] [timer.py:260:stop] epoch=0/micro_step=261/global_step=261, RunningAvgSamplesPerSec=95.43379565291467, CurrSamplesPerSec=93.69082481711062, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 93.58051439708503 samples/s, lr: 1.3558441558441559e-05, loss: 1.5998921394348145 cuda_mem_allocated: 22.29267120361328 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6433.0 batch_size: 71.0 total loss: 0.9394625425338745 | |
Per-token loss scaled by world size: 0.001011050189845264Per-token loss scaled by world size: 0.0014862697571516037Per-token loss scaled by world size: 0.0013477727770805359Per-token loss scaled by world size: 0.0006327003939077258Per-token loss scaled by world size: 0.0007510947762057185Per-token loss scaled by world size: 0.0011350858258083463Per-token loss scaled by world size: 0.0007110319565981627 | |
Epoch: 1, Step: 262, Rank: 5, loss = 0.9436889290809631Epoch: 1, Step: 262, Rank: 1, loss = 1.2579773664474487Epoch: 1, Step: 262, Rank: 2, loss = 1.387247085571289 | |
Epoch: 1, Step: 262, Rank: 3, loss = 1.0594607591629028Epoch: 1, Step: 262, Rank: 7, loss = 0.590546727180481Epoch: 1, Step: 262, Rank: 4, loss = 0.7010530829429626Per-token loss scaled by world size: 0.0007828678353689611Epoch: 1, Step: 262, Rank: 0, loss = 0.6636594533920288 | |
Epoch: 1, Step: 262, Rank: 6, loss = 0.7307092547416687 | |
[2024-06-27 16:47:23,604] [INFO] [logging.py:96:log_dist] [Rank 0] step=262, skipped=0, lr=[1.3610389610389612e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:23,678] [INFO] [timer.py:260:stop] epoch=0/micro_step=262/global_step=262, RunningAvgSamplesPerSec=95.43412751069643, CurrSamplesPerSec=95.52015645586845, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.42379037408722 samples/s, lr: 1.3610389610389612e-05, loss: 0.6636594533920288 cuda_mem_allocated: 22.23733425140381 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7467.0 batch_size: 74.0 total loss: 0.9167928695678711 | |
Per-token loss scaled by world size: 0.002141335280612111Per-token loss scaled by world size: 0.0010481125209480524Per-token loss scaled by world size: 0.0005305741797201335Per-token loss scaled by world size: 0.0007947302656248212Per-token loss scaled by world size: 0.001276607858017087Per-token loss scaled by world size: 0.001097489963285625Per-token loss scaled by world size: 0.0005111643695272505 | |
Epoch: 1, Step: 263, Rank: 5, loss = 1.0198135375976562Epoch: 1, Step: 263, Rank: 1, loss = 2.083519220352173Epoch: 1, Step: 263, Rank: 4, loss = 0.4973629415035248Epoch: 1, Step: 263, Rank: 2, loss = 0.7732725739479065 | |
Epoch: 1, Step: 263, Rank: 0, loss = 1.242139458656311Epoch: 1, Step: 263, Rank: 3, loss = 1.0678577423095703 | |
Epoch: 1, Step: 263, Rank: 7, loss = 0.5162487030029297 | |
Per-token loss scaled by world size: 0.0009350811596959829 | |
Epoch: 1, Step: 263, Rank: 6, loss = 0.9098339676856995 | |
[2024-06-27 16:47:24,656] [INFO] [logging.py:96:log_dist] [Rank 0] step=263, skipped=0, lr=[1.3662337662337663e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:24,729] [INFO] [timer.py:260:stop] epoch=0/micro_step=263/global_step=263, RunningAvgSamplesPerSec=95.43730616542717, CurrSamplesPerSec=96.27100366121003, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.17180808973703 samples/s, lr: 1.3662337662337663e-05, loss: 1.242139458656311 cuda_mem_allocated: 22.280386924743652 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7784.0 batch_size: 76.0 total loss: 1.0137560367584229 | |
Epoch 1: 100% 50/50 [01:11<00:00, 1.43s/it] | |
total tokens: 2520 num samples: 10 num padding tokens: 216 - rank: 4 max len: 252 min len: 214 avg len: 230.4 num_loss_counted_tokens: 753 | |
total tokens: 2436 num samples: 12 num padding tokens: 105 - rank: 4 max len: 203 min len: 186 avg len: 194.25 num_loss_counted_tokens: 1036 | |
total tokens: 2410 num samples: 10 num padding tokens: 199 - rank: 4 max len: 241 min len: 205 avg len: 221.1 num_loss_counted_tokens: 771 | |
total tokens: 2508 num samples: 11 num padding tokens: 175 - rank: 4 max len: 228 min len: 198 avg len: 212.0909090909091 num_loss_counted_tokens: 1169 | |
total tokens: 2261 num samples: 7 num padding tokens: 267 - rank: 3 max len: 323 min len: 253 avg len: 284.85714285714283 num_loss_counted_tokens: 795 | |
total tokens: 2460 num samples: 12 num padding tokens: 153 - rank: 4 max len: 205 min len: 174 avg len: 192.25 num_loss_counted_tokens: 807 | |
total tokens: 2350 num samples: 10 num padding tokens: 95 - rank: 4 max len: 235 min len: 213 avg len: 225.5 num_loss_counted_tokens: 811 | |
total tokens: 2088 num samples: 4 num padding tokens: 373 - rank: 1 max len: 522 min len: 381 avg len: 428.75 num_loss_counted_tokens: 1260 | |
total tokens: 2410 num samples: 10 num padding tokens: 124 - rank: 3 max len: 241 min len: 215 avg len: 228.6 num_loss_counted_tokens: 939 | |
total tokens: 2256 num samples: 8 num padding tokens: 270 - rank: 3 max len: 282 min len: 233 avg len: 248.25 num_loss_counted_tokens: 1020 | |
total tokens: 2408 num samples: 7 num padding tokens: 295 - rank: 3 max len: 344 min len: 249 avg len: 301.85714285714283 num_loss_counted_tokens: 949 | |
total tokens: 2440 num samples: 10 num padding tokens: 74 - rank: 3 max len: 244 min len: 228 avg len: 236.6 num_loss_counted_tokens: 1033 | |
total tokens: 2358 num samples: 9 num padding tokens: 147 - rank: 3 max len: 262 min len: 235 avg len: 245.66666666666666 num_loss_counted_tokens: 943 | |
total tokens: 2360 num samples: 10 num padding tokens: 96 - rank: 3 max len: 236 min len: 211 avg len: 226.4 num_loss_counted_tokens: 1017 | |
total tokens: 2232 num samples: 6 num padding tokens: 154 - rank: 1 max len: 372 min len: 313 avg len: 346.3333333333333 num_loss_counted_tokens: 1261 | |
total tokens: 2511 num samples: 9 num padding tokens: 256 - rank: 3 max len: 279 min len: 233 avg len: 250.55555555555554 num_loss_counted_tokens: 964 | |
total tokens: 2310 num samples: 10 num padding tokens: 145 - rank: 4 max len: 231 min len: 198 avg len: 216.5 num_loss_counted_tokens: 803 total tokens: 2440 num samples: 10 num padding tokens: 277 - rank: 4 max len: 244 min len: 192 avg len: 216.3 num_loss_counted_tokens: 1052 | |
total tokens: 2304 num samples: 8 num padding tokens: 254 - rank: 3 max len: 288 min len: 235 avg len: 256.25 num_loss_counted_tokens: 492 | |
total tokens: 2488 num samples: 8 num padding tokens: 251 - rank: 3 max len: 311 min len: 259 avg len: 279.625 num_loss_counted_tokens: 1022 | |
total tokens: 2453 num samples: 11 num padding tokens: 368 - rank: 4 max len: 223 min len: 171 avg len: 189.54545454545453 num_loss_counted_tokens: 945 | |
total tokens: 2367 num samples: 9 num padding tokens: 170 - rank: 3 max len: 263 min len: 225 avg len: 244.11111111111111 num_loss_counted_tokens: 725 | |
total tokens: 2508 num samples: 11 num padding tokens: 250 - rank: 4 max len: 228 min len: 192 avg len: 205.27272727272728 num_loss_counted_tokens: 985 | |
total tokens: 2310 num samples: 10 num padding tokens: 90 - rank: 4 max len: 231 min len: 214 avg len: 222.0 num_loss_counted_tokens: 941 | |
total tokens: 2160 num samples: 5 num padding tokens: 113 - rank: 1 max len: 432 min len: 360 avg len: 409.4 num_loss_counted_tokens: 956 | |
total tokens: 2408 num samples: 8 num padding tokens: 149 - rank: 3 max len: 301 min len: 262 avg len: 282.375 num_loss_counted_tokens: 712 | |
total tokens: 2511 num samples: 9 num padding tokens: 171 - rank: 3 max len: 279 min len: 231 avg len: 260.0 num_loss_counted_tokens: 992 | |
total tokens: 2292 num samples: 6 num padding tokens: 189 - rank: 1 max len: 382 min len: 321 avg len: 350.5 num_loss_counted_tokens: 1101 | |
total tokens: 2456 num samples: 4 num padding tokens: 304 - rank: 1 max len: 614 min len: 464 avg len: 538.0 num_loss_counted_tokens: 672 | |
total tokens: 2430 num samples: 15 num padding tokens: 172 - rank: 6 max len: 162 min len: 134 avg len: 150.53333333333333 num_loss_counted_tokens: 908 | |
total tokens: 2450 num samples: 7 num padding tokens: 253 - rank: 1 max len: 350 min len: 283 avg len: 313.85714285714283 num_loss_counted_tokens: 1389 | |
total tokens: 2272 num samples: 8 num padding tokens: 117 - rank: 3 max len: 284 min len: 251 avg len: 269.375 num_loss_counted_tokens: 1031 | |
total tokens: 2397 num samples: 17 num padding tokens: 135 - rank: 6 max len: 141 min len: 125 avg len: 133.05882352941177 num_loss_counted_tokens: 819 | |
total tokens: 2530 num samples: 11 num padding tokens: 254 - rank: 4 max len: 230 min len: 189 avg len: 206.9090909090909 num_loss_counted_tokens: 728 | |
total tokens: 2431 num samples: 11 num padding tokens: 121 - rank: 4 max len: 221 min len: 195 avg len: 210.0 num_loss_counted_tokens: 1072 | |
total tokens: 2340 num samples: 9 num padding tokens: 263 - rank: 4 max len: 260 min len: 217 avg len: 230.77777777777777 num_loss_counted_tokens: 979 | |
total tokens: 2490 num samples: 5 num padding tokens: 227 - rank: 1 max len: 498 min len: 379 avg len: 452.6 num_loss_counted_tokens: 1543 | |
total tokens: 2472 num samples: 6 num padding tokens: 188 - rank: 1 max len: 412 min len: 341 avg len: 380.6666666666667 num_loss_counted_tokens: 1060 | |
total tokens: 2322 num samples: 9 num padding tokens: 98 - rank: 4 max len: 258 min len: 230 avg len: 247.11111111111111 num_loss_counted_tokens: 1288 | |
total tokens: 2460 num samples: 5 num padding tokens: 382 - rank: 1 max len: 492 min len: 382 avg len: 415.6 num_loss_counted_tokens: 853 | |
total tokens: 2408 num samples: 14 num padding tokens: 225 - rank: 6 max len: 172 min len: 143 avg len: 155.92857142857142 num_loss_counted_tokens: 835 | |
total tokens: 2352 num samples: 8 num padding tokens: 114 - rank: 3 max len: 294 min len: 262 avg len: 279.75 num_loss_counted_tokens: 986 | |
total tokens: 2512 num samples: 16 num padding tokens: 303 - rank: 6 max len: 157 min len: 120 avg len: 138.0625 num_loss_counted_tokens: 703 | |
total tokens: 2418 num samples: 13 num padding tokens: 290 - rank: 6 max len: 186 min len: 150 avg len: 163.69230769230768 num_loss_counted_tokens: 681 | |
total tokens: 2125 num samples: 5 num padding tokens: 204 - rank: 1 max len: 425 min len: 361 avg len: 384.2 num_loss_counted_tokens: 967 | |
total tokens: 2499 num samples: 17 num padding tokens: 173 - rank: 6 max len: 147 min len: 126 avg len: 136.8235294117647 num_loss_counted_tokens: 766 | |
total tokens: 2120 num samples: 5 num padding tokens: 235 - rank: 1 max len: 424 min len: 350 avg len: 377.0 num_loss_counted_tokens: 604 | |
total tokens: 2464 num samples: 16 num padding tokens: 333 - rank: 6 max len: 154 min len: 118 avg len: 133.1875 num_loss_counted_tokens: 793 | |
total tokens: 2510 num samples: 10 num padding tokens: 214 - rank: 3 max len: 251 min len: 208 avg len: 229.6 num_loss_counted_tokens: 923 | |
total tokens: 2190 num samples: 6 num padding tokens: 80 - rank: 1 max len: 365 min len: 332 avg len: 351.6666666666667 num_loss_counted_tokens: 1270 | |
total tokens: 2499 num samples: 17 num padding tokens: 164 - rank: 6 max len: 147 min len: 125 avg len: 137.35294117647058 num_loss_counted_tokens: 926 | |
total tokens: 2346 num samples: 6 num padding tokens: 160 - rank: 1 max len: 391 min len: 354 avg len: 364.3333333333333 num_loss_counted_tokens: 865 | |
total tokens: 2448 num samples: 12 num padding tokens: 156 - rank: 4 max len: 204 min len: 177 avg len: 191.0 num_loss_counted_tokens: 1065 | |
total tokens: 2385 num samples: 15 num padding tokens: 194 - rank: 6 max len: 159 min len: 128 avg len: 146.06666666666666 num_loss_counted_tokens: 836 | |
total tokens: 2415 num samples: 3 num padding tokens: 262 - rank: 0 max len: 805 min len: 668 avg len: 717.6666666666666 num_loss_counted_tokens: 750 | |
total tokens: 2520 num samples: 15 num padding tokens: 275 - rank: 6 max len: 168 min len: 133 avg len: 149.66666666666666 num_loss_counted_tokens: 929 total tokens: 1978 num samples: 2 num padding tokens: 232 - rank: 0 max len: 989 min len: 757 avg len: 873.0 num_loss_counted_tokens: 1544 | |
total tokens: 2052 num samples: 4 num padding tokens: 165 - rank: 0 max len: 513 min len: 435 avg len: 471.75 num_loss_counted_tokens: 886 total tokens: 2470 num samples: 13 num padding tokens: 368 - rank: 6 max len: 190 min len: 135 avg len: 161.69230769230768 num_loss_counted_tokens: 826 | |
total tokens: 2373 num samples: 7 num padding tokens: 202 - rank: 1 max len: 339 min len: 287 avg len: 310.14285714285717 num_loss_counted_tokens: 832 | |
total tokens: 2430 num samples: 15 num padding tokens: 271 - rank: 6 max len: 162 min len: 132 avg len: 143.93333333333334 num_loss_counted_tokens: 886 | |
total tokens: 2416 num samples: 16 num padding tokens: 208 - rank: 6 max len: 151 min len: 118 avg len: 138.0 num_loss_counted_tokens: 782 | |
total tokens: 2533 num samples: 17 num padding tokens: 303 - rank: 6 max len: 149 min len: 120 avg len: 131.1764705882353 num_loss_counted_tokens: 877 | |
total tokens: 2270 num samples: 5 num padding tokens: 106 - rank: 1 max len: 454 min len: 412 avg len: 432.8 num_loss_counted_tokens: 1208 | |
total tokens: 2260 num samples: 5 num padding tokens: 308 - rank: 0 max len: 452 min len: 351 avg len: 390.4 num_loss_counted_tokens: 1254 | |
total tokens: 2286 num samples: 9 num padding tokens: 122 - rank: 3 max len: 254 min len: 231 avg len: 240.44444444444446 num_loss_counted_tokens: 1066 total tokens: 1965 num samples: 3 num padding tokens: 348 - rank: 0 max len: 655 min len: 455 avg len: 539.0 num_loss_counted_tokens: 537 | |
total tokens: 2444 num samples: 4 num padding tokens: 332 - rank: 0 max len: 611 min len: 454 avg len: 528.0 num_loss_counted_tokens: 1389 | |
total tokens: 2376 num samples: 6 num padding tokens: 334 - rank: 1 max len: 396 min len: 315 avg len: 340.3333333333333 num_loss_counted_tokens: 1295 | |
total tokens: 2092 num samples: 4 num padding tokens: 233 - rank: 0 max len: 523 min len: 425 avg len: 464.75 num_loss_counted_tokens: 1034 | |
total tokens: 1956 num samples: 3 num padding tokens: 226 - rank: 0 max len: 652 min len: 463 avg len: 576.6666666666666 num_loss_counted_tokens: 889 | |
total tokens: 2088 num samples: 3 num padding tokens: 333 - rank: 0 max len: 696 min len: 522 avg len: 585.0 num_loss_counted_tokens: 1076 | |
total tokens: 2349 num samples: 9 num padding tokens: 254 - rank: 4 max len: 261 min len: 204 avg len: 232.77777777777777 num_loss_counted_tokens: 1144 | |
total tokens: 2532 num samples: 4 num padding tokens: 195 - rank: 0 max len: 633 min len: 531 avg len: 584.25 num_loss_counted_tokens: 490 | |
total tokens: 2475 num samples: 15 num padding tokens: 248 - rank: 6 max len: 165 min len: 134 avg len: 148.46666666666667 num_loss_counted_tokens: 887 | |
total tokens: 2082 num samples: 3 num padding tokens: 160 - rank: 0 max len: 694 min len: 563 avg len: 640.6666666666666 num_loss_counted_tokens: 932 | |
total tokens: 2448 num samples: 6 num padding tokens: 108 - rank: 0 max len: 408 min len: 369 avg len: 390.0 num_loss_counted_tokens: 1194 | |
total tokens: 2400 num samples: 16 num padding tokens: 223 - rank: 6 max len: 150 min len: 117 avg len: 136.0625 num_loss_counted_tokens: 792 | |
total tokens: 2428 num samples: 4 num padding tokens: 289 - rank: 0 max len: 607 min len: 505 avg len: 534.75 num_loss_counted_tokens: 1749 | |
total tokens: 2365 num samples: 5 num padding tokens: 181 - rank: 1 max len: 473 min len: 403 avg len: 436.8 num_loss_counted_tokens: 1452 | |
total tokens: 2236 num samples: 4 num padding tokens: 324 - rank: 0 max len: 559 min len: 397 avg len: 478.0 num_loss_counted_tokens: 1028 | |
total tokens: 2400 num samples: 15 num padding tokens: 282 - rank: 6 max len: 160 min len: 123 avg len: 141.2 num_loss_counted_tokens: 786 | |
total tokens: 2470 num samples: 5 num padding tokens: 252 - rank: 0 max len: 494 min len: 408 avg len: 443.6 num_loss_counted_tokens: 1153 | |
total tokens: 2488 num samples: 8 num padding tokens: 335 - rank: 2 max len: 311 min len: 242 avg len: 269.125 num_loss_counted_tokens: 1063 | |
total tokens: 2220 num samples: 6 num padding tokens: 112 - rank: 2 max len: 370 min len: 326 avg len: 351.3333333333333 num_loss_counted_tokens: 1031 | |
total tokens: 2145 num samples: 5 num padding tokens: 254 - rank: 2 max len: 429 min len: 361 avg len: 378.2 num_loss_counted_tokens: 991 | |
total tokens: 2366 num samples: 14 num padding tokens: 147 - rank: 5 max len: 169 min len: 150 avg len: 158.5 num_loss_counted_tokens: 860 total tokens: 2365 num samples: 11 num padding tokens: 160 - rank: 5 max len: 215 min len: 186 avg len: 200.45454545454547 num_loss_counted_tokens: 795 | |
total tokens: 2392 num samples: 13 num padding tokens: 259 - rank: 5 max len: 184 min len: 145 avg len: 164.07692307692307 num_loss_counted_tokens: 773 | |
total tokens: 2448 num samples: 12 num padding tokens: 181 - rank: 5 max len: 204 min len: 174 avg len: 188.91666666666666 num_loss_counted_tokens: 756 | |
total tokens: 2282 num samples: 7 num padding tokens: 146 - rank: 2 max len: 326 min len: 282 avg len: 305.14285714285717 num_loss_counted_tokens: 688 | |
total tokens: 2436 num samples: 12 num padding tokens: 204 - rank: 5 max len: 203 min len: 155 avg len: 186.0 num_loss_counted_tokens: 1059 | |
total tokens: 2226 num samples: 6 num padding tokens: 242 - rank: 2 max len: 371 min len: 308 avg len: 330.6666666666667 num_loss_counted_tokens: 1055 | |
total tokens: 2394 num samples: 14 num padding tokens: 141 - rank: 5 max len: 171 min len: 151 avg len: 160.92857142857142 num_loss_counted_tokens: 958 | |
total tokens: 2448 num samples: 9 num padding tokens: 181 - rank: 2 max len: 272 min len: 238 avg len: 251.88888888888889 num_loss_counted_tokens: 1107 | |
total tokens: 2328 num samples: 8 num padding tokens: 131 - rank: 2 max len: 291 min len: 265 avg len: 274.625 num_loss_counted_tokens: 940 | |
total tokens: 2484 num samples: 12 num padding tokens: 303 - rank: 5 max len: 207 min len: 163 avg len: 181.75 num_loss_counted_tokens: 783 total tokens: 2376 num samples: 12 num padding tokens: 247 - rank: 5 max len: 198 min len: 160 avg len: 177.41666666666666 num_loss_counted_tokens: 923 | |
total tokens: 2488 num samples: 8 num padding tokens: 122 - rank: 2 max len: 311 min len: 278 avg len: 295.75 num_loss_counted_tokens: 1229 | |
total tokens: 2396 num samples: 4 num padding tokens: 283 - rank: 0 max len: 599 min len: 468 avg len: 528.25 num_loss_counted_tokens: 1363 | |
total tokens: 2401 num samples: 7 num padding tokens: 248 - rank: 2 max len: 343 min len: 285 avg len: 307.57142857142856 num_loss_counted_tokens: 1185 | |
total tokens: 2519 num samples: 11 num padding tokens: 206 - rank: 5 max len: 229 min len: 190 avg len: 210.27272727272728 num_loss_counted_tokens: 672 | |
total tokens: 2324 num samples: 7 num padding tokens: 154 - rank: 2 max len: 332 min len: 292 avg len: 310.0 num_loss_counted_tokens: 1325 | |
total tokens: 2280 num samples: 6 num padding tokens: 230 - rank: 2 max len: 380 min len: 317 avg len: 341.6666666666667 num_loss_counted_tokens: 1005 | |
total tokens: 2478 num samples: 7 num padding tokens: 223 - rank: 2 max len: 354 min len: 288 avg len: 322.14285714285717 num_loss_counted_tokens: 1044 | |
total tokens: 2478 num samples: 14 num padding tokens: 212 - rank: 5 max len: 177 min len: 151 avg len: 161.85714285714286 num_loss_counted_tokens: 789 | |
total tokens: 2496 num samples: 13 num padding tokens: 272 - rank: 5 max len: 192 min len: 150 avg len: 171.07692307692307 num_loss_counted_tokens: 996 | |
total tokens: 2478 num samples: 7 num padding tokens: 301 - rank: 2 max len: 354 min len: 288 avg len: 311.0 num_loss_counted_tokens: 1042 | |
total tokens: 2280 num samples: 8 num padding tokens: 125 - rank: 2 max len: 285 min len: 251 avg len: 269.375 num_loss_counted_tokens: 553 | |
total tokens: 2121 num samples: 3 num padding tokens: 251 - rank: 0 max len: 707 min len: 544 avg len: 623.3333333333334 num_loss_counted_tokens: 268 | |
total tokens: 2364 num samples: 12 num padding tokens: 204 - rank: 5 max len: 197 min len: 169 avg len: 180.0 num_loss_counted_tokens: 888 | |
total tokens: 2532 num samples: 12 num padding tokens: 329 - rank: 5 max len: 211 min len: 162 avg len: 183.58333333333334 num_loss_counted_tokens: 840 | |
total tokens: 2470 num samples: 13 num padding tokens: 171 - rank: 5 max len: 190 min len: 166 avg len: 176.84615384615384 num_loss_counted_tokens: 835 | |
total tokens: 1625 num samples: 13 num padding tokens: 154 - rank: 7 max len: 125 min len: 98 avg len: 113.15384615384616 num_loss_counted_tokens: 380 | |
total tokens: 2527 num samples: 19 num padding tokens: 355 - rank: 7 max len: 133 min len: 91 avg len: 114.3157894736842 num_loss_counted_tokens: 608 | |
total tokens: 1521 num samples: 13 num padding tokens: 158 - rank: 7 max len: 117 min len: 81 avg len: 104.84615384615384 num_loss_counted_tokens: 364 | |
total tokens: 2502 num samples: 18 num padding tokens: 388 - rank: 7 max len: 139 min len: 87 avg len: 117.44444444444444 num_loss_counted_tokens: 759 | |
total tokens: 2414 num samples: 17 num padding tokens: 464 - rank: 7 max len: 142 min len: 83 avg len: 114.70588235294117 num_loss_counted_tokens: 641 | |
total tokens: 2360 num samples: 20 num padding tokens: 310 - rank: 7 max len: 118 min len: 83 avg len: 102.5 num_loss_counted_tokens: 517 | |
total tokens: 2440 num samples: 20 num padding tokens: 288 - rank: 7 max len: 122 min len: 82 avg len: 107.6 num_loss_counted_tokens: 534 | |
total tokens: 2091 num samples: 17 num padding tokens: 291 - rank: 7 max len: 123 min len: 89 avg len: 105.88235294117646 num_loss_counted_tokens: 503 | |
total tokens: 2443 num samples: 7 num padding tokens: 364 - rank: 2 max len: 349 min len: 259 avg len: 297.0 num_loss_counted_tokens: 1004 | |
total tokens: 1980 num samples: 15 num padding tokens: 352 - rank: 7 max len: 132 min len: 77 avg len: 108.53333333333333 num_loss_counted_tokens: 322 | |
total tokens: 2488 num samples: 8 num padding tokens: 150 - rank: 2 max len: 311 min len: 265 avg len: 292.25 num_loss_counted_tokens: 1141 | |
total tokens: 2223 num samples: 19 num padding tokens: 179 - rank: 7 max len: 117 min len: 79 avg len: 107.57894736842105 num_loss_counted_tokens: 572 | |
total tokens: 2340 num samples: 12 num padding tokens: 111 - rank: 5 max len: 195 min len: 166 avg len: 185.75 num_loss_counted_tokens: 960 | |
total tokens: 2499 num samples: 21 num padding tokens: 392 - rank: 7 max len: 119 min len: 82 avg len: 100.33333333333333 num_loss_counted_tokens: 498 | |
total tokens: 2431 num samples: 13 num padding tokens: 163 - rank: 5 max len: 187 min len: 151 avg len: 174.46153846153845 num_loss_counted_tokens: 850 | |
total tokens: 2394 num samples: 19 num padding tokens: 401 - rank: 7 max len: 126 min len: 77 avg len: 104.89473684210526 num_loss_counted_tokens: 577 | |
total tokens: 2244 num samples: 17 num padding tokens: 430 - rank: 7 max len: 132 min len: 88 avg len: 106.70588235294117 num_loss_counted_tokens: 461 | |
total tokens: 2268 num samples: 18 num padding tokens: 281 - rank: 7 max len: 126 min len: 78 avg len: 110.38888888888889 num_loss_counted_tokens: 615 | |
total tokens: 2214 num samples: 6 num padding tokens: 178 - rank: 2 max len: 369 min len: 316 avg len: 339.3333333333333 num_loss_counted_tokens: 1150 | |
total tokens: 2388 num samples: 12 num padding tokens: 268 - rank: 5 max len: 199 min len: 161 avg len: 176.66666666666666 num_loss_counted_tokens: 811 | |
total tokens: 2508 num samples: 19 num padding tokens: 408 - rank: 7 max len: 132 min len: 82 avg len: 110.52631578947368 num_loss_counted_tokens: 661 | |
total tokens: 2320 num samples: 20 num padding tokens: 302 - rank: 7 max len: 116 min len: 79 avg len: 100.9 num_loss_counted_tokens: 436 | |
total tokens: 2460 num samples: 20 num padding tokens: 341 - rank: 7 max len: 123 min len: 82 avg len: 105.95 num_loss_counted_tokens: 612 | |
Per-token loss scaled by world size: 0.0015078254509717226Per-token loss scaled by world size: 0.0014508756576105952Per-token loss scaled by world size: 0.0010836200090125203 | |
Per-token loss scaled by world size: 0.0005951099446974695Per-token loss scaled by world size: 0.0006408853805623949Per-token loss scaled by world size: 0.0018796175718307495 | |
Per-token loss scaled by world size: 0.001330733997747302 | |
Epoch: 2, Step: 264, Rank: 0, loss = 1.2492039203643799Epoch: 2, Step: 264, Rank: 2, loss = 1.298237681388855 | |
Epoch: 2, Step: 264, Rank: 5, loss = 0.5123896598815918Epoch: 2, Step: 264, Rank: 6, loss = 0.9329968094825745 | |
Epoch: 2, Step: 264, Rank: 4, loss = 0.5518023371696472Epoch: 2, Step: 264, Rank: 1, loss = 1.6183507442474365 | |
Epoch: 2, Step: 264, Rank: 3, loss = 1.1457619667053223 | |
Per-token loss scaled by world size: 0.000683697173371911 | |
Epoch: 2, Step: 264, Rank: 7, loss = 0.5886632800102234 | |
[2024-06-27 16:47:26,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=264, skipped=0, lr=[1.3714285714285716e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:26,293] [INFO] [timer.py:260:stop] epoch=0/micro_step=264/global_step=264, RunningAvgSamplesPerSec=95.42602932368364, CurrSamplesPerSec=92.57116451144124, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 92.37103291172345 samples/s, lr: 1.3714285714285716e-05, loss: 1.2492039203643799 cuda_mem_allocated: 22.30256938934326 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6888.0 batch_size: 76.0 total loss: 0.9871757626533508 | |
Epoch 2: 0% 1/205 [00:01<04:56, 1.45s/it] total tokens: 2370 num samples: 10 num padding tokens: 129 - rank: 4 max len: 237 min len: 214 avg len: 224.1 num_loss_counted_tokens: 1104 | |
total tokens: 2096 num samples: 4 num padding tokens: 265 - rank: 1 max len: 524 min len: 388 avg len: 457.75 num_loss_counted_tokens: 917 | |
total tokens: 2332 num samples: 11 num padding tokens: 173 - rank: 5 max len: 212 min len: 182 avg len: 196.27272727272728 num_loss_counted_tokens: 757 | |
total tokens: 2366 num samples: 13 num padding tokens: 328 - rank: 6 max len: 182 min len: 135 avg len: 156.76923076923077 num_loss_counted_tokens: 790 | |
total tokens: 2448 num samples: 8 num padding tokens: 341 - rank: 3 max len: 306 min len: 239 avg len: 263.375 num_loss_counted_tokens: 879 | |
total tokens: 2527 num samples: 19 num padding tokens: 377 - rank: 7 max len: 133 min len: 88 avg len: 113.15789473684211 num_loss_counted_tokens: 694 | |
total tokens: 2199 num samples: 3 num padding tokens: 115 - rank: 0 max len: 733 min len: 621 avg len: 694.6666666666666 num_loss_counted_tokens: 1587 | |
total tokens: 2322 num samples: 6 num padding tokens: 314 - rank: 2 max len: 387 min len: 310 avg len: 334.6666666666667 num_loss_counted_tokens: 1022 | |
Per-token loss scaled by world size: 0.00045981217408552766Per-token loss scaled by world size: 0.0015727243153378367Per-token loss scaled by world size: 0.0005740747437812388 | |
Per-token loss scaled by world size: 0.0008069529430940747Per-token loss scaled by world size: 0.0009249402210116386 | |
Per-token loss scaled by world size: 0.0010686274617910385 | |
Per-token loss scaled by world size: 0.0006267482531256974 | |
Epoch: 2, Step: 265, Rank: 1, loss = 1.437273383140564 | |
Epoch: 2, Step: 265, Rank: 6, loss = 0.7374541163444519Epoch: 2, Step: 265, Rank: 7, loss = 0.4202108383178711 | |
Epoch: 2, Step: 265, Rank: 5, loss = 0.5246325731277466Epoch: 2, Step: 265, Rank: 4, loss = 0.8452797532081604 | |
Epoch: 2, Step: 265, Rank: 0, loss = 0.5727695822715759Epoch: 2, Step: 265, Rank: 3, loss = 0.976591944694519 | |
Per-token loss scaled by world size: 0.0010052262805402279 | |
Epoch: 2, Step: 265, Rank: 2, loss = 0.9186511635780334 | |
[2024-06-27 16:47:27,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=265, skipped=0, lr=[1.3766233766233767e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:27,369] [INFO] [timer.py:260:stop] epoch=0/micro_step=265/global_step=265, RunningAvgSamplesPerSec=95.42531114214701, CurrSamplesPerSec=95.23751928860868, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.13500108684352 samples/s, lr: 1.3766233766233767e-05, loss: 0.5727695822715759 cuda_mem_allocated: 22.259278774261475 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7311.0 batch_size: 90.0 total loss: 0.8041079044342041 | |
Epoch 2: 1% 2/205 [00:02<04:09, 1.23s/it] total tokens: 2352 num samples: 7 num padding tokens: 55 - rank: 2 max len: 336 min len: 317 avg len: 328.14285714285717 num_loss_counted_tokens: 999 | |
total tokens: 2405 num samples: 13 num padding tokens: 215 - rank: 5 max len: 185 min len: 154 avg len: 168.46153846153845 num_loss_counted_tokens: 857 | |
total tokens: 2532 num samples: 6 num padding tokens: 264 - rank: 1 max len: 422 min len: 342 avg len: 378.0 num_loss_counted_tokens: 1126 | |
total tokens: 2304 num samples: 9 num padding tokens: 315 - rank: 4 max len: 256 min len: 195 avg len: 221.0 num_loss_counted_tokens: 832 | |
total tokens: 2432 num samples: 16 num padding tokens: 319 - rank: 6 max len: 152 min len: 107 avg len: 132.0625 num_loss_counted_tokens: 731 | |
total tokens: 749 num samples: 7 num padding tokens: 40 - rank: 7 max len: 107 min len: 94 avg len: 101.28571428571429 num_loss_counted_tokens: 163 | |
total tokens: 2480 num samples: 8 num padding tokens: 239 - rank: 3 max len: 310 min len: 259 avg len: 280.125 num_loss_counted_tokens: 1010 | |
total tokens: 2530 num samples: 5 num padding tokens: 188 - rank: 0 max len: 506 min len: 447 avg len: 468.4 num_loss_counted_tokens: 1135 | |
Per-token loss scaled by world size: 0.0013742614537477493Per-token loss scaled by world size: 0.001820345758460462 | |
Per-token loss scaled by world size: 0.0006255320622585714Per-token loss scaled by world size: 0.0008512948406860232Per-token loss scaled by world size: 0.0009970476385205984Per-token loss scaled by world size: 0.0009552471456117928Per-token loss scaled by world size: 0.0016077319160103798 | |
Epoch: 2, Step: 266, Rank: 1, loss = 1.2500625848770142 | |
Epoch: 2, Step: 266, Rank: 6, loss = 0.9069395065307617Epoch: 2, Step: 266, Rank: 4, loss = 0.7743590474128723Epoch: 2, Step: 266, Rank: 5, loss = 0.5689995884895325Epoch: 2, Step: 266, Rank: 3, loss = 0.8689166903495789 | |
Epoch: 2, Step: 266, Rank: 0, loss = 1.655832052230835 | |
Epoch: 2, Step: 266, Rank: 2, loss = 1.462433099746704 | |
Per-token loss scaled by world size: 0.0006449205684475601 | |
Epoch: 2, Step: 266, Rank: 7, loss = 0.5866358876228333 | |
[2024-06-27 16:47:28,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=266, skipped=0, lr=[1.381818181818182e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:28,426] [INFO] [timer.py:260:stop] epoch=0/micro_step=266/global_step=266, RunningAvgSamplesPerSec=95.42680561703548, CurrSamplesPerSec=95.82148430862722, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.67665277645439 samples/s, lr: 1.381818181818182e-05, loss: 1.655832052230835 cuda_mem_allocated: 22.250452041625977 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7277.0 batch_size: 72.0 total loss: 1.0092723369598389 | |
Epoch 2: 1% 3/205 [00:03<03:52, 1.15s/it] total tokens: 2296 num samples: 8 num padding tokens: 211 - rank: 3 max len: 287 min len: 242 avg len: 260.625 num_loss_counted_tokens: 642 | |
total tokens: 2513 num samples: 7 num padding tokens: 280 - rank: 2 max len: 359 min len: 292 avg len: 319.0 num_loss_counted_tokens: 739 | |
total tokens: 2360 num samples: 10 num padding tokens: 204 - rank: 4 max len: 236 min len: 199 avg len: 215.6 num_loss_counted_tokens: 682 | |
total tokens: 2528 num samples: 16 num padding tokens: 310 - rank: 6 max len: 158 min len: 123 avg len: 138.625 num_loss_counted_tokens: 724 | |
total tokens: 2340 num samples: 12 num padding tokens: 188 - rank: 5 max len: 195 min len: 159 avg len: 179.33333333333334 num_loss_counted_tokens: 832 | |
total tokens: 2128 num samples: 4 num padding tokens: 311 - rank: 1 max len: 532 min len: 377 avg len: 454.25 num_loss_counted_tokens: 1337 | |
total tokens: 1728 num samples: 2 num padding tokens: 189 - rank: 0 max len: 864 min len: 675 avg len: 769.5 num_loss_counted_tokens: 241 | |
total tokens: 2420 num samples: 20 num padding tokens: 293 - rank: 7 max len: 121 min len: 79 avg len: 106.35 num_loss_counted_tokens: 560 | |
Per-token loss scaled by world size: 0.0008880278328433633Per-token loss scaled by world size: 0.00044619632535614073Per-token loss scaled by world size: 0.0015869641210883856Per-token loss scaled by world size: 0.001504681073129177Per-token loss scaled by world size: 0.0011818443890661001 | |
Per-token loss scaled by world size: 0.0016971371369436383 | |
Epoch: 2, Step: 267, Rank: 0, loss = 0.36325958371162415Epoch: 2, Step: 267, Rank: 2, loss = 0.722965657711029 | |
Epoch: 2, Step: 267, Rank: 4, loss = 1.2919871807098389Epoch: 2, Step: 267, Rank: 3, loss = 1.2249984741210938 | |
Per-token loss scaled by world size: 0.0007314866525121033Epoch: 2, Step: 267, Rank: 1, loss = 0.9621690511703491 | |
Epoch: 2, Step: 267, Rank: 5, loss = 1.3816817998886108 | |
Per-token loss scaled by world size: 0.0005004482809454203 | |
Epoch: 2, Step: 267, Rank: 6, loss = 0.5955215692520142 | |
Epoch: 2, Step: 267, Rank: 7, loss = 0.40742745995521545 | |
[2024-06-27 16:47:29,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=267, skipped=0, lr=[1.3870129870129871e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:29,480] [INFO] [timer.py:260:stop] epoch=0/micro_step=267/global_step=267, RunningAvgSamplesPerSec=95.42808539748947, CurrSamplesPerSec=95.76715244754314, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.65501465637517 samples/s, lr: 1.3870129870129871e-05, loss: 0.36325958371162415 cuda_mem_allocated: 22.248901844024658 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6513.0 batch_size: 82.0 total loss: 0.8687514066696167 | |
Epoch 2: 2% 4/205 [00:04<03:43, 1.11s/it] total tokens: 2336 num samples: 8 num padding tokens: 107 - rank: 2 max len: 292 min len: 264 avg len: 278.625 num_loss_counted_tokens: 972 | |
total tokens: 2322 num samples: 9 num padding tokens: 234 - rank: 3 max len: 258 min len: 222 avg len: 232.0 num_loss_counted_tokens: 777 | |
total tokens: 2310 num samples: 6 num padding tokens: 273 - rank: 1 max len: 385 min len: 308 avg len: 339.5 num_loss_counted_tokens: 1053 | |
total tokens: 2385 num samples: 15 num padding tokens: 248 - rank: 6 max len: 159 min len: 129 avg len: 142.46666666666667 num_loss_counted_tokens: 669 | |
total tokens: 2520 num samples: 12 num padding tokens: 176 - rank: 4 max len: 210 min len: 186 avg len: 195.33333333333334 num_loss_counted_tokens: 1056 | |
total tokens: 2224 num samples: 4 num padding tokens: 345 - rank: 0 max len: 556 min len: 394 avg len: 469.75 num_loss_counted_tokens: 620 | |
total tokens: 2379 num samples: 13 num padding tokens: 93 - rank: 5 max len: 183 min len: 161 avg len: 175.84615384615384 num_loss_counted_tokens: 1035 | |
total tokens: 2413 num samples: 19 num padding tokens: 233 - rank: 7 max len: 127 min len: 82 avg len: 114.73684210526316 num_loss_counted_tokens: 551 | |
Per-token loss scaled by world size: 0.001061886316165328Per-token loss scaled by world size: 0.0003785556473303586Per-token loss scaled by world size: 0.0012574209831655025 | |
Per-token loss scaled by world size: 0.0009761035325936973Per-token loss scaled by world size: 0.0004970197333022952Per-token loss scaled by world size: 0.0011173797538504004 | |
Per-token loss scaled by world size: 0.0012954314006492496 | |
Epoch: 2, Step: 268, Rank: 7, loss = 0.3767101764678955 | |
Epoch: 2, Step: 268, Rank: 4, loss = 0.4945967495441437Epoch: 2, Step: 268, Rank: 5, loss = 1.05670964717865Epoch: 2, Step: 268, Rank: 0, loss = 1.251291036605835 | |
Epoch: 2, Step: 268, Rank: 2, loss = 0.9713450074195862 | |
Epoch: 2, Step: 268, Rank: 1, loss = 1.2891161441802979 | |
Epoch: 2, Step: 268, Rank: 3, loss = 1.1119325160980225 | |
Per-token loss scaled by world size: 0.000914787407964468 | |
Epoch: 2, Step: 268, Rank: 6, loss = 0.9103277921676636 | |
[2024-06-27 16:47:30,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=268, skipped=0, lr=[1.3922077922077924e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:30,541] [INFO] [timer.py:260:stop] epoch=0/micro_step=268/global_step=268, RunningAvgSamplesPerSec=95.4268403347436, CurrSamplesPerSec=95.09803983623381, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.97926206190527 samples/s, lr: 1.3922077922077924e-05, loss: 1.251291036605835 cuda_mem_allocated: 22.284084796905518 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7961.0 batch_size: 91.0 total loss: 0.9327536225318909 | |
Epoch 2: 2% 5/205 [00:05<03:38, 1.09s/it] total tokens: 2412 num samples: 12 num padding tokens: 165 - rank: 4 max len: 201 min len: 174 avg len: 187.25 num_loss_counted_tokens: 842 | |
total tokens: 2529 num samples: 9 num padding tokens: 130 - rank: 2 max len: 281 min len: 233 avg len: 266.55555555555554 num_loss_counted_tokens: 1033 | |
total tokens: 2380 num samples: 14 num padding tokens: 125 - rank: 5 max len: 170 min len: 150 avg len: 161.07142857142858 num_loss_counted_tokens: 802 | |
total tokens: 2533 num samples: 17 num padding tokens: 183 - rank: 6 max len: 149 min len: 127 avg len: 138.23529411764707 num_loss_counted_tokens: 891 | |
total tokens: 2527 num samples: 7 num padding tokens: 325 - rank: 1 max len: 361 min len: 291 avg len: 314.57142857142856 num_loss_counted_tokens: 876 | |
total tokens: 2330 num samples: 10 num padding tokens: 99 - rank: 3 max len: 233 min len: 208 avg len: 223.1 num_loss_counted_tokens: 824 | |
total tokens: 2232 num samples: 18 num padding tokens: 479 - rank: 7 max len: 124 min len: 79 avg len: 97.38888888888889 num_loss_counted_tokens: 402 | |
total tokens: 2425 num samples: 5 num padding tokens: 189 - rank: 0 max len: 485 min len: 385 avg len: 447.2 num_loss_counted_tokens: 1638 | |
Per-token loss scaled by world size: 0.0010144159896299243Per-token loss scaled by world size: 0.0012087530922144651 | |
Per-token loss scaled by world size: 0.0012573550920933485Per-token loss scaled by world size: 0.000978213269263506Per-token loss scaled by world size: 0.0011167486663907766Per-token loss scaled by world size: 0.0010828831000253558 | |
Epoch: 2, Step: 269, Rank: 4, loss = 0.9299658536911011Per-token loss scaled by world size: 0.0004543900431599468 | |
Epoch: 2, Step: 269, Rank: 2, loss = 0.9927331209182739 | |
Epoch: 2, Step: 269, Rank: 5, loss = 0.8967770338058472Epoch: 2, Step: 269, Rank: 0, loss = 1.0237793922424316Epoch: 2, Step: 269, Rank: 6, loss = 1.1081243753433228Per-token loss scaled by world size: 0.0009199506603181362Epoch: 2, Step: 269, Rank: 1, loss = 1.1526802778244019 | |
Epoch: 2, Step: 269, Rank: 7, loss = 0.4165620803833008 | |
Epoch: 2, Step: 269, Rank: 3, loss = 0.8433647751808167 | |
[2024-06-27 16:47:31,520] [INFO] [logging.py:96:log_dist] [Rank 0] step=269, skipped=0, lr=[1.3974025974025975e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:31,593] [INFO] [timer.py:260:stop] epoch=0/micro_step=269/global_step=269, RunningAvgSamplesPerSec=95.42939158346523, CurrSamplesPerSec=96.11290270605868, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 96.02431340714416 samples/s, lr: 1.3974025974025975e-05, loss: 1.0237793922424316 cuda_mem_allocated: 22.264048099517822 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7334.0 batch_size: 78.0 total loss: 0.9204983711242676 | |
Epoch 2: 3% 6/205 [00:06<03:34, 1.08s/it] total tokens: 2400 num samples: 20 num padding tokens: 292 - rank: 7 max len: 120 min len: 85 avg len: 105.4 num_loss_counted_tokens: 596 | |
total tokens: 2532 num samples: 12 num padding tokens: 148 - rank: 5 max len: 211 min len: 185 avg len: 198.66666666666666 num_loss_counted_tokens: 1015 | |
total tokens: 2376 num samples: 8 num padding tokens: 144 - rank: 3 max len: 297 min len: 250 avg len: 279.0 num_loss_counted_tokens: 1058 | |
total tokens: 2380 num samples: 10 num padding tokens: 129 - rank: 4 max len: 238 min len: 215 avg len: 225.1 num_loss_counted_tokens: 873 | |
total tokens: 2280 num samples: 5 num padding tokens: 260 - rank: 1 max len: 456 min len: 379 avg len: 404.0 num_loss_counted_tokens: 1519 | |
total tokens: 2244 num samples: 6 num padding tokens: 160 - rank: 2 max len: 374 min len: 300 avg len: 347.3333333333333 num_loss_counted_tokens: 909 | |
total tokens: 2405 num samples: 13 num padding tokens: 463 - rank: 6 max len: 185 min len: 126 avg len: 149.3846153846154 num_loss_counted_tokens: 886 | |
total tokens: 2265 num samples: 3 num padding tokens: 416 - rank: 0 max len: 755 min len: 457 avg len: 616.3333333333334 num_loss_counted_tokens: 567 | |
Per-token loss scaled by world size: 0.0007693567895330489Per-token loss scaled by world size: 0.0007006658706814051Per-token loss scaled by world size: 0.0004206891171634197Per-token loss scaled by world size: 0.00034014054108411074Per-token loss scaled by world size: 0.0010207903105765581 | |
Per-token loss scaled by world size: 0.0011634115362539887 | |
Epoch: 2, Step: 270, Rank: 2, loss = 0.7090584635734558 | |
Epoch: 2, Step: 270, Rank: 5, loss = 0.6457511782646179Epoch: 2, Step: 270, Rank: 6, loss = 0.387717604637146 | |
Epoch: 2, Step: 270, Rank: 1, loss = 0.940785825252533Epoch: 2, Step: 270, Rank: 3, loss = 1.0722291469573975Epoch: 2, Step: 270, Rank: 7, loss = 0.31348201632499695 | |
Per-token loss scaled by world size: 0.0012984126806259155 | |
Per-token loss scaled by world size: 0.0012580411275848746 | |
Epoch: 2, Step: 270, Rank: 0, loss = 1.1966495513916016 | |
Epoch: 2, Step: 270, Rank: 4, loss = 1.1594421863555908 | |
[2024-06-27 16:47:32,583] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=0, lr=[1.4025974025974028e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:32,656] [INFO] [timer.py:260:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=95.42783849656306, CurrSamplesPerSec=95.01496512212624, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.90986027040788 samples/s, lr: 1.4025974025974028e-05, loss: 1.1966495513916016 cuda_mem_allocated: 22.306028842926025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7373.0 batch_size: 83.0 total loss: 0.8031395673751831 | |
Epoch 2: 3% 7/205 [00:07<03:32, 1.07s/it] total tokens: 2448 num samples: 18 num padding tokens: 303 - rank: 7 max len: 136 min len: 98 avg len: 119.16666666666667 num_loss_counted_tokens: 652 | |
total tokens: 2358 num samples: 9 num padding tokens: 138 - rank: 2 max len: 262 min len: 230 avg len: 246.66666666666666 num_loss_counted_tokens: 1002 | |
total tokens: 2472 num samples: 8 num padding tokens: 165 - rank: 1 max len: 309 min len: 263 avg len: 288.375 num_loss_counted_tokens: 718 | |
total tokens: 2460 num samples: 12 num padding tokens: 91 - rank: 4 max len: 205 min len: 191 avg len: 197.41666666666666 num_loss_counted_tokens: 831 | |
total tokens: 2457 num samples: 13 num padding tokens: 144 - rank: 5 max len: 189 min len: 167 avg len: 177.92307692307693 num_loss_counted_tokens: 926 | |
total tokens: 2415 num samples: 15 num padding tokens: 187 - rank: 6 max len: 161 min len: 138 avg len: 148.53333333333333 num_loss_counted_tokens: 806 | |
total tokens: 2405 num samples: 5 num padding tokens: 475 - rank: 0 max len: 481 min len: 326 avg len: 386.0 num_loss_counted_tokens: 690 | |
total tokens: 2519 num samples: 11 num padding tokens: 127 - rank: 3 max len: 229 min len: 205 avg len: 217.45454545454547 num_loss_counted_tokens: 1008 | |
Per-token loss scaled by world size: 0.0009514625417068601Per-token loss scaled by world size: 0.0021811318583786488Per-token loss scaled by world size: 0.0010927587281912565Per-token loss scaled by world size: 0.0006295367493294179Per-token loss scaled by world size: 0.001390306162647903Per-token loss scaled by world size: 0.0009616155875846744Per-token loss scaled by world size: 0.001227794331498444 | |
Epoch: 2, Step: 271, Rank: 5, loss = 0.9277521371841431Epoch: 2, Step: 271, Rank: 1, loss = 0.8077917098999023Epoch: 2, Step: 271, Rank: 2, loss = 1.8517810106277466Epoch: 2, Step: 271, Rank: 7, loss = 0.5344766974449158Epoch: 2, Step: 271, Rank: 0, loss = 1.042397379875183 | |
Per-token loss scaled by world size: 0.0004629769828170538 | |
Epoch: 2, Step: 271, Rank: 4, loss = 1.1803699731826782 | |
Epoch: 2, Step: 271, Rank: 3, loss = 0.8164116144180298 | |
Epoch: 2, Step: 271, Rank: 6, loss = 0.39306744933128357 | |
[2024-06-27 16:47:33,630] [INFO] [logging.py:96:log_dist] [Rank 0] step=271, skipped=0, lr=[1.4077922077922079e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:33,703] [INFO] [timer.py:260:stop] epoch=0/micro_step=271/global_step=271, RunningAvgSamplesPerSec=95.43189419357905, CurrSamplesPerSec=96.5313904941366, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 96.42290586861589 samples/s, lr: 1.4077922077922079e-05, loss: 1.042397379875183 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6792.0 batch_size: 72.0 total loss: 0.9442560076713562 | |
Epoch 2: 4% 8/205 [00:08<03:29, 1.07s/it] total tokens: 2520 num samples: 10 num padding tokens: 182 - rank: 3 max len: 252 min len: 220 avg len: 233.8 num_loss_counted_tokens: 860 | |
total tokens: 2480 num samples: 16 num padding tokens: 263 - rank: 6 max len: 155 min len: 127 avg len: 138.5625 num_loss_counted_tokens: 704 | |
total tokens: 2420 num samples: 11 num padding tokens: 144 - rank: 4 max len: 220 min len: 191 avg len: 206.9090909090909 num_loss_counted_tokens: 807 | |
total tokens: 2373 num samples: 7 num padding tokens: 202 - rank: 1 max len: 339 min len: 285 avg len: 310.14285714285717 num_loss_counted_tokens: 1154 | |
total tokens: 2032 num samples: 4 num padding tokens: 388 - rank: 0 max len: 508 min len: 343 avg len: 411.0 num_loss_counted_tokens: 892 | |
total tokens: 2457 num samples: 13 num padding tokens: 243 - rank: 5 max len: 189 min len: 157 avg len: 170.30769230769232 num_loss_counted_tokens: 804 | |
total tokens: 2256 num samples: 8 num padding tokens: 142 - rank: 2 max len: 282 min len: 255 avg len: 264.25 num_loss_counted_tokens: 835 | |
total tokens: 2108 num samples: 17 num padding tokens: 271 - rank: 7 max len: 124 min len: 91 avg len: 108.05882352941177 num_loss_counted_tokens: 487 | |
Per-token loss scaled by world size: 0.0013769824290648103Per-token loss scaled by world size: 0.0011591723887249827Per-token loss scaled by world size: 0.0007112948223948479Per-token loss scaled by world size: 0.0007819252205081284Per-token loss scaled by world size: 0.0008328496478497982Per-token loss scaled by world size: 0.0007395858410745859Per-token loss scaled by world size: 0.0009956901194527745 | |
Epoch: 2, Step: 272, Rank: 1, loss = 1.0742629766464233Epoch: 2, Step: 272, Rank: 7, loss = 0.6591925024986267Epoch: 2, Step: 272, Rank: 4, loss = 0.922755777835846Epoch: 2, Step: 272, Rank: 2, loss = 1.276118516921997Epoch: 2, Step: 272, Rank: 3, loss = 0.72464919090271 | |
Epoch: 2, Step: 272, Rank: 5, loss = 0.771843433380127 | |
Epoch: 2, Step: 272, Rank: 6, loss = 0.6854111552238464 | |
Per-token loss scaled by world size: 0.00045602588215842843 | |
Epoch: 2, Step: 272, Rank: 0, loss = 0.4226219952106476 | |
[2024-06-27 16:47:34,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=272, skipped=0, lr=[1.4129870129870132e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:34,767] [INFO] [timer.py:260:stop] epoch=0/micro_step=272/global_step=272, RunningAvgSamplesPerSec=95.43008961232866, CurrSamplesPerSec=94.94712313552937, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.82421075041118 samples/s, lr: 1.4129870129870132e-05, loss: 0.4226219952106476 cuda_mem_allocated: 22.316523551940918 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7414.0 batch_size: 80.0 total loss: 0.8171070218086243 | |
Epoch 2: 4% 9/205 [00:09<03:28, 1.07s/it] total tokens: 2512 num samples: 16 num padding tokens: 227 - rank: 6 max len: 157 min len: 124 avg len: 142.8125 num_loss_counted_tokens: 894 | |
total tokens: 2519 num samples: 11 num padding tokens: 116 - rank: 4 max len: 229 min len: 192 avg len: 218.45454545454547 num_loss_counted_tokens: 869 | |
total tokens: 2313 num samples: 9 num padding tokens: 120 - rank: 3 max len: 257 min len: 229 avg len: 243.66666666666666 num_loss_counted_tokens: 900 | |
total tokens: 2392 num samples: 8 num padding tokens: 145 - rank: 2 max len: 299 min len: 268 avg len: 280.875 num_loss_counted_tokens: 1191 | |
total tokens: 2484 num samples: 6 num padding tokens: 415 - rank: 1 max len: 414 min len: 301 avg len: 344.8333333333333 num_loss_counted_tokens: 1445 | |
total tokens: 2483 num samples: 13 num padding tokens: 257 - rank: 5 max len: 191 min len: 159 avg len: 171.23076923076923 num_loss_counted_tokens: 987 | |
total tokens: 2280 num samples: 19 num padding tokens: 235 - rank: 7 max len: 120 min len: 88 avg len: 107.63157894736842 num_loss_counted_tokens: 554 | |
total tokens: 2096 num samples: 4 num padding tokens: 172 - rank: 0 max len: 524 min len: 451 avg len: 481.0 num_loss_counted_tokens: 751 | |
Per-token loss scaled by world size: 0.0012193904258310795Per-token loss scaled by world size: 0.0005880265962332487 | |
Per-token loss scaled by world size: 0.00039535993710160255Per-token loss scaled by world size: 0.0010761775774881244Per-token loss scaled by world size: 0.0013701936695724726Per-token loss scaled by world size: 0.0007367098587565124 | |
Epoch: 2, Step: 273, Rank: 2, loss = 1.235242486000061 | |
Per-token loss scaled by world size: 0.0008953416836448014Epoch: 2, Step: 273, Rank: 6, loss = 0.5956709384918213Epoch: 2, Step: 273, Rank: 1, loss = 1.3880062103271484Epoch: 2, Step: 273, Rank: 4, loss = 1.0901678800582886 | |
Epoch: 2, Step: 273, Rank: 3, loss = 0.4004996120929718Epoch: 2, Step: 273, Rank: 5, loss = 0.7462871074676514 | |
Per-token loss scaled by world size: 0.0005737451137974858 | |
Epoch: 2, Step: 273, Rank: 0, loss = 0.9069811105728149 | |
Epoch: 2, Step: 273, Rank: 7, loss = 0.581203818321228 | |
[2024-06-27 16:47:35,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=273, skipped=0, lr=[1.4181818181818183e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:35,827] [INFO] [timer.py:260:stop] epoch=0/micro_step=273/global_step=273, RunningAvgSamplesPerSec=95.42985087729166, CurrSamplesPerSec=95.36543608766303, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.25423655036901 samples/s, lr: 1.4181818181818183e-05, loss: 0.9069811105728149 cuda_mem_allocated: 22.30030393600464 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8104.0 batch_size: 84.0 total loss: 0.8680074214935303 | |
Epoch 2: 5% 10/205 [00:10<03:27, 1.06s/it] total tokens: 2340 num samples: 10 num padding tokens: 176 - rank: 4 max len: 234 min len: 199 avg len: 216.4 num_loss_counted_tokens: 880 | |
total tokens: 2238 num samples: 6 num padding tokens: 173 - rank: 1 max len: 373 min len: 326 avg len: 344.1666666666667 num_loss_counted_tokens: 1018 | |
total tokens: 2166 num samples: 19 num padding tokens: 279 - rank: 7 max len: 114 min len: 68 avg len: 99.3157894736842 num_loss_counted_tokens: 419 | |
total tokens: 2522 num samples: 13 num padding tokens: 272 - rank: 5 max len: 194 min len: 150 avg len: 173.07692307692307 num_loss_counted_tokens: 820 | |
total tokens: 2268 num samples: 7 num padding tokens: 73 - rank: 2 max len: 324 min len: 298 avg len: 313.57142857142856 num_loss_counted_tokens: 1096 | |
total tokens: 2336 num samples: 8 num padding tokens: 233 - rank: 3 max len: 292 min len: 237 avg len: 262.875 num_loss_counted_tokens: 932 | |
total tokens: 2400 num samples: 16 num padding tokens: 215 - rank: 6 max len: 150 min len: 116 avg len: 136.5625 num_loss_counted_tokens: 790 | |
total tokens: 2142 num samples: 3 num padding tokens: 601 - rank: 0 max len: 714 min len: 405 avg len: 513.6666666666666 num_loss_counted_tokens: 1302 | |
Per-token loss scaled by world size: 0.0009517568396404386Per-token loss scaled by world size: 0.001467656809836626Per-token loss scaled by world size: 0.000907154637388885Per-token loss scaled by world size: 0.0015945280902087688 | |
Per-token loss scaled by world size: 0.0008923530695028603Per-token loss scaled by world size: 0.000319219718221575Per-token loss scaled by world size: 0.0005228713853284717 | |
Epoch: 2, Step: 274, Rank: 0, loss = 0.7222084999084473 | |
Epoch: 2, Step: 274, Rank: 4, loss = 0.7577174305915833Epoch: 2, Step: 274, Rank: 7, loss = 0.41627100110054016 | |
Epoch: 2, Step: 274, Rank: 5, loss = 0.7104246020317078Epoch: 2, Step: 274, Rank: 2, loss = 1.2694436311721802Epoch: 2, Step: 274, Rank: 3, loss = 1.1684383153915405Epoch: 2, Step: 274, Rank: 1, loss = 0.2541387975215912 | |
Per-token loss scaled by world size: 0.0010173558257520199 | |
Epoch: 2, Step: 274, Rank: 6, loss = 0.8099424242973328 | |
[2024-06-27 16:47:36,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=274, skipped=0, lr=[1.4233766233766236e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:36,884] [INFO] [timer.py:260:stop] epoch=0/micro_step=274/global_step=274, RunningAvgSamplesPerSec=95.43162793622247, CurrSamplesPerSec=95.91566253580284, MemAllocated=22.25GB, MaxMemAllocated=28.61GB | |
throughput: 95.82016174477224 samples/s, lr: 1.4233766233766236e-05, loss: 0.7222084999084473 cuda_mem_allocated: 22.2478289604187 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6369.0 batch_size: 84.0 total loss: 0.7635731101036072 | |
Epoch 2: 5% 11/205 [00:12<03:25, 1.06s/it] total tokens: 2408 num samples: 14 num padding tokens: 355 - rank: 6 max len: 172 min len: 123 avg len: 146.64285714285714 num_loss_counted_tokens: 895 | |
total tokens: 2460 num samples: 12 num padding tokens: 208 - rank: 5 max len: 205 min len: 176 avg len: 187.66666666666666 num_loss_counted_tokens: 801 | |
total tokens: 2408 num samples: 8 num padding tokens: 269 - rank: 3 max len: 301 min len: 251 avg len: 267.375 num_loss_counted_tokens: 747 | |
total tokens: 2408 num samples: 7 num padding tokens: 156 - rank: 2 max len: 344 min len: 307 avg len: 321.7142857142857 num_loss_counted_tokens: 1231 | |
total tokens: 2490 num samples: 10 num padding tokens: 258 - rank: 4 max len: 249 min len: 206 avg len: 223.2 num_loss_counted_tokens: 1031 | |
total tokens: 2280 num samples: 6 num padding tokens: 77 - rank: 1 max len: 380 min len: 348 avg len: 367.1666666666667 num_loss_counted_tokens: 1338 | |
total tokens: 2340 num samples: 4 num padding tokens: 493 - rank: 0 max len: 585 min len: 402 avg len: 461.75 num_loss_counted_tokens: 1003 | |
total tokens: 2400 num samples: 20 num padding tokens: 310 - rank: 7 max len: 120 min len: 77 avg len: 104.5 num_loss_counted_tokens: 508 | |
Per-token loss scaled by world size: 0.0011129570193588734Per-token loss scaled by world size: 0.0005533045041374862Per-token loss scaled by world size: 0.0004926788387820125Per-token loss scaled by world size: 0.0010467887623235583Per-token loss scaled by world size: 0.0005666610668413341Per-token loss scaled by world size: 0.001143822679296136 | |
Per-token loss scaled by world size: 0.0014881336828693748 | |
Epoch: 2, Step: 275, Rank: 6, loss = 1.0241986513137817Epoch: 2, Step: 275, Rank: 4, loss = 0.5091784596443176 | |
Epoch: 2, Step: 275, Rank: 3, loss = 0.45338770747184753 | |
Epoch: 2, Step: 275, Rank: 5, loss = 0.9633073210716248Epoch: 2, Step: 275, Rank: 1, loss = 1.052602767944336Epoch: 2, Step: 275, Rank: 7, loss = 0.5214698314666748Epoch: 2, Step: 275, Rank: 2, loss = 1.3694549798965454 | |
Per-token loss scaled by world size: 0.0014292632695287466 | |
Epoch: 2, Step: 275, Rank: 0, loss = 1.315279483795166 | |
[2024-06-27 16:47:37,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=275, skipped=0, lr=[1.4285714285714287e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:37,942] [INFO] [timer.py:260:stop] epoch=0/micro_step=275/global_step=275, RunningAvgSamplesPerSec=95.43088365789053, CurrSamplesPerSec=95.22887007162271, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.1321240438429 samples/s, lr: 1.4285714285714287e-05, loss: 1.315279483795166 cuda_mem_allocated: 22.30650568008423 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7362.0 batch_size: 81.0 total loss: 0.9011099934577942 | |
Epoch 2: 6% 12/205 [00:13<03:24, 1.06s/it] total tokens: 2445 num samples: 15 num padding tokens: 217 - rank: 6 max len: 163 min len: 126 avg len: 148.53333333333333 num_loss_counted_tokens: 713 | |
total tokens: 2534 num samples: 7 num padding tokens: 284 - rank: 1 max len: 362 min len: 292 avg len: 321.42857142857144 num_loss_counted_tokens: 1179 | |
total tokens: 2490 num samples: 10 num padding tokens: 182 - rank: 3 max len: 249 min len: 217 avg len: 230.8 num_loss_counted_tokens: 772 | |
total tokens: 2376 num samples: 12 num padding tokens: 193 - rank: 5 max len: 198 min len: 171 avg len: 181.91666666666666 num_loss_counted_tokens: 712 | |
total tokens: 2320 num samples: 8 num padding tokens: 192 - rank: 2 max len: 290 min len: 251 avg len: 266.0 num_loss_counted_tokens: 855 | |
total tokens: 2376 num samples: 11 num padding tokens: 101 - rank: 4 max len: 216 min len: 199 avg len: 206.8181818181818 num_loss_counted_tokens: 1021 | |
total tokens: 2337 num samples: 19 num padding tokens: 352 - rank: 7 max len: 123 min len: 85 avg len: 104.47368421052632 num_loss_counted_tokens: 473 | |
total tokens: 2416 num samples: 4 num padding tokens: 511 - rank: 0 max len: 604 min len: 363 avg len: 476.25 num_loss_counted_tokens: 1245 | |
Per-token loss scaled by world size: 0.00119394704233855Per-token loss scaled by world size: 0.0010148546425625682Per-token loss scaled by world size: 0.0007396430592052639Per-token loss scaled by world size: 0.0007650036131963134 | |
Per-token loss scaled by world size: 0.0006784716388210654Per-token loss scaled by world size: 0.0003647918638307601Per-token loss scaled by world size: 0.0010800014715641737 | |
Epoch: 2, Step: 276, Rank: 6, loss = 0.7073414921760559 | |
Epoch: 2, Step: 276, Rank: 2, loss = 0.9383599758148193Epoch: 2, Step: 276, Rank: 4, loss = 1.103953242301941Per-token loss scaled by world size: 0.000330713955918327Epoch: 2, Step: 276, Rank: 1, loss = 0.6838924884796143Epoch: 2, Step: 276, Rank: 3, loss = 0.6273318529129028Epoch: 2, Step: 276, Rank: 0, loss = 0.33729568123817444 | |
Epoch: 2, Step: 276, Rank: 5, loss = 0.9985963106155396 | |
Epoch: 2, Step: 276, Rank: 7, loss = 0.3057864010334015 | |
[2024-06-27 16:47:38,932] [INFO] [logging.py:96:log_dist] [Rank 0] step=276, skipped=0, lr=[1.433766233766234e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:39,006] [INFO] [timer.py:260:stop] epoch=0/micro_step=276/global_step=276, RunningAvgSamplesPerSec=95.4302944775991, CurrSamplesPerSec=95.26971989527867, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.15957542598137 samples/s, lr: 1.433766233766234e-05, loss: 0.33729568123817444 cuda_mem_allocated: 22.262856006622314 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7397.0 batch_size: 84.0 total loss: 0.7128196358680725 | |
Epoch 2: 6% 13/205 [00:14<03:23, 1.06s/it] total tokens: 2235 num samples: 5 num padding tokens: 220 - rank: 1 max len: 447 min len: 372 avg len: 403.0 num_loss_counted_tokens: 949 | |
total tokens: 2520 num samples: 15 num padding tokens: 264 - rank: 6 max len: 168 min len: 135 avg len: 150.4 num_loss_counted_tokens: 931 | |
total tokens: 2499 num samples: 7 num padding tokens: 340 - rank: 2 max len: 357 min len: 274 avg len: 308.42857142857144 num_loss_counted_tokens: 830 | |
total tokens: 2330 num samples: 10 num padding tokens: 185 - rank: 4 max len: 233 min len: 201 avg len: 214.5 num_loss_counted_tokens: 823 | |
total tokens: 2439 num samples: 9 num padding tokens: 133 - rank: 3 max len: 271 min len: 236 avg len: 256.22222222222223 num_loss_counted_tokens: 876 | |
total tokens: 2400 num samples: 12 num padding tokens: 196 - rank: 5 max len: 200 min len: 169 avg len: 183.66666666666666 num_loss_counted_tokens: 772 | |
total tokens: 2448 num samples: 4 num padding tokens: 426 - rank: 0 max len: 612 min len: 462 avg len: 505.5 num_loss_counted_tokens: 947 | |
total tokens: 2412 num samples: 18 num padding tokens: 364 - rank: 7 max len: 134 min len: 73 avg len: 113.77777777777777 num_loss_counted_tokens: 535 | |
Per-token loss scaled by world size: 0.0009375474182888865Per-token loss scaled by world size: 0.0010672371136024594Per-token loss scaled by world size: 0.0011157452827319503Per-token loss scaled by world size: 0.0009921601740643382Per-token loss scaled by world size: 0.00102334120310843Per-token loss scaled by world size: 0.0009613364236429334 | |
Per-token loss scaled by world size: 0.0005637667491100729 | |
Epoch: 2, Step: 277, Rank: 6, loss = 0.938275933265686Epoch: 2, Step: 277, Rank: 5, loss = 0.8596137762069702Epoch: 2, Step: 277, Rank: 4, loss = 0.9096868634223938Epoch: 2, Step: 277, Rank: 2, loss = 0.9785230755805969 | |
Epoch: 2, Step: 277, Rank: 3, loss = 1.0229989290237427 | |
Epoch: 2, Step: 277, Rank: 1, loss = 0.8814253211021423Epoch: 2, Step: 277, Rank: 7, loss = 0.5169036388397217 | |
Per-token loss scaled by world size: 0.0010833546984940767 | |
Epoch: 2, Step: 277, Rank: 0, loss = 0.9933008551597595 | |
[2024-06-27 16:47:39,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=277, skipped=0, lr=[1.4389610389610391e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:40,071] [INFO] [timer.py:260:stop] epoch=0/micro_step=277/global_step=277, RunningAvgSamplesPerSec=95.42804385181729, CurrSamplesPerSec=94.81534618924914, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.70627152130963 samples/s, lr: 1.4389610389610391e-05, loss: 0.9933008551597595 cuda_mem_allocated: 22.309129238128662 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7335.0 batch_size: 85.0 total loss: 0.8875910639762878 | |
Epoch 2: 7% 14/205 [00:15<03:22, 1.06s/it] total tokens: 2232 num samples: 6 num padding tokens: 213 - rank: 1 max len: 372 min len: 306 avg len: 336.5 num_loss_counted_tokens: 1037 | |
total tokens: 2464 num samples: 16 num padding tokens: 142 - rank: 6 max len: 154 min len: 137 avg len: 145.125 num_loss_counted_tokens: 864 | |
total tokens: 2295 num samples: 9 num padding tokens: 128 - rank: 3 max len: 255 min len: 228 avg len: 240.77777777777777 num_loss_counted_tokens: 727 | |
total tokens: 2312 num samples: 8 num padding tokens: 155 - rank: 2 max len: 289 min len: 256 avg len: 269.625 num_loss_counted_tokens: 704 | |
total tokens: 2444 num samples: 13 num padding tokens: 263 - rank: 5 max len: 188 min len: 155 avg len: 167.76923076923077 num_loss_counted_tokens: 925 | |
total tokens: 2475 num samples: 11 num padding tokens: 179 - rank: 4 max len: 225 min len: 190 avg len: 208.72727272727272 num_loss_counted_tokens: 1012 | |
total tokens: 2470 num samples: 19 num padding tokens: 404 - rank: 7 max len: 130 min len: 87 avg len: 108.73684210526316 num_loss_counted_tokens: 543 | |
total tokens: 2452 num samples: 4 num padding tokens: 439 - rank: 0 max len: 613 min len: 430 avg len: 503.25 num_loss_counted_tokens: 910 | |
Per-token loss scaled by world size: 0.0014931473415344954Per-token loss scaled by world size: 0.0006383298896253109Per-token loss scaled by world size: 0.0005509555921889842Per-token loss scaled by world size: 0.00021167140221223235Per-token loss scaled by world size: 0.0009006232721731067Per-token loss scaled by world size: 0.0007042817887850106Per-token loss scaled by world size: 0.0012416016543284059 | |
Epoch: 2, Step: 278, Rank: 4, loss = 1.5103185176849365Epoch: 2, Step: 278, Rank: 6, loss = 0.6456707119941711 | |
Epoch: 2, Step: 278, Rank: 5, loss = 0.557291567325592 | |
Epoch: 2, Step: 278, Rank: 3, loss = 0.7123810052871704Epoch: 2, Step: 278, Rank: 2, loss = 0.9109804630279541Epoch: 2, Step: 278, Rank: 7, loss = 0.21410562098026276Epoch: 2, Step: 278, Rank: 1, loss = 1.2558801174163818 | |
Per-token loss scaled by world size: 0.0019375028787180781 | |
Epoch: 2, Step: 278, Rank: 0, loss = 1.9597841501235962 | |
[2024-06-27 16:47:41,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=278, skipped=0, lr=[1.4441558441558442e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:41,129] [INFO] [timer.py:260:stop] epoch=0/micro_step=278/global_step=278, RunningAvgSamplesPerSec=95.4287353671916, CurrSamplesPerSec=95.61928319232183, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.5192500665415 samples/s, lr: 1.4441558441558442e-05, loss: 1.9597841501235962 cuda_mem_allocated: 22.30411958694458 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8092.0 batch_size: 71.0 total loss: 0.9708014726638794 | |
Epoch 2: 7% 15/205 [00:16<03:21, 1.06s/it] total tokens: 2325 num samples: 5 num padding tokens: 409 - rank: 2 max len: 465 min len: 316 avg len: 383.2 num_loss_counted_tokens: 898 | |
total tokens: 2412 num samples: 12 num padding tokens: 223 - rank: 5 max len: 201 min len: 168 avg len: 182.41666666666666 num_loss_counted_tokens: 794 | |
total tokens: 2304 num samples: 9 num padding tokens: 210 - rank: 4 max len: 256 min len: 202 avg len: 232.66666666666666 num_loss_counted_tokens: 637 | |
total tokens: 2384 num samples: 8 num padding tokens: 85 - rank: 3 max len: 298 min len: 268 avg len: 287.375 num_loss_counted_tokens: 854 | |
total tokens: 2505 num samples: 15 num padding tokens: 300 - rank: 6 max len: 167 min len: 134 avg len: 147.0 num_loss_counted_tokens: 762 | |
total tokens: 2346 num samples: 3 num padding tokens: 490 - rank: 1 max len: 782 min len: 524 avg len: 618.6666666666666 num_loss_counted_tokens: 1350 | |
total tokens: 1320 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1320 min len: 1320 avg len: 1320.0 num_loss_counted_tokens: 74 | |
total tokens: 2358 num samples: 18 num padding tokens: 228 - rank: 7 max len: 131 min len: 97 avg len: 118.33333333333333 num_loss_counted_tokens: 641 | |
Per-token loss scaled by world size: 0.0014660782180726528Per-token loss scaled by world size: 0.0007004475337453187Per-token loss scaled by world size: 0.0008085716981440783 | |
Per-token loss scaled by world size: 0.0008990885107778013Per-token loss scaled by world size: 0.0006659971550107002Per-token loss scaled by world size: 0.0006124840583652258 | |
Per-token loss scaled by world size: 0.00045225207577459514 | |
Epoch: 2, Step: 279, Rank: 1, loss = 0.5729660987854004 | |
Epoch: 2, Step: 279, Rank: 6, loss = 0.6614116430282593Epoch: 2, Step: 279, Rank: 7, loss = 0.5447856783866882Epoch: 2, Step: 279, Rank: 3, loss = 1.1992520093917847Epoch: 2, Step: 279, Rank: 4, loss = 0.7354543805122375 | |
Epoch: 2, Step: 279, Rank: 2, loss = 0.5010119676589966Per-token loss scaled by world size: 0.0016895901644602418Epoch: 2, Step: 279, Rank: 5, loss = 0.36994218826293945 | |
Epoch: 2, Step: 279, Rank: 0, loss = 1.3820847272872925 | |
[2024-06-27 16:47:42,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=279, skipped=0, lr=[1.4493506493506495e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:42,185] [INFO] [timer.py:260:stop] epoch=0/micro_step=279/global_step=279, RunningAvgSamplesPerSec=95.42975633722824, CurrSamplesPerSec=95.71238163515817, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.61562750450291 samples/s, lr: 1.4493506493506495e-05, loss: 1.3820847272872925 cuda_mem_allocated: 22.281221389770508 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6544.0 batch_size: 90.0 total loss: 0.7458636164665222 | |
Epoch 2: 8% 16/205 [00:17<03:20, 1.06s/it] total tokens: 2420 num samples: 11 num padding tokens: 159 - rank: 4 max len: 220 min len: 191 avg len: 205.54545454545453 num_loss_counted_tokens: 825 | |
total tokens: 2367 num samples: 9 num padding tokens: 185 - rank: 3 max len: 263 min len: 221 avg len: 242.44444444444446 num_loss_counted_tokens: 886 | |
total tokens: 2504 num samples: 8 num padding tokens: 159 - rank: 2 max len: 313 min len: 275 avg len: 293.125 num_loss_counted_tokens: 997 | |
total tokens: 2496 num samples: 6 num padding tokens: 380 - rank: 1 max len: 416 min len: 319 avg len: 352.6666666666667 num_loss_counted_tokens: 1088 | |
total tokens: 2405 num samples: 13 num padding tokens: 278 - rank: 5 max len: 185 min len: 147 avg len: 163.6153846153846 num_loss_counted_tokens: 835 | |
total tokens: 2482 num samples: 17 num padding tokens: 244 - rank: 6 max len: 146 min len: 121 avg len: 131.64705882352942 num_loss_counted_tokens: 845 | |
total tokens: 2452 num samples: 4 num padding tokens: 291 - rank: 0 max len: 613 min len: 430 avg len: 540.25 num_loss_counted_tokens: 1169 | |
total tokens: 2360 num samples: 20 num padding tokens: 366 - rank: 7 max len: 118 min len: 79 avg len: 99.7 num_loss_counted_tokens: 493 | |
Per-token loss scaled by world size: 0.0011011005844920874Per-token loss scaled by world size: 0.0012091645039618015Per-token loss scaled by world size: 0.000864869449287653Per-token loss scaled by world size: 0.00199395720846951Per-token loss scaled by world size: 0.0002926621527876705Per-token loss scaled by world size: 0.0007243118016049266Per-token loss scaled by world size: 0.0015451819635927677 | |
Epoch: 2, Step: 280, Rank: 6, loss = 0.7793554663658142Epoch: 2, Step: 280, Rank: 4, loss = 0.9922292232513428Epoch: 2, Step: 280, Rank: 0, loss = 0.2637251913547516Epoch: 2, Step: 280, Rank: 3, loss = 1.0896083116531372Epoch: 2, Step: 280, Rank: 1, loss = 1.7968047857284546Epoch: 2, Step: 280, Rank: 5, loss = 0.6526954770088196 | |
Epoch: 2, Step: 280, Rank: 2, loss = 1.3924020528793335 | |
Per-token loss scaled by world size: 0.0005851888563483953 | |
Epoch: 2, Step: 280, Rank: 7, loss = 0.5273283123970032 | |
[2024-06-27 16:47:43,163] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=0, lr=[1.4545454545454546e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:43,237] [INFO] [timer.py:260:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=95.43227016978709, CurrSamplesPerSec=96.13373860472645, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 96.04039174342999 samples/s, lr: 1.4545454545454546e-05, loss: 0.2637251913547516 cuda_mem_allocated: 22.267507553100586 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7209.0 batch_size: 78.0 total loss: 0.936768651008606 | |
Epoch 2: 8% 17/205 [00:18<03:18, 1.06s/it] total tokens: 2401 num samples: 7 num padding tokens: 183 - rank: 1 max len: 343 min len: 299 avg len: 316.85714285714283 num_loss_counted_tokens: 1193 | |
total tokens: 2464 num samples: 11 num padding tokens: 173 - rank: 4 max len: 224 min len: 194 avg len: 208.27272727272728 num_loss_counted_tokens: 886 | |
total tokens: 2368 num samples: 8 num padding tokens: 177 - rank: 2 max len: 296 min len: 261 avg len: 273.875 num_loss_counted_tokens: 1201 | |
total tokens: 2522 num samples: 13 num padding tokens: 265 - rank: 5 max len: 194 min len: 154 avg len: 173.6153846153846 num_loss_counted_tokens: 878 | |
total tokens: 2416 num samples: 16 num padding tokens: 243 - rank: 6 max len: 151 min len: 124 avg len: 135.8125 num_loss_counted_tokens: 709 | |
total tokens: 2260 num samples: 4 num padding tokens: 522 - rank: 0 max len: 565 min len: 368 avg len: 434.5 num_loss_counted_tokens: 755 | |
total tokens: 2304 num samples: 9 num padding tokens: 144 - rank: 3 max len: 256 min len: 230 avg len: 240.0 num_loss_counted_tokens: 810 | |
total tokens: 2440 num samples: 20 num padding tokens: 288 - rank: 7 max len: 122 min len: 89 avg len: 107.6 num_loss_counted_tokens: 619 | |
Per-token loss scaled by world size: 0.00072110426845029Per-token loss scaled by world size: 0.0011211988748982549Per-token loss scaled by world size: 0.0005689544486813247Per-token loss scaled by world size: 0.0013626019936054945Per-token loss scaled by world size: 0.0004028394469060004Per-token loss scaled by world size: 0.0009328121086582541Per-token loss scaled by world size: 0.0003893736284226179 | |
Epoch: 2, Step: 281, Rank: 6, loss = 0.6985697746276855Epoch: 2, Step: 281, Rank: 7, loss = 0.5511746406555176Epoch: 2, Step: 281, Rank: 0, loss = 1.3200206756591797Epoch: 2, Step: 281, Rank: 4, loss = 1.0861613750457764Epoch: 2, Step: 281, Rank: 5, loss = 0.39025071263313293 | |
Epoch: 2, Step: 281, Rank: 1, loss = 0.3772056996822357 | |
Epoch: 2, Step: 281, Rank: 2, loss = 0.9036617279052734 | |
Per-token loss scaled by world size: 0.0008235736167989671 | |
Epoch: 2, Step: 281, Rank: 3, loss = 0.79783695936203 | |
[2024-06-27 16:47:44,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=281, skipped=0, lr=[1.45974025974026e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:44,294] [INFO] [timer.py:260:stop] epoch=0/micro_step=281/global_step=281, RunningAvgSamplesPerSec=95.43302274452682, CurrSamplesPerSec=95.64269984629355, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.53994268835147 samples/s, lr: 1.45974025974026e-05, loss: 1.3200206756591797 cuda_mem_allocated: 22.276809692382812 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7750.0 batch_size: 74.0 total loss: 0.7656102180480957 | |
Epoch 2: 9% 18/205 [00:19<03:17, 1.06s/it] total tokens: 2422 num samples: 14 num padding tokens: 101 - rank: 5 max len: 173 min len: 157 avg len: 165.78571428571428 num_loss_counted_tokens: 751 | |
total tokens: 2398 num samples: 11 num padding tokens: 206 - rank: 4 max len: 218 min len: 176 avg len: 199.27272727272728 num_loss_counted_tokens: 949 | |
total tokens: 2145 num samples: 5 num padding tokens: 179 - rank: 1 max len: 429 min len: 350 avg len: 393.2 num_loss_counted_tokens: 1046 | |
total tokens: 2533 num samples: 17 num padding tokens: 303 - rank: 6 max len: 149 min len: 111 avg len: 131.1764705882353 num_loss_counted_tokens: 786 | |
total tokens: 2387 num samples: 7 num padding tokens: 160 - rank: 2 max len: 341 min len: 306 avg len: 318.14285714285717 num_loss_counted_tokens: 1259 | |
total tokens: 2448 num samples: 9 num padding tokens: 214 - rank: 3 max len: 272 min len: 219 avg len: 248.22222222222223 num_loss_counted_tokens: 1056 | |
total tokens: 1760 num samples: 16 num padding tokens: 252 - rank: 7 max len: 110 min len: 79 avg len: 94.25 num_loss_counted_tokens: 277 | |
total tokens: 2412 num samples: 4 num padding tokens: 395 - rank: 0 max len: 603 min len: 444 avg len: 504.25 num_loss_counted_tokens: 1260 | |
Per-token loss scaled by world size: 0.001304773846641183Per-token loss scaled by world size: 0.0009796690428629518Per-token loss scaled by world size: 0.0008831643499433994Per-token loss scaled by world size: 0.001517477328889072Per-token loss scaled by world size: 0.0012334926286712289Per-token loss scaled by world size: 0.0014417035272344947 | |
Per-token loss scaled by world size: 0.000180044153239578 | |
Epoch: 2, Step: 282, Rank: 6, loss = 0.8392089605331421Epoch: 2, Step: 282, Rank: 4, loss = 0.7565406560897827Epoch: 2, Step: 282, Rank: 1, loss = 1.299908995628357 | |
Epoch: 2, Step: 282, Rank: 2, loss = 1.2349992990493774 | |
Epoch: 2, Step: 282, Rank: 5, loss = 1.056640625Epoch: 2, Step: 282, Rank: 3, loss = 1.1177018880844116 | |
Epoch: 2, Step: 282, Rank: 7, loss = 0.15423032641410828Per-token loss scaled by world size: 0.0011783740483224392 | |
Epoch: 2, Step: 282, Rank: 0, loss = 1.0094246864318848 | |
[2024-06-27 16:47:45,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=282, skipped=0, lr=[1.464935064935065e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:45,357] [INFO] [timer.py:260:stop] epoch=0/micro_step=282/global_step=282, RunningAvgSamplesPerSec=95.43191192319895, CurrSamplesPerSec=95.12299957429457, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 95.00920330227812 samples/s, lr: 1.464935064935065e-05, loss: 1.0094246864318848 cuda_mem_allocated: 22.316285133361816 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6853.0 batch_size: 71.0 total loss: 0.9335819482803345 | |
Epoch 2: 9% 19/205 [00:20<03:16, 1.06s/it] total tokens: 2125 num samples: 17 num padding tokens: 393 - rank: 7 max len: 125 min len: 83 avg len: 101.88235294117646 num_loss_counted_tokens: 440 | |
total tokens: 2510 num samples: 10 num padding tokens: 220 - rank: 4 max len: 251 min len: 198 avg len: 229.0 num_loss_counted_tokens: 1223 | |
total tokens: 2440 num samples: 5 num padding tokens: 285 - rank: 1 max len: 488 min len: 385 avg len: 431.0 num_loss_counted_tokens: 1261 | |
total tokens: 2431 num samples: 13 num padding tokens: 150 - rank: 5 max len: 187 min len: 153 avg len: 175.46153846153845 num_loss_counted_tokens: 1030 | |
total tokens: 2046 num samples: 3 num padding tokens: 290 - rank: 0 max len: 682 min len: 492 avg len: 585.3333333333334 num_loss_counted_tokens: 1136 | |
total tokens: 2219 num samples: 7 num padding tokens: 187 - rank: 3 max len: 317 min len: 274 avg len: 290.2857142857143 num_loss_counted_tokens: 939 | |
total tokens: 2298 num samples: 6 num padding tokens: 239 - rank: 2 max len: 383 min len: 321 avg len: 343.1666666666667 num_loss_counted_tokens: 1233 | |
total tokens: 2533 num samples: 17 num padding tokens: 178 - rank: 6 max len: 149 min len: 126 avg len: 138.52941176470588 num_loss_counted_tokens: 864 | |
Per-token loss scaled by world size: 0.0005102830473333597Per-token loss scaled by world size: 0.0013571567833423615 | |
Per-token loss scaled by world size: 0.0009806619491428137Per-token loss scaled by world size: 0.00019946281099691987 | |
Per-token loss scaled by world size: 0.0017977566458284855Per-token loss scaled by world size: 0.0007940777577459812 | |
Per-token loss scaled by world size: 0.0010863353963941336 | |
Epoch: 2, Step: 283, Rank: 3, loss = 0.3672124445438385 | |
Epoch: 2, Step: 283, Rank: 0, loss = 0.1435384303331375Epoch: 2, Step: 283, Rank: 2, loss = 0.9766439199447632 | |
Epoch: 2, Step: 283, Rank: 4, loss = 0.571438193321228 | |
Epoch: 2, Step: 283, Rank: 7, loss = 0.7057088613510132Epoch: 2, Step: 283, Rank: 1, loss = 1.2937105894088745 | |
Epoch: 2, Step: 283, Rank: 5, loss = 0.7817540764808655 | |
Per-token loss scaled by world size: 0.0012883374001830816 | |
Epoch: 2, Step: 283, Rank: 6, loss = 0.927119791507721 | |
[2024-06-27 16:47:46,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=283, skipped=0, lr=[1.4701298701298703e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:46,424] [INFO] [timer.py:260:stop] epoch=0/micro_step=283/global_step=283, RunningAvgSamplesPerSec=95.42722021797677, CurrSamplesPerSec=94.13144446966649, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 94.04508476136 samples/s, lr: 1.4701298701298703e-05, loss: 0.1435384303331375 cuda_mem_allocated: 22.22063636779785 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 5757.0 batch_size: 79.0 total loss: 0.7208908200263977 | |
Epoch 2: 10% 20/205 [00:21<03:16, 1.06s/it] total tokens: 2464 num samples: 7 num padding tokens: 117 - rank: 1 max len: 352 min len: 318 avg len: 335.2857142857143 num_loss_counted_tokens: 1058 | |
total tokens: 2480 num samples: 10 num padding tokens: 149 - rank: 3 max len: 248 min len: 223 avg len: 233.1 num_loss_counted_tokens: 1068 | |
total tokens: 2272 num samples: 8 num padding tokens: 156 - rank: 2 max len: 284 min len: 253 avg len: 264.5 num_loss_counted_tokens: 917 | |
total tokens: 2392 num samples: 13 num padding tokens: 225 - rank: 5 max len: 184 min len: 155 avg len: 166.69230769230768 num_loss_counted_tokens: 748 | |
total tokens: 2464 num samples: 16 num padding tokens: 161 - rank: 6 max len: 154 min len: 128 avg len: 143.9375 num_loss_counted_tokens: 920 | |
total tokens: 2120 num samples: 5 num padding tokens: 187 - rank: 0 max len: 424 min len: 361 avg len: 386.6 num_loss_counted_tokens: 1437 | |
total tokens: 2442 num samples: 11 num padding tokens: 211 - rank: 4 max len: 222 min len: 185 avg len: 202.8181818181818 num_loss_counted_tokens: 872 | |
total tokens: 2394 num samples: 19 num padding tokens: 328 - rank: 7 max len: 126 min len: 88 avg len: 108.73684210526316 num_loss_counted_tokens: 552 | |
Per-token loss scaled by world size: 0.0003793139476329088Per-token loss scaled by world size: 0.0012375088408589363Per-token loss scaled by world size: 0.0005778170307166874Per-token loss scaled by world size: 0.0009138953755609691Per-token loss scaled by world size: 0.001354865962639451 | |
Per-token loss scaled by world size: 0.0005813523312099278Per-token loss scaled by world size: 0.0006126501830294728 | |
Epoch: 2, Step: 284, Rank: 5, loss = 0.7691571712493896 | |
Epoch: 2, Step: 284, Rank: 7, loss = 0.48630523681640625Epoch: 2, Step: 284, Rank: 0, loss = 0.31924009323120117Epoch: 2, Step: 284, Rank: 1, loss = 1.0415183305740356Epoch: 2, Step: 284, Rank: 2, loss = 1.140289068222046 | |
Epoch: 2, Step: 284, Rank: 6, loss = 0.48928067088127136 | |
Epoch: 2, Step: 284, Rank: 3, loss = 0.5156217217445374Per-token loss scaled by world size: 0.0013596005737781525 | |
Epoch: 2, Step: 284, Rank: 4, loss = 1.1442738771438599 | |
[2024-06-27 16:47:47,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=284, skipped=0, lr=[1.4753246753246754e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:47,484] [INFO] [timer.py:260:stop] epoch=0/micro_step=284/global_step=284, RunningAvgSamplesPerSec=95.42674844176025, CurrSamplesPerSec=95.2943638911885, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.20396730107103 samples/s, lr: 1.4753246753246754e-05, loss: 0.31924009323120117 cuda_mem_allocated: 22.27979040145874 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6733.0 batch_size: 86.0 total loss: 0.7382108569145203 | |
Epoch 2: 10% 21/205 [00:22<03:15, 1.06s/it] total tokens: 2460 num samples: 15 num padding tokens: 290 - rank: 6 max len: 164 min len: 122 avg len: 144.66666666666666 num_loss_counted_tokens: 882 | |
total tokens: 2436 num samples: 12 num padding tokens: 249 - rank: 5 max len: 203 min len: 166 avg len: 182.25 num_loss_counted_tokens: 941 | |
total tokens: 2530 num samples: 11 num padding tokens: 152 - rank: 4 max len: 230 min len: 205 avg len: 216.1818181818182 num_loss_counted_tokens: 948 | |
total tokens: 2265 num samples: 5 num padding tokens: 271 - rank: 1 max len: 453 min len: 340 avg len: 398.8 num_loss_counted_tokens: 958 | |
total tokens: 2511 num samples: 9 num padding tokens: 237 - rank: 3 max len: 279 min len: 231 avg len: 252.66666666666666 num_loss_counted_tokens: 1235 | |
total tokens: 2282 num samples: 7 num padding tokens: 107 - rank: 2 max len: 326 min len: 289 avg len: 310.7142857142857 num_loss_counted_tokens: 758 | |
total tokens: 2205 num samples: 3 num padding tokens: 275 - rank: 0 max len: 735 min len: 585 avg len: 643.3333333333334 num_loss_counted_tokens: 1228 | |
total tokens: 2520 num samples: 21 num padding tokens: 302 - rank: 7 max len: 120 min len: 83 avg len: 105.61904761904762 num_loss_counted_tokens: 597 | |
Per-token loss scaled by world size: 0.000792104285210371Per-token loss scaled by world size: 0.0007112606544978917Per-token loss scaled by world size: 0.002303321612998843Per-token loss scaled by world size: 0.0009666296537034214Per-token loss scaled by world size: 0.0004723069432657212Per-token loss scaled by world size: 0.0009661148069426417 | |
Per-token loss scaled by world size: 0.0004783780896104872 | |
Epoch: 2, Step: 285, Rank: 7, loss = 0.4314523935317993Epoch: 2, Step: 285, Rank: 4, loss = 0.7235872745513916Epoch: 2, Step: 285, Rank: 0, loss = 2.1040842533111572Epoch: 2, Step: 285, Rank: 1, loss = 0.6497365832328796 | |
Epoch: 2, Step: 285, Rank: 6, loss = 0.8825458884239197Epoch: 2, Step: 285, Rank: 5, loss = 0.8830161690711975 | |
Epoch: 2, Step: 285, Rank: 3, loss = 0.4369983971118927 | |
Per-token loss scaled by world size: 0.0011454490013420582 | |
Epoch: 2, Step: 285, Rank: 2, loss = 1.0463676452636719 | |
[2024-06-27 16:47:48,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=285, skipped=0, lr=[1.4805194805194807e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:48,553] [INFO] [timer.py:260:stop] epoch=0/micro_step=285/global_step=285, RunningAvgSamplesPerSec=95.42365684631729, CurrSamplesPerSec=94.55974794912765, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.46535331968076 samples/s, lr: 1.4805194805194807e-05, loss: 2.1040842533111572 cuda_mem_allocated: 22.303761959075928 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7308.0 batch_size: 92.0 total loss: 0.8947235941886902 | |
Epoch 2: 11% 22/205 [00:23<03:14, 1.06s/it] total tokens: 2453 num samples: 11 num padding tokens: 189 - rank: 4 max len: 223 min len: 191 avg len: 205.8181818181818 num_loss_counted_tokens: 901 | |
total tokens: 2470 num samples: 13 num padding tokens: 207 - rank: 5 max len: 190 min len: 162 avg len: 174.07692307692307 num_loss_counted_tokens: 967 | |
total tokens: 2430 num samples: 9 num padding tokens: 172 - rank: 3 max len: 270 min len: 223 avg len: 250.88888888888889 num_loss_counted_tokens: 1242 | |
total tokens: 2286 num samples: 6 num padding tokens: 150 - rank: 1 max len: 381 min len: 325 avg len: 356.0 num_loss_counted_tokens: 1157 | |
total tokens: 2480 num samples: 8 num padding tokens: 204 - rank: 2 max len: 310 min len: 271 avg len: 284.5 num_loss_counted_tokens: 1051 | |
total tokens: 2496 num samples: 16 num padding tokens: 170 - rank: 6 max len: 156 min len: 131 avg len: 145.375 num_loss_counted_tokens: 907 | |
total tokens: 2340 num samples: 18 num padding tokens: 432 - rank: 7 max len: 130 min len: 82 avg len: 106.0 num_loss_counted_tokens: 474 | |
total tokens: 2052 num samples: 3 num padding tokens: 253 - rank: 0 max len: 684 min len: 521 avg len: 599.6666666666666 num_loss_counted_tokens: 1190 | |
Per-token loss scaled by world size: 0.0022972438018769026Per-token loss scaled by world size: 0.0005985541502013803Per-token loss scaled by world size: 0.0005281041958369315Per-token loss scaled by world size: 0.001802445505745709Per-token loss scaled by world size: 0.0009410215425305068Per-token loss scaled by world size: 0.0010664846049621701Per-token loss scaled by world size: 0.0009071247186511755 | |
Epoch: 2, Step: 286, Rank: 0, loss = 0.5553834438323975Epoch: 2, Step: 286, Rank: 1, loss = 2.1315550804138184Epoch: 2, Step: 286, Rank: 2, loss = 1.6724441051483154Epoch: 2, Step: 286, Rank: 7, loss = 0.4900146722793579 | |
Epoch: 2, Step: 286, Rank: 6, loss = 0.9895643591880798 | |
Epoch: 2, Step: 286, Rank: 4, loss = 0.8731503486633301 | |
Epoch: 2, Step: 286, Rank: 3, loss = 0.8416983485221863 | |
Per-token loss scaled by world size: 0.0007096081390045583 | |
Epoch: 2, Step: 286, Rank: 5, loss = 0.6584276556968689 | |
[2024-06-27 16:47:49,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=286, skipped=0, lr=[1.4857142857142858e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:49,608] [INFO] [timer.py:260:stop] epoch=0/micro_step=286/global_step=286, RunningAvgSamplesPerSec=95.42627896201874, CurrSamplesPerSec=96.17417407662536, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 96.08542598569409 samples/s, lr: 1.4857142857142858e-05, loss: 0.5553834438323975 cuda_mem_allocated: 22.28468084335327 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7423.0 batch_size: 77.0 total loss: 1.0265296697616577 | |
Epoch 2: 11% 23/205 [00:24<03:13, 1.06s/it] total tokens: 2313 num samples: 9 num padding tokens: 61 - rank: 3 max len: 257 min len: 233 avg len: 250.22222222222223 num_loss_counted_tokens: 850 | |
total tokens: 2330 num samples: 10 num padding tokens: 115 - rank: 4 max len: 233 min len: 210 avg len: 221.5 num_loss_counted_tokens: 582 | |
total tokens: 2387 num samples: 7 num padding tokens: 225 - rank: 1 max len: 341 min len: 285 avg len: 308.85714285714283 num_loss_counted_tokens: 937 | |
total tokens: 2528 num samples: 16 num padding tokens: 200 - rank: 6 max len: 158 min len: 129 avg len: 145.5 num_loss_counted_tokens: 905 | |
total tokens: 2496 num samples: 12 num padding tokens: 238 - rank: 5 max len: 208 min len: 171 avg len: 188.16666666666666 num_loss_counted_tokens: 856 | |
total tokens: 2520 num samples: 9 num padding tokens: 85 - rank: 2 max len: 280 min len: 258 avg len: 270.55555555555554 num_loss_counted_tokens: 1197 | |
total tokens: 2432 num samples: 19 num padding tokens: 359 - rank: 7 max len: 128 min len: 80 avg len: 109.10526315789474 num_loss_counted_tokens: 497 | |
total tokens: 2285 num samples: 5 num padding tokens: 298 - rank: 0 max len: 457 min len: 348 avg len: 397.4 num_loss_counted_tokens: 1085 | |
Per-token loss scaled by world size: 0.0006646617548540235Per-token loss scaled by world size: 0.0008884237031452358Per-token loss scaled by world size: 0.0013320022262632847Per-token loss scaled by world size: 0.0007605497958138585Per-token loss scaled by world size: 0.0006397353135980666Per-token loss scaled by world size: 0.0009263490210287273Per-token loss scaled by world size: 0.0012889462523162365 | |
Epoch: 2, Step: 287, Rank: 1, loss = 0.5510876774787903Epoch: 2, Step: 287, Rank: 2, loss = 1.1043963432312012Epoch: 2, Step: 287, Rank: 7, loss = 0.6305908560752869Epoch: 2, Step: 287, Rank: 5, loss = 0.5304205417633057 | |
Epoch: 2, Step: 287, Rank: 4, loss = 0.7366142868995667 | |
Epoch: 2, Step: 287, Rank: 6, loss = 0.7680591344833374 | |
Epoch: 2, Step: 287, Rank: 3, loss = 1.0686975717544556 | |
Per-token loss scaled by world size: 0.0005769465351477265 | |
Epoch: 2, Step: 287, Rank: 0, loss = 0.4783608019351959 | |
[2024-06-27 16:47:50,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=287, skipped=0, lr=[1.4909090909090911e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:50,681] [INFO] [timer.py:260:stop] epoch=0/micro_step=287/global_step=287, RunningAvgSamplesPerSec=95.42272362819814, CurrSamplesPerSec=94.4236180761935, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.33760576728211 samples/s, lr: 1.4909090909090911e-05, loss: 0.4783608019351959 cuda_mem_allocated: 22.301377296447754 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6633.0 batch_size: 91.0 total loss: 0.7335284352302551 | |
Epoch 2: 12% 24/205 [00:25<03:12, 1.06s/it] total tokens: 2496 num samples: 13 num padding tokens: 232 - rank: 6 max len: 192 min len: 146 avg len: 174.15384615384616 num_loss_counted_tokens: 703 | |
total tokens: 2508 num samples: 11 num padding tokens: 182 - rank: 5 max len: 228 min len: 193 avg len: 211.45454545454547 num_loss_counted_tokens: 850 | |
total tokens: 2430 num samples: 9 num padding tokens: 185 - rank: 4 max len: 270 min len: 231 avg len: 249.44444444444446 num_loss_counted_tokens: 836 | |
total tokens: 2366 num samples: 7 num padding tokens: 177 - rank: 3 max len: 338 min len: 291 avg len: 312.7142857142857 num_loss_counted_tokens: 1195 | |
total tokens: 2388 num samples: 4 num padding tokens: 432 - rank: 1 max len: 597 min len: 404 avg len: 489.0 num_loss_counted_tokens: 1226 | |
total tokens: 2376 num samples: 6 num padding tokens: 158 - rank: 2 max len: 396 min len: 345 avg len: 369.6666666666667 num_loss_counted_tokens: 1205 | |
total tokens: 1856 num samples: 2 num padding tokens: 296 - rank: 0 max len: 928 min len: 632 avg len: 780.0 num_loss_counted_tokens: 985 | |
total tokens: 2465 num samples: 17 num padding tokens: 395 - rank: 7 max len: 145 min len: 94 avg len: 121.76470588235294 num_loss_counted_tokens: 703 | |
Per-token loss scaled by world size: 0.000935838557779789Per-token loss scaled by world size: 0.0012919281143695116Per-token loss scaled by world size: 0.0009970334358513355Per-token loss scaled by world size: 0.0009888159111142159Per-token loss scaled by world size: 0.00047487858682870865 | |
Per-token loss scaled by world size: 0.0003670241276267916Per-token loss scaled by world size: 0.0007585382554680109 | |
Epoch: 2, Step: 288, Rank: 6, loss = 0.7653989791870117Epoch: 2, Step: 288, Rank: 2, loss = 1.0566357374191284 | |
Epoch: 2, Step: 288, Rank: 7, loss = 0.38839131593704224Epoch: 2, Step: 288, Rank: 5, loss = 0.8154487013816833 | |
Epoch: 2, Step: 288, Rank: 0, loss = 0.3001798689365387 | |
Epoch: 2, Step: 288, Rank: 1, loss = 0.8087278604507446Per-token loss scaled by world size: 0.000985165243037045 | |
Epoch: 2, Step: 288, Rank: 3, loss = 0.620389461517334 | |
Epoch: 2, Step: 288, Rank: 4, loss = 0.8057420253753662 | |
[2024-06-27 16:47:51,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=288, skipped=0, lr=[1.4961038961038962e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:51,738] [INFO] [timer.py:260:stop] epoch=0/micro_step=288/global_step=288, RunningAvgSamplesPerSec=95.42370545211905, CurrSamplesPerSec=95.70435112059842, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.61742125427978 samples/s, lr: 1.4961038961038962e-05, loss: 0.3001798689365387 cuda_mem_allocated: 22.256892204284668 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6543.0 batch_size: 86.0 total loss: 0.6951143145561218 | |
Epoch 2: 12% 25/205 [00:26<03:11, 1.06s/it] total tokens: 2457 num samples: 13 num padding tokens: 197 - rank: 5 max len: 189 min len: 163 avg len: 173.84615384615384 num_loss_counted_tokens: 783 | |
total tokens: 2226 num samples: 6 num padding tokens: 130 - rank: 1 max len: 371 min len: 334 avg len: 349.3333333333333 num_loss_counted_tokens: 1039 | |
total tokens: 2415 num samples: 15 num padding tokens: 259 - rank: 6 max len: 161 min len: 130 avg len: 143.73333333333332 num_loss_counted_tokens: 837 | |
total tokens: 2500 num samples: 5 num padding tokens: 297 - rank: 0 max len: 500 min len: 384 avg len: 440.6 num_loss_counted_tokens: 1323 | |
total tokens: 2303 num samples: 7 num padding tokens: 162 - rank: 2 max len: 329 min len: 268 avg len: 305.85714285714283 num_loss_counted_tokens: 805 | |
total tokens: 2432 num samples: 19 num padding tokens: 217 - rank: 7 max len: 128 min len: 94 avg len: 116.57894736842105 num_loss_counted_tokens: 744 | |
total tokens: 2420 num samples: 11 num padding tokens: 210 - rank: 4 max len: 220 min len: 189 avg len: 200.9090909090909 num_loss_counted_tokens: 867 | |
total tokens: 2394 num samples: 9 num padding tokens: 232 - rank: 3 max len: 266 min len: 223 avg len: 240.22222222222223 num_loss_counted_tokens: 1027 | |
Per-token loss scaled by world size: 0.0015897402772679925Per-token loss scaled by world size: 0.0013519873609766364Per-token loss scaled by world size: 0.0007269562920555472Per-token loss scaled by world size: 0.00042284891242161393Per-token loss scaled by world size: 0.0006393829244188964Per-token loss scaled by world size: 0.0010436181910336018 | |
Per-token loss scaled by world size: 0.0009967123623937368 | |
Epoch: 2, Step: 289, Rank: 0, loss = 0.6066944599151611 | |
Epoch: 2, Step: 289, Rank: 1, loss = 1.5084648132324219 | |
Epoch: 2, Step: 289, Rank: 7, loss = 0.40123075246810913Epoch: 2, Step: 289, Rank: 2, loss = 1.2828669548034668Epoch: 2, Step: 289, Rank: 5, loss = 0.9902631640434265Epoch: 2, Step: 289, Rank: 3, loss = 0.689790666103363 | |
Epoch: 2, Step: 289, Rank: 6, loss = 0.9457554817199707 | |
Per-token loss scaled by world size: 0.0005243066698312759 | |
Epoch: 2, Step: 289, Rank: 4, loss = 0.4975014925003052 | |
[2024-06-27 16:47:52,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=289, skipped=0, lr=[1.5012987012987015e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:52,804] [INFO] [timer.py:260:stop] epoch=0/micro_step=289/global_step=289, RunningAvgSamplesPerSec=95.41939947187649, CurrSamplesPerSec=94.20363499392296, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.11060953462771 samples/s, lr: 1.5012987012987015e-05, loss: 0.6066944599151611 cuda_mem_allocated: 22.264524936676025 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7591.0 batch_size: 86.0 total loss: 0.8653209805488586 | |
Epoch 2: 13% 26/205 [00:27<03:10, 1.06s/it] total tokens: 2420 num samples: 10 num padding tokens: 192 - rank: 4 max len: 242 min len: 211 avg len: 222.8 num_loss_counted_tokens: 1163 | |
total tokens: 2219 num samples: 7 num padding tokens: 290 - rank: 3 max len: 317 min len: 248 avg len: 275.57142857142856 num_loss_counted_tokens: 879 | |
total tokens: 2366 num samples: 14 num padding tokens: 310 - rank: 6 max len: 169 min len: 129 avg len: 146.85714285714286 num_loss_counted_tokens: 705 | |
total tokens: 2532 num samples: 12 num padding tokens: 212 - rank: 5 max len: 211 min len: 173 avg len: 193.33333333333334 num_loss_counted_tokens: 932 | |
total tokens: 2527 num samples: 7 num padding tokens: 145 - rank: 2 max len: 361 min len: 331 avg len: 340.2857142857143 num_loss_counted_tokens: 1627 | |
total tokens: 2514 num samples: 6 num padding tokens: 147 - rank: 1 max len: 419 min len: 362 avg len: 394.5 num_loss_counted_tokens: 1196 | |
total tokens: 2520 num samples: 4 num padding tokens: 525 - rank: 0 max len: 630 min len: 420 avg len: 498.75 num_loss_counted_tokens: 1246 | |
total tokens: 2520 num samples: 20 num padding tokens: 381 - rank: 7 max len: 126 min len: 85 avg len: 106.95 num_loss_counted_tokens: 571 | |
Per-token loss scaled by world size: 0.0009496136917732656Per-token loss scaled by world size: 0.0003522684855852276Per-token loss scaled by world size: 0.0017158660339191556Per-token loss scaled by world size: 0.0007982291281223297Per-token loss scaled by world size: 0.0007976594497449696Per-token loss scaled by world size: 0.0015981205506250262Per-token loss scaled by world size: 0.0007958402275107801 | |
Epoch: 2, Step: 290, Rank: 1, loss = 0.8614183068275452Epoch: 2, Step: 290, Rank: 2, loss = 1.5565049648284912Epoch: 2, Step: 290, Rank: 5, loss = 0.7240936160087585 | |
Epoch: 2, Step: 290, Rank: 7, loss = 0.319551557302475 | |
Epoch: 2, Step: 290, Rank: 4, loss = 0.7235768437385559Epoch: 2, Step: 290, Rank: 3, loss = 0.7219265699386597Epoch: 2, Step: 290, Rank: 0, loss = 1.449695110321045 | |
Per-token loss scaled by world size: 0.0007516335463151336 | |
Epoch: 2, Step: 290, Rank: 6, loss = 0.681825578212738 | |
[2024-06-27 16:47:53,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=0, lr=[1.5064935064935066e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:53,853] [INFO] [timer.py:260:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=95.42317021371989, CurrSamplesPerSec=96.51783102789419, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 96.4148942054544 samples/s, lr: 1.5064935064935066e-05, loss: 1.449695110321045 cuda_mem_allocated: 22.27001142501831 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7257.0 batch_size: 82.0 total loss: 0.8798239827156067 | |
Epoch 2: 13% 27/205 [00:29<03:08, 1.06s/it] total tokens: 2460 num samples: 12 num padding tokens: 163 - rank: 5 max len: 205 min len: 172 avg len: 191.41666666666666 num_loss_counted_tokens: 842 | |
total tokens: 2456 num samples: 8 num padding tokens: 332 - rank: 3 max len: 307 min len: 237 avg len: 265.5 num_loss_counted_tokens: 667 | |
total tokens: 2340 num samples: 10 num padding tokens: 143 - rank: 4 max len: 234 min len: 207 avg len: 219.7 num_loss_counted_tokens: 967 | |
total tokens: 2475 num samples: 15 num padding tokens: 254 - rank: 6 max len: 165 min len: 135 avg len: 148.06666666666666 num_loss_counted_tokens: 761 | |
total tokens: 2270 num samples: 5 num padding tokens: 215 - rank: 1 max len: 454 min len: 376 avg len: 411.0 num_loss_counted_tokens: 663 | |
total tokens: 2208 num samples: 6 num padding tokens: 167 - rank: 2 max len: 368 min len: 308 avg len: 340.1666666666667 num_loss_counted_tokens: 849 | |
total tokens: 2412 num samples: 18 num padding tokens: 413 - rank: 7 max len: 134 min len: 78 avg len: 111.05555555555556 num_loss_counted_tokens: 588 | |
total tokens: 2088 num samples: 3 num padding tokens: 267 - rank: 0 max len: 696 min len: 500 avg len: 607.0 num_loss_counted_tokens: 1365 | |
Per-token loss scaled by world size: 0.0007570798043161631Per-token loss scaled by world size: 0.0014867089921608567Per-token loss scaled by world size: 0.0018571114633232355Per-token loss scaled by world size: 0.0012333773775026202Per-token loss scaled by world size: 0.0010129191214218736Per-token loss scaled by world size: 0.0006084711058065295Per-token loss scaled by world size: 0.0011342046782374382 | |
Epoch: 2, Step: 291, Rank: 0, loss = 1.164616584777832Epoch: 2, Step: 291, Rank: 2, loss = 1.4038249254226685Epoch: 2, Step: 291, Rank: 1, loss = 1.753577470779419 | |
Epoch: 2, Step: 291, Rank: 5, loss = 0.7148725986480713 | |
Epoch: 2, Step: 291, Rank: 3, loss = 0.5745488405227661Epoch: 2, Step: 291, Rank: 4, loss = 1.0709728002548218Epoch: 2, Step: 291, Rank: 6, loss = 0.9564489126205444 | |
Per-token loss scaled by world size: 0.0005452057812362909 | |
Epoch: 2, Step: 291, Rank: 7, loss = 0.5148105621337891 | |
[2024-06-27 16:47:54,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=291, skipped=0, lr=[1.511688311688312e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:54,912] [INFO] [timer.py:260:stop] epoch=0/micro_step=291/global_step=291, RunningAvgSamplesPerSec=95.42369506568238, CurrSamplesPerSec=95.57509308795663, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 95.48484265697368 samples/s, lr: 1.511688311688312e-05, loss: 1.164616584777832 cuda_mem_allocated: 22.293625354766846 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7554.0 batch_size: 81.0 total loss: 1.0192091464996338 | |
Epoch 2: 14% 28/205 [00:30<03:07, 1.06s/it] total tokens: 2496 num samples: 13 num padding tokens: 155 - rank: 5 max len: 192 min len: 174 avg len: 180.07692307692307 num_loss_counted_tokens: 1030 | |
total tokens: 2400 num samples: 6 num padding tokens: 256 - rank: 1 max len: 400 min len: 319 avg len: 357.3333333333333 num_loss_counted_tokens: 644 | |
total tokens: 2408 num samples: 14 num padding tokens: 223 - rank: 6 max len: 172 min len: 141 avg len: 156.07142857142858 num_loss_counted_tokens: 655 | |
total tokens: 2340 num samples: 9 num padding tokens: 178 - rank: 3 max len: 260 min len: 229 avg len: 240.22222222222223 num_loss_counted_tokens: 860 | |
total tokens: 2360 num samples: 8 num padding tokens: 127 - rank: 2 max len: 295 min len: 261 avg len: 279.125 num_loss_counted_tokens: 956 | |
total tokens: 2508 num samples: 11 num padding tokens: 146 - rank: 4 max len: 228 min len: 197 avg len: 214.72727272727272 num_loss_counted_tokens: 809 | |
total tokens: 2228 num samples: 4 num padding tokens: 237 - rank: 0 max len: 557 min len: 422 avg len: 497.75 num_loss_counted_tokens: 1191 | |
total tokens: 2025 num samples: 15 num padding tokens: 359 - rank: 7 max len: 135 min len: 79 avg len: 111.06666666666666 num_loss_counted_tokens: 452 | |
Per-token loss scaled by world size: 0.000837460218463093Per-token loss scaled by world size: 0.0007171059842221439Per-token loss scaled by world size: 0.0011792497243732214Per-token loss scaled by world size: 0.0012801220873370767Per-token loss scaled by world size: 0.0008809207356534898Per-token loss scaled by world size: 0.0009536764118820429 | |
Epoch: 2, Step: 292, Rank: 6, loss = 0.6247785687446594Epoch: 2, Step: 292, Rank: 3, loss = 0.7296372056007385Epoch: 2, Step: 292, Rank: 4, loss = 1.0274213552474976Epoch: 2, Step: 292, Rank: 1, loss = 1.1153063774108887 | |
Epoch: 2, Step: 292, Rank: 2, loss = 0.8308905959129333 | |
Epoch: 2, Step: 292, Rank: 5, loss = 0.7675021886825562 | |
Per-token loss scaled by world size: 0.0009378226823173463 | |
Per-token loss scaled by world size: 0.0004570106975734234 | |
Epoch: 2, Step: 292, Rank: 0, loss = 0.8170779943466187 | |
Epoch: 2, Step: 292, Rank: 7, loss = 0.3981705605983734 | |
[2024-06-27 16:47:55,902] [INFO] [logging.py:96:log_dist] [Rank 0] step=292, skipped=0, lr=[1.516883116883117e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:55,975] [INFO] [timer.py:260:stop] epoch=0/micro_step=292/global_step=292, RunningAvgSamplesPerSec=95.422083964489, CurrSamplesPerSec=94.95874438954262, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.86766436549902 samples/s, lr: 1.516883116883117e-05, loss: 0.8170779943466187 cuda_mem_allocated: 22.302688598632812 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6970.0 batch_size: 86.0 total loss: 0.7888481616973877 | |
Epoch 2: 14% 29/205 [00:31<03:06, 1.06s/it] total tokens: 2505 num samples: 15 num padding tokens: 223 - rank: 6 max len: 167 min len: 133 avg len: 152.13333333333333 num_loss_counted_tokens: 952 | |
total tokens: 2420 num samples: 10 num padding tokens: 191 - rank: 4 max len: 242 min len: 207 avg len: 222.9 num_loss_counted_tokens: 1009 | |
total tokens: 2457 num samples: 7 num padding tokens: 242 - rank: 1 max len: 351 min len: 288 avg len: 316.42857142857144 num_loss_counted_tokens: 1331 | |
total tokens: 2256 num samples: 8 num padding tokens: 52 - rank: 2 max len: 282 min len: 267 avg len: 275.5 num_loss_counted_tokens: 980 | |
total tokens: 2400 num samples: 12 num padding tokens: 229 - rank: 5 max len: 200 min len: 168 avg len: 180.91666666666666 num_loss_counted_tokens: 970 | |
total tokens: 2349 num samples: 9 num padding tokens: 101 - rank: 3 max len: 261 min len: 242 avg len: 249.77777777777777 num_loss_counted_tokens: 693 | |
total tokens: 2470 num samples: 5 num padding tokens: 351 - rank: 0 max len: 494 min len: 356 avg len: 423.8 num_loss_counted_tokens: 1326 | |
total tokens: 2508 num samples: 19 num padding tokens: 459 - rank: 7 max len: 132 min len: 78 avg len: 107.84210526315789 num_loss_counted_tokens: 598 | |
Per-token loss scaled by world size: 0.0009519039886072278Per-token loss scaled by world size: 0.0006040088483132422Per-token loss scaled by world size: 0.0005076800589449704Per-token loss scaled by world size: 0.0009202412329614162Per-token loss scaled by world size: 0.001118575339205563Per-token loss scaled by world size: 0.0008934871293604374Per-token loss scaled by world size: 0.000671601970680058 | |
Epoch: 2, Step: 293, Rank: 3, loss = 0.7928170561790466Epoch: 2, Step: 293, Rank: 4, loss = 0.5030638575553894Epoch: 2, Step: 293, Rank: 7, loss = 0.42283400893211365Epoch: 2, Step: 293, Rank: 5, loss = 0.7664459347724915Epoch: 2, Step: 293, Rank: 6, loss = 0.931633472442627 | |
Epoch: 2, Step: 293, Rank: 1, loss = 0.7441630959510803 | |
Epoch: 2, Step: 293, Rank: 2, loss = 0.5593605041503906 | |
Per-token loss scaled by world size: 0.001163340755738318 | |
Epoch: 2, Step: 293, Rank: 0, loss = 0.9689174294471741 | |
[2024-06-27 16:47:56,967] [INFO] [logging.py:96:log_dist] [Rank 0] step=293, skipped=0, lr=[1.5220779220779223e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:57,041] [INFO] [timer.py:260:stop] epoch=0/micro_step=293/global_step=293, RunningAvgSamplesPerSec=95.42051382942411, CurrSamplesPerSec=94.9673446052654, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.8706595514431 samples/s, lr: 1.5220779220779223e-05, loss: 0.9689174294471741 cuda_mem_allocated: 22.30650568008423 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6663.0 batch_size: 80.0 total loss: 0.7111544013023376 | |
Epoch 2: 15% 30/205 [00:32<03:05, 1.06s/it] total tokens: 2432 num samples: 16 num padding tokens: 249 - rank: 6 max len: 152 min len: 120 avg len: 136.4375 num_loss_counted_tokens: 686 | |
total tokens: 2431 num samples: 13 num padding tokens: 199 - rank: 5 max len: 187 min len: 153 avg len: 171.69230769230768 num_loss_counted_tokens: 752 | |
total tokens: 2490 num samples: 10 num padding tokens: 340 - rank: 4 max len: 249 min len: 189 avg len: 215.0 num_loss_counted_tokens: 863 | |
total tokens: 2360 num samples: 8 num padding tokens: 151 - rank: 3 max len: 295 min len: 262 avg len: 276.125 num_loss_counted_tokens: 1156 | |
total tokens: 2470 num samples: 5 num padding tokens: 245 - rank: 1 max len: 494 min len: 407 avg len: 445.0 num_loss_counted_tokens: 1171 | |
total tokens: 2346 num samples: 6 num padding tokens: 280 - rank: 2 max len: 391 min len: 311 avg len: 344.3333333333333 num_loss_counted_tokens: 1286 | |
total tokens: 2006 num samples: 17 num padding tokens: 316 - rank: 7 max len: 118 min len: 76 avg len: 99.41176470588235 num_loss_counted_tokens: 402 | |
total tokens: 2436 num samples: 3 num padding tokens: 566 - rank: 0 max len: 812 min len: 523 avg len: 623.3333333333334 num_loss_counted_tokens: 1030 | |
Per-token loss scaled by world size: 0.0007206430891528726Per-token loss scaled by world size: 0.0008215337293222547Per-token loss scaled by world size: 0.001095075742341578Per-token loss scaled by world size: 0.0006648896960541606Per-token loss scaled by world size: 0.0004951409646309912Per-token loss scaled by world size: 0.0006364326691254973Per-token loss scaled by world size: 0.001072457293048501 | |
Epoch: 2, Step: 294, Rank: 6, loss = 0.6055203676223755Epoch: 2, Step: 294, Rank: 3, loss = 0.6902937293052673Epoch: 2, Step: 294, Rank: 4, loss = 0.920137345790863 | |
Epoch: 2, Step: 294, Rank: 1, loss = 0.5347625613212585 | |
Epoch: 2, Step: 294, Rank: 2, loss = 0.5586735606193542Epoch: 2, Step: 294, Rank: 7, loss = 0.4160422086715698Epoch: 2, Step: 294, Rank: 5, loss = 0.9011322855949402 | |
Per-token loss scaled by world size: 0.0010999409714713693 | |
Epoch: 2, Step: 294, Rank: 0, loss = 0.9242254495620728 | |
[2024-06-27 16:47:58,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=294, skipped=0, lr=[1.5272727272727276e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:58,104] [INFO] [timer.py:260:stop] epoch=0/micro_step=294/global_step=294, RunningAvgSamplesPerSec=95.42020645985448, CurrSamplesPerSec=95.33084596726364, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.22616751348268 samples/s, lr: 1.5272727272727276e-05, loss: 0.9242254495620728 cuda_mem_allocated: 22.30698299407959 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6722.0 batch_size: 86.0 total loss: 0.6938484311103821 | |
Epoch 2: 15% 31/205 [00:33<03:04, 1.06s/it] total tokens: 2334 num samples: 6 num padding tokens: 223 - rank: 1 max len: 389 min len: 326 avg len: 351.8333333333333 num_loss_counted_tokens: 1509 | |
total tokens: 2436 num samples: 14 num padding tokens: 261 - rank: 6 max len: 174 min len: 141 avg len: 155.35714285714286 num_loss_counted_tokens: 920 | |
total tokens: 2268 num samples: 7 num padding tokens: 249 - rank: 2 max len: 324 min len: 269 avg len: 288.42857142857144 num_loss_counted_tokens: 1084 | |
total tokens: 2412 num samples: 9 num padding tokens: 178 - rank: 3 max len: 268 min len: 223 avg len: 248.22222222222223 num_loss_counted_tokens: 593 | |
total tokens: 2085 num samples: 15 num padding tokens: 397 - rank: 7 max len: 139 min len: 87 avg len: 112.53333333333333 num_loss_counted_tokens: 514 | |
total tokens: 2364 num samples: 12 num padding tokens: 149 - rank: 5 max len: 197 min len: 177 avg len: 184.58333333333334 num_loss_counted_tokens: 721 | |
total tokens: 2442 num samples: 11 num padding tokens: 154 - rank: 4 max len: 222 min len: 197 avg len: 208.0 num_loss_counted_tokens: 1093 | |
total tokens: 2455 num samples: 5 num padding tokens: 222 - rank: 0 max len: 491 min len: 394 avg len: 446.6 num_loss_counted_tokens: 701 | |
Per-token loss scaled by world size: 0.0011166369076818228Per-token loss scaled by world size: 0.0007115188054740429Per-token loss scaled by world size: 0.00012176240852568299Per-token loss scaled by world size: 0.0029387231916189194Per-token loss scaled by world size: 0.0008289811084978282Per-token loss scaled by world size: 0.0007433113642036915Per-token loss scaled by world size: 0.001228688401170075 | |
Epoch: 2, Step: 295, Rank: 6, loss = 0.8388734459877014Epoch: 2, Step: 295, Rank: 4, loss = 0.5345284938812256 | |
Epoch: 2, Step: 295, Rank: 1, loss = 2.2077157497406006 | |
Epoch: 2, Step: 295, Rank: 7, loss = 0.6227720379829407Epoch: 2, Step: 295, Rank: 0, loss = 0.0914740115404129 | |
Epoch: 2, Step: 295, Rank: 2, loss = 0.9230521321296692Epoch: 2, Step: 295, Rank: 3, loss = 0.5584126710891724Per-token loss scaled by world size: 0.001281685777939856 | |
Epoch: 2, Step: 295, Rank: 5, loss = 0.9628664255142212 | |
[2024-06-27 16:47:59,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=295, skipped=0, lr=[1.5324675324675326e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:47:59,160] [INFO] [timer.py:260:stop] epoch=0/micro_step=295/global_step=295, RunningAvgSamplesPerSec=95.421943605139, CurrSamplesPerSec=95.93191020567839, MemAllocated=22.17GB, MaxMemAllocated=28.61GB | |
throughput: 95.83129066222908 samples/s, lr: 1.5324675324675326e-05, loss: 0.0914740115404129 cuda_mem_allocated: 22.17197847366333 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6010.0 batch_size: 71.0 total loss: 0.8424617648124695 | |
Epoch 2: 16% 32/205 [00:34<03:03, 1.06s/it] total tokens: 2296 num samples: 7 num padding tokens: 128 - rank: 3 max len: 328 min len: 293 avg len: 309.7142857142857 num_loss_counted_tokens: 1074 | |
total tokens: 2296 num samples: 8 num padding tokens: 253 - rank: 4 max len: 287 min len: 223 avg len: 255.375 num_loss_counted_tokens: 697 | |
total tokens: 2478 num samples: 6 num padding tokens: 344 - rank: 2 max len: 413 min len: 330 avg len: 355.6666666666667 num_loss_counted_tokens: 1147 | |
total tokens: 2442 num samples: 11 num padding tokens: 147 - rank: 5 max len: 222 min len: 192 avg len: 208.63636363636363 num_loss_counted_tokens: 872 | |
total tokens: 2483 num samples: 13 num padding tokens: 177 - rank: 6 max len: 191 min len: 159 avg len: 177.3846153846154 num_loss_counted_tokens: 753 | |
total tokens: 2515 num samples: 5 num padding tokens: 134 - rank: 1 max len: 503 min len: 453 avg len: 476.2 num_loss_counted_tokens: 1036 | |
total tokens: 1726 num samples: 2 num padding tokens: 146 - rank: 0 max len: 863 min len: 717 avg len: 790.0 num_loss_counted_tokens: 1299 | |
total tokens: 2256 num samples: 16 num padding tokens: 392 - rank: 7 max len: 141 min len: 84 avg len: 116.5 num_loss_counted_tokens: 539 | |
Per-token loss scaled by world size: 0.0008387180860154331Per-token loss scaled by world size: 0.0010704942978918552Per-token loss scaled by world size: 0.000516597181558609 | |
Per-token loss scaled by world size: 0.0007347949431277812Per-token loss scaled by world size: 0.0009301775717176497Per-token loss scaled by world size: 0.0010381819447502494Per-token loss scaled by world size: 0.0013808500953018665 | |
Epoch: 2, Step: 296, Rank: 2, loss = 0.9551485180854797 | |
Epoch: 2, Step: 296, Rank: 7, loss = 0.4609338343143463Epoch: 2, Step: 296, Rank: 5, loss = 0.748346209526062 | |
Epoch: 2, Step: 296, Rank: 3, loss = 0.655620813369751Epoch: 2, Step: 296, Rank: 4, loss = 0.8299509286880493Epoch: 2, Step: 296, Rank: 6, loss = 0.9263178706169128Epoch: 2, Step: 296, Rank: 1, loss = 1.2320635318756104 | |
Per-token loss scaled by world size: 0.001052703708410263 | |
Epoch: 2, Step: 296, Rank: 0, loss = 0.9392748475074768 | |
[2024-06-27 16:48:00,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=296, skipped=0, lr=[1.537662337662338e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:00,227] [INFO] [timer.py:260:stop] epoch=0/micro_step=296/global_step=296, RunningAvgSamplesPerSec=95.42007824321512, CurrSamplesPerSec=94.8766504907557, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.78334039131921 samples/s, lr: 1.537662337662338e-05, loss: 0.9392748475074768 cuda_mem_allocated: 22.30698299407959 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7138.0 batch_size: 88.0 total loss: 0.843457043170929 | |
Epoch 2: 16% 33/205 [00:35<03:02, 1.06s/it] total tokens: 2352 num samples: 6 num padding tokens: 252 - rank: 1 max len: 392 min len: 327 avg len: 350.0 num_loss_counted_tokens: 984 | |
total tokens: 2420 num samples: 10 num padding tokens: 212 - rank: 4 max len: 242 min len: 202 avg len: 220.8 num_loss_counted_tokens: 1076 | |
total tokens: 2436 num samples: 14 num padding tokens: 272 - rank: 6 max len: 174 min len: 133 avg len: 154.57142857142858 num_loss_counted_tokens: 929 | |
total tokens: 2451 num samples: 19 num padding tokens: 426 - rank: 7 max len: 129 min len: 79 avg len: 106.57894736842105 num_loss_counted_tokens: 511 | |
total tokens: 2282 num samples: 7 num padding tokens: 181 - rank: 2 max len: 326 min len: 282 avg len: 300.14285714285717 num_loss_counted_tokens: 988 | |
total tokens: 2400 num samples: 12 num padding tokens: 126 - rank: 5 max len: 200 min len: 176 avg len: 189.5 num_loss_counted_tokens: 895 | |
total tokens: 2511 num samples: 9 num padding tokens: 164 - rank: 3 max len: 279 min len: 242 avg len: 260.77777777777777 num_loss_counted_tokens: 893 | |
total tokens: 2304 num samples: 4 num padding tokens: 456 - rank: 0 max len: 576 min len: 399 avg len: 462.0 num_loss_counted_tokens: 915 | |
Per-token loss scaled by world size: 0.0017885491251945496Per-token loss scaled by world size: 0.0004809042438864708Per-token loss scaled by world size: 0.0008734623552300036Per-token loss scaled by world size: 0.0009589239489287138 | |
Per-token loss scaled by world size: 0.002212283667176962Per-token loss scaled by world size: 0.0007173180347308517Per-token loss scaled by world size: 0.0007839151076041162 | |
Epoch: 2, Step: 297, Rank: 0, loss = 0.7698478698730469 | |
Epoch: 2, Step: 297, Rank: 3, loss = 0.8451715707778931Epoch: 2, Step: 297, Rank: 6, loss = 0.4238569736480713Epoch: 2, Step: 297, Rank: 2, loss = 1.5763825178146362 | |
Epoch: 2, Step: 297, Rank: 1, loss = 1.9498515129089355 | |
Epoch: 2, Step: 297, Rank: 7, loss = 0.6322261691093445Epoch: 2, Step: 297, Rank: 4, loss = 0.6909231543540955 | |
Per-token loss scaled by world size: 0.0009234739118255675 | |
Epoch: 2, Step: 297, Rank: 5, loss = 0.8139268159866333 | |
[2024-06-27 16:48:01,213] [INFO] [logging.py:96:log_dist] [Rank 0] step=297, skipped=0, lr=[1.542857142857143e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:01,286] [INFO] [timer.py:260:stop] epoch=0/micro_step=297/global_step=297, RunningAvgSamplesPerSec=95.42083083549176, CurrSamplesPerSec=95.64260897387325, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.54576906320564 samples/s, lr: 1.542857142857143e-05, loss: 0.7698478698730469 cuda_mem_allocated: 22.284084796905518 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7051.0 batch_size: 88.0 total loss: 0.962773323059082 | |
Epoch 2: 17% 34/205 [00:36<03:01, 1.06s/it] total tokens: 2506 num samples: 14 num padding tokens: 196 - rank: 5 max len: 179 min len: 157 avg len: 165.0 num_loss_counted_tokens: 834 | |
total tokens: 2220 num samples: 6 num padding tokens: 178 - rank: 1 max len: 370 min len: 314 avg len: 340.3333333333333 num_loss_counted_tokens: 1179 | |
total tokens: 2510 num samples: 10 num padding tokens: 147 - rank: 3 max len: 251 min len: 214 avg len: 236.3 num_loss_counted_tokens: 1142 | |
total tokens: 2472 num samples: 8 num padding tokens: 244 - rank: 2 max len: 309 min len: 256 avg len: 278.5 num_loss_counted_tokens: 851 | |
total tokens: 2440 num samples: 20 num padding tokens: 364 - rank: 7 max len: 122 min len: 80 avg len: 103.8 num_loss_counted_tokens: 547 | |
total tokens: 2512 num samples: 16 num padding tokens: 307 - rank: 6 max len: 157 min len: 122 avg len: 137.8125 num_loss_counted_tokens: 866 | |
total tokens: 2343 num samples: 11 num padding tokens: 157 - rank: 4 max len: 213 min len: 182 avg len: 198.72727272727272 num_loss_counted_tokens: 814 | |
total tokens: 2216 num samples: 4 num padding tokens: 377 - rank: 0 max len: 554 min len: 376 avg len: 459.75 num_loss_counted_tokens: 643 | |
Per-token loss scaled by world size: 0.0009050446096807718Per-token loss scaled by world size: 0.0013295498210936785Per-token loss scaled by world size: 0.0015197137836366892Per-token loss scaled by world size: 0.0019819317385554314Per-token loss scaled by world size: 0.0012411342468112707 | |
Per-token loss scaled by world size: 0.0007818713202141225 | |
Per-token loss scaled by world size: 0.0002189004298998043 | |
Epoch: 2, Step: 298, Rank: 0, loss = 1.1455669403076172Epoch: 2, Step: 298, Rank: 4, loss = 0.8353561758995056Epoch: 2, Step: 298, Rank: 3, loss = 1.2271745204925537 | |
Epoch: 2, Step: 298, Rank: 2, loss = 1.4026957750320435 | |
Epoch: 2, Step: 298, Rank: 5, loss = 0.7216672301292419Epoch: 2, Step: 298, Rank: 1, loss = 1.8293230533599854Per-token loss scaled by world size: 0.0007212876225821674 | |
Epoch: 2, Step: 298, Rank: 7, loss = 0.20204509794712067 | |
Epoch: 2, Step: 298, Rank: 6, loss = 0.6657484769821167 | |
[2024-06-27 16:48:02,279] [INFO] [logging.py:96:log_dist] [Rank 0] step=298, skipped=0, lr=[1.548051948051948e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:02,352] [INFO] [timer.py:260:stop] epoch=0/micro_step=298/global_step=298, RunningAvgSamplesPerSec=95.4191105998448, CurrSamplesPerSec=94.91433474452417, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 94.81166241334215 samples/s, lr: 1.548051948051948e-05, loss: 1.1455669403076172 cuda_mem_allocated: 22.30221176147461 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7384.0 batch_size: 83.0 total loss: 1.003697156906128 | |
Epoch 2: 17% 35/205 [00:37<03:00, 1.06s/it] total tokens: 2340 num samples: 12 num padding tokens: 185 - rank: 5 max len: 195 min len: 158 avg len: 179.58333333333334 num_loss_counted_tokens: 892 | |
total tokens: 2380 num samples: 7 num padding tokens: 111 - rank: 2 max len: 340 min len: 306 avg len: 324.14285714285717 num_loss_counted_tokens: 1317 | |
total tokens: 2448 num samples: 6 num padding tokens: 149 - rank: 1 max len: 408 min len: 340 avg len: 383.1666666666667 num_loss_counted_tokens: 1384 | |
total tokens: 2330 num samples: 10 num padding tokens: 154 - rank: 4 max len: 233 min len: 196 avg len: 217.6 num_loss_counted_tokens: 1005 | |
total tokens: 2286 num samples: 18 num padding tokens: 242 - rank: 7 max len: 127 min len: 97 avg len: 113.55555555555556 num_loss_counted_tokens: 615 | |
total tokens: 2384 num samples: 8 num padding tokens: 245 - rank: 3 max len: 298 min len: 234 avg len: 267.375 num_loss_counted_tokens: 1254 | |
total tokens: 2480 num samples: 16 num padding tokens: 184 - rank: 6 max len: 155 min len: 128 avg len: 143.5 num_loss_counted_tokens: 883 | |
total tokens: 2390 num samples: 5 num padding tokens: 153 - rank: 0 max len: 478 min len: 418 avg len: 447.4 num_loss_counted_tokens: 1345 | |
Per-token loss scaled by world size: 0.0007788612856529653Per-token loss scaled by world size: 0.0006974542629905045Per-token loss scaled by world size: 0.0011387639679014683Per-token loss scaled by world size: 0.0008575004176236689Per-token loss scaled by world size: 0.0014541540294885635Per-token loss scaled by world size: 0.00033362515387125313 | |
Per-token loss scaled by world size: 0.0008516140514984727 | |
Epoch: 2, Step: 299, Rank: 2, loss = 1.4770569801330566Epoch: 2, Step: 299, Rank: 1, loss = 0.7911283373832703 | |
Epoch: 2, Step: 299, Rank: 3, loss = 0.7084391713142395 | |
Per-token loss scaled by world size: 0.0006415222305804491Epoch: 2, Step: 299, Rank: 4, loss = 1.156699538230896Epoch: 2, Step: 299, Rank: 0, loss = 0.8710060715675354Epoch: 2, Step: 299, Rank: 7, loss = 0.3388797640800476Epoch: 2, Step: 299, Rank: 5, loss = 0.8650269508361816 | |
Epoch: 2, Step: 299, Rank: 6, loss = 0.6516262292861938 | |
[2024-06-27 16:48:03,348] [INFO] [logging.py:96:log_dist] [Rank 0] step=299, skipped=0, lr=[1.5532467532467534e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:03,422] [INFO] [timer.py:260:stop] epoch=0/micro_step=299/global_step=299, RunningAvgSamplesPerSec=95.41610032144962, CurrSamplesPerSec=94.53332951430937, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 94.44246525604929 samples/s, lr: 1.5532467532467534e-05, loss: 0.8710060715675354 cuda_mem_allocated: 22.258561611175537 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8126.0 batch_size: 78.0 total loss: 0.85748291015625 | |
Epoch 2: 18% 36/205 [00:38<02:59, 1.06s/it] total tokens: 2505 num samples: 15 num padding tokens: 266 - rank: 6 max len: 167 min len: 132 avg len: 149.26666666666668 num_loss_counted_tokens: 835 | |
total tokens: 2344 num samples: 8 num padding tokens: 263 - rank: 3 max len: 293 min len: 235 avg len: 260.125 num_loss_counted_tokens: 956 | |
total tokens: 2382 num samples: 6 num padding tokens: 80 - rank: 1 max len: 397 min len: 368 avg len: 383.6666666666667 num_loss_counted_tokens: 938 | |
total tokens: 2508 num samples: 19 num padding tokens: 457 - rank: 7 max len: 132 min len: 77 avg len: 107.94736842105263 num_loss_counted_tokens: 551 | |
total tokens: 2184 num samples: 6 num padding tokens: 195 - rank: 2 max len: 364 min len: 297 avg len: 331.5 num_loss_counted_tokens: 1090 | |
total tokens: 2376 num samples: 12 num padding tokens: 181 - rank: 5 max len: 198 min len: 171 avg len: 182.91666666666666 num_loss_counted_tokens: 865 | |
total tokens: 2453 num samples: 11 num padding tokens: 170 - rank: 4 max len: 223 min len: 198 avg len: 207.54545454545453 num_loss_counted_tokens: 752 | |
total tokens: 2244 num samples: 3 num padding tokens: 477 - rank: 0 max len: 748 min len: 414 avg len: 589.0 num_loss_counted_tokens: 1139 | |
Per-token loss scaled by world size: 0.0011401452356949449Per-token loss scaled by world size: 0.0009421664290130138Per-token loss scaled by world size: 0.0005126545438542962Per-token loss scaled by world size: 0.002689735498279333Per-token loss scaled by world size: 0.0008318047039210796Per-token loss scaled by world size: 0.00058185268426314Per-token loss scaled by world size: 0.0009142343769781291 | |
Epoch: 2, Step: 300, Rank: 7, loss = 0.4852275252342224Epoch: 2, Step: 300, Rank: 1, loss = 0.891760528087616Epoch: 2, Step: 300, Rank: 3, loss = 1.079147458076477Epoch: 2, Step: 300, Rank: 0, loss = 2.545834541320801Epoch: 2, Step: 300, Rank: 6, loss = 0.7873031497001648 | |
Per-token loss scaled by world size: 0.0006929872906766832 | |
Epoch: 2, Step: 300, Rank: 5, loss = 0.5507235527038574 | |
Epoch: 2, Step: 300, Rank: 2, loss = 0.8653228282928467 | |
Epoch: 2, Step: 300, Rank: 4, loss = 0.655912458896637 | |
[2024-06-27 16:48:04,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=0, lr=[1.5584415584415587e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:04,485] [INFO] [timer.py:260:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=95.41496320978666, CurrSamplesPerSec=95.07843619700597, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 94.98602856564564 samples/s, lr: 1.5584415584415587e-05, loss: 2.545834541320801 cuda_mem_allocated: 22.267388343811035 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7572.0 batch_size: 89.0 total loss: 0.9826540350914001 | |
Epoch 2: 18% 37/205 [00:39<02:58, 1.06s/it] total tokens: 2480 num samples: 16 num padding tokens: 216 - rank: 6 max len: 155 min len: 131 avg len: 141.5 num_loss_counted_tokens: 731 | |
total tokens: 2320 num samples: 10 num padding tokens: 129 - rank: 4 max len: 232 min len: 204 avg len: 219.1 num_loss_counted_tokens: 972 | |
total tokens: 2376 num samples: 12 num padding tokens: 206 - rank: 5 max len: 198 min len: 158 avg len: 180.83333333333334 num_loss_counted_tokens: 936 | |
total tokens: 2148 num samples: 4 num padding tokens: 383 - rank: 0 max len: 537 min len: 381 avg len: 441.25 num_loss_counted_tokens: 1353 | |
total tokens: 2456 num samples: 8 num padding tokens: 160 - rank: 2 max len: 307 min len: 265 avg len: 287.0 num_loss_counted_tokens: 1107 | |
total tokens: 2499 num samples: 7 num padding tokens: 223 - rank: 1 max len: 357 min len: 311 avg len: 325.14285714285717 num_loss_counted_tokens: 1028 | |
total tokens: 2322 num samples: 9 num padding tokens: 117 - rank: 3 max len: 258 min len: 232 avg len: 245.0 num_loss_counted_tokens: 1018 | |
total tokens: 2489 num samples: 19 num padding tokens: 291 - rank: 7 max len: 131 min len: 97 avg len: 115.6842105263158 num_loss_counted_tokens: 661 | |
Per-token loss scaled by world size: 0.0007819042657501996Per-token loss scaled by world size: 0.0005994371022097766Per-token loss scaled by world size: 0.001317274640314281Per-token loss scaled by world size: 0.0005795535398647189Per-token loss scaled by world size: 0.0008233979460783303Per-token loss scaled by world size: 0.0008473636116832495Per-token loss scaled by world size: 0.0009141117334365845 | |
Epoch: 2, Step: 301, Rank: 1, loss = 1.242684006690979Epoch: 2, Step: 301, Rank: 0, loss = 0.7376289367675781Epoch: 2, Step: 301, Rank: 7, loss = 0.5654940009117126Epoch: 2, Step: 301, Rank: 5, loss = 0.8623501658439636Epoch: 2, Step: 301, Rank: 2, loss = 0.5467362999916077 | |
Epoch: 2, Step: 301, Rank: 6, loss = 0.7767730355262756 | |
Epoch: 2, Step: 301, Rank: 4, loss = 0.799381673336029Per-token loss scaled by world size: 0.0014153033262118697 | |
Epoch: 2, Step: 301, Rank: 3, loss = 1.335161805152893 | |
[2024-06-27 16:48:05,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=301, skipped=0, lr=[1.563636363636364e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:05,551] [INFO] [timer.py:260:stop] epoch=0/micro_step=301/global_step=301, RunningAvgSamplesPerSec=95.41077685403648, CurrSamplesPerSec=94.17939764489046, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.09323580959503 samples/s, lr: 1.563636363636364e-05, loss: 0.7376289367675781 cuda_mem_allocated: 22.277524948120117 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7547.0 batch_size: 83.0 total loss: 0.8582762479782104 | |
Epoch 2: 19% 38/205 [00:40<02:57, 1.06s/it] total tokens: 2442 num samples: 11 num padding tokens: 142 - rank: 4 max len: 222 min len: 197 avg len: 209.0909090909091 num_loss_counted_tokens: 1086 | |
total tokens: 2460 num samples: 15 num padding tokens: 294 - rank: 6 max len: 164 min len: 127 avg len: 144.4 num_loss_counted_tokens: 689 | |
total tokens: 2358 num samples: 9 num padding tokens: 178 - rank: 3 max len: 262 min len: 226 avg len: 242.22222222222223 num_loss_counted_tokens: 1028 | |
total tokens: 2352 num samples: 12 num padding tokens: 181 - rank: 5 max len: 196 min len: 167 avg len: 180.91666666666666 num_loss_counted_tokens: 963 | |
total tokens: 2440 num samples: 8 num padding tokens: 118 - rank: 2 max len: 305 min len: 272 avg len: 290.25 num_loss_counted_tokens: 1187 | |
total tokens: 2262 num samples: 6 num padding tokens: 203 - rank: 1 max len: 377 min len: 309 avg len: 343.1666666666667 num_loss_counted_tokens: 903 | |
total tokens: 2520 num samples: 20 num padding tokens: 369 - rank: 7 max len: 126 min len: 82 avg len: 107.55 num_loss_counted_tokens: 642 | |
total tokens: 2370 num samples: 5 num padding tokens: 201 - rank: 0 max len: 474 min len: 381 avg len: 433.8 num_loss_counted_tokens: 982 | |
Per-token loss scaled by world size: 0.0007968831341713667Per-token loss scaled by world size: 0.0012804120779037476 | |
Per-token loss scaled by world size: 0.0022493137512356043Per-token loss scaled by world size: 0.0013419609749689698Per-token loss scaled by world size: 0.001211515162140131Per-token loss scaled by world size: 0.0008408754365518689 | |
Epoch: 2, Step: 302, Rank: 4, loss = 0.7858263850212097Per-token loss scaled by world size: 0.0007203117711469531 | |
Epoch: 2, Step: 302, Rank: 1, loss = 1.2626463174819946Epoch: 2, Step: 302, Rank: 3, loss = 1.3233412504196167 | |
Epoch: 2, Step: 302, Rank: 0, loss = 2.218104600906372Epoch: 2, Step: 302, Rank: 5, loss = 0.8292083144187927 | |
Epoch: 2, Step: 302, Rank: 2, loss = 1.1947053670883179 | |
Epoch: 2, Step: 302, Rank: 6, loss = 0.7103174328804016Per-token loss scaled by world size: 0.0002573397068772465 | |
Epoch: 2, Step: 302, Rank: 7, loss = 0.2537691295146942 | |
[2024-06-27 16:48:06,540] [INFO] [logging.py:96:log_dist] [Rank 0] step=302, skipped=0, lr=[1.568831168831169e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:06,614] [INFO] [timer.py:260:stop] epoch=0/micro_step=302/global_step=302, RunningAvgSamplesPerSec=95.41083183203483, CurrSamplesPerSec=95.42727309569409, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.33615026364328 samples/s, lr: 1.568831168831169e-05, loss: 2.218104600906372 cuda_mem_allocated: 22.259278774261475 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7889.0 batch_size: 84.0 total loss: 1.072239875793457 | |
Epoch 2: 19% 39/205 [00:41<02:56, 1.06s/it] total tokens: 2500 num samples: 4 num padding tokens: 400 - rank: 1 max len: 625 min len: 447 avg len: 525.0 num_loss_counted_tokens: 1454 | |
total tokens: 2286 num samples: 6 num padding tokens: 132 - rank: 2 max len: 381 min len: 341 avg len: 359.0 num_loss_counted_tokens: 1365 | |
total tokens: 2380 num samples: 7 num padding tokens: 83 - rank: 3 max len: 340 min len: 311 avg len: 328.14285714285717 num_loss_counted_tokens: 905 | |
total tokens: 2400 num samples: 10 num padding tokens: 304 - rank: 5 max len: 240 min len: 185 avg len: 209.6 num_loss_counted_tokens: 867 | |
total tokens: 2520 num samples: 14 num padding tokens: 394 - rank: 6 max len: 180 min len: 130 avg len: 151.85714285714286 num_loss_counted_tokens: 887 | |
total tokens: 2376 num samples: 8 num padding tokens: 230 - rank: 4 max len: 297 min len: 248 avg len: 268.25 num_loss_counted_tokens: 727 | |
total tokens: 1902 num samples: 2 num padding tokens: 242 - rank: 0 max len: 951 min len: 709 avg len: 830.0 num_loss_counted_tokens: 1433 | |
total tokens: 2322 num samples: 18 num padding tokens: 325 - rank: 7 max len: 129 min len: 89 avg len: 110.94444444444444 num_loss_counted_tokens: 533 | |
Per-token loss scaled by world size: 0.0015582763589918613Per-token loss scaled by world size: 0.00041789308306761086Per-token loss scaled by world size: 0.00038243673043325543Per-token loss scaled by world size: 0.0013358573196455836Per-token loss scaled by world size: 0.0009659580537118018Per-token loss scaled by world size: 0.0009595631272532046Per-token loss scaled by world size: 0.0007859747856855392 | |
Epoch: 2, Step: 303, Rank: 0, loss = 1.3457664251327515Epoch: 2, Step: 303, Rank: 7, loss = 0.36090290546417236 | |
Epoch: 2, Step: 303, Rank: 4, loss = 0.33028191328048706Epoch: 2, Step: 303, Rank: 3, loss = 0.6787874698638916 | |
Epoch: 2, Step: 303, Rank: 5, loss = 0.8287026882171631Epoch: 2, Step: 303, Rank: 2, loss = 1.1536797285079956 | |
Epoch: 2, Step: 303, Rank: 1, loss = 0.8342255353927612 | |
Per-token loss scaled by world size: 0.0008459270466119051 | |
Epoch: 2, Step: 303, Rank: 6, loss = 0.730563759803772 | |
[2024-06-27 16:48:07,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=303, skipped=0, lr=[1.5740259740259742e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:07,684] [INFO] [timer.py:260:stop] epoch=0/micro_step=303/global_step=303, RunningAvgSamplesPerSec=95.40824827615921, CurrSamplesPerSec=94.63944782614215, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.55006687606138 samples/s, lr: 1.5740259740259742e-05, loss: 1.3457664251327515 cuda_mem_allocated: 22.287065505981445 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6909.0 batch_size: 87.0 total loss: 0.7828637361526489 | |
Epoch 2: 20% 40/205 [00:42<02:55, 1.07s/it] total tokens: 2412 num samples: 9 num padding tokens: 142 - rank: 3 max len: 268 min len: 239 avg len: 252.22222222222223 num_loss_counted_tokens: 702 | |
total tokens: 2340 num samples: 10 num padding tokens: 216 - rank: 4 max len: 234 min len: 189 avg len: 212.4 num_loss_counted_tokens: 967 | |
total tokens: 2317 num samples: 7 num padding tokens: 136 - rank: 2 max len: 331 min len: 274 avg len: 311.57142857142856 num_loss_counted_tokens: 1263 | |
total tokens: 2268 num samples: 6 num padding tokens: 174 - rank: 1 max len: 378 min len: 336 avg len: 349.0 num_loss_counted_tokens: 1202 | |
total tokens: 2418 num samples: 13 num padding tokens: 197 - rank: 5 max len: 186 min len: 157 avg len: 170.84615384615384 num_loss_counted_tokens: 893 | |
total tokens: 2512 num samples: 16 num padding tokens: 348 - rank: 6 max len: 157 min len: 121 avg len: 135.25 num_loss_counted_tokens: 710 | |
total tokens: 2400 num samples: 20 num padding tokens: 249 - rank: 7 max len: 120 min len: 85 avg len: 107.55 num_loss_counted_tokens: 593 | |
total tokens: 2345 num samples: 5 num padding tokens: 181 - rank: 0 max len: 469 min len: 395 avg len: 432.8 num_loss_counted_tokens: 1501 | |
Per-token loss scaled by world size: 0.0008358680061064661 | |
Per-token loss scaled by world size: 0.0007072513690218329Per-token loss scaled by world size: 0.0013240152038633823Per-token loss scaled by world size: 0.000749254715628922Per-token loss scaled by world size: 0.0015199531335383654Per-token loss scaled by world size: 0.0010602911934256554Per-token loss scaled by world size: 0.0005942516145296395 | |
Epoch: 2, Step: 304, Rank: 5, loss = 0.8048363924026489 | |
Epoch: 2, Step: 304, Rank: 0, loss = 0.6809946894645691Epoch: 2, Step: 304, Rank: 4, loss = 0.7214386463165283Epoch: 2, Step: 304, Rank: 1, loss = 1.4635248184204102 | |
Epoch: 2, Step: 304, Rank: 3, loss = 1.2748610973358154Epoch: 2, Step: 304, Rank: 2, loss = 1.020927906036377 | |
Epoch: 2, Step: 304, Rank: 7, loss = 0.5721900463104248Per-token loss scaled by world size: 0.0007343433098867536 | |
Epoch: 2, Step: 304, Rank: 6, loss = 0.7070808410644531 | |
[2024-06-27 16:48:08,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=304, skipped=0, lr=[1.5792207792207795e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:08,744] [INFO] [timer.py:260:stop] epoch=0/micro_step=304/global_step=304, RunningAvgSamplesPerSec=95.40838709026909, CurrSamplesPerSec=95.45018850463107, MemAllocated=22.24GB, MaxMemAllocated=28.61GB | |
throughput: 95.35649263287223 samples/s, lr: 1.5792207792207795e-05, loss: 0.6809946894645691 cuda_mem_allocated: 22.235901832580566 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7703.0 batch_size: 69.0 total loss: 0.9057318568229675 | |
Epoch 2: 20% 41/205 [00:43<02:54, 1.06s/it] total tokens: 2408 num samples: 7 num padding tokens: 256 - rank: 3 max len: 344 min len: 283 avg len: 307.42857142857144 num_loss_counted_tokens: 1411 | |
total tokens: 2496 num samples: 13 num padding tokens: 334 - rank: 6 max len: 192 min len: 142 avg len: 166.30769230769232 num_loss_counted_tokens: 856 | |
total tokens: 2140 num samples: 4 num padding tokens: 196 - rank: 1 max len: 535 min len: 448 avg len: 486.0 num_loss_counted_tokens: 1163 | |
total tokens: 2350 num samples: 10 num padding tokens: 252 - rank: 5 max len: 235 min len: 194 avg len: 209.8 num_loss_counted_tokens: 613 | |
total tokens: 2120 num samples: 5 num padding tokens: 223 - rank: 2 max len: 424 min len: 350 avg len: 379.4 num_loss_counted_tokens: 1025 | |
total tokens: 2256 num samples: 8 num padding tokens: 202 - rank: 4 max len: 282 min len: 241 avg len: 256.75 num_loss_counted_tokens: 821 | |
total tokens: 1754 num samples: 2 num padding tokens: 317 - rank: 0 max len: 877 min len: 560 avg len: 718.5 num_loss_counted_tokens: 196 | |
total tokens: 2397 num samples: 17 num padding tokens: 395 - rank: 7 max len: 141 min len: 76 avg len: 117.76470588235294 num_loss_counted_tokens: 613 | |
Per-token loss scaled by world size: 0.0005981854628771544Per-token loss scaled by world size: 0.0009157305466942489Per-token loss scaled by world size: 0.0008547082543373108Per-token loss scaled by world size: 0.0013625366846099496Per-token loss scaled by world size: 0.0007589785964228213Per-token loss scaled by world size: 0.0006682123639620841Per-token loss scaled by world size: 0.0006512838299386203 | |
Epoch: 2, Step: 305, Rank: 7, loss = 0.7932760715484619Epoch: 2, Step: 305, Rank: 2, loss = 0.5551908612251282 | |
Epoch: 2, Step: 305, Rank: 1, loss = 0.8499124050140381Epoch: 2, Step: 305, Rank: 3, loss = 1.2646043300628662Epoch: 2, Step: 305, Rank: 4, loss = 0.620184600353241 | |
Epoch: 2, Step: 305, Rank: 6, loss = 0.7044270038604736 | |
Epoch: 2, Step: 305, Rank: 5, loss = 0.604472815990448 | |
Per-token loss scaled by world size: 0.001651280326768756 | |
Epoch: 2, Step: 305, Rank: 0, loss = 1.5325945615768433 | |
[2024-06-27 16:48:09,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=305, skipped=0, lr=[1.5844155844155847e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:09,805] [INFO] [timer.py:260:stop] epoch=0/micro_step=305/global_step=305, RunningAvgSamplesPerSec=95.40825773405372, CurrSamplesPerSec=95.369208199058, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.26971989527867 samples/s, lr: 1.5844155844155847e-05, loss: 1.5325945615768433 cuda_mem_allocated: 22.312707901000977 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7425.0 batch_size: 85.0 total loss: 0.8655828237533569 | |
Epoch 2: 20% 42/205 [00:44<02:53, 1.06s/it] total tokens: 2400 num samples: 15 num padding tokens: 241 - rank: 6 max len: 160 min len: 127 avg len: 143.93333333333334 num_loss_counted_tokens: 706 | |
total tokens: 2376 num samples: 12 num padding tokens: 161 - rank: 5 max len: 198 min len: 162 avg len: 184.58333333333334 num_loss_counted_tokens: 999 | |
total tokens: 2480 num samples: 10 num padding tokens: 226 - rank: 4 max len: 248 min len: 206 avg len: 225.4 num_loss_counted_tokens: 776 | |
total tokens: 2368 num samples: 8 num padding tokens: 205 - rank: 3 max len: 296 min len: 258 avg len: 270.375 num_loss_counted_tokens: 1071 | |
total tokens: 2401 num samples: 7 num padding tokens: 178 - rank: 2 max len: 343 min len: 297 avg len: 317.57142857142856 num_loss_counted_tokens: 1049 | |
total tokens: 2272 num samples: 4 num padding tokens: 439 - rank: 1 max len: 568 min len: 381 avg len: 458.25 num_loss_counted_tokens: 1198 | |
total tokens: 2083 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2083 min len: 2083 avg len: 2083.0 num_loss_counted_tokens: 124 | |
total tokens: 2440 num samples: 20 num padding tokens: 313 - rank: 7 max len: 122 min len: 87 avg len: 106.35 num_loss_counted_tokens: 614 | |
Per-token loss scaled by world size: 0.0006858584238216281Per-token loss scaled by world size: 0.0007981790695339441Per-token loss scaled by world size: 0.0011973580112680793Per-token loss scaled by world size: 0.0010389272356405854Per-token loss scaled by world size: 0.0006842307629995048Per-token loss scaled by world size: 0.0005397179629653692 | |
Per-token loss scaled by world size: 0.0003174866724293679 | |
Epoch: 2, Step: 306, Rank: 4, loss = 0.7132070064544678 | |
Epoch: 2, Step: 306, Rank: 1, loss = 0.83000648021698Epoch: 2, Step: 306, Rank: 2, loss = 1.2451026439666748Epoch: 2, Step: 306, Rank: 0, loss = 1.0803544521331787 | |
Epoch: 2, Step: 306, Rank: 3, loss = 0.7115144729614258Per-token loss scaled by world size: 0.0003103260532952845Epoch: 2, Step: 306, Rank: 5, loss = 0.5612392425537109 | |
Epoch: 2, Step: 306, Rank: 6, loss = 0.330146461725235 | |
Epoch: 2, Step: 306, Rank: 7, loss = 0.32270029187202454 | |
[2024-06-27 16:48:10,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=306, skipped=0, lr=[1.5896103896103897e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:10,874] [INFO] [timer.py:260:stop] epoch=0/micro_step=306/global_step=306, RunningAvgSamplesPerSec=95.40566767690781, CurrSamplesPerSec=94.62730415017292, MemAllocated=22.32GB, MaxMemAllocated=28.61GB | |
throughput: 94.52223373533232 samples/s, lr: 1.5896103896103897e-05, loss: 1.0803544521331787 cuda_mem_allocated: 22.31509256362915 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8319.0 batch_size: 80.0 total loss: 0.7242838740348816 | |
Epoch 2: 21% 43/205 [00:46<02:52, 1.07s/it] total tokens: 2490 num samples: 15 num padding tokens: 189 - rank: 6 max len: 166 min len: 135 avg len: 153.4 num_loss_counted_tokens: 830 | |
total tokens: 2349 num samples: 9 num padding tokens: 306 - rank: 4 max len: 261 min len: 198 avg len: 227.0 num_loss_counted_tokens: 894 | |
total tokens: 2170 num samples: 5 num padding tokens: 157 - rank: 1 max len: 434 min len: 371 avg len: 402.6 num_loss_counted_tokens: 814 | |
total tokens: 2488 num samples: 8 num padding tokens: 147 - rank: 3 max len: 311 min len: 261 avg len: 292.625 num_loss_counted_tokens: 1237 | |
total tokens: 2364 num samples: 12 num padding tokens: 199 - rank: 5 max len: 197 min len: 167 avg len: 180.41666666666666 num_loss_counted_tokens: 899 | |
total tokens: 2513 num samples: 7 num padding tokens: 201 - rank: 2 max len: 359 min len: 311 avg len: 330.2857142857143 num_loss_counted_tokens: 1082 | |
total tokens: 1722 num samples: 2 num padding tokens: 404 - rank: 0 max len: 861 min len: 457 avg len: 659.0 num_loss_counted_tokens: 327 | |
total tokens: 2412 num samples: 18 num padding tokens: 339 - rank: 7 max len: 134 min len: 101 avg len: 115.16666666666667 num_loss_counted_tokens: 600 | |
Per-token loss scaled by world size: 0.0012250123545527458Per-token loss scaled by world size: 0.0007220285478979349Per-token loss scaled by world size: 0.0009128263918682933 | |
Per-token loss scaled by world size: 0.0010389359667897224Per-token loss scaled by world size: 0.0019298852421343327Per-token loss scaled by world size: 0.0008810173603706062 | |
Per-token loss scaled by world size: 0.0005279642064124346 | |
Epoch: 2, Step: 307, Rank: 5, loss = 0.6048794388771057 | |
Epoch: 2, Step: 307, Rank: 0, loss = 1.616761326789856 | |
Epoch: 2, Step: 307, Rank: 3, loss = 0.7647203207015991Epoch: 2, Step: 307, Rank: 4, loss = 1.0262540578842163Epoch: 2, Step: 307, Rank: 6, loss = 0.7380722761154175Epoch: 2, Step: 307, Rank: 2, loss = 0.8703685998916626 | |
Epoch: 2, Step: 307, Rank: 1, loss = 0.44230201840400696 | |
Per-token loss scaled by world size: 0.0005681946640834212 | |
Epoch: 2, Step: 307, Rank: 7, loss = 0.47600507736206055 | |
[2024-06-27 16:48:11,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=307, skipped=0, lr=[1.594805194805195e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:11,930] [INFO] [timer.py:260:stop] epoch=0/micro_step=307/global_step=307, RunningAvgSamplesPerSec=95.40771427699173, CurrSamplesPerSec=96.03397808634762, MemAllocated=22.26GB, MaxMemAllocated=28.61GB | |
throughput: 95.93728160234603 samples/s, lr: 1.594805194805195e-05, loss: 1.616761326789856 cuda_mem_allocated: 22.26357126235962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6702.0 batch_size: 77.0 total loss: 0.8174204230308533 | |
Epoch 2: 21% 44/205 [00:47<02:51, 1.06s/it] total tokens: 2483 num samples: 13 num padding tokens: 135 - rank: 5 max len: 191 min len: 163 avg len: 180.6153846153846 num_loss_counted_tokens: 956 | |
total tokens: 2409 num samples: 11 num padding tokens: 142 - rank: 4 max len: 219 min len: 193 avg len: 206.0909090909091 num_loss_counted_tokens: 1155 | |
total tokens: 2440 num samples: 10 num padding tokens: 91 - rank: 3 max len: 244 min len: 221 avg len: 234.9 num_loss_counted_tokens: 997 | |
total tokens: 2324 num samples: 7 num padding tokens: 191 - rank: 1 max len: 332 min len: 291 avg len: 304.7142857142857 num_loss_counted_tokens: 822 | |
total tokens: 2296 num samples: 8 num padding tokens: 149 - rank: 2 max len: 287 min len: 249 avg len: 268.375 num_loss_counted_tokens: 986 | |
total tokens: 2430 num samples: 15 num padding tokens: 134 - rank: 6 max len: 162 min len: 135 avg len: 153.06666666666666 num_loss_counted_tokens: 926 | |
total tokens: 2412 num samples: 18 num padding tokens: 348 - rank: 7 max len: 134 min len: 88 avg len: 114.66666666666667 num_loss_counted_tokens: 663 | |
total tokens: 2450 num samples: 5 num padding tokens: 469 - rank: 0 max len: 490 min len: 339 avg len: 396.2 num_loss_counted_tokens: 863 | |
Per-token loss scaled by world size: 0.0013803952606394887Per-token loss scaled by world size: 0.0016792705282568932 | |
Per-token loss scaled by world size: 0.0010632964549586177Per-token loss scaled by world size: 0.00040462068864144385Per-token loss scaled by world size: 0.001047288067638874Per-token loss scaled by world size: 0.001035152468830347Per-token loss scaled by world size: 0.0010909936390817165 | |
Epoch: 2, Step: 308, Rank: 1, loss = 1.1383084058761597 | |
Epoch: 2, Step: 308, Rank: 4, loss = 0.8768208026885986Epoch: 2, Step: 308, Rank: 0, loss = 1.3847684860229492Epoch: 2, Step: 308, Rank: 5, loss = 0.8536126017570496Epoch: 2, Step: 308, Rank: 2, loss = 0.8636199235916138 | |
Epoch: 2, Step: 308, Rank: 3, loss = 0.899660587310791 | |
Epoch: 2, Step: 308, Rank: 7, loss = 0.3336603343486786 | |
Per-token loss scaled by world size: 0.0006241354858502746 | |
Epoch: 2, Step: 308, Rank: 6, loss = 0.5146777033805847 | |
[2024-06-27 16:48:12,915] [INFO] [logging.py:96:log_dist] [Rank 0] step=308, skipped=0, lr=[1.6000000000000003e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:12,989] [INFO] [timer.py:260:stop] epoch=0/micro_step=308/global_step=308, RunningAvgSamplesPerSec=95.40833160478537, CurrSamplesPerSec=95.59699011345242, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.4945122810213 samples/s, lr: 1.6000000000000003e-05, loss: 1.3847684860229492 cuda_mem_allocated: 22.2802677154541 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6597.0 batch_size: 80.0 total loss: 0.8581410050392151 | |
Epoch 2: 22% 45/205 [00:48<02:49, 1.06s/it] total tokens: 2425 num samples: 5 num padding tokens: 375 - rank: 1 max len: 485 min len: 364 avg len: 410.0 num_loss_counted_tokens: 1562 | |
total tokens: 2450 num samples: 10 num padding tokens: 259 - rank: 4 max len: 245 min len: 198 avg len: 219.1 num_loss_counted_tokens: 572 | |
total tokens: 2505 num samples: 15 num padding tokens: 248 - rank: 6 max len: 167 min len: 137 avg len: 150.46666666666667 num_loss_counted_tokens: 976 | |
total tokens: 2376 num samples: 8 num padding tokens: 171 - rank: 3 max len: 297 min len: 253 avg len: 275.625 num_loss_counted_tokens: 1160 | |
total tokens: 2133 num samples: 3 num padding tokens: 197 - rank: 0 max len: 711 min len: 611 avg len: 645.3333333333334 num_loss_counted_tokens: 773 | |
total tokens: 2312 num samples: 17 num padding tokens: 389 - rank: 7 max len: 136 min len: 84 avg len: 113.11764705882354 num_loss_counted_tokens: 607 | |
total tokens: 2520 num samples: 7 num padding tokens: 214 - rank: 2 max len: 360 min len: 299 avg len: 329.42857142857144 num_loss_counted_tokens: 1146 | |
total tokens: 2483 num samples: 13 num padding tokens: 172 - rank: 5 max len: 191 min len: 168 avg len: 177.76923076923077 num_loss_counted_tokens: 917 | |
Per-token loss scaled by world size: 0.0004817172302864492Per-token loss scaled by world size: 0.0008429336594417691Per-token loss scaled by world size: 0.0011539822444319725Per-token loss scaled by world size: 0.0004813241248484701Per-token loss scaled by world size: 0.0010001367190852761Per-token loss scaled by world size: 0.0009342778939753771Per-token loss scaled by world size: 0.0010572049068287015 | |
Epoch: 2, Step: 309, Rank: 1, loss = 0.9178112745285034Epoch: 2, Step: 309, Rank: 6, loss = 0.828076958656311Epoch: 2, Step: 309, Rank: 0, loss = 1.1336432695388794Epoch: 2, Step: 309, Rank: 3, loss = 0.47322696447372437 | |
Epoch: 2, Step: 309, Rank: 2, loss = 0.9825093150138855Epoch: 2, Step: 309, Rank: 7, loss = 0.4728407859802246 | |
Epoch: 2, Step: 309, Rank: 4, loss = 1.0385717153549194 | |
Per-token loss scaled by world size: 0.001068569952622056 | |
Epoch: 2, Step: 309, Rank: 5, loss = 1.0497363805770874 | |
[2024-06-27 16:48:13,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=309, skipped=0, lr=[1.6051948051948056e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:14,053] [INFO] [timer.py:260:stop] epoch=0/micro_step=309/global_step=309, RunningAvgSamplesPerSec=95.4056518167604, CurrSamplesPerSec=94.59264732103948, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 94.49494920100594 samples/s, lr: 1.6051948051948056e-05, loss: 1.1336432695388794 cuda_mem_allocated: 22.309129238128662 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7859.0 batch_size: 85.0 total loss: 0.8620520830154419 | |
Epoch 2: 22% 46/205 [00:49<02:48, 1.06s/it] total tokens: 2397 num samples: 17 num padding tokens: 175 - rank: 6 max len: 141 min len: 109 avg len: 130.7058823529412 num_loss_counted_tokens: 738 | |
total tokens: 2388 num samples: 12 num padding tokens: 121 - rank: 4 max len: 199 min len: 169 avg len: 188.91666666666666 num_loss_counted_tokens: 1029 | |
total tokens: 2387 num samples: 11 num padding tokens: 104 - rank: 3 max len: 217 min len: 200 avg len: 207.54545454545453 num_loss_counted_tokens: 952 | |
total tokens: 2520 num samples: 15 num padding tokens: 190 - rank: 5 max len: 168 min len: 144 avg len: 155.33333333333334 num_loss_counted_tokens: 946 | |
total tokens: 2504 num samples: 8 num padding tokens: 153 - rank: 1 max len: 313 min len: 278 avg len: 293.875 num_loss_counted_tokens: 1135 | |
total tokens: 2349 num samples: 9 num padding tokens: 166 - rank: 2 max len: 261 min len: 220 avg len: 242.55555555555554 num_loss_counted_tokens: 942 | |
total tokens: 1417 num samples: 13 num padding tokens: 170 - rank: 7 max len: 109 min len: 74 avg len: 95.92307692307692 num_loss_counted_tokens: 289 | |
total tokens: 2270 num samples: 5 num padding tokens: 282 - rank: 0 max len: 454 min len: 332 avg len: 397.6 num_loss_counted_tokens: 1372 | |
Per-token loss scaled by world size: 0.0013866230146959424Per-token loss scaled by world size: 0.0009360495605506003Per-token loss scaled by world size: 0.0012650219723582268Per-token loss scaled by world size: 0.0011720292968675494Per-token loss scaled by world size: 0.00037610018625855446Per-token loss scaled by world size: 0.0007789427181705832 | |
Per-token loss scaled by world size: 0.0005949810729362071 | |
Epoch: 2, Step: 310, Rank: 4, loss = 0.8595275282859802Epoch: 2, Step: 310, Rank: 7, loss = 0.34535399079322815Epoch: 2, Step: 310, Rank: 3, loss = 1.1616064310073853Epoch: 2, Step: 310, Rank: 2, loss = 1.2732665538787842 | |
Epoch: 2, Step: 310, Rank: 5, loss = 0.7152641415596008Epoch: 2, Step: 310, Rank: 1, loss = 1.0762158632278442 | |
Per-token loss scaled by world size: 0.0005420686211436987 | |
Epoch: 2, Step: 310, Rank: 6, loss = 0.5463413596153259 | |
Epoch: 2, Step: 310, Rank: 0, loss = 0.4977545142173767 | |
[2024-06-27 16:48:15,042] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=0, lr=[1.6103896103896105e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:15,116] [INFO] [timer.py:260:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=95.40558629254117, CurrSamplesPerSec=95.38547461153202, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.28714749477126 samples/s, lr: 1.6103896103896105e-05, loss: 0.4977545142173767 cuda_mem_allocated: 22.30507516860962 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7346.0 batch_size: 78.0 total loss: 0.8094162344932556 | |
Epoch 2: 23% 47/205 [00:50<02:47, 1.06s/it] total tokens: 2530 num samples: 10 num padding tokens: 212 - rank: 4 max len: 253 min len: 222 avg len: 231.8 num_loss_counted_tokens: 772 | |
total tokens: 2480 num samples: 8 num padding tokens: 192 - rank: 3 max len: 310 min len: 264 avg len: 286.0 num_loss_counted_tokens: 1088 | |
total tokens: 2512 num samples: 16 num padding tokens: 223 - rank: 6 max len: 157 min len: 130 avg len: 143.0625 num_loss_counted_tokens: 854 | |
total tokens: 2285 num samples: 5 num padding tokens: 227 - rank: 1 max len: 457 min len: 361 avg len: 411.6 num_loss_counted_tokens: 1273 | |
total tokens: 2451 num samples: 19 num padding tokens: 484 - rank: 7 max len: 129 min len: 79 avg len: 103.52631578947368 num_loss_counted_tokens: 473 | |
total tokens: 2520 num samples: 12 num padding tokens: 243 - rank: 5 max len: 210 min len: 168 avg len: 189.75 num_loss_counted_tokens: 960 | |
total tokens: 2253 num samples: 3 num padding tokens: 373 - rank: 0 max len: 751 min len: 459 avg len: 626.6666666666666 num_loss_counted_tokens: 1331 | |
total tokens: 2380 num samples: 7 num padding tokens: 80 - rank: 2 max len: 340 min len: 320 avg len: 328.57142857142856 num_loss_counted_tokens: 1217 | |
Per-token loss scaled by world size: 0.0008947865571826696Per-token loss scaled by world size: 0.0005130222998559475Per-token loss scaled by world size: 0.0010802049655467272Per-token loss scaled by world size: 0.0005584873724728823Per-token loss scaled by world size: 0.0008587805787101388Per-token loss scaled by world size: 0.0025193241890519857Per-token loss scaled by world size: 0.0005399688961915672 | |
Epoch: 2, Step: 311, Rank: 1, loss = 2.246922254562378Epoch: 2, Step: 311, Rank: 6, loss = 0.7980377674102783 | |
Epoch: 2, Step: 311, Rank: 4, loss = 0.9634078145027161Epoch: 2, Step: 311, Rank: 5, loss = 0.4981009364128113 | |
Per-token loss scaled by world size: 0.0007233356591314077Epoch: 2, Step: 311, Rank: 7, loss = 0.4575517773628235 | |
Epoch: 2, Step: 311, Rank: 2, loss = 0.7659249305725098Epoch: 2, Step: 311, Rank: 3, loss = 0.481584757566452 | |
Epoch: 2, Step: 311, Rank: 0, loss = 0.6451249718666077 | |
[2024-06-27 16:48:16,101] [INFO] [logging.py:96:log_dist] [Rank 0] step=311, skipped=0, lr=[1.6155844155844158e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:16,174] [INFO] [timer.py:260:stop] epoch=0/micro_step=311/global_step=311, RunningAvgSamplesPerSec=95.40617536917306, CurrSamplesPerSec=95.58795779302172, MemAllocated=22.31GB, MaxMemAllocated=28.61GB | |
throughput: 95.49143228189635 samples/s, lr: 1.6155844155844158e-05, loss: 0.6451249718666077 cuda_mem_allocated: 22.307340621948242 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7135.0 batch_size: 79.0 total loss: 0.8570818901062012 | |
Epoch 2: 23% 48/205 [00:51<02:46, 1.06s/it] total tokens: 2408 num samples: 14 num padding tokens: 238 - rank: 6 max len: 172 min len: 140 avg len: 155.0 num_loss_counted_tokens: 854 | |
total tokens: 2310 num samples: 10 num padding tokens: 154 - rank: 4 max len: 231 min len: 204 avg len: 215.6 num_loss_counted_tokens: 990 | |
total tokens: 2430 num samples: 18 num padding tokens: 342 - rank: 7 max len: 135 min len: 89 avg len: 116.0 num_loss_counted_tokens: 582 | |
total tokens: 2412 num samples: 12 num padding tokens: 202 - rank: 5 max len: 201 min len: 172 avg len: 184.16666666666666 num_loss_counted_tokens: 910 | |
total tokens: 2328 num samples: 8 num padding tokens: 150 - rank: 2 max len: 291 min len: 261 avg len: 272.25 num_loss_counted_tokens: 756 | |
total tokens: 2349 num samples: 9 num padding tokens: 130 - rank: 3 max len: 261 min len: 231 avg len: 246.55555555555554 num_loss_counted_tokens: 917 | |
total tokens: 2331 num samples: 7 num padding tokens: 135 - rank: 1 max len: 333 min len: 292 avg len: 313.7142857142857 num_loss_counted_tokens: 1222 | |
total tokens: 2512 num samples: 4 num padding tokens: 594 - rank: 0 max len: 628 min len: 356 avg len: 479.5 num_loss_counted_tokens: 1067 | |
Per-token loss scaled by world size: 0.0005368334823288023Per-token loss scaled by world size: 0.0007000562036409974Per-token loss scaled by world size: 0.0014292638516053557Per-token loss scaled by world size: 0.0011468883603811264Per-token loss scaled by world size: 0.0007655913941562176Per-token loss scaled by world size: 0.0006295014172792435 | |
Per-token loss scaled by world size: 0.0009079938172362745 | |
Epoch: 2, Step: 312, Rank: 4, loss = 0.649039626121521Epoch: 2, Step: 312, Rank: 6, loss = 0.497711718082428Epoch: 2, Step: 312, Rank: 3, loss = 1.3251062631607056 | |
Epoch: 2, Step: 312, Rank: 2, loss = 1.063308835029602Epoch: 2, Step: 312, Rank: 7, loss = 0.5836265087127686Epoch: 2, Step: 312, Rank: 0, loss = 0.7097989320755005Epoch: 2, Step: 312, Rank: 5, loss = 0.8418237566947937 | |
Per-token loss scaled by world size: 0.0014762092614546418 | |
Epoch: 2, Step: 312, Rank: 1, loss = 1.3686305284500122 | |
[2024-06-27 16:48:17,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=312, skipped=0, lr=[1.620779220779221e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:17,240] [INFO] [timer.py:260:stop] epoch=0/micro_step=312/global_step=312, RunningAvgSamplesPerSec=95.40437364557457, CurrSamplesPerSec=94.8508814124837, MemAllocated=22.22GB, MaxMemAllocated=28.61GB | |
throughput: 94.75753277010331 samples/s, lr: 1.620779220779221e-05, loss: 0.7097989320755005 cuda_mem_allocated: 22.22039794921875 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7417.0 batch_size: 68.0 total loss: 0.8798807859420776 | |
Saving model in huggingface format at samples_seen: 29952 | |
Model saved in /instructlab/training_output/hf_format/samples_29952 | |
[16:48:37] INFO saving took 19.874022722244263 seconds utils.py:192 | |
Epoch 2: 24% 49/205 [01:12<18:16, 7.03s/it] total tokens: 2520 num samples: 9 num padding tokens: 189 - rank: 3 max len: 280 min len: 237 avg len: 259.0 num_loss_counted_tokens: 623 | |
total tokens: 2120 num samples: 4 num padding tokens: 364 - rank: 1 max len: 530 min len: 385 avg len: 439.0 num_loss_counted_tokens: 1199 | |
total tokens: 2360 num samples: 10 num padding tokens: 107 - rank: 4 max len: 236 min len: 208 avg len: 225.3 num_loss_counted_tokens: 916 | |
total tokens: 2472 num samples: 12 num padding tokens: 242 - rank: 5 max len: 206 min len: 161 avg len: 185.83333333333334 num_loss_counted_tokens: 777 | |
total tokens: 2262 num samples: 6 num padding tokens: 292 - rank: 2 max len: 377 min len: 302 avg len: 328.3333333333333 num_loss_counted_tokens: 1048 | |
total tokens: 2415 num samples: 15 num padding tokens: 197 - rank: 6 max len: 161 min len: 128 avg len: 147.86666666666667 num_loss_counted_tokens: 786 | |
total tokens: 2028 num samples: 2 num padding tokens: 319 - rank: 0 max len: 1014 min len: 695 avg len: 854.5 num_loss_counted_tokens: 679 | |
total tokens: 2500 num samples: 20 num padding tokens: 324 - rank: 7 max len: 125 min len: 89 avg len: 108.8 num_loss_counted_tokens: 659 | |
Per-token loss scaled by world size: 0.001112502533942461Per-token loss scaled by world size: 0.0011856609489768744Per-token loss scaled by world size: 0.0013278903206810355Per-token loss scaled by world size: 0.0007259754929691553Per-token loss scaled by world size: 0.001241364050656557Per-token loss scaled by world size: 0.0011601358419284225Per-token loss scaled by world size: 0.0005688659148290753 | |
Epoch: 2, Step: 313, Rank: 6, loss = 1.0000007152557373Epoch: 2, Step: 313, Rank: 4, loss = 1.1936074495315552Epoch: 2, Step: 313, Rank: 5, loss = 0.6525612473487854Epoch: 2, Step: 313, Rank: 3, loss = 1.0657609701156616 | |
Epoch: 2, Step: 313, Rank: 1, loss = 1.1158311367034912 | |
Epoch: 2, Step: 313, Rank: 7, loss = 0.5113393664360046Epoch: 2, Step: 313, Rank: 2, loss = 1.0428171157836914 | |
Per-token loss scaled by world size: 0.0005915443762205541 | |
Epoch: 2, Step: 313, Rank: 0, loss = 0.5317244529724121 | |
[2024-06-27 16:48:38,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=313, skipped=0, lr=[1.6259740259740264e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:38,184] [INFO] [timer.py:260:stop] epoch=0/micro_step=313/global_step=313, RunningAvgSamplesPerSec=95.40211775734151, CurrSamplesPerSec=94.7078976565361, MemAllocated=22.29GB, MaxMemAllocated=28.61GB | |
throughput: 94.60478210293266 samples/s, lr: 1.6259740259740264e-05, loss: 0.5317244529724121 cuda_mem_allocated: 22.28933095932007 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7191.0 batch_size: 81.0 total loss: 0.8892052173614502 | |
Epoch 2: 24% 50/205 [01:13<13:31, 5.24s/it] total tokens: 2036 num samples: 4 num padding tokens: 236 - rank: 1 max len: 509 min len: 375 avg len: 450.0 num_loss_counted_tokens: 706 | |
total tokens: 2522 num samples: 13 num padding tokens: 173 - rank: 5 max len: 194 min len: 165 avg len: 180.69230769230768 num_loss_counted_tokens: 1082 | |
total tokens: 2310 num samples: 10 num padding tokens: 192 - rank: 4 max len: 231 min len: 196 avg len: 211.8 num_loss_counted_tokens: 1068 | |
total tokens: 2445 num samples: 15 num padding tokens: 228 - rank: 6 max len: 163 min len: 132 avg len: 147.8 num_loss_counted_tokens: 906 | |
total tokens: 2220 num samples: 6 num padding tokens: 240 - rank: 2 max len: 370 min len: 281 avg len: 330.0 num_loss_counted_tokens: 1144 | |
total tokens: 2466 num samples: 9 num padding tokens: 86 - rank: 3 max len: 274 min len: 251 avg len: 264.44444444444446 num_loss_counted_tokens: 873 | |
total tokens: 2024 num samples: 2 num padding tokens: 470 - rank: 0 max len: 1012 min len: 542 avg len: 777.0 num_loss_counted_tokens: 136 | |
total tokens: 2470 num samples: 19 num padding tokens: 389 - rank: 7 max len: 130 min len: 85 avg len: 109.52631578947368 num_loss_counted_tokens: 570 | |
Per-token loss scaled by world size: 0.0008398248464800417Per-token loss scaled by world size: 0.0005964859155938029Per-token loss scaled by world size: 0.0015459820860996842Per-token loss scaled by world size: 0.0007641934789717197Per-token loss scaled by world size: 0.0009938504081219435Per-token loss scaled by world size: 0.0012760799145326018Per-token loss scaled by world size: 0.0008122684666886926 | |
Epoch: 2, Step: 314, Rank: 3, loss = 1.0967906713485718Epoch: 2, Step: 314, Rank: 2, loss = 0.5126796364784241Epoch: 2, Step: 314, Rank: 0, loss = 0.7218294739723206 | |
Epoch: 2, Step: 314, Rank: 5, loss = 0.6568242907524109Epoch: 2, Step: 314, Rank: 6, loss = 0.8542144298553467 | |
Epoch: 2, Step: 314, Rank: 4, loss = 0.6981447339057922Epoch: 2, Step: 314, Rank: 1, loss = 1.3287715911865234 | |
Per-token loss scaled by world size: 0.0006225727265700698 | |
Epoch: 2, Step: 314, Rank: 7, loss = 0.5351012349128723 | |
[2024-06-27 16:48:39,171] [INFO] [logging.py:96:log_dist] [Rank 0] step=314, skipped=0, lr=[1.6311688311688313e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:39,244] [INFO] [timer.py:260:stop] epoch=0/micro_step=314/global_step=314, RunningAvgSamplesPerSec=95.40210896053867, CurrSamplesPerSec=95.39937323356183, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 95.29596517730968 samples/s, lr: 1.6311688311688313e-05, loss: 0.7218294739723206 cuda_mem_allocated: 22.278836727142334 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 6876.0 batch_size: 89.0 total loss: 0.8005445599555969 | |
Epoch 2: 25% 51/205 [01:14<10:13, 3.98s/it] total tokens: 2376 num samples: 9 num padding tokens: 220 - rank: 3 max len: 264 min len: 230 avg len: 239.55555555555554 num_loss_counted_tokens: 932 | |
total tokens: 2400 num samples: 15 num padding tokens: 228 - rank: 6 max len: 160 min len: 129 avg len: 144.8 num_loss_counted_tokens: 853 | |
total tokens: 2317 num samples: 7 num padding tokens: 229 - rank: 2 max len: 331 min len: 267 avg len: 298.2857142857143 num_loss_counted_tokens: 1115 | |
total tokens: 2358 num samples: 6 num padding tokens: 215 - rank: 1 max len: 393 min len: 336 avg len: 357.1666666666667 num_loss_counted_tokens: 1396 | |
total tokens: 2453 num samples: 11 num padding tokens: 201 - rank: 4 max len: 223 min len: 195 avg len: 204.72727272727272 num_loss_counted_tokens: 954 | |
total tokens: 2522 num samples: 13 num padding tokens: 194 - rank: 5 max len: 194 min len: 160 avg len: 179.07692307692307 num_loss_counted_tokens: 958 | |
total tokens: 2313 num samples: 3 num padding tokens: 646 - rank: 0 max len: 771 min len: 439 avg len: 555.6666666666666 num_loss_counted_tokens: 489 | |
total tokens: 2432 num samples: 19 num padding tokens: 380 - rank: 7 max len: 128 min len: 87 avg len: 108.0 num_loss_counted_tokens: 530 | |
Per-token loss scaled by world size: 0.0007060010102577507Per-token loss scaled by world size: 0.0009470549994148314Per-token loss scaled by world size: 0.001305989339016378Per-token loss scaled by world size: 0.0006362065323628485Per-token loss scaled by world size: 0.00045280082849785686Per-token loss scaled by world size: 0.001140908687375486Per-token loss scaled by world size: 0.0011430153390392661 | |
Epoch: 2, Step: 315, Rank: 6, loss = 0.7673348784446716Epoch: 2, Step: 315, Rank: 5, loss = 1.0293303728103638 | |
Epoch: 2, Step: 315, Rank: 2, loss = 1.4194471836090088 | |
Epoch: 2, Step: 315, Rank: 7, loss = 0.4921379089355469Epoch: 2, Step: 315, Rank: 4, loss = 0.6914770007133484 | |
Epoch: 2, Step: 315, Rank: 3, loss = 1.2400251626968384Epoch: 2, Step: 315, Rank: 1, loss = 1.2423148155212402 | |
Per-token loss scaled by world size: 0.0011745044030249119 | |
Epoch: 2, Step: 315, Rank: 0, loss = 1.2765394449234009 | |
[2024-06-27 16:48:40,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=315, skipped=0, lr=[1.6363636363636366e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:40,303] [INFO] [timer.py:260:stop] epoch=0/micro_step=315/global_step=315, RunningAvgSamplesPerSec=95.40256948935068, CurrSamplesPerSec=95.54647190353217, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.44840102376578 samples/s, lr: 1.6363636363636366e-05, loss: 1.2765394449234009 cuda_mem_allocated: 22.299588680267334 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 8695.0 batch_size: 82.0 total loss: 1.01982581615448 | |
Epoch 2: 25% 52/205 [01:15<07:55, 3.11s/it] total tokens: 2022 num samples: 1 num padding tokens: 0 - rank: 2 max len: 2022 min len: 2022 avg len: 2022.0 num_loss_counted_tokens: 40 | |
total tokens: 2061 num samples: 1 num padding tokens: 0 - rank: 1 max len: 2061 min len: 2061 avg len: 2061.0 num_loss_counted_tokens: 61 | |
total tokens: 412 num samples: 4 num padding tokens: 48 - rank: 7 max len: 103 min len: 83 avg len: 91.0 num_loss_counted_tokens: 70 | |
total tokens: 2448 num samples: 9 num padding tokens: 367 - rank: 4 max len: 272 min len: 200 avg len: 231.22222222222223 num_loss_counted_tokens: 929 | |
total tokens: 2243 num samples: 1 num padding tokens: 0 - rank: 0 max len: 2243 min len: 2243 avg len: 2243.0 num_loss_counted_tokens: 24 | |
total tokens: 2455 num samples: 5 num padding tokens: 530 - rank: 3 max len: 491 min len: 319 avg len: 385.0 num_loss_counted_tokens: 986 | |
total tokens: 2448 num samples: 16 num padding tokens: 387 - rank: 6 max len: 153 min len: 106 avg len: 128.8125 num_loss_counted_tokens: 657 | |
total tokens: 2388 num samples: 12 num padding tokens: 249 - rank: 5 max len: 199 min len: 155 avg len: 178.25 num_loss_counted_tokens: 775 | |
Per-token loss scaled by world size: 0.0004990228335373104Per-token loss scaled by world size: 0.0008345713722519577Per-token loss scaled by world size: 0.0011897372314706445Per-token loss scaled by world size: 0.0004333140095695853Per-token loss scaled by world size: 0.0013512754812836647Per-token loss scaled by world size: 0.0019989940337836742Per-token loss scaled by world size: 0.0009285626001656055 | |
Epoch: 2, Step: 316, Rank: 0, loss = 1.7806040048599243Epoch: 2, Step: 316, Rank: 5, loss = 0.7433944344520569Epoch: 2, Step: 316, Rank: 1, loss = 1.0597584247589111Epoch: 2, Step: 316, Rank: 7, loss = 0.3859744668006897Epoch: 2, Step: 316, Rank: 4, loss = 0.44450458884239197 | |
Epoch: 2, Step: 316, Rank: 3, loss = 0.8271171450614929Epoch: 2, Step: 316, Rank: 2, loss = 1.2036486864089966 | |
Per-token loss scaled by world size: 0.0009428098564967513 | |
Epoch: 2, Step: 316, Rank: 6, loss = 0.8398078680038452 | |
[2024-06-27 16:48:41,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=316, skipped=0, lr=[1.641558441558442e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:41,364] [INFO] [timer.py:260:stop] epoch=0/micro_step=316/global_step=316, RunningAvgSamplesPerSec=95.40091002449776, CurrSamplesPerSec=94.88431910287957, MemAllocated=22.28GB, MaxMemAllocated=28.61GB | |
throughput: 94.79318087756627 samples/s, lr: 1.641558441558442e-05, loss: 1.7806040048599243 cuda_mem_allocated: 22.282176971435547 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7126.0 batch_size: 80.0 total loss: 0.9106011986732483 | |
Epoch 2: 26% 53/205 [01:16<06:18, 2.49s/it] total tokens: 2488 num samples: 8 num padding tokens: 283 - rank: 3 max len: 311 min len: 253 avg len: 275.625 num_loss_counted_tokens: 903 | |
total tokens: 2490 num samples: 6 num padding tokens: 355 - rank: 2 max len: 415 min len: 315 avg len: 355.8333333333333 num_loss_counted_tokens: 1383 | |
total tokens: 1360 num samples: 1 num padding tokens: 0 - rank: 0 max len: 1360 min len: 1360 avg len: 1360.0 num_loss_counted_tokens: 78 | |
total tokens: 2520 num samples: 12 num padding tokens: 234 - rank: 5 max len: 210 min len: 162 avg len: 190.5 num_loss_counted_tokens: 897 | |
total tokens: 2430 num samples: 15 num padding tokens: 312 - rank: 6 max len: 162 min len: 125 avg len: 141.2 num_loss_counted_tokens: 736 | |
total tokens: 2480 num samples: 10 num padding tokens: 210 - rank: 4 max len: 248 min len: 212 avg len: 227.0 num_loss_counted_tokens: 996 | |
total tokens: 2250 num samples: 2 num padding tokens: 520 - rank: 1 max len: 1125 min len: 605 avg len: 865.0 num_loss_counted_tokens: 117 | |
total tokens: 2299 num samples: 19 num padding tokens: 273 - rank: 7 max len: 121 min len: 91 avg len: 106.63157894736842 num_loss_counted_tokens: 537 | |
Per-token loss scaled by world size: 0.0008579285349696875 | |
Per-token loss scaled by world size: 0.0011185455368831754Per-token loss scaled by world size: 0.0015236953040584922Per-token loss scaled by world size: 0.0004698233096860349Per-token loss scaled by world size: 0.0007846846710890532Per-token loss scaled by world size: 0.000504550349432975Per-token loss scaled by world size: 0.0008738836040720344 | |
Epoch: 2, Step: 317, Rank: 3, loss = 0.8371237516403198 | |
Epoch: 2, Step: 317, Rank: 7, loss = 0.45843008160591125Epoch: 2, Step: 317, Rank: 6, loss = 0.49231502413749695Epoch: 2, Step: 317, Rank: 5, loss = 0.8526919484138489Epoch: 2, Step: 317, Rank: 4, loss = 0.765656054019928Epoch: 2, Step: 317, Rank: 2, loss = 1.0914207696914673Epoch: 2, Step: 317, Rank: 0, loss = 1.4867457151412964 | |
Per-token loss scaled by world size: 0.0008304865914396942 | |
Epoch: 2, Step: 317, Rank: 1, loss = 0.810347318649292 | |
[2024-06-27 16:48:42,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=317, skipped=0, lr=[1.646753246753247e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:42,423] [INFO] [timer.py:260:stop] epoch=0/micro_step=317/global_step=317, RunningAvgSamplesPerSec=95.39750901543039, CurrSamplesPerSec=94.34145173394539, MemAllocated=22.27GB, MaxMemAllocated=28.61GB | |
throughput: 94.25179414063479 samples/s, lr: 1.646753246753247e-05, loss: 1.4867457151412964 cuda_mem_allocated: 22.270727157592773 GB cuda_malloc_retries: 0 num_loss_counted_tokens: 7806.0 batch_size: 85.0 total loss: 0.8493412733078003 | |
Epoch 2: 26% 54/205 [01:17<05:11, 2.06s/it] total tokens: 2534 num samples: 14 num padding tokens: 118 - rank: 5 max len: 181 min len: 158 avg len: 172.57142857142858 num_loss_counted_tokens: 1072 | |
total tokens: 2448 num samples: 16 num padding tokens: 234 - rank: 6 max len: 153 min len: 114 avg len: 138.375 num_loss_counted_tokens: 898 | |
total tokens: 2387 num samples: 11 num padding tokens: 240 - rank: 4 max len: 217 min len: 183 avg len: 195.1818181818182 num_loss_counted_tokens: 896 | |
total tokens: 2421 num samples: 9 num padding tokens: 179 - rank: 3 max len: 269 min len: 233 avg len: 249.11111111111111 num_loss_counted_tokens: 1114 | |
total tokens: 2148 num samples: 4 num padding tokens: 343 - rank: 1 max len: 537 min len: 357 avg len: 451.25 num_loss_counted_tokens: 892 | |
total tokens: 2492 num samples: 7 num padding tokens: 267 - rank: 2 max len: 356 min len: 280 avg len: 317.85714285714283 num_loss_counted_tokens: 860 | |
total tokens: 1853 num samples: 17 num padding tokens: 170 - rank: 7 max len: 109 min len: 79 avg len: 99.0 num_loss_counted_tokens: 378 | |
total tokens: 1784 num samples: 2 num padding tokens: 225 - rank: 0 max len: 892 min len: 667 avg len: 779.5 num_loss_counted_tokens: 201 | |
Per-token loss scaled by world size: 0.0010360616724938154Per-token loss scaled by world size: 0.0006307634175755084Per-token loss scaled by world size: 0.0011333598522469401Per-token loss scaled by world size: 0.0005975920357741416 | |
Per-token loss scaled by world size: 0.001067737932316959Per-token loss scaled by world size: 0.0013249184703454375 | |
Per-token loss scaled by world size: 0.0008524349541403353 | |
Epoch: 2, Step: 318, Rank: 7, loss = 0.5587485432624817 | |
Epoch: 2, Step: 318, Rank: 6, loss = 0.5897638201713562Epoch: 2, Step: 318, Rank: 4, loss = 0.9687176942825317 | |
Epoch: 2, Step: 318, Rank: 2, loss = 1.0596914291381836 | |
Epoch: 2, Step: 318, Rank: 3, loss = 1.23879873752594Epoch: 2, Step: 318, Rank: 5, loss = 0.9983349442481995 | |
Epoch: 2, Step: 318, Rank: 1, loss = 0.7970266938209534 | |
Per-token loss scaled by world size: 0.0015107771614566445 | |
Epoch: 2, Step: 318, Rank: 0, loss = 1.412576675415039 | |
[2024-06-27 16:48:43,407] [INFO] [logging.py:96:log_dist] [Rank 0] step=318, skipped=0, lr=[1.651948051948052e-05], mom=[(0.9, 0.95)] | |
[2024-06-27 16:48:43,481] [INFO] [timer.py:260:stop] epoch=0/micro_step=318/global_step=318, RunningAvgSamplesPerSec=95.39863042664024, CurrSamplesPerSec=95.7531920054505, MemAllocated=22.3GB, MaxMemAllocated=28.61GB | |
throughput: 95.66680985489837 sampl |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment