This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[I debug.cpp:49] [c10d] The debug level is set to DETAIL. | |
[I ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started! | |
[I ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options: | |
NCCL_ASYNC_ERROR_HANDLING: -2 | |
NCCL_DESYNC_DEBUG: 1 | |
NCCL_BLOCKING_WAIT: 0 | |
TIMEOUT(ms): 1800000 | |
USE_HIGH_PRIORITY_STREAM: 0 | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. | |
warnings.warn( |
This file has been truncated, but you can view the full file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
submitit INFO (2022-09-22 18:42:53,293) - Starting with JobEnvironment(job_id=67434, hostname=a100-st-p4d24xlarge-3, local_rank=0(8), node=0(2), global_rank=0(16)) | |
submitit INFO (2022-09-22 18:42:53,294) - Loading pickle: /fsx/users/dberard/scratch-local/bench-fast/benchmark/logs/67434_submitted.pkl | |
Process group: 16 tasks, rank: 0 | |
MY HOSTNAME: a100-st-p4d24xlarge-3 | |
FI_PROVIDER : efa | |
LD_LIBRARY_PATH : /fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/usr/local/cuda-11.6/lib:/usr/local/cuda-11.6/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib: | |
NCCL_DEBUG : TRACE | |
FI_EFA_USE_DEVICE_RDMA : 1 | |
a100-st-p4d24xlarge-3:69371:69371 [0] NCCL INFO |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import torchdynamo | |
import os | |
import logging | |
torchdynamo.config.verbose = True | |
torchdynamo.config.log_level = logging.DEBUG | |
def setup(): | |
os.environ["MASTER_ADDR"] = "localhost" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import torchdynamo | |
import argparse | |
import os | |
import logging | |
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler | |
# torchdynamo.config.verbose = True |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. | |
warnings.warn( | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights. | |
warnings.warn(msg) | |
submitit ERROR (2022-10-03 23:44:18,682) - Submitted job triggered an exception | |
ERROR > Submitted job triggered an exception | |
Traceback (most recent call last): | |
File "/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/runpy.py", line 194, in _run_module_as_main | |
return _run_code(code, main_globals, None, | |
File "/data/home/dberard/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. | |
warnings.warn( | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights. | |
warnings.warn(msg) | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. | |
warnings.warn( | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvisi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. | |
warnings.warn( | |
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights. | |
warnings.warn(msg) | |
[W kineto_shim.cpp:330] Profiler is not initialized: skipping step() invocation | |
STAGE:2022-11-01 01:39:13 3461:3461 ActivityProfilerController.cpp:294] Completed Stage: Warm Up | |
STAGE:2022-11-01 01:39:14 3461:3461 ActivityProfilerController.cpp:300] Completed Stage: Collection | |
STAGE:2022-11-01 01:39:16 3461:3461 output_json.cpp:417] C |
This file has been truncated, but you can view the full file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WARNING:__main__:Sequence Length not defined for MobileBertForMaskedLM. Choosing 128 arbitrarily | |
[2022-11-07 20:06:13,575] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients. | |
cuda train MobileBertForMaskedLM [2022-11-07 20:06:15,061] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients. | |
[2022-11-07 20:06:16,693] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients. | |
[2022-11-07 20:06:18,063] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients. | |
[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py | |
[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py | |
[2022-11-07 20:06:19,506] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo s |
We can make this file beautiful and searchable if this error is corrected: No commas found in this CSV file in line 0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Metric;111fe61602;7bc72f5a2f | |
nnc-dynamic:autogen-0;0.14310094044776633;0.14308511896524578 | |
nnc-dynamic:autogen-1;0.11164433404337615;0.11165716196410358 | |
nnc-dynamic:autogen-10;0.017939746397314594;0.017773296852828933 | |
nnc-dynamic:autogen-11;0.02166838520206511;0.021501831093337385 | |
nnc-dynamic:autogen-12;0.12938609847333285;0.12939167249714956 | |
nnc-dynamic:autogen-13;1.8119537853635848;1.8118110403884202 | |
nnc-dynamic:autogen-14;7.227453680243343;7.228049130644649 | |
nnc-dynamic:autogen-15;0.023439827701076863;0.02318546730093658 | |
nnc-dynamic:autogen-16;0.24199218105059117;0.2419714879943058 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FullyShardedDataParallel( | |
(_fsdp_wrapped_module): T5ForConditionalGeneration( | |
(shared): Embedding(32128, 1024) | |
(encoder): T5Stack( | |
(embed_tokens): Embedding(32128, 1024) | |
(block): ModuleList( | |
(0): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): T5Block( | |
(layer): ModuleList( | |
(0): T5LayerSelfAttention( |