Skip to content

Instantly share code, notes, and snippets.

View davidberard98's full-sized avatar

David Berard davidberard98

  • PyTorch
  • Menlo Park, CA
View GitHub Profile
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: -2
NCCL_DESYNC_DEBUG: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
@davidberard98
davidberard98 / 67434_0_log.out
Last active September 23, 2022 00:49
hf_T5, 2 nodes, dynamo+inductor, verbose=True, log_level=DEBUG; functorch..debug_graphs is FALSE.
This file has been truncated, but you can view the full file.
submitit INFO (2022-09-22 18:42:53,293) - Starting with JobEnvironment(job_id=67434, hostname=a100-st-p4d24xlarge-3, local_rank=0(8), node=0(2), global_rank=0(16))
submitit INFO (2022-09-22 18:42:53,294) - Loading pickle: /fsx/users/dberard/scratch-local/bench-fast/benchmark/logs/67434_submitted.pkl
Process group: 16 tasks, rank: 0
MY HOSTNAME: a100-st-p4d24xlarge-3
FI_PROVIDER : efa
LD_LIBRARY_PATH : /fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/usr/local/cuda-11.6/lib:/usr/local/cuda-11.6/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:
NCCL_DEBUG : TRACE
FI_EFA_USE_DEVICE_RDMA : 1
a100-st-p4d24xlarge-3:69371:69371 [0] NCCL INFO
import torch
import torchdynamo
import os
import logging
torchdynamo.config.verbose = True
torchdynamo.config.log_level = logging.DEBUG
def setup():
os.environ["MASTER_ADDR"] = "localhost"
import torch
import torchdynamo
import argparse
import os
import logging
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
# torchdynamo.config.verbose = True
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
submitit ERROR (2022-10-03 23:44:18,682) - Submitted job triggered an exception
ERROR > Submitted job triggered an exception
Traceback (most recent call last):
File "/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/home/dberard/
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvisi
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
[W kineto_shim.cpp:330] Profiler is not initialized: skipping step() invocation
STAGE:2022-11-01 01:39:13 3461:3461 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-11-01 01:39:14 3461:3461 ActivityProfilerController.cpp:300] Completed Stage: Collection
STAGE:2022-11-01 01:39:16 3461:3461 output_json.cpp:417] C
This file has been truncated, but you can view the full file.
WARNING:__main__:Sequence Length not defined for MobileBertForMaskedLM. Choosing 128 arbitrarily
[2022-11-07 20:06:13,575] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
cuda train MobileBertForMaskedLM [2022-11-07 20:06:15,061] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
[2022-11-07 20:06:16,693] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
[2022-11-07 20:06:18,063] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py
[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py
[2022-11-07 20:06:19,506] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo s
We can make this file beautiful and searchable if this error is corrected: No commas found in this CSV file in line 0.
Metric;111fe61602;7bc72f5a2f
nnc-dynamic:autogen-0;0.14310094044776633;0.14308511896524578
nnc-dynamic:autogen-1;0.11164433404337615;0.11165716196410358
nnc-dynamic:autogen-10;0.017939746397314594;0.017773296852828933
nnc-dynamic:autogen-11;0.02166838520206511;0.021501831093337385
nnc-dynamic:autogen-12;0.12938609847333285;0.12939167249714956
nnc-dynamic:autogen-13;1.8119537853635848;1.8118110403884202
nnc-dynamic:autogen-14;7.227453680243343;7.228049130644649
nnc-dynamic:autogen-15;0.023439827701076863;0.02318546730093658
nnc-dynamic:autogen-16;0.24199218105059117;0.2419714879943058
FullyShardedDataParallel(
(_fsdp_wrapped_module): T5ForConditionalGeneration(
(shared): Embedding(32128, 1024)
(encoder): T5Stack(
(embed_tokens): Embedding(32128, 1024)
(block): ModuleList(
(0): FullyShardedDataParallel(
(_fsdp_wrapped_module): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(