Skip to content

Instantly share code, notes, and snippets.

@davidberard98
Last active September 22, 2022 04:25
Show Gist options
  • Save davidberard98/1d9302384b64e46943ad997aaaa54adb to your computer and use it in GitHub Desktop.
Save davidberard98/1d9302384b64e46943ad997aaaa54adb to your computer and use it in GitHub Desktop.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: -2
NCCL_DESYNC_DEBUG: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
[I ProcessGroupNCCL.cpp:1274] NCCL_DEBUG: TRACE
[I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:213] [Rank 0]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 1
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 161
output_device: 0
rank: 0
total_parameter_size_bytes: 102228128
world_size: 16
backend_name: nccl
bucket_sizes: 102228128
cuda_visible_devices: 0,1,2,3,4,5,6,7
device_ids: 0
dtypes: float
master_addr: N/A
master_port: N/A
module_name: ResNet
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: TRACE
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: ens
torch_distributed_debug: DETAIL
[I logger.cpp:377] [Rank 0 / 16] [before iteration 1] Training ResNet unused_parameter_size=0
Avg forward compute time: 14610432
Avg backward compute time: 380300288
Avg backward comm. time: 127566848
Avg backward comm/comp overlap time: 21514240
[I reducer.cpp:1724] 5 buckets rebuilt with size limits: 1048576, 26214400, 26214400, 26214400, 26214400 bytes.
[I logger.cpp:377] [Rank 0 / 16] [before iteration 2] Training ResNet unused_parameter_size=0
Avg forward compute time: 81418752
Avg backward compute time: 239553024
Avg backward comm. time: 128046592
Avg backward comm/comp overlap time: 58981888
[I logger.cpp:377] [Rank 0 / 16] [before iteration 3] Training ResNet unused_parameter_size=0
Avg forward compute time: 65623040
Avg backward compute time: 192718848
Avg backward comm. time: 128496984
Avg backward comm/comp overlap time: 71587157
[I logger.cpp:377] [Rank 0 / 16] [before iteration 4] Training ResNet unused_parameter_size=0
Avg forward compute time: 57681408
Avg backward compute time: 169098496
Avg backward comm. time: 125390338
Avg backward comm/comp overlap time: 77677311
[I logger.cpp:377] [Rank 0 / 16] [before iteration 5] Training ResNet unused_parameter_size=0
Avg forward compute time: 52917248
Avg backward compute time: 154995916
Avg backward comm. time: 126040270
Avg backward comm/comp overlap time: 81394892
[I logger.cpp:377] [Rank 0 / 16] [before iteration 6] Training ResNet unused_parameter_size=0
Avg forward compute time: 49737216
Avg backward compute time: 473799487
Avg backward comm. time: 452736513
Avg backward comm/comp overlap time: 412090708
[I logger.cpp:377] [Rank 0 / 16] [before iteration 7] Training ResNet unused_parameter_size=0
Avg forward compute time: 47497508
Avg backward compute time: 420205110
Avg backward comm. time: 407578771
Avg backward comm/comp overlap time: 366987994
[I logger.cpp:377] [Rank 0 / 16] [before iteration 8] Training ResNet unused_parameter_size=0
Avg forward compute time: 45786239
Avg backward compute time: 380060911
Avg backward comm. time: 371972352
Avg backward comm/comp overlap time: 333205502
[I logger.cpp:377] [Rank 0 / 16] [before iteration 9] Training ResNet unused_parameter_size=0
Avg forward compute time: 44459917
Avg backward compute time: 349120496
Avg backward comm. time: 345771121
Avg backward comm/comp overlap time: 307226508
[I logger.cpp:377] [Rank 0 / 16] [before iteration 10] Training ResNet unused_parameter_size=0
Avg forward compute time: 43406334
Avg backward compute time: 324065982
Avg backward comm. time: 323487845
Avg backward comm/comp overlap time: 286128945
[I ProcessGroupNCCL.cpp:837] [Rank 0] NCCL watchdog thread terminated normally
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: -2
NCCL_DESYNC_DEBUG: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
[I ProcessGroupNCCL.cpp:1274] NCCL_DEBUG: TRACE
[I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:213] [Rank 0]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 1
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 161
output_device: 0
rank: 0
total_parameter_size_bytes: 102228128
world_size: 16
backend_name: nccl
bucket_sizes: 102228128
cuda_visible_devices: 0,1,2,3,4,5,6,7
device_ids: 0
dtypes: float
master_addr: N/A
master_port: N/A
module_name: ResNet
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: TRACE
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: ens
torch_distributed_debug: DETAIL
[I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:213] [Rank 0]: DDP Initialized with:
broadcast_buffers: 0
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 1
has_sync_bn: 0
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 161
output_device: 0
rank: 0
total_parameter_size_bytes: 102228128
world_size: 16
backend_name: nccl
bucket_sizes: 102228128
cuda_visible_devices: 0,1,2,3,4,5,6,7
device_ids: 0
dtypes: float
master_addr: N/A
master_port: N/A
module_name: ResNet
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: TRACE
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: ens
torch_distributed_debug: DETAIL
[I logger.cpp:377] [Rank 0 / 16] [before iteration 1] Training ResNet unused_parameter_size=0
Avg forward compute time: 202930176
Avg backward compute time: 0
Avg backward comm. time: 0
Avg backward comm/comp overlap time: 0
INFO:submitit:Job completed successfully
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment