davidberard98’s gists

davidberard98 / resnet50_eager.err

Last active September 22, 2022 04:25

	[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
	[I ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
	[I ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
	NCCL_ASYNC_ERROR_HANDLING: -2
	NCCL_DESYNC_DEBUG: 1
	NCCL_BLOCKING_WAIT: 0
	TIMEOUT(ms): 1800000
	USE_HIGH_PRIORITY_STREAM: 0
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
	warnings.warn(

davidberard98 / 67434_0_log.out

Last active September 23, 2022 00:49

hf_T5, 2 nodes, dynamo+inductor, verbose=True, log_level=DEBUG; functorch..debug_graphs is FALSE.

This file has been truncated, but you can view the full file.

	submitit INFO (2022-09-22 18:42:53,293) - Starting with JobEnvironment(job_id=67434, hostname=a100-st-p4d24xlarge-3, local_rank=0(8), node=0(2), global_rank=0(16))
	submitit INFO (2022-09-22 18:42:53,294) - Loading pickle: /fsx/users/dberard/scratch-local/bench-fast/benchmark/logs/67434_submitted.pkl
	Process group: 16 tasks, rank: 0
	MY HOSTNAME: a100-st-p4d24xlarge-3
	FI_PROVIDER : efa
	LD_LIBRARY_PATH : /fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/fsx/users/dberard/scratch-local/bench-fast/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/path/to/aws-ofi-nccl:/opt/amazon/efa/lib:/usr/local/cuda-11.6/lib:/usr/local/cuda-11.6/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:
	NCCL_DEBUG : TRACE
	FI_EFA_USE_DEVICE_RDMA : 1
	a100-st-p4d24xlarge-3:69371:69371 [0] NCCL INFO

davidberard98 / isolated_ddp.py

Last active September 23, 2022 01:22

	import torch
	import torchdynamo
	import os
	import logging

	torchdynamo.config.verbose = True
	torchdynamo.config.log_level = logging.DEBUG

	def setup():
	os.environ["MASTER_ADDR"] = "localhost"

davidberard98 / profiler_error.py

Last active September 29, 2022 04:07

	import torch
	import torchdynamo

	import argparse
	import os
	import logging

	from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

	# torchdynamo.config.verbose = True

davidberard98 / 69791_0_log.err

Last active October 3, 2022 23:56

	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
	warnings.warn(
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
	warnings.warn(msg)
	submitit ERROR (2022-10-03 23:44:18,682) - Submitted job triggered an exception
	ERROR > Submitted job triggered an exception
	Traceback (most recent call last):
	File "/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/runpy.py", line 194, in _run_module_as_main
	return _run_code(code, main_globals, None,
	File "/data/home/dberard/

davidberard98 / fail_log.err

Last active October 27, 2022 02:53

	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
	warnings.warn(
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
	warnings.warn(msg)
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
	warnings.warn(
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvisi

davidberard98 / 74983_0_log.err

Last active November 1, 2022 03:21

	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
	warnings.warn(
	/data/home/dberard/miniconda/envs/bench-fast/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
	warnings.warn(msg)
	[W kineto_shim.cpp:330] Profiler is not initialized: skipping step() invocation
	STAGE:2022-11-01 01:39:13 3461:3461 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
	STAGE:2022-11-01 01:39:14 3461:3461 ActivityProfilerController.cpp:300] Completed Stage: Collection
	STAGE:2022-11-01 01:39:16 3461:3461 output_json.cpp:417] C

davidberard98 / nooptim.txt

Created November 7, 2022 23:19

This file has been truncated, but you can view the full file.

	WARNING:__main__:Sequence Length not defined for MobileBertForMaskedLM. Choosing 128 arbitrarily
	[2022-11-07 20:06:13,575] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
	cuda train MobileBertForMaskedLM [2022-11-07 20:06:15,061] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
	[2022-11-07 20:06:16,693] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
	[2022-11-07 20:06:18,063] torch._dynamo.testing: [WARNING] High loss value alert - 10.43. Can result in unstable gradients.
	[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py
	[2022-11-07 20:06:19,501] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /data/home/dberard/miniconda/envs/dynamo38/lib/python3.8/contextlib.py
	[2022-11-07 20:06:19,506] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo s

davidberard98 / nvfuser_metric_example.csv

Created November 16, 2022 05:15

We can make this file beautiful and searchable if this error is corrected: No commas found in this CSV file in line 0.

	Metric;111fe61602;7bc72f5a2f
	nnc-dynamic:autogen-0;0.14310094044776633;0.14308511896524578
	nnc-dynamic:autogen-1;0.11164433404337615;0.11165716196410358
	nnc-dynamic:autogen-10;0.017939746397314594;0.017773296852828933
	nnc-dynamic:autogen-11;0.02166838520206511;0.021501831093337385
	nnc-dynamic:autogen-12;0.12938609847333285;0.12939167249714956
	nnc-dynamic:autogen-13;1.8119537853635848;1.8118110403884202
	nnc-dynamic:autogen-14;7.227453680243343;7.228049130644649
	nnc-dynamic:autogen-15;0.023439827701076863;0.02318546730093658
	nnc-dynamic:autogen-16;0.24199218105059117;0.2419714879943058

davidberard98 / FSDP_T5-large_wrapping.txt

Last active November 18, 2022 01:03

	FullyShardedDataParallel(
	(_fsdp_wrapped_module): T5ForConditionalGeneration(
	(shared): Embedding(32128, 1024)
	(encoder): T5Stack(
	(embed_tokens): Embedding(32128, 1024)
	(block): ModuleList(
	(0): FullyShardedDataParallel(
	(_fsdp_wrapped_module): T5Block(
	(layer): ModuleList(
	(0): T5LayerSelfAttention(

David Berard davidberard98