Skip to content

Instantly share code, notes, and snippets.

@crypdick
Created February 7, 2025 19:39
Show Gist options
  • Save crypdick/3572b4e15a4d63e0f79ca60d436637b8 to your computer and use it in GitHub Desktop.
Save crypdick/3572b4e15a4d63e0f79ca60d436637b8 to your computer and use it in GitHub Desktop.
Error logs from running tune-pytorch-lightning.ipynb `tune_mnist_asha(num_samples=num_samples)`
(RayTrainWorker pid=43596) Setting up process group for: env:// [rank=0, world_size=3]
(RayTrainWorker pid=43591) [W207 11:34:59.682154000 ProcessGroupGloo.cpp:757] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(TorchTrainer pid=43579) Started distributed worker processes:
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43592) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43591) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43593) world_rank=2, local_rank=2, node_rank=0
(RayTrainWorker pid=43595) Setting up process group for: env:// [rank=0, world_size=3] [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=43596) GPU available: True (mps), used: False
(RayTrainWorker pid=43596) TPU available: False, using: 0 TPU cores
(RayTrainWorker pid=43596) HPU available: False, using: 0 HPUs
(RayTrainWorker pid=43596) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
(RayTrainWorker pid=43596) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
0%| | 0.00/9.91M [00:00<?, ?B/s]
1%| | 65.5k/9.91M [00:00<00:21, 457kB/s]
64%|██████▍ | 6.32M/9.91M [00:02<00:01, 3.09MB/s]
95%|█████████▍| 9.40M/9.91M [00:02<00:00, 3.85MB/s]
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.88MB/s]
92%|█████████▏| 9.14M/9.91M [00:02<00:00, 3.27MB/s]
99%|█████████▉| 9.80M/9.91M [00:02<00:00, 3.30MB/s]
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.52MB/s]
95%|█████████▍| 9.37M/9.91M [00:03<00:00, 3.30MB/s]
100%|██████████| 9.91M/9.91M [00:03<00:00, 3.12MB/s]
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.69MB/s]
93%|█████████▎| 9.24M/9.91M [00:03<00:00, 4.17MB/s]
100%|██████████| 9.91M/9.91M [00:03<00:00, 2.88MB/s]
94%|█████████▍| 9.31M/9.91M [00:02<00:00, 6.67MB/s]
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.34MB/s]
97%|█████████▋| 9.60M/9.91M [00:03<00:00, 4.85MB/s]
100%|██████████| 9.91M/9.91M [00:03<00:00, 2.69MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 373kB/s]
(RayTrainWorker pid=43594) [W207 11:34:59.296102000 ProcessGroupGloo.cpp:757] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [repeated 8x across cluster]
(TorchTrainer pid=43581) Started distributed worker processes:  [repeated 2x across cluster]
(TorchTrainer pid=43581) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43594) world_rank=2, local_rank=2, node_rank=0 [repeated 6x across cluster]
100%|██████████| 28.9k/28.9k [00:00<00:00, 380kB/s]
96%|█████████▌| 9.54M/9.91M [00:04<00:00, 4.27MB/s]
100%|██████████| 9.91M/9.91M [00:04<00:00, 2.43MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 387kB/s]
(RayTrainWorker pid=43595) GPU available: True (mps), used: False [repeated 2x across cluster]
(RayTrainWorker pid=43595) TPU available: False, using: 0 TPU cores [repeated 2x across cluster]
(RayTrainWorker pid=43595) HPU available: False, using: 0 HPUs [repeated 2x across cluster]
(RayTrainWorker pid=43595) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`. [repeated 2x across cluster]
(RayTrainWorker pid=43595) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`. [repeated 2x across cluster]
100%|██████████| 28.9k/28.9k [00:00<00:00, 373kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 379kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 364kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.33MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 354kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.58MB/s]
0%| | 0.00/1.65M [00:00<?, ?B/s] [repeated 22x across cluster]
48%|████▊ | 786k/1.65M [00:00<00:00, 1.90MB/s] [repeated 177x across cluster]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.20MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.08MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.94MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.87MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.70MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.44MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.14MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 3.44MB/s]
(RayTrainWorker pid=43592)
(RayTrainWorker pid=43592) | Name | Type | Params | Mode
(RayTrainWorker pid=43592) --------------------------------------------------------
(RayTrainWorker pid=43592) 0 | accuracy | MulticlassAccuracy | 0 | train
(RayTrainWorker pid=43592) 1 | layer_1 | Linear | 25.1 K | train
(RayTrainWorker pid=43592) 2 | layer_2 | Linear | 2.1 K | train
(RayTrainWorker pid=43592) 3 | layer_3 | Linear | 650 | train
(RayTrainWorker pid=43592) --------------------------------------------------------
(RayTrainWorker pid=43592) 27.9 K Trainable params
(RayTrainWorker pid=43592) 0 Non-trainable params
(RayTrainWorker pid=43592) 27.9 K Total params
(RayTrainWorker pid=43592) 0.112 Total estimated model params size (MB)
(RayTrainWorker pid=43592) 4 Modules in train mode
(RayTrainWorker pid=43592) 0 Modules in eval mode
(RayTrainWorker pid=43592) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.
(RayTrainWorker pid=43595)
(RayTrainWorker pid=43596)
(RayTrainWorker pid=43595) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
100%|██████████| 4.54k/4.54k [00:00<00:00, 2.50MB/s] [repeated 13x across cluster]
66%|██████▌ | 1.08M/1.65M [00:00<00:00, 2.77MB/s] [repeated 19x across cluster]
(RayTrainWorker pid=43596) | Name | Type | Params | Mode  [repeated 2x across cluster]
(RayTrainWorker pid=43596) -------------------------------------------------------- [repeated 4x across cluster]
(RayTrainWorker pid=43596) 0 | accuracy | MulticlassAccuracy | 0 | train [repeated 2x across cluster]
(RayTrainWorker pid=43596) 2 | layer_2 | Linear | 4.2 K | train [repeated 5x across cluster]
(RayTrainWorker pid=43596) 3 | layer_3 | Linear | 650 | train
(RayTrainWorker pid=43596) 55.1 K Trainable params [repeated 2x across cluster]
(RayTrainWorker pid=43596) 0 Non-trainable params [repeated 2x across cluster]
(RayTrainWorker pid=43596) 55.1 K Total params [repeated 2x across cluster]
(RayTrainWorker pid=43596) 0.220 Total estimated model params size (MB) [repeated 2x across cluster]
(RayTrainWorker pid=43596) 4 Modules in train mode [repeated 2x across cluster]
(RayTrainWorker pid=43596) 0 Modules in eval mode [repeated 2x across cluster]
(RayTrainWorker pid=43596) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization. [repeated 2x across cluster]
(RayTrainWorker pid=43593) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00000_0_batch_size=64,layer_1_size=32,layer_2_size=64,lr=0.0002_2025-02-07_11-34-49/checkpoint_000000)
(RayTrainWorker pid=43596) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization. [repeated 2x across cluster]
(RayTrainWorker pid=43597) Traceback (most recent call last):
(RayTrainWorker pid=43597) File "<string>", line 1, in <module>
(RayTrainWorker pid=43597) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
(RayTrainWorker pid=43597) exitcode = _main(fd, parent_sentinel)
(RayTrainWorker pid=43597) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43597) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
(RayTrainWorker pid=43597) self = reduction.pickle.load(from_parent)
(RayTrainWorker pid=43597) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43597) _pickle.UnpicklingError: pickle data was truncated
(RayTrainWorker pid=43597) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown
(RayTrainWorker pid=43597) warnings.warn('resource_tracker: There appear to be %d '
(RayTrainWorker pid=43594) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00001_1_batch_size=32,layer_1_size=128,layer_2_size=128,lr=0.0272_2025-02-07_11-34-49/checkpoint_000000) [repeated 8x across cluster]
(RayTrainWorker pid=43594) Traceback (most recent call last): [repeated 2x across cluster]
(RayTrainWorker pid=43594) File "<string>", line 1, in <module> [repeated 2x across cluster]
(RayTrainWorker pid=43594) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main [repeated 2x across cluster]
(RayTrainWorker pid=43594) exitcode = _main(fd, parent_sentinel) [repeated 2x across cluster]
(RayTrainWorker pid=43594) ^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayTrainWorker pid=43594) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main [repeated 2x across cluster]
(RayTrainWorker pid=43594) self = reduction.pickle.load(from_parent) [repeated 2x across cluster]
(RayTrainWorker pid=43594) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayTrainWorker pid=43594) _pickle.UnpicklingError: pickle data was truncated [repeated 2x across cluster]
(RayTrainWorker pid=43594) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown [repeated 2x across cluster]
(RayTrainWorker pid=43594) warnings.warn('resource_tracker: There appear to be %d ' [repeated 2x across cluster]
(RayTrainWorker pid=43596) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000001) [repeated 4x across cluster]
(RayTrainWorker pid=43593) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00000_0_batch_size=64,layer_1_size=32,layer_2_size=64,lr=0.0002_2025-02-07_11-34-49/checkpoint_000002) [repeated 3x across cluster]
(RayTrainWorker pid=43593) Traceback (most recent call last):
(RayTrainWorker pid=43593) File "<string>", line 1, in <module>
(RayTrainWorker pid=43593) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
(RayTrainWorker pid=43593) exitcode = _main(fd, parent_sentinel)
(RayTrainWorker pid=43593) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43593) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
(RayTrainWorker pid=43593) self = reduction.pickle.load(from_parent)
(RayTrainWorker pid=43593) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43593) _pickle.UnpicklingError: pickle data was truncated
2025-02-07 11:36:00,732 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48' in 0.0032s.
2025-02-07 11:36:00,736 INFO tune.py:1041 -- Total run time: 70.84 seconds (70.82 seconds for the tuning loop).
(RayTrainWorker pid=43596) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000002)
(RayTrainWorker pid=43596) Traceback (most recent call last):
(RayTrainWorker pid=43596) File "<string>", line 1, in <module>
(RayTrainWorker pid=43596) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
(RayTrainWorker pid=43596) exitcode = _main(fd, parent_sentinel)
(RayTrainWorker pid=43596) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43596) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
(RayTrainWorker pid=43596) self = reduction.pickle.load(from_parent)
(RayTrainWorker pid=43596) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43596) _pickle.UnpicklingError: pickle data was truncated
(RayTrainWorker pid=43593) /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown
(RayTrainWorker pid=43593) warnings.warn('resource_tracker: There appear to be %d '
(RayTrainWorker pid=43599) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000002) [repeated 2x across cluster]
(RayTrainWorker pid=43599) Traceback (most recent call last): [repeated 2x across cluster]
(RayTrainWorker pid=43599) File "<string>", line 1, in <module> [repeated 2x across cluster]
(RayTrainWorker pid=43599) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main [repeated 2x across cluster]
(RayTrainWorker pid=43599) exitcode = _main(fd, parent_sentinel) [repeated 2x across cluster]
(RayTrainWorker pid=43599) ^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayTrainWorker pid=43599) File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main [repeated 2x across cluster]
(RayTrainWorker pid=43599) self = reduction.pickle.load(from_parent) [repeated 2x across cluster]
(RayTrainWorker pid=43599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 2x across cluster]
(RayTrainWorker pid=43599) _pickle.UnpicklingError: pickle data was truncated [repeated 2x across cluster]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment