Created
February 7, 2025 19:39
-
-
Save crypdick/3572b4e15a4d63e0f79ca60d436637b8 to your computer and use it in GitHub Desktop.
Error logs from running tune-pytorch-lightning.ipynb `tune_mnist_asha(num_samples=num_samples)`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[36m(RayTrainWorker pid=43596)[0m Setting up process group for: env:// [rank=0, world_size=3] | |
[36m(RayTrainWorker pid=43591)[0m [W207 11:34:59.682154000 ProcessGroupGloo.cpp:757] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) | |
[36m(TorchTrainer pid=43579)[0m Started distributed worker processes: | |
[36m(TorchTrainer pid=43579)[0m - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43592) world_rank=0, local_rank=0, node_rank=0 | |
[36m(TorchTrainer pid=43579)[0m - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43591) world_rank=1, local_rank=1, node_rank=0 | |
[36m(TorchTrainer pid=43579)[0m - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43593) world_rank=2, local_rank=2, node_rank=0 | |
[36m(RayTrainWorker pid=43595)[0m Setting up process group for: env:// [rank=0, world_size=3][32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m | |
[36m(RayTrainWorker pid=43596)[0m GPU available: True (mps), used: False | |
[36m(RayTrainWorker pid=43596)[0m TPU available: False, using: 0 TPU cores | |
[36m(RayTrainWorker pid=43596)[0m HPU available: False, using: 0 HPUs | |
[36m(RayTrainWorker pid=43596)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`. | |
[36m(RayTrainWorker pid=43596)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`. | |
0%| | 0.00/9.91M [00:00<?, ?B/s] | |
1%| | 65.5k/9.91M [00:00<00:21, 457kB/s] | |
64%|██████▍ | 6.32M/9.91M [00:02<00:01, 3.09MB/s] | |
95%|█████████▍| 9.40M/9.91M [00:02<00:00, 3.85MB/s] | |
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.88MB/s] | |
92%|█████████▏| 9.14M/9.91M [00:02<00:00, 3.27MB/s] | |
99%|█████████▉| 9.80M/9.91M [00:02<00:00, 3.30MB/s] | |
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.52MB/s] | |
95%|█████████▍| 9.37M/9.91M [00:03<00:00, 3.30MB/s] | |
100%|██████████| 9.91M/9.91M [00:03<00:00, 3.12MB/s] | |
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.69MB/s] | |
93%|█████████▎| 9.24M/9.91M [00:03<00:00, 4.17MB/s] | |
100%|██████████| 9.91M/9.91M [00:03<00:00, 2.88MB/s] | |
94%|█████████▍| 9.31M/9.91M [00:02<00:00, 6.67MB/s] | |
100%|██████████| 9.91M/9.91M [00:02<00:00, 3.34MB/s] | |
97%|█████████▋| 9.60M/9.91M [00:03<00:00, 4.85MB/s] | |
100%|██████████| 9.91M/9.91M [00:03<00:00, 2.69MB/s] | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 373kB/s] | |
[36m(RayTrainWorker pid=43594)[0m [W207 11:34:59.296102000 ProcessGroupGloo.cpp:757] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())[32m [repeated 8x across cluster][0m | |
[36m(TorchTrainer pid=43581)[0m Started distributed worker processes: [32m [repeated 2x across cluster][0m | |
[36m(TorchTrainer pid=43581)[0m - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43594) world_rank=2, local_rank=2, node_rank=0[32m [repeated 6x across cluster][0m | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 380kB/s] | |
96%|█████████▌| 9.54M/9.91M [00:04<00:00, 4.27MB/s] | |
100%|██████████| 9.91M/9.91M [00:04<00:00, 2.43MB/s] | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 387kB/s] | |
[36m(RayTrainWorker pid=43595)[0m GPU available: True (mps), used: False[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43595)[0m TPU available: False, using: 0 TPU cores[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43595)[0m HPU available: False, using: 0 HPUs[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43595)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43595)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.[32m [repeated 2x across cluster][0m | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 373kB/s] | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 379kB/s] | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 364kB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.33MB/s] | |
100%|██████████| 28.9k/28.9k [00:00<00:00, 354kB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.58MB/s] | |
0%| | 0.00/1.65M [00:00<?, ?B/s][32m [repeated 22x across cluster][0m | |
48%|████▊ | 786k/1.65M [00:00<00:00, 1.90MB/s][32m [repeated 177x across cluster][0m | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.20MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.08MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.94MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.87MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.70MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.44MB/s] | |
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.14MB/s] | |
100%|██████████| 4.54k/4.54k [00:00<00:00, 3.44MB/s] | |
[36m(RayTrainWorker pid=43592)[0m | |
[36m(RayTrainWorker pid=43592)[0m | Name | Type | Params | Mode | |
[36m(RayTrainWorker pid=43592)[0m -------------------------------------------------------- | |
[36m(RayTrainWorker pid=43592)[0m 0 | accuracy | MulticlassAccuracy | 0 | train | |
[36m(RayTrainWorker pid=43592)[0m 1 | layer_1 | Linear | 25.1 K | train | |
[36m(RayTrainWorker pid=43592)[0m 2 | layer_2 | Linear | 2.1 K | train | |
[36m(RayTrainWorker pid=43592)[0m 3 | layer_3 | Linear | 650 | train | |
[36m(RayTrainWorker pid=43592)[0m -------------------------------------------------------- | |
[36m(RayTrainWorker pid=43592)[0m 27.9 K Trainable params | |
[36m(RayTrainWorker pid=43592)[0m 0 Non-trainable params | |
[36m(RayTrainWorker pid=43592)[0m 27.9 K Total params | |
[36m(RayTrainWorker pid=43592)[0m 0.112 Total estimated model params size (MB) | |
[36m(RayTrainWorker pid=43592)[0m 4 Modules in train mode | |
[36m(RayTrainWorker pid=43592)[0m 0 Modules in eval mode | |
[36m(RayTrainWorker pid=43592)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization. | |
[36m(RayTrainWorker pid=43595)[0m | |
[36m(RayTrainWorker pid=43596)[0m | |
[36m(RayTrainWorker pid=43595)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization. | |
100%|██████████| 4.54k/4.54k [00:00<00:00, 2.50MB/s][32m [repeated 13x across cluster][0m | |
66%|██████▌ | 1.08M/1.65M [00:00<00:00, 2.77MB/s][32m [repeated 19x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m | Name | Type | Params | Mode [32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m --------------------------------------------------------[32m [repeated 4x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 0 | accuracy | MulticlassAccuracy | 0 | train[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 2 | layer_2 | Linear | 4.2 K | train[32m [repeated 5x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 3 | layer_3 | Linear | 650 | train | |
[36m(RayTrainWorker pid=43596)[0m 55.1 K Trainable params[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 0 Non-trainable params[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 55.1 K Total params[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 0.220 Total estimated model params size (MB)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 4 Modules in train mode[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m 0 Modules in eval mode[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43593)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00000_0_batch_size=64,layer_1_size=32,layer_2_size=64,lr=0.0002_2025-02-07_11-34-49/checkpoint_000000) | |
[36m(RayTrainWorker pid=43596)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43597)[0m Traceback (most recent call last): | |
[36m(RayTrainWorker pid=43597)[0m File "<string>", line 1, in <module> | |
[36m(RayTrainWorker pid=43597)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main | |
[36m(RayTrainWorker pid=43597)[0m exitcode = _main(fd, parent_sentinel) | |
[36m(RayTrainWorker pid=43597)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43597)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main | |
[36m(RayTrainWorker pid=43597)[0m self = reduction.pickle.load(from_parent) | |
[36m(RayTrainWorker pid=43597)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43597)[0m _pickle.UnpicklingError: pickle data was truncated | |
[36m(RayTrainWorker pid=43597)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown | |
[36m(RayTrainWorker pid=43597)[0m warnings.warn('resource_tracker: There appear to be %d ' | |
[36m(RayTrainWorker pid=43594)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00001_1_batch_size=32,layer_1_size=128,layer_2_size=128,lr=0.0272_2025-02-07_11-34-49/checkpoint_000000)[32m [repeated 8x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m Traceback (most recent call last):[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m File "<string>", line 1, in <module>[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m exitcode = _main(fd, parent_sentinel)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m self = reduction.pickle.load(from_parent)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m _pickle.UnpicklingError: pickle data was truncated[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43594)[0m warnings.warn('resource_tracker: There appear to be %d '[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43596)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000001)[32m [repeated 4x across cluster][0m | |
[36m(RayTrainWorker pid=43593)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00000_0_batch_size=64,layer_1_size=32,layer_2_size=64,lr=0.0002_2025-02-07_11-34-49/checkpoint_000002)[32m [repeated 3x across cluster][0m | |
[36m(RayTrainWorker pid=43593)[0m Traceback (most recent call last): | |
[36m(RayTrainWorker pid=43593)[0m File "<string>", line 1, in <module> | |
[36m(RayTrainWorker pid=43593)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main | |
[36m(RayTrainWorker pid=43593)[0m exitcode = _main(fd, parent_sentinel) | |
[36m(RayTrainWorker pid=43593)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43593)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main | |
[36m(RayTrainWorker pid=43593)[0m self = reduction.pickle.load(from_parent) | |
[36m(RayTrainWorker pid=43593)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43593)[0m _pickle.UnpicklingError: pickle data was truncated | |
2025-02-07 11:36:00,732 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48' in 0.0032s. | |
2025-02-07 11:36:00,736 INFO tune.py:1041 -- Total run time: 70.84 seconds (70.82 seconds for the tuning loop). | |
[36m(RayTrainWorker pid=43596)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000002) | |
[36m(RayTrainWorker pid=43596)[0m Traceback (most recent call last): | |
[36m(RayTrainWorker pid=43596)[0m File "<string>", line 1, in <module> | |
[36m(RayTrainWorker pid=43596)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main | |
[36m(RayTrainWorker pid=43596)[0m exitcode = _main(fd, parent_sentinel) | |
[36m(RayTrainWorker pid=43596)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43596)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main | |
[36m(RayTrainWorker pid=43596)[0m self = reduction.pickle.load(from_parent) | |
[36m(RayTrainWorker pid=43596)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[36m(RayTrainWorker pid=43596)[0m _pickle.UnpicklingError: pickle data was truncated | |
[36m(RayTrainWorker pid=43593)[0m /opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 52 leaked semaphore objects to clean up at shutdown | |
[36m(RayTrainWorker pid=43593)[0m warnings.warn('resource_tracker: There appear to be %d ' | |
[36m(RayTrainWorker pid=43599)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/rdecal/ray_results/TorchTrainer_2025-02-07_11-34-48/TorchTrainer_9886b_00002_2_batch_size=32,layer_1_size=64,layer_2_size=64,lr=0.0034_2025-02-07_11-34-49/checkpoint_000002)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m Traceback (most recent call last):[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m File "<string>", line 1, in <module>[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m exitcode = _main(fd, parent_sentinel)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m File "/opt/miniconda3/envs/tune-pytorch-cifar/lib/python3.11/multiprocessing/spawn.py", line 132, in _main[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m self = reduction.pickle.load(from_parent)[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 2x across cluster][0m | |
[36m(RayTrainWorker pid=43599)[0m _pickle.UnpicklingError: pickle data was truncated[32m [repeated 2x across cluster][0m |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment