Created
October 27, 2025 20:12
-
-
Save crypdick/fa60b623ecfda5775aa2cddc04ab1bf3 to your computer and use it in GitHub Desktop.
error trace for CPU-to-CPU RDT over NIXL from ReplayBuffer -> Learner
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| (base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ git pull && python grpo_contextual_bandits_simple.py | |
| Already up to date. | |
| 2025-10-27 13:05:23,423 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.24.59:6379... | |
| 2025-10-27 13:05:23,433 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com | |
| 2025-10-27 13:05:23,439 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip' (1.01MiB) to Ray cluster... | |
| 2025-10-27 13:05:23,442 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip'. | |
| /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 | |
| warnings.warn( | |
| (Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:361 Backend UCX was instantiated | |
| (Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:251 Initialized NIXL agent: 2892f736c8126b406a0be5b90a000000 | |
| 2025-10-27 13:05:28 NIXL INFO _api.py:361 Backend UCX was instantiated | |
| 2025-10-27 13:05:28 NIXL INFO _api.py:251 Initialized NIXL agent: RAY-DRIVER-46ca0f11-abd7-46fb-9765-957ca597e487 | |
| Traceback (most recent call last): | |
| File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 436, in <module> | |
| train(total_steps=args.steps) | |
| File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 409, in train | |
| step_result = ray.get(learner.step.remote()) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper | |
| return fn(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper | |
| return func(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get | |
| values, debugger_breakpoint = worker.get_objects( | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects | |
| raise value.as_instanceof_cause() | |
| ray.exceptions.RayTaskError(nixlBackendError): ray::Learner.step() (pid=8518, ip=10.0.30.16, actor_id=2892f736c8126b406a0be5b90a000000, repr=<grpo_contextual_bandits_simple.Learner object at 0x70616fd939d0>) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 297, in step | |
| ^^^^^^^^^^^^^^^^^^^ | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| ^^^^^^^^^^^^^^^^^^^ | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 515, in get_gpu_object | |
| self._fetch_object(object_id, tensor_transport) | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 360, in _fetch_object | |
| __ray_recv__( | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_store.py", line 98, in __ray_recv__ | |
| tensor_transport_manager.recv_multiple_tensors( | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/collective/nixl_tensor_transport.py", line 141, in recv_multiple_tensors | |
| g.recv( | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/util/collective/collective_group/nixl_backend.py", line 62, in recv | |
| local_descs = nixl_agent.register_memory(tensors) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/home/ray/anaconda3/lib/python3.11/site-packages/nixl/_api.py", line 384, in register_memory | |
| self.agent.registerMem(reg_descs, handle_list) | |
| nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND | |
| (Learner pid=8518, ip=10.0.30.16) E1027 13:05:30.361892 8693 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends | |
| (Learner pid=8518, ip=10.0.30.16) [1761595530.359394] [ip-10-0-30-16:8518 :0] cuda_copy_md.c:168 UCX ERROR cuMemHostRegister_v2(address, length, 0x01) failed: part or all of the requested memory range is already mapped | |
| (Learner pid=8518, ip=10.0.30.16) [1761595530.359436] [ip-10-0-30-16:8518 :0] ucp_mm.c:76 UCX ERROR failed to register address 0x70430c651340 (host) length 80 on md[4]=cuda_cpy: Input/output error (md supports: host|cuda|cuda-managed) | |
| (ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:361 Backend UCX was instantiated [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) | |
| (ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:251 Initialized NIXL agent: 8dfa72b54bb00fccd318b5500a000000 [repeated 3x across cluster] | |
| (base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment