Skip to content

Instantly share code, notes, and snippets.

@crypdick
Created October 27, 2025 20:12
Show Gist options
  • Save crypdick/fa60b623ecfda5775aa2cddc04ab1bf3 to your computer and use it in GitHub Desktop.
Save crypdick/fa60b623ecfda5775aa2cddc04ab1bf3 to your computer and use it in GitHub Desktop.
error trace for CPU-to-CPU RDT over NIXL from ReplayBuffer -> Learner
(base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ git pull && python grpo_contextual_bandits_simple.py
Already up to date.
2025-10-27 13:05:23,423 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.24.59:6379...
2025-10-27 13:05:23,433 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com
2025-10-27 13:05:23,439 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip' (1.01MiB) to Ray cluster...
2025-10-27 13:05:23,442 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip'.
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
(Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:361 Backend UCX was instantiated
(Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:251 Initialized NIXL agent: 2892f736c8126b406a0be5b90a000000
2025-10-27 13:05:28 NIXL INFO _api.py:361 Backend UCX was instantiated
2025-10-27 13:05:28 NIXL INFO _api.py:251 Initialized NIXL agent: RAY-DRIVER-46ca0f11-abd7-46fb-9765-957ca597e487
Traceback (most recent call last):
File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 436, in <module>
train(total_steps=args.steps)
File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 409, in train
step_result = ray.get(learner.step.remote())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(nixlBackendError): ray::Learner.step() (pid=8518, ip=10.0.30.16, actor_id=2892f736c8126b406a0be5b90a000000, repr=<grpo_contextual_bandits_simple.Learner object at 0x70616fd939d0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 297, in step
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 515, in get_gpu_object
self._fetch_object(object_id, tensor_transport)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 360, in _fetch_object
__ray_recv__(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_store.py", line 98, in __ray_recv__
tensor_transport_manager.recv_multiple_tensors(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/collective/nixl_tensor_transport.py", line 141, in recv_multiple_tensors
g.recv(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/util/collective/collective_group/nixl_backend.py", line 62, in recv
local_descs = nixl_agent.register_memory(tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/nixl/_api.py", line 384, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
(Learner pid=8518, ip=10.0.30.16) E1027 13:05:30.361892 8693 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
(Learner pid=8518, ip=10.0.30.16) [1761595530.359394] [ip-10-0-30-16:8518 :0] cuda_copy_md.c:168 UCX ERROR cuMemHostRegister_v2(address, length, 0x01) failed: part or all of the requested memory range is already mapped
(Learner pid=8518, ip=10.0.30.16) [1761595530.359436] [ip-10-0-30-16:8518 :0] ucp_mm.c:76 UCX ERROR failed to register address 0x70430c651340 (host) length 80 on md[4]=cuda_cpy: Input/output error (md supports: host|cuda|cuda-managed)
(ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:361 Backend UCX was instantiated [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:251 Initialized NIXL agent: 8dfa72b54bb00fccd318b5500a000000 [repeated 3x across cluster]
(base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment