crypdick · October 27, 2025 20:12
diff --git a/cpu-cpu-nixl-rdt-error.txt b/cpu-cpu-nixl-rdt-error.txt
 (base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ git pull && python grpo_contextual_bandits_simple.py 
 Already up to date.
 2025-10-27 13:05:23,423 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.24.59:6379...
 2025-10-27 13:05:23,433 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com 
 2025-10-27 13:05:23,439 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip' (1.01MiB) to Ray cluster...
 2025-10-27 13:05:23,442 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip'.
 /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
 (Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO    _api.py:361 Backend UCX was instantiated
 (Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO    _api.py:251 Initialized NIXL agent: 2892f736c8126b406a0be5b90a000000
 2025-10-27 13:05:28 NIXL INFO    _api.py:361 Backend UCX was instantiated
 2025-10-27 13:05:28 NIXL INFO    _api.py:251 Initialized NIXL agent: RAY-DRIVER-46ca0f11-abd7-46fb-9765-957ca597e487
 Traceback (most recent call last):
  File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 436, in <module>
    train(total_steps=args.steps)
  File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 409, in train
    step_result = ray.get(learner.step.remote())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects
    raise value.as_instanceof_cause()
 ray.exceptions.RayTaskError(nixlBackendError): ray::Learner.step() (pid=8518, ip=10.0.30.16, actor_id=2892f736c8126b406a0be5b90a000000, repr=<grpo_contextual_bandits_simple.Learner object at 0x70616fd939d0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 297, in step
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^
             ^^^^^^^^^^^^^^^^^^^^^^^^^
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 515, in get_gpu_object
    self._fetch_object(object_id, tensor_transport)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 360, in _fetch_object
    __ray_recv__(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_store.py", line 98, in __ray_recv__
    tensor_transport_manager.recv_multiple_tensors(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/collective/nixl_tensor_transport.py", line 141, in recv_multiple_tensors
    g.recv(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/util/collective/collective_group/nixl_backend.py", line 62, in recv
    local_descs = nixl_agent.register_memory(tensors)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/nixl/_api.py", line 384, in register_memory
    self.agent.registerMem(reg_descs, handle_list)
 nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
 (Learner pid=8518, ip=10.0.30.16) E1027 13:05:30.361892    8693 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
 (Learner pid=8518, ip=10.0.30.16) [1761595530.359394] [ip-10-0-30-16:8518 :0]    cuda_copy_md.c:168  UCX  ERROR cuMemHostRegister_v2(address, length, 0x01) failed: part or all of the requested memory range is already mapped
 (Learner pid=8518, ip=10.0.30.16) [1761595530.359436] [ip-10-0-30-16:8518 :0]          ucp_mm.c:76   UCX  ERROR failed to register address 0x70430c651340 (host) length 80 on md[4]=cuda_cpy: Input/output error (md supports: host|cuda|cuda-managed)
 (ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO    _api.py:361 Backend UCX was instantiated [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
 (ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO    _api.py:251 Initialized NIXL agent: 8dfa72b54bb00fccd318b5500a000000 [repeated 3x across cluster]
 (base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$
	(base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ git pull && python grpo_contextual_bandits_simple.py
	Already up to date.
	2025-10-27 13:05:23,423 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.24.59:6379...
	2025-10-27 13:05:23,433 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com
	2025-10-27 13:05:23,439 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip' (1.01MiB) to Ray cluster...
	2025-10-27 13:05:23,442 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip'.
	/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
	warnings.warn(
	(Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:361 Backend UCX was instantiated
	(Learner pid=8518, ip=10.0.30.16) 2025-10-27 13:05:26 NIXL INFO _api.py:251 Initialized NIXL agent: 2892f736c8126b406a0be5b90a000000
	2025-10-27 13:05:28 NIXL INFO _api.py:361 Backend UCX was instantiated
	2025-10-27 13:05:28 NIXL INFO _api.py:251 Initialized NIXL agent: RAY-DRIVER-46ca0f11-abd7-46fb-9765-957ca597e487
	Traceback (most recent call last):
	File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 436, in <module>
	train(total_steps=args.steps)
	File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 409, in train
	step_result = ray.get(learner.step.remote())
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
	return fn(args, *kwargs)
	^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
	return func(args, *kwargs)
	^^^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2961, in get
	values, debugger_breakpoint = worker.get_objects(
	^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects
	raise value.as_instanceof_cause()
	ray.exceptions.RayTaskError(nixlBackendError): ray::Learner.step() (pid=8518, ip=10.0.30.16, actor_id=2892f736c8126b406a0be5b90a000000, repr=<grpo_contextual_bandits_simple.Learner object at 0x70616fd939d0>)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/home/ray/default/rl-gpu-objects/grpo_contextual_bandits_simple.py", line 297, in step
	^^^^^^^^^^^^^^^^^^^
	^^^^^^^^^^^^^^^^^^^^^
	^^^^^^^^^^^^^^^^^^^
	^^^^^^^^^^^^^^^^^^^^^^^^^
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 515, in get_gpu_object
	self._fetch_object(object_id, tensor_transport)
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_manager.py", line 360, in _fetch_object
	__ray_recv__(
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/gpu_object_manager/gpu_object_store.py", line 98, in __ray_recv__
	tensor_transport_manager.recv_multiple_tensors(
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/collective/nixl_tensor_transport.py", line 141, in recv_multiple_tensors
	g.recv(
	File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/util/collective/collective_group/nixl_backend.py", line 62, in recv
	local_descs = nixl_agent.register_memory(tensors)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/home/ray/anaconda3/lib/python3.11/site-packages/nixl/_api.py", line 384, in register_memory
	self.agent.registerMem(reg_descs, handle_list)
	nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
	(Learner pid=8518, ip=10.0.30.16) E1027 13:05:30.361892 8693 nixl_agent.cpp:473] registerMem: registration failed for the specified or all potential backends
	(Learner pid=8518, ip=10.0.30.16) [1761595530.359394] [ip-10-0-30-16:8518 :0] cuda_copy_md.c:168 UCX ERROR cuMemHostRegister_v2(address, length, 0x01) failed: part or all of the requested memory range is already mapped
	(Learner pid=8518, ip=10.0.30.16) [1761595530.359436] [ip-10-0-30-16:8518 :0] ucp_mm.c:76 UCX ERROR failed to register address 0x70430c651340 (host) length 80 on md[4]=cuda_cpy: Input/output error (md supports: host\|cuda\|cuda-managed)
	(ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:361 Backend UCX was instantiated [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
	(ReplayBuffer pid=6839, ip=10.0.19.192) 2025-10-27 13:05:29 NIXL INFO _api.py:251 Initialized NIXL agent: 8dfa72b54bb00fccd318b5500a000000 [repeated 3x across cluster]
	(base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$