Skip to content

Instantly share code, notes, and snippets.

@crypdick
crypdick / get_object_ref_rdt_bug.txt
Created October 30, 2025 02:42
can't ray.get(objectRef_list)
(GeneratorCore pid=70828, ip=10.0.49.144) (Worker_TP2 pid=71026) [vLLM-Worker] Loaded 389 weight(s)
[Generator] Updated weights on all vLLM workers
[Generator] Weight sync completed
[train] step 1/25 called
(GeneratorCore pid=70828, ip=10.0.49.144) (Worker_TP0 pid=71024) [vLLM-Worker] Loaded 389 weight(s)
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1245.71it/s]
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
(GeneratorCore pid=70828, ip=10.0.49.144) prompt: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year? | logprobs length: 8 | response:
(GeneratorCore pid=70828, ip=10.0.49.144) A lot of people write a lot
(GeneratorCore pid=70828, ip=10.0.49.144) prompt: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year? | logprobs length: 8 | response:
@crypdick
crypdick / error-when-sending-gpu-object-ref-with-vllm-collective_rpc.txt
Created October 29, 2025 06:14
Error when sending ray.ObjectRef using collective_rlc
Updating 6f48dfe..5b23624
Fast-forward
rdt-vllm-simple/agents/generator/core.py | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
2025-10-28 23:11:39,280 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.10.150:6379...
2025-10-28 23:11:39,292 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com
2025-10-28 23:11:39,300 INFO packaging.py:588 -- Creating a file package for local module '/home/ray/default/rl-gpu-objects/rdt-vllm-simple'.
2025-10-28 23:11:39,305 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_210586ea18ac0c9a.zip' (0.11MiB) to Ray cluster...
2025-10-28 23:11:39,306 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_210586ea18ac0c9a.zip'.
(LearnerWorker pid=21752, ip=10.0.11.247) [Learner-rank0] Initializing process group: master=10.0.11.247:34579, world_size=4
@crypdick
crypdick / cpu-cpu-nixl-rdt-error.txt
Created October 27, 2025 20:12
error trace for CPU-to-CPU RDT over NIXL from ReplayBuffer -> Learner
(base) ray@ip-10-0-24-59:~/default/rl-gpu-objects$ git pull && python grpo_contextual_bandits_simple.py
Already up to date.
2025-10-27 13:05:23,423 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.24.59:6379...
2025-10-27 13:05:23,433 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at https://session-3frf4lk2clfpxfatd3azds6c8r.i.anyscaleuserdata.com
2025-10-27 13:05:23,439 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip' (1.01MiB) to Ray cluster...
2025-10-27 13:05:23,442 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_7934b7eafc1d2b0d7f2bb6fb316727df3f7c78c2.zip'.
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_
@crypdick
crypdick / count_parquet_rows.py
Created August 12, 2025 20:49
Count the number of rows in a sharded parquet dataset without loading the shards. Works by reading just the metadata headers. Works with S3 datasets
import pyarrow.dataset as ds
def count_parquet_rows(dataset_path: str) -> int:
"""
Count the number of rows in a parquet file without reading the data into memory.
https://stackoverflow.com/a/79118602/4212158
"""
dataset = ds.dataset(dataset_path, format="parquet")
row_count = sum(row_group.num_rows for fragment in dataset.get_fragments() for row_group in fragment.row_groups)
Traceback (most recent call last):
File "/tmp/ray/session_2025-08-08_19-13-17_305038_2243/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_text_embeddings_benchmark_fixed_size_preemptible_gswrofihok__anyscale_pkg_aa4c368f375d6f6f25845bef969f1c00/dataset/text_embeddings_benchmark.py", line 257, in <module>
benchmark.run_fn("text-embeddings-benchmark", main, args)
File "/tmp/ray/session_2025-08-08_19-13-17_305038_2243/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_text_embeddings_benchmark_fixed_size_preemptible_gswrofihok__anyscale_pkg_aa4c368f375d6f6f25845bef969f1c00/dataset/benchmark.py", line 154, in run_fn
fn_output = fn(*fn_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-08-08_19-13-17_305038_2243/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_text_embeddings_benchmark_fixed_size_preemptible_gswrofihok__anyscale_pkg_aa4c368f375d6f6f25845bef969f1c0
@crypdick
crypdick / convert_arrow_to_parquet_streaming.py
Created August 8, 2025 20:04
Convert large Arrow shards into Parquet without loading the entire dataset into memory.
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "pyarrow",
# ]
# ///
"""Convert .arrow shards to Parquet without loading entire dataset into memory.
- Discovers all .arrow files under a given source directory
@crypdick
crypdick / gpu_util_tracker.py
Created July 23, 2025 00:43
Actor that integrates GPU capacity over time across a Ray cluster.
import threading
import time
import ray
@ray.remote(num_cpus=0)
class GPUHoursTracker:
"""Actor that integrates GPU capacity over time across a Ray cluster.
@crypdick
crypdick / code.py
Last active July 17, 2025 20:23
repro for ray data llm failure with parquet sink
import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct")
processor = build_llm_processor(
config,
preprocess=lambda row: {
"messages": [
{"role": "system", "content": "You are a bot that responds with haikus."},
{"role": "user", "content": row["item"]},
@crypdick
crypdick / lerp_vs_slurp_2_vectors.py
Created July 7, 2025 03:06
for my blog post on multi-vector slerp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Create a 3D plot showing the vectors and interpolations
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Plot original vectors
ax.quiver(0, 0, 0, vecs[0][0], vecs[0][1], vecs[0][2],
color='black', arrow_length_ratio=0.1, label='v0 [1,0,0]')
<deleted 50,000 lines of logs repeating the same thing>
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) raise output
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 317, in generate_async
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) output = await self._generate_async(request)
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 399, in generate_as