Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / SPARK.md
Created December 22, 2025 06:16
A Claude Code PySpark README - a heavily altered Palantir PySpark Style Guide

PySpark Style Guide

PySpark is a wrapper language that allows users to interface with an Apache Spark backend to quickly process data. Spark can operate on massive datasets across a distributed network of servers, providing major performance and reliability benefits when utilized correctly. It presents challenges, even for experienced Python developers, as the PySpark syntax draws on the JVM heritage of Spark and therefore implements code patterns that may be unfamiliar.

This opinionated guide to PySpark code style presents common situations we've encountered and the associated best practices based on the most frequent recurring topics across PySpark repos.

Beyond PySpark specifics, the general practices of clean code are important in PySpark repositories- the Google PyGuide is a strong starting point for learning more about these practices.

Prefer implicit column selection to direct access, except for disambiguation

@rjurney
rjurney / 1_query.py
Created December 16, 2025 02:25
Some commands to examine the data after you run the pipeline
threads = spark.read.parquet("data/threaded_emails.parquet")
threads.select("jwz_thread_id", "id", "date", "thread_depth", "subject").orderBy("jwz_thread_id","date").limit(20).show(20, False)
@rjurney
rjurney / an_explanation.txt
Created December 15, 2025 00:37
LMSS RDF --> Property Graph Experiment
I ran the simplest experiment possible using a project I created with a friend to get the highly scalable, standard PySpark to work as an MCP server: https://github.com/SemyonSinchenko/pyspark-mcp-server
It just figured out how to map the RDF to a property graph. Now, it looks okay to me but obviously with black boxes you need a healthy amount of validation data but this looks promising!
@rjurney
rjurney / etl.sh
Created December 14, 2025 02:57
You can do general purpose ETL like broadcast joins in LLMs these days...
abzu process kg tickers -b 100 -n 1
2025-12-13 18:44:44,806 - abzu.kg.tickers - INFO - Using cached SEC tickers from data/sec/company_tickers.json
2025-12-13 18:44:44,806 - abzu.kg.tickers - INFO - Loading companies from data/knowledge_graph/companies.parquet
2025-12-13 18:44:45,166 - abzu.kg.tickers - INFO - Loaded 15,983 companies
2025-12-13 18:44:45,166 - abzu.kg.tickers - INFO - Loading tickers from data/sec/company_tickers.json
2025-12-13 18:44:45,174 - abzu.kg.tickers - INFO - Loaded 10,021 tickers
2025-12-13 18:44:45,176 - abzu.kg.tickers - INFO - Limited to 1 companies
2025-12-13 18:44:45,187 - abzu.kg.tickers - INFO - Processing 1 companies with 10,021 tickers
0%| | 0/1 [00:00<?, ?it/s]2025-12-13 18:44:45,196 - abzu.kg.tickers - INFO - Processing batch 1
@rjurney
rjurney / error.py
Created December 7, 2025 03:01
Error when visualizing a DataFrame of edges
HTTPError
422 Client Error: Unprocessable Entity for url: https://hub.graphistry.com/api/v2/upload/datasets/299b11a0c59741db86d40613c1269cf8/nodes/arrow
See the console area for a traceback.
Traceback (most recent call last):
File "/Users/rjurney/anaconda3/envs/weave/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 139, in execute_cell
return eval(cell.last_expr, glbls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
@rjurney
rjurney / Data-Scientist-Coined.eml
Last active November 22, 2025 01:09
The email where the title 'data scientist' was originally coined by Jeff Hammerbacher and his team at Facebook on March 1, 2008
From: cameron marlow <[email protected]>
Date: Tue, Oct 23, 2012 at 12:25 AM
Subject: data application scientist
To: Jeff Hammerbacher <[email protected]>
Was searching for the date the term was coined, thought you might appreciate this.
Hey Cam,
@rjurney
rjurney / results.txt
Created November 11, 2025 23:34
SERF entity resolution - round two results
============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ENTITY RESOLUTION EVALUATION SUMMARY - ITERATION 2
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Original raw companies (before matching): 13,641 unique
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Companies that went into matching: 11,093
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Skipped (singletons/errors): 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - MATCHING RESULTS:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - BAML-processed companies: 4,930 unique
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Companies merged: 6,163 (55.56%)
@rjurney
rjurney / eridu.txt
Created October 29, 2025 00:01
Latest Eridu performance metrics
Using device: mps
Loading model from: data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/
Successfully loaded model from data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/
Loading test data from: data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/test_split.parquet
Sampling 1,000 test pairs from 168,280 available
Running inference on 1,000 test pairs
Found optimal threshold: 0.6907
Running inference on 1,000 test pairs
Model Evaluation Report
@rjurney
rjurney / groupage.md
Last active October 2, 2025 22:24
Claude Code command to group data, count the size of the groups, look and display high / low superkeys and sample and display grouped records
allowed-tools description
pyspark-mcp, WebFetch, Web Search, Bash(python:*), Bash(poetry:*), Bash(pyspark:*)
Groupage command is used to group data, count the group size, plit a histogram and display both the keys of the largest groups and those of groups in a middle range. Arguments include the column to group by,

Groupage Command

Description

Realize that the unique values of fields of real world datasets often have long-tail, log scale distributions. This creates 'superkeys' that can cause problems in downstream code. The groupage command is used to identify and mitigate these superkeys.

@rjurney
rjurney / Pynvml_Rich_GPU_Monitor.sql
Created October 2, 2025 17:51
Pynvml / Rich GPU Monitor with min/max
┏━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ GPU ID ┃ GPU Util % ┃ Mem Used (MB) ┃ Mem Total (MB) ┃ Mem % ┃ Min Mem (MB) ┃ Max Mem (MB) ┃
┡━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ 0 │ 100.0 │ 9856.0 │ 12288.0 │ 80.2 │ 9816.0 │ 9856.0 │
│ 1 │ 75.0 │ 4534.4 │ 12288.0 │ 36.9 │ 4534.4 │ 4534.4 │
└────────┴────────────┴───────────────┴────────────────┴───────┴──────────────┴──────────────┘