Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / git.md
Created January 8, 2026 21:37
Git command to commit changes after running pre-commit and fixing any outstanding issues
allowed-tools description
Bash(git checkout --branch:*), Bash(git status:*), Bash(git commit:*), Bash(gh pr create:*), Bash(pre-commit), Edit, Read, Write
Prepare and commit the code already added to git

Run pre-commit to find any outstanding problems with the changes already added to git. Once they are taken care of, you may git add only the original files I already added to re-run pre-commit. DO NOT add any other files to git. When pre-commit passes, commit the code. Don't use 'pre-commit run --all-files' to validate the code, just run 'pre-commit' to run it on files that are already git added. Add them before running pre-commit if you have made changes. If black says it reformatted a file, add it and then re-run pre-commit.

@rjurney
rjurney / nvidia.js
Last active January 5, 2026 11:23
800 entities merged into one 'NVIDIA Corporation' node in a knowledge graph of the AI ecosystem :)
{
id: null,
uuid: "74f5a7de-e61f-413e-ace3-01d6620e7d77",
name: "NVIDIA Corporation",
cik: null,
ticker: {
id: 1,
uuid: null,
symbol: "NVDA",
exchange: "UQ"
@rjurney
rjurney / uuid.py
Created December 25, 2025 01:31
A Christmas UUID to integer ID mapper for BAML - LLMs hate UUIDs, but love integers :)
from typing import Optional
class UUIDMapper:
"""Maps UUIDs to integer IDs for BAML processing and back."""
def __init__(self) -> None:
"""Initialize the UUID mapper."""
self.uuid_to_int: dict[str, int] = {}
self.int_to_uuid: dict[int, str] = {}
@rjurney
rjurney / SPARK.md
Created December 22, 2025 06:16
A Claude Code PySpark README - a heavily altered Palantir PySpark Style Guide

PySpark Style Guide

PySpark is a wrapper language that allows users to interface with an Apache Spark backend to quickly process data. Spark can operate on massive datasets across a distributed network of servers, providing major performance and reliability benefits when utilized correctly. It presents challenges, even for experienced Python developers, as the PySpark syntax draws on the JVM heritage of Spark and therefore implements code patterns that may be unfamiliar.

This opinionated guide to PySpark code style presents common situations we've encountered and the associated best practices based on the most frequent recurring topics across PySpark repos.

Beyond PySpark specifics, the general practices of clean code are important in PySpark repositories- the Google PyGuide is a strong starting point for learning more about these practices.

Prefer implicit column selection to direct access, except for disambiguation

@rjurney
rjurney / 1_query.py
Created December 16, 2025 02:25
Some commands to examine the data after you run the pipeline
threads = spark.read.parquet("data/threaded_emails.parquet")
threads.select("jwz_thread_id", "id", "date", "thread_depth", "subject").orderBy("jwz_thread_id","date").limit(20).show(20, False)
@rjurney
rjurney / an_explanation.txt
Created December 15, 2025 00:37
LMSS RDF --> Property Graph Experiment
I ran the simplest experiment possible using a project I created with a friend to get the highly scalable, standard PySpark to work as an MCP server: https://github.com/SemyonSinchenko/pyspark-mcp-server
It just figured out how to map the RDF to a property graph. Now, it looks okay to me but obviously with black boxes you need a healthy amount of validation data but this looks promising!
@rjurney
rjurney / etl.sh
Created December 14, 2025 02:57
You can do general purpose ETL like broadcast joins in LLMs these days...
abzu process kg tickers -b 100 -n 1
2025-12-13 18:44:44,806 - abzu.kg.tickers - INFO - Using cached SEC tickers from data/sec/company_tickers.json
2025-12-13 18:44:44,806 - abzu.kg.tickers - INFO - Loading companies from data/knowledge_graph/companies.parquet
2025-12-13 18:44:45,166 - abzu.kg.tickers - INFO - Loaded 15,983 companies
2025-12-13 18:44:45,166 - abzu.kg.tickers - INFO - Loading tickers from data/sec/company_tickers.json
2025-12-13 18:44:45,174 - abzu.kg.tickers - INFO - Loaded 10,021 tickers
2025-12-13 18:44:45,176 - abzu.kg.tickers - INFO - Limited to 1 companies
2025-12-13 18:44:45,187 - abzu.kg.tickers - INFO - Processing 1 companies with 10,021 tickers
0%| | 0/1 [00:00<?, ?it/s]2025-12-13 18:44:45,196 - abzu.kg.tickers - INFO - Processing batch 1
@rjurney
rjurney / error.py
Created December 7, 2025 03:01
Error when visualizing a DataFrame of edges
HTTPError
422 Client Error: Unprocessable Entity for url: https://hub.graphistry.com/api/v2/upload/datasets/299b11a0c59741db86d40613c1269cf8/nodes/arrow
See the console area for a traceback.
Traceback (most recent call last):
File "/Users/rjurney/anaconda3/envs/weave/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 139, in execute_cell
return eval(cell.last_expr, glbls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
@rjurney
rjurney / Data-Scientist-Coined.eml
Last active November 22, 2025 01:09
The email where the title 'data scientist' was originally coined by Jeff Hammerbacher and his team at Facebook on March 1, 2008
From: cameron marlow <[email protected]>
Date: Tue, Oct 23, 2012 at 12:25 AM
Subject: data application scientist
To: Jeff Hammerbacher <[email protected]>
Was searching for the date the term was coined, thought you might appreciate this.
Hey Cam,
@rjurney
rjurney / results.txt
Created November 11, 2025 23:34
SERF entity resolution - round two results
============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ENTITY RESOLUTION EVALUATION SUMMARY - ITERATION 2
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Original raw companies (before matching): 13,641 unique
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Companies that went into matching: 11,093
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Skipped (singletons/errors): 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - MATCHING RESULTS:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - BAML-processed companies: 4,930 unique
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Companies merged: 6,163 (55.56%)