Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / Jeff-Hammerbacher.eml
Last active November 12, 2025 02:20
The email where the title 'data scientist' was originally coined by Jeff Hammerbacher and his team at Facebook on March 1, 2008
From: cameron marlow <[email protected]>
Date: Tue, Oct 23, 2012 at 12:25 AM
Subject: data application scientist
To: Jeff Hammerbacher <[email protected]>
Was searching for the date the term was coined, thought you might appreciate this.
Hey Cam,
@rjurney
rjurney / results.txt
Created November 11, 2025 23:34
SERF entity resolution - round two results
============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ENTITY RESOLUTION EVALUATION SUMMARY - ITERATION 2
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Original raw companies (before matching): 13,641 unique
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Companies that went into matching: 11,093
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Skipped (singletons/errors): 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - MATCHING RESULTS:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - BAML-processed companies: 4,930 unique
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Companies merged: 6,163 (55.56%)
@rjurney
rjurney / eridu.txt
Created October 29, 2025 00:01
Latest Eridu performance metrics
Using device: mps
Loading model from: data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/
Successfully loaded model from data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/
Loading test data from: data/fine-tuned-sbert-intfloat-multilingual-e5-base-original-adafactor-companies/test_split.parquet
Sampling 1,000 test pairs from 168,280 available
Running inference on 1,000 test pairs
Found optimal threshold: 0.6907
Running inference on 1,000 test pairs
Model Evaluation Report
@rjurney
rjurney / groupage.md
Last active October 2, 2025 22:24
Claude Code command to group data, count the size of the groups, look and display high / low superkeys and sample and display grouped records
allowed-tools description
pyspark-mcp, WebFetch, Web Search, Bash(python:*), Bash(poetry:*), Bash(pyspark:*)
Groupage command is used to group data, count the group size, plit a histogram and display both the keys of the largest groups and those of groups in a middle range. Arguments include the column to group by,

Groupage Command

Description

Realize that the unique values of fields of real world datasets often have long-tail, log scale distributions. This creates 'superkeys' that can cause problems in downstream code. The groupage command is used to identify and mitigate these superkeys.

@rjurney
rjurney / Pynvml_Rich_GPU_Monitor.sql
Created October 2, 2025 17:51
Pynvml / Rich GPU Monitor with min/max
┏━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ GPU ID ┃ GPU Util % ┃ Mem Used (MB) ┃ Mem Total (MB) ┃ Mem % ┃ Min Mem (MB) ┃ Max Mem (MB) ┃
┡━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ 0 │ 100.0 │ 9856.0 │ 12288.0 │ 80.2 │ 9816.0 │ 9856.0 │
│ 1 │ 75.0 │ 4534.4 │ 12288.0 │ 36.9 │ 4534.4 │ 4534.4 │
└────────┴────────────┴───────────────┴────────────────┴───────┴──────────────┴──────────────┘
{
"hooks": {
"PostToolUse": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "if command -v osascript >/dev/null 2>&1; then osascript -e 'beep 1'; elif command -v notify-send >/dev/null 2>&1; then notify-send 'Claude Code' \"Tool: $CLAUDE_TOOL_NAME completed\"; fi"
}
@rjurney
rjurney / README.md
Created August 29, 2025 21:00
Graphlet AI Claude Code PySpark Guide - customized Palantir-PySpark-Guide for effective PySpark in Claude Code

Note: this style guide is an edit of the Palantir Style guide, for which I am very grateful! You may use this one or edit theirs as a starting point for your own agent-based PySpark code.

Palantir PySpark Style Guide

PySpark Style Guide

PySpark is a wrapper language that allows users to interface with an Apache Spark backend to quickly process data. Spark can operate on massive datasets across a distributed network of servers, providing major performance and reliability benefits when utilized correctly. It presents challenges, even for experienced Python developers, as the PySpark syntax draws on the JVM heritage of Spark and therefore implements code patterns that may be unfamiliar.

This opinionated guide to PySpark code style presents common situations we've encountered and the associated best practices based on the most frequent recurring topics across PySpark repos.

@rjurney
rjurney / merged.baml
Created August 16, 2025 06:28
Merged record includes full corporate name, as determined by the 'name' field @description :)
{
"name": "Nvidia Corporation",
"ticker": {
"symbol": "NVDA",
"exchange": "NASDAQ"
},
"description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player in the AI, gaming, and data center markets, it is led by CEO Jensen Huang and headquartered in Santa Clara, California.",
"website_url": "null",
"headquarters_location": "Santa Clara, California, USA",
"revenue_usd": 10918000000,
@rjurney
rjurney / before.json
Created August 16, 2025 06:27
Company records to be merged with field description as metadata for guidance...
{
"name": "Nvidia Corporation",
"ticker": {
"symbol": "NVDA",
"exchange": "NASDAQ"
},
"description": "An American technology company, founded in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant player in the AI, gaming, and data center markets, it is led by CEO Jensen Huang and headquartered in Santa Clara, California.",
"website_url": "null",
"headquarters_location": "Santa Clara, California, USA",
"revenue_usd": 10918000000,
@rjurney
rjurney / company.baml
Created August 16, 2025 06:23
BAML field annotations guide extraction, matching and merging!
class Company {
name string
@description("Formal name of the company with corporate suffix")
...
}