Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / pregel.py
Created March 25, 2025 17:10
GraphFrames Pregel API - sum the ages of a node's neighbors
from graphframes.lib import AggregateMessages as AM
from graphframes.examples import Graphs
from pyspark.sql.functions import sum as sqlsum
g = Graphs(spark).friends() # Get example graph
# For each user, sum the ages of the adjacent users
msgToSrc = AM.dst["age"]
msgToDst = AM.src["age"]
@rjurney
rjurney / csv_single.py
Last active February 3, 2025 13:20
A proposed monkey patch to save PySpark DataFrames into a single CSV file
import os
import glob
import shutil
import uuid
from pyspark.sql.readwriter import DataFrameWriter
def csv_single(self, path, **options):
"""
Write the DataFrame as a single CSV file at the specified path.
@rjurney
rjurney / extract.py
Last active January 31, 2025 07:58
Relik for relation extraction on the GraphFrames paper
"""Script that tests and times Relik's relation extraction and entity linking on the GraphFrames Paper: https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf"""
import timeit
import warnings
from pprint import pprint
from relik import Relik # type: ignore
from relik.inference.data.objects import RelikOutput # type: ignore
# Squash Relik's warnings for prettier screenshots
warnings.simplefilter("ignore")
@rjurney
rjurney / paper.py
Created January 31, 2025 06:48
Extracting the text from the GraphFrames paper with PyPDF
from pypdf import PdfReader
# Load the PDF. The GraphFrames paper normally resides at
# https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf
reader = PdfReader("data/grades_graphframes.pdf")
# Extract text from all pages
text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])
# Write it to a text file
@rjurney
rjurney / comment.txt
Last active January 31, 2025 17:27
Relik: Hello, World!
You can see it linked many topics that are related - Apache Phoenix - but not actually mentioned in the text...
@rjurney
rjurney / command.txt
Last active January 18, 2025 10:31
Warp.dev shell command to count my papers on graph pattern matching: graphlets and network motifs
Prompt: Find all instances of files containing the term 'motif' or 'graphlet' in this folder
or any below it. List the filenames, then print the total count of unique files.
@rjurney
rjurney / A GraphFrames Bug
Last active January 11, 2025 07:49
GraphFrames Connected Components OutOfMemoryError in Java 11 on TINY Graph...
I can't figure out why this unit test is failing with this error:
> [error] Uncaught exception when running org.graphframes.lib.ConnectedComponentsSuite: java.lang.OutOfMemoryError: Java
heap space sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
The test is an 8 node, 6 edge graph of two components and two dangling vertices. WTF heap space? I cleaned up the `Dockerfile`
below because it was on wonky versions and tried the same commands there... no go. Same exception. The weird thing is that
CI does pass these tests... so I don't get what is going wrong.
HOW YOU CAN HELP: Please run this command and tell me if the tests pass:
@rjurney
rjurney / motif.py
Last active December 27, 2024 05:12
Github Copilot can do network motifs...
We can also look for a more complex motif: a directed square. We will find all instances of a directed square in the graph.
<div data-lang="python" markdown="1">
{% highlight python %}
# G8: Directed Square
paths = g.find("(a)-[e]->(b); (b)-[e2]->(c); (c)-[e3]->(d); (d)-[e4]->(a)")
four_edge_count(paths).show()
{% endhighlight %}
</div>
@rjurney
rjurney / kuzu.py
Created July 27, 2024 18:13
Loading ICIJ data into KuzuDB
import kuzu
import kuzu.connection
import kuzu.database
def create_tables(conn: kuzu.connection.Connection) -> None:
try:
# Create a Person node table
conn.execute(
@rjurney
rjurney / Dockerfile.cli
Last active July 22, 2024 01:49
Senzing Dockerfile for Python environment setup
# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image
# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then
# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file.
FROM continuumio/miniconda3
RUN apt update && \
apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \
autoconf automake libtool pkg-config libpq5 libpq-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*