Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / paper.py
Created January 31, 2025 06:48
Extracting the text from the GraphFrames paper with PyPDF
from pypdf import PdfReader
# Load the PDF. The GraphFrames paper normally resides at
# https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf
reader = PdfReader("data/grades_graphframes.pdf")
# Extract text from all pages
text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])
# Write it to a text file
@rjurney
rjurney / comment.txt
Last active January 31, 2025 17:27
Relik: Hello, World!
You can see it linked many topics that are related - Apache Phoenix - but not actually mentioned in the text...
@rjurney
rjurney / command.txt
Last active January 18, 2025 10:31
Warp.dev shell command to count my papers on graph pattern matching: graphlets and network motifs
Prompt: Find all instances of files containing the term 'motif' or 'graphlet' in this folder
or any below it. List the filenames, then print the total count of unique files.
@rjurney
rjurney / A GraphFrames Bug
Last active January 11, 2025 07:49
GraphFrames Connected Components OutOfMemoryError in Java 11 on TINY Graph...
I can't figure out why this unit test is failing with this error:
> [error] Uncaught exception when running org.graphframes.lib.ConnectedComponentsSuite: java.lang.OutOfMemoryError: Java
heap space sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
The test is an 8 node, 6 edge graph of two components and two dangling vertices. WTF heap space? I cleaned up the `Dockerfile`
below because it was on wonky versions and tried the same commands there... no go. Same exception. The weird thing is that
CI does pass these tests... so I don't get what is going wrong.
HOW YOU CAN HELP: Please run this command and tell me if the tests pass:
@rjurney
rjurney / motif.py
Last active December 27, 2024 05:12
Github Copilot can do network motifs...
We can also look for a more complex motif: a directed square. We will find all instances of a directed square in the graph.
<div data-lang="python" markdown="1">
{% highlight python %}
# G8: Directed Square
paths = g.find("(a)-[e]->(b); (b)-[e2]->(c); (c)-[e3]->(d); (d)-[e4]->(a)")
four_edge_count(paths).show()
{% endhighlight %}
</div>
@rjurney
rjurney / kuzu.py
Created July 27, 2024 18:13
Loading ICIJ data into KuzuDB
import kuzu
import kuzu.connection
import kuzu.database
def create_tables(conn: kuzu.connection.Connection) -> None:
try:
# Create a Person node table
conn.execute(
@rjurney
rjurney / Dockerfile.cli
Last active July 22, 2024 01:49
Senzing Dockerfile for Python environment setup
# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image
# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then
# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file.
FROM continuumio/miniconda3
RUN apt update && \
apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \
autoconf automake libtool pkg-config libpq5 libpq-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@rjurney
rjurney / complete.json
Created July 4, 2024 22:39
6 JSON Lines records in Senzing format
{
"DATA_SOURCE": "TEST",
"RECORD_ID": "1",
"RECORD_TYPE": "PERSON",
"NAME_LIST": [
{
"NAME_TYPE": "PRIMARY",
"NAME_FULL": "KIM SOO IN"
}
],
@rjurney
rjurney / company.json
Created July 4, 2024 21:44
Example of a valid Senzing record that is an edge with only source metadata. How could I encode a second out-link without copying the source metadata?
{
"DATA_SOURCE": "TEST",
"RECORD_ID": "6",
"RECORD_TYPE": "ORGANIZATION",
"NAME_LIST": [
{
"NAME_TYPE": "PRIMARY",
"NAME_ORG": "Random Company, LTD."
}
],
@rjurney
rjurney / download.cmd
Created July 1, 2024 21:12
Download and unzip the International Consortium of Investigative Journalists (ICIJ) knowledge graph dataset
#!/usr/bin/env bash
: '
@echo off
powershell -ExecutionPolicy Bypass -Command "$ErrorActionPreference='Stop'; $ProgressPreference='SilentlyContinue';
$output_file = 'data/full-oldb.LATEST.zip'
$extract_dir = 'data'
Write-Host "`nDownloading the ICIJ Offshore Leaks Database to $output_file`n"
Invoke-WebRequest -Uri 'https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip' -OutFile $output_file