Skip to content

Instantly share code, notes, and snippets.

View djszemiako's full-sized avatar

Daniel Szemiako djszemiako

  • Basil Systems, Inc.
  • Brooklyn, NY
  • 03:18 (UTC -04:00)
View GitHub Profile
@djszemiako
djszemiako / snapshot_gcs_directory.sh
Created May 14, 2026 13:54
Snapshots a GCS prefix by copying the data from one prefix to the same prefix, but appended with a `snapshot=` and UNIX timestamp.
#!/usr/bin/env bash
set -euo pipefail
if [ $# -ne 1 ]; then
echo "Usage: $(basename "$0") <source>"
echo "Example: $(basename "$0") 'bucket/path/to/data'"
echo " $(basename "$0") 'gs://bucket/path/to/data'"
exit 1
fi
import pandas as pd
from datetime import datetime
from itertools import combinations
from math import comb
from sys import argv
from time import monotonic
from typing import List
MAX_FAMILY_TOKENS = 6
import pandas as pd
from datetime import datetime
from rapidfuzz.process import extractOne
from string_grouper import match_strings, group_similar_strings
from sys import argv
from time import monotonic
from uuid import uuid4
class Constants:
# Make sure the source index is actually open
POST /source_index/_open
# Put the source index in read-only mode
PUT /source_index/_settings
{
"settings": {
"index.blocks.write": "true"
}
}