Skip to content

Instantly share code, notes, and snippets.

@qrkourier
Created November 21, 2024 22:27
Show Gist options
  • Save qrkourier/193a2951d82ca92e1516f32147cfabb7 to your computer and use it in GitHub Desktop.
Save qrkourier/193a2951d82ca92e1516f32147cfabb7 to your computer and use it in GitHub Desktop.
Stage all text-encoded knowledge base files in a directory for RAG ingestion
#!/usr/bin/env bash
set -euo pipefail
KNOW_SRC="$HOME/Sites/netfoundry/github"
: "${TMPDIR:=$(mktemp -d)}"
cd "$TMPDIR"
KNOW_DST="$TMPDIR/knowledge"
mkdir -p "$KNOW_DST"
find "$KNOW_SRC" \
\( \
-type d \( \
-name ".npm" -o \
-name "build" -o \
-name ".cache" -o \
-name ".git" -o \
-name ".github" -o \
-name ".docusaurus" -o \
-name "node_modules" -o \
-name "_deps" -o \
-name "_remotes" \
\) -o \
-name "*.svg" \
\) -prune -o \
-type f -size +0c -print0 | \
xargs -0r file --mime-encoding | \
grep -v 'binary$' | \
cut -d: -f1 | \
tee text_files.list
rsync -aiP --stats --files-from=text_files.list \
--prune-empty-dirs --delete \
/ "$KNOW_DST/"
echo "Text knowledge files have been copied to: $(realpath "$KNOW_DST")"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment