Skip to content

Instantly share code, notes, and snippets.

@Olex1313
Created June 16, 2025 09:18
Show Gist options
  • Save Olex1313/9dd9bac1b084e5f510277f8c783bfd5d to your computer and use it in GitHub Desktop.
Save Olex1313/9dd9bac1b084e5f510277f8c783bfd5d to your computer and use it in GitHub Desktop.
Transcribe all uni lections from a s3 bucket with whisper cpp
#!/bin/bash
S3_BUCKET="hpc-lections"
WHISPER_CPP_PATH="whisper.cpp"
MODEL="large-v3"
LANG="ru"
AUDIO_EXTRACT=true
command -v aws >/dev/null 2>&1 || { echo >&2 "AWS CLI required."; exit 1; }
command -v ffmpeg >/dev/null 2>&1 && HAS_FFMPEG=true || HAS_FFMPEG=false
if [ "$AUDIO_EXTRACT" = true ] && [ "$HAS_FFMPEG" = false ]; then
echo >&2 "FFmpeg required."; exit 1
fi
TMP_DIR=$(mktemp -d)
FILES=$(aws s3 ls "s3://${S3_BUCKET}" --recursive | awk '/\.mp4$/ {print $4}')
[ -z "$FILES" ] && { echo "No MP4 files found."; exit 0; }
for FILE in $FILES; do
FILENAME=$(basename "$FILE")
BASENAME="${FILENAME%.*}"
LOCAL_MP4="$TMP_DIR/$FILENAME"
LOCAL_AUDIO="$TMP_DIR/$BASENAME.wav"
LOCAL_TRANSCRIPT="$TMP_DIR/$BASENAME.txt"
aws s3 cp "s3://${S3_BUCKET}/${FILE}" "$LOCAL_MP4"
if [ "$AUDIO_EXTRACT" = true ]; then
ffmpeg -i "$LOCAL_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$LOCAL_AUDIO" -y
INPUT_FILE="$LOCAL_AUDIO"
else
INPUT_FILE="$LOCAL_MP4"
fi
"$WHISPER_CPP_PATH/build/bin/whisper-cli" -m "$WHISPER_CPP_PATH/models/ggml-$MODEL.bin" -f "$INPUT_FILE" -l "$LANG" \
-otxt -ml 1 2>&1 | tee "$LOCAL_TRANSCRIPT"
S3_TRANSCRIPT_PATH="${FILE%.*}.txt"
aws s3 cp "$LOCAL_TRANSCRIPT" "s3://${S3_BUCKET}/${S3_TRANSCRIPT_PATH}"
rm -f "$LOCAL_MP4" "$LOCAL_AUDIO" "$LOCAL_TRANSCRIPT"
done
rm -rf "$TMP_DIR"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment