Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active October 1, 2022 22:16
Show Gist options
  • Save psychemedia/1786459368ae817f85867e2e003ba9c3 to your computer and use it in GitHub Desktop.
Save psychemedia/1786459368ae817f85867e2e003ba9c3 to your computer and use it in GitHub Desktop.
Example of setting up a Docker image containing a suite of applications on a related theme for use over the command line: audiogrep, videogreop and youtube_dl.

Audiogrep / Videogrep Tools

Docker image containing several tools for tinkering with audio and video files.

The Dockerfile is an edit of the Dockerfile from kevinhughes27/audiogrep-docker that includes a patch and additional utilities and a shared folder. The other files from that repository are required to build the image.

Original audiogrep docs here: antiboredom/audiogrep

See also these examples of what audiogrep can do.

audiogrep also makes use of:

#create shared folder on host
mkdir -p files

#The build around the Dockerfile needs to be in the context of other files from: https://github.com/kevinhughes27/audiogrep-docker
docker build -t psychemedia/avgrep .
#Transcribe an audio file
docker run --volume "${PWD}/files":/avgrepfiles --tty --interactive --rm psychemedia/avgrep  audiogrep --input avgrepfiles/MYFILE.mp3 --transcribe
#The transcription seems to chunk the audio file and produce a transcript for each as a separate file
#The audiogrep search seems to want a single trasncript with a different filename
#Create the single transcript file
cat files/MYFILE*.txt >> MYFILE.mp3.transcription.txt

#Generate a supercut
docker run --volume "${PWD}/files":/avgrepfiles --tty --interactive --rm psychemedia/avgrep  audiogrep --input /avgrepfiles/MYFILE.mp3 --search 'transparency | honest | health' --output /avgrepfiles/supercut.mp3 --regex --output-mode word

videogrep is also included in the container, but untested. Original videogrep docs here: antiboredom/videogrep

See also this example of what videogrep can do.

To help grab files from YouTube, youtube_dl is also included in the container.

Usage is along the lines of:

 docker run --volume "${PWD}/files":/audiogrepfiles --tty --interactive --rm psychemedia/avgrep  youtube-dl --extract-audio --audio-format mp3 -o '/avgrepfiles/%(id)s.mp3' https://www.youtube.com/watch?v=YOUTUBE_ID

Using a couple of test audio files with UK English speakers, I couldn't replicate anything like the original demos. Transcription was poor, the timing seemed really off (and didn't match searched for words), and some of the splices were of very long segments (minutes long). In the transcript, only single words seemed to be indentified, so I'm not sure how phrase identification is supposed to work.

I haven't looked at the code, but it might be worth generating a view reports over the extracted words to help identify sensible phrases. Something like nltk concordancing relative to a single word or multiple words would add another dimension to the reporting, and help the user spot keyword keyed phrases in the text, rather than the audio. (Adding the ability for the concordancer to act on OR'd words is a feature we can perhaps take away from audiogrep - I'll add it to my to do list!;-)

#Based on https://github.com/kevinhughes27/audiogrep-docker
# DOCKER-VERSION 1.4.0
FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install -y software-properties-common
# FFMPEG
#The repository needs updating from the original
#Note that ffmpeg not standardly available for Ubuntu 14.04: http://www.faqforge.com/linux/how-to-install-ffmpeg-on-ubuntu-14-04/
RUN apt-add-repository ppa:mc3man/trusty-media
RUN apt-get update
RUN apt-get install -y ffmpeg
# PocketSpinx
RUN apt-get install -y pocketsphinx-utils
RUN apt-get install -y pocketsphinx-hmm-wsj1
RUN apt-get install -y pocketsphinx-lm-wsj
# python
RUN apt-get install -y git python python-pip python-dev
# audiogrep
RUN git clone https://github.com/antiboredom/audiogrep.git
RUN cd audiogrep && pip install -r requirements.txt && \
chmod +x audiogrep/audiogrep.py && cp audiogrep/audiogrep.py /usr/bin/audiogrep
#RUN pip install audiogrep
RUN pip install moviepy
RUN pip install videogrep
#Tools to support grabbing of a/v files
#youtube_dl via https://electricarchaeology.ca/2016/04/19/audiogrep/
RUN pip install youtube_dl
RUN mkdir -p /avgrepfiles
VOLUME /avgrepfiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment