Skip to content

Instantly share code, notes, and snippets.

  • Save candideu/4a6525dfa9c2066cfc7c0e1bb7f41a4d to your computer and use it in GitHub Desktop.
Save candideu/4a6525dfa9c2066cfc7c0e1bb7f41a4d to your computer and use it in GitHub Desktop.
Open Source AI Scribe / Auto-Transcriber / Speech-to-text Transcriptions / Captions & Subtitles Exporter / Interactive Transcripts / Alternative to Otter.ai, Descript, Sonix.ai

Hello world!

As a video editor, researcher, digital media enthusiast, and lover of all things FLOSS, I've been on the hunt for an open source alternative to proprietary services like Otter.ai, Sonix, and Descript. I've pitched my idead on open-source-ideas, but I wanted to create a dedicated post for it so that it can reach as many people as possible.

Project description

The idea

A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track, clickable, and editable, so that users can skip to certain passages and refine the transcript accordingly.

The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.

Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.

This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).

Inspiration, and the "Why"

As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai. They're very easy to use and provide pretty accurate transcriptions.

image

image

image

Issues, and what's missing in existing tools

That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).

Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.

Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons (see this Reddit thread).

Relevant Technology

I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.

Speech-to-text

Vosk Browser

VOSK.Broswer.mp4

VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.

According to the dev, "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."

Potential ways to build upon this project:

  • Adding punctuation: I've found a number of punctuation restoration projects on here that could help with that such as punctuator2 and its many forks such as PunkProse. Punctuator2 even has a nifty demo which you can try out here. I also found an implementation of PunkProse + VOSK here.
  • Making the transcript editable
  • Adding timings that are synced to the audio (I assume that the live dictation would have to be recorded)
  • The ability to export the work as a subtitle/caption file

Check out the Demo: https://ccoreilly.github.io/vosk-browser/ View GitHub Repo: https://github.com/ccoreilly/vosk-browser

ideasman42/nerd-dictation

Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript

nerd-dictation.mp4

Video demo

Source code can be viewed here

saharmor/realtime-transcription-playground

Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.

Real-time.transcription.demo.mp4

Source code can be viewed here

STTWebApp

image Web Application that uses VOSK to transcribe audios to texts in portuguese. Would be great if users could supply the language model of your choice.

Source code can be viewed here


Clickable, Interactive Transcript

AblePlayer

Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:

AblePlayer.mp4

The source code can be viewed here.


Subtitle + Transcript Editors + Previewers

oTranscribe

oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.

image

There's even an oTranscribe for Electron fork that could be interesting to look into.

Drawbacks:

  • No speech-to-text
  • Cannot export to .srt (although an .otr to .srt conversion is possible with this external tool)
  • Cannot edit timestamps as text

View the website here: https://otranscribe.com/ View the repo here: https://github.com/oTranscribe

Hyperaudio

Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!

I namely want to highlight the following tools, which could be of interest:

Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts

Hyperaudio.Editor.mp4

Hyperaudio Lite: a Super-lightweight Interactive Transcript Player image

Hyperaudio Converter: converts from JSON/SRT to HTML Based Interactive Transcript image

Hyperaudio Website for now: https://lab.hyperaud.io/ Official Website: https://hyper.audio/


All arounders

Kdenlive

The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.

image

View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts

Video Transcriber

Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.

Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.

image

View the repo: https://github.com/glitchdigital/video-transcriber


Complexity and required time

I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.

Complexity

  • Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
  • Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
  • Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

  • Little work - A couple of days
  • Medium work - A week or two
  • Much work - The project will take more than a couple of weeks and serious planning is required

Categories

  • Mobile app
  • IoT
  • Web app
  • Frontend/UI
  • AI/ML
  • APIs/Backend
  • Voice Assistant
  • Developer Tooling
  • Extension/Plugin/Add-On
  • Design/UX
  • AR/VR
  • Bots
  • Security
  • Blockchain
  • Futuristic Tech/Something Unique

My own programming (?) skills are limited to HTML, basic CSS, and the tiniest bit of Javascript. As such, I'm hoping to share my findings and proposed idea here in the hopes that more competent coders can bring this to life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment