Hello world!
As a video editor, researcher, digital media enthusiast, and lover of all things FLOSS, I've been on the hunt for an open source alternative to proprietary services like Otter.ai, Sonix, and Descript. I've pitched my idead on open-source-ideas, but I wanted to create a dedicated post for it so that it can reach as many people as possible.
A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track, clickable, and editable, so that users can skip to certain passages and refine the transcript accordingly.
The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.
Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.
This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).
As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai. They're very easy to use and provide pretty accurate transcriptions.
That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).
Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.
Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons (see this Reddit thread).
I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.
VOSK.Broswer.mp4
VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.
According to the dev, "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."
Potential ways to build upon this project:
- Adding punctuation: I've found a number of punctuation restoration projects on here that could help with that such as punctuator2 and its many forks such as PunkProse. Punctuator2 even has a nifty demo which you can try out here. I also found an implementation of PunkProse + VOSK here.
- Making the transcript editable
- Adding timings that are synced to the audio (I assume that the live dictation would have to be recorded)
- The ability to export the work as a subtitle/caption file
Check out the Demo: https://ccoreilly.github.io/vosk-browser/ View GitHub Repo: https://github.com/ccoreilly/vosk-browser
Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript
nerd-dictation.mp4
Source code can be viewed here
Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.
Real-time.transcription.demo.mp4
Source code can be viewed here
Web Application that uses VOSK to transcribe audios to texts in portuguese. Would be great if users could supply the language model of your choice.
Source code can be viewed here
Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:
AblePlayer.mp4
The source code can be viewed here.
oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.
There's even an oTranscribe for Electron fork that could be interesting to look into.
Drawbacks:
- No speech-to-text
- Cannot export to .srt (although an .otr to .srt conversion is possible with this external tool)
- Cannot edit timestamps as text
View the website here: https://otranscribe.com/ View the repo here: https://github.com/oTranscribe
Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!
I namely want to highlight the following tools, which could be of interest:
Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts
Hyperaudio.Editor.mp4
Hyperaudio Lite: a Super-lightweight Interactive Transcript Player
Hyperaudio Converter: converts from JSON/SRT to HTML Based Interactive Transcript
- Site: https://hyperaud.io/converter/converter.html
- Repo: https://github.com/hyperaudio/ha-converter
Hyperaudio Website for now: https://lab.hyperaud.io/ Official Website: https://hyper.audio/
The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.
View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts
Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.
Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.
View the repo: https://github.com/glitchdigital/video-transcriber
I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.
- Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
- Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
- Advanced - The project requires the user to have a good understanding of all components of the project to contribute
- Little work - A couple of days
- Medium work - A week or two
- Much work - The project will take more than a couple of weeks and serious planning is required
- Mobile app
- IoT
- Web app
- Frontend/UI
- AI/ML
- APIs/Backend
- Voice Assistant
- Developer Tooling
- Extension/Plugin/Add-On
- Design/UX
- AR/VR
- Bots
- Security
- Blockchain
- Futuristic Tech/Something Unique
My own programming (?) skills are limited to HTML, basic CSS, and the tiniest bit of Javascript. As such, I'm hoping to share my findings and proposed idea here in the hopes that more competent coders can bring this to life.