Interesting Open Source (for Commercial Use) Generative AI models

Raw

Speech to Text

The original Whisper model is a good speech to text transcription model which is used in many places: https://huggingface.co/openai/whisper-large-v3

Text to Speech

WhisperSpeech is a good text to speech model with voice cloning: https://huggingface.co/WhisperSpeech/WhisperSpeech that uses an MIT license (unlike coqui and suno). It isn't the "best" model, but for its size it is very, very good.

Other alternatives is a purely ONNX driven model as sponsored by txtai: https://huggingface.co/NeuML/ljspeech-jets-onnx

Image Generation

For image generation from the (ex) Stable Diffusion Team: https://huggingface.co/black-forest-labs/FLUX.1-schnell An online on-demand generator is here: https://replicate.com/black-forest-labs/flux-schnell This model is open source and free for commercial use. The other FLUX.1 models have different licenses

Image and Video Segmentation

Meta's SAM 2 model is perfect for image segmentation and uses Apache 2.0 license https://github.com/facebookresearch/segment-anything-2

Hopefully with some tooling, we can combine it with inpainting for image editing

Raw

model-serving.md

Create a end to end application framework for data intensive applications including:

python server (RESTful)
nat.io for processing events
bytewax for doing transformations (e.g. writing logging out) without relying on cloud services like Kinesis firehose.

Raw

opportunities.md

Look into how to convert audio to animation of a mouth moving

https://github.com/DanielSWolf/rhubarb-lip-sync

Basically output dat file, then match the mouths to the framerate and animate(?)

LLM generated code to do this without the repo above, but using Python with pocketsphinx:

import pocketsphinx as ps
import nltk
from functools import lru_cache
from itertools import product as iterprod

try:
    arpabet = nltk.corpus.cmudict.dict()
except LookupError:
    nltk.download('cmudict')
    arpabet = nltk.corpus.cmudict.dict()

@lru_cache()
def wordbreak(s):
    s = s.lower()
    if s in arpabet:
        return arpabet[s]
    middle = len(s)/2
    partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x)
    for i in partition:
        pre, suf = (s[:i], s[i:])
        if pre in arpabet and wordbreak(suf) is not None:
            return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))]
    return None

# config = ps.Decoder.default_config()
decoder = ps.Decoder()
stream = open("example.wav", "rb")

decoder.start_utt()
current_time = 0

word_list = []
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        # print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        temp_word_list = []
        for seg in decoder.seg():
            word = wordbreak(seg.word.split("(")[0])
            if word is not None:
                if isinstance(word[0], list):
                    word = word[0]
            start_frame = current_time + seg.start_frame
            end_frame = current_time + seg.end_frame
            temp_word_list.append((seg.word, word, start_frame, end_frame))
            last_frame = seg.end_frame
        
        current_time += last_frame
        print(temp_word_list)
        word_list.extend(temp_word_list)
        decoder.end_utt()
        decoder.start_utt()

Then map the phones to one of the 7 mouth positions, and then animate with opencv/moviepy

8bit-pixies/index.md