Skip to content

Instantly share code, notes, and snippets.

@8bit-pixies
Last active September 29, 2024 01:23
Show Gist options
  • Save 8bit-pixies/0b72cd9f3fc2467ce01caed6c6149887 to your computer and use it in GitHub Desktop.
Save 8bit-pixies/0b72cd9f3fc2467ce01caed6c6149887 to your computer and use it in GitHub Desktop.
Interesting Open Source (for Commercial Use) Generative AI models

Speech to Text

The original Whisper model is a good speech to text transcription model which is used in many places: https://huggingface.co/openai/whisper-large-v3

Text to Speech

WhisperSpeech is a good text to speech model with voice cloning: https://huggingface.co/WhisperSpeech/WhisperSpeech that uses an MIT license (unlike coqui and suno). It isn't the "best" model, but for its size it is very, very good.

Other alternatives is a purely ONNX driven model as sponsored by txtai: https://huggingface.co/NeuML/ljspeech-jets-onnx

Image Generation

For image generation from the (ex) Stable Diffusion Team: https://huggingface.co/black-forest-labs/FLUX.1-schnell An online on-demand generator is here: https://replicate.com/black-forest-labs/flux-schnell This model is open source and free for commercial use. The other FLUX.1 models have different licenses

Image and Video Segmentation

Meta's SAM 2 model is perfect for image segmentation and uses Apache 2.0 license https://github.com/facebookresearch/segment-anything-2

Hopefully with some tooling, we can combine it with inpainting for image editing

Create a end to end application framework for data intensive applications including:

  • python server (RESTful)
  • nat.io for processing events
  • bytewax for doing transformations (e.g. writing logging out) without relying on cloud services like Kinesis firehose.

Look into how to convert audio to animation of a mouth moving

https://github.com/DanielSWolf/rhubarb-lip-sync

Basically output dat file, then match the mouths to the framerate and animate(?)

LLM generated code to do this without the repo above, but using Python with pocketsphinx:

import pocketsphinx as ps
import nltk
from functools import lru_cache
from itertools import product as iterprod

try:
    arpabet = nltk.corpus.cmudict.dict()
except LookupError:
    nltk.download('cmudict')
    arpabet = nltk.corpus.cmudict.dict()

@lru_cache()
def wordbreak(s):
    s = s.lower()
    if s in arpabet:
        return arpabet[s]
    middle = len(s)/2
    partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x)
    for i in partition:
        pre, suf = (s[:i], s[i:])
        if pre in arpabet and wordbreak(suf) is not None:
            return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))]
    return None

# config = ps.Decoder.default_config()
decoder = ps.Decoder()
stream = open("example.wav", "rb")

decoder.start_utt()
current_time = 0

word_list = []
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        # print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        temp_word_list = []
        for seg in decoder.seg():
            word = wordbreak(seg.word.split("(")[0])
            if word is not None:
                if isinstance(word[0], list):
                    word = word[0]
            start_frame = current_time + seg.start_frame
            end_frame = current_time + seg.end_frame
            temp_word_list.append((seg.word, word, start_frame, end_frame))
            last_frame = seg.end_frame
        
        current_time += last_frame
        print(temp_word_list)
        word_list.extend(temp_word_list)
        decoder.end_utt()
        decoder.start_utt()

Then map the phones to one of the 7 mouth positions, and then animate with opencv/moviepy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment