Speech to Text
The original Whisper model is a good speech to text transcription model which is used in many places: https://huggingface.co/openai/whisper-large-v3
Text to Speech
WhisperSpeech is a good text to speech model with voice cloning: https://huggingface.co/WhisperSpeech/WhisperSpeech that uses an MIT license (unlike coqui and suno). It isn't the "best" model, but for its size it is very, very good.
Other alternatives is a purely ONNX driven model as sponsored by txtai: https://huggingface.co/NeuML/ljspeech-jets-onnx
Image Generation
For image generation from the (ex) Stable Diffusion Team: https://huggingface.co/black-forest-labs/FLUX.1-schnell An online on-demand generator is here: https://replicate.com/black-forest-labs/flux-schnell This model is open source and free for commercial use. The other FLUX.1 models have different licenses
Image and Video Segmentation
Meta's SAM 2 model is perfect for image segmentation and uses Apache 2.0 license https://github.com/facebookresearch/segment-anything-2
Hopefully with some tooling, we can combine it with inpainting for image editing