Eymo

A "quick 2 week project" to deepen my Rust skills

that I've spent the last ~9 weeks working on.

What I planned to learn

Rust generics
Rust concurrency paradigms
Packaging Rust programs
Lifetimes in Rust
How to create a virtual webcam device

What I learned (abridged)

~~Rust generics~~ Got the job done with enums
Rust multi-threading paradigms
~~Packaging Rust programs~~ (Too busy building to publish)
Lifetimes in Rust (well, sort of)
~~How to create a virtual webcam device~~ Turns out this is really complicated. And luckily people have already built solutions for this with v4l2loopback
Image encoding formats (well, some at a high level at least)
ML inference runtimes in Rust
Input + output formats for Mediapipe models
Triangulation
GPU programming (shaders)
Targeting WebAssembly in Rust
Bytepacking to work with 24bit pixel images (RGB) in 32bit GPU-land
Designing a configuration language
....

Finding the eyes and mouth and such

Google's Mediapipe edge-optimized computer vision packages includes open models for a variety of practical computer vision use cases. I wanted two specific models from this package:

Face detection
Face landmarker (detected face in, mesh out)

(I iterated over several other solutions before landing on mediapipe: OpenCV haarcascades, yolo, and some proprietary Qualcomm model)

Getting Mediapipe face detection + landmarks working in rust

Convert .tflite representations of the models into ONNX
Find a rust model execution runtime that supports these models
- tried to use burn but that doesn't support asymmetric padding
  - ...whatever asymmetric padding is ¯\_(ツ)_/¯
- ort - a wrapper around ONNX Runtime did the trick (but needed to switch to tract for WASM support)
Figure out inputs and outputs (seemingly no documentation to be found to work with the models directly anywhere)
- All outputs are basically opaque -- an N-dimensional matrix of floats loaded with implicit meaning
- Outputs of face landmarker in particular are WILD -- with implicit coupling to a specific "anchor" layout that's seemingly not documented ANYWHERE
- Solved these by digging through mediapipe python + JS implementations

GPU Programming

Images are big. Like really big. So doing things to them on the CPU, like resizing or swapping pixels, can take a while.

The performance gains of introducing thread-based parallel processing is limited by the number of cores and other programs on the computer competing for compute resources.

But for eymo to work, image transformations need to be fast. Otherwise the frame rate will to drop AND will have substantial lag

GPUs were built to work with large images really super concurrently. I think.

Or maybe they're just built to burn massive amounts of energy for proof-of-work algorithms I'm not sure I'm in over my head here 🤔

GPU Programming

Moving image manipulations (swapping face features, resizing, etc) to the GPU brought operations that took ~20-40ms down to ~3-10ms. Seriously.

And there's probably dumb stuff I'm doing that can be optimized to make further gains.

WebGPU is great because you can write code once and target any modern GPU with it (no need for hardware-specific implementation like CUDA/Vulcan/Metal etc)

And despite the name WebGPU (or wgpu) works on desktop, too (at least Rust's implementation does).

But GPU programming is annoying:

lots of boilerplate
need to use another language (WGSL - WebGPU shader languages)
debugging sucks (can't log within shaders)

Targeting WASM

Using wasm-pack and - wasm-bindgen took care of most of the heavy lifting. Once I got those in, it was a case of playing compiler (and then runtime) whack-a-mole until I got it working.

Key challenges:

WASM support video input capture library nokhwa was roadmapped for release this year (2024)
- got around this by digging into web MediaStream docs and implementing a solution manually
ort had an external dependency on the ONNX Runtime -- switching to tract did the job (and seemingly GREATLY improved performance)
limited browser wgpu support on linux (using chrome dev feature to test)
had to propagate wgpu async patterns throughout my shared desktop implementation, as utilities to block on awaits are not supported in WASM

Commands

Using lalrpop to define a grammar.

"interpreter" layer that converts the generated AST (just nested Rust structs) into one or more stateful transformations on images.

leye: scale(4), saturate(1.2), brighten(1.4), spin(0.3), drift(100)
reye: scale(4), saturate(1.2), brighten(1.4), spin(-0.5), drift(100, 225)
mouth: copy_to(reye_region+1), scale(1.5), brighten(1.2), saturate(1.3)
leye_region: swap_with(mouth+1), scale(1.5), brighten(1.2), saturate(1.3)

The Future

Finalize WASM support
Fix landmarker borkiness on tilted head
Test/get working on Mac (and I guess Windows?) - please help :)

Clone the eymo repo and give it a try!

Thanks

Bruce for his slides program used for this presentation

James, Lars, Bradley, Mihika, Keenan (and probably others I'm forgetting) for either pairing with me or helping me reason about the project

Mohan for tiling some eyes in the first or second week of batch, giving inspiration to this project

And all the open source maintainers for being the giants whose shoulders this project stands on

jackrr/Presentation 2025-07-31 - EYMO.md

Select an option

No results found

Select an option

No results found

Eymo

What I planned to learn

What I learned (abridged)

Finding the eyes and mouth and such

Getting Mediapipe face detection + landmarks working in rust

GPU Programming

GPU Programming

Targeting WASM

Commands

The Future

Thanks