Skip to content

Instantly share code, notes, and snippets.

@jackrr
Created July 31, 2025 19:15
Show Gist options
  • Select an option

  • Save jackrr/d43a06761c90dc4b982bc0bda5ebd694 to your computer and use it in GitHub Desktop.

Select an option

Save jackrr/d43a06761c90dc4b982bc0bda5ebd694 to your computer and use it in GitHub Desktop.
Eymo Presentation 2025-07-31

Eymo

A "quick 2 week project" to deepen my Rust skills

that I've spent the last ~9 weeks working on.


What I planned to learn

  • Rust generics
  • Rust concurrency paradigms
  • Packaging Rust programs
  • Lifetimes in Rust
  • How to create a virtual webcam device

What I learned (abridged)

  • Rust generics Got the job done with enums
  • Rust multi-threading paradigms
  • Packaging Rust programs (Too busy building to publish)
  • Lifetimes in Rust (well, sort of)
  • How to create a virtual webcam device Turns out this is really complicated. And luckily people have already built solutions for this with v4l2loopback
  • Image encoding formats (well, some at a high level at least)
  • ML inference runtimes in Rust
  • Input + output formats for Mediapipe models
  • Triangulation
  • GPU programming (shaders)
  • Targeting WebAssembly in Rust
  • Bytepacking to work with 24bit pixel images (RGB) in 32bit GPU-land
  • Designing a configuration language
  • ....

Finding the eyes and mouth and such

Google's Mediapipe edge-optimized computer vision packages includes open models for a variety of practical computer vision use cases. I wanted two specific models from this package:

  • Face detection
  • Face landmarker (detected face in, mesh out)

(I iterated over several other solutions before landing on mediapipe: OpenCV haarcascades, yolo, and some proprietary Qualcomm model)


Getting Mediapipe face detection + landmarks working in rust

  • Convert .tflite representations of the models into ONNX
  • Find a rust model execution runtime that supports these models
    • tried to use burn but that doesn't support asymmetric padding
      • ...whatever asymmetric padding is ¯\_(ツ)_/¯
    • ort - a wrapper around ONNX Runtime did the trick (but needed to switch to tract for WASM support)
  • Figure out inputs and outputs (seemingly no documentation to be found to work with the models directly anywhere)
    • All outputs are basically opaque -- an N-dimensional matrix of floats loaded with implicit meaning
    • Outputs of face landmarker in particular are WILD -- with implicit coupling to a specific "anchor" layout that's seemingly not documented ANYWHERE
    • Solved these by digging through mediapipe python + JS implementations

GPU Programming

Images are big. Like really big. So doing things to them on the CPU, like resizing or swapping pixels, can take a while.

The performance gains of introducing thread-based parallel processing is limited by the number of cores and other programs on the computer competing for compute resources.

But for eymo to work, image transformations need to be fast. Otherwise the frame rate will to drop AND will have substantial lag

GPUs were built to work with large images really super concurrently. I think.

Or maybe they're just built to burn massive amounts of energy for proof-of-work algorithms I'm not sure I'm in over my head here 🤔


GPU Programming

Moving image manipulations (swapping face features, resizing, etc) to the GPU brought operations that took ~20-40ms down to ~3-10ms. Seriously.

And there's probably dumb stuff I'm doing that can be optimized to make further gains.

WebGPU is great because you can write code once and target any modern GPU with it (no need for hardware-specific implementation like CUDA/Vulcan/Metal etc)

And despite the name WebGPU (or wgpu) works on desktop, too (at least Rust's implementation does).

But GPU programming is annoying:

  • lots of boilerplate
  • need to use another language (WGSL - WebGPU shader languages)
  • debugging sucks (can't log within shaders)

Targeting WASM

Using wasm-pack and - wasm-bindgen took care of most of the heavy lifting. Once I got those in, it was a case of playing compiler (and then runtime) whack-a-mole until I got it working.

Key challenges:

  • WASM support video input capture library nokhwa was roadmapped for release this year (2024)
  • ort had an external dependency on the ONNX Runtime -- switching to tract did the job (and seemingly GREATLY improved performance)
  • limited browser wgpu support on linux (using chrome dev feature to test)
  • had to propagate wgpu async patterns throughout my shared desktop implementation, as utilities to block on awaits are not supported in WASM

Commands

Using lalrpop to define a grammar.

"interpreter" layer that converts the generated AST (just nested Rust structs) into one or more stateful transformations on images.

leye: scale(4), saturate(1.2), brighten(1.4), spin(0.3), drift(100)
reye: scale(4), saturate(1.2), brighten(1.4), spin(-0.5), drift(100, 225)
mouth: copy_to(reye_region+1), scale(1.5), brighten(1.2), saturate(1.3)
leye_region: swap_with(mouth+1), scale(1.5), brighten(1.2), saturate(1.3)

The Future

  1. Finalize WASM support
  2. Fix landmarker borkiness on tilted head
  3. Test/get working on Mac (and I guess Windows?) - please help :)

Clone the eymo repo and give it a try!


Thanks

Bruce for his slides program used for this presentation

James, Lars, Bradley, Mihika, Keenan (and probably others I'm forgetting) for either pairing with me or helping me reason about the project

Mohan for tiling some eyes in the first or second week of batch, giving inspiration to this project

And all the open source maintainers for being the giants whose shoulders this project stands on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment