A "quick 2 week project" to deepen my Rust skills
that I've spent the last ~9 weeks working on.
- Rust generics
- Rust concurrency paradigms
- Packaging Rust programs
- Lifetimes in Rust
- How to create a virtual webcam device
Rust genericsGot the job done with enums- Rust multi-threading paradigms
Packaging Rust programs(Too busy building to publish)- Lifetimes in Rust (well, sort of)
How to create a virtual webcam deviceTurns out this is really complicated. And luckily people have already built solutions for this with v4l2loopback- Image encoding formats (well, some at a high level at least)
- ML inference runtimes in Rust
- Input + output formats for Mediapipe models
- Triangulation
- GPU programming (shaders)
- Targeting WebAssembly in Rust
- Bytepacking to work with 24bit pixel images (RGB) in 32bit GPU-land
- Designing a configuration language
- ....
Google's Mediapipe edge-optimized computer vision packages includes open models for a variety of practical computer vision use cases. I wanted two specific models from this package:
- Face detection
- Face landmarker (detected face in, mesh out)
(I iterated over several other solutions before landing on mediapipe: OpenCV haarcascades, yolo, and some proprietary Qualcomm model)
- Convert .tflite representations of the models into ONNX
- Find a rust model execution runtime that supports these models
- Figure out inputs and outputs (seemingly no documentation to be found to work with the models directly anywhere)
- All outputs are basically opaque -- an N-dimensional matrix of floats loaded with implicit meaning
- Outputs of face landmarker in particular are WILD -- with implicit coupling to a specific "anchor" layout that's seemingly not documented ANYWHERE
- Solved these by digging through mediapipe python + JS implementations
Images are big. Like really big. So doing things to them on the CPU, like resizing or swapping pixels, can take a while.
The performance gains of introducing thread-based parallel processing is limited by the number of cores and other programs on the computer competing for compute resources.
But for eymo to work, image transformations need to be fast. Otherwise the frame rate will to drop AND will have substantial lag
GPUs were built to work with large images really super concurrently. I think.
Or maybe they're just built to burn massive amounts of energy for proof-of-work algorithms I'm not sure I'm in over my head here 🤔
Moving image manipulations (swapping face features, resizing, etc) to the GPU brought operations that took ~20-40ms down to ~3-10ms. Seriously.
And there's probably dumb stuff I'm doing that can be optimized to make further gains.
WebGPU is great because you can write code once and target any modern GPU with it (no need for hardware-specific implementation like CUDA/Vulcan/Metal etc)
And despite the name WebGPU (or wgpu) works on desktop, too (at least Rust's implementation does).
But GPU programming is annoying:
- lots of boilerplate
- need to use another language (WGSL - WebGPU shader languages)
- debugging sucks (can't log within shaders)
Using wasm-pack and - wasm-bindgen took care of most of the heavy lifting. Once I got those in, it was a case of playing compiler (and then runtime) whack-a-mole until I got it working.
Key challenges:
- WASM support video input capture library nokhwa was roadmapped for release this year (2024)
- got around this by digging into web MediaStream docs and implementing a solution manually
- ort had an external dependency on the ONNX Runtime -- switching to tract did the job (and seemingly GREATLY improved performance)
- limited browser wgpu support on linux (using chrome dev feature to test)
- had to propagate wgpu async patterns throughout my shared desktop implementation, as utilities to block on awaits are not supported in WASM
Using lalrpop to define a grammar.
"interpreter" layer that converts the generated AST (just nested Rust structs) into one or more stateful transformations on images.
leye: scale(4), saturate(1.2), brighten(1.4), spin(0.3), drift(100)
reye: scale(4), saturate(1.2), brighten(1.4), spin(-0.5), drift(100, 225)
mouth: copy_to(reye_region+1), scale(1.5), brighten(1.2), saturate(1.3)
leye_region: swap_with(mouth+1), scale(1.5), brighten(1.2), saturate(1.3)
- Finalize WASM support
- Fix landmarker borkiness on tilted head
- Test/get working on Mac (and I guess Windows?) - please help :)
Clone the eymo repo and give it a try!
Bruce for his slides program used for this presentation
James, Lars, Bradley, Mihika, Keenan (and probably others I'm forgetting) for either pairing with me or helping me reason about the project
Mohan for tiling some eyes in the first or second week of batch, giving inspiration to this project
And all the open source maintainers for being the giants whose shoulders this project stands on
