Skip to content

Instantly share code, notes, and snippets.

@ezrapierce000
Last active April 28, 2024 11:25
Show Gist options
  • Save ezrapierce000/af89fdab4dac376f21f2a836807c6c62 to your computer and use it in GitHub Desktop.
Save ezrapierce000/af89fdab4dac376f21f2a836807c6c62 to your computer and use it in GitHub Desktop.
Machine Learning with Bela & IREE - GSoC 2022

Machine Learning with Bela & IREE

This project took place over the summer of 2022 as part of the Google Summer of Code, with support from the Intelligent Instruments Lab, Beagleboard Foundation and Bela. The project's objective was to improve the tooling available for those looking to use machine learning models in their Bela projects. For some background, Bela is a maker platform built on top of the Beaglebone Black with a focus on real-time audio and sensor processing for use in interactive art projects including digital instrument design. The availability of machine learning tools to be used with Bela would allow for new design practices incorporating machine learning models.

The original goal of this project arose out of the constraints of Bela's low-powered (in machine learning terms) processor and the real-time constraints of interactive projects. To aid development on this platform there is a need for performance analysis tools that allow for quick evaluation of different models on Bela. This project has built some tools that can be used for this purpose including benchmarking and profiling utilities.

Although not originally a focus for this project, this project also worked on supporting the Intermediate Representation Execution Environment (IREE) on Bela. This project is part of the larger MLIR compiler infrastructure project. It includes a machine learning compiler and runtime. In short, MLIR is based on dialects, which are themselves collections of operations. The passes within MLIR compilers translate between dialects. The MLIR infrastructure allows different compilers to reuse optimizations and compiler passes on and in between new dialects without starting from scratch. IREE is using MLIR to create a compiler that goes all the way down to scheduling workloads, with a lightweight Hardware Abstraction Layer to run on. I didn't know a thing about what a machine learning compiler was 4 months ago so I am not the best person to explain them, I would highly recommend checking out the original MLIR paper as well as this IREE paper for more in-depth background.

IREE allows for multiple different code generation backends from the compiler including portable IREE virtual machine bytecode, LLVM IR or C source code. The main advantages that I see with IREE is the portability across different hardware platforms including bare-metal options, option for parallelization on platforms where it is an option as well as the multiple frontends available for importing models (although some are in the very early stages - see Torch-MLIR). Past the Bela/Beaglebone Black, I do think that if embedded in a more portable audio application (Pure Data, VST, CLAP, etc.) the IREE runtime could be a lightweight way of running machine learning models in different types of audio(or other multimedia) projects while still being able to take advantage of larger multiprocessor systems.

IREE architecture diagram from https://iree-org.github.io/iree/:

iree_architecture_dark

My work this summer built two new related projects for running and measuring IREE on Bela. Firstly I created a docker image that contains a toolchain setup for compiling IREE projects for Bela that also contains some utilities to compile, benchmark and profile programs. You can see this project at https://github.com/ezrapierce000/bela-iree-container. This project also contains a model zoo and runtime as submodules so you can have a full end to end development environment for using IREE in Bela projects. The second project I created was an IREE runtime for Bela, this can be found at https://github.com/ezrapierce000/bela-iree-runtime. This project contains a Bela project with the IREE runtime setup to allow for a model to be loaded into a Bela project. The runtime has two branches with different project structure. The main branch requires the IREE compiler to export a VMFB file to the Bela, which will be then loaded at runtime. Alternatively, there is the EmitC option on the emit-c branch. This option requires the IREE compiler to output C source code in a module.c file to be compiled into the binary ahead of time. The runtime also has the option to enable Xenomai diagnostics during runtime to inspect how the IREE thread is behaving. It is functional but still in the early stages, I plan on improving it further so it is easier to use.

Benchmarking

The benchmark utility created can be found in the bela-iree-container docker container. You can follow the readme in that repository to set it up. Below are the most recent benchmarks from the BBB and BBAI (CPU only). As you can see the Cortex-A8 on the Bela is quite a bit slower at inference than the AArch64 Cortex-A73 on the BBAI64. There are also more errors encountered running on the 32-bit ARM platform when using the LLVM-CPU codegen backend. The interpreted VMVX runtime is still quite new and is expected to be improved as the IREE developers begin to add new microkernels to VMVX which will hopefully translate to new performance gains on Bela. All models in this table other than the MDRNN were processing blocks of 1024.

Current benchmarks using IREE:

Model IREE input type Bela IREE Benchmark BBAI (CPU only) IREE Benchmark - LLVM-CPU
basic_mlp_1024 TOSA from TFLite 222ms (VMVX) 24.0ms
resnet_1d_1024 TOSA from TFLite segfault NA
simple_conv_1d_1024 TOSA from TFLite 2549ms (LLVM-CPU) 137ms
simple_rnn_1024 NA NA - unable to export to TOSA NA
single_mm_1024 TOSA from Torch-MLIR 19.7ms (LLVM-CPU) 7.72ms
siren_mlp_1024 TOSA from TFLite 778ms (LLVM-CPU) 50.4ms
transformer_block_1024 TOSA from TFLite Segmentation fault 142ms
variational_encoder_1024 NA NA - unable to export to TOSA NA
mdrnn (64 hidden units) MHLO from JAX 37.6ms (VMVX) 0.176ms

Profiling

Another benefit of using IREE is the built-in instrumentation using the Tracy profiler. This profiler can be enabled throughout the IREE runtime with a compiler flag, allowing for finegrained profiling data from running IREE programs. The data can be sent over TCP to a capture tool which allows for visualization of traces, memory allocations, etc. Unfortunately, the instrumented binaries were unstable on the Bela making for unreliable profile recording, although some profiles could be recorded. More work could be done here debugging the cause of the instability as it could be a very useful tool and it is somewhat functional.

Tracy example from Bela:

image

As an alternative the perf profiler was used to record profiles and performance monitor events on the Bela while models are running. The profiling utility in the docker container allows for recording profiles as the model runs. The profiles and events can then be viewed in various formats. The viewer I have been using is the TraceCompass. Additional work could be done in the IREE runtime on Bela to provide similar instrumentation to the Tracy profiler, this could possibly be done with the LTTng tool to insert tracepoints.

Workflow example

As an example, these are the steps need to benchmark a single matrix multiply using IREE on Bela as well as print out some profiling information. These steps assume you are in the bela-iree-container.

cd /workspaces/bela-iree-container/models/embedded-model-zoo/ && conda activate zoo && python -m zoo

cd tosa/ && compile -i single_mm.tosa -t bbb -d tosa -f vm-bytecode -h llvm-cpu -o single_mm.vmfb

benchmark -f single_mm.vmfb -t bbb -r 10 -e forward -i 1x1024xf32=4 -d 192.168.7.2

profile -f single_mm.vmfb -m stat -e forward -i 1x1024xf32=4 -l cache-misses,cache-references -d 192.168.7.2

Future work

Although I did not complete all of the goals set out at the start of this project, what I've accomplished gives a good starting point for future experimentation with IREE on Bela and other embedded devices (RPi, Teensy, etc.). I plan on continuing this work in some form and will be especially focused on improving the IREE runtime on Bela so that is stable and easier to use. In the short term I plan on getting a full demo of a control-rate MDRNN running in a Bela project using IREE. In addition to more practical improvements I also would like to automate some of the process of building the IREE runtime components + profiling/instrumentation utilities + MLIR components so that the upstream projects can be easily tracked as they are moving quite fast. Another interesting thing to investigate in the future would be the GPU on the Bela. There were GPU drivers finally released by PowerVR (GPU manafacturer) in 2020 for the GPU on the Bela so it may be interesting to look at. I don't think it would be very useful for audio synthesis as it would still be using large block sizes but you may be able to use it for control-rate processing or image processing.

Links

https://arxiv.org/abs/2205.14479 - TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment

https://arxiv.org/abs/2002.11054 - MLIR: A Compiler Infrastructure for the End of Moore's Law

https://archive.eclipse.org/tracecompass.incubator/doc/org.eclipse.tracecompass.incubator.perf.profiling.doc.user/User-Guide.html - Perf User Guide for TraceCompass

https://github.com/rodrigodzf/DeepLearningForBela - Great runtime and examples for deep learning with Bela

https://www.brendangregg.com/linuxperf.html - Linux performance info

https://groups.google.com/g/iree-discuss/c/qyTy88KLq2c - Post from iree-discuss talking about the VMVX backend, plans for implementing more microkernels

https://www.lei.chat/ - Blog of one of the IREE developers who has some very helpful posts regarding MLIR + GPUs

https://github.com/wolfpld/tracy - Tracy profiler

https://github.com/iree-org/iree-jax - Start of an IREE-JAX integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment