Summer of ml5.js

This is a journal of my research on ml5.js during the summer of 2025, specifically on the topic of depth estimation and how it can best be a part of ml5. I'll try to write it as a single "article" with multiple chapters instead of a week-by-week blog.

On tensorflow: Fast-but-Rough Portrait Depth Estimation

An important part of this work has been started already by Alan Ren, developing an implementation of tensorflow's depth estimation into ml5.js. This is my starting point for this research.

Tensorflow's depth estimation uses the ARPortraitDepth model (See this 2022 Tensorflow Blog Post), which, as the name implies, is designed to do depth estimation specifically of portrait images, for purposes of AR on mobile phones. For this reason, it is a very lean model that can do estimation in realtime (~51fps on M1 Macbook and ~22fps on iPhone 13 according to the aforelinked blog).

How does it work?

Tensorflow's implementation is actually not a just a model but an API, the Portrait Depth API, which takes in an image and passes it through MediaPipe Selfie Segmentation (see in source code) to separate people from the background, and then through the depth estimation model before returning. This is a very interesting approach since separating objects just from a depth map is not very effective:

Rough 3D model of a two carts on the street each holding a few boxes of soda cans.

3D model made with depth estimation (depth anything model) of a flat image. Here the depth planes were separated when adjacent z-values were greater than a threshold. This method causes streaks on the edges of objects where the pixel data is ambiguous for the model.

Key Takeaways

Specific to people (selfies), will not work on other sort of images which is somewhat limiting for ml5.js users, but still useful for realtime interactivity with people.
Works on multiple people at once, still have to test its limits.
Combination of segmentation + depth estimation, a good way to overcome issues with the ambiguity of edges in depth maps. Could be applied to other depth models by passing the images through segmentation first.

Experiences using the model

In order to test the model and compare it to the options in transformers.js, I used the branch on PR #248 to replicate a sketch I had previously made with transformer.js, which uses the depth estimation to make the webcam image into a 3D mesh that can be rotated and panned with orbitControl().

Dealing with edge noise

Working on this, it became apparent that the tensorflow AR Portrait Depth API is very noisy in the edges of the detected portrait despite using segmentation. This is something that can be seen in the depth map generated but becomes more apparent in 3D:

Left to right: Snapshot from webcam. Depthmap generated with the tensorflow model, notice the noise in the edges. Front view of 3D model generated from depthmap. 45º view of the 3D model.

Since we already generated a segmentation mask before feeding the image into the model, I thought it would make sense to use that mask again in the output. By using a dilation/erosion filter algorithm we can remove the border of the portrait, taking away the noise.

I first modified the source implementation of the model in ml5.js to expose the segmentation mask (binaryMask inside the processDepthMap function) in the results object returned with the estimation. That way I could use it on my sketch. Then I added the dilation filter to dilate the non-zero-alpha regions in the mask and finally masked the dephtMap with the new shrunken segmentation mask:

Small demonstration with a slider changing dilation factor. By dilating just by a 4-6 pixels we can get much cleaner edges in the resulting depthmap.

Generating the 3D model from the depth map after a 6px dilation/erosion.

I think this could be an indispensable part of using this model, which could be implemented directly into the ml5.js wrapper for it. The dilation factor could be controlled with an option passed into ml5.depthEstimation().

Comparisons

Speed

To test and compare the speed of both the tensorflow AR Portrait Depth Model and the Depth Anything v2 Small model on Transformers.js with the fp16 option (16 bit floating point numbers), I ran live estimation on both of them and counted the time it took to make estimations. The key aspects of the code were:

Estimated depth based on a p5.Graphics with pixelDensity(1).
Estimation time was counted between calls of the estimation callback functions.

Estimation on a 640x480 image:

Tensorflow on the left, transformers.js on the right

Tensorflow clocks in ~3fps faster than transformers, but the difference in speed is not very perceptible.

Estimation on a 320x240 image:

Tensorflow on the left, transformers.js on the right

Tensorflow clocks in ~7-8fps faster than transformers. Here the difference is very noticeable.

Suffice to say that the tensorflow model runs faster, but not by a very significant perceptible difference. Transformers.js still manages to run close enough to realtime to allow interactivity in p5 sketches.

Refactoring the tensorflow implementation

Since a pull request with the tensorflow was well underway thanks to the work of Alan, we decided it would be a good idea to first try to keep working on that branch in order to merge it. This would be both a good learning experience of the inner workings of ml5.js and a quicker way to get depth estimation into the library, paving the way for transformers.js later on!

Restructuring the result object

After some discussion in the Pull Request, and looking to prioritize the p5.js user, the result object was restructured slightly in 0dc3304.

The idea is that users will name the result received in the callback function as depthMap:

let depthMap;

function gotResults(result) {
  depthMap = result; //Update the depthMap with the new result
}

This follows the conventions set by other modules, like the result for handPose being called hands or bodyPose, poses.

With this in mind, users can access the most important property intuitively:

depthMap.image //returns a p5 image of the depth map in a grayscale colormap

Reusing `segmentPeople()` calls

There was an opportunity to reduce the number of times segmentation was done on the source media by reusing previous segmentation operations done on the same frame. This was implemented in 43944ef, meaning the module now does a single segmentation operation per frame, which had a positive effect on the FPS of the estimation calls:

↑ A 640x480 image, reaching up to 14fps

↑ A 320x240 image, reaching up to 30fps

Alan also mentioned one of his goals was to remove the segmentation model being used here, instead using one of the ones already bundled with ml5. This may also have performance gains.

nasif-co/summer-of-ml5.md

Summer of ml5.js

On tensorflow: Fast-but-Rough Portrait Depth Estimation

How does it work?

Key Takeaways

Experiences using the model

Dealing with edge noise

Comparisons

Speed

Estimation on a 640x480 image:

Estimation on a 320x240 image:

Refactoring the tensorflow implementation

Restructuring the result object

Reusing `segmentPeople()` calls

alanvww commented Jul 6, 2025

Uh oh!

nasif-co commented Jul 6, 2025 •

edited

Loading

Uh oh!

nasif-co/summer-of-ml5.md

Summer of ml5.js

On tensorflow: Fast-but-Rough Portrait Depth Estimation

How does it work?

Key Takeaways

Experiences using the model

Dealing with edge noise

Comparisons

Speed

Estimation on a 640x480 image:

Estimation on a 320x240 image:

Refactoring the tensorflow implementation

Restructuring the result object

Reusing segmentPeople() calls

alanvww commented Jul 6, 2025

Uh oh!

nasif-co commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reusing `segmentPeople()` calls

nasif-co commented Jul 6, 2025 •

edited

Loading