Adding tensorflow.js depth-estimation #248

alanvww · 2025-05-01T04:34:59Z

Hello! This PR is adding the depth estimation functionality from tensorflow.js to ml5.

I primarily refer to this example for its performance and results.

Testing sketches:
depthEstimation-video
depthEstimation-single-image

Set grayscale colormap as default
Remove bodySegmentation?
Backend option to use transformers.js(!)

examples/depthEstimation-single-image/index.html

examples/depthEstimation-single-image/sketch.js

examples/depthEstimation-video/sketch.js

src/DepthEstimation/index.js

examples/depthEstimation-video/sketch.js

Also changed the mention of this in the examples.

src/DepthEstimation/index.js

Removed console logs, the comments are clear enough without them. Also renamed the examples' <title> tag to match ml5.js format.

nasif-co · 2025-07-02T16:41:19Z

Wanted to add a to do list of task that I'll try to work on for this PR, please let me know if there are suggestions!

Reorganize depthmap images in result object
Reuse the initial segmentation result in the masking section of processDepthMap()
Add dilation filter to the masking section and dilation parameters to the options object
Write simple "hello world" examples
Diagnose size mismatch issue between source video and depthmap when video is resized.
Clean up console.logs in the library file
Align code with our p5 2.0 compatibility decisions from p5.js 2.0 Compatibility #244

examples/depthEstimation-video/sketch.js

Removed the depth estimation tensor from the result object so we could handle disposing of it internally. Also tested ml5.tf.memory() on the current code and found a memory leak, which ended up being due to some segmentation tensors not being disposed. I replaced the disposal code being used here with the one used in the official tensorflow examples, which fixed the leak.

Added the dilation algorithm to the library. The level of dilation is controlled by the config option 'dilationFactor' which takes values between 0 and 10, corresponding to the amount of pixels to grow the background into the silouette. Larger dilation factors affect fps because they need longer loops to look for bounds. Also made the mask available as a p5.Image in the result, under the name 'mask'. This mask is compatible with the p5 mask() function, so it is easy to use it to cut out the profile from the background. Lastly, also optimized the helper function that turns imageData into p5.Image, by replacing set() and instead copying the imageData.data array into the pixels array.

The first two examples aimed at being starting points for using the model. One is simply a webcam depth estimation without any interface. The other is the same but using the mask to clear out the background. Also made applying the segmentation mask the default for the model, since the it performs much better with it.

shiffman · 2025-07-16T14:08:54Z

Hi @nasif-co, I took a look at your latest updates and reviewed the examples, amazing work! A few quick questions / thoughts.

Diagnose size mismatch issue between source video and depthmap when video is resized.

Is this an issue only if you resize the video during a sketch, or if you just call video.size(w, h) once in setup() does it break the depth map?

The new examples are fantastic!

Am I right that darker pixels are closer to the camera and brighter are further? This is the opposite of my expectation since I think the transformers.js models work the other way. Nothing to change here, just noting it as something to mention in documentation!
The "hello world" examples are perfect, exactly as I imagined! Now after seeing them I'm wondering if we might consider including a single example that incorporates something with 3D or uses the pixel data in some way? Perhaps a nested loop through every N pixels and draw a box() for each value with a z-position mapped to the pixel value?
I think the example that uses the mask() might be more effective if there is an image or maybe something simple drawn behind the silhouette, it's not so clear what is happening with only clear()!

nasif-co · 2025-07-17T13:26:52Z

Is this an issue only if you resize the video during a sketch, or if you just call video.size(w, h) once in setup() does it break the depth map?

It happens by calling video.size(w, h) once in setup(), you can see a reproduction of the issue in this sketch.

Am I right that darker pixels are closer to the camera and brighter are further? This is the opposite of my expectation since I think the transformers.js models work the other way. Nothing to change here, just noting it as something to mention in documentation!

Yes you are right, it confused me a bit at first. Now that I'm looking into it, changing it to be the other way around seems to be a small change that could be useful in keeping consistency with transformers.js, looking ahead at integrating it. I'll commit that small change.

The "hello world" examples are perfect, exactly as I imagined! Now after seeing them I'm wondering if we might consider including a single example that incorporates something with 3D or uses the pixel data in some way? Perhaps a nested loop through every N pixels and draw a box() for each value with a z-position mapped to the pixel value?

Yes! I was looking at including some of those next. I had this sketch I made a few months ago for class using transformers.js which sounds like what you're describing. I'll port that one to ml5. Do you think we should just do the one? I was thinking of also adding one that builds a 3D mesh using the depth map, but I don't know if that becomes more of a tutorial territory than example.

Also was planning on adding an example that showcases how to "detect" distance, so that different interactions can occur depending on how close/far a subject is. Something like a chain of if/else statements, each with a different interaction. I suspect this would be a common use case of the depth estimation model.

I think the example that uses the mask() might be more effective if there is an image or maybe something simple drawn behind the silhouette, it's not so clear what is happening with only clear()!

I agree, I'll add something simple in the back, maybe a background color shifting in hue or just an image.

Thanks for the detailed review! :)

To match transformers.js: lighter pixels are closer to the camera, darker are farther from it.

To help visualize what using the mask together with the depthMap does.

nasif-co · 2025-07-19T22:00:26Z

Interestingly, the bug also affects the body segmentation module, when using SelfieSegmentation (see a sketch of it) but not when using BodyPix, which is strange since both use the same function to do the detection.

On the other hand, it makes some sense since the ARPortraitDepth model we are using here also uses SelfieSegmentation, so that may be the root of the issue. I feel it has something to do with the source video's intrinsic dimensions as opposed to its display dimensions, and how the video.size() method only sets display width. The one way I have found to change intrinsic dimensions of the webcam video element is to request them with getUserMedia when creating the capture, which is out of the question.

Going forward, I think the best solution is rendering the video pixels to a separate canvas/p5.Graphics in ml5 and passing that as the detectMedia. Would love to hear some thoughts on this!

Bug was due to estimation being done on the source element intrinsic dimensions and not the display dimensions set by the user, leading to an unexpected output. Needed to resize the media given by the user before passing it to the models. After some discussions in the discord, opted for resizing the input media through tensorflow.js own methods. I think this might be more performant than resizing the image in a canvas but didn't test them side by side to corroborate.

Simplified existing examples and aligned them with the changes in config defaults.

Converted console logs to comments.

Realized the mask and dilation was not being applied to the data array and therefore not to the getDepthAt() method. Fixed it for consistency.

Make the code a little simpler.

Since we already have a webcam video example, it felt redundant to have the depthEstimation-video example also use the webcam. So I modified it to instead showcase how to run depthEstimation with a video file.

The depth estimation result now includes the exact frame of the input that was used to generate the returned estimation. This is useful for aligning the image with the estimation, especially if the model is running at a lower fps than the source video (which is most often the case)

Shows how to use the depth estimation result together with p5.js 3D geometry tools to build a live mesh of the webcam video.

Replace the code fixing the size mismatch bug by using the new function designed for that: resizeImageAsTensor.

nasif-co · 2025-08-01T04:21:11Z

Updated the examples to add the p5 2.0 version. Interestingly, the point cloud example had a performance drop, and the mesh example had a great performance boost. Looking into it in processing/p5.js#6438, it must be related to the mesh example making use of p5.Geometry while the point cloud just uses 3D primitives. May be a good idea to modify the point cloud one in the future to use p5.Geometry instead.

shiffman

Incredible work, thank you to @alanvww for getting this started and @nasif-co for completing! This feature won't be released until likely early September so we have time to do additional testing for bugs as well as tweak or alter any of the examples if other contributors have comments. But I'd like to merge this today to mark the end of the summer research period! Happy August! 💜

alanvww added 3 commits April 30, 2025 23:10

Included tf.js/depth-estimation & added pipeline, examples

a3d2fb1

Better color map and normalization

41d3541

indentation clean up

5b7ea15