Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks for TriMesh transforms #18

Merged
merged 20 commits into from
Aug 2, 2020
Merged

Benchmarks for TriMesh transforms #18

merged 20 commits into from
Aug 2, 2020

Conversation

nirmal-suthar
Copy link
Contributor

@nirmal-suthar nirmal-suthar commented Jul 26, 2020

Fixes #17

Update TriMesh to compute verts only when demanded incase of 
invalidation, Update _compute_* to return nothing.
@nirmal-suthar
Copy link
Contributor Author

bm_trimesh

@FluxML FluxML deleted a comment from codecov bot Jul 27, 2020
@FluxML FluxML deleted a comment from codecov bot Jul 27, 2020
@avik-pal
Copy link
Member

Do you have any intuition why Kaolin's CPU benchmark is so noisy? You could try increasing the niters for benchmarking to reduce the variance (though it should not affect the min time)

@nirmal-suthar
Copy link
Contributor Author

nirmal-suthar commented Jul 27, 2020

With higher niters (**Updated)
bm_trimesh

@FluxML FluxML deleted a comment from codecov bot Jul 27, 2020
@avik-pal
Copy link
Member

Hmm... Yeah it doesn't help!! Let's try and use some standard mesh. Just use an icosphere and plot the trend.

@nirmal-suthar
Copy link
Contributor Author

nirmal-suthar commented Jul 27, 2020

Hmm... Yeah it doesn't help!! Let's try and use some standard mesh. Just use an icosphere and plot the trend.

Can using time module be noisy, as benchmarks looks pretty smooth in gpu, which is not using this time module?

@avik-pal
Copy link
Member

It is possible, but again the pytorch people are also using a similar script

@nirmal-suthar
Copy link
Contributor Author

I guess, I have figured out the problem. Cyclops CPU are already busy and while running cpu benchmark, htop shows all cpu as maxed out, which is why benchmarks are noisy [I have updated above plots]. I will run benchmarks script again once cyclops is free.

@FluxML FluxML deleted a comment from codecov bot Jul 30, 2020
@nirmal-suthar
Copy link
Contributor Author

nirmal-suthar commented Jul 30, 2020

@avik-pal I have updated TriMesh plot above, cyclops was mostly free while benchmarking.

@avik-pal
Copy link
Member

avik-pal commented Jul 30, 2020

Hmm... Interesting. So making it in place has zero effect on the performance... That doesn't seem right!
As per discussion on slack it turned out to be some type inference issues in std function in CUDA.jl JuliaGPU/CUDA.jl#336 (fixed). The following function can be used for the time being

stdev(a, m) = sqrt.(sum((a .- m) .^ 2, dims = 2) / size(a, 2))

Additionally, the performance has significantly improved on CUDA.jl master. We will still continue to use the latest stable release but we should still check if any of the performance issues persist on the master. If they don't we should go ahead and merge this.

And how are the metrics faring with kaolin?

@nirmal-suthar
Copy link
Contributor Author

Due to some reason chamfer distance in kaolin is not working on my setup, so would try fixing it today. This loss is written in cuda native, so not able to figure out exactly.

@avik-pal
Copy link
Member

Interesting... If you cant fix it let me know. I have kaolin setup on a cluster. I will install julia and run the benchmarks. (And maybe post which exact scripts I need to run)

@nirmal-suthar
Copy link
Contributor Author

nirmal-suthar commented Jul 31, 2020

Kaolin benchmarks were too noisy on cyclops, So I tried benchmarking on colab (chamfer_distance cpu kaolin was very slow to benchmark)

bm_metrics
bm_trimesh

@nirmal-suthar
Copy link
Contributor Author

@avik-pal Is build failing due to julia 1.3 is not compatible with GPUCompiler?
ERROR: Unsatisfiable requirements detected for package GPUCompiler [61eb1bfa]:
Tests are passing locally on julia 1.4

@avik-pal
Copy link
Member

@nirmal-suthar Could you try getting rid of the Manifest once?
It says restricted to versions 0.5.5 by an explicit requirement, leaving only versions 0.5.5 but there is no restriction on the Project.toml

@avik-pal
Copy link
Member

chamfer_distance cpu kaolin was very slow to benchmark

True. I am happy that Flux3d can at least compute it on CPU. It is a very expensive operation and a general bottleneck for any 3d mesh-based work I have ever done. I would suggest putting a note that the Kaolin one is too slow to be benchmarked in a reasonable time.

We also need the timing for the forward and backward pass. So every plot in the grid should have 6 lines (forward, backward, total for both frameworks).

@avik-pal
Copy link
Member

avik-pal commented Jul 31, 2020

I also feel it makes sense for us to have a cuda kernel for chamfer distance given that the difference rn is somewhat big. (Maybe once check on CUDA#master before doing that)

@nirmal-suthar
Copy link
Contributor Author

The above benchmarks are on CUDA#master

@avik-pal
Copy link
Member

Normalize is slow on master? I checked yesterday it was faster on my setup on cyclops.

@nirmal-suthar
Copy link
Contributor Author

Current issues with chamfer distance in Flux3D is that for calculating nearest neighbours distances, we are computing a matrix multiplication (of size N1 x N2 x B) which is really space expensive operation for high no. of verts. But yes, it is able to match kaolin for average number of points.

@avik-pal
Copy link
Member

I have never come across 2^14 vertices in practice. Most of my problems involved like 1000 vertices. So it is good that we can match the performance of a cuda kernel without writing one ourselves. I would suggest finishing the remaining benchmarks and open an issue for this and tackle it separately.

@nirmal-suthar
Copy link
Contributor Author

Normalize is slow on master? I checked yesterday it was faster on my setup on cyclops.

Benchmarks on cyclops (CUDA#master) were better than previous numbers. But I will check this again.

@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #18 into master will increase coverage by 0.45%.
The diff coverage is 84.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #18      +/-   ##
==========================================
+ Coverage   73.85%   74.31%   +0.45%     
==========================================
  Files          21       21              
  Lines         742      763      +21     
==========================================
+ Hits          548      567      +19     
- Misses        194      196       +2     
Impacted Files Coverage Δ
src/datasets/modelnet10/mn10_pcloud.jl 0.00% <0.00%> (ø)
src/datasets/modelnet40/mn40_pcloud.jl 0.00% <0.00%> (ø)
src/metrics/pcloud.jl 95.23% <0.00%> (-4.77%) ⬇️
src/rep/utils.jl 91.35% <88.23%> (-3.01%) ⬇️
src/rep/mesh.jl 91.16% <91.66%> (+2.16%) ⬆️
src/metrics/mesh.jl 100.00% <100.00%> (ø)
src/transforms/mesh_func.jl 88.77% <100.00%> (+0.23%) ⬆️
src/transforms/pcloud_func.jl 84.61% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52e9a48...8f63ba8. Read the comment docs.

@nirmal-suthar
Copy link
Contributor Author

bm_metrics

@nirmal-suthar
Copy link
Contributor Author

@avik-pal I am not able to match normalize with kaolin on GPU. Can this be due to my Julia env? In my Julia env, CUDA version is 1.2.1

@avik-pal
Copy link
Member

avik-pal commented Aug 2, 2020

I think it might be some mistake from my end. Anyways the difference is not much

From the updated plots I feel we need to have the kernel for chamfer distance and an adjoint for that, It should allow us to beat the kaolin numbers. Even for laplacian loss we need that adjoint. Since these are for an unreasonably high number of vertices, let's not block this PR for that and handle it elsewhere.

@nirmal-suthar
Copy link
Contributor Author

Posting some points before merging this PR:

  • CUSPARSE is currently broken for Float32 and cusparse-matrix is taking lot greater time, so using Sparse for laplacian_loss even on gpu.
  • Laplacian_loss in kaolin is slightly different from Flux3D, so for benchmarking in kaolin I am using a simple custom laplacian_loss which matched that of Flux3D.

@avik-pal
Copy link
Member

avik-pal commented Aug 2, 2020

In the benchmarks directory create a readme and mention these points.

Additionally if the cusparse issue is not known open an issue in cuda.jl

@nirmal-suthar
Copy link
Contributor Author

There was an existing issue
CUDA.jl/issues/322

@nirmal-suthar nirmal-suthar merged commit 2560f1c into master Aug 2, 2020
@avik-pal avik-pal deleted the ns/benchmarks branch August 2, 2020 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmarks for the Mesh Representation
2 participants