CUDA-accelerated robotics algorithms (C++/CUDA), based on PythonRobotics and CppRobotics plus differentiable extensions.
Same algorithm on CPU and GPU — GPU enables orders of magnitude more particles / samples / rays:
| Capability | Demo | GPU scale | Headline |
|---|---|---|---|
| Occupancy grid | comparison_occupancy_grid |
256x256 | log-odds raycast |
| Collision check | comparison_collision_check |
1M segments/scan | 1,277x per candidate |
| Scan matching | comparison_icp, comparison_ndt, gpu_ndt_3d_multires, gicp, gpu_correlative_scan_matching, gpu_branch_and_bound_csm |
10K+ points / 2.1M candidate poses | parallel correspondences; exhaustive global CSM recovers large offsets (44/44) where a local field matcher fails (5/44), 487x vs CPU; branch-and-bound over a GPU multi-resolution bound returns the IDENTICAL optimum scoring up to 1004x fewer candidates (40/40 frames exact) |
| Pose-graph SLAM | gpu_pose_graph_slam, gpu_pose_graph_slam_3d, gpu_pose_graph_slam_3d_robust, gpu_pose_graph_slam_3d_switchable, gpu_online_slam, gpu_online_slam_3d_switchable, gpu_csm_loop_closure_slam, gpu_csm_submap_slam, gpu_bnb_loop_closure_slam |
2D 200 poses / 3D 384-420 poses | robust 3D rejects 36/36 false loops, 6.95→0.28 m; switchable constraints learn per-loop switches jointly with poses, 6.95→0.29 m; online 3D switchable rejects false loops live in a sliding window, plain 9.10 m → switchable 0.29 m (21/21 rejected); CSM loop-closure SLAM detects loops from scan data (no GT), ATE 2.03→0.17 m, 49 accepted / 3 rejected; submap front-end fuses sparse scans → 48/52 loops (0.18 m) vs single-scan 17/52 (0.38 m); branch-and-bound loop search scores 957x fewer candidates than brute force over a 4.5M-cell window, identical relpose 51/51, ATE 2.18→0.20 m |
| Particle filter | comparison_pf, gpu_global_localization_mcl, gpu_megaparticles_stein_mcl, gpu_megaparticles_lsh, gpu_megaparticles_6dof, gpu_megaparticles_gicp_mcl, gpu_megaparticles_smoother, gpu_kld_amcl, diff_pf, diff_pf_mlp |
10K-1M particles | MegaParticles-style range SPF: 14.61 m bootstrap vs 0.097 m recovery; explicit p-stable LSH neighbor index lifts neighbor recall 58%→88%; 6-DoF SE(3) relocalization recovers a hidden kidnap to 0.22 m / 1.9 deg (LSH neighbor consensus); surface-aware GICP D2D likelihood halves post-kidnap error vs the field proxy (0.099→0.064 m); a robust fixed-lag smoother over the representative state cuts in-track jitter ~70x (RMSE 5.4→0.25 m) and rejects spurious-mode flips; KLD-AMCL adapts 400→65,536 particles, 15.2x vs CPU |
| RRT family | comparison_rrt*, comparison_rrtstar_rewire |
1M paths / 200K nodes | 5,000x per-path; 62x rewire |
| Crowd / swarm | gpu_crowd_swarm |
10,000 boids with uniform-grid neighbours | 105x vs CPU |
| Graph policy control | gpu_gnn_swarm_controller, gpu_gat_traversability_policy |
2048 agents / 3072 terrain nodes x 3 heads | 2.88 ms/control; 99.4x GAT policy |
| Assignment / tracking | gpu_hungarian_assignment, gpu_assignment_tracking |
512 x 64x64 assignment / 128 tracking scenes | 158x Hungarian; 14.0x tracking |
| Interaction graph risk | gpu_interaction_graph_risk, gpu_interaction_graph_neural_mppi, gpu_multiagent_graph_neural_mppi, gpu_priority_graph_neural_mppi, gpu_intent_graph_neural_mppi, gpu_belief_risk_graph_mppi, gpu_best_response_graph_mppi, gpu_iterative_game_graph_mppi, gpu_noregret_game_graph_mppi, gpu_safe_noregret_game_graph_mppi, gpu_learned_safety_dual_graph_mppi, gpu_trainable_safety_dual_graph_mppi, gpu_planner_showdown_benchmark, gpu_planner_falsifier_benchmark |
2048 agents x 10 message passes / 48-agent graph x 4 passes / 48 robots x 768 MPPI rollouts / 719K scenario falsifier scan | 76.3x risk propagation; interaction-aware MPPI reduces social risk 19.7%; multi-agent graph MPPI cuts cross-route collisions 518 -> 261; priority arbitration cuts 261 -> 245; intent beliefs cut 518 -> 216; belief CVaR cuts collision tail risk 31.6%; best-response game cuts collisions 518 -> 171; damped fictitious play cuts collisions 518 -> 154; regret matching cuts collisions 518 -> 150; safety-constrained no-regret cuts collisions 518 -> 136 and CVaR 43.34 -> 37.93; learned safety-dual prior keeps reach 48/48 with collisions 518 -> 140 and residual 16.53%; trainable safety-dual prior trains 1152 labels and reaches collisions 132 / CVaR 37.10; scenario-conditioned pressure showdown reaches 0 collisions, CVaR 21.61, residual 4.90%, 13.05 ms; adversarial falsifier scans 719,712 scenarios and finds 12/12 no-pressure failures while learned repair passes |
| Risk-aware planning | gpu_reciprocal_risk_planner, gpu_interaction_graph_neural_mppi, gpu_multiagent_graph_neural_mppi, gpu_priority_graph_neural_mppi, gpu_intent_graph_neural_mppi, gpu_belief_risk_graph_mppi, gpu_best_response_graph_mppi, gpu_iterative_game_graph_mppi, gpu_noregret_game_graph_mppi, gpu_safe_noregret_game_graph_mppi, gpu_learned_safety_dual_graph_mppi, gpu_trainable_safety_dual_graph_mppi, gpu_planner_showdown_benchmark, gpu_planner_falsifier_benchmark |
1024 agents x 9 actions x H=16 / 32768 social-risk MPPI rollouts / 48 robot coordinated, priority, intent-aware, CVaR belief, best-response, iterative game, no-regret, safe no-regret, learned/trainable safety-dual, showdown MPPI, and adversarial falsifier | 4.05 ms/plan; 311.5x reciprocal risk; 4140.9x interaction-graph MPPI; 3139.6x multi-agent graph MPPI; 292.9x intent graph MPPI; 3132.6x best-response graph MPPI; 3181.3x iterative game graph MPPI; 2960.6x no-regret graph MPPI; 3043.4x safe no-regret graph MPPI; 3075.5x learned safety-dual prior graph MPPI; 3013.1x trainable safety-dual prior graph MPPI; 2977.5x planner showdown benchmark; 70.7x adversarial falsifier scan |
| SfM / multi-view | gpu_sfm_mini |
2048 features x 4 views | 217.0x match + BA vs CPU |
| Sparse linear solvers | gpu_pcg_solver |
262K unknowns / 1.31M CSR nnz | 13.4x Jacobi-PCG vs CPU |
| Clustering / graph ML | gpu_em_gmm, gpu_spectral_clustering, gpu_label_propagation, gpu_label_propagation_traversability, gpu_graph_crf_traversability |
262K GMM points / 3K graph nodes | 90.2x EM; 193x spectral; 123x propagation; 106x CRF |
| Black-box optimization | gpu_cma_es |
3 x 32,768 candidates x 10D | 1,254x objective eval |
| Monte Carlo planning | gpu_mcts_planner |
64 scenes x 4096 rollouts x 48 horizon | 712x vs CPU |
| Learning-based planning | gpu_diffusion_planner, gpu_diffusion_policy, gpu_diff_value_iteration_traversability, gpu_neural_astar_traversability, gpu_anytime_neural_astar_traversability, gpu_multigoal_neural_astar_traversability, gpu_spatiotemporal_neural_astar_traversability, gpu_experience_graph_neural_planner, gpu_graph_guided_neural_mppi, gpu_kinodynamic_graph_neural_mppi, gpu_multiagent_graph_neural_mppi, gpu_priority_graph_neural_mppi, gpu_intent_graph_neural_mppi, gpu_belief_risk_graph_mppi, gpu_best_response_graph_mppi, gpu_iterative_game_graph_mppi, gpu_noregret_game_graph_mppi, gpu_safe_noregret_game_graph_mppi, gpu_learned_safety_dual_graph_mppi, gpu_trainable_safety_dual_graph_mppi, gpu_planner_showdown_benchmark, gpu_planner_falsifier_benchmark |
512 x 64 trajectories / 192x128 soft VI / 64x neural A* / 1536-node graph / 32768 MPPI rollouts / 48 robot graph MPPI / 719K adversarial scenario scan | analytic score -> BC denoising policy; 747.4x learned-cost VI; 153.1x batched neural A*; 278.5x experience-graph A*; 1320.1x graph-guided MPPI; 100% top-1 intent MPPI; belief-space CVaR tail-risk MPPI; graph-neural best-response, damped fictitious-play, no-regret, safety-constrained no-regret, learned safety-dual prior, trainable safety-dual prior, target-gated planner showdown MPPI, and adversarial falsifier repair gate |
| Voxel map (3D) | comparison_voxel_map |
256x256x32 | 58x per ray |
| ESDF (2D/3D) | comparison_esdf, comparison_esdf_3d |
640K cells / 1.05M voxels | 53,404x / 86,613x |
| LiDAR sim | comparison_lidar_sim, comparison_lidar3d_sim, comparison_lidar3d_realistic |
1M 2D / 131K 3D rays | + 5 physical effects (realistic) |
| GPU Jacobi-PCG sparse SPD solver (262K unknowns, 1.31M CSR nnz, 33 iterations, 13.4x vs CPU) | |
![]() |
More classical-algorithm GIFs
mkdir build && cd build && cmake .. && make -j$(nproc)Requires CMake ≥ 3.18, CUDA Toolkit ≥ 12.0, OpenCV ≥ 4.5, Eigen 3. Executables go to bin/.
Planner showdown target gate:
./bin/gpu_planner_showdown_benchmark --check --no-video --json gif/gpu_planner_showdown_benchmark.json
python3 scripts/summarize_planner_showdown.py --json gif/gpu_planner_showdown_benchmark.json --markdown-out build/gpu_planner_showdown_benchmark.md --strict--check returns non-zero if the trainable safety-dual row misses the hard gates.
The gated default scenario is baseline; --scenario tight,
--scenario priority_flip, and --scenario adversarial_density are manual
stress probes for narrower crossings, flipped priority ordering, and dense
centerline conflicts.
Use --pressure-mode learned|teacher|none for the safety-pressure ablation:
the learned controller uses runtime safety metrics plus scenario context
(lane tightness, conflict density, cross-shift load, and priority flips) and
matches the teacher-style target gate on baseline,
while disabling pressure drops the adversarial-density stress probe to a target
miss (9 collisions, CVaR 32.62).
Use --adaptive-budget learned|off for the pass-budget ablation. The default
learned budget scores pass-2 CVaR, residual pressure, and scenario difficulty;
in the tracked matrix it evaluates only the adversarial-density run and records
whether a refinement candidate is accepted. Current matrix runtime remains
within 13.050 ms.
After running those probes, repeat --json to render a scenario matrix:
python3 scripts/summarize_planner_showdown.py \
--json gif/gpu_planner_showdown_benchmark.json \
--json build/gpu_planner_showdown_tight.json \
--json build/gpu_planner_showdown_priority_flip.json \
--json build/gpu_planner_showdown_adversarial_density.json \
--markdown-out build/gpu_planner_showdown_matrix.md \
--strictPressure ablation matrix:
./bin/gpu_planner_showdown_benchmark --scenario baseline --pressure-mode teacher --no-video --json build/gpu_planner_showdown_pressure_teacher.json
./bin/gpu_planner_showdown_benchmark --scenario baseline --pressure-mode none --no-video --json build/gpu_planner_showdown_pressure_none.json
./bin/gpu_planner_showdown_benchmark --scenario adversarial_density --pressure-mode none --no-video --json build/gpu_planner_showdown_pressure_none_adversarial_density.json
python3 scripts/summarize_planner_showdown.py \
--json gif/gpu_planner_showdown_benchmark.json \
--json build/gpu_planner_showdown_pressure_teacher.json \
--json build/gpu_planner_showdown_pressure_none.json \
--json build/gpu_planner_showdown_adversarial_density.json \
--json build/gpu_planner_showdown_pressure_none_adversarial_density.json \
--markdown-out build/gpu_planner_showdown_pressure_ablation.mdPlanner falsifier target gate:
./bin/gpu_planner_falsifier_benchmark --check --json gif/gpu_planner_falsifier_benchmark.json
python3 scripts/summarize_planner_falsifier.py --json gif/gpu_planner_falsifier_benchmark.json --markdown-out build/gpu_planner_falsifier_benchmark.md --strictThe falsifier scans 719,712 scenario variants over lane tightness, jitter, cross-shift, spawn phase, goal offset, and priority flips. The worst 12 cases must break no-pressure and no-regret baselines, keep the learned target planner inside the hard gates, and accept at least one adaptive repair. In the tracked run, all 12 worst cases accept repair; worst learned CVaR is 24.68 with 5.54% residual and 13.178 ms target runtime.
ROS2 (optional):
cd ros2_ws && colcon build --packages-select cuda_robotics| Domain | Best result |
|---|---|
| Particle Filter (10K) | CPU 75 s → CUDA 27 ms — 2,776x |
| Dynamic Window (8K samples) | CPU 1.2 s → CUDA 1.7 ms — 705x |
| Global Localization MCL | 32,768 particles, hidden kidnap; local-only post RMSE 20.24 m → sensor-reset recovery 0.022 m |
| MegaParticles-style Stein MCL | 1,048,576 range particles; local bootstrap post RMSE 14.61 m → Stein/bucket posterior recovery 0.097 m |
| MegaParticles LSH neighbor index | 2 × 1,048,576 particles; explicit p-stable LSH (8 tables × 3 projections) vs fixed grid; neighbor recall vs brute-force kNN 58.2% → 87.8%, post-kidnap RMSE 0.099 → 0.088 m |
| MegaParticles 6-DoF SE(3) | 1,048,576 SE(3) particles in a 3D voxel world; 3D-ESDF range likelihood, quaternion GN steps, 6-D p-stable LSH neighbor consensus; hidden kidnap: local bootstrap post RMSE 5.97 m → 6-DoF MegaParticles 0.22 m / 1.9 deg, reacquires in 0 frames |
| MegaParticles GICP D2D likelihood | 2 × 1,048,576 particles, identical Stein machinery; surface-aware GICP distribution-to-distribution scoring (per-point disk covariances, grid-indexed map cloud) vs the distance-field proxy; both recover the hidden kidnap in 0 frames, post-kidnap RMSE 0.099 → 0.064 m and final error 0.040 → 0.021 m, at ~2.4x per-step cost (4.9 → 12.1 ms) |
| MegaParticles trajectory smoother | 1,048,576 particles; robust fixed-lag smoother over the max-posterior representative state (switchable CV-motion + Huber measurement factors, data-driven reset on post-dropout relocalization); raw vs smoothed, in-track jitter (mean |Δ²pos|) 4.31 → 0.06 (~70x), in-track RMSE 5.4 → 0.25 m, post-kidnap RMSE ~1.6 → 0.09 m, recovers the hidden kidnap in 0 frames |
| Augmented KLD-AMCL | KLD-sampling adapts 400→65,536 particles, augmented injection reacquires hidden kidnap in 13 steps, settled RMSE 0.014 m, 15.2x vs CPU |
| Correlative scan matching | exhaustive global pose search, 2.1M (x,y,θ) candidates/frame (coarse-to-fine); recovers offsets up to ±3.8 m / 40° (44/44 < 0.20 m, RMSE 0.006 m) where a local field matcher stalls (5/44, RMSE 1.95 m); GPU 6 ms vs CPU 2.9 s — ~490x |
| 2D ESDF (640K cells) | 53,404x per cell (JFA) |
| 3D ESDF (1M voxels) | 86,613x per voxel (JFA-3D) |
| Massive collision check | 1,277x per candidate (2D DDA) |
| Normal estimation (10K pts) | 3,171x (PCA, one thread per point) |
| Pose-graph SLAM (200 nodes) | ~200 ms total, RMSE 4.88 → 0.56 m |
| 3D Pose-graph SLAM | 384 poses / 575 edges, finite-difference SE(3) Jacobians, RMSE 1.64 → 0.28 m |
| Robust 3D Pose-graph SLAM | 384 poses / 611 edges, 36 false loop closures, switch gate rejects 36/36; plain 6.95 m → robust 0.28 m |
| Switchable-constraint 3D Pose-graph SLAM | 384 poses / 611 edges, per-loop switch variables jointly optimised with poses; learns 36/36 false-loop rejection (no hand-set trim); plain 6.95 m → switchable 0.29 m / 2.2 deg |
| Online 3D SLAM, switchable loop constraints | 420 streamed SE(3) poses, sliding window W=80 + global pass on loop, 21 false loops injected live; plain online 9.10 m → switchable online 0.29 m, 21/21 false loops rejected as they arrive |
| CSM loop-closure SLAM | 140-keyframe 2D lap; loops DETECTED by exhaustive correlative scan matching (no GT), 1.42M candidate relposes/attempt (coarse-to-fine), score-gated (49 accepted / 3 rejected); dead-reckoning ATE 2.03 m → SLAM 0.17 m; GPU 2.4 ms vs CPU 1.5 s per attempt — 630x |
| CSM submap SLAM | same 140-keyframe lap with deliberately SPARSE, noisy scans (64 rays, 6 cm noise); loop front-end matches against a SUBMAP (8 fused scans) vs a single scan; submap recovers 48/52 loops (ATE 0.18 m) where single-scan gets only 17/52 (0.38 m), both from a 2.03 m dead-reckoning baseline; 1.42M candidate relposes/attempt, GPU 4.6 ms vs CPU 0.67 s — 148x |
| Branch-and-bound CSM | growing (x, y, θ) window up to ±6.4 m / ±35°; branch-and-bound over a GPU-built multi-resolution max-pool bound returns the IDENTICAL grid optimum as exhaustive search (40/40 frames exact) while scoring up to 1004x fewer candidates (4.5M exhaustive vs 4.5k nodes); GPU exhaustive 12 ms vs CPU 0.7 s — ~57x |
| Branch-and-bound loop-closure SLAM | branch-and-bound runs the SLAM loop-closure search over a full-resolution 4.5M-cell relpose window (±8 m / ±0.6 rad) scoring 957x fewer candidates than brute force (4.7k nodes vs 4.52M), returning the IDENTICAL relpose on 51/51 attempts; drives the live pose-graph, closing the lap (dead-reckoning ATE 2.18 m → 0.20 m); GPU B&B 0.27 ms vs brute force 7.1 ms / attempt |
| 3D Gaussian Splatting (~1k Gaussians, 720x480) | 0.94 ms / frame |
| GPU diffusion policy | 768-sample behavior cloning MLP + 512 x 64 learned denoising trajectories |
| GPU CMA-ES objective evaluation | 3 x 32,768 candidates x 10D, 1,254x vs CPU eval |
| GPU MCTS kinodynamic planning | 64 scenes x 4096 rollouts x 48 horizon, 712x vs CPU |
| GPU differentiable value iteration traversability | 192x128 learned traversability cost x 220 soft Bellman iterations, 1.53 ms, path reaches goal, 747.4x vs CPU |
| GPU neural A* traversability | 64 batched 192x128 learned-heuristic A* queries, 145.12 ms/batch, 79.0% fewer expansions than Dijkstra, 153.1x vs CPU sequential neural A* |
| GPU anytime neural A* traversability | 4-pass heuristic annealing over 64 batched 192x128 learned-heuristic A* queries, path cost 579.57 -> 522.96, 158.0x vs CPU sequential anytime |
| GPU multi-goal neural A* traversability | 8 candidate goals x 8 replans on a 192x128 learned cost field, selected G0 with score -23.71, all 8 goals reachable, 87.5x vs CPU sequential multi-goal |
| GPU spatiotemporal neural A* traversability | 64 batched 192x128 dynamic-risk neural A* queries, moving-obstacle max risk 1.94 -> 0.26, 80.9% fewer expansions than dynamic Dijkstra, 106.5x vs CPU sequential spatiotemporal A* |
| GPU learned experience graph planner | 128 batched 1536-node learned experience-graph A* queries, all queries reachable, 51.8% fewer expansions than graph Dijkstra, 278.5x vs CPU sequential graph A* |
| GPU graph-guided neural MPPI | 32768 rollouts x H=72 x guided/unguided batches, cost 1430.31 -> 842.35, terminal error 1.25 -> 0.15, route error 0.491 -> 0.045, 1320.1x vs CPU equivalent rollout evaluation |
| GPU kinodynamic graph-neural MPPI | 32768 nonholonomic speed/steering rollouts x H=72 x guided/unguided batches, cost 1516.74 -> 851.11, terminal error 5.11 -> 0.88, route error 1.530 -> 0.252, 49.9x vs CPU equivalent kinodynamic rollout evaluation |
| GPU interaction-graph neural MPPI | 48 moving agents x 4 message-passing risk updates + 32768 MPPI rollouts x H=72, social risk 1.628 -> 1.308, clearance -0.15 -> -0.10, full objective 2913.50 -> 2395.14, 4140.9x vs CPU equivalent rollout evaluation |
| GPU multi-agent graph-neural MPPI | 48 robots x 768 rollouts x H=72 x independent/coordinated modes, cross-route collisions 518 -> 261, social risk 3.544 -> 2.588, reach basin 48/48 -> 36/48, 3139.6x vs CPU equivalent rollout evaluation |
| GPU priority graph-neural MPPI | 48 robots x 768 rollouts x H=72 x coordinated/priority modes, right-of-way arbitration cuts cross-route collisions 261 -> 245, reach basin 36/48 -> 40/48, deadlocks 1 -> 0, terminal error 1.97 -> 1.65, 2870.5x vs CPU equivalent rollout evaluation |
| GPU intent graph-neural MPPI | 48 robots x 768 rollouts x H=72 x naive/intent-aware modes, top-1 intent 100.0%, cross-route collisions 518 -> 216, social risk 3.519 -> 2.897, reach basin 48/48 -> 42/48, 292.9x vs CPU equivalent rollout evaluation |
| GPU belief-risk graph MPPI | 48 robots x 768 rollouts x H=72 x expected-risk/CVaR belief modes, collision CVaR 38.23 -> 26.17, tail social risk 4.363 -> 3.972, min separation -0.391 -> -0.368, reach basin 48/48 -> 48/48, 652.8x vs CPU equivalent rollout evaluation |
| GPU best-response graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/best-response game passes, cross-route collisions 518 -> 171, collision CVaR 105.43 -> 35.00, unilateral best-response gain 25.69%, reach basin 48/48 -> 39/48, 3132.6x vs CPU equivalent rollout evaluation |
| GPU iterative game graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/raw best-response/2 damped fictitious-play updates, cross-route collisions 518 -> 154, collision CVaR 105.43 -> 38.77, reach basin 48/48 -> 39/48 -> 48/48, path delta 0.385 -> 0.191, 3181.3x vs CPU equivalent rollout evaluation |
| GPU no-regret game graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/raw best-response/3 regret-matched updates, cross-route collisions 518 -> 150, collision CVaR 105.43 -> 43.34, reach basin 48/48 -> 39/48 -> 48/48, unilateral residual 25.69% -> 13.58%, alpha avg 0.546 -> 0.281, 2960.6x vs CPU equivalent rollout evaluation |
| GPU safe no-regret game graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/raw/no-regret/safe no-regret updates, cross-route collisions 518 -> 136, collision CVaR 43.34 -> 37.93 vs no-regret, reach basin 48/48 -> 39/48 -> 48/48, safety alpha avg 0.637 -> 0.469, unilateral residual 25.69% -> 19.37%, 3043.4x vs CPU equivalent rollout evaluation |
| GPU learned safety-dual prior graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/raw/no-regret/learned-prior safety-dual updates, fixed-weight MLP prior predicts dual/alpha/scale from graph-risk features, cross-route collisions 518 -> 140, collision CVaR 43.34 -> 39.54 vs no-regret, reach basin 48/48 -> 39/48 -> 48/48, learned prior scale avg 1.168 -> 1.052, unilateral residual 25.69% -> 16.53%, 3075.5x vs CPU equivalent rollout evaluation |
| GPU trainable safety-dual prior graph MPPI | 48 robots x 768 rollouts x H=72 x one-shot/raw/no-regret/trainable-prior safety-dual updates, tiny MLP trains on 1152 synthetic graph-risk labels (loss 0.21104 -> 0.00178), cross-route collisions 518 -> 132, collision CVaR 43.34 -> 37.10 vs no-regret, reach basin 48/48 -> 39/48 -> 48/48, trainable prior scale avg 1.150 -> 1.083, unilateral residual 25.69% -> 17.48%, 3013.1x vs CPU equivalent rollout evaluation |
| GPU planner showdown benchmark | 48 robots x 768 rollouts x H=72 comparing ORCA-like reciprocal, priority graph, no-regret MPPI, and trainable safety-dual MPPI; hard target gates: reach 48/48, deadlocks 0, collisions <= 8, CVaR <= 26.5, residual <= 12.0%, runtime <= 15.0 ms; trainable safety-dual plus scenario-conditioned learned safety-pressure controller (4320 metric/context labels, loss 0.78480 -> 0.01129) and adaptive tail-risk refinement budget with collisions 0, CVaR 21.61, residual 4.90%, 13.05 ms, 2977.5x vs CPU equivalent rollout evaluation; --check emits a target-gate exit code and JSON summary |
| GPU planner falsifier benchmark | 719,712 GPU-scored scenario variants over lane scale, jitter, cross-shift, spawn phase, goal offset, and priority flips; worst 12 cases break no-pressure and no-regret (12/12 failures), learned safety-pressure target passes all 12 with worst CVaR 24.68, residual 5.54%, runtime 13.178 ms, and adaptive repair accepted 12/12; 70.7x vs CPU surrogate scan |
| GPU GNN swarm controller | 2048 agents x 3 message passes, 2.88 ms/control, 44.3x vs CPU |
| GPU reciprocal risk planner | 1024 agents x 9 actions x H=16, 4.05 ms/plan, 311.5x vs CPU |
| GPU assignment tracking | 128 scenes x 48 tracks x 72 detections, 14.0x vs CPU |
| GPU crowd swarm | 10,000 agents, uniform-grid neighbours, 105x vs CPU |
| GPU interaction graph risk | 2048 agents x 10 message-passing steps, 76.3x vs CPU |
| GPU SfM mini | 2048 features x 4 views, match + point BA, 217.0x vs CPU |
| GPU Jacobi-PCG sparse solver | 262K unknowns / 1.31M CSR nnz, 13.4x vs CPU |
| GPU EM GMM clustering | 262K points x 5 full-cov Gaussians, 90.2x vs CPU |
| GPU spectral clustering | 3072-point dense RBF graph, 40 subspace iterations, 193x vs CPU |
| GPU label propagation | 3072-node RBF graph, 12 seeds, 50 clamped iterations, 123x vs CPU |
| GPU traversability label propagation | 3072 graph nodes x 40 propagation iters, 33.47 ms, 79.9x vs CPU |
| GPU graph CRF traversability | 3072-node bilateral terrain graph x 32 mean-field iters, noisy unary 82.0% -> CRF 83.6%, 106x vs CPU |
| GPU GAT traversability policy | 3072 terrain nodes x 3 heads x 4 graph-attention layers, noisy unary 78.7% -> GAT 81.3%, 99.4x vs CPU |
- PythonRobotics
- Probabilistic Robotics
- Koide et al., MegaParticles: Range-based 6-DoF Monte Carlo Localization
- Datar, Immorlica, Indyk, Mirrokni, Locality-Sensitive Hashing Scheme Based on p-Stable Distributions (SoCG 2004)
- Fox, Adapting the Sample Size in Particle Filters Through KLD-Sampling (IJRR 2003); Augmented MCL: Thrun/Burgard/Fox, Probabilistic Robotics, Table 8.3
- Sünderhauf & Protzel, Switchable Constraints for Robust Pose Graph SLAM (IROS 2012)
- Diff-MPPI write-up:
paper/, ablations:paper/diff_mppi_*_followup.md - GitHub Pages gallery: https://rsasaki0109.github.io/CudaRobotics/








































































































