Official code implementation of ECCV 2022 paper: "MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes".
This paper aims at progressively encode first- and multi-order spatial relations within graphs in a recently proposed novel task -- 3D dense captioning
We suggest readers refer to data and setup preparation steps of Scan2Cap.
We provide training scripts on multi-GPUs, which can facilitate users to train our model faster. To train MORE with different painted point features (e.g., xyz, xyz+normal, xyz+normal+rgb, xyz+normal+multiview), please run
CUDA_VISIBLE_DEVICES=$gpu_ids python -m torch.distributed.launch --nproc_per_node $num_of_gpus --master_port $port_id scripts/train_ddp.py --use_color --use_normal --num_graph_steps 1 --graph_module SLGC --graph_mode spatial_layout_conv --stamp MORE_normal_rgb --decoder_module OTAG --lr 1e-4
The above command gives an example of using xyz+normal+rgb
as input point features. For using other settings of point features, please flexibly add or remove args including --use_normal, --use_color, --use_rgb
. Please change the --num_graph_steps
to 2 when using multiview features as:
CUDA_VISIBLE_DEVICES=$gpu_ids python -m torch.distributed.launch --nproc_per_node $num_of_gpus --master_port $port_id scripts/train_ddp.py --use_multiview --use_normal --num_graph_steps 2 --graph_module SLGC --graph_mode spatial_layout_conv --stamp MORE_normal_multiview --decoder_module OTAG --lr 1e-4
To evaluate the caption performances, please run
python scripts/eval.py --folder $ckpt_folder_name --use_color --use_normal --num_graph_steps 1 --num_locals 10 --eval_caption --min_iou 0.5 --graph_module SLGC --graph_mode spatial_layout_conv --decoder_module OTAG
Note that arguments must match ones for training.
If you find any problems, feel free to make an issue.
model | C@0.5IoU | B-4@0.5IoU | M@0.5IoU | R@0.5IoU |
---|---|---|---|---|
MORErgb | 38.98 | 23.01 | 21.65 | 44.33 |
MOREmul | 40.94 | 22.93 | 21.66 | 44.42 |
Since the caption metrics are not stable, we recommend you to save the checkpoints at every epoch and load the checkpoint with higher validation performance during testing for better performance.
If you found our project helpful, please kindly cite our paper via:
@inproceedings{jiao2022more,
title={More: Multi-order relation mining for dense captioning in 3d scenes},
author={Jiao, Yang and Chen, Shaoxiang and Jie, Zequn and Chen, Jingjing and Ma, Lin and Jiang, Yu-Gang},
booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV},
pages={528--545},
year={2022},
organization={Springer}
}
We sincerely thank the authors of Scan2Cap for open sourcing their data and code. Part of the code in our project are from Scan2Cap.