Official implementation of 'Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding'.
[2023.5] We release ICCV2023 'ViewRefer3D', a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities with LLM.
[2023.9] We release AAAI2024 'Point-PEFT', adapting 3D pre-trained Models with 1% parameters to downstream tasks .
[2024.5] The results of Any2Point on ShapeNetPart will be released soon!
[2024.7] Any2Point has been accepted by ECCV 2024!
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.
We report the pre-training modality (Pre-train), the number of learnable parameters (#Param) on the "PB-T50-RS" split of ScanObjectNN (SCAN.) and ModelNet40 (MN.). * indicates utilizing the voting strategy.
Method | Pre-train | #Param(M) | SCAN.(%) | MN.(%) |
---|---|---|---|---|
PointNet | N/A | 3.5 | 68.0 | 89.2 |
PointNet++ | N/A | 1.5 | 77.9 | 90.7 |
DGCNN | N/A | 1.8 | 78.1 | 92.9 |
PointMLP | N/A | 12.6 | 85.4 | 94.1 |
Point-PN | N/A | 0.8 | 87.1 | 93.8 |
PointNeXt | N/A | 1.4 | 87.7 | 94.0 |
Point-BERT | 3D | 22.1 | 83.1 | 92.7 |
Point-MAE | 3D | 22.1 | 85.2 | 93.2 |
Point-M2AE | 3D | 15.3 | 86.4 | 93.4 |
P2P-HorNet | 2D | 1.2 | 89.3 | 94.0* |
ACT | 3D+2D | 22.1 | 88.2 | 93.7 |
I2P-MAE | 3D+2D | 12.9 | 90.1 | 93.7 |
ReCon | 3D+2D+Language | 43.6 | 90.6 | 94.1 |
Any2Point (Audio) | Audio | 0.8 | 87.0 | 92.7 |
Any2Point (2D) | 2D | 0.8 | 87.7 | 93.2 |
Any2Point (Language) | Language | 0.9 | 91.9 | 94.3 |
Real-world shape classification on the PB-T50-RS split of ScanObjectNN:
Method | Logs | Acc. | Ckpts |
---|---|---|---|
Any2Point-Lang-CLIP | Language_CLIP_Scan.log | 91.9% | Language_CLIP_Scan.pth |
Any2Point-Vision-DINOV2 | Vision_DINOV2_Scan.log | 87.7% | Vision_DINOV2_Scan.pth |
Any2Point-Audio-ImageBind | Audio_imagebind_scan.log | 87.0% | Audio_imagebind_scan.pth |
Synthetic shape classification on the ModelNet40:
Method | Logs | Acc. | Ckpts |
---|---|---|---|
Any2Point-Lang-CLIP | Language_CLIP_ModelNet.log | 94.3% | Language_CLIP_ModelNet.pth |
Any2Point-Vision-DINOV2 | Vision_DINOV2_ModelNet.log | 93.2% | Vision_DINOV2_ModelNet.pth |
Any2Point-Audio-ImageBind | Audio_imagebind_ModelNet.log | 92.7% | Audio_imagebind_ModelNet.pth |
Create a conda environment and install basic dependencies:
git clone https://github.com/Ivan-Tang-3D/Any2Point.git
cd Any2Point
conda create -n Any2Point python=3.7
conda activate Any2Point
# Install the according versions of torch and torchvision
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch
conda install -c pyg pytorch-cluster pytorch-scatter pytorch-sparse -y
pip install torch-geometric==2.0
source install.sh
For pre-training and fine-tuning, please follow DATASET.md to install ModelNet40, ScanObjectNN, and ShapeNetPart datasets, referring to Point-BERT. Specially Put the unzip folder under data/
.
The Language Part Training just occupies 26GB Memory.
The final directory structure should be:
│Any2Point/
├──Any2Point_CLIP_Lang/
├──ckpts/
├──data/
│ ├──ModelNet/
│ ├──ScanObjectNN/
├──...
Please download the CLIP_pre-train.pth, DINOV2_pre-train.pth and ImageBind_audio_pre-train.pth into the ckpts/
folder.
For the PB-T50-RS split of ScanObjectNN, run:
Any2Point_CLIP_Lang
cd Any2Point_CLIP_Lang
sh fine_tune.sh
Any2Point_DINOV2_Vision
cd Any2Point_DINOV2_Vision
sh fine_tune.sh
Any2Point_ImageBind_audio
cd Any2Point_ImageBind_audio
sh fine_tune.sh
For the ModelNet40, run:
Any2Point_CLIP_Lang
cd Any2Point_clip_lang_modelnet
sh fine_tune.sh
Any2Point_DINOV2
cd Any2Point_DINOV2_modelnet
sh fine_tune.sh
Any2Point_ImageBind
cd Any2Point_ImageBind_Modelnet
sh fine_tune.sh
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{tang2024any2point,
title={Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding},
author={Tang, Yiwen and Liu, Jiaming and Wang, Dong and Wang, Zhigang and Zhang, Shanghang and Zhao, Bin and Li, Xuelong},
journal={arXiv preprint arXiv:2404.07989},
year={2024}
}
This repo benefits from Pix4Point, Point-NN, PointTransformerV2, Openpoints. Thanks for their wonderful works.