Skip to content

Latest commit



171 lines (140 loc) · 12.6 KB

File metadata and controls

171 lines (140 loc) · 12.6 KB

Daily Logs

Table of Contents



  • 2022/07/01, Vendredi.
  1. Multi-View Transformer for 3D Visual Grounding(CVPR2022) [PDF] [Code]

    • Main Idea: Two models: Point cloud and text. Learn a multi-modal representation independent from from its sepecific single view. Different rotation matrixes are used for robust multi-view representation. Fuse features of each object with the query features.
    • Experiments: Nr3D: 55.1%, Sr3D: 58.5%, Sr3D+: 59.5%(SOTA) ScanRefer: 40.80%(GOOD)
    • Reproduce Notes:
      • 1 RTX 3090 takes almost 15h for Nr3D, 55.1% for Nr3D!
      • Replacing all mentions of AT_CHECK with TORCH_CHECK in ./referit3d/external_tools/pointnet2/_ext_src/src in CUDA 11.
      • Point Cloud Visualization tool: open3d [Package]
      • Point Cloud 3D Box Visualization: [Code]
      • Point Cloud aligned: [Code]

  2. Distilling Audio-Visual Knowledge by Compositional Contrastive Learning(CVPR2021) [PDF] [Code]

    • Main Idea: Contrastive Compositional learning for video feature extraction in order to solve sematic gap between two different modalities.
    • Experiments: UCF51: 70.0%, ActivityNet: 47.3%
    • Reproduce Notes:
      • 1 RTX 3090 takes almost 10h for UCF101, 3 days for ActivityNet, 6 days for VGGSound.

  • 2022/07/02, Samedi.
  1. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection(CVPR2022) [PDF] [Code]
    • Main Idea: First single stage 3D visual grounding method. It regards 3DVG task as a keypoint selection problem. Pcloud is input, Pseed is feature, P0 is language-relevant keypooint, Pt is target keypoints and finally, Pt regresses to the bounding boxes.
    • Experiments: ScanRefer:47-48%(SOTA), Nr3D:51.5%, Sr3D:62.6%(GOOD)
    • Reproduce Notes:
      • 1 Telsa V100 or 2 RTX3090 is enough. It takes almost 39h while training on 2 RTX3090 with/without multi-view features.
      • Distributed training yaml [Code]
      • Distributed training script [Code]
      • In pytorch 1.7.0 environment, you should replace "tile" in lib/ with "repeat".
      • If you use distributed training, you should add "if args.local_rank == 0" before you save the model.
      • If you use distributed training, you should change the torch.load code in scipts/ to
      checkpoint = torch.load(path)
      model.load_state_dict({k[7:]: v for k, v in checkpoint.items()}, strict=True)
      • If you want to visualize the results, you should do following steps:
      1. add following code to config/default.yaml
          scene_id: "scene0011_00"
      1. change some codes in script/ [Code]
      2. run command in the terminal
      python scripts/ --folder 2022-07-23_20-36_REPRODUCE-MULTIVIEW_DOUBLE_WORKERS-1 --config ./config/default.yaml

  • 2022/07/03, Dimanche.
  1. ScanQA: 3D Question Answering for Spatial Scene Understanding(CVPR2022) [PDF] [Code]

    • Main Idea: This paper provides a new task: 3D VQA and a baseline which consists of 3 parts: question and point clouds feature extraction, feature fusion and 3 MLP heads for object classification, answer classification and object localization.
    • Experiments: 23.45% (Baseline)
    • Reproduce Notes:
      • Not implemented yet. (TODO)
      • 1 Telsa V100 takes < 1 day.

  • 2022/07/04, Lundi.
  1. ScanQA: Text-guided graph neural networks for referring 3d instance segmentation.(AAAI2021) [PDF] [Code]
    • Main Idea: This paper dividing the task into two part: 3D instance segmentation and instance refering. 3D mask prediction is interesting. They propose a clustering algorithm to cluster points belonging to the same instance. A text-guided graph neural network is proposed for the second phrase.
    • Experiments: (Baseline)

  • 2022/07/05, Mardi.
  1. X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning(CVPR2022) [PDF] [Code]
    • Main Idea: In training stage, they utilize both 2D and 3D modalities as teacher network to teach the student network who only use 3D modality. In inference stage, they only use 3D modality.
      • They propose a different fusion module: randomly mask the teacher features and add it to the student feature.
      • They propose a different object representation method.
    • Experiments:
  • 2022/07/06, Mercredi.
  1. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds(CVPR2022) [PDF] [Code]
    • Main Idea: This paper provides a unified framework for joint dense captioning and visual grounding on 3D point clouds. Feature representation and fusion modules are task-agnostic which are designed for collaboatively learning.
    • Experiments: SOTA on ScanRefer and Nr3D, even better than 3D Vision Gounding paper in IJCV2022.
    • Reproduce Notes:
      • 1 RTX3090 takes almost 4 days to train and 1h*"repeats" to validate on ScanRefer dataset.
      • If you use multi-view features, this project will occupy 212GB of space. So, you'd better rent GPUs in BeiJing district in AutoDL.
      • Scan2CAD dataset and its preprocessing are also needed to train this project. [Code]
      • If your system is CUDA11.0+, you should replace pointnet++ in the original repo with 3DSPS.
      • To issue:"No module named 'quaternion", you should type "pip install numpy-quaternion" in the terminal.
      • ScanRefer dataset can directly unzip in the dataset folder.
      • "ScanRefer_filtered_organized.json" can be obtained by [Code]
      • Training arguments must match validation arguments or you will get a RunTime Error: size mismatch.
      • Java is a must.
      sudo apt-get update
      sudo apt-get install openjdk-8-jdk
      • "--num_ground=150" means avoiding the training of the caption head for the first 150 epochs.

Visual Grounding:

Validation Set Unique Unique Multiple Multiple Overall Overall
Methods Publication Modality Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5
3DJCG (Paper) CVPR2022 3D 78.75 61.30 40.13 30.08 47.62 36.14
3DJCG (Paper) CVPR2022 2D + 3D 83.47 64.34 41.39 30.82 49.56 37.33
3DJCG (Our Reproduce) CVPR2022 2D + 3D 81.98 63.18 41.35 30.04 49.23 36.47
- Future work: Performance of vision grounding will improve.  
  • 2022/07/07, Jeudi.
  • I try to do an experiment on COCO2017 dataset with 4 RTX3090s!
  1. Escaping the Big Data Paradigm with Compact Transformers(Arxiv202206) [PDF] [Code]
    • Main Idea: This paper design a new transformer architecture for training on small dataset. First, they reduce the layers, heads and hidden dimensions. Then, they design SeqPool module:x'=softmax(g(f(x)).T) => z=x'*x, where f is transformer encoder, g is a linear layer. Finally, a convolutional tokenizer, which substitutes for patch and embedding is designed to introduce an inductive bias into the model.
    • Experiments: SOTA in small dataset such as Cifar10 and Flower102.

  • 2022/07/08, Vendredi.
  • After adding Transformer to IEEC, although the training process is unstable, it can surpass the baseline! Here is the validation accuracy reported by tensorboard:

  1. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes (ECCV2020) [PDF] [Code]
    • Main Idea: They introduce a two-part dataset: a high quality synthetic dataset of 83572 referential utterances (Sr3D) and a dataset with 41503 natural (human) referential utterances (Nr3D).
  • 2022/07/09, Samedi. Run the 3DJCG code.

  • 2022/07/10, Dimanche.

  1. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (CVPR2017) [PDF] [Code]
    • Main Idea: This paper fully exploits permutation invariant property of points cloud and propose PointNet. This paper also provides some theoretical analysis in [Supplemental].
  • 2022/07/11, Lundi.

  • Experiment on COCO2017 dataset with 4 RTX3090s is over. The experiment lasted for 5 days! My results surpass the baseline model.

  • 2022/07/20, Mercredi.

  1. Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds (IJCV2022) [PDF] [Code]
    • Main Idea: This paper proposed SpaCap to do 3D Dense Captioning. Main-axis spatial relation label maps are prepocessed before training. They can be used as the prior knowledge for model. Besides, this paper also propose a new Transformer decoder: vision token mask and word token mask are both fed to self-attention layer.
    • Reproduce Notes:
      • Successful

  • 2022/07/30, Samedi.
  1. D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding (ECCV2022) [PDF] [Code]
    • Main Idea: This paper proposed D3Net to do 3D visual grounding and dense captioning jointly. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanNet data with partially annotated descriptions. They outperforms SOTA methods in both tasks on the ScanRefer dataset.
    • Reproduce Notes:
      • Not provide a yaml file.

When the downloaded zip file is corrupted, we can fix it by WinRAR!


  • 2022/09/05, Lundi.
  1. 基于深度学习的图像复原技术研究_武士想 (2022中科大博)

Non-blind image recovery: including hyper-segmentation and denoising (hyper-segmentation*2+denoising*1)

Blind image recovery: (based on unsupervised CycleGAN framework*1)

Blind image single-image recovery: (based on generative model*1)

