You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SenseTime and Shanghai AI Laboratory jointly released the multimodal multitask general model "INTERN-2.5" on March 14, 2023. "INTERN-2.5" achieved multiple breakthroughs in multimodal multitask processing, and its excellent cross-modal task processing ability in text and image can provide efficient and accurate perception and understanding capabilities for general scenarios such as autonomous driving.
-:thumbsup:**The strongest visual universal backbone model with up to 3 billion parameters**
45
+
- 🏆 **Achieved `90.1% Top1` accuracy in ImageNet, the most accurate among open-source models**
46
+
- 🏆 **Achieved `65.5 mAP` on the COCO benchmark dataset for object detection, the only model that exceeded `65.0 mAP`**
52
47
48
+
## News
49
+
-`Mar 14, 2023`: 🚀 "INTERN-2.5" is released!
50
+
-`Feb 28, 2023`: 🚀 InternImage is accepted to CVPR 2023!
51
+
-`Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves state-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.
52
+
-`Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on
53
+
ADE20K, outperforming previous models by a large margin.
"INTERN-2.5" achieved a Top-1 accuracy of 90.1% using only publicly available data for image classification. This is the only model, besides two undisclosed models from Google and Microsoft and additional datasets, to achieve a Top-1 accuracy of over 90.0%. It is also the highest-accuracy open-source model on ImageNet and the largest model in scale in the world.
60
+
- On the COCO object detection benchmark dataset, "INTERN-2.5" achieved a mAP of 65.5, making it the only model in the world to surpass 65 mAP.
61
+
- "INTERN-2.5" achieved the world's best performance on 16 other important visual benchmark datasets, covering classification, detection, and segmentation tasks.
"INTERN-2.5" can quickly locate and retrieve the most semantically relevant images based on textual content requirements. This capability can be applied to both videos and image collections and can be further combined with object detection boxes to enable a variety of applications, helping users quickly and easily find the required image resources. For example, it can return the relevant images specified by the text in the album.
"INTERN-2.5" has a strong understanding capability in various aspects of visual-to-text tasks such as image captioning, visual question answering, visual reasoning, and optical character recognition. For example, in the context of autonomous driving, it can enhance the scene perception and understanding capabilities, assist the vehicle in judging traffic signal status, road signs, and other information, and provide effective perception information support for vehicle decision-making and planning.
The outstanding performance of "INTERN-2.5" in the field of cross-modal learning is due to several innovations in the core technology of multi-modal multi-task general model, including the development of InternImage as the backbone network for visual perception, LLM as the large-scale text pre-training network for text processing, and Uni-Perceiver as the compatible decoding modeling for multi-task learning.
InternImage, the visual backbone network of "INTERN-2.5", has a parameter size of up to 3 billion and can adaptively adjust the position and combination of convolutions based on dynamic sparse convolution operators, providing powerful representations for multi-functional visual perception. Uni-Perceiver, a versatile task decoding model, encodes data from different modalities into a unified representation space and unifies different tasks into the same task paradigm, enabling simultaneous processing of various modalities and tasks with the same task architecture and shared model parameters.
144
147
145
148
146
149
<divalign=left>
147
150
<imgsrc='./docs/figs/network.png'width=900>
148
151
</div>
149
152
150
153
151
-
## 项目功能
152
-
-[ ]各类下游任务
154
+
## Project Release
155
+
-[ ]Model for other downstream tasks
153
156
-[x] InternImage-H(1B)/G(3B)
154
-
-[x] TensorRT 推理
155
-
-[x] InternImage 系列分类代码
156
-
-[x] InternImage-T/S/B/L/XL ImageNet-1K 预训练模型
157
-
-[x] InternImage-L/XL ImageNet-22K 预训练模型
158
-
-[x] InternImage-T/S/B/L/XL 检测和实例分割模型
159
-
-[x] InternImage-T/S/B/L/XL 语义分割模型
157
+
-[x] TensorRT inference
158
+
-[x]Classification code of the InternImage series
159
+
-[x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
160
+
-[x] InternImage-L/XL ImageNet-22K pretrained model
161
+
-[x] InternImage-T/S/B/L/XL detection and instance segmentation model
162
+
-[x] InternImage-T/S/B/L/XL semantic segmentation model
Before using `mmdeploy` to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator builded correctly. You can build it with the following command:
241
245
```shell
242
246
export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
243
247
@@ -254,13 +258,13 @@ make -j$(nproc) && make install
For more details on building custom ops, please refering to [this document](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md).
258
262
259
263
260
264
261
-
## 引用
265
+
## Citation
262
266
263
-
若“书生2.5”对您的研究工作有帮助,请参考如下bibtex对我们的工作进行引用。
267
+
If this work is helpful for your research, please consider citing the following BibTeX entry.
0 commit comments