-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support YOLOX detection model #6341
Comments
@zhiqwang Thank you very much for the comprehensive proposal. :) Your implementation at YOLO5-RT-stack is indeed of very high quality. Having a modern implementation of YOLO was on our bucket list but I just want to be mindful and not cannibalise your project. After all PyTorch's unique value proposition is its rich ecosystem. Having said that, if you are happy to upstream parts of your repo to TorchVision then we would absolutely love to have it. Rest assured that if we do add it, we will make sure to provide all the necessary credit to OSS contributors who made that possible. I know that your coding-styles and practices are very aligned with the ones used in TorchVision, so I agree v5 would probably be the easiest step forward. I have a couple of questions for you:
Concerning the training engine, I complete agree we should refactor large part of our reference scripts to inherit and reuse components. My recommendation though is not to link this work to the addition of YOLO as this is already a very big project. There are also various potential solutions that we might want to leverage (for example TorchRecipes), but this will require additional chats. I would suggest next steps is to clarify the above and decide how to progress. Possibly we will need to split the project to subprojects and potentially invite more contributors to help out. We could tackle this as part of #6323 and leverage the community. |
Hi @datumbox
I have only verified on a few images with the ported weights, because there are some differences in our preprocessing part compared to the original yolov5 version, I have not been able to do a complete verification on the coco dataset, and I can add some more detailed comparison data in the next few days. I don't have a server in my hand. We can divide the task into smaller parts and eventually put all the modules together, and then being able to train with your help is the best option.
As you said the most important part of the data augmentation is the mosaic technique. It was first introduced in ultralytics/yolov3#310 (comment), and there is a similar discussion in AlexeyAB/darknet#3114 (comment). This mosaic technique is helpful to detect smaller size object. (And I think this is the key technology that allows the YOLO series to be trained from scratch.) I quote Jocher's conclusions below.
In addition, v5 also uses the following enhancements: Seems that the vision/references/detection/transforms.py Line 30 in b30fa5c
|
YOLO{v5/v7/X} train their detection models from scratch, and they now have backbone that is not the DarkNet presented in the original paper. It would be nice to have a pre-trained model on the ImageNet to help accelerate our training, but it is up for debate whether it is necessary to implement the original version of CSPDarknet.
The codes currently in this folder https://github.com/zhiqwang/yolov5-rt-stack/tree/main/yolort/models was written from scratch, I only called some common functions of YOLOv5. That parts have been rewritten by YOLOX, we can call YOLOX's common functions instead (or we can rewrite it ourselves) to get rid of this dependency. The main reason I used YOLOv5's common was to be able to load the checkpoints trained from YOLOv5. Concretely I restructured the YOLOv5's yaml-parse mechanism into following three sub-modules in the layout of TorchVision: |
@zhiqwang I had the chance to investigate a bit further the references. The biggest concern about YOLOv5 is that there is still no paper to accompany the architecture (see ultralytics/yolov5#1333); I remember that it first came out as a repo and the owners said that the paper will be coming out shortly but I don't think there is currently one. Though it's a very popular architecture which achieves good results, the lack of paper is a problem as we usually focus on canonical implementations and expansions that have been studied in research. YOLOX seems like a viable alternative. Perhaps that's the way forward to avoid licensing issues, wdyt?
Sounds good, I think it's worth confirming that the implementation yields the expected accuracy prior deciding to adopt it.
Sounds good. We can follow a similar approach as with FCOS but with omitting the original training. We have capacity of training such network internally, so you don't have to have your own infra.
We should implement the mosaic augmentation and add it on references at first. Then once @pmeier and @vfdev-5 are back, we can examine implementing them as transforms on the new API.
I'm a bit surprised to see it in the list (haven't check the references). Do they use mixup for detection? Do they adjust the probabilities of the labels? How about the boxes, are they multiplied also by weights?
I can confirm that training from scratch usually yields better results. I'm OK not adding a Darknet arch in TorchVision; it's kind of old. |
Agreed but as such we will need DarkNet to implement YOLOX. So it might just be better to provide it too. Although it's old it's still being used and relevant. It's quite a fundamental model like AlexNet. So somehow I feel it might be good to add. |
@zhiqwang Sounds good. I'll need to dig a bit on the YOLOX paper and familiarize myself. I'll try to do this by EOW. In meantime, if you have in your mind a clear plan on intermediary milestones for adding YOLOX please add it here (aka addition of X arch, Y transforms, Z operators etc). This will hopefully let us to coordinate among contributors. |
@zhiqwang I'm late by 1 week. Sorry I got caught up on other pieces of work. I've gone through the bibliography around YOLOX and here are some thoughts:
Mosaic and MixUp are worth implementing. I've added tasks for them on the #6323 issue. Whether we will go ahead with the rest depends on your bandwidth. Is this something you would like to pick up and lead? If yes, we can find a POC on our side that would assist with the model validation and training resources. Let me know, thanks! |
Hi @datumbox ,
I agree with you here. YOLOX have a good balance in terms of copyright and code quality, and it's enough to have a YOLOX implementation from the community's perspective.
Sorry for not having enough bandwidth to work here recently ( But I can help to review the codes and support for deployment if there is such a need. |
@zhiqwang Thanks for getting back to me. I completely understand. Unfortunately we are very constrained in terms of headcount and bandwidth at the moment. I don't think any of the maintainers can pick this up. Originally the idea of you picking up and leading this initiative was very promising as you have extensive experience with the YOLO architecture due to your earlier work. But I understand that since we will be interested in porting YOLOX and not v5, that would increase significantly your work. I'm happy to leave this open in case your situation changes on the future. Since we are here, let me do the cheeky move and check if any of the original authors of YOLOX would be interested in contributing an implementation to TorchVision? @FateScript @Joker316701882 @GOATmessi7 |
@FateScript thanks for responding. We would love to have a modern YOLO iteration in TorchVision. Currently we don't offer any variant of this architecture which means that researchers can't do off-the-shelf comparisons. I don't know how familiar you are with the TorchVision code-base. As with every library it has its own idioms and quirks, so this is an exercise of porting your original code to follow those idioms. I've listed a few thoughts on what needs to be done on the following comment, let me know your thoughts: #6341 (comment) To summarize, we would need to implement specific backbones that are not supported + the architecture of YOLOX along with any utilities that are not already available in TorchVision. Hopefully we should already support many of such ops (bbox ops and IoU estimations, bbox encoding & matching, anchor utils (I'm aware YOLOX is anchor free) and ops. We can provide assistance in form of PR reviews and model training (using our own compute). I'll leave you to check some of the references and let me know your thoughts. It would be really awesome to work with you. Being one of the original authors of YOLOX means it should be easier for you to adapt the implementation and faster for us to review it. |
@datumbox Sorry for my late reply — I'm on my vacation these days. I check the references you mentioned above and I think implemented a YOLOX model in torchvision is not too hard. The main effort here are data transform and model arch. I decide to share one day per week to complete this. BTW, Is there any DDL for me ? |
@FateScript That's awesome, thanks a lot for doing this!
Sorry what do you mean by DDL?
@zhiqwang Just wanted to check if you still want to be involved on supporting Feng during the PRs or if I should find a POC on our side for this. Totally depends on your bandwidth. |
I mean, deadline |
@FateScript No deadlines from our side. We appreciate you are dedicating your time to an open-source project and we are thankful. :) Just a date to keep in mind in case we aim to release the model available for v0.14. All PRs for that release need to be merged by beginning of October. Anything merged after that, will be released with v0.15. |
Hi @datumbox and @FateScript , I believe Feng will implement a very superior version of YOLOX here, and I will contact him offline to see if there is anything I can do to help :) |
@FateScript I just wanted to follow up and see if you faced any blockers with the implementation. Let me know if we can help or if there is a change of plans. Thanks! :) |
Hi @datumbox , I haven't faced any blockers here with the implementation. The only bad news for us is that I transferd to a new work group and my new leader only allow me to share only a half day per week to complete this job. So it might takes me more time than expected. |
@FateScript Thanks for the heads up. No worries at all. You are donating your time and we are grateful for this. Just checking that you are not blocked by something or have abandoned it due to circumstances. Slow and steady wins the race; let me know if you need anything. |
I don't know if this is something that you'd like to consider, but I submitted an implementation of YOLOv3 and YOLOv4 to Lightning Bolts, and later submitted a pull request for features from YOLOv5, Scaled-YOLOv4, and YOLOX. It's very flexible - you can use networks defined in torch such as YOLOv5 or networks defined in Darknet configuration files, and you can use different IoU functions from Torchvision and different algorithms (e.g. SimOTA) for matching targets to anchors, to construct the different YOLO variants. I haven't checked that I can reproduce the numbers from the papers, though. I think there are too many differences in the details between the different implementations, that it doesn't make sense to try to implement all of the variants exactly. Anyway, I submitted the pull request a year ago and it has been accepted by the reviewers, but it still hasn't been merged. It seems like the Bolts project has gone pretty inactive. So if you're interested, I'd be happy to work on porting it to Torchvision and perhaps merging with the code from @FateScript ? It's clean code and well documented. You can have a look: https://github.com/groke-technologies/pytorch-lightning-bolts/tree/yolo-update/pl_bolts/models/detection/yolo |
@senarvi Thank you for sharing me with your clean code : ) |
@FateScript what exactly do you mean by data providing logic? I would think that all the models in I'd just like to clarify that in my opinion it doesn't make sense to implement a module that's strictly YOLOX, because every year there's a new and improved YOLO version. I currently started looking into adding features from YOLOv7. I think it's better to have a generic YOLO module and reusable components that can be used to train a YOLOX model, but also used in the future with new YOLO versions, and ideally also with other model families. The most important components are the loss calculation, matching targets to anchors, and the network architectures. Torchvision already supports all the different IoU functions, so we should reuse those in the loss calculation. The network backbones could also be reused between other Torchvision models, although I think that currently the existing backbones such as ResNet only return features from the last layer. YOLO uses the FPN/PAN network with multiple detection heads that needs features from three or four backbone layers. If you agree that that's the direction where we want to be heading, then I can create a pull request for you to have a look and comment. |
@senarvi Data providing logic here means that data related code such as data augmentation, dataset cacheing and so on.
So it seems that writing a YOLOX model into torchvision is meaningless, but other code like data augmentation is useful for torchvision? |
@FateScript Definitely not meaningless. Maybe I explained myself poorly. I mean just that I can see two approaches: From a benchmarking perspective it can be useful to have a model that's identical to a standard YOLO implementation such as YOLOX. Then you also need identical data augmentations etc. The downside is that if you want to add other YOLO versions in the future, it will be more difficult to reuse the components (if you want every version to be 1-1 identical to the original code). Personally I would find it more useful to have a generic YOLO class, where it's easy to reuse features from different YOLO versions, because as much as possible is abstracted into separate classes. It would also be nice to have augmentations such as mosaic, but in my opinion those can be implemented separately. In my opinion it's most important that the augmentations can be reused in different models, and are not something YOLO specific. By the way, I'm not any kind of authority here. :) I guess it's a matter of "philosophy" of the Torchvision project, which way to go. |
@senarvi Yes, you are right. I misunderstand your meaning here. |
I added a YOLOv7 architecture in the Bolts pull request. So now it supports the YOLO variants listed by @zhiqwang in the initial post. The biggest new architectural change was the Deep Supervision, i.e. auxiliary detection heads. It required some thinking, because the I've tried to make the components reusable so that they can be easily used for building new models. For example, instead of a huge monolithic function that expects some complex data structures and supports only one algorithm for matching the predictions to the targets, there are these generic functions that take the predictions and targets. Details of the SimOTA algorithm I've got from the YOLOX code, and they have changed considerably in YOLOv7:
@FateScript what do you think about using this codebase as the basis for the YOLO model? It shouldn't be too difficult to fit this into the Torchvision framework. If you think that it might be a good idea, I can try and rewrite the model class for Torchvision. If there are features that we're missing, in the data processing, in the SimOTA algorithm, or some other details that are not correct, and you have time, you could help there. Do you think that this code architecture would be suitable? |
@senarvi Thanks for your code : ) IMO, your code architecture is suitable here. If any help is needed, please feel free to contact me. |
I prepared the contribution. Just need an official approval from my employer. |
Just a quick update that some of my managers are not at the office at the moment. Hopefully I will get the approval next week. |
I got a permission from my employer Groke Technologies to contribute the YOLO model, and created this pull request: #7496 The model is quite well tried and tested, but I would appreciate any help with some things related to Torchvision integration. For example, I'm not sure if the unit tests work and I don't know how the pretrained weights are created. |
Amazing! This is really great |
It would be great to get some feedback, especially on the model factory functions. According to this issue, I should add a factory function for each model variant. There are dozens of YOLO variants. There have been something like ten notable YOLO versions and each had several variants (s, m, l, x, nano, tiny, etc.). As discussed above, faithfully implementing all of them is not feasible. We should decide whether we want to add as many variants as possible, or just some most important ones (which ones)? Before adding more variants, I'd also like to know if I've understood correctly what is wanted. I created an example for yolov4. I understood that I have to train weights for the variants. I train the model on the COCO dataset using Construction of the network is a bit different from the other detection models, because YOLO adds several detection layers in different levels of the network. The backbone is the network only up to the FPN (or the extension of FPN called PAN). The detection layers are placed within the PAN. Take a look at the YOLOV7Network class for example. It's not really possible to separate the FPN/PAN from the detection layers, so that we could have the FPN/PAN as part of the backbone. We can switch the backbone (up to the FPN) though, if the backbone provides outputs from different levels like here. Also, I use the method validate_batch() to validate that the input is in correct format. I wonder if we should use the same function to validate the input in all detection models. Then we would know that all detection models use the same data format. That said, this pull request might blow up if we start to include too much of this kind of refactoring in the same pull request. |
🚀 The feature
YOLO aka. You Only Look Once, which is a vibrant series of object detection models since the release of Joseph Redmon You Only Look Once: Unified, Real-Time Object Detection.
So far a couple of more notable implementations are as follows (all PyTorch):
Motivation, pitch
Until now, one of the most successful ones is probably YOLOv5. YOLOv5 is great, and they have also built up a very friendly community and ecosystem. We don't intend to copy YOLOv5 into TorchVision, our main goal here is to make training SoTA models easier and share reusable subcomponents to build the next SoTA models in the same/proxy family.6
YOLOX is a high-performance anchor-free YOLO, and it has a good balance in terms of copyright and code quality, it's enough to have a YOLOX implementation from the community's perspective.
The License
YOLO{v5/v7} are built under the GPL-3.0 license, and YOLOX is built under the Apache-2.0 license.
More context
I have previously rewritten the code used in the inference part of YOLOv5 according to the style and specification of torchvision7, and I can relicense that part to BSD-3-Clause license. The amount of work involved in the model inference part is not much with the help of YOLOX base code.
Data augmentation and a new trainer engine will be the core of what we will do here.
The data augmentation section is in the planning list #6224 , and we have already merged some augmentation methods like #5825 , I think it would help us to build the next SoTA models with a new primitives like classification models.8
As TorchVision adds more and more models, it may be time to abstract out a simple trainer engine for sharing reusable subcomponents. It might be more appropriate to open a new thread for necessity and specific steps about this part.
cc @datumbox @YosuaMichael @oke-aditya
Footnotes
https://github.com/ultralytics/yolov3/tree/v9.1 ↩
https://github.com/open-mmlab/mmdetection/tree/master/configs/yolo ↩
https://github.com/ultralytics/yolov5 ↩
https://github.com/Megvii-BaseDetection/YOLOX ↩
https://github.com/WongKinYiu/yolov7 ↩
https://github.com/keras-team/keras-cv/issues/622#issuecomment-1198063712 ↩
https://github.com/zhiqwang/yolov5-rt-stack/tree/main/yolort/models ↩
https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/ ↩
The text was updated successfully, but these errors were encountered: