-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix dataset problems release 2.7 #11646
base: release/2.7
Are you sure you want to change the base?
fix dataset problems release 2.7 #11646
Conversation
- remove unlimited loop for simpledatset when poly has more than 4 points - fasten data loading - improve model loading
Thanks for your contribution! |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
Your acceleration solution is effective in some scenes. Have you considered adding a parameter to the configuration file to select this way ?
Also, can you please submit a PR to the dygraph branch. This modification will be merged into the next stable branch.
if data is None: | ||
continue | ||
if 'polys' in data.keys(): | ||
if data['polys'].shape[1] != 4: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to handle this situation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is where we need to discuss.
This condition leaded to endless loop and discarded every annotation line with points more than 4 in detection task.
I have checked annotation files we generated and official dataset from ctw1500 (800M) and ICDAR2015 (80M) datasets which are used for finetuning and evaluation: they both have annotation lines with more than 4 points.
The root cause is utility file can only process polygon with fixed points:
assert len(points) == 4, "shape of points must be 4*2"
Hence we deleted the unnecessary constrain (4 points bbox annotation is just a simple case of general closed polyline) and training goes smoothly again.
@@ -75,6 +75,9 @@ def maybe_download_params(model_path): | |||
else: | |||
url = model_path | |||
tmp_path = os.path.join(MODELS_DIR, url.split('/')[-1]) | |||
if os.path.exists(tmp_path) and os.path.isfile(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi models from Model_Path was downloaded to tmp_path but it is repeated to do so each time when we launch a task in the same machine, which takes about 3-5 minutes. This is unnecessary.
A better way I think is, by providing MD5 sum in model configuration file, we can check if the files in tmp_path is correct.
But any change to the config in this project requires work checking all the other configuration files. That will be too cumbersome.
@@ -609,8 +610,31 @@ def get_rotate_crop_image(img, points): | |||
img_crop = img[top:bottom, left:right, :].copy() | |||
points[:, 0] = points[:, 0] - left | |||
points[:, 1] = points[:, 1] - top | |||
''' | |||
assert len(points) == 4, "shape of points must be 4*2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Situations where more data formats need to be processed in addition to CTW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But both in CTW, ICDAR2015 and our own annotation files we don't assume bbox consists of fixed number number of points to form a valid bbox polyline.
So we just treat it as general polyline: 2 points (upper left, bottom right), 3 points (trianguulars), and closed polylines
B.t.w I just found this line when I scan the dataset file:
The image can be effectively worked with uint8, this allows 8-bit IO workload and better compute-IO overlap. It seems that we havn't use DALI to accelerate IO by allowing device prefetching and on-chip preprocessing. |
Can we land this one ? @tink2123 |
@yiakwy-xpu-ml-framework-team you'll need to create a PR, target for dygraph branch. we decided not merging this into release/2.7. |
I guess I can cherry-pick this commit onto the dygraph branch. I will do it soon. |
PR 类型 PR types
Bug fixes
PR 变化内容类型 PR changes
Dataset, preprocessing :
描述 Description
improve codes stability and fix bugs when training with cn_PP_OCR_v4 detection model.
I identified failures of dataset infinite loading problem when reproducing a OCR dec in local env. My team is using this model to generate high quality data for Multi-Modal LLM. After this fix , dec can be trained very quickly for a few epochs :
SimpleDataset forever loop :
The utility function should also be adapted to process the annotation data instead of throwing them away or raising an exception.
Note by default, "CopyPaste" for imgaug method is chosen. That means arbitrary of two images are selected and paired. However get_ext_data always read a new image from the dataset, which is very inefficient.
提PR之前的检查 Check-list
This PR is pushed to the dygraph branch or cherry-picked from the dygraph branch. Otherwise, please push your changes to the dygraph branch.