Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeCamp #116 Add SROIE to dataset preparer #1639

Merged

Conversation

FerryHuang
Copy link
Contributor

@FerryHuang FerryHuang commented Dec 22, 2022

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

CodeCamp-MMOCR-116

Modification

Support SROIE dataset (text detection, text recognition, text spotting) on the DatasetPreparer

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

@CLAassistant
Copy link

CLAassistant commented Dec 22, 2022

CLA assistant check
All committers have signed the CLA.

@FerryHuang FerryHuang changed the title Ferry/add sroie to dataset preparer [Feature]/add sroie to dataset preparer Dec 22, 2022
@FerryHuang FerryHuang changed the title [Feature]/add sroie to dataset preparer [Feature] Add sroie to dataset preparer Dec 22, 2022
configs/textdet/_base_/datasets/sroie.py Outdated Show resolved Hide resolved
configs/textrecog/_base_/datasets/sroie.py Outdated Show resolved Hide resolved
dataset_zoo/sroie/textrecog.py Outdated Show resolved Hide resolved
dataset_zoo/sroie/textspotting.py Show resolved Hide resolved
mmocr/datasets/preparers/data_converter.py Outdated Show resolved Hide resolved
dataset_zoo/sroie/sample_anno.md Outdated Show resolved Hide resolved
Comment on lines 153 to 160
if '*' in src.split('/')[-1] and not osp.exists(dst):
os.makedirs(dst)
for f in glob.glob(src):
if osp.exists(f):
shutil.move(f, dst)
continue

if osp.exists(src) and not osp.exists(dst):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can be optimized to be more general. The if condition is too strict and may only be applicable to this dataset.

  1. Wildcard (*) does not usually appear in the file path, and therefore we can use glob to match all the files whenever there is * in the file path.
  2. mmengine has a util mkdir_or_exist to create the directory whenever needed.
  3. Is line 156 really necessary? I think glob.glob should return a list of existing files.
  4. Using continue to skip the next if block doesn't make the logic flow very clear. I'm in favor of if...elif... here

To conclude, the code can be rewritten as:

import mmengine

#...

if '*' in src:
    mmengine.mkdir_or_exist(dst)
    for f in glob.glob(src):
        shutil.move(f, dst)
elif osp.exists(src) and not osp.exists(dst):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions and I have modified this code block. But with #1639 (comment) fixed, this reports error like 'path already exists' when running task-recog after task-det. Keeping the not osp.exists(dst) in if statement will be fine. So it's like this

if '*' in src and not osp.exists(dst):
    mkdir_or_exist(dst)
    for f in glob.glob(src):
        shutil.move(f, dst)

elif osp.exists(src) and not osp.exists(dst):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to address the key issue. Such an error occurs because shutil will check the existence of the destination file. If dst is just a directory path, shutil will do the check on the resulting filename, which is equivalent to osp.exists(osp.join(dst, osp.basename(src)).

Back to your implementation: ... and not osp.exists(dst) only verifies the existence of the directory instead of a specific file. Suppose that we have placed an empty directory to dst, this implementation will still skip moving all the files inside.

Therefore, it's better to just check the existence of each file with osp.exists(osp.join(dst, osp.basename(f)) and only skip those existing ones.

mmocr/datasets/preparers/parsers/sroie_parser.py Outdated Show resolved Hide resolved


@DATA_PARSERS.register_module()
class SROIETextDetAnnParser(BaseParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference with ICDARTextDetAnnParaser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is the try-except structure added to handle unicode errors that I want to skip. Should I merge this into the ICDARTextDetAnnParser?

@FerryHuang FerryHuang changed the title [Feature] Add sroie to dataset preparer CodeCamp #116 Add SROIE to dataset preparer Dec 26, 2022
Copy link
Collaborator

@gaotongxiao gaotongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall, but does not pass our lint test.

image

To fix it, please install pre-commit hooks following

https://mmocr.readthedocs.io/en/dev-1.x/notes/contribution_guide.html#commit-your-changes

and then run pre-commit run --all-files, and it will help you fix the format issues.

Also, there is a conflict needed to be resolved before I merge it into dev-1.x:

image

You can fix it locally by running git merge dev-1.x, then push the latest changes to your branch.

FerryHuang and others added 3 commits December 27, 2022 18:25
Conflicts:
	mmocr/datasets/preparers/parsers/__init__.py

resolve conflicts
Copy link
Collaborator

@gaotongxiao gaotongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've updated the downloading link for SROIE, since the google drive links doesn't work.

@gaotongxiao gaotongxiao merged commit 1413b50 into open-mmlab:dev-1.x Dec 29, 2022
@FerryHuang FerryHuang deleted the ferry/add_SROIE_to_dataset_preparer branch December 29, 2022 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants