Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance in get_imports regexp #34298

Conversation

AlekseyLobanov
Copy link
Contributor

@AlekseyLobanov AlekseyLobanov commented Oct 21, 2024

What does this PR do?

Just found that one of the Regular Expressions may be very slow (O(N^3)) on some special inputs.

Something like 'try:' + ' ' * 1500 may take a lot of time. I replaced it with the same logic but faster

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@gante

@AlekseyLobanov
Copy link
Contributor Author

It is not a typo but still a small change with existing tests

@Rocketknight1
Copy link
Member

Rocketknight1 commented Oct 23, 2024

This looks good now! It should be simpler and more performant, and shouldn't change which strings it matches.

I have one more suggestion, though. This regex looks like it would fail in cases when the strings "try" and "except" occur internally in some lines. This is probably very rare, but maybe we should change the regex to \n\s*try and \n\s*except to be secure against that?

To be clear, this isn't a bug with your PR - it's an issue with the regex that already existed!

@AlekseyLobanov
Copy link
Contributor Author

regex looks like it would fail in cases ...

Any regex should fail because even valid Python source is not defined by a regular language.
For example, code below also should be ignored, I believe.

"""
try:
    pass
except:
    pass
"""
# code in multi line string literal

Should I add it in another PR with more tests? This PR is more about performance improvements.

Maybe we can wait for @gante to reply, because he wrote this lines.

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlekseyLobanov yes, good point - we can keep this PR simple. Want me to merge this one now, and if you think you can improve it further, you can open a new PR with added test(s)?

@AlekseyLobanov
Copy link
Contributor Author

Want me to merge this one now, and if you think you can improve it further, you can open a new PR with added test(s)?

Can we just merge this for now? It will be great.

@gante
Copy link
Member

gante commented Oct 29, 2024

Maybe we can wait for @gante to reply, because he wrote this lines.

Hehe not true (#23725 is the source of these lines), but the diff looks good to me :)

@Rocketknight1
Copy link
Member

Yes, the diff looks good to me. Thanks for the PR!

@Rocketknight1 Rocketknight1 merged commit f55595b into huggingface:main Oct 29, 2024
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants