Skip to content

testing multiprocessing for faster finds! #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 27, 2022
Merged

Conversation

vsoch
Copy link
Collaborator

@vsoch vsoch commented Mar 26, 2022

This might be a terrible idea, but in repos where we have a LOT of files to check, it's getting much slower to do the check. So here I'm going to test using multiprocessing, meaning we can check ~9 files in parallel. I'll try to open a custom action branch so I can test this on a repo I know is rather large (given it passes here of course!).

Signed-off-by: vsoch vsoch@users.noreply.github.com

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Collaborator Author

vsoch commented Mar 26, 2022

holy crap @SuperKogito this went from 4-5 minutes to 44 seconds!! Major improvement!! https://github.com/USRSE/usrse.github.io/runs/5701139638?check_suite_focus=true

@vsoch vsoch requested a review from SuperKogito March 26, 2022 04:44
@vsoch
Copy link
Collaborator Author

vsoch commented Mar 27, 2022

@SuperKogito I know you've been busy or not able to respond in the past - I've tested this (even via the action) on external repos so I'm going to merge for the super speed up, and we can open further issue/PR if any new issues arise! I hope you are doing well!

@vsoch vsoch merged commit 1f9bd5c into master Mar 27, 2022
@vsoch vsoch deleted the test/multiprocessing branch March 27, 2022 21:37
@SuperKogito
Copy link
Member

I am really sorry for taking some time, I intended to look at it today but couldn't. Overall the structure is nice (kudos for the new class), and the improvements are great ;) I only wonder if this might touch some requests limit if requests are simultaneous. However since the automatic tests didn't fail, I think it is safe to merge.

@vsoch
Copy link
Collaborator Author

vsoch commented Mar 27, 2022

Oh yay you are around! I think we should actually be OK because it’s parsed on the level of the file (so checks are unique within a process run), and if we hit a case of shared urls across files it either will work off the bat still or retry. I tested here and on our USRSE repo and saw no issues so I think it’s a huge improvement worth it!

But I thought about this, and if we do need to handle duplication across jobs we could always parse urls first, taking this into account, and then run the multiprocess with no duplications.

That might further optimize actually - ok if O try this out and open another PR? And since I know you are around I will indeed wait for your review this time!

@SuperKogito
Copy link
Member

SuperKogito commented Mar 27, 2022

I agree, I think it is a great improvement 👏 I feel silly how we didn't think of it before 😝
yes filtering the duplicates before multiprocessing should make it even faster - Of course, give it a shot and if it is stable enough let's merge ;)

( just an idea )
I think if you have lengthy checks and looking for further faster processing, it might be worthy to revive #52 sine that will reach the limits of acceleration but asynchronous processing has its drawbacks and I think if we ever pursue that we will need to create different python and action folders (say: urltechie_super_speed :p ) cuz using that will force us to drop a lot of our useful generic features.

@vsoch
Copy link
Collaborator Author

vsoch commented Mar 27, 2022

okay - I'm on it!

Agreed - let me get in a PR for optimizing the urls we check, and then let's rebase #52 and we can time the master branch against the same with async/aiohttp. If it's an improvement that is noticeable, it's definitely worth considering! If it's a trivial change then probably not worth the pain of async 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants