-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor for modern 2022 python style and usage #80
base: master
Are you sure you want to change the base?
Conversation
Join 2022 with the rest of us. Fixed via: `fd -e py -X black` ``` $ black --version black, 22.6.0 (compiled: yes) Python (CPython) 3.10.2 ```
I've seen this pattern copied out to a dozen different ML repos, but it's just completely wrong? Instead of using 10 lines of "hack python to be smarter than python," it can all be replaced with a one-line python-provided API for repeating a single iterator forever: for iter_num, batch in enumerate(itertools.cycle(train_loader)):
This is the "proper" python way of doing path operations. It makes file operations one-liners with half the typing vs using 10 year outdated legacy python patterns. Also reducing some unnecessary line padding and unnecessary text decoration in places too.
Using globals to store data between calls? What?! The proper way is for callbacks to have access to an internal aribtrary state object for their own accounting purposes. this does thusly.
Small minor non-important fix, but it makes the intent more clear and shows people there's more python built-ins to explore.
The new syntax makes formatting intent much clearer and easier to understand than legacy python string formatting syntax. The new syntax only requires knowing like 2 concepts while the old formatting syntax requires knowing 8 different things about how strings and formatting and replacements and substitutions happen in legacy python strings.
Hints for improving readability: - don't smash 10+ lines together without line breaks - if you have comments for a block, make it a visual block with newlines instead of: #comment\ncode\n#comment\ncode\ncode\n#comment\ncode\n.... - don't create any globals or code under __main__, just call a helper function. this also helps when wanting to turn python entrypoints into single callable scripts with paraemters since a function entry point can be called easier (a la Fire, etc) - avoid end-of-line comments where possible because it makes lines too long then the code formatter uses worse formatting. prefer "# big comment\ncode" instead of "code # big comment"
Also added 'adder' helper runner (see below) Enables: - automated package buliding: poetry build - automated command running - example: poetry run adder - all the other modern package management things like dependency verification, isolated install environments, etc
poetry run chargpt (if you have 'input.txt' in your local directory) Also includes the basic Path/style/formatting/cmd refactors made to adder.
Usable via: poetry run bpe Fixed same issues as cleaning up everything else: - pathlib.Path everywhere instead of direct os module file manipulation - improves code readability (the entire purpose of code, after all) - improves output readability too with more logical breaks for reading/understanding what's actually happening
I appreciate the effort, I reviewed the full diff, but I don't like it. But I do like the |
Actually on closer look I'm a bit worried about itertools.cycle, as it caches the individual elements in memory to yield through them on second pass. I am worried this would consume a large amount of memory, which the current implementation discards and recreates each element of the data loader on demand. Punting for now. |
oh, good catch. could also just do it manually with a hand rolled infinite generator: In [1]: import random
In [2]: randos = [random.randbytes(3000) for i in range(333)]
In [3]: def forever():
...: while True:
...: yield from randos
In [4]: for i, got in enumerate(forever()):
...: print(i)
...: if i > 3000000:
...: break |
No actual functionality changes, just a lot of cleanup to make the project look like it belongs in the present instead of something half abandoned from 10 years ago.
Some of these changes are important because other projects copy this project over and over again, so every legacy code architecture pattern here gets proliferated unnecessarily into the future for all time.
There's simple solutions to hacks I've seen in this codebase copied and pasted into dozens of other projects (like the 10 line "continuous iterator" hack which actually has a one line python builtin for doing it properly:
itertools.cycle(iterable_thing)
)Also refactored lots of modern python usage like using
pathlib.Path
for easy and more semantic file manipulation instead of a dozen differentos.path
functions.and of course complying with modern syntax means we need to use the standard python code formatter instead of bespoke "data-science-ese" of write-once disposable code patterns (but it still gets copied in thousands of places and just ends up decaying more and more over time)
also added a pyproject/poetry config to specify requirements somewhat (and as a bonus, i configured current scripts as poetry runnables so you can just do things like
poetry run adder
and it works automatically)good luck.