-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up upload-to-s3
#41
Comments
This is currently blocked on |
Alternatively, the naive |
They're provided in our other runtimes (almost by happenstance, as part of the underlying OS image) and having them available in all runtimes makes it much easier to write portable programs without having to deal with GNU vs. BSD differences. Note that typically GNU coreutils would already be available in the Conda runtime on Linux (via the host system), but not the Conda runtime on macOS (unless installed separately, e.g. via Homebrew). So explicitly including GNU coreutils here increases consistency, isolation, and portability of the runtime. Related-to: <nextstrain/ingest#41>
Note that we're (I'm) assuming the GNU coreutils |
Simple test with a 1.3G fasta file.
Still running after 10mins...I'll post update with final time after it finishes... |
🤦♀️ Nope, I was just running the script wrong
|
Wait, is the Python one actually faster? What? I mean, I know the hashlib implementations are in C as is much file i/o, but I still would expect Python overhead to be significant here. |
Is your coreutils |
Ah, should have said I was running these in the Nextstrain shell using the Docker runtime. |
One thing I noted looking at coreutils sha256sum is it's reading in 32 kiB chunks vs. our 5 MiB chunks. |
Similar results when running in macOS terminal:
|
Just making sure this is true for a larger file, testing with a 70G fasta. Python is much faster than GNU coreutils! GNU coreutils:
Python:
|
Wow. I'd be really curious what the times are if you drop our read size in Python to 32 kiB. I'd also wonder if aarch64 is coming into play here: is Python taking advantage of it (and coreutils not) in a way it couldn't on x86_64 hardware we're using on AWS Batch? On my machine, Python is only slightly faster than coreutils. In fact, alternative non-cryptographic/secure hashing algorithms I've tried (a few impls of MurmurHash3, simple crc32, simple md5) all come out very roughly in the same ballpark (within ~20s of each other on a 3GB file), which leads me to thinking I'm bottlenecking on i/o on my machine. And so I'd wonder if we hit an i/o bottleneck in Batch too. We're not using fast disks on AWS... |
It's actually slightly faster when I drop the chunk size
|
Prompted by nextstrain/ncov-ingest#446
Some ideas of speeding up
upload-to-s3
proposed in related Slack threadThe text was updated successfully, but these errors were encountered: