-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filesize sniffing to specify the import job's disk space for WDL and CWL #5114
Comments
We think the right approach is to do the filesize data collection on the leader and then delegate two types of jobs: The construction of the import jobs DAG will happen on the leader after it's collected all file sizes. This should enable maximum parallelization when possible. The threshold will be configurable by the user and will usurp the There also will likely be changes to the internal We may not support files that don't let us look at the size of the file before downloading. This may include FTP and old HTTP servers. Alternatively, we can implement a fallback for files where we can't sniff out the sizes. Since CWL/WDL don't seem to currently support FTP, this won't be a priority. (Though MiniWDL does support FTP, maybe we should make this higher priority for behavioral parity?) There seems to be some inefficiency in the |
An import job will first attempt to import via streaming. When that fails, a child job will be made to handle the import, with the child job's disk requirement being the associated filesize. (How will this work exactly for the batch import? I think the child job will consist of either the remaining batch or just that one failed file. If it's just one failed file, won't we run into the same issue we tried to avoid with the batch import, where all small files get their own inefficient job? Especially when the file is local where we will just try a symlink.) If the import ultimately fails, we might just bail until we figure out a fallback. When we can't get the filesize of some file, maybe we should just import it on the leader and warn the user that their file is bad. |
When
--importWorkerDisk
is not specified for WDL and CWL, there should be some sort of file sniffing functionality to detect the minimum amount of disk space needed for the importing job to run. This probably can't take in account whether a file is streamable or not. We can probably try to stream the files in first with a small disk requirement, and then when that doesn't work, start a child job to run the imports without streaming but with more disk space.This will likely require retooling the import interface so file streaming can be controlled there.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1655
The text was updated successfully, but these errors were encountered: