-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in check_mclapply_OK(complete_set_of_results) #15
Comments
It definitely could be a QUILT related problem. To double check, does your scheduler output error information? To see if it's run out of RAM If you're parallelizing anyway, can you decrease to using 1 core per job? Then run using What's the nature of the data, is it uniform low-coverage, or something else? Just wondering if the intermittent error relates to either heuristics, or overflow issues. For instance I've sometimes seen problems with GBS type data with a lot of nearby SNPs (say 20 within 100bp) with lots of coverage (40X) that aren't downsampled aggressively enough (see option |
Hi Robert, To answer your questions:
However, after talking to the IT service, I think I understood that maybe the problem could be related to the way their system works; I assumed I had 4GB/core but they actually said that it's ~4GB/core. This means that when I was running a job on 4 cores with 4GB each (16GB in total per job), I was actually giving less than I expected. So, maybe what happened is that QUILT expected 4GB per core but some of them received less. I don't know if this makes any sense, I'm just trying to figure it out! However, if you think this is not a possible explanation and you have some suggestion, please let me know what I can do. Thank you very much. |
OK interesting, thanks. So unclear on whether it's RAM related then? The scheduler doesn't store whether a particular job runs out of RAM, but when you checked memory usage in general it seemed OK? About the 5000 samples, QUILT imputes samples independently, so you can split into smaller batches and run those independently, then merge the VCFs afterwards. |
When I check the memory usage after a run it seems ok. However, I found that another error is reported together with the one I mentioned at the beginning:
I've found that working on a shared computing environment might create this issue, since mclapply sometimes fails on shared linux systems, when using n.cores, to automatically determine the number of cores to use. If this is true, it would explain why this issue is so random. In fact, when I repeat the same exact job twice, most of the times it doesn't fail. So, the only possible explanation is some sort of internal conflict on a shared environment. Maybe, I should try occupying the entire node, it's the only way I'm thinking to prevent people to "steal" resources from my job. I also found that it could be solved by strictly defining the number of cores to use using mclapply mc.cores function, but I have no access to that internal option of the software and I don't know how QUILT communicate to mclapply what number of cores has been chosen. Regarding your suggestion of splitting the 5000 samples in small batches, I already did it to fix these problems related to some windows. Actually, I'm working with >75K samples that I've already split in smaller batches of 5K. The system I work on has not so many resources and I also have a limited amount of time to work on it, so I designed a pipeline that parallelize the run of the 22 autosomes for the 5000 samples. So far, it allowed me to finsh a batch in 3 days (which is also the walltime limit I have on the system I use). But thanks for the suggestion. |
Hi Robert, sorry to bother you but I rerun QUILT on just a couple of windows that previously crashed. This time I used a reduced batch of 500 samples, but I still had problems; this error keeps popping out:
I've read that mclapply() is known to be unstable with code that multi-thread and, since I'm using multi-threading within QUILT, I wonder whether this might be the problem. Normally, I use an nCore=4 which should not be so intensive and avoiding it would be not time efficient. Do you have any suggestion? Thanks |
If you use nCores = 1, mclapply will just fall back to lapply, which is more stable I have definitely in the past seen un-stable bugs in QUILT though, e.g. jobs that run fine once then fail if re-run seemingly exactly the same way, related to randomness in the read assignment. I haven't seen any recently though, and they mostly have been due to underflow / overflow in the past (hence me asking if the coverage was truly uniform, or would have weird high spikes like with GBS). It's possible there are more weird edge cases out there. The best way to sort these out is to set a seed (e.g. using the seed option in QUILT.R), using 1 core, and see if you can see consistently see a failure with one seed and consistently not see a failure with another seed. One other thing, this line
suggests you were using >=23 cores though, far more than 4? |
Indeed, I was using more cores just because I needed to speed up the process, but eventually I see it was not a good choice. |
There's nothing inherently wrong with using >1 cores, I do it all the time, but normally for testing (when I want the wall clock time to be low). IMO for debugging 1 core is usually better (to isolate the problem), as well as for production jobs (which, 75000 samples definitely qualifies as!). |
Hi,
I'm running QUILT using a scheduler that allows me to send one job per chromosome, parallelizing the run of each one of the windows provided. In other words, each chromosome is a job with multiple tasks (windows) that run QUILT in parallel.
I many cases things work easily, but sometimes some of the tasks fail and the error I see from QUILT is:
Error in check_mclapply_OK(complete_set_of_results) : An error occured during QUILT. The first such error is above Calls: QUILT -> check_mclapply_OK In addition: Warning message: In mclapply(1:length(sampleRanges), mc.cores = nCores, function(iCore) { : scheduled cores 2, 3 did not deliver results, all values of the jobs will be affected Execution halted
Each task is using 4 cores (4Gb each, i.e.,16Gb per task). I don't think is a memory related problem since bigger chromosomes ran smoothly with even less memory per core and the error I'm reporting is from smaller chromosomes with even less tasks per job.
I also noticed that this is a random behavior and that sometimes the same job does not generate the problem.
I wonder whether this is a QUILT related issue or something else.
Could you please help me with this?
Thanks
The text was updated successfully, but these errors were encountered: