-
Notifications
You must be signed in to change notification settings - Fork 348
Open
Labels
bugSomething isn't working, either in /adalflow, /tutorials, or /use cases...Something isn't working, either in /adalflow, /tutorials, or /use cases...
Description
Bug description
When running benchmarks.hotpot_qa.adal_exp.train_agent_rag
, the script may fail to save to checkpoint. I guess this is because of the use of multiple parallel threads.
What version are you seeing the problem on?
Github commit version: b43a866
How to reproduce the bug
python -m benchmarks.hotpot_qa.adal_exp.train_agent_rag
Error messages and logs
Loading Data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1319.52it/s]
Predicting: step(9): 0.0 across 3 samples, Max potential: 0.0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 316.29it/s]
2025-06-05 09:33:07 - [trainer.py:2230:_text_grad_constraint_propose_step] - Fail minibatch check, try next proposal: True, 0.0 <= 0.6666666666666666 | 0/3 [00:00<?, ?it/s]
Proposing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:41<00:00, 32.24s/it]
No proposal can improve the subset and full set, and val set██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:41<00:00, 18.09s/it]
Saving checkpoint to chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json
Training Step: 9: 32%|███████████████████████████████████████████████ | 8/25 [37:57<1:20:40, 284.74s/it]
Epoch: 0%| | 0/1 [37:57<?, ?it/s]
Traceback (most recent call last):
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/utils/file_io.py", line 26, in save_json
OSError: [Errno 24] Too many open files: 'chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/ssd2/junyuan/AdalFlow/benchmarks/hotpot_qa/adal_exp/train_agent_rag.py", line 225, in <module>
File "/ssd2/junyuan/AdalFlow/benchmarks/hotpot_qa/adal_exp/train_agent_rag.py", line 171, in train
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 668, in fit
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 644, in run_text_optimizers
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 2385, in _fit_text_grad_constraint
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 2324, in _text_grad_constraint_propose_step
File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/utils/file_io.py", line 30, in save_json
OSError: Error saving object to JSON file chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json: [Errno 24] Too many open files: 'chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json'
Environment
- OS: Linux (Ubuntu)
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't working, either in /adalflow, /tutorials, or /use cases...Something isn't working, either in /adalflow, /tutorials, or /use cases...