Skip to content

File saving error due to parallel file opening. #388

@jyhong836

Description

@jyhong836

Bug description

When running benchmarks.hotpot_qa.adal_exp.train_agent_rag, the script may fail to save to checkpoint. I guess this is because of the use of multiple parallel threads.

What version are you seeing the problem on?

Github commit version: b43a866

How to reproduce the bug

python -m benchmarks.hotpot_qa.adal_exp.train_agent_rag

Error messages and logs

Loading Data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1319.52it/s]
Predicting: step(9): 0.0 across 3 samples, Max potential: 0.0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 316.29it/s]
2025-06-05 09:33:07 - [trainer.py:2230:_text_grad_constraint_propose_step] - Fail minibatch check, try next proposal: True, 0.0 <= 0.6666666666666666                                 | 0/3 [00:00<?, ?it/s]
Proposing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:41<00:00, 32.24s/it]
No proposal can improve the subset and full set, and val set██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:41<00:00, 18.09s/it]
Saving checkpoint to chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json
Training Step: 9:  32%|███████████████████████████████████████████████                                                                                                    | 8/25 [37:57<1:20:40, 284.74s/it]
Epoch:   0%|                                                                                                                                                                          | 0/1 [37:57<?, ?it/s]
Traceback (most recent call last):
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/utils/file_io.py", line 26, in save_json
OSError: [Errno 24] Too many open files: 'chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/ssd2/junyuan/AdalFlow/benchmarks/hotpot_qa/adal_exp/train_agent_rag.py", line 225, in <module>
  File "/ssd2/junyuan/AdalFlow/benchmarks/hotpot_qa/adal_exp/train_agent_rag.py", line 171, in train
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 668, in fit
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 644, in run_text_optimizers
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 2385, in _fit_text_grad_constraint
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/optim/trainer/trainer.py", line 2324, in _text_grad_constraint_propose_step
  File "/ssd2/junyuan/AdalFlow/adalflow/adalflow/utils/file_io.py", line 30, in save_json
OSError: Error saving object to JSON file chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json: [Errno 24] Too many open files: 'chkpts/hotpot_qa/adal_exp/constrained_max_steps_12_b36c8_run_1.json'

Environment

  • OS: Linux (Ubuntu)

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working, either in /adalflow, /tutorials, or /use cases...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions