-
Notifications
You must be signed in to change notification settings - Fork 3.6k
feat: Enable pipeline override and reuse with compatible options #2952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…ne-options-override-without-reinit
- remove `force_all_model_init` - reject incompatible override options (no auto pipeline reinit) - allow runtime `do_*` overrides only for `True -> False` toggles - apply compatible `do_*` overrides per execution in base/threaded PDF pipelines - add compatibility tests and update converter docstrings Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -8,9 +8,8 @@
- `generate_page_images`, `generate_picture_images`: Extract page/picture images
- `force_backend_text`: Force backend text extraction
- Additional options for OCR engine, layout model, table extraction, etc.
-- **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8).
-
----
+- **Pipeline Option Overrides**: The Python API allows you to override pipeline options at conversion time for a given format using the `format_options` argument. Only `do_*` flags (such as `do_ocr`, `do_table_structure`, `do_code_enrichment`, `do_formula_enrichment`, etc.) can be changed, and only from `True` to `False`. All other options must remain identical to those used at pipeline initialization. Attempting to enable a do_* flag or change other fields will result in an error. This enables per-call disabling of enrichment features without reinitializing the pipeline.
+- **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8). Refer to the Python SDK documentation for usage of `format_options`.
### DOCX
- **Pipeline/Backend**: `SimplePipeline` + `MsWordDocumentBackend`
@@ -52,5 +51,4 @@
- Only PDF supports image resolution adjustment (`images_scale`).
- DOCX header/footer export is only available via Python API.
- PPTX/XLSX support enrichment options and pagination (slide/sheet level).
-
-For further details, refer to the provided code links and examples.
+- **Pipeline Option Overrides**: For all formats, the Python API supports disabling enrichment-related `do_*` flags at conversion time using the `format_options` argument. Only disabling (True → False) is allowed; all other options must remain unchanged. See the PDF section above for details.Note: You must be authenticated to accept/decline updates. |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
| def _get_enrichment_pipe_for_execution( | ||
| self, | ||
| ) -> Iterable[GenericEnrichmentModel[Any]]: | ||
| effective_options = self.get_effective_options() | ||
| assert isinstance(effective_options, ConvertPipelineOptions) | ||
|
|
||
| do_picture_classification = ( | ||
| effective_options.do_picture_classification | ||
| or effective_options.do_chart_extraction | ||
| ) | ||
| do_picture_description = effective_options.do_picture_description | ||
| do_chart_extraction = effective_options.do_chart_extraction | ||
|
|
||
| for model in self.enrichment_pipe: | ||
| if isinstance(model, DocumentPictureClassifier): | ||
| if do_picture_classification: | ||
| yield model | ||
| elif isinstance(model, PictureDescriptionBaseModel): | ||
| if do_picture_description: | ||
| yield model | ||
| elif isinstance(model, ChartExtractionModelGraniteVision): | ||
| if do_chart_extraction: | ||
| yield model | ||
| else: | ||
| yield model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the coolest thing to put here. Ideas for improvements are welcome.
Summary
This PR adds per-call pipeline option overrides in
DocumentConverterand enforces compatibility-based reuse of initialized pipelines.What’s Included
format_optionsoverride support to:DocumentConverter.convert(...)DocumentConverter.convert_all(...)do_*fieldsdo_*flags can only be relaxed (True -> False)do_*overrides are respected safely.tests/test_options.py.Behavior
raises_on_error).Checklist: