Skip to content

feat(Docling): prefetch model artifacts #964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 4, 2025

Conversation

jvallesm
Copy link
Collaborator

@jvallesm jvallesm commented Feb 3, 2025

Because

  • Some EasyOCR models are needed by Docling to transform PDF to Markdown. Without them, the first execution of the document component fails because the output starts by a "Downloading detection model, please wait..." print.
  • This also prevented coverage for the Docling converter.
  • The use-docling parameter in the document operator is less open to changes that an enum converter selector.

This commit

  • Adds the EasyOCR models to the Docker images.
  • Corrects the integration test in the CI after the latest changes in instill-core
    .
  • Replaces the use-docling parameter by converter.

The following changes are made on the Dockerfile:

  • nobody:nogroup needs to have a $HOME where the EasyOCR models will be placed (internally, this engine looks for the models in ~/.EasyOCR/model).
  • The workdir (/pipeline-backend) is owned by nobody:nogroup in the dev image so we can run the coverage action without the root user.

@jvallesm jvallesm self-assigned this Feb 3, 2025
Copy link

linear bot commented Feb 3, 2025

@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch 4 times, most recently from 81b0ec7 to d1df7ea Compare February 3, 2025 17:57
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from d1df7ea to 27bdee2 Compare February 4, 2025 07:35
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch 4 times, most recently from ab5a3f1 to e0f8139 Compare February 4, 2025 09:27
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from e0f8139 to 44e8e57 Compare February 4, 2025 09:57
@jvallesm jvallesm marked this pull request as ready for review February 4, 2025 11:10
@jvallesm jvallesm requested a review from donch1989 as a code owner February 4, 2025 11:10
@jvallesm jvallesm merged commit c9ff323 into main Feb 4, 2025
12 checks passed
@jvallesm jvallesm deleted the jvalles/ins-7276-add-docling-to-document-operator branch February 4, 2025 11:21
jvallesm pushed a commit that referenced this pull request Feb 25, 2025
🤖 I have created a release *beep* *boop*
---


##
[0.51.0-beta](v0.50.0-beta...v0.51.0-beta)
(2025-02-25)


### Features

* **all:** rename VDP to pipeline
([#963](#963))
([8ba570a](8ba570a))
* **component:** support metadata filter in artifact component
([#979](#979))
([624029a](624029a))
* **Docling:** prefetch model artifacts
([#964](#964))
([c9ff323](c9ff323))
* **document:** convert PDF to Markdown with Docling
([#959](#959))
([a9dbf55](a9dbf55))
* **document:** log execution times for benchmarking
([#969](#969))
([ac3e2c3](ac3e2c3))
* **init:** remove preset pipeline downloader
([#970](#970))
([11f8f5c](11f8f5c))
* **minio:** add client info and user header to artifact binary fetcher
([#978](#978))
([78c9c1f](78c9c1f))
* **minio:** add service name and version to MinIO requests
([#976](#976))
([39c66cd](39c66cd))
* **minio:** log MinIO actions with requester
([#972](#972))
([8ba353e](8ba353e))
* **perplexity:** add new Sonar models
([#957](#957))
([2699679](2699679))
* **recipe:** rename `format` to `type` in variable section
([#971](#971))
([88ead91](88ead91))
* **x:** update MinIO package to delegate audit logs
([#973](#973))
([f81287b](f81287b))


### Bug Fixes

* **ci:** registry image build
([#960](#960))
([3a56698](3a56698))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants