Skip to content

feat(document): log execution times for benchmarking #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 7, 2025

Conversation

jvallesm
Copy link
Collaborator

@jvallesm jvallesm commented Feb 5, 2025

Because

  • We want to benchmark the performance of Docling vs pdfplumber when converting PDF to Markdown.

This commit

  • Cleans up the transformer structure in the component. Several levels of indirection were introduced with no clear purpose, which made it difficult to share a logger instance across functions.
  • Logs the timestamp at several points in the document conversion execution, indexed by the pipeline run ID.
  • Additionally, make coverage runs the Docker container (again) as root in order to share the local code with the instance.

@jvallesm jvallesm self-assigned this Feb 5, 2025
Copy link

linear bot commented Feb 5, 2025

@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch 2 times, most recently from 3988526 to 0bfed5b Compare February 5, 2025 09:51
On GitHub actions the [USER and WORKDIR directives cause
issues](https://docs.github.com/en/actions/sharing-automations/creating-actions/dockerfile-support-for-github-actions#user),
preventing us from writing the coverage report on the CI.
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from 0bfed5b to 6469d5b Compare February 5, 2025 09:51
The component structure adds several levels of indirection for no
reason. We can keep the logic that selects the conversion algorithm
hidden under the transformer package.
@jvallesm
Copy link
Collaborator Author

jvallesm commented Feb 6, 2025

QA ✅

Benchmark logs

{
  "L": "INFO",
  "T": "2025-02-06T14:29:36.902Z",
  "M": "PDF to Markdown conversion",
  "pipelineTriggerID": "09313b2b-e6f5-4099-8ee9-6f3d7e629cb6",
  "executionID": "5d1110bf-3ece-4c7b-954e-c1b182f171b2",
  "start": "2025-02-06T14:29:26.013Z",
  "repair": "2025-02-06T14:29:26.209Z",
  "convert": "2025-02-06T14:29:36.900Z",
  "handleOutput": "2025-02-06T14:29:36.902Z",
  "durationInSecs": 10.888739176,
  "ok": true
}

Capture unwanted stderr / stdout logs

{
  "L": "INFO",
  "T": "2025-02-07T08:23:31.124Z",
  "M": "PDF to Markdown Python script produced conversion logs",
  "pipelineTriggerID": "6f8f39e6-553a-42a8-a240-f09535ffc21b",
  "executionID": "731c6034-f224-4503-98b6-a68e904392d1",
  "conversionLogs": [
    "Downloading detection model, please wait. This may take several minutes depending upon your network connection.",
    "Downloading recognition model, please wait. This may take several minutes depending upon your network connection."
  ]
}

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 64.87805% with 72 lines in your changes missing coverage. Please review.

Project coverage is 19.53%. Comparing base (1aa40c2) to head (2c5577b).
Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
...tor/document/v0/transformer/markdowntransformer.go 61.00% 27 Missing and 12 partials ⚠️
...mponent/operator/document/v0/transformer/images.go 35.71% 6 Missing and 3 partials ⚠️
...onent/operator/document/v0/transformer/markdown.go 84.44% 5 Missing and 2 partials ⚠️
...erator/document/v0/convert_document_to_markdown.go 72.72% 4 Missing and 2 partials ⚠️
...mponent/operator/document/v0/transformer/helper.go 25.00% 6 Missing ⚠️
pkg/data/document.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #969      +/-   ##
==========================================
+ Coverage   19.22%   19.53%   +0.31%     
==========================================
  Files         426      436      +10     
  Lines       84603    85104     +501     
==========================================
+ Hits        16264    16629     +365     
- Misses      66002    66115     +113     
- Partials     2337     2360      +23     
Flag Coverage Δ
unittests 19.53% <64.87%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from 6d861dd to f4df57d Compare February 7, 2025 08:24
@jvallesm jvallesm marked this pull request as ready for review February 7, 2025 08:25
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from d635d49 to 88ca817 Compare February 7, 2025 09:52
@jvallesm jvallesm force-pushed the jvalles/ins-7276-add-docling-to-document-operator branch from 2778d25 to 2c5577b Compare February 7, 2025 10:53
@jvallesm jvallesm merged commit ac3e2c3 into main Feb 7, 2025
13 checks passed
@jvallesm jvallesm deleted the jvalles/ins-7276-add-docling-to-document-operator branch February 7, 2025 11:27
jvallesm pushed a commit that referenced this pull request Feb 25, 2025
🤖 I have created a release *beep* *boop*
---


##
[0.51.0-beta](v0.50.0-beta...v0.51.0-beta)
(2025-02-25)


### Features

* **all:** rename VDP to pipeline
([#963](#963))
([8ba570a](8ba570a))
* **component:** support metadata filter in artifact component
([#979](#979))
([624029a](624029a))
* **Docling:** prefetch model artifacts
([#964](#964))
([c9ff323](c9ff323))
* **document:** convert PDF to Markdown with Docling
([#959](#959))
([a9dbf55](a9dbf55))
* **document:** log execution times for benchmarking
([#969](#969))
([ac3e2c3](ac3e2c3))
* **init:** remove preset pipeline downloader
([#970](#970))
([11f8f5c](11f8f5c))
* **minio:** add client info and user header to artifact binary fetcher
([#978](#978))
([78c9c1f](78c9c1f))
* **minio:** add service name and version to MinIO requests
([#976](#976))
([39c66cd](39c66cd))
* **minio:** log MinIO actions with requester
([#972](#972))
([8ba353e](8ba353e))
* **perplexity:** add new Sonar models
([#957](#957))
([2699679](2699679))
* **recipe:** rename `format` to `type` in variable section
([#971](#971))
([88ead91](88ead91))
* **x:** update MinIO package to delegate audit logs
([#973](#973))
([f81287b](f81287b))


### Bug Fixes

* **ci:** registry image build
([#960](#960))
([3a56698](3a56698))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants