Skip to content

feat: Add upload parquet and manifest files #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

SantanaTiago
Copy link
Collaborator

This PR adds upload of parquet and manifest file.
Improve use of temporary files by creating temporary folder at top.

Resolves #20 #22

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…pload it later to s3

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…hash to filename

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Copy link

mergify bot commented Apr 23, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantanaTiago some observations, since maybe the feature was not explained in detail:

  • the parquet file should be a means to pack the conversion of several files into a single compressed file, instead of creating 1 parquet file for each converted document.
  • to make the parquet files manageable and still make best use of the storage, we could set a maximum size of 500Mb. We keep piling conversion results in a parquet file and as soon as we reach 500Mb, we stop and we start piling in a new parquet file.
  • the manifest file should then help users locate a converted PDF file in the set of parquet files. In addition, the row field could contain the row in the specific parquet file where the conversion results are located.

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
@SantanaTiago SantanaTiago marked this pull request as draft May 12, 2025 08:34
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…parquet file

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
@SantanaTiago SantanaTiago marked this pull request as ready for review May 15, 2025 08:05
Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantanaTiago Thanks for this PR.
I posted some questions we may want to clarify.
Also please resolve the conflicts with uv.lock.


from docling_jobkit.model.s3_inputs import S3Coordinates

logging.basicConfig(level=logging.INFO)

# Set the maximum file size of parquet to 100MB
MAX_PARQUET_FILE_SIZE = 100 * 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set it to 500MB since it is still manageable and we thus make it a bit more efficient.

Comment on lines +44 to +61
classifier_labels = [
"bar_chart",
"bar_code",
"chemistry_markush_structure",
"chemistry_molecular_structure",
"flow_chart",
"icon",
"line_chart",
"logo",
"map",
"other",
"pie_chart",
"qr_code",
"remote_sensing",
"screenshot",
"signature",
"stamp",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment: we need to coordinate with @nikos-livathinos and @Matteo-Omenetti to define these labels in a data schema for better integration

@@ -123,25 +156,35 @@ def generate_presign_url(
return None


def get_source_files(s3_source_client, s3_source_resource, s3_coords):
def get_source_files(s3_source_client, s3_source_resource, s3_coords: S3Coordinates):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also provide type hints to the other method arguments?

Comment on lines +713 to +717
manifest[f"{parquet_file_name}"] = {
"filenames": current_df["filename"].tolist(),
"doc_hashes": current_df["doc_hash"].tolist(),
"row_number": 3,
"timestamp": timestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share a mock example of how that manifest file would look like?
I think that if we do not need nested structures, a CSV file could be more efficient.

Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Method check_target_has_source_converted not working as designed
2 participants