-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Add upload parquet and manifest files #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…pload it later to s3 Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…hash to filename Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SantanaTiago some observations, since maybe the feature was not explained in detail:
- the parquet file should be a means to pack the conversion of several files into a single compressed file, instead of creating 1 parquet file for each converted document.
- to make the parquet files manageable and still make best use of the storage, we could set a maximum size of 500Mb. We keep piling conversion results in a parquet file and as soon as we reach 500Mb, we stop and we start piling in a new parquet file.
- the manifest file should then help users locate a converted PDF file in the set of parquet files. In addition, the
row
field could contain the row in the specific parquet file where the conversion results are located.
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
…parquet file Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SantanaTiago Thanks for this PR.
I posted some questions we may want to clarify.
Also please resolve the conflicts with uv.lock
.
|
||
from docling_jobkit.model.s3_inputs import S3Coordinates | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
# Set the maximum file size of parquet to 100MB | ||
MAX_PARQUET_FILE_SIZE = 100 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set it to 500MB since it is still manageable and we thus make it a bit more efficient.
classifier_labels = [ | ||
"bar_chart", | ||
"bar_code", | ||
"chemistry_markush_structure", | ||
"chemistry_molecular_structure", | ||
"flow_chart", | ||
"icon", | ||
"line_chart", | ||
"logo", | ||
"map", | ||
"other", | ||
"pie_chart", | ||
"qr_code", | ||
"remote_sensing", | ||
"screenshot", | ||
"signature", | ||
"stamp", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment: we need to coordinate with @nikos-livathinos and @Matteo-Omenetti to define these labels in a data schema for better integration
@@ -123,25 +156,35 @@ def generate_presign_url( | |||
return None | |||
|
|||
|
|||
def get_source_files(s3_source_client, s3_source_resource, s3_coords): | |||
def get_source_files(s3_source_client, s3_source_resource, s3_coords: S3Coordinates): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also provide type hints to the other method arguments?
manifest[f"{parquet_file_name}"] = { | ||
"filenames": current_df["filename"].tolist(), | ||
"doc_hashes": current_df["doc_hash"].tolist(), | ||
"row_number": 3, | ||
"timestamp": timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share a mock example of how that manifest file would look like?
I think that if we do not need nested structures, a CSV file could be more efficient.
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
Signed-off-by: Tiago Santana <54704492+SantanaTiago@users.noreply.github.com>
This PR adds upload of parquet and manifest file.
Improve use of temporary files by creating temporary folder at top.
Resolves #20 #22