fix: Add trailing slash if not present for `gcs_prefix` in `Document.from_gcs()` to cover matching prefixes edge case. #274

holtskinner · 2024-03-08T17:22:10Z

Fixes #271 🦕

holtskinner · 2024-03-08T17:38:20Z

Error Message appeared in Tests before document.py commit was applied.

=================================== FAILURES ===================================
_______ test_quickstart_sample_batch_process_metadata_matching_prefixes ________

capsys = <_pytest.capture.CaptureFixture object at 0x7f3543707c50>

    def test_quickstart_sample_batch_process_metadata_matching_prefixes(
        capsys: pytest.CaptureFixture,
    ) -> None:
        batch_process_metadata = documentai.BatchProcessMetadata(
            state=documentai.BatchProcessMetadata.State.SUCCEEDED,
            individual_process_statuses=[
                documentai.BatchProcessMetadata.IndividualProcessStatus(
                    input_gcs_source="gs://test-directory/documentai/input.pdf",
                    output_gcs_destination="gs://documentai_toolbox_samples/output/matching-prefixes/1",
                ),
                documentai.BatchProcessMetadata.IndividualProcessStatus(
                    input_gcs_source="gs://test-directory/documentai/input.pdf",
                    output_gcs_destination="gs://documentai_toolbox_samples/output/matching-prefixes/11",
                ),
            ],
        )
        wrapped_document = quickstart_sample.quickstart_sample(
>           batch_process_metadata=batch_process_metadata
        )

test_quickstart_sample.py:116: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
quickstart_sample.py:80: in quickstart_sample
    metadata=batch_process_metadata
../../google/cloud/documentai_toolbox/wrappers/document.py:581: in from_batch_process_metadata
    for process in list(metadata.individual_process_statuses)
../../google/cloud/documentai_toolbox/wrappers/document.py:581: in 
    for process in list(metadata.individual_process_statuses)
../../google/cloud/documentai_toolbox/wrappers/document.py:507: in from_gcs
    shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

gcs_bucket_name = 'documentai_toolbox_samples'
gcs_prefix = 'output/matching-prefixes/1'

    def _get_shards(gcs_bucket_name: str, gcs_prefix: str) -> List[documentai.Document]:
        r"""Returns a list of `documentai.Document` shards from a Cloud Storage folder.
    
        Args:
            gcs_bucket_name (str):
                Required. The name of the gcs bucket.
    
                Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_bucket_name=`bucket`.
            gcs_prefix (str):
                Required. The prefix of the json files in the target_folder.
    
                Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_prefix=`{optional_folder}/{target_folder}`.
        Returns:
            List[google.cloud.documentai.Document]:
                A list of documentai.Documents.
    
        """
        file_check = re.match(constants.FILE_CHECK_REGEX, gcs_prefix)
        if file_check is not None:
            raise ValueError("gcs_prefix cannot contain file types")
    
        byte_array = gcs_utilities.get_bytes(gcs_bucket_name, gcs_prefix)
        shards = [
            documentai.Document.from_json(byte, ignore_unknown_fields=True)
            for byte in byte_array
        ]
    
        if not shards:
            raise ValueError("Incomplete Document - No JSON files found.")
    
        total_shards = len(shards)
    
        if total_shards > 1:
            shards.sort(key=lambda x: int(x.shard_info.shard_index))
    
            for shard in shards:
                if int(shard.shard_info.shard_count) != total_shards:
                    raise ValueError(
>                       f"Invalid Document - shardInfo.shardCount ({shard.shard_info.shard_count}) does not match number of shards ({total_shards})."
                    )
E                   ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (6).

../../google/cloud/documentai_toolbox/wrappers/document.py:134: ValueError
-------- generated xml file: /workspace/samples/snippets/sponge_log.xml --------

google/cloud/documentai_toolbox/wrappers/document.py

…from_gcs()` to cover matching prefixes edge case.

Added Tests for GCS Matching Prefixes

df88241

holtskinner assigned parthea Mar 8, 2024

holtskinner requested review from a team as code owners March 8, 2024 17:22

holtskinner requested a review from m-strzelczyk March 8, 2024 17:22

product-auto-label bot added the size: m Pull request size is medium. label Mar 8, 2024

holtskinner requested review from dizcology and removed request for m-strzelczyk March 8, 2024 17:22

Fix test

ec28a76

holtskinner force-pushed the batch-process branch from ecbc331 to ec28a76 Compare March 8, 2024 17:28

holtskinner mentioned this pull request Mar 8, 2024

Document.from_batch_process_operation() method failing due to sharding made by batch process documents #271

Closed

dizcology reviewed Mar 8, 2024

View reviewed changes

google/cloud/documentai_toolbox/wrappers/document.py Outdated Show resolved Hide resolved

holtskinner force-pushed the batch-process branch from 7ecb9fe to a9340f7 Compare March 8, 2024 18:24

holtskinner changed the title ~~fix: Add trailing slash to gcs_prefix in from_batch_process_metadata() to cover matching prefixes edge case.~~ fix: Add trailing slash if not present for gcs_prefix in Document.from_gcs() to cover matching prefixes edge case. Mar 8, 2024

fix: Add trailing slash if not present for gcs_prefix in `Document.…

6ff8f4c

…from_gcs()` to cover matching prefixes edge case.

holtskinner force-pushed the batch-process branch from a9340f7 to 6ff8f4c Compare March 8, 2024 18:27

parthea approved these changes Mar 8, 2024

View reviewed changes

parthea assigned dizcology and unassigned parthea Mar 8, 2024

holtskinner merged commit b4762e8 into main Mar 8, 2024

holtskinner deleted the batch-process branch March 8, 2024 18:32

release-please bot mentioned this pull request Mar 8, 2024

chore(main): release 0.13.2-alpha #275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Add trailing slash if not present for `gcs_prefix` in `Document.from_gcs()` to cover matching prefixes edge case. #274

fix: Add trailing slash if not present for `gcs_prefix` in `Document.from_gcs()` to cover matching prefixes edge case. #274

Uh oh!

holtskinner commented Mar 8, 2024

Uh oh!

holtskinner commented Mar 8, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Add trailing slash if not present for gcs_prefix in Document.from_gcs() to cover matching prefixes edge case. #274

fix: Add trailing slash if not present for gcs_prefix in Document.from_gcs() to cover matching prefixes edge case. #274

Uh oh!

Conversation

holtskinner commented Mar 8, 2024

Uh oh!

holtskinner commented Mar 8, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Add trailing slash if not present for `gcs_prefix` in `Document.from_gcs()` to cover matching prefixes edge case. #274

fix: Add trailing slash if not present for `gcs_prefix` in `Document.from_gcs()` to cover matching prefixes edge case. #274