Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate code from googleapis/python-documentai #8862

Merged
merged 112 commits into from
Jan 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
5d1ef63
Chore: Add requirements.txt and noxfile.py for new samples (#45)
aribray Oct 19, 2020
d8068bc
docs(samples): new Doc AI samples for v1beta3 (#44)
aribray Oct 21, 2020
e1f2ffd
chores: fixed small issue with start index problem (#56)
munkhuushmgl Nov 11, 2020
dd6edae
chore: update samples noxfile
yoshi-automation Nov 18, 2020
162c543
fix: removes C-style semicolons and slash comments (#59)
telpirion Nov 20, 2020
cb39215
chore(deps): update dependency google-cloud-storage to v1.33.0 (#61)
renovate-bot Nov 25, 2020
cfbf114
fix: added if statement to filter out dir blob files (#63)
munkhuushmgl Dec 2, 2020
2c75b23
samples(fix): change comments to match function signature (#68)
telpirion Dec 4, 2020
d522922
fix: moves import statment inside region tags (#71)
telpirion Dec 9, 2020
e1cc457
samples: added test that covers the wrong file type case (#69)
munkhuushmgl Dec 11, 2020
9e9c288
chore: update templates (#74)
yoshi-automation Dec 29, 2020
ecb67d1
samples: migrate v1beta2 doc AI samples (#79)
munkhuushmgl Jan 12, 2021
9dd0c67
chore(deps): update dependency google-cloud-storage to v1.35.0 (#78)
renovate-bot Jan 13, 2021
3146c9c
chore: added increased timeout on flaky batch request (#84)
munkhuushmgl Jan 22, 2021
567b71e
chore: exclude `.nox` directories from linting (#87)
yoshi-automation Jan 28, 2021
5fdb43a
chore(deps): update dependency google-cloud-storage to v1.36.0 (#91)
renovate-bot Feb 12, 2021
f3198cf
chore(deps): update dependency google-cloud-storage to v1.36.1 (#92)
renovate-bot Feb 24, 2021
95710e3
fix(samples): swaps 'continue' for 'return' (#93)
telpirion Mar 5, 2021
8908d29
fix: adds comment with explicit hostname change (#94)
telpirion Mar 11, 2021
5cdea39
chore(deps): update dependency google-cloud-storage to v1.36.2 (#95)
renovate-bot Mar 12, 2021
3fea6b4
chore: update templates (#97)
yoshi-automation Mar 23, 2021
a9f7ecd
chore(deps): update dependency google-cloud-storage to v1.37.0 (#104)
renovate-bot Mar 26, 2021
f24cd97
chore(deps): update dependency google-cloud-documentai to v0.4.0 (#103)
renovate-bot Mar 30, 2021
3789ad8
samples: updates Document AI samples to v1 version of service (#108)
telpirion Apr 12, 2021
73e39b1
samples: more updates for v1 (#121)
telpirion Apr 14, 2021
d558291
chore: template updates (#120)
yoshi-automation Apr 19, 2021
feea57d
chore(deps): update dependency google-cloud-storage to v1.37.1 (#114)
renovate-bot Apr 19, 2021
5a5416b
chore: migrate to owl bot (#130)
parthea May 10, 2021
4b1653f
chore(deps): update dependency pytest to v6.2.4 (#124)
renovate-bot May 16, 2021
cb957c5
chore: new owl bot post processor docker image (#152)
gcf-owl-bot[bot] May 22, 2021
a1505b7
fix: Parsing pages, but should be paragraphs (#147)
dgallegos May 25, 2021
c980ad6
chore(deps): update dependency google-cloud-documentai to v0.5.0 (#155)
renovate-bot Jun 2, 2021
d39b68a
chore(deps): update dependency google-cloud-storage to v1.38.0 (#133)
renovate-bot Jun 2, 2021
5cad87f
chore(deps): update dependency google-cloud-storage to v1.39.0 (#169)
renovate-bot Jun 22, 2021
8794e9a
chore(deps): update dependency google-cloud-storage to v1.40.0 (#173)
renovate-bot Jul 1, 2021
e0e241d
chore(deps): update dependency google-cloud-storage to v1.41.0 (#177)
renovate-bot Jul 15, 2021
f9c925b
feat: add Samples section to CONTRIBUTING.rst (#181)
gcf-owl-bot[bot] Jul 22, 2021
c139b20
chore(deps): update dependency google-cloud-storage to v1.41.1 (#182)
renovate-bot Jul 26, 2021
9cb90cd
chore(deps): update dependency google-cloud-documentai to v1 (#185)
renovate-bot Jul 27, 2021
50c14f3
samples: moves region tag to include import statement (#186)
telpirion Jul 28, 2021
62dd921
chore: fix INSTALL_LIBRARY_FROM_SOURCE in noxfile.py (#192)
gcf-owl-bot[bot] Aug 11, 2021
f9a40b2
chore(deps): update dependency google-cloud-storage to v1.42.0 (#194)
renovate-bot Aug 12, 2021
3ab1f10
chore: drop mention of Python 2.7 from templates (#197)
gcf-owl-bot[bot] Aug 13, 2021
39df47e
samples: moves import statement within region tags (#190)
telpirion Aug 13, 2021
0f97562
chore(deps): update dependency pytest to v6.2.5 (#204)
renovate-bot Aug 31, 2021
0dd510f
chore(deps): update dependency google-cloud-storage to v1.42.1 (#209)
renovate-bot Sep 9, 2021
bdc507c
chore: blacken samples noxfile template (#212)
gcf-owl-bot[bot] Sep 17, 2021
f3efedb
chore(deps): update dependency google-cloud-storage to v1.42.2 (#213)
renovate-bot Sep 20, 2021
d4de378
chore: fail samples nox session if python version is missing (#218)
gcf-owl-bot[bot] Sep 30, 2021
27d5d65
chore(deps): update dependency google-cloud-storage to v1.42.3 (#219)
renovate-bot Sep 30, 2021
0f69b95
chore(python): Add kokoro configs for python 3.10 samples testing (#225)
gcf-owl-bot[bot] Oct 8, 2021
8209cac
chore(deps): update dependency google-cloud-documentai to v1.1.0 (#227)
renovate-bot Oct 11, 2021
d5e0d84
chore(deps): update dependency google-cloud-documentai to v1.2.0 (#232)
renovate-bot Oct 25, 2021
102553b
docs(samples): add OCR, form, quality, splitter and specialized proce…
Nov 10, 2021
7921a42
chore(python): run blacken session for all directories with a noxfile…
gcf-owl-bot[bot] Dec 12, 2021
8cafbbd
chore(samples): Add check for tests in directory (#257)
gcf-owl-bot[bot] Jan 11, 2022
bcd2924
chore(deps): update dependency google-cloud-storage to v2 (#247)
renovate-bot Jan 16, 2022
74849f9
chore(python): Noxfile recognizes that tests can live in a folder (#262)
gcf-owl-bot[bot] Jan 19, 2022
45db6f7
chore(deps): update dependency google-cloud-documentai to v1.2.1 (#263)
renovate-bot Jan 19, 2022
a93a75f
test: strip quotes and newlines from output (#279)
busunkim96 Feb 26, 2022
186ef82
chore(deps): update dependency google-cloud-storage to v2.1.0 (#264)
renovate-bot Feb 28, 2022
183abce
chore: Adding support for pytest-xdist and pytest-parallel (#286)
gcf-owl-bot[bot] Mar 4, 2022
3a7cddc
chore(deps): update all dependencies (#281)
renovate-bot Mar 5, 2022
adb2e08
chore(deps): update dependency google-cloud-documentai to v1.3.0 (#290)
renovate-bot Mar 8, 2022
d813e39
chore(deps): update dependency pytest to v7.1.0 (#291)
renovate-bot Mar 13, 2022
0a82fd3
chore(deps): update dependency google-cloud-storage to v2.2.0 (#292)
renovate-bot Mar 14, 2022
13f21b9
chore(deps): update dependency google-cloud-storage to v2.2.1 (#293)
renovate-bot Mar 16, 2022
b860651
chore(deps): update dependency pytest to v7.1.1 (#296)
renovate-bot Mar 19, 2022
bcf9b68
chore(deps): update dependency google-cloud-documentai to v1.4.0 (#297)
renovate-bot Mar 23, 2022
df44706
chore(python): use black==22.3.0 (#301)
gcf-owl-bot[bot] Mar 28, 2022
c81aabb
chore(deps): update dependency google-cloud-storage to v2.3.0 (#310)
renovate-bot Apr 14, 2022
9429f13
chore(python): add nox session to sort python imports (#312)
gcf-owl-bot[bot] Apr 21, 2022
d6537fa
chore(deps): update dependency pytest to v7.1.2 (#316)
renovate-bot Apr 25, 2022
05600e3
chore: removed v1beta2 samples (#315)
galz10 Apr 26, 2022
972498a
chore(deps): update dependency google-cloud-documentai to v1.4.1 (#319)
renovate-bot Apr 28, 2022
f19d03a
fix: require python 3.7+ (#348)
gcf-owl-bot[bot] Jul 10, 2022
c948e1c
chore(deps): update all dependencies (#338)
renovate-bot Jul 14, 2022
ca3dfa4
refactor: Updates to Document AI Python Samples (#323)
holtskinner Jul 28, 2022
ec19a83
chore(deps): update all dependencies (#355)
renovate-bot Aug 5, 2022
410a444
chore(deps): update dependency google-cloud-documentai to v1.5.1 (#362)
renovate-bot Aug 17, 2022
50ad567
docs(samples): Added Human Review Request Sample (#357)
holtskinner Aug 17, 2022
fe1c925
chore(deps): update dependency google-cloud-documentai to v2 (#364)
renovate-bot Aug 19, 2022
91bfeab
chore(deps): update dependency pytest to v7.1.3 (#374)
renovate-bot Sep 6, 2022
9111110
docs(samples): Updated Samples for v2.0.0 Client Library (#365)
holtskinner Sep 13, 2022
aa44c16
chore(main): release 2.0.1 (#378)
release-please[bot] Sep 13, 2022
4bf23aa
chore: detect samples tests in nested directories (#379)
gcf-owl-bot[bot] Sep 13, 2022
b9adf36
chore(deps): update dependency google-cloud-documentai to v2.0.1 (#380)
renovate-bot Sep 14, 2022
478dcc1
docs(samples): Added Processor Version Samples (#382)
holtskinner Sep 26, 2022
c56dcfe
chore(deps): update dependency google-cloud-documentai to v2.0.2 (#386)
renovate-bot Oct 4, 2022
a67c330
chore(deps): update dependency google-cloud-documentai to v2.0.3 (#390)
renovate-bot Oct 18, 2022
1b9fc12
chore(deps): update dependency pytest to v7.2.0 (#392)
renovate-bot Oct 26, 2022
2139ffe
docs(samples): Added extra exception handling to operation samples (#…
holtskinner Nov 2, 2022
5f5d5f1
chore:Remove Sample Inputs/Outputs from Repo (#391)
holtskinner Nov 2, 2022
b82b31b
chore(deps): update dependency google-cloud-storage to v2.6.0 (#399)
renovate-bot Nov 8, 2022
39cdf84
chore(deps): update dependency google-cloud-documentai to v2.1.0 (#407)
renovate-bot Nov 9, 2022
ed2179f
docs(samples): Updated code samples for 2.1.0 release (#406)
holtskinner Nov 11, 2022
2efbcb0
chore(deps): update dependency google-cloud-documentai to v2.2.0 (#411)
renovate-bot Nov 14, 2022
95773fb
chore(deps): update dependency google-cloud-documentai to v2.3.0 (#414)
renovate-bot Nov 15, 2022
f9c6fa3
chore(python): drop flake8-import-order in samples noxfile (#421)
gcf-owl-bot[bot] Nov 27, 2022
1d89983
fix(samples): Fix Typos in Batch process & get processor Samples (#420)
holtskinner Nov 27, 2022
a789ef7
chore(deps): update dependency google-cloud-documentai to v2.4.0 (#423)
renovate-bot Dec 2, 2022
3e9f022
chore(deps): update dependency google-cloud-storage to v2.7.0 (#426)
holtskinner Dec 7, 2022
38a7849
chore(deps): update dependency google-cloud-documentai to v2.4.1 (#428)
renovate-bot Dec 12, 2022
47ed844
chore(deps): update dependency google-cloud-documentai to v2.5.0 (#432)
renovate-bot Dec 14, 2022
207c183
chore(deps): update dependency google-cloud-documentai to v2.6.0 (#435)
renovate-bot Dec 15, 2022
b568ab0
chore(deps): update dependency mock to v5 (#436)
renovate-bot Jan 4, 2023
6c09164
Merge remote-tracking branch 'migration/main' into documentai-migration
holtskinner Jan 4, 2023
d1d27e4
Changed import from samples.snippets to documentai.snippets
holtskinner Jan 5, 2023
6bd843f
Repository Migration logistics
holtskinner Jan 5, 2023
ffb1d0e
Removed old noxfile, replaced with noxfile_config
holtskinner Jan 5, 2023
6b5e99c
Adjusted enable/disable processor tests for new output from API
holtskinner Jan 5, 2023
5a81603
Fixed Typo in Comments
holtskinner Jan 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
/dataproc/**/* @GoogleCloudPlatform/python-samples-reviewers
/datastore/**/* @GoogleCloudPlatform/cloud-native-db-dpes @GoogleCloudPlatform/python-samples-reviewers
/dns/**/* @GoogleCloudPlatform/python-samples-reviewers
/documentai/**/* @GoogleCloudPlatform/dee-data-ai @GoogleCloudPlatform/python-samples-reviewers
/endpoints/**/* @GoogleCloudPlatform/python-samples-reviewers
/eventarc/**/* @GoogleCloudPlatform/aap-dpes @GoogleCloudPlatform/python-samples-reviewers
/error_reporting/**/* @GoogleCloudPlatform/python-samples-reviewers
Expand Down
2 changes: 2 additions & 0 deletions .github/blunderbuss.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ assign_issues_by:
to:
- GoogleCloudPlatform/api-iot
- labels:
- 'api: documentai'
- 'api: language'
- 'api: texttospeech'
- 'api: retail'
Expand Down Expand Up @@ -188,6 +189,7 @@ assign_prs_by:
to:
- GoogleCloudPlatform/infra-db-dpes
- labels:
- 'api: documentai'
- 'api: retail'
to:
- GoogleCloudPlatform/dee-data-ai
Expand Down
3 changes: 0 additions & 3 deletions document/README.rst

This file was deleted.

1 change: 1 addition & 0 deletions documentai/AUTHORING_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/AUTHORING_GUIDE.md
1 change: 1 addition & 0 deletions documentai/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/CONTRIBUTING.md
Empty file added documentai/__init__.py
Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# [START documentai_batch_process_documents_processor_version]
import re

from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import RetryError
from google.cloud import documentai
from google.cloud import storage

# TODO(developer): Uncomment these variables before running the sample.
# project_id = 'YOUR_PROJECT_ID'
# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'
# processor_id = 'YOUR_PROCESSOR_ID' # Example: aeb8cea219b7c272
# processor_version_id = "YOUR_PROCESSOR_VERSION_ID" # Example: pretrained-ocr-v1.0-2020-09-23
# gcs_input_uri = "YOUR_INPUT_URI" # Format: gs://bucket/directory/file.pdf
# input_mime_type = "application/pdf"
# gcs_output_bucket = "YOUR_OUTPUT_BUCKET_NAME" # Format: gs://bucket
# gcs_output_uri_prefix = "YOUR_OUTPUT_URI_PREFIX" # Format: directory/subdirectory/
# field_mask = "text,entities,pages.pageNumber" # Optional. The fields to return in the Document object.


def batch_process_documents_processor_version(
project_id: str,
location: str,
processor_id: str,
processor_version_id: str,
gcs_input_uri: str,
input_mime_type: str,
gcs_output_bucket: str,
gcs_output_uri_prefix: str,
field_mask: str = None,
timeout: int = 400,
):

# You must set the api_endpoint if you use a location other than 'us'.
opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

client = documentai.DocumentProcessorServiceClient(client_options=opts)

gcs_document = documentai.GcsDocument(
gcs_uri=gcs_input_uri, mime_type=input_mime_type
)

# Load GCS Input URI into a List of document files
gcs_documents = documentai.GcsDocuments(documents=[gcs_document])
input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)

# NOTE: Alternatively, specify a GCS URI Prefix to process an entire directory
#
# gcs_input_uri = "gs://bucket/directory/"
# gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
# input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)
#

# Cloud Storage URI for the Output Directory
# This must end with a trailing forward slash `/`
destination_uri = f"{gcs_output_bucket}/{gcs_output_uri_prefix}"

gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
gcs_uri=destination_uri, field_mask=field_mask
)

# Where to write results
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

# The full resource name of the processor version
# e.g. projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
name = client.processor_version_path(
project_id, location, processor_id, processor_version_id
)

request = documentai.BatchProcessRequest(
name=name,
input_documents=input_config,
document_output_config=output_config,
)

# BatchProcess returns a Long Running Operation (LRO)
operation = client.batch_process_documents(request)

# Continually polls the operation until it is complete.
# This could take some time for larger files
# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
try:
print(f"Waiting for operation {operation.operation.name} to complete...")
operation.result(timeout=timeout)
# Catch exception when operation doesn't finish before timeout
except (RetryError, InternalServerError) as e:
print(e.message)

# NOTE: Can also use callbacks for asynchronous processing
#
# def my_callback(future):
# result = future.result()
#
# operation.add_done_callback(my_callback)

# Once the operation is complete,
# get output document information from operation metadata
metadata = documentai.BatchProcessMetadata(operation.metadata)

if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
raise ValueError(f"Batch Process Failed: {metadata.state_message}")

storage_client = storage.Client()

print("Output files:")
# One process per Input Document
for process in metadata.individual_process_statuses:
# output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/
# The Cloud Storage API requires the bucket name and URI prefix separately
matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
if not matches:
print(
"Could not parse output GCS destination:",
process.output_gcs_destination,
)
continue

output_bucket, output_prefix = matches.groups()

# Get List of Document Objects from the Output Bucket
output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

# Document AI may output multiple JSON files per source file
for blob in output_blobs:
# Document AI should only output JSON files to GCS
if ".json" not in blob.name:
print(
f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}"
)
continue

# Download JSON File as bytes object and convert to Document Object
print(f"Fetching {blob.name}")
document = documentai.Document.from_json(
blob.download_as_bytes(), ignore_unknown_fields=True
)

# For a full list of Document object attributes, please reference this page:
# https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document

# Read the text recognition output from the processor
print("The document contains the following text:")
print(document.text)


# [END documentai_batch_process_documents_processor_version]
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import os
from uuid import uuid4

from documentai.snippets import \
batch_process_documents_processor_version_sample

location = "us"
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
processor_id = "90484cfdedb024f6"
processor_version_id = "pretrained-form-parser-v1.0-2020-09-23"
gcs_input_uri = "gs://cloud-samples-data/documentai/invoice.pdf"
input_mime_type = "application/pdf"
gcs_output_bucket = "gs://document-ai-python"
gcs_output_uri_prefix = f"{uuid4()}/"
field_mask = "text,pages.pageNumber"


def test_batch_process_documents(capsys):
batch_process_documents_processor_version_sample.batch_process_documents_processor_version(
project_id=project_id,
location=location,
processor_id=processor_id,
processor_version_id=processor_version_id,
gcs_input_uri=gcs_input_uri,
input_mime_type=input_mime_type,
gcs_output_bucket=gcs_output_bucket,
gcs_output_uri_prefix=gcs_output_uri_prefix,
field_mask=field_mask,
)
out, _ = capsys.readouterr()

assert "operation" in out
assert "Fetching" in out
assert "text:" in out
Loading