Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving single object with GCSToGCSOperator differs from gsutil mv command #37576

Closed
1 of 2 tasks
nyoungstudios opened this issue Feb 20, 2024 · 1 comment · Fixed by #40162
Closed
1 of 2 tasks

moving single object with GCSToGCSOperator differs from gsutil mv command #37576

nyoungstudios opened this issue Feb 20, 2024 · 1 comment · Fixed by #40162
Assignees
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@nyoungstudios
Copy link
Contributor

nyoungstudios commented Feb 20, 2024

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.10.0

Apache Airflow version

2.6.3

Operating System

Debian 11

Deployment

Google Cloud Composer

Deployment details

Reproducible locally in our Dockerfile based with Python environment installed with conda in our VS Code dev container. Local executor, Postgres database. Same error in our Google Cloud Composer deployment (k8s and Postgres and celery executor). Can provide full pip install with Dockerfile if needed.

What happened

The result of GCSToGCSOperator differs based of the existing source files in the source bucket. And the result of GCSToGCSOperator also differs if we run the equavalent gsutil mv command. I believe this is because the GCSToGCSOperator treats moving a single object different than moving multiple objects.

What you think should happen instead

The GCSToGCSOperator should match what the gsutil mv command does.

How to reproduce


Overview

Airflow operator usage

Here is our example usage of this operator.

GCSToGCSOperator(
    task_id="move-files",
    source_bucket="bucket-name",
    source_object="folder/nested_folder/",
    destination_bucket="bucket-name-2",
    destination_object="folder/nested_folder/",
    move_object=True,
)

gsutil mv usage

Here is our example usage of the gsutil mv command.

gsutil -m mv gs://bucket-name/folder/nested_folder gs://bucket-name-2/folder/nested_folder

Test 1: Expected result

Given that these files exist before running the task.

> gsutil -m ls "gs://bucket-name/folder/nested_folder/**"
gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/12345.txt
gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/67890.txt

The Airflow GCSToGCSOperator task will move

  • gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/12345.txt to gs://bucket-name-2/folder/nested_folder/aaaa/bbbb/cccc/12345.txt
  • gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/67890.txt to gs://bucket-name-2/folder/nested_folder/aaaa/bbbb/cccc/67890.txt

This matches what the equivalent gsutil command would do.

Test 2: Unexpected result

Given that these files exist before running the task.

> gsutil -m ls "gs://bucket-name/folder/nested_folder/**"
gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/12345.txt

The Airflow GCSToGCSOperator task will move

  • gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/12345.txt to gs://bucket-name-2/folder/nested_folder/12345.txt with doesn't retain the nested folder structure like the first test.

This does not match what the equivalent gsutil command would do. The gsutil mv command would correctly move

  • gs://bucket-name/folder/nested_folder/aaaa/bbbb/cccc/12345.txt to gs://bucket-name-2/folder/nested_folder/aaaa/bbbb/cccc/12345.txt.

Anything else

Here is the gcloud version output from my tests above.

> gcloud version
Google Cloud SDK 453.0.0
alpha 2023.10.27
beta 2023.10.27
bq 2.0.98
bundled-python3-unix 3.9.17
core 2023.10.27
gcloud-crc32c 1.0.0
gke-gcloud-auth-plugin 0.5.6
gsutil 5.27

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nyoungstudios nyoungstudios added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Feb 20, 2024
@Lee-W Lee-W added the provider:google Google (including GCP) related issues label Apr 30, 2024
@eladkal eladkal added good first issue and removed needs-triage label for new issues that we didn't triage yet labels May 26, 2024
@boraberke
Copy link
Contributor

Would like to work on this issue, @eladkal could you please assign?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants