When you create an Apache Iceberg table on S3 the Iceberg table has both data files and metadata files. If you physically copy the files that make an Iceberg table to another S3 bucket the metadata files need to be updated.
The metadata files (metadata.json files and AVRO files) have fields that reference the S3 path of the AVRO and data files. When you copy the files that make an Iceberg table to another S3 bucket the S3 path references will still be to the old / S3 bucket the files were copied from.
For example, I have an Iceberg table in S3 bucket A. I copy the data files and metadatafiles from bucket A to bucket B. The metadata.json files and AVRO contain references to S3 bucket A. We need to update these to bucket B since this Iceberg table is now stored / was copied to S3 bucket B.
After we updated the S3 references we can optionally register the updated metadata.json as a new Glue data catalog entry. An example of using the register_table
command with AWS Glue is avaiable in the Iceberg_Glue_register_table repository.
Launch the CloudFormation stack below to deploy a Glue python shell script that can be used to update the metadata.json and AVRO files.
After you deploy the CloudFormation stack. You need to update a section of python script. Navigate to the Glue console click on ETL jobs, then select the Update Iceberg Metadata, then click on the Actions drop down then Edit jobs
In the Glue python script, you need to configure 4 python variables.
# Adjust the values of these variables before running the script
s3_bucket_name_w_metadata_to_update = '<s3 bucket name that has the Iceberg metadata that you want to update>' # ex. register-iceberg-2ut1suuihxyq
folder_path_to_metadata = '<path to the Iceberg metadata folder in the ^ bucket>' # ex. iceberg/iceberg.db/sampledataicebergtable/metadata/
old_s3_bucket_name_or_path = '<name of S3 bucket or the S3 file path that you want to replace in the Iceberg metadata>' # ex. glue-iceberg-from-jars-s3bucket-2ut1suuihxyq
new_s3_bucket_name_or_path = '<when you find an instance of ^ what you want to replace it with IE. the name of the S3 bucket or file path the metadata was moved to>' # ex. register-iceberg-2ut1suuihxyq
After updating these variables click on the Save and then Run button.
If you are running this script and updating the S3 references in the metadata.json and AVRO files with the intent of using the register_table command.
The python script outputs the path of the latest metadata.json file for the Iceberg table. This can be directly input into the register_table command.
To find this output access the Cloudwatch Output logs for the Glue job run.
If you navigate to the end of the log stream you will see a log message that provides the file path you can use with register_table