Skip to content

Fix instant file name path retrieval in Hudi Active Timeline #18213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 20, 2023

Conversation

atezs82
Copy link
Contributor

@atezs82 atezs82 commented Jul 10, 2023

Description

When submitting even a very simple query for certain types of Hudi tables (stored on S3) we were constantly receiving the following error:

SQL Error [65536]: Query failed (#20230710_093123_00000_bqjz2): No scheme for file system location: path/to/hudi/table/within/bucket/.hoodie

Stacktrace taken from the Trino coordinator:

...
Caused by: java.lang.IllegalArgumentException: No scheme for file system location: path/to/hudi/table/within/bucket/.hoodie
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:218)
	at io.trino.filesystem.Location.of(Location.java:74)
	at io.trino.plugin.hudi.timeline.HudiActiveTimeline.getInstantFileNamePath(HudiActiveTimeline.java:88)
	at io.trino.plugin.hudi.timeline.HudiActiveTimeline.getInstantDetails(HudiActiveTimeline.java:71)
...

The issue could be reproduced for this table always, on the other hand, other tables were working just fine. We were using a simple SQL query select * from test_catalog.test_schema.table limit 1;, but more complex queries were also failing due to this error.

This PR proposes a solution for the problem by slightly modifying the way Trino handles the Hudi timeline internally. Albeit the solution is simple, I have also added some unit test coverage about the fix. Please let me know your thoughts about it (I needed to include a Java mock library into Maven, this might not be desirable from overall project perspective).

Additional context and related issues

At some point during its operation, the Hudi connector tries to update ("reset") its internal state of the Hudi filesystem if "replaced instants" (files) are present on the Hudi timeline. This operation involves accessing the Hudi table's metadata folder on the object store. When this path segment is created, the corresponding io.trino.filesystem.Location class is created in a manner that it did not contain the scheme, just the path (see the code change below). This is forbidden in the code of that class (since it has to adhere to the format scheme://[userInfo@]host[:port][/path]), therefore the mentioned exception was thrown (refer to the Javadoc of the Location class for more information).

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

  • Fixed reading Hudi tables with replaced instants.

@cla-bot
Copy link

cla-bot bot commented Jul 10, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Attila Tóth.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 32a55c9 to 88124ab Compare July 10, 2023 13:21
@cla-bot
Copy link

cla-bot bot commented Jul 10, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Attila Tóth.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 88124ab to 60d2776 Compare July 10, 2023 13:24
@cla-bot
Copy link

cla-bot bot commented Jul 10, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the hudi Hudi connector label Jul 10, 2023
@@ -374,6 +374,13 @@
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We basically don't use mockito in tests. Please fix BaseHudiConnectorTest or TestHudiSparkCompatibility (or both).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix that and append the existing suites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in trino-hudi is ready. I have created a simple copy of the already existing test table stock_ticks_cow in which the Hudi timeline has replacecommit actions, that is present in the patchset. I have confirmed that without the code change the test is also failing exactly with the above mentioned exception. Please let me know what do you think about this solution/direction.

Still need some time to familiarize myself with the product tests and add a test case there as well for this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended TestHudiSparkCompatibility in the latest patchset, verifying that it also fails with the No scheme for file system location: exception above. Modified the setup method as well to have a default value for S3_BUCKET if that is not specified externally. Please let me know what do you think about this solution in overall.

Signing of the CLA is underway, I will send it in the upcoming days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr Are you OK with the proposed changes? Thanks for taking a look again at this!

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 60d2776 to 2046013 Compare July 11, 2023 08:28
@cla-bot
Copy link

cla-bot bot commented Jul 11, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 2046013 to 410a077 Compare July 12, 2023 13:25
@cla-bot
Copy link

cla-bot bot commented Jul 12, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 410a077 to 4f5a8b4 Compare July 12, 2023 20:48
@cla-bot
Copy link

cla-bot bot commented Jul 12, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@atezs82 atezs82 removed their assignment Jul 17, 2023
@atezs82
Copy link
Contributor Author

atezs82 commented Jul 24, 2023

Hey @ebyhr! Thanks for looking at this change! I'm ready to work on this more if needed - what do you think about the latest state of the PR? I have already sent the signed CLA on the 13th of July already.

createNonPartitionedTableWithReplaceCommit(tableName, COW_TABLE_TYPE);

try {
onTrino().executeQuery("SELECT id, name FROM hudi.default." + tableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the SELECT result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - what I want to see here though is that the query passes. If it does, then replace commits were parsed in the timeline successfully. Please let me know if you thought about some additional verification here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the rows using containsOnly instead of hasAnyRows.

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from 4f5a8b4 to f5c27f8 Compare August 4, 2023 11:11
@cla-bot cla-bot bot added the cla-signed label Aug 4, 2023
@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch 3 times, most recently from 8f53bae to b99ac1a Compare August 4, 2023 13:05
@atezs82
Copy link
Contributor Author

atezs82 commented Aug 4, 2023

@ebyhr Thanks for all the comments and suggestions! I tried to answer them accordingly, please let me know what do you think. I have also verified that tests were failing if the code change in HudiActiveTimeline was not in place.

@ebyhr ebyhr removed their assignment Aug 4, 2023
return Location.of(fileName.contains(SCHEMA_COMMIT_ACTION) ? metaClient.getSchemaFolderName() : metaClient.getMetaPath().path()).appendPath(fileName);
Location metaPath = metaClient.getMetaPath();
if (fileName.contains(SCHEMA_COMMIT_ACTION)) {
return metaPath.appendPath(HudiTableMetaClient.SCHEMA_FOLDER_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .appendPath(fileName) missing? Do existing tests cover this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the appendPath(fileName) to the schema commit path of the condition, it was clearly missing from there.

In order to test this, we need a second table with at least one schemacommit action in the Hudi timeline. Will work on preparing that as well in this patchset.

Copy link
Contributor Author

@atezs82 atezs82 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note my comment below regarding this.

@@ -0,0 +1,74 @@
# Context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this README file! Is it possible to create the table with SQL? We usually use SQL when creating the resource likes plugin/trino-delta-lake/src/test/resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Spark SQL instead of this minimal Spark pipeline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

@atezs82 atezs82 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the README.md file accordingly. When the schemacommit test is ready, I will also add a README.md there.

Copy link
Contributor Author

@atezs82 atezs82 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note my comment below regarding this.

createNonPartitionedTableWithReplaceCommit(tableName, COW_TABLE_TYPE);

try {
onTrino().executeQuery("SELECT id, name FROM hudi.default." + tableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the rows using containsOnly instead of hasAnyRows.

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch 2 times, most recently from 5c41bdc to eaac554 Compare August 11, 2023 13:18
@atezs82
Copy link
Contributor Author

atezs82 commented Aug 11, 2023

Regarding the comment

Please verify the rows using containsOnly instead of hasAnyRows.

The code has been changed accordingly, but the test table contains 4 rows instead of 2. Will analyze the root cause of that as well. For now the test has this but I would like to understand the root cause of this.

@atezs82
Copy link
Contributor Author

atezs82 commented Aug 16, 2023

I merged the two test tables to one (now that one has replacecommits and schemacommits as well). Please let me know what do you think about this current approach (eg. is the number of files and their sizes acceptable?). Moreover, shall we keep the schemacommit content even though it does not seem to force the logic to enter the branch in question (see @ebyhr's question about this above)?

During the creation of the testcases, I also noticed a different fault when handling replacecommits - test records were duplicated in a seemingly incorrect fashion. (This is the same issue.) I investigated this a bit further and found out that there is a problem with JSON deserialization in HudiReplaceCommitMetadata. Since the tests looked very strange due to this (for two test records I needed to assert 4), I also took the liberty to propose a fix for this in this PR. What do you think about this?

If you agree with having the schemacommit tests and this fix I will write a detailed README.md about how did I create the merged test table. (It is relatively simple, but might not be completely possible in Spark SQL though, since schema evolution in Hudi is still experimental and some parameters need to be set on the Spark Context beforehand.)

Otherwise, I can also work on restoring the test suite to contain just the replacecommit test - that was proven to reproduce the exception we have experienced. (Probably that might be a simpler approach.) I can also further dig into why does that branch does not get called, but I will need more time to understand the code regarding this (it operates with direct method calls and method references as well internally in HudiActiveTimeline/HudiDefaultTimeline). Please advise how shall I proceed with this.

Thanks in advance for looking at this once again!

@atezs82
Copy link
Contributor Author

atezs82 commented Aug 17, 2023

One more addition: it bugged me so I quickly looked at the option of altering the needed configuration for the schemacommit tests from product tests. Turns out that eg. SET spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension results in the exception org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.extensions, so this might not be that easy to do from there (also it would be not desirable to modify a commonly used test SparkContext from a single test like this).

@atezs82
Copy link
Contributor Author

atezs82 commented Aug 17, 2023

After spending some time on analyzing the code further about the schema commit support in HudiActiveTimeline I suspect that this feature is not yet in use. The only real place from where HudiActiveTimeline::getInstantDetails is called from is HudiTableFileSystemView::resetFileGroupsReplaced, but even there, right before the method is called, replace commits are filtered out right before when HudiDefaultTimeline::getCompletedReplaceTimeline is called - so getInstantDetails is called only with this commit action type.

In theory future users of the Hudi timeline interface might benefit from this feature (get proper filenames for schema commits as well, which reside under the .schema subfolder within the Hudi metadata folder), but for now it seems it is not used for any feature the connector exposes. I have also checked and seen that this was never called from any unit tests of trino-hudi.

Given this, I would suggest to remove the schema tests altogether, and fall back to just the replace commit tests, which clearly reproduced the original issue. That would make the test system simpler, we can just use the product tests if needed to test this (so that a large number of files do not need to be added to trino-hudi), and if we want to test in trino-hudi we do not need complicated Spark SQL, we can use the same method which is currently in use in the product tests.

@electrum @ebyhr What is your opinion about this?

// for ser/deser
public HudiReplaceCommitMetadata()
@JsonCreator(mode = JsonCreator.Mode.PROPERTIES)
public HudiReplaceCommitMetadata(@JsonProperty("partitionToReplaceFileIds") final Map<String, List<String>> partitionToReplaceFileIds,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the 1st argument to next line and remove final.

    public HudiReplaceCommitMetadata(
            @JsonProperty("partitionToReplaceFileIds") Map<String, List<String>> partitionToReplaceFileIds,
            @JsonProperty("compacted") Boolean compacted)
    {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a commit body to explain why this change is needed to "Fix replace commit handling in Hudi Active Timeline"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message body now has a brief reasoning about this change.

Comment on lines 38 to 43
this.partitionToReplaceFileIds = partitionToReplaceFileIds;
this.compacted = compacted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these fields nulalble?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not. If eg. partitionToReplaceFileIds become null this code will throw an NPE (from HudiActiveTimeline::reetFileGroupsReplaced):

                        // get replace instant mapping for each partition, fileId
                        return replaceMetadata.getPartitionToReplaceFileIds().entrySet().stream()
                                .flatMap(entry -> entry.getValue().stream().map(fileId ->
                                        Map.entry(new HudiFileGroupId(entry.getKey(), fileId), instant)));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, it would be better to add ImmutableMap.copyOf & requireNonNull and change Boolean to boolean

        this.partitionToReplaceFileIds = ImmutableMap.copyOf(requireNonNull(partitionToReplaceFileIds, "partitionToReplaceFileIds is null"));

@@ -155,6 +155,7 @@ private static void copyDir(Path srcDir, Path dstDir)
public enum TestingTable
{
HUDI_NON_PART_COW(nonPartitionRegularColumns()),
HUDI_NON_PART_COW_CL_SC(schemaChangedNonPartitionedRegularColumns()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of CL and SC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CL means "clustered", while SC means "schema change" in this context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think code readers can understand the meaning. Please avoid abbreviations.
https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#avoid-abbreviations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will remove the schema change tests altogether as requested anyways, and will change this as well in the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the abbreviation, the table is called HUDI_NON_PART_COW_CLUSTERING (where COW is a well-known Hudi concept).

@@ -0,0 +1 @@
{"schemas":[{"max_column_id":13,"version_id":20230815074720567,"type":"record","fields":[{"id":0,"name":"_hoodie_commit_time","optional":true,"type":"string","doc":""},{"id":1,"name":"_hoodie_commit_seqno","optional":true,"type":"string","doc":""},{"id":2,"name":"_hoodie_record_key","optional":true,"type":"string","doc":""},{"id":3,"name":"_hoodie_partition_path","optional":true,"type":"string","doc":""},{"id":4,"name":"_hoodie_file_name","optional":true,"type":"string","doc":""},{"id":5,"name":"rowId","optional":true,"type":"string"},{"id":6,"name":"partitionId","optional":true,"type":"string"},{"id":7,"name":"preComb","optional":true,"type":"long"},{"id":8,"name":"name","optional":true,"type":"string"},{"id":9,"name":"versionId","optional":true,"type":"string"},{"id":10,"name":"toBeDeletedStr","optional":true,"type":"string"},{"id":11,"name":"intToLong","optional":true,"type":"int"},{"id":12,"name":"longToInt","optional":true,"type":"long"},{"id":13,"name":"description","optional":true,"type":"string"}]},{"max_column_id":12,"version_id":0,"type":"record","fields":[{"id":0,"name":"_hoodie_commit_time","optional":true,"type":"string","doc":""},{"id":1,"name":"_hoodie_commit_seqno","optional":true,"type":"string","doc":""},{"id":2,"name":"_hoodie_record_key","optional":true,"type":"string","doc":""},{"id":3,"name":"_hoodie_partition_path","optional":true,"type":"string","doc":""},{"id":4,"name":"_hoodie_file_name","optional":true,"type":"string","doc":""},{"id":5,"name":"rowId","optional":true,"type":"string"},{"id":6,"name":"partitionId","optional":true,"type":"string"},{"id":7,"name":"preComb","optional":true,"type":"long"},{"id":8,"name":"name","optional":true,"type":"string"},{"id":9,"name":"versionId","optional":true,"type":"string"},{"id":10,"name":"toBeDeletedStr","optional":true,"type":"string"},{"id":11,"name":"intToLong","optional":true,"type":"int"},{"id":12,"name":"longToInt","optional":true,"type":"long"}]}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18213 (comment)

Let's remove these file and use product test instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the tests for the schema commits.


try {
assertThat(onTrino().executeQuery("SELECT id, name FROM hudi.default." + tableName + ";"))
.containsOnly(ImmutableList.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ImmutableList.of is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch 3 times, most recently from aa1edd5 to e01c2c7 Compare September 4, 2023 07:12
@codope
Copy link
Contributor

codope commented Sep 13, 2023

@ebyhr @mosabua This PR is critical and affects Hudi table with clustering. I see most comments are addressed. Any chance of getting this merged sooner and released in the next version?

@atezs82
Copy link
Contributor Author

atezs82 commented Sep 13, 2023

Please let me know if any other things need to be done here. We are using a patched Trino image, but we would also would like to upgrade to an official Trino release having this fix soon. Thanks in advance!

Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change look good. Thanks @atezs82 for fixing the issue.
Please also include me as a reviewer for any change in Hudi connector.

@martint martint requested a review from electrum September 14, 2023 20:19
Comment on lines 38 to 43
this.partitionToReplaceFileIds = partitionToReplaceFileIds;
this.compacted = compacted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, it would be better to add ImmutableMap.copyOf & requireNonNull and change Boolean to boolean

        this.partitionToReplaceFileIds = ImmutableMap.copyOf(requireNonNull(partitionToReplaceFileIds, "partitionToReplaceFileIds is null"));

Comment on lines 34 to 54
CREATE TABLE hudi_non_part_cow
USING hudi
OPTIONS (
hoodie.table.name = 'hudi_non_part_cow',
hoodie.datasource.write.recordkey.field = 'rowId',
hoodie.datasource.write.precombine.field = 'preComb',
hoodie.table.type = 'COPY_ON_WRITE'
)
LOCATION 'file:///<<< EXISTING TABLE LOCATION >>>'
CREATE TABLE hudi_non_part_cow_clustered
USING hudi
OPTIONS (
hoodie.table.name = 'hudi_non_part_cow_clustered',
hoodie.datasource.write.recordkey.field = 'rowId',
hoodie.datasource.write.precombine.field = 'preComb',
hoodie.table.type = 'COPY_ON_WRITE',
hoodie.clustering.inline = 'true',
hoodie.clustering.inline.max.commits = '1'
)
LOCATION 'file:///<<< NEW TABLE LOCATION >>>'
AS SELECT * FROM hudi_non_part_cow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, why do we need this resource files instead of product tests? I think we can simply execute this CTAS statement in product tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed so that this special behavior (replace commit handling) is also tested in the Hudi connector, but I also agree on that this is still a bit more complex than it should be. I removed this test altogether from the Hudi connector and let the product tests test this behavior as well in the last changeset.

@ebyhr
Copy link
Member

ebyhr commented Sep 15, 2023

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch from e01c2c7 to 64cc81f Compare September 16, 2023 18:23
@atezs82
Copy link
Contributor Author

atezs82 commented Sep 16, 2023

Thanks for the review and the suggestions!

I have shortened/slightly reformatted the commit message to contain the changes requested in #18213 (comment), but still have a shorter subject line (with a longer body explaining the exact changes done). Please let me know if that is acceptable.

Regarding #18213 (comment), I have implemented the requested changes, on the other hand, changing the compacted field to boolean from Boolean also had some implications on the overriden equals and hashCode methods within the HudiReplaceCommitMetadata class. I have modified those methods to contain the partitionToReplaceFileIds field, since it was missing from both methods. Please let me know if this change is acceptable here or shall we place it in an other PR.

@atezs82 atezs82 force-pushed the fix-hudi-instant-file-name-fetch branch 2 times, most recently from 7082301 to ba70065 Compare September 19, 2023 09:17
@ebyhr ebyhr force-pushed the fix-hudi-instant-file-name-fetch branch from ba70065 to 7eee37b Compare September 19, 2023 11:59
When reading Hudi tables with replace commits, the location of the
Hudi commit file was assembled incorrectly in the Hudi Active
Timeline logic.

The Hudi replace commit metadata model object was also patched to
deserialize Hudi metadata properly. This was needed so that Trino
users (and the tests for the active timeline) do not see duplicated
records for replace commits.
@ebyhr ebyhr force-pushed the fix-hudi-instant-file-name-fetch branch from 7eee37b to 802b7c5 Compare September 20, 2023 08:56
@ebyhr ebyhr merged commit ea4277d into trinodb:master Sep 20, 2023
@github-actions github-actions bot added this to the 427 milestone Sep 20, 2023
@mosabua
Copy link
Member

mosabua commented Sep 20, 2023

Thank you @ebyhr for all your work on this.

Also fyi @codope

@codope
Copy link
Contributor

codope commented Sep 21, 2023

Thanks a lot @mosabua @ebyhr @atezs82 for helping fix this critical issue!

@atezs82
Copy link
Contributor Author

atezs82 commented Sep 21, 2023

I also would like to thank @ebyhr for all the work he put into reviewing this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hudi Hudi connector
Development

Successfully merging this pull request may close these issues.

5 participants