Skip to content

Conversation

@KiteSoar
Copy link

@KiteSoar KiteSoar commented Dec 21, 2025

Describe the issue this Pull Request addresses

This PR refactors the codebase to use HoodieStorageFactory instead of direct HoodieHadoopStorage instantiation, improving code maintainability and extensibility.
closes #17549

Summary and Changelog

Summary:
Users and developers will benefit from a more maintainable and extensible storage layer architecture. The factory pattern allows for easier switching between different storage implementations in the future.

Changelog:

  • Replaced ~20 direct new HoodieHadoopStorage() instantiations with HoodieStorageFactory.getStorage() calls

Impact

Public API Changes: None - This is an internal refactoring only.

User-Facing Changes: None - No behavioral changes for end users.

Performance Impact: Negligible - The factory pattern adds minimal overhead (one additional method call).

Risk Level

Low

Mitigation:

  • The refactoring preserves existing functionality - only the instantiation pattern changed
  • Internal implementation classes (HoodieHadoopIOFactory, HoodieHadoopStorage itself) continue to use direct instantiation as needed
  • No changes to storage behavior, configurations, or file formats

Documentation Update

none - This is an internal code refactoring with no user-facing changes, configuration updates, or new features requiring documentation.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 21, 2025
@KiteSoar KiteSoar marked this pull request as draft December 21, 2025 13:25
@KiteSoar KiteSoar force-pushed the refactor/hoodie-storage-factory-17549 branch from 4c50af2 to 5c00a52 Compare December 21, 2025 15:15
@KiteSoar KiteSoar marked this pull request as ready for review December 21, 2025 15:16
* @return {@link HoodieStorage} instance
* @throws HoodieException if unable to create the storage instance
*/
public static HoodieStorage getStorage(StorageConfiguration<?> conf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove HoodieStorageUtils and directly use HoodieStorageFactory for construction? Move all instantiation logic into HoodieStorageFactory.

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

@CTTY could you review this PR?

@CTTY
Copy link
Contributor

CTTY commented Dec 22, 2025

Hi @KiteSoar @yihua , I'm having second thoughts on this change now.

Initially I was thinking of something like:

factory.withPath().withConf().build()

And then we can slowly migrate the entry point of storage from just HoodieStorage to HoodieStorageFactory. But I think this may be too intrusive no matter how we do it.

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

I still want to stop people from using the HoodieHadoopStorage constructor directly but I don't have a good solution right now. The only idea in my mind is to set the constructor to be default visibility and use reflection when you can only use HoodieHadoopStorage, but I think this is anti-pattern and deserves a separate thread to discuss.

Please let me know what you think!

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

Using existing HoodieStorageUtils to replace direct usages of HoodieHadoopStorage constructor makes sense to me. Right now HoodieStorageUtils serves as the factory for the HoodieStorage; the proposed HoodieStorageFactory can be a refactoring later, though not necessary at this point.

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

The only idea in my mind is to set the constructor to be default visibility and use reflection when you can only use HoodieHadoopStorage, but I think this is anti-pattern and deserves a separate thread to discuss.

We should revisit the places to instantiate the HoodieStorage instance. Ideally, there should only be a handful of places doing that, and all other places should pass the HoodieStorage instance around.

@KiteSoar KiteSoar closed this Dec 23, 2025
@KiteSoar KiteSoar reopened this Dec 23, 2025
@KiteSoar
Copy link
Author

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

Using existing HoodieStorageUtils to replace direct usages of HoodieHadoopStorage constructor makes sense to me. Right now HoodieStorageUtils serves as the factory for the HoodieStorage; the proposed HoodieStorageFactory can be a refactoring later, though not necessary at this point.

Thanks for the review @CTTY @yihua.
That makes sense. I agree with the suggestion to drop HoodieStorageFactory and stick with the existing HoodieStorageUtils.
So, just to confirm the consensus: the goal of this PR is to switch the usages of the HoodieHadoopStorage constructor to HoodieStorageUtils wherever possible. I'll proceed with this change.

@KiteSoar KiteSoar force-pushed the refactor/hoodie-storage-factory-17549 branch from 5c00a52 to 6fe4b19 Compare December 25, 2025 01:16
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Dec 25, 2025
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KiteSoar is this ready for review? If so, @CTTY could you review again?

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KiteSoar , I just did another round of review. The approach looks good in general while I have some concerns about introducing a new util method that accepts Object[] params to instantiate storage classes. We should assume that a storage implementation only has constructor that takes {StoragePath.class, StorageConfiguration.class}

* @return {@link HoodieStorage} instance
* @throws HoodieException if unable to create the storage instance
*/
public static HoodieStorage getStorage(StorageConfiguration<?> conf,
Copy link
Contributor

@CTTY CTTY Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this method. This will fail all storage implementations that don't have a constructor that takes params.

Looking at other changes, it seems like the only purpose of this method is to instantiate HoodieHadoopStorage when you only have FileSystem available. We should get path somehow and use the util method that takes {StoragePath.class, StorageConfiguration.class}. In the worst case, we should stay using HoodieHadoopStorage constructor when path is not even available

Path inputPath = ((FileSplit) split).getPath();
FileSystem fs = inputPath.getFileSystem(job);
HoodieStorage storage = new HoodieHadoopStorage(fs);
HoodieStorage storage = HoodieStorageUtils.getStorage(
Copy link
Contributor

@CTTY CTTY Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should instead use HoodieStorageUtils.getStorage(HadoopFSUtils.convertToStoragePath(inputPath), HadoopFSUtils.getStorageConf(fs.getConf())

Path path = ((FileSplit) split).getPath();
FileSystem fs = path.getFileSystem(job);
HoodieStorage storage = new HoodieHadoopStorage(fs);
HoodieStorage storage = HoodieStorageUtils.getStorage(
Copy link
Contributor

@CTTY CTTY Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should instead use HoodieStorageUtils.getStorage(HadoopFSUtils.convertToStoragePath(path), HadoopFSUtils.getStorageConf(fs.getConf())

Path inputPath = ((FileSplit) split).getPath();
FileSystem fs = inputPath.getFileSystem(job);
HoodieStorage storage = new HoodieHadoopStorage(fs);
HoodieStorage storage = HoodieStorageUtils.getStorage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

storage = new HoodieHadoopStorage(path.getFileSystem(job));
FileSystem fs = path.getFileSystem(job);
storage = HoodieStorageUtils.getStorage(
HadoopFSUtils.getStorageConf(fs.getConf()), new Class<?>[] {FileSystem.class}, fs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should use path and conf to construct storage instance, not FileSystem

this.cfg = cfg;
this.hoodieSparkContext = hoodieSparkContext;
this.storage = new HoodieHadoopStorage(fs);
this.storage = HoodieStorageUtils.getStorage(HadoopFSUtils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also use path and conf to construct storage instance. The path here should be cfg.targetBasePath

this.hoodieSparkContext = hoodieSparkContext;
this.sparkSession = sparkSession;
this.storage = new HoodieHadoopStorage(fs);
this.storage = HoodieStorageUtils.getStorage(HadoopFSUtils.getStorageConf(fs.getConf()), new Class<?>[] {FileSystem.class}, fs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also use path and conf to construct storage instance. The path here should be cfg.targetBasePath

@@ -91,7 +93,10 @@ void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean hasTransformer,
Boolean isNullTargetSchema, Boolean hasErrorTable, Boolean shouldTryWriteToErrorTable) {
//basic deltastreamer inputs
HoodieSparkEngineContext hoodieSparkEngineContext = mock(HoodieSparkEngineContext.class);
HoodieStorage storage = new HoodieHadoopStorage(mock(FileSystem.class));
HoodieStorage storage = HoodieStorageUtils.getStorage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can also use HoodieStorageUtils to get storage instance in tests like this(not with the FileSystem). But I'm not sure how feasible it is. If it's quite tricky then we can just use HoodieHadoopStorage constructor for tests

return tempViewProvider;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add the empty line back

}).collect(Collectors.toList());
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's add the empty line back

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a HoodieStorageFactory to provide HoodieStorage instance

4 participants