-
Notifications
You must be signed in to change notification settings - Fork 2.5k
refactor: Replace HoodieHadoopStorage instantiation with HoodieStorageFactory #17661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
refactor: Replace HoodieHadoopStorage instantiation with HoodieStorageFactory #17661
Conversation
4c50af2 to
5c00a52
Compare
| * @return {@link HoodieStorage} instance | ||
| * @throws HoodieException if unable to create the storage instance | ||
| */ | ||
| public static HoodieStorage getStorage(StorageConfiguration<?> conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove HoodieStorageUtils and directly use HoodieStorageFactory for construction? Move all instantiation logic into HoodieStorageFactory.
|
@CTTY could you review this PR? |
|
Hi @KiteSoar @yihua , I'm having second thoughts on this change now. Initially I was thinking of something like: And then we can slowly migrate the entry point of storage from just It makes more sense to me right now that we don't add a new I still want to stop people from using the Please let me know what you think! |
Using existing |
We should revisit the places to instantiate the |
Thanks for the review @CTTY @yihua. |
5c00a52 to
6fe4b19
Compare
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTTY
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @KiteSoar , I just did another round of review. The approach looks good in general while I have some concerns about introducing a new util method that accepts Object[] params to instantiate storage classes. We should assume that a storage implementation only has constructor that takes {StoragePath.class, StorageConfiguration.class}
| * @return {@link HoodieStorage} instance | ||
| * @throws HoodieException if unable to create the storage instance | ||
| */ | ||
| public static HoodieStorage getStorage(StorageConfiguration<?> conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add this method. This will fail all storage implementations that don't have a constructor that takes params.
Looking at other changes, it seems like the only purpose of this method is to instantiate HoodieHadoopStorage when you only have FileSystem available. We should get path somehow and use the util method that takes {StoragePath.class, StorageConfiguration.class}. In the worst case, we should stay using HoodieHadoopStorage constructor when path is not even available
| Path inputPath = ((FileSplit) split).getPath(); | ||
| FileSystem fs = inputPath.getFileSystem(job); | ||
| HoodieStorage storage = new HoodieHadoopStorage(fs); | ||
| HoodieStorage storage = HoodieStorageUtils.getStorage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should instead use HoodieStorageUtils.getStorage(HadoopFSUtils.convertToStoragePath(inputPath), HadoopFSUtils.getStorageConf(fs.getConf())
| Path path = ((FileSplit) split).getPath(); | ||
| FileSystem fs = path.getFileSystem(job); | ||
| HoodieStorage storage = new HoodieHadoopStorage(fs); | ||
| HoodieStorage storage = HoodieStorageUtils.getStorage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we should instead use HoodieStorageUtils.getStorage(HadoopFSUtils.convertToStoragePath(path), HadoopFSUtils.getStorageConf(fs.getConf())
| Path inputPath = ((FileSplit) split).getPath(); | ||
| FileSystem fs = inputPath.getFileSystem(job); | ||
| HoodieStorage storage = new HoodieHadoopStorage(fs); | ||
| HoodieStorage storage = HoodieStorageUtils.getStorage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
| storage = new HoodieHadoopStorage(path.getFileSystem(job)); | ||
| FileSystem fs = path.getFileSystem(job); | ||
| storage = HoodieStorageUtils.getStorage( | ||
| HadoopFSUtils.getStorageConf(fs.getConf()), new Class<?>[] {FileSystem.class}, fs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we should use path and conf to construct storage instance, not FileSystem
| this.cfg = cfg; | ||
| this.hoodieSparkContext = hoodieSparkContext; | ||
| this.storage = new HoodieHadoopStorage(fs); | ||
| this.storage = HoodieStorageUtils.getStorage(HadoopFSUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use path and conf to construct storage instance. The path here should be cfg.targetBasePath
| this.hoodieSparkContext = hoodieSparkContext; | ||
| this.sparkSession = sparkSession; | ||
| this.storage = new HoodieHadoopStorage(fs); | ||
| this.storage = HoodieStorageUtils.getStorage(HadoopFSUtils.getStorageConf(fs.getConf()), new Class<?>[] {FileSystem.class}, fs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use path and conf to construct storage instance. The path here should be cfg.targetBasePath
| @@ -91,7 +93,10 @@ void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean hasTransformer, | |||
| Boolean isNullTargetSchema, Boolean hasErrorTable, Boolean shouldTryWriteToErrorTable) { | |||
| //basic deltastreamer inputs | |||
| HoodieSparkEngineContext hoodieSparkEngineContext = mock(HoodieSparkEngineContext.class); | |||
| HoodieStorage storage = new HoodieHadoopStorage(mock(FileSystem.class)); | |||
| HoodieStorage storage = HoodieStorageUtils.getStorage( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we can also use HoodieStorageUtils to get storage instance in tests like this(not with the FileSystem). But I'm not sure how feasible it is. If it's quite tricky then we can just use HoodieHadoopStorage constructor for tests
| return tempViewProvider; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add the empty line back
| }).collect(Collectors.toList()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, let's add the empty line back
Describe the issue this Pull Request addresses
This PR refactors the codebase to use HoodieStorageFactory instead of direct HoodieHadoopStorage instantiation, improving code maintainability and extensibility.
closes #17549
Summary and Changelog
Summary:
Users and developers will benefit from a more maintainable and extensible storage layer architecture. The factory pattern allows for easier switching between different storage implementations in the future.
Changelog:
new HoodieHadoopStorage()instantiations with HoodieStorageFactory.getStorage() callsImpact
Public API Changes: None - This is an internal refactoring only.
User-Facing Changes: None - No behavioral changes for end users.
Performance Impact: Negligible - The factory pattern adds minimal overhead (one additional method call).
Risk Level
Low
Mitigation:
Documentation Update
none - This is an internal code refactoring with no user-facing changes, configuration updates, or new features requiring documentation.
Contributor's checklist