Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename files written by PageSink #14974

Merged
merged 6 commits into from
Oct 1, 2020
Merged

Rename files written by PageSink #14974

merged 6 commits into from
Oct 1, 2020

Conversation

NikhilCollooru
Copy link
Contributor

@NikhilCollooru NikhilCollooru commented Aug 6, 2020

This PR enables the workers to send / receive metadata requests /responses from coordinator.
If this feature is enabled for Hive, then all the files written by the HivePageSink for a given partition will have increasing whole numbers as file names. (For ex: Lets say this we enabled this feature and ds=2020-09-30 is written by HivePageSink. Then we can expect the file names of that partition to be 0,1,2,3..N etc ).

depended by https://github.com/facebookexternal/presto-facebook/pull/1125

== RELEASE NOTES ==
Hive Changes
* Add support for file renaming for Hive connector. This can be enabled with ``hive.file_renaming_enabled`` configuration property

@NikhilCollooru NikhilCollooru requested a review from highker August 6, 2020 02:19
@highker
Copy link
Contributor

highker commented Aug 6, 2020

rebase a bit maybe?

@NikhilCollooru NikhilCollooru force-pushed the m2 branch 2 times, most recently from 5b2a34e to 3e915b5 Compare August 7, 2020 01:25
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level design direction:

  1. Create a new interface in ConnectorMetadata called updateMetadata in SPI
  2. Create a new service HiveFileRenamer for Hive connector in presto-hive and inject it into HiveMetadata. This is to move your current logic in ContinuousTaskStatusFetcher into this HiveFileRenamer. So that you can call the interface updateMetadata to tunnel the logic into it.
  3. Change the field name in TaskStatus to be called MetadataUpdates.
  4. Create a new connector interface ConnectorDistributedMetadataUpdater in SPI. This is to replace FileNameProvider. Make sure to name the interface in a generic way (to avoid word like 'file', 'name', etc). This class is not injected. But to created from class loader. (Check ConnectorPlanOptimizerProvider as an example).
  5. Create a new class HiveDistributedMetadataUpdater to inherit ConnectorPlanOptimizerProvider. This is to replace FileNameManager. Move all your logic there.
  6. SqlTask should call HiveDistributedMetadataUpdater to update the renamed files.
  7. HiveDistributedMetadataUpdater will be tunneled into HivePageSink just like FileNameProvider.

@NikhilCollooru NikhilCollooru force-pushed the m2 branch 6 times, most recently from 0dbf2bf to c89651f Compare August 16, 2020 17:10
@NikhilCollooru NikhilCollooru requested a review from highker August 16, 2020 19:24
@highker highker marked this pull request as ready for review August 17, 2020 16:36
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on SPI change. Once we fix SPI, other parts should be easy to proceed

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing

Comment on lines 42 to 43
@Inject
public HiveDistributedMetadataUpdater() {}
Copy link
Contributor

@highker highker Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We might need to inject something light-weighted to track the lifecycle of a query. SqlTaskManager is the closest I can think of. But that is too heavy and not accessible through connectors. We may in the end hook callbacks from HivePageSink to remove dead queries.
  • clean up hashmap by hooking it to HiveFileWriter::commit/rollback
  • add an assertion to any integration or smoke tests to assert we don't leak requests. Check 9f89256 as an example

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • take a look ConnectorPlanOptimizer and ConnectorPlanOptimizerProvider
  • ConnectorDistributedMetadataUpdater <-> ConnectorPlanOptimizer. It will be per query/task
  • Create a new ConnectorDistributedMetadataUpdaterProvider that is a singleton. Inject it into LocalExecutionPlanner and somehow return it back SqlTask?.
  • In LocalExecutionPlanner, when you create a new TableWriterOperatorFactory, create a new ConnectorDistributedMetadataUpdater instance and pass it in.

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we pass the updater into table writer instead of injecting it, we will need to modify the SPI createPageSink. To do it, we could rename PageSinkProperties into PageSinkContext as a separate commit. Then put the updater inside PageSinkContext so we don't have to change the interface again.

@NikhilCollooru NikhilCollooru force-pushed the m2 branch 4 times, most recently from 00f121b to 5100427 Compare August 25, 2020 16:44
@NikhilCollooru NikhilCollooru requested a review from highker August 25, 2020 18:20
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rename PageSinkProperties to PageSinkContext" LGTM

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add ConnectorMetadataUpdater to handle worker's metadata updates"

  • there is a typo":wq" in the commit message
  • haven't reviewed in details yet. One quick comment: we probably need to use a session property to guard if we actually wanna do send/receive any requests for HiveMetadataUpdater.

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add ConnectorMetadataUpdateHandle resolver" LGTM

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level comments of this PR: we need tests; a lot of unit tests and integration tests. Probably enable the feature for all write tests for hive connector. We also need to make sure the file names are in sequential order.

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rename files written by HivePageSink": make rename async

@NikhilCollooru NikhilCollooru force-pushed the m2 branch 3 times, most recently from c77545f to 33d310c Compare September 26, 2020 04:24
@highker highker self-requested a review September 27, 2020 03:02
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"POST the metadata results to worker": we need good ways to purge queries on coordinators

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add config parameters to enable hive file renaming": LGTM

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Disable listFiles call to underlying storage": let's remove this commit in this PR. That is more like the standalone piece to review in the next PR

@highker
Copy link
Contributor

highker commented Sep 27, 2020

First pass done. Feel free to re-request my reviews once the comments are addressed.

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall look good to me. Minor comments only

@NikhilCollooru NikhilCollooru force-pushed the m2 branch 2 times, most recently from 0ee1a54 to 6c40f9b Compare October 1, 2020 02:13
@NikhilCollooru NikhilCollooru requested a review from highker October 1, 2020 03:31
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work! Nits only

@highker highker self-assigned this Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants