Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 6068] Fixing the calls to Helix to throw exception if zk conne… #6069

Merged
merged 4 commits into from
Oct 6, 2020

Conversation

mcvsubbu
Copy link
Contributor

…ction is broken

See Issue #6068

Description

Add a description of your PR here.
A good description should include pointers to an issue or design document, etc.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

If you have tagged this as either backward-incompat or release-notes,
you MUST add text here that you would like to see appear in release notes of the
next release.

If you have a series of commits adding or enabling a feature, then
add this section only in final commit that marks the feature completed.
Refer to earlier release notes to see examples of text

Documentation

If you have introduced a new feature or configuration, please add it to the documentation as well.
See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document

These APIs will ensure that if there is a zk disconnect
we will get an exception after a minimal number of retries.
We can change this to retry once and implement a backoff retry
if needed later on.

Note that the underlying helix library ends up calling the previous
API (as yet), but we will upgrade to a helix version soon that actually
implements these
@mcvsubbu
Copy link
Contributor Author

mcvsubbu commented Oct 4, 2020

@Jackie-Jiang I made one more change. Can you take another look please? thanks

@@ -305,7 +308,7 @@ public static Schema getTableSchema(@Nonnull ZkHelixPropertyStore<ZNRecord> prop
ZkHelixPropertyStore<ZNRecord> propertyStore, String tableName) {
String offlineTableName = TableNameBuilder.OFFLINE.tableNameWithType(tableName);
String parentPath = constructPropertyStorePathForResource(offlineTableName);
List<ZNRecord> znRecords = propertyStore.getChildren(parentPath, null, AccessOption.PERSISTENT);
List<ZNRecord> znRecords = propertyStore.getChildren(parentPath, null, AccessOption.PERSISTENT, ZK_OP_RETRY_COUNT, ZK_OP_RETRY_INTERVAL_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't take effect... Here is the Helix code for it (I would count it as a bug):

  @Override
  public List<T> getChildren(String parentPath, List<Stat> stats, int options, int retryCount,
      int retryInterval) throws HelixException {
    return getChildren(parentPath, stats, options);
  }

I don't see an easy way to throw exception for this API unless we create the children paths manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are working with the helix team to give us a new version with the API implemented. Since I took the trouble to trace all the APIs, I took the liberty to modify them so that when we upgrade to newer helix version, we have the right calls

@Jackie-Jiang
Copy link
Contributor

@mcvsubbu We want both getChildNames() and getChildren() to have the same behavior, and should throw exception on failures. Since Helix does not have the correct APIs for getChildren() yet, I would suggest not changing getChildren() for now and wait until we upgrade to the new Helix version, or it will be very confusing for the developers because of the unexpected Helix behavior. Thoughts?

@mcvsubbu
Copy link
Contributor Author

mcvsubbu commented Oct 5, 2020

@mcvsubbu We want both getChildNames() and getChildren() to have the same behavior, and should throw exception on failures. Since Helix does not have the correct APIs for getChildren() yet, I would suggest not changing getChildren() for now and wait until we upgrade to the new Helix version, or it will be very confusing for the developers because of the unexpected Helix behavior. Thoughts?

Since the new API has not been implemented yet, there will not be any unexpected behavior at this time, right?

@Jackie-Jiang
Copy link
Contributor

@mcvsubbu The unexpected behavior I was referring to is that we explicitly set retry in the method, but Helix won't do the retry, and won't throw exception when the read fails. So I suggest not doing the explicit retry before upgrading the Helix version. But this is not critical. Feel free to merge

@mcvsubbu mcvsubbu merged commit 24147dd into apache:master Oct 6, 2020
@mcvsubbu mcvsubbu deleted the issue-6068 branch October 6, 2020 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants