-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-16690. Automatically format unformatted JNs with JournalNodeSyncer #6925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16690. Automatically format unformatted JNs with JournalNodeSyncer #6925
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
30257ae
to
809e51a
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aswinmprabhu for your proposal. Good idea. Leave some comments inline. PFYI.
journal.getStorage().getEditsSyncDir()); | ||
return; | ||
} | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it will drop into endless loop when tryFormatting
set to false by default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endless loop already exists. If the JN is unformatted, it will throw an exception in a loop every time the syncer tries to sync from other JNs. tryFormatting
doesn't add anything extra to the loop. But if it is set, the formatting could happen.
return GetEditLogManifestResponseProto.newBuilder() | ||
.setManifest(PBHelper.convert(manifest)) | ||
.setHttpPort(jn.getBoundHttpAddress().getPort()) | ||
.setFromURL(jn.getHttpServerURI()) | ||
.build(); | ||
} | ||
|
||
@Override | ||
public StorageInfoProto getStorageInfo(String jid, | ||
String nameServiceId) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameServiceId
here is not used anymore, so why we should define it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This was a miss. getOrCreateJournal
should be called with jid and nameServiceId. I've pushed an update. Thanks!
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@Hexiaoqiao can you please take a look again? I've replied and updated the PR. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1. The failed unit test is not related to this PR (which is traced by #6951). Will commit if no more other comments for two work days.
Thanks, @Hexiaoqiao for the approval. Can we merge at the end of today? |
Committed to trunk. Thanks @aswinmprabhu for your works. |
…er (apache#6925). Contributed by Aswin M Prabhu. Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
…er (apache#6925). Contributed by Aswin M Prabhu. Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Description of PR
If an unformatted JournalNode is added to an existing JournalNode set --
This scenario can arise in different situations like
Manually fixing this involves rsyncing the VERSION file to the edit log root directory from the other healthy JNs. A similar issue concerning the paxos directory was solved in HDFS-10659.
This PR tries to leverage the already existing JournalNodeSyncer daemon to format the JournalNode on which it is running when it discovers that syncs can't happen due to the
JournalNotFormattedException
. JournalNodeSyncer calls theformatWithSyncer
method if it sees that the JN is unformatted.formatWithSyncer
will loop over the other JN proxies, trying to fetch theStorageInfo
object from them. The StorageInfo object is then used to format the JN by callingJNStorage.format()
.How was this patch tested?
Unit tests have been added for the change:
Build is also successful:
I've tested the changes manually in a K8S cluster with 3 JNs:
JN root dir before testing:
Logs show that JN can receive edit logs:
2024-07-04 08:50:55,242 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /grid/edits/journal/d ata/asprabhu-hadoop/current/edits_inprogress_0000000000000000001 -> /grid/edits/journal/data/asprabhu-hadoop/current/edits_00000000 00000000001-0000000000000000042
Killed the JN process and deleted the edit logs root dir:
Started the JN process. The syncer formatted the JN. Some relevant log lines:
Syncer was also able to fill the holes from other JNs:
JN root dir after the testing:
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?