Skip to content

[SPARK-12267][Core]Store the remote RpcEnv address to send the correct disconnetion message #10261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Dec 11, 2015

No description provided.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 11, 2015

CC @vanzin could you take a look? Thanks!

// the listening address
val remoteEnvAddress = requestMessage.senderAddress
if (remoteAddresses.putIfAbsent(clientAddr, remoteEnvAddress) == null) {
dispatcher.postToAll(RemoteProcessConnected(remoteEnvAddress))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to concern about multiple network messages with different addresses since the codes already handle the bad addresses.

@vanzin
Copy link
Contributor

vanzin commented Dec 11, 2015

So, I don't think this is gonna work, and that's why my change is taking longer than I hoped to finish... while this might fix the immediate problem listed in the bug, there are other problems related to this that need to be fixed.

Basically, standalone master HA is currently broken, and because there are no unit tests at all for that part of the code, nobody caught it...

I'm trying to fix it, I think I almost got it, have one last exception to get rid of.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 11, 2015

Cool. I will close this PR.

On Thu, Dec 10, 2015 at 6:28 PM Marcelo Vanzin notifications@github.com
wrote:

So, I don't think this is gonna work, and that's why my change is taking
longer than I hoped to finish... while this might fix the immediate problem
listed in the bug, there are other problems related to this that need to be
fixed.

Basically, standalone master HA is currently broken, and because there are
no unit tests at all for that part of the code, nobody caught it...

I'm trying to fix it, I think I almost got it, have one last exception to
get rid of.


Reply to this email directly or view it on GitHub
#10261 (comment).

@marmbrus
Copy link
Contributor

hey guys, I know this is still in flight, but we are pretty behind on the release. I'm going to cut RC2 and we can always make a judgement call about doing an RC3 once this is fixed.

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47561 has finished for PR 10261 at commit f84e325.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 11, 2015

@vanzin

I saw your following comment in vanzin@3848bf5#diff-acd05d6d379b6ef6ccf36bd3db5614f6R69

Because the messages used to register workers and applications were one-way
messages, though, "receive" was used, and that information was not available.
So now those messages are send using "ask", which looks a bit awkward but is
simpler than changing the RpcEnv API so that the client address is available
in the "receive" method.

But I don't get it.

The communication in master-worker, worker-driver and master-driver are all in non-client mode. So for one-way messages, RequestMessage.senderAddress is the RpcEnv listening address. If we maintain the address relation like this PR, RpcEndpoint doesn't need to worry about the address issue even for one-way messages. Right?

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47588 has finished for PR 10261 at commit 3b37070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2015

I saw your following comment...But I don't get it.

That's because we took opposite approaches. Your PR's approach is "if this connection says the sender address is a listening socket, then also consider that address when sending events about the remote process".

My PR takes the opposite approach: the address of the remote process is always the address of the socket used to connect, regardless of whether its also listening in another socket.

I think my approach is in the end more correct, but requires more code to fix existing code. In my view, RpcCallContext.senderAddress should be the address of the socket that sent the message, not the address the remote process is listening on.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 12, 2015

I see. However, for 1.6, we still need to support Akka RPC. Is it possible to get the client address from ActorRef?

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2015

Akka doesn't have this problem as far as I understand the code; also because the akka rpc backend never runs in "client mode" (that's a netty rpc-only thing).

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2015

BTW I'm fine with this change for 1.6; we can revisit this later to avoid having to keep extra state in the RPC layer.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 13, 2015

Akka doesn't have this problem as far as I understand the code; also because the akka rpc backend never runs in "client mode" (that's a netty rpc-only thing).

I didn't mean that. I just want to mention that if we want to change the semantics of senderAddress, we need to update the Akka RPC as well.

BTW I'm fine with this change for 1.6; we can revisit this later to avoid having to keep extra state in the RPC layer.

Great. I'm going to merge this one to master and 1.6. Let's revisit this later.

asfgit pushed a commit that referenced this pull request Dec 13, 2015
…ct disconnetion message

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10261 from zsxwing/SPARK-12267.

(cherry picked from commit 8af2f8c)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
@asfgit asfgit closed this in 8af2f8c Dec 13, 2015
@zsxwing zsxwing deleted the SPARK-12267 branch December 13, 2015 06:03
@zsxwing
Copy link
Member Author

zsxwing commented Dec 13, 2015

@vanzin I created SPARK-12308 to remind us of your approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants