Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

Open
3 tasks done
ulysses-you opened this issue Oct 9, 2023 · 24 comments
Open
3 tasks done

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

ulysses-you opened this issue Oct 9, 2023 · 24 comments
Assignees

Comments

@ulysses-you
Copy link
Contributor

ulysses-you commented Oct 9, 2023

Code of Conduct

Search before creating

  • I have searched in the task list and found no similar tasks.

Mentor

  • I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

  • Knowledge about Spark Connect
  • Knowledge about Kyuubi architecture
  • Knowledge about protobuf
  • Knowledge about grpc
  • Knowledge about thrift

Background and Goals

Make Kyuubi server compatible with Spark Connect protocol, so that people can use Spark Connect client to connect to Kyuubi Server.

image

Implementation steps

  1. Add a new Spark Connect frontend
    1.1 Add basic gRpc server as frontend
    1.2 Compatible with Spark Connect protocol, see https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/base.proto
    1.3 Support ExecutePlan
    1.4 Support AnalyzePlan
    1.5 Support Config
    1.6 Support AddArtifacts
    1.7 Support ArtifactsStatus
    1.8 Support Interrupt
    1.9 Support ReattachExecute
    1.10 Support ReleaseExecute
    1.11 Serialize the protobuf based request

  2. Add a new Spark Connect backend
    2.1 Imprort Sprak-Connect-Server and rewrite SparkConnectService https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala
    2.2 Deserialize response to protobuf based

  3. Add IT

  4. Add docs

Additional context

Introduction of #6232

@yehere
Copy link
Contributor

yehere commented Oct 10, 2023

I think this is very challenging, but I want to give it a try. Can you assign it to me and help me

@ulysses-you
Copy link
Contributor Author

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

@pan3793
Copy link
Member

pan3793 commented Oct 10, 2023

This huge task could be divided into several different level tasks, feel free to go ahead ~ all your contributions will be counted eventually :)

@cfmcgrady
Copy link
Contributor

cc @cfmcgrady

@zhaomin1423
Copy link
Member

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

I'm also interested in it, hope to work together.

@ulysses-you
Copy link
Contributor Author

thank you @zhaomin1423 , glad to see you are interested in.

@minyk
Copy link
Contributor

minyk commented Oct 19, 2023

how about co-located mode with kyuubi's sparksql engine? separated service is good and basic, but also needs more resources for more spark instances.

@ulysses-you
Copy link
Contributor Author

@minyk there are in different process, just like Spark thirftserver and connect server. We are going to add a new module and new server for Kyuubi connect. We can do it together if you are interested in.

@davidyuan1223
Copy link
Contributor

@ulysses-you hello,i'm interested with this component,hope work with you

@davidyuan1223
Copy link
Contributor

@yaooqinn @pan3793 @ulysses-you
i found spark have package the connect module to maven repository
https://mvnrepository.com/artifact/org.apache.spark/spark-connect-client-jvm_2.13/3.4.0
can we use thoes package to simplify the code?
maybe we could provide a connectingstr to sparkSession like the code description
https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
based spark-connect package, we can reduce grpcServer and proto
what do you think?

@pan3793
Copy link
Member

pan3793 commented Apr 16, 2024

I haven't had deep look at it, my current thought is,

  1. for server part, we only need a thin gRPC layer, coping proto files and regenerating gRPC files is fine.
  2. for engine part, we can reuse the connect-server module to simplify the code.

@ulysses-you
Copy link
Contributor Author

@davidyuan1223 sure, please go ahead. +1 for @pan3793 thought.

@davidyuan1223
Copy link
Contributor

@ulysses-you @pan3793
Understand, I'd like to try this challenging issue, which could go on for a long time, as I need to go through the whole architecture of kyuubi-server and figure out the differences between it and spark-connect, and in the process I might have some discussions with you.
Could youe assigned this issue to me?

@tgravescs
Copy link

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

@davidyuan1223
Copy link
Contributor

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

Yes, my initial assumption is to create a 3.4-based sparkSession by providing the configuration item remote connection str and then merging it with thrift service to provide the corresponding engine(so this configuration must force a check of the spark version > 3.4, while spark-connect-client has already written sparkSession to reduce our development process), what do you think?

@pan3793
Copy link
Member

pan3793 commented Apr 18, 2024

@tgravescs that's a good question, and we did have an offline discussion about it.

TL;DR, your assumption will be the ultimate version, but not at the beginning.

As you know the current main flow of Kyuubi is:

       ===[http]
client ===[thrift]====> Server ===[thrift]===> Engine
       ===[etc.]               ---[thrift]---> STS/HS2/Impala (we know someone implemented such a feature internally)

The engine itself is kind of a regular Spark app that basically only consumes Spark's public API, making it easily compatible with multiple Spark versions. As connect is a new feature and connect-server is not supposed to be exposed to the user directly(I suppose only gRPC API is public API in this case), pulling connect-server in the current Spark engine module directly would break the current assumption. So in the experimental phase we are going to create a dedicated engine module for the connect engine, we may call it SPARK_CONNECT(the current one is SPARK_SQL)。

Another important case is Server ===[thrift]===> Engine, currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used, and keep two internal RPC protocol is quite complex and redundant, we tend to create a dedicated experimental server that keeps similar architecture but rewrite the RPC implementation.

Once the PoC is completed, we can consider merging servers and engines to achieve the final vision as you said.

       ===[http]
       ===[grpc]
client ===[thrift]====> Server ===[grpc]===> Engine
       ===[etc.]               --[thrift]--> STS/HS2/Impala (we know someone implemented such a feature internally)

Maybe @yaooqinn can share more information

@davidyuan1223
Copy link
Contributor

davidyuan1223 commented Apr 18, 2024

@pan3793 @yaooqinn @ulysses-you @tgravescs
Hello, I have analyzed the processing flow of spark-connect, as shown in the following figure.
image

  1. SparkSession.builder.remote(host:port).getOrCreate() to create a SparkConnectClient(RPCClient)
  2. spark.sql(xxx), acutually, this method is build a rpcRequest then use RPClient to process with Spark-Connect-Server
  3. Then Spark-Connect-Server receive the request and process it with local sparkSession, finally, return the rpcResponse
  4. The client sparkSession receive the rpcResponse will resolve it then return

As mentioned above, I believe that in the RPC request process of kyuubi based on SparkConnect, we no longer need the involvement of SparkSession, so I have designed the following process:
image

  1. We will implement a KyuubiSparkConnectClient(RPCClient, based on SparkConnectClient). It will be created when we use EngineRef.getOrCreate to create a KyuubiSparkConnectEngine
  2. Examples, like beeline, when we use beeline to execute sql, it will create a thrift request to the KyuubiSparkConnectFrontendService
  3. The frontendService will not do any thing, just like other engine, then the frontendService will post request to KyuubiSparkConnectService(client: KyuubiSparkConnectClient)
  4. The backendService also like other engine, it will use corresponding operation to handle the request
  5. The operation will process like the follow
    5.1 Process the thrift request and tranform it to rpc request
    5.2 Call client method to process the request
    5.3 Receive the rpc response from the Spark-Connect-Server
    5.4 Tranform the rpc response to thrift response

Based the rpc client, we don't need create sparkSession

What do you think?

@tigrulya-exe
Copy link
Contributor

@pan3793 @yaooqinn Hi! Just to clarify - do I understand correctly, that for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

Or it is expected to start directly from rewriting the current internal RPC mechanism from Thrift (HS2) to gRPC and changing the internal API (kyuubi frontend server <--> engine), so that it will include logical methods from both the old API and the Spark Connect API?

currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used

@pan3793
Copy link
Member

pan3793 commented Aug 15, 2024

for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

@tigrulya-exe Exactly! I'm doing some experiments in this way, and it does involve lots of refactoring work to support both Thrift and gRPC and reuse code as much as possible. I can not promise an ETA since I'm not sure how much time I can spend on this task in the next few months. But I will open a draft PR once I make the pipeline work (for example, successfully executing select 1 using a spark-connect client), meanwhile, I will separate the refactoring changes and push them to the master branch gradually.

@tigrulya-exe
Copy link
Contributor

tigrulya-exe commented Aug 15, 2024

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

@pan3793
Copy link
Member

pan3793 commented Aug 15, 2024

@tigrulya-exe I will share with you more details in the next one or two weeks.

@davidyuan1223
Copy link
Contributor

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

Yeah, it's active, you could see this pr #6412. We first need to verify the feasibility of this solution, but the spark-connect latest version 3.5.1 has some question, so i'm waitting for the new version 3.5.2 release(currently it's released). And i will verify the spark-connect-3.5.2 this week

@pan3793
Copy link
Member

pan3793 commented Aug 23, 2024

A quick and dirty version of Kyuubi Connect is available at #6642

@tigrulya-exe
Copy link
Contributor

@pan3793 Hi! I checked your PoC and built it locally. I tried to run some queries using pyspark and they finished successfully, nice work! Now, I suggest creating a list of tasks that are required to complete this solution. These tasks include supporting all gRPC Spark Connect API methods and refactoring the current code to seamlessly integrate the PoC. This will allow us to work simultaneously and add functionality to the master branch more quickly.

Could you please share any changes that break the current thrift-based logic and any things that need to be refactored that you noticed during the implementation of this solution, so we can use this information as a starting point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants