Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32001][SQL]Create JDBC authentication provider developer API #29024
[SPARK-32001][SQL]Create JDBC authentication provider developer API #29024
Changes from all commits
8caa8e7
e919ac4
226f177
265b26e
50e33ea
054535e
d54596f
184fd26
1f562f8
b2fd5d8
1297304
9dbed59
69856f6
0412490
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting the following exception on my console permanently while running JDBC tests. Should it be really logged as an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log it as a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can decrease it to warning. The main message is to notify the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to ignore the error case where it fails to load builtin providers?
DataSource
throws an exception if it fails to load builtin datasources:spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Lines 694 to 702 in 94d648d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean ignore? Providers must be loaded independently so we need to catch and ignore the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the suggestion is let the exception fire then I would say it's bad idea. If a provider is not able to be loaded then the rest must go. We've had similar issue and expectation in hadoop delegation token area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the policy looks okay for secure connections, but how about the basic one,
BasicConnectionProvider
? At least, until Spark v3.0, creating basic JDBC connections does not fail because of the loading failure.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it's the same since
BasicConnectionProvider
is built-in and no pre-load inside. The main issue would come when one adds a new provider which unable to be loaded. That failure would make all the rest workload fail if we don't load them independently.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#29968
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭
I believe this synchronization introduces a significant performance bottleneck for applications that rely on being able to establish many connections simultaneously for performance reasons. This change forces concurrency to 1 when establishing database connections for a given JDBC driver and that strikes me as a significant user impacting change.
Can anyone propose a workaround for this? I have an app that makes connections to thousands of databases and I can't upgrade to any version >=3.1.x because of this significant bottleneck.
https://issues.apache.org/jira/browse/SPARK-37391
That screenshot only shows a handful of blocked threads, but in total I have 98 threads blocked waiting for the
SecurityConfigurationLock
.I do not have this issue when running Spark 3.0.1 with no code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaborgsomogyi is getConnection() not thread-safe?
Or is there any possibility of finer-grained locking on the actual object that getConnection() is called on?
We could move the filteredProviders.head out of the block but doubt that does anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply no.
This is well thought initially even if it introduces bottlenecks. The main problem is that JVM has a single security context. In order to make a connection one MUST modify the security context, otherwise the JDBC connector is not able to provide the security credentials (TGT, keytab or whatever). Simply saying JDBC connectors are able to get credentials from the single security context.
Since multiple connections are made concurrently (sometimes same DB type w/ different credentials) this must be synchronized not to have a malformed security context (we've made load test and added 10+ DBs and corrupted the security context pretty fast w/o sync and it was horror to find out what's going on).
Workload are coming and going, nobody can tell in advance what kind of security context need to be created at the very beginning of JVM creation. If we would pre-bake the security context (not sure how?!) then the following issues would come:
I personally think the first one can't be solved, the second one could be cured w/ all JVM restarts but I think it's just not user friendly.
If somebody has an excellent idea I would like to hear it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I can imagine is to pre-set the JVM JAAS config for databases from file like this:
java -Djava.security.auth.login.config=jaas.conf
.And then a new flag can be introduced like:
spark.jdbc.doNotSyncronize
which is defaulttrue
.That case security credentials MUSTN'T be provided by Spark's JDBC properties but only from JAAS file.
However this would be super risky and for advanced users only. That said I've spent at least a month to debug the JVM security context what's going on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about the underlying problem, but it had crossed my mind that allowing the synchronization to be optional could be one path forward.
@gaborgsomogyi may I trouble you for some references related to
so I can better familiarize myself with the problem that you are describing?
Beyond this particular issue, what you've shared suggests that the concurrency utilized by my app could be causing us to crosswire data which would be a major problem.
I guess I'd also ask, is there more to it than you described? It sounds like I should either have some cross wired data or if that's not the case then there is some missing piece of the puzzle that means the synchronization is not always required.
Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless somebody has a ground breaking idea providers must be synced where security enabled, we can free up some time when no authentication is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdg5 or @maropu would you be able to find some time for fixing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed solution seems straightforward to me and I'm happy to take a stab at it. Dumb question though, how would I eventually tweak the spark class loader so it would detect a custom
JdbcConnectionProvider
in my library code?I tried baking such a class into the uberjar containing the driver app, but I didn't seem to be able to get
ConnectionProvider
to notice my clone ofBasicConnectionProvider
. I was guessing this was due to a class loader disagreement.My spark-submit looks roughly like so:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdg5 Thanks for lending a helping hand! For such scenarios I've created a readme here.
Please don't forget to update this doc accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaborgsomogyi thanks! The
META-INF
manifest is what I was missing 👍🏻Here's my stab at what you suggested: #34745
I marked it a WIP, but it is complete as far as I am concerned and aware. I just want to make sure it matches up with what you had in mind before I open it up to broader review.