Skip to content

Updated RequestHandler to handle read failures #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mwlon
Copy link

@mwlon mwlon commented Aug 15, 2018

Using the spark cassandra connector, I kept getting errors like the one I pasted below, despite a generous retry policy. I traced them back to this repo, which immediately throws for any READ_FAILURE exceptions. I believe this change fixes it, but please check.

java.io.IOException: Exception during execution of SELECT (omitted): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:350)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
	at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:85)
	at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:27)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:37)
	at com.sun.proxy.$Proxy19.execute(Unknown Source)
	at com.datastax.spark.connector.cql.DefaultScanner.scan(Scanner.scala:34)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:342)

@datastax-bot
Copy link

Hi @mwlon, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

Sincerely,
DataStax Bot.

@mwlon
Copy link
Author

mwlon commented Aug 15, 2018

It looks like the contribution license agreement is dysfunctional. I get

Your connection is not secure

The owner of cla.datastax.com has configured their website improperly.

@tolbertam
Copy link
Contributor

Thanks @mwlon, we'll look into getting the certificate fixed, looks like it's reporting the wrong common names for some reason.

@tolbertam
Copy link
Contributor

With regards to how to handle read and write failures, it does appear that we don't allow addressing either of these with retry policy. I'm not completely sure if this is intentional or not, will see what others think. I know that a common cause of ReadFailure is TombstoneOverwhelmingExceptions. In this case, retrying may not improve things, but on the other hand a coordinator may choose a better replica that does not surface a failure.

At the very least, I think we could surface those to RetryPolicy.onRequestError so the user has some means of dictating retries. I think using onReadTimeout / onWriteTimeout may be overloading their use, and implementors of RetryPolicy may not be accounting for that, so we should either use onRequestError or consider adding new api methods (i.e. onReadFailure) for them.

@mwlon
Copy link
Author

mwlon commented Aug 15, 2018

Thanks for taking a look @tolbertam. Please let me know if I can help in any way - getting this resolved is high-priority for me.

@tolbertam
Copy link
Contributor

Hi @mwlon. I went ahead and logged JAVA-1944 for tracking this issue, we are still considering how we should resolve this.

Also, with regards to https://cla.datastax.com not having a valid cert, we just fixed this (may take a little for our DNS change to propagate), thanks for reporting that issue!

@tolbertam
Copy link
Contributor

tolbertam commented Aug 17, 2018

@mwlon So I talked to a few people and we decided the right thing to do was:

  1. Pass ReadFailureException and WriteFailureException to onRequestError to allow it to be considered for retry.
  2. Update DefaultRetryPolicy.onRequestError to rethrow these exceptions by default, as in general I think this is the right thing to do.

This will allow users the capability of retrying these exceptions, but they will not be retried by default.

However, this won't completely fix things for you as the spark connector's retry policy implementation only retries read timeout, write timeout and unavailables. I see that in SPARKC-507 that they do not intend to retry on ReadFailureException. However, I think you can work around this by implementing your own CassandraConnectionFactory (specified via connection.factory) and provide your own RetryPolicy implementation. Let me know if you have any questions about that.

@mwlon
Copy link
Author

mwlon commented Aug 17, 2018

@tolbertam that's reasonable. Do you expect ReadFailureException and WriteFailureException be made available to onRequestError in the next release (3.5.2)?

@tolbertam
Copy link
Contributor

@mwlon We were planning on targeting this for 3.6.0, which we are wrapping up work on. Since this is a behavior change, we'd like to avoid putting it in a hotfix release.

@mwlon mwlon closed this Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants