Updated RequestHandler to handle read failures #1081

mwlon · 2018-08-15T01:52:00Z

Using the spark cassandra connector, I kept getting errors like the one I pasted below, despite a generous retry policy. I traced them back to this repo, which immediately throws for any READ_FAILURE exceptions. I believe this change fixes it, but please check.

java.io.IOException: Exception during execution of SELECT (omitted): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:350)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
	at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:85)
	at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:27)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:37)
	at com.sun.proxy.$Proxy19.execute(Unknown Source)
	at com.datastax.spark.connector.cql.DefaultScanner.scan(Scanner.scala:34)
	at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:342)

datastax-bot · 2018-08-15T01:52:01Z

Hi @mwlon, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

Sincerely,
DataStax Bot.

mwlon · 2018-08-15T01:54:51Z

It looks like the contribution license agreement is dysfunctional. I get

Your connection is not secure

The owner of cla.datastax.com has configured their website improperly.

tolbertam · 2018-08-15T14:34:33Z

Thanks @mwlon, we'll look into getting the certificate fixed, looks like it's reporting the wrong common names for some reason.

tolbertam · 2018-08-15T14:40:21Z

With regards to how to handle read and write failures, it does appear that we don't allow addressing either of these with retry policy. I'm not completely sure if this is intentional or not, will see what others think. I know that a common cause of ReadFailure is TombstoneOverwhelmingExceptions. In this case, retrying may not improve things, but on the other hand a coordinator may choose a better replica that does not surface a failure.

At the very least, I think we could surface those to RetryPolicy.onRequestError so the user has some means of dictating retries. I think using onReadTimeout / onWriteTimeout may be overloading their use, and implementors of RetryPolicy may not be accounting for that, so we should either use onRequestError or consider adding new api methods (i.e. onReadFailure) for them.

mwlon · 2018-08-15T17:52:24Z

Thanks for taking a look @tolbertam. Please let me know if I can help in any way - getting this resolved is high-priority for me.

tolbertam · 2018-08-16T20:32:47Z

Hi @mwlon. I went ahead and logged JAVA-1944 for tracking this issue, we are still considering how we should resolve this.

Also, with regards to https://cla.datastax.com not having a valid cert, we just fixed this (may take a little for our DNS change to propagate), thanks for reporting that issue!

tolbertam · 2018-08-17T16:14:15Z

@mwlon So I talked to a few people and we decided the right thing to do was:

Pass ReadFailureException and WriteFailureException to onRequestError to allow it to be considered for retry.
Update DefaultRetryPolicy.onRequestError to rethrow these exceptions by default, as in general I think this is the right thing to do.

This will allow users the capability of retrying these exceptions, but they will not be retried by default.

However, this won't completely fix things for you as the spark connector's retry policy implementation only retries read timeout, write timeout and unavailables. I see that in SPARKC-507 that they do not intend to retry on ReadFailureException. However, I think you can work around this by implementing your own CassandraConnectionFactory (specified via connection.factory) and provide your own RetryPolicy implementation. Let me know if you have any questions about that.

mwlon · 2018-08-17T18:11:50Z

@tolbertam that's reasonable. Do you expect ReadFailureException and WriteFailureException be made available to onRequestError in the next release (3.5.2)?

tolbertam · 2018-08-17T18:35:21Z

@mwlon We were planning on targeting this for 3.6.0, which we are wrapping up work on. Since this is a behavior change, we'd like to avoid putting it in a hotfix release.

updated RequestHandler to handle read failures

c097d2d

datastax-bot added the cla-missing label Aug 15, 2018

tolbertam mentioned this pull request Aug 17, 2018

JAVA-1944: Surface Read and WriteFailureException to RetryPolicy #1084

Merged

mwlon closed this Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated RequestHandler to handle read failures #1081

Updated RequestHandler to handle read failures #1081

Uh oh!

mwlon commented Aug 15, 2018 •

edited

Loading

Uh oh!

datastax-bot commented Aug 15, 2018

Uh oh!

mwlon commented Aug 15, 2018 •

edited

Loading

Uh oh!

tolbertam commented Aug 15, 2018

Uh oh!

tolbertam commented Aug 15, 2018

Uh oh!

mwlon commented Aug 15, 2018

Uh oh!

tolbertam commented Aug 16, 2018

Uh oh!

tolbertam commented Aug 17, 2018 •

edited

Loading

Uh oh!

mwlon commented Aug 17, 2018

Uh oh!

tolbertam commented Aug 17, 2018

Uh oh!

Uh oh!

Updated RequestHandler to handle read failures #1081

Updated RequestHandler to handle read failures #1081

Uh oh!

Conversation

mwlon commented Aug 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datastax-bot commented Aug 15, 2018

Uh oh!

mwlon commented Aug 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tolbertam commented Aug 15, 2018

Uh oh!

tolbertam commented Aug 15, 2018

Uh oh!

mwlon commented Aug 15, 2018

Uh oh!

tolbertam commented Aug 16, 2018

Uh oh!

tolbertam commented Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwlon commented Aug 17, 2018

Uh oh!

tolbertam commented Aug 17, 2018

Uh oh!

Uh oh!

mwlon commented Aug 15, 2018 •

edited

Loading

mwlon commented Aug 15, 2018 •

edited

Loading

tolbertam commented Aug 17, 2018 •

edited

Loading