Skip to content

Rows instance failed to retrieve results not bubbled up to Rows.Error() #254

@mdibaiee

Description

@mdibaiee

Hello,

We recently noticed an issue with our integration with Databricks which is based on this driver, where we would run a large SELECT query, and sometimes we would not retrieve all the rows produced by the SELECT query without an error, that would later lead to logical errors in our code because we had not read all the results.

Description

  1. A large SELECT query is run which has a large amount of data that needs to be transmitted over network (gigabytes of data)
  2. Sometimes, one of the calls to rows.Next() hangs for about 4 minutes and 9 seconds
  3. There is a corresponding log to stderr at this time:
{"level":"error","connId":"01eff9bf-1045-11e2-82d6-bdb9473d9de6","corrId":"","queryId":"01eff9bf-105c-1b34-8fce-253380a1ff36","error":"databricks: request error: fetch results request error: *cli_service.TFetchResultsResp error reading struct: *cli_service.TRowSet error reading struct: *cli_service.TSparkArrowBatch error reading struct: error reading field 1: read tcp 10.128.0.48:56114->44.234.192.42:443: read: connection reset by peer","time":"2025-03-05T12:47:45Z","message":"databricks: Rows instance failed to retrieve results"}
  1. However, the call to rows.Error() right after the rows.Next() loop returns nil, so the select query is assumed to have been successful, even though in reality it was not able to retrieve all of the results

I believe the bug is here in the interaction between these two pieces of code:

func (rpf *resultPageIterator) HasNext() bool {
if rpf.isFinished && rpf.nextResultPage == nil {
// There are no more pages to load and there isn't an already fetched
// page waiting to retrieved by Next()
rpf.err = io.EOF
return false
}
// If there isn't an already fetched result page try to fetch one now
if rpf.nextResultPage == nil {
nrp, err := rpf.getNextPage()
if err != nil {
rpf.Close()
rpf.isFinished = true
rpf.err = err
return false
}

And

https://github.com/databricks/databricks-sql-go/blob/main/internal/rows/rows.go#L459-L461

Basically HasNext() returns false when there is an error in fetching results, and in fetchPageResult first a check against HasNext is run and if that's false, regardless of an underlying error, an io.EOF is returned.

Reproduction

The script to reproduce this is below. Note that because this issue happens "sometimes", the script runs in an infinite loop and logs when it detects the bug.

To prepare Databricks setup for running this script:

  1. Write a file with 16KB of random data and upload it to a volume on Databricks (the purpose of this file is to enlarge the query response size to make it take a long time over network), for generating the file I used:
dd if=/dev/urandom bs=786438 count=1 | base64 | tr -d '\n' > 16rand
  1. Create a synthetic table in Databricks with 1000 rows, for this you can use the query below:
CREATE TABLE sample_data(id BIGINT);
INSERT INTO sample_data(id) VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255),(256),(257),(258),(259),(260),(261),(262),(263),(264),(265),(266),(267),(268),(269),(270),(271),(272),(273),(274),(275),(276),(277),(278),(279),(280),(281),(282),(283),(284),(285),(286),(287),(288),(289),(290),(291),(292),(293),(294),(295),(296),(297),(298),(299),(300),(301),(302),(303),(304),(305),(306),(307),(308),(309),(310),(311),(312),(313),(314),(315),(316),(317),(318),(319),(320),(321),(322),(323),(324),(325),(326),(327),(328),(329),(330),(331),(332),(333),(334),(335),(336),(337),(338),(339),(340),(341),(342),(343),(344),(345),(346),(347),(348),(349),(350),(351),(352),(353),(354),(355),(356),(357),(358),(359),(360),(361),(362),(363),(364),(365),(366),(367),(368),(369),(370),(371),(372),(373),(374),(375),(376),(377),(378),(379),(380),(381),(382),(383),(384),(385),(386),(387),(388),(389),(390),(391),(392),(393),(394),(395),(396),(397),(398),(399),(400),(401),(402),(403),(404),(405),(406),(407),(408),(409),(410),(411),(412),(413),(414),(415),(416),(417),(418),(419),(420),(421),(422),(423),(424),(425),(426),(427),(428),(429),(430),(431),(432),(433),(434),(435),(436),(437),(438),(439),(440),(441),(442),(443),(444),(445),(446),(447),(448),(449),(450),(451),(452),(453),(454),(455),(456),(457),(458),(459),(460),(461),(462),(463),(464),(465),(466),(467),(468),(469),(470),(471),(472),(473),(474),(475),(476),(477),(478),(479),(480),(481),(482),(483),(484),(485),(486),(487),(488),(489),(490),(491),(492),(493),(494),(495),(496),(497),(498),(499),(500),(501),(502),(503),(504),(505),(506),(507),(508),(509),(510),(511),(512),(513),(514),(515),(516),(517),(518),(519),(520),(521),(522),(523),(524),(525),(526),(527),(528),(529),(530),(531),(532),(533),(534),(535),(536),(537),(538),(539),(540),(541),(542),(543),(544),(545),(546),(547),(548),(549),(550),(551),(552),(553),(554),(555),(556),(557),(558),(559),(560),(561),(562),(563),(564),(565),(566),(567),(568),(569),(570),(571),(572),(573),(574),(575),(576),(577),(578),(579),(580),(581),(582),(583),(584),(585),(586),(587),(588),(589),(590),(591),(592),(593),(594),(595),(596),(597),(598),(599),(600),(601),(602),(603),(604),(605),(606),(607),(608),(609),(610),(611),(612),(613),(614),(615),(616),(617),(618),(619),(620),(621),(622),(623),(624),(625),(626),(627),(628),(629),(630),(631),(632),(633),(634),(635),(636),(637),(638),(639),(640),(641),(642),(643),(644),(645),(646),(647),(648),(649),(650),(651),(652),(653),(654),(655),(656),(657),(658),(659),(660),(661),(662),(663),(664),(665),(666),(667),(668),(669),(670),(671),(672),(673),(674),(675),(676),(677),(678),(679),(680),(681),(682),(683),(684),(685),(686),(687),(688),(689),(690),(691),(692),(693),(694),(695),(696),(697),(698),(699),(700),(701),(702),(703),(704),(705),(706),(707),(708),(709),(710),(711),(712),(713),(714),(715),(716),(717),(718),(719),(720),(721),(722),(723),(724),(725),(726),(727),(728),(729),(730),(731),(732),(733),(734),(735),(736),(737),(738),(739),(740),(741),(742),(743),(744),(745),(746),(747),(748),(749),(750),(751),(752),(753),(754),(755),(756),(757),(758),(759),(760),(761),(762),(763),(764),(765),(766),(767),(768),(769),(770),(771),(772),(773),(774),(775),(776),(777),(778),(779),(780),(781),(782),(783),(784),(785),(786),(787),(788),(789),(790),(791),(792),(793),(794),(795),(796),(797),(798),(799),(800),(801),(802),(803),(804),(805),(806),(807),(808),(809),(810),(811),(812),(813),(814),(815),(816),(817),(818),(819),(820),(821),(822),(823),(824),(825),(826),(827),(828),(829),(830),(831),(832),(833),(834),(835),(836),(837),(838),(839),(840),(841),(842),(843),(844),(845),(846),(847),(848),(849),(850),(851),(852),(853),(854),(855),(856),(857),(858),(859),(860),(861),(862),(863),(864),(865),(866),(867),(868),(869),(870),(871),(872),(873),(874),(875),(876),(877),(878),(879),(880),(881),(882),(883),(884),(885),(886),(887),(888),(889),(890),(891),(892),(893),(894),(895),(896),(897),(898),(899),(900),(901),(902),(903),(904),(905),(906),(907),(908),(909),(910),(911),(912),(913),(914),(915),(916),(917),(918),(919),(920),(921),(922),(923),(924),(925),(926),(927),(928),(929),(930),(931),(932),(933),(934),(935),(936),(937),(938),(939),(940),(941),(942),(943),(944),(945),(946),(947),(948),(949),(950),(951),(952),(953),(954),(955),(956),(957),(958),(959),(960),(961),(962),(963),(964),(965),(966),(967),(968),(969),(970),(971),(972),(973),(974),(975),(976),(977),(978),(979),(980),(981),(982),(983),(984),(985),(986),(987),(988),(989),(990),(991),(992),(993),(994),(995),(996),(997),(998),(999);

Then run the script below after updating the databricks connection URI and the Volume path to your 16KB random file:

package main

import (
	"unsafe"
	"time"
	"context"
	stdsql "database/sql"
	"fmt"
	"strings"

	_ "github.com/databricks/databricks-sql-go"
)

func main() {
	for true {
		fmt.Printf("%s sending query\n", time.Now().UTC().Format(time.RFC3339))
		var ctx = context.Background()
		db, err := stdsql.Open("databricks", "token:<YOUR_TOKEN>@<YOUR_WAREHOUSE><YOUR_HTTP_PATH>?catalog=main&schema=test&userAgentEntry=EstuaryTechnologiesFlow")
		if err != nil {
			panic(fmt.Errorf("sql.Open: %w", err))
		}
		defer db.Close()

		var queries []string
		var iterations = 500
		for i := 0; i < iterations; i++ {
			queries = append(queries, fmt.Sprintf("SELECT %d, (SELECT value from text.`/Volumes/main/test/sample_vol/16rand`) || id FROM main.test.sample_data", i))
		}

		// Issue a union join of the target tables and their (now staged) load keys,
		// and send results to the |loaded| callback.
		var unionQuery = strings.Join(queries, "\nUNION ALL\n")
		rows, err := db.QueryContext(ctx, unionQuery)
		if err != nil {
			panic(fmt.Errorf("querying Load documents: %w", err))
		}
		defer rows.Close()

		var count = 0
		var readBytes uintptr
		for rows.Next() {
			fmt.Printf("%s %d (%d) ", time.Now().UTC().Format(time.RFC3339), count, readBytes)
			var binding int
			var text string
			count = count + 1

			if err = rows.Scan(&binding, &text); err != nil {
				panic(fmt.Errorf("scanning Load document: %w", err))
			} else if binding > -1 {
				fmt.Printf("[%d] \n", binding)
			}
			readBytes = readBytes + unsafe.Sizeof(binding) + unsafe.Sizeof(text)
		}

		if err = rows.Err(); err != nil {
			panic(fmt.Errorf("querying Loads: %w", err))
		}

		if count < iterations * 1000 {
			fmt.Printf("%s BUG! read %d documents\n", time.Now().UTC().Format(time.RFC3339), count)
		} else {
			fmt.Printf("%s read %d documents\n", time.Now().UTC().Format(time.RFC3339), count)
		}
	}
}

After a while, you will see the following log lines in output:

.
.
.
2025-03-05T12:43:36Z 110997 (2663928) [110] 
2025-03-05T12:47:45Z BUG! read 110998 documents
.
.
.
2025-03-05T12:50:30Z 155997 (3743928) [155] 
2025-03-05T12:54:39Z BUG! read 155998 documents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions