Skip to content

feat(csharp): Implement CloudFetch for Databricks Spark driver#2634

Merged
CurtHagenlocher merged 10 commits into
apache:mainfrom
jadewang-db:add-cloudfetch-to-spark-adbc
Mar 31, 2025
Merged

feat(csharp): Implement CloudFetch for Databricks Spark driver#2634
CurtHagenlocher merged 10 commits into
apache:mainfrom
jadewang-db:add-cloudfetch-to-spark-adbc

Conversation

@jadewang-db

Copy link
Copy Markdown
Contributor

Initial implementation of adding CloudFetch feature in Databricks Spark Driver.

  • create a new CloudFetchReader to handle CloudFetch file download and decompress.
  • Test case for small and large result.

Coming changes after this

  • Adding prefetch to the downloader
  • Adding renewal for expired presigned url
  • Retries

@github-actions github-actions Bot added this to the ADBC Libraries 18 milestone Mar 20, 2025
@davidhcoe

Copy link
Copy Markdown
Contributor

I don’t think we want to do this in the Spark driver, do we?

@jadewang-db

Copy link
Copy Markdown
Contributor Author

I don’t think we want to do this in the Spark driver, do we?

the change should be backward compatible. how can I run the test?

@CurtHagenlocher CurtHagenlocher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I've left some feedback.

Comment thread csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs Outdated
Comment thread csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs
Comment thread csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs Outdated
Comment thread csharp/src/Drivers/Apache/Spark/SparkDatabricksConnection.cs Outdated
Comment thread csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs Outdated
MemoryStream dataStream;

// If the data is LZ4 compressed, decompress it
if (this.isLz4Compressed)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to leverage the Apache.Arrow.Compression assembly to do decompression? It works by passing a CompressionCodecFactory to the ArrowStreamReader constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have code pointers? I tried it, seems not working.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, no. I can try to figure it out later; this doesn't need to be blocking.

Comment thread csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs Outdated
Comment thread csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs Outdated
Comment thread csharp/test/Drivers/Apache/Apache.Arrow.Adbc.Tests.Drivers.Apache.csproj Outdated

@CurtHagenlocher CurtHagenlocher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file artifacts/Apache.Arrow.Adbc.TetsDrivers/Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache is still present in the latest iteration. Could you please remove it from the PR?

Looks fine to me otherwise. One thing we might consider in a future change is to (configurably) fetch more than one link in parallel in order to maximize throughput.

MemoryStream dataStream;

// If the data is LZ4 compressed, decompress it
if (this.isLz4Compressed)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, no. I can try to figure it out later; this doesn't need to be blocking.

@CurtHagenlocher CurtHagenlocher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@CurtHagenlocher CurtHagenlocher merged commit 9eee98d into apache:main Mar 31, 2025
colin-rogers-dbt pushed a commit to dbt-labs/arrow-adbc that referenced this pull request Jun 10, 2025
…e#2634)

Initial implementation of adding CloudFetch feature in Databricks Spark
Driver.

- create a new CloudFetchReader to handle CloudFetch file download and
decompress.
- Test case for small and large result.

Coming changes after this

- Adding prefetch to the downloader
- Adding renewal for expired presigned url
- Retries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants