Skip to content

Conversation

@allisonwang-db
Copy link
Owner

@allisonwang-db allisonwang-db commented Jul 22, 2025

Summary

  • Added new Arrow data source for reading Arrow/Parquet files with high performance
  • Added comprehensive project documentation (CLAUDE.md) for development guidance
  • Enhanced project structure with updated exports and release notes

Usage and Examples

>>> from pyspark_datasources import ArrowDataSource
>>> spark.dataSource.register(ArrowDataSource)

Read a single Arrow file:

>>> df = spark.read.format("arrow").load("/path/to/employees.arrow")
>>> df.show()
+---+-----------+---+-------+----------+------+
| id|       name|age| salary|department|active|
+---+-----------+---+-------+----------+------+
|  1|Alice Smith| 28|65000.0|      Tech|  true|
+---+-----------+---+-------+----------+------+

Read multiple files with glob pattern (creates multiple partitions):

>>> df = spark.read.format("arrow").load("/data/sales/sales_*.arrow")
>>> df.show()
>>> print(f"Number of partitions: {df.rdd.getNumPartitions()}")

Changes

  • New Arrow Data Source: Efficient reading of Arrow and Parquet files with automatic partitioning
  • Project Documentation: Added CLAUDE.md with development guidelines, API specs, and usage patterns
  • Enhanced Exports: Updated init.py to include ArrowDataSource
  • Test Coverage: Comprehensive tests for Arrow data source functionality
  • Release Updates: Updated RELEASE.md for version 0.17.0

Test Plan

  • Arrow data source reads single files correctly
  • Arrow data source handles partitioned datasets
  • Schema inference works for both Arrow and Parquet formats
  • All existing tests continue to pass
  • Manual testing with various Arrow/Parquet files

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Introduced support for reading Apache Arrow IPC files with a new data source, enabling efficient loading of single files, directories, or glob patterns into Spark.
  • Documentation
    • Added a comprehensive project documentation file with detailed usage patterns, API specifications, and troubleshooting guidance.
    • Updated release notes with instructions for resolving PyArrow compatibility issues on macOS.
  • Tests
    • Added integration tests to verify reading of single and multiple Arrow files using the new data source.

@coderabbitai
Copy link

coderabbitai bot commented Jul 22, 2025

Walkthrough

A new PySpark data source for reading Apache Arrow IPC files was implemented and integrated into the package. Project documentation and API specifications were added. Release notes were updated with a troubleshooting section for PyArrow/macOS issues. Two new tests were introduced to validate the Arrow data source with both single and multiple files.

Changes

Cohort / File(s) Change Summary
Project Documentation
CLAUDE.md
Added comprehensive documentation detailing project overview, tech stack, directory structure, data source connectors, usage patterns, API specification, implementation requirements, and docstring guidelines.
Release Notes
RELEASE.md
Added troubleshooting section for PyArrow compatibility issues on macOS, including environment variable workaround and usage instructions.
Arrow Data Source Implementation
pyspark_datasources/arrow.py, pyspark_datasources/__init__.py
Introduced ArrowDataSource and ArrowDataSourceReader classes for reading Arrow IPC files in Spark, with schema inference, partitioning, and error handling. Updated package init to expose the new data source.
Arrow Data Source Tests
tests/test_data_sources.py
Added two tests to verify reading single and multiple Arrow files using the new data source, including data validation and resource cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SparkSession
    participant ArrowDataSource
    participant ArrowDataSourceReader
    participant FileSystem

    User->>SparkSession: Register ArrowDataSource
    User->>SparkSession: spark.read.format("arrow").load("path")
    SparkSession->>ArrowDataSource: Instantiate with options
    ArrowDataSource->>FileSystem: Resolve Arrow file paths
    ArrowDataSource->>ArrowDataSource: Infer schema from first file
    ArrowDataSource->>ArrowDataSourceReader: Create reader with schema
    ArrowDataSourceReader->>FileSystem: List Arrow files
    loop For each file
        ArrowDataSourceReader->>FileSystem: Open Arrow file
        ArrowDataSourceReader->>SparkSession: Yield RecordBatches
    end
    SparkSession->>User: Return DataFrame
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Poem

In fields of data, Arrow flies,
Sparkling streams before my eyes.
One file, or many—hop and read,
Partitioned rows for every need.
Docs and tests now join the show,
With PyArrow’s help, we make it so.
🐇✨ Data hops, onward we go!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19b1480 and 05ebc9d.

📒 Files selected for processing (5)
  • CLAUDE.md (1 hunks)
  • RELEASE.md (1 hunks)
  • pyspark_datasources/__init__.py (1 hunks)
  • pyspark_datasources/arrow.py (1 hunks)
  • tests/test_data_sources.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • RELEASE.md
🧰 Additional context used
🪛 Ruff (0.12.2)
pyspark_datasources/arrow.py

1-1: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


161-161: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

tests/test_data_sources.py

101-101: ArrowDataSource may be undefined, or defined from star imports

(F405)


139-139: ArrowDataSource may be undefined, or defined from star imports

(F405)

pyspark_datasources/__init__.py

1-1: .arrow.ArrowDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md

178-178: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (12)
pyspark_datasources/__init__.py (1)

1-1: LGTM! Import correctly exposes the new ArrowDataSource.

The import follows the established pattern and enables users to import ArrowDataSource directly from the package. The static analysis hint about unused import is a false positive - this is the standard way to expose classes in __init__.py files.

tests/test_data_sources.py (3)

2-4: LGTM! Proper imports added for Arrow testing.

The imports are appropriate for the new Arrow data source tests, adding necessary modules for file handling and PyArrow operations.


99-135: Excellent test coverage for single Arrow file reading.

The test thoroughly validates:

  • Data source registration
  • Schema inference and column validation
  • Data integrity with row count and content verification
  • Proper resource cleanup with try/finally

The test follows good practices with temporary file management and comprehensive assertions.


137-181: Comprehensive test for multiple Arrow files with partitioning validation.

This test excellently covers:

  • Directory-based file discovery
  • Combined data from multiple files (4 total rows from 2 files)
  • Partition verification ensuring one partition per file
  • Schema consistency across files
  • Proper cleanup with TemporaryDirectory context manager

The partitioning assertion on Line 175 is particularly valuable for validating the data source's parallel processing capability.

pyspark_datasources/arrow.py (4)

10-93: Excellent comprehensive documentation and class design.

The ArrowDataSource class features:

  • Detailed docstring with usage patterns and examples
  • Clear explanation of partitioning strategy (one partition per file)
  • Performance notes highlighting PySpark 4.0+ Arrow integration
  • Comprehensive examples covering single files, glob patterns, and directories

The class design properly inherits from DataSource and implements the required interface methods.


95-116: Solid schema inference implementation.

The schema method correctly:

  • Validates required path option
  • Uses file discovery to find the first file
  • Reads Arrow IPC schema and converts to Spark StructType
  • Leverages PySpark's from_arrow_schema utility for proper conversion

120-131: Well-implemented file discovery with multiple path patterns.

The _get_files method properly handles:

  • Single files with os.path.isfile
  • Directory scanning for .arrow files
  • Glob patterns for flexible file matching
  • Sorted results for consistent ordering

134-149: Proper reader implementation with partition-per-file strategy.

The ArrowDataSourceReader correctly:

  • Validates required options in constructor
  • Creates one InputPartition per file for parallel processing
  • Implements the expected DataSourceReader interface
CLAUDE.md (4)

1-77: Excellent comprehensive project documentation.

This documentation provides outstanding coverage of:

  • Clear project overview and educational purpose disclaimer
  • Complete tech stack and dependency information
  • Detailed project structure with file descriptions
  • Comprehensive development workflow commands
  • Proper environment setup instructions including macOS-specific considerations

The structure and content are well-organized and highly informative for developers.


78-120: Well-documented usage patterns and implementation details.

This section excellently covers:

  • Standard Spark Data Source API usage patterns
  • Testing strategy with pytest and PySpark fixtures
  • Key implementation requirements and inheritance patterns
  • External API dependencies and common issues

The code examples are clear and follow established patterns.


121-208: Comprehensive Python Data Source API specification.

This section provides excellent technical documentation including:

  • Complete coverage of all abstract base classes
  • Detailed method specifications with parameters and return types
  • Implementation requirements and best practices
  • Usage patterns with concrete examples
  • Performance optimization strategies

The API specification is thorough and will be valuable for developers implementing custom data sources.


209-223: Valuable documentation guidelines for data source development.

The docstring guidelines provide excellent guidance for:

  • Required documentation sections
  • Schema output examples
  • Error case documentation
  • Partitioning and Arrow optimization explanations

These guidelines will ensure consistent and comprehensive documentation across all data sources.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-arrow-datasource-and-project-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
tests/create_test_data.py (1)

118-118: Fix unused loop variable.

The dirs variable is not used within the loop body. Follow the static analysis suggestion to rename it.

-    for root, dirs, files in os.walk(data_dir):
+    for root, _dirs, files in os.walk(data_dir):
pyspark_datasources/arrow.py (2)

186-188: Avoid recreating ArrowDataSource instance

The partitions() method creates a new ArrowDataSource instance just to access its _get_files() method. Since ArrowDataSourceReader already has access to self.path and self.options, it would be more efficient to move the file discovery logic here or pass the files from the parent data source.

Consider refactoring to avoid unnecessary object creation:

def partitions(self) -> List[InputPartition]:
    """Create partitions, one per file for parallel reading."""
-   data_source = ArrowDataSource(self.options)
-   files = data_source._get_files(self.path)
+   files = self._get_files(self.path)
    return [InputPartition(file_path) for file_path in files]

Then add the _get_files method to this class or pass files through the constructor.


193-194: Avoid recreating ArrowDataSource instance in read()

Similar to the partitions() method, this creates a new ArrowDataSource instance just to access _detect_format(). This is inefficient and could be refactored.

Consider passing the format detection result through the partition or storing it in the reader instance.

CLAUDE.md (4)

17-17: Add language identifier to fenced code block

The fenced code block showing the project structure should have a language identifier for better syntax highlighting.

-```
+```text

32-34: Add blank lines around table

Markdown tables should be surrounded by blank lines for better readability and compatibility with various markdown parsers.

## Available Data Sources
+
| Short Name    | File | Description | Dependencies |

203-206: Format bare URLs as proper markdown links

Bare URLs should be formatted as proper markdown links for better presentation.


221-221: Add newline at end of file

The file should end with a newline character for POSIX compliance.

Add a newline after line 221.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecaa405 and 0a05cc9.

⛔ Files ignored due to path filters (1)
  • tests/data/products.parquet is excluded by !**/*.parquet
📒 Files selected for processing (8)
  • .gitignore (1 hunks)
  • CLAUDE.md (1 hunks)
  • RELEASE.md (1 hunks)
  • pyspark_datasources/__init__.py (1 hunks)
  • pyspark_datasources/arrow.py (1 hunks)
  • tests/create_test_data.py (1 hunks)
  • tests/test_arrow_datasource.py (1 hunks)
  • tests/test_arrow_datasource_updated.py (1 hunks)
🧬 Code Graph Analysis (3)
pyspark_datasources/__init__.py (1)
pyspark_datasources/arrow.py (1)
  • ArrowDataSource (11-171)
tests/create_test_data.py (1)
pyspark_datasources/arrow.py (1)
  • schema (120-140)
tests/test_arrow_datasource.py (2)
pyspark_datasources/arrow.py (4)
  • ArrowDataSource (11-171)
  • schema (120-140)
  • name (117-118)
  • read (190-210)
tests/test_arrow_datasource_updated.py (3)
  • spark (9-11)
  • test_arrow_datasource_missing_path (156-163)
  • test_arrow_datasource_nonexistent_path (166-173)
🪛 Ruff (0.12.2)
pyspark_datasources/__init__.py

1-1: .arrow.ArrowDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

tests/create_test_data.py

118-118: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

tests/test_arrow_datasource_updated.py

219-219: f-string without any placeholders

Remove extraneous f prefix

(F541)

pyspark_datasources/arrow.py

1-1: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


33-33: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


203-203: Bare URL used

(MD034, no-bare-urls)


205-205: Bare URL used

(MD034, no-bare-urls)


206-206: Bare URL used

(MD034, no-bare-urls)

🧰 Additional context used
🧬 Code Graph Analysis (3)
pyspark_datasources/__init__.py (1)
pyspark_datasources/arrow.py (1)
  • ArrowDataSource (11-171)
tests/create_test_data.py (1)
pyspark_datasources/arrow.py (1)
  • schema (120-140)
tests/test_arrow_datasource.py (2)
pyspark_datasources/arrow.py (4)
  • ArrowDataSource (11-171)
  • schema (120-140)
  • name (117-118)
  • read (190-210)
tests/test_arrow_datasource_updated.py (3)
  • spark (9-11)
  • test_arrow_datasource_missing_path (156-163)
  • test_arrow_datasource_nonexistent_path (166-173)
🪛 Ruff (0.12.2)
pyspark_datasources/__init__.py

1-1: .arrow.ArrowDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

tests/create_test_data.py

118-118: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

tests/test_arrow_datasource_updated.py

219-219: f-string without any placeholders

Remove extraneous f prefix

(F541)

pyspark_datasources/arrow.py

1-1: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md

17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


33-33: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


203-203: Bare URL used

(MD034, no-bare-urls)


205-205: Bare URL used

(MD034, no-bare-urls)


206-206: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (5)
.gitignore (1)

162-163: LGTM! Appropriate addition to gitignore.

The .claude/ directory exclusion aligns with the new CLAUDE.md documentation and follows the existing pattern for ignoring development-related directories.

pyspark_datasources/__init__.py (1)

1-1: LGTM! Package-level import follows established pattern.

The import makes ArrowDataSource available at the package level, consistent with other data sources. The static analysis warning about unused import is a false positive - this is the intended behavior for package __init__.py files to expose public APIs.

RELEASE.md (1)

122-135: Excellent troubleshooting documentation for PyArrow compatibility.

This addition provides valuable guidance for a known macOS issue with PyArrow. The solutions are well-structured with practical examples for different usage scenarios, which will help users avoid runtime crashes when using the new ArrowDataSource.

tests/create_test_data.py (1)

9-125: Excellent test data generation script with comprehensive coverage.

The script creates a well-structured test dataset covering various scenarios:

  • Single Arrow and Parquet files
  • Partitioned datasets for testing multiple partitions
  • Mixed format directories
  • Proper use of PyArrow APIs for both IPC and Parquet formats

This provides robust test infrastructure for the ArrowDataSource functionality.

tests/test_arrow_datasource.py (1)

1-141: Comprehensive test suite with excellent structure and coverage.

The test module provides solid coverage of ArrowDataSource functionality:

  • Well-designed fixtures with proper temporary file creation and cleanup
  • Tests for both Arrow IPC and Parquet file formats
  • Schema inference validation
  • Appropriate error handling tests for edge cases
  • Clear test structure following pytest conventions

The tests effectively validate the core functionality while maintaining good resource management practices.

@@ -0,0 +1,210 @@
from typing import List, Iterator, Union
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import Union

The Union type from the typing module is imported but never used in the code.

-from typing import List, Iterator, Union
+from typing import List, Iterator
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import List, Iterator, Union
from typing import List, Iterator
🧰 Tools
🪛 Ruff (0.12.2)

1-1: typing.Union imported but unused

Remove unused import: typing.Union

(F401)

🤖 Prompt for AI Agents
In pyspark_datasources/arrow.py at line 1, the import statement includes `Union`
from the `typing` module, but `Union` is not used anywhere in the code. Remove
`Union` from the import list to clean up unused imports.

Comment on lines 209 to 210
except Exception as e:
raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Preserve exception chain when re-raising

When catching and re-raising exceptions, use from e to preserve the original exception chain for better debugging.

except Exception as e:
-   raise RuntimeError(f"Failed to read file {file_path}: {str(e)}")
+   raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
raise RuntimeError(f"Failed to read file {file_path}: {str(e)}")
except Exception as e:
raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") from e
🧰 Tools
🪛 Ruff (0.12.2)

210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
In pyspark_datasources/arrow.py around lines 209 to 210, the RuntimeError is
raised without preserving the original exception chain. Modify the raise
statement to use 'raise RuntimeError(...) from e' to maintain the exception
context and improve debugging.

assert batch.num_rows == 5
assert batch.num_columns == 6

print(f"\nDirect access test results:")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unnecessary f-string prefix

The string on this line doesn't contain any placeholders, so the f prefix is not needed.

-print(f"\nDirect access test results:")
+print("\nDirect access test results:")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"\nDirect access test results:")
print("\nDirect access test results:")
🧰 Tools
🪛 Ruff (0.12.2)

219-219: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In tests/test_arrow_datasource_updated.py at line 219, remove the unnecessary
f-string prefix from the print statement since the string does not contain any
placeholders. Change the line to a regular string without the 'f' prefix.

@allisonwang-db allisonwang-db force-pushed the add-arrow-datasource-and-project-docs branch from 19b1480 to 05ebc9d Compare August 4, 2025 18:55
@allisonwang-db allisonwang-db changed the title Add Arrow data source and project documentation Add Arrow data source Aug 4, 2025
@allisonwang-db allisonwang-db changed the title Add Arrow data source Add PyArrow data source Aug 4, 2025
@allisonwang-db allisonwang-db merged commit 8985da2 into master Aug 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants