-
Notifications
You must be signed in to change notification settings - Fork 20
Add PyArrow data source #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new PySpark data source for reading Apache Arrow IPC files was implemented and integrated into the package. Project documentation and API specifications were added. Release notes were updated with a troubleshooting section for PyArrow/macOS issues. Two new tests were introduced to validate the Arrow data source with both single and multiple files. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SparkSession
participant ArrowDataSource
participant ArrowDataSourceReader
participant FileSystem
User->>SparkSession: Register ArrowDataSource
User->>SparkSession: spark.read.format("arrow").load("path")
SparkSession->>ArrowDataSource: Instantiate with options
ArrowDataSource->>FileSystem: Resolve Arrow file paths
ArrowDataSource->>ArrowDataSource: Infer schema from first file
ArrowDataSource->>ArrowDataSourceReader: Create reader with schema
ArrowDataSourceReader->>FileSystem: List Arrow files
loop For each file
ArrowDataSourceReader->>FileSystem: Open Arrow file
ArrowDataSourceReader->>SparkSession: Yield RecordBatches
end
SparkSession->>User: Return DataFrame
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🪛 Ruff (0.12.2)pyspark_datasources/arrow.py1-1: Remove unused import: (F401) 161-161: Within an (B904) tests/test_data_sources.py101-101: (F405) 139-139: (F405) pyspark_datasources/__init__.py1-1: (F401) 🪛 markdownlint-cli2 (0.17.2)CLAUDE.md178-178: Bare URL used (MD034, no-bare-urls) 🔇 Additional comments (12)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (7)
tests/create_test_data.py (1)
118-118: Fix unused loop variable.The
dirsvariable is not used within the loop body. Follow the static analysis suggestion to rename it.- for root, dirs, files in os.walk(data_dir): + for root, _dirs, files in os.walk(data_dir):pyspark_datasources/arrow.py (2)
186-188: Avoid recreating ArrowDataSource instanceThe
partitions()method creates a newArrowDataSourceinstance just to access its_get_files()method. SinceArrowDataSourceReaderalready has access toself.pathandself.options, it would be more efficient to move the file discovery logic here or pass the files from the parent data source.Consider refactoring to avoid unnecessary object creation:
def partitions(self) -> List[InputPartition]: """Create partitions, one per file for parallel reading.""" - data_source = ArrowDataSource(self.options) - files = data_source._get_files(self.path) + files = self._get_files(self.path) return [InputPartition(file_path) for file_path in files]Then add the
_get_filesmethod to this class or pass files through the constructor.
193-194: Avoid recreating ArrowDataSource instance in read()Similar to the
partitions()method, this creates a newArrowDataSourceinstance just to access_detect_format(). This is inefficient and could be refactored.Consider passing the format detection result through the partition or storing it in the reader instance.
CLAUDE.md (4)
17-17: Add language identifier to fenced code blockThe fenced code block showing the project structure should have a language identifier for better syntax highlighting.
-``` +```text
32-34: Add blank lines around tableMarkdown tables should be surrounded by blank lines for better readability and compatibility with various markdown parsers.
## Available Data Sources + | Short Name | File | Description | Dependencies |
203-206: Format bare URLs as proper markdown linksBare URLs should be formatted as proper markdown links for better presentation.
221-221: Add newline at end of fileThe file should end with a newline character for POSIX compliance.
Add a newline after line 221.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/data/products.parquetis excluded by!**/*.parquet
📒 Files selected for processing (8)
.gitignore(1 hunks)CLAUDE.md(1 hunks)RELEASE.md(1 hunks)pyspark_datasources/__init__.py(1 hunks)pyspark_datasources/arrow.py(1 hunks)tests/create_test_data.py(1 hunks)tests/test_arrow_datasource.py(1 hunks)tests/test_arrow_datasource_updated.py(1 hunks)
🧬 Code Graph Analysis (3)
pyspark_datasources/__init__.py (1)
pyspark_datasources/arrow.py (1)
ArrowDataSource(11-171)
tests/create_test_data.py (1)
pyspark_datasources/arrow.py (1)
schema(120-140)
tests/test_arrow_datasource.py (2)
pyspark_datasources/arrow.py (4)
ArrowDataSource(11-171)schema(120-140)name(117-118)read(190-210)tests/test_arrow_datasource_updated.py (3)
spark(9-11)test_arrow_datasource_missing_path(156-163)test_arrow_datasource_nonexistent_path(166-173)
🪛 Ruff (0.12.2)
pyspark_datasources/__init__.py
1-1: .arrow.ArrowDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
tests/create_test_data.py
118-118: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
tests/test_arrow_datasource_updated.py
219-219: f-string without any placeholders
Remove extraneous f prefix
(F541)
pyspark_datasources/arrow.py
1-1: typing.Union imported but unused
Remove unused import: typing.Union
(F401)
210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
33-33: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
203-203: Bare URL used
(MD034, no-bare-urls)
205-205: Bare URL used
(MD034, no-bare-urls)
206-206: Bare URL used
(MD034, no-bare-urls)
🧰 Additional context used
🧬 Code Graph Analysis (3)
pyspark_datasources/__init__.py (1)
pyspark_datasources/arrow.py (1)
ArrowDataSource(11-171)
tests/create_test_data.py (1)
pyspark_datasources/arrow.py (1)
schema(120-140)
tests/test_arrow_datasource.py (2)
pyspark_datasources/arrow.py (4)
ArrowDataSource(11-171)schema(120-140)name(117-118)read(190-210)tests/test_arrow_datasource_updated.py (3)
spark(9-11)test_arrow_datasource_missing_path(156-163)test_arrow_datasource_nonexistent_path(166-173)
🪛 Ruff (0.12.2)
pyspark_datasources/__init__.py
1-1: .arrow.ArrowDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
tests/create_test_data.py
118-118: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
tests/test_arrow_datasource_updated.py
219-219: f-string without any placeholders
Remove extraneous f prefix
(F541)
pyspark_datasources/arrow.py
1-1: typing.Union imported but unused
Remove unused import: typing.Union
(F401)
210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
33-33: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
203-203: Bare URL used
(MD034, no-bare-urls)
205-205: Bare URL used
(MD034, no-bare-urls)
206-206: Bare URL used
(MD034, no-bare-urls)
🔇 Additional comments (5)
.gitignore (1)
162-163: LGTM! Appropriate addition to gitignore.The
.claude/directory exclusion aligns with the new CLAUDE.md documentation and follows the existing pattern for ignoring development-related directories.pyspark_datasources/__init__.py (1)
1-1: LGTM! Package-level import follows established pattern.The import makes
ArrowDataSourceavailable at the package level, consistent with other data sources. The static analysis warning about unused import is a false positive - this is the intended behavior for package__init__.pyfiles to expose public APIs.RELEASE.md (1)
122-135: Excellent troubleshooting documentation for PyArrow compatibility.This addition provides valuable guidance for a known macOS issue with PyArrow. The solutions are well-structured with practical examples for different usage scenarios, which will help users avoid runtime crashes when using the new ArrowDataSource.
tests/create_test_data.py (1)
9-125: Excellent test data generation script with comprehensive coverage.The script creates a well-structured test dataset covering various scenarios:
- Single Arrow and Parquet files
- Partitioned datasets for testing multiple partitions
- Mixed format directories
- Proper use of PyArrow APIs for both IPC and Parquet formats
This provides robust test infrastructure for the ArrowDataSource functionality.
tests/test_arrow_datasource.py (1)
1-141: Comprehensive test suite with excellent structure and coverage.The test module provides solid coverage of ArrowDataSource functionality:
- Well-designed fixtures with proper temporary file creation and cleanup
- Tests for both Arrow IPC and Parquet file formats
- Schema inference validation
- Appropriate error handling tests for edge cases
- Clear test structure following pytest conventions
The tests effectively validate the core functionality while maintaining good resource management practices.
| @@ -0,0 +1,210 @@ | |||
| from typing import List, Iterator, Union | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import Union
The Union type from the typing module is imported but never used in the code.
-from typing import List, Iterator, Union
+from typing import List, Iterator📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from typing import List, Iterator, Union | |
| from typing import List, Iterator |
🧰 Tools
🪛 Ruff (0.12.2)
1-1: typing.Union imported but unused
Remove unused import: typing.Union
(F401)
🤖 Prompt for AI Agents
In pyspark_datasources/arrow.py at line 1, the import statement includes `Union`
from the `typing` module, but `Union` is not used anywhere in the code. Remove
`Union` from the import list to clean up unused imports.
pyspark_datasources/arrow.py
Outdated
| except Exception as e: | ||
| raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve exception chain when re-raising
When catching and re-raising exceptions, use from e to preserve the original exception chain for better debugging.
except Exception as e:
- raise RuntimeError(f"Failed to read file {file_path}: {str(e)}")
+ raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") | |
| except Exception as e: | |
| raise RuntimeError(f"Failed to read file {file_path}: {str(e)}") from e |
🧰 Tools
🪛 Ruff (0.12.2)
210-210: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In pyspark_datasources/arrow.py around lines 209 to 210, the RuntimeError is
raised without preserving the original exception chain. Modify the raise
statement to use 'raise RuntimeError(...) from e' to maintain the exception
context and improve debugging.
| assert batch.num_rows == 5 | ||
| assert batch.num_columns == 6 | ||
|
|
||
| print(f"\nDirect access test results:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary f-string prefix
The string on this line doesn't contain any placeholders, so the f prefix is not needed.
-print(f"\nDirect access test results:")
+print("\nDirect access test results:")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"\nDirect access test results:") | |
| print("\nDirect access test results:") |
🧰 Tools
🪛 Ruff (0.12.2)
219-219: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tests/test_arrow_datasource_updated.py at line 219, remove the unnecessary
f-string prefix from the print statement since the string does not contain any
placeholders. Change the line to a regular string without the 'f' prefix.
19b1480 to
05ebc9d
Compare
Summary
Usage and Examples
Changes
Test Plan
🤖 Generated with Claude Code
Summary by CodeRabbit