-
Couldn't load subscription status.
- Fork 19
Refactor documentation #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis PR restructures the project documentation from a MkDocs-based system with individual datasource pages to a comprehensive guide-based model. Changes include removing CLAUDE.md and the MkDocs config, consolidating datasource documentation into unified guides, and rewriting the README with quick-start examples. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Large volume of documentation changes across 22+ files, primarily consolidating dispersed datasource documentation into unified guides. While mostly homogeneous (repetitive pattern of removing individual datasource pages), the README rewrite and addition of comprehensive guides require substantive content review for quality, completeness, and consistency. No code logic changes present. Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
CLAUDE.md(0 hunks)README.md(1 hunks)contributing/DEVELOPMENT.md(1 hunks)docs/api-reference.md(1 hunks)docs/building-data-sources.md(1 hunks)docs/data-sources-guide.md(1 hunks)docs/datasources/arrow.md(0 hunks)docs/datasources/fake.md(0 hunks)docs/datasources/github.md(0 hunks)docs/datasources/googlesheets.md(0 hunks)docs/datasources/huggingface.md(0 hunks)docs/datasources/jsonplaceholder.md(0 hunks)docs/datasources/kaggle.md(0 hunks)docs/datasources/lance.md(0 hunks)docs/datasources/opensky.md(0 hunks)docs/datasources/robinhood.md(0 hunks)docs/datasources/salesforce.md(0 hunks)docs/datasources/simplejson.md(0 hunks)docs/datasources/stock.md(0 hunks)docs/datasources/weather.md(0 hunks)docs/index.md(0 hunks)mkdocs.yml(0 hunks)
💤 Files with no reviewable changes (17)
- docs/datasources/stock.md
- docs/datasources/github.md
- docs/datasources/arrow.md
- docs/datasources/huggingface.md
- docs/datasources/weather.md
- docs/datasources/salesforce.md
- docs/datasources/kaggle.md
- docs/datasources/fake.md
- docs/datasources/jsonplaceholder.md
- docs/datasources/robinhood.md
- docs/index.md
- docs/datasources/simplejson.md
- docs/datasources/opensky.md
- CLAUDE.md
- docs/datasources/googlesheets.md
- mkdocs.yml
- docs/datasources/lance.md
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Include comprehensive class docstrings for each data source with: brief description and Name: "format_name", an Options section (parameters/types/defaults), and Examples (registration and basic usage)
Applied to files:
docs/data-sources-guide.mdcontributing/DEVELOPMENT.mddocs/building-data-sources.mdREADME.mddocs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : All data source classes must inherit from Spark's DataSource base class
Applied to files:
docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Implement robust exception handling in data source read/write paths
Applied to files:
docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : All classes used by the data sources (including readers/writers) must be pickle-serializable
Applied to files:
docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Defer expensive operations until read time (lazy evaluation) in data source implementations
Applied to files:
docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : For streaming sources, ensure resources are cleaned up (e.g., implement stop()/commit() as appropriate)
Applied to files:
docs/api-reference.md
🪛 LanguageTool
contributing/DEVELOPMENT.md
[uncategorized] ~324-~324: The official name of this software platform is spelled with a capital “H”.
Context: ...ons for CI/CD. Workflows are defined in .github/workflows/. ### Running CI Locally `...
(GITHUB)
README.md
[uncategorized] ~60-~60: The official name of this software platform is spelled with a capital “H”.
Context: ...nstall pyspark-data-sources[faker]| |github` | Batch | Read GitHub pull requests | ...
(GITHUB)
🪛 markdownlint-cli2 (0.18.1)
contributing/DEVELOPMENT.md
284-284: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
284-284: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
285-285: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
285-285: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
286-286: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
286-286: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
287-287: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
287-287: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
🔇 Additional comments (12)
contributing/DEVELOPMENT.md (1)
121-159: Excellent docstring example format.The data source docstring template (lines 121–159) provides a comprehensive pattern that aligns well with best practices: it includes a brief description, explicit Name field, structured Options section with parameter details, and practical Examples with registration and output. This sets a strong standard for contributors.
docs/api-reference.md (2)
185-207: Excellent serialization guidance with clear examples.The serialization section effectively contrasts BAD and GOOD patterns, demonstrating why non-serializable objects (connections, HTTP clients) must be created in
read()rather than stored as instance variables. This guidance directly prevents a common category of runtime errors.
261-274: Strong resource management pattern.The resource management example (lines 266–274) correctly demonstrates cleanup via
hasattr()checks and explicit close() calls, emphasizing the importance of implementingstop()for streaming sources. Aligns with best practices.docs/data-sources-guide.md (2)
5-14: Clear and comprehensive table of contents.The table of contents provides excellent navigation across nine data sources with links. Users can quickly find the data source they need and understand the scope upfront.
425-457: Practical common patterns section.Error handling (lines 428–436) and schema inference vs. specification (lines 438–457) patterns are clear and practical. The examples demonstrate both automatic inference and explicit schema specification, helping users choose the right approach.
docs/building-data-sources.md (4)
18-65: Excellent minimal example with clear progression.The minimal example cleanly demonstrates the two core classes (DataSource and DataSourceReader), name/schema methods, and basic read implementation. Starting with this pattern before advancing to partitioning and streaming is pedagogically sound.
336-375: Strong error handling pattern with exponential backoff.The retry logic with exponential backoff (lines 336–375) demonstrates a professional-grade pattern: catches specific exceptions, implements progressive delays, respects failOnError option, and provides logging. Good reference implementation.
377-404: Serialization guidance matches best practices.The SerializableReader example correctly shows the anti-pattern (storing connection objects) vs. pattern (storing connection strings and creating objects in read()). Clear and directly applicable.
617-684: Comprehensive testing section covering multiple scenarios.Unit, partitioned, error handling, and streaming tests (lines 617–684) provide good coverage of common test scenarios. The pytest fixtures and assertion patterns are idiomatic and practical.
README.md (3)
6-69: Well-structured README with clear quick-start focus.The restructured README effectively prioritizes quick start (installation, requirements, basic usage) and provides a compact table of available data sources with install notes. The progression from quick start → example → building guide → documentation links is intuitive and uncluttered.
93-129: Minimal custom data source example is practical.The building guide example (lines 98–127) shows a complete, runnable data source with DataSource and DataSourceReader classes, registration, and usage. Conciseness balances completeness—users can run this immediately and then refer to docs/building-data-sources.md for advanced patterns.
131-150: Documentation navigation is clear and well-organized.The documentation links (lines 131–136) provide a clear hierarchy: Data Sources Guide (examples) → Building Data Sources (tutorial) → API Reference (spec) → Development Guide (contributing). Resources at the end (lines 147–150) link to official Spark documentation. Good information architecture.
| 1. **Serialization errors**: Ensure all class attributes are pickle-able | ||
| 2. **Schema mismatch**: Verify returned data matches declared schema | ||
| 3. **Missing dependencies**: Use try/except to provide helpful error messages | ||
| 4. **API rate limits**: Implement backoff and retry logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix markdown strong-style violations (MD050).
Use underscores instead of asterisks for strong emphasis in list items to comply with markdown linting rules.
-1. **Serialization errors**: Ensure all class attributes are pickle-able
-2. **Schema mismatch**: Verify returned data matches declared schema
-3. **Missing dependencies**: Use try/except to provide helpful error messages
-4. **API rate limits**: Implement backoff and retry logic
+1. __Serialization errors__: Ensure all class attributes are pickle-able
+2. __Schema mismatch__: Verify returned data matches declared schema
+3. __Missing dependencies__: Use try/except to provide helpful error messages
+4. __API rate limits__: Implement backoff and retry logic📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 1. **Serialization errors**: Ensure all class attributes are pickle-able | |
| 2. **Schema mismatch**: Verify returned data matches declared schema | |
| 3. **Missing dependencies**: Use try/except to provide helpful error messages | |
| 4. **API rate limits**: Implement backoff and retry logic | |
| 1. __Serialization errors__: Ensure all class attributes are pickle-able | |
| 2. __Schema mismatch__: Verify returned data matches declared schema | |
| 3. __Missing dependencies__: Use try/except to provide helpful error messages | |
| 4. __API rate limits__: Implement backoff and retry logic |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
284-284: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
284-284: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
285-285: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
285-285: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
286-286: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
286-286: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
287-287: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
287-287: Strong style
Expected: underscore; Actual: asterisk
(MD050, strong-style)
🤖 Prompt for AI Agents
In contributing/DEVELOPMENT.md around lines 284 to 287, the list items use
asterisks for strong emphasis which violates MD050; replace the asterisk-based
bold markers with underscore-based strong emphasis (e.g., change **Serialization
errors** to __Serialization errors__) for each list item so the markdown linter
passes while preserving the same text and list structure.
Summary by CodeRabbit