Skip to content

Conversation

@allisonwang-db
Copy link
Owner

@allisonwang-db allisonwang-db commented Oct 22, 2025

Summary by CodeRabbit

  • Documentation
    • Significantly restructured documentation with new comprehensive guides for building custom data sources and detailed API reference.
    • Updated README with practical quick-start examples, clearer installation instructions, and improved layout.
    • Consolidated data source information into dedicated guides for easier navigation and discovery.
    • Added development guide for contributors.

@coderabbitai
Copy link

coderabbitai bot commented Oct 22, 2025

Walkthrough

This PR restructures the project documentation from a MkDocs-based system with individual datasource pages to a comprehensive guide-based model. Changes include removing CLAUDE.md and the MkDocs config, consolidating datasource documentation into unified guides, and rewriting the README with quick-start examples.

Changes

Cohort / File(s) Summary
Documentation Removal
CLAUDE.md, mkdocs.yml, docs/index.md
Deleted project context file, MkDocs configuration, and index page to transition from MkDocs to new documentation structure.
Datasource Documentation Consolidation
docs/datasources/*.md
Removed 14 individual datasource documentation pages (arrow.md, fake.md, github.md, googlesheets.md, huggingface.md, jsonplaceholder.md, kaggle.md, lance.md, opensky.md, robinhood.md, salesforce.md, simplejson.md, stock.md, weather.md) in favor of unified guides.
New Comprehensive Guides
docs/api-reference.md, docs/building-data-sources.md, docs/data-sources-guide.md
Added three new guide documents covering Python Data Source API specification, custom data source implementation patterns, and consolidated datasource usage with examples.
Development Documentation
contributing/DEVELOPMENT.md
Added new development guide covering environment setup, testing, code quality, pre-commit hooks, and debugging practices.
README Restructuring
README.md
Substantially rewrote README with new quick-start section, installation instructions with pip extras, basic usage examples, available data sources table, and custom data source implementation example.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Large volume of documentation changes across 22+ files, primarily consolidating dispersed datasource documentation into unified guides. While mostly homogeneous (repetitive pattern of removing individual datasource pages), the README rewrite and addition of comprehensive guides require substantive content review for quality, completeness, and consistency. No code logic changes present.

Possibly related PRs

Poem

🐰 From scattered docs to guides so bright,
MkDocs fades into the night,
One grand README takes the stage,
With APIs spelled on every page,
Building sources, crystal clear—
Hop along, the path is here! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Refactor documentation" accurately reflects the primary change in the changeset. The PR involves a comprehensive restructuring of the documentation: removing old documentation files (individual data source pages at docs/datasources/*.md, docs/index.md, mkdocs.yml, and CLAUDE.md), adding new consolidated guides (api-reference, building-data-sources, data-sources-guide, and DEVELOPMENT.md), and substantially rewriting README.md. The title is concise, clear, and conveys sufficient information for someone scanning the history to understand that the main change is a documentation refactoring, without being vague or generic like "misc updates" or "stuff."
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5788049 and d62843e.

📒 Files selected for processing (22)
  • CLAUDE.md (0 hunks)
  • README.md (1 hunks)
  • contributing/DEVELOPMENT.md (1 hunks)
  • docs/api-reference.md (1 hunks)
  • docs/building-data-sources.md (1 hunks)
  • docs/data-sources-guide.md (1 hunks)
  • docs/datasources/arrow.md (0 hunks)
  • docs/datasources/fake.md (0 hunks)
  • docs/datasources/github.md (0 hunks)
  • docs/datasources/googlesheets.md (0 hunks)
  • docs/datasources/huggingface.md (0 hunks)
  • docs/datasources/jsonplaceholder.md (0 hunks)
  • docs/datasources/kaggle.md (0 hunks)
  • docs/datasources/lance.md (0 hunks)
  • docs/datasources/opensky.md (0 hunks)
  • docs/datasources/robinhood.md (0 hunks)
  • docs/datasources/salesforce.md (0 hunks)
  • docs/datasources/simplejson.md (0 hunks)
  • docs/datasources/stock.md (0 hunks)
  • docs/datasources/weather.md (0 hunks)
  • docs/index.md (0 hunks)
  • mkdocs.yml (0 hunks)
💤 Files with no reviewable changes (17)
  • docs/datasources/stock.md
  • docs/datasources/github.md
  • docs/datasources/arrow.md
  • docs/datasources/huggingface.md
  • docs/datasources/weather.md
  • docs/datasources/salesforce.md
  • docs/datasources/kaggle.md
  • docs/datasources/fake.md
  • docs/datasources/jsonplaceholder.md
  • docs/datasources/robinhood.md
  • docs/index.md
  • docs/datasources/simplejson.md
  • docs/datasources/opensky.md
  • CLAUDE.md
  • docs/datasources/googlesheets.md
  • mkdocs.yml
  • docs/datasources/lance.md
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Include comprehensive class docstrings for each data source with: brief description and Name: "format_name", an Options section (parameters/types/defaults), and Examples (registration and basic usage)

Applied to files:

  • docs/data-sources-guide.md
  • contributing/DEVELOPMENT.md
  • docs/building-data-sources.md
  • README.md
  • docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : All data source classes must inherit from Spark's DataSource base class

Applied to files:

  • docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Implement robust exception handling in data source read/write paths

Applied to files:

  • docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : All classes used by the data sources (including readers/writers) must be pickle-serializable

Applied to files:

  • docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : Defer expensive operations until read time (lazy evaluation) in data source implementations

Applied to files:

  • docs/api-reference.md
📚 Learning: 2025-08-19T20:07:33.281Z
Learnt from: CR
PR: allisonwang-db/pyspark-data-sources#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-19T20:07:33.281Z
Learning: Applies to pyspark_datasources/!(__init__).py : For streaming sources, ensure resources are cleaned up (e.g., implement stop()/commit() as appropriate)

Applied to files:

  • docs/api-reference.md
🪛 LanguageTool
contributing/DEVELOPMENT.md

[uncategorized] ~324-~324: The official name of this software platform is spelled with a capital “H”.
Context: ...ons for CI/CD. Workflows are defined in .github/workflows/. ### Running CI Locally `...

(GITHUB)

README.md

[uncategorized] ~60-~60: The official name of this software platform is spelled with a capital “H”.
Context: ...nstall pyspark-data-sources[faker]| |github` | Batch | Read GitHub pull requests | ...

(GITHUB)

🪛 markdownlint-cli2 (0.18.1)
contributing/DEVELOPMENT.md

284-284: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


284-284: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


285-285: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


285-285: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


286-286: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


286-286: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


287-287: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


287-287: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)

🔇 Additional comments (12)
contributing/DEVELOPMENT.md (1)

121-159: Excellent docstring example format.

The data source docstring template (lines 121–159) provides a comprehensive pattern that aligns well with best practices: it includes a brief description, explicit Name field, structured Options section with parameter details, and practical Examples with registration and output. This sets a strong standard for contributors.

docs/api-reference.md (2)

185-207: Excellent serialization guidance with clear examples.

The serialization section effectively contrasts BAD and GOOD patterns, demonstrating why non-serializable objects (connections, HTTP clients) must be created in read() rather than stored as instance variables. This guidance directly prevents a common category of runtime errors.


261-274: Strong resource management pattern.

The resource management example (lines 266–274) correctly demonstrates cleanup via hasattr() checks and explicit close() calls, emphasizing the importance of implementing stop() for streaming sources. Aligns with best practices.

docs/data-sources-guide.md (2)

5-14: Clear and comprehensive table of contents.

The table of contents provides excellent navigation across nine data sources with links. Users can quickly find the data source they need and understand the scope upfront.


425-457: Practical common patterns section.

Error handling (lines 428–436) and schema inference vs. specification (lines 438–457) patterns are clear and practical. The examples demonstrate both automatic inference and explicit schema specification, helping users choose the right approach.

docs/building-data-sources.md (4)

18-65: Excellent minimal example with clear progression.

The minimal example cleanly demonstrates the two core classes (DataSource and DataSourceReader), name/schema methods, and basic read implementation. Starting with this pattern before advancing to partitioning and streaming is pedagogically sound.


336-375: Strong error handling pattern with exponential backoff.

The retry logic with exponential backoff (lines 336–375) demonstrates a professional-grade pattern: catches specific exceptions, implements progressive delays, respects failOnError option, and provides logging. Good reference implementation.


377-404: Serialization guidance matches best practices.

The SerializableReader example correctly shows the anti-pattern (storing connection objects) vs. pattern (storing connection strings and creating objects in read()). Clear and directly applicable.


617-684: Comprehensive testing section covering multiple scenarios.

Unit, partitioned, error handling, and streaming tests (lines 617–684) provide good coverage of common test scenarios. The pytest fixtures and assertion patterns are idiomatic and practical.

README.md (3)

6-69: Well-structured README with clear quick-start focus.

The restructured README effectively prioritizes quick start (installation, requirements, basic usage) and provides a compact table of available data sources with install notes. The progression from quick start → example → building guide → documentation links is intuitive and uncluttered.


93-129: Minimal custom data source example is practical.

The building guide example (lines 98–127) shows a complete, runnable data source with DataSource and DataSourceReader classes, registration, and usage. Conciseness balances completeness—users can run this immediately and then refer to docs/building-data-sources.md for advanced patterns.


131-150: Documentation navigation is clear and well-organized.

The documentation links (lines 131–136) provide a clear hierarchy: Data Sources Guide (examples) → Building Data Sources (tutorial) → API Reference (spec) → Development Guide (contributing). Resources at the end (lines 147–150) link to official Spark documentation. Good information architecture.

Comment on lines +284 to +287
1. **Serialization errors**: Ensure all class attributes are pickle-able
2. **Schema mismatch**: Verify returned data matches declared schema
3. **Missing dependencies**: Use try/except to provide helpful error messages
4. **API rate limits**: Implement backoff and retry logic
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdown strong-style violations (MD050).

Use underscores instead of asterisks for strong emphasis in list items to comply with markdown linting rules.

-1. **Serialization errors**: Ensure all class attributes are pickle-able
-2. **Schema mismatch**: Verify returned data matches declared schema
-3. **Missing dependencies**: Use try/except to provide helpful error messages
-4. **API rate limits**: Implement backoff and retry logic
+1. __Serialization errors__: Ensure all class attributes are pickle-able
+2. __Schema mismatch__: Verify returned data matches declared schema
+3. __Missing dependencies__: Use try/except to provide helpful error messages
+4. __API rate limits__: Implement backoff and retry logic
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
1. **Serialization errors**: Ensure all class attributes are pickle-able
2. **Schema mismatch**: Verify returned data matches declared schema
3. **Missing dependencies**: Use try/except to provide helpful error messages
4. **API rate limits**: Implement backoff and retry logic
1. __Serialization errors__: Ensure all class attributes are pickle-able
2. __Schema mismatch__: Verify returned data matches declared schema
3. __Missing dependencies__: Use try/except to provide helpful error messages
4. __API rate limits__: Implement backoff and retry logic
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

284-284: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


284-284: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


285-285: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


285-285: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


286-286: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


286-286: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


287-287: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)


287-287: Strong style
Expected: underscore; Actual: asterisk

(MD050, strong-style)

🤖 Prompt for AI Agents
In contributing/DEVELOPMENT.md around lines 284 to 287, the list items use
asterisks for strong emphasis which violates MD050; replace the asterisk-based
bold markers with underscore-based strong emphasis (e.g., change **Serialization
errors** to __Serialization errors__) for each list item so the markdown linter
passes while preserving the same text and list structure.

@allisonwang-db allisonwang-db merged commit 4e877ce into master Oct 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants