Skip to content

Conversation

@allisonwang-db
Copy link
Owner

@allisonwang-db allisonwang-db commented Aug 14, 2025

Summary by CodeRabbit

  • Documentation
    • Reorganized README data sources into categorized sections (Batch/Streaming Read/Write) with a new Type and Example columns.
    • Added entries for Arrow (Batch Read), Hugging Face (Batch Read), Weather (Streaming Read), and Lance sink (Batch Write); reclassified Fake to support Batch and Streaming Read; removed Simple JSON.
    • Included clearer usage examples for batch and streaming, plus installation guidance for optional extras.
    • Added new docs pages for Arrow, Lance, OpenSky, and Weather with installation notes and API references.

@coderabbitai
Copy link

coderabbitai bot commented Aug 14, 2025

Walkthrough

Documentation-only update: README reorganized with categorized data source tables and examples; added four new docs pages for ArrowDataSource, LanceSink, OpenSkyDataSource, and WeatherDataSource with installation notes and API references. No code or public API changes.

Changes

Cohort / File(s) Summary of Changes
README restructure
README.md
Replaced single table with categorized sections (Batch/Streaming Read/Write), added Type and Example columns, added/remodeled data sources (Arrow, Fake, HuggingFace, Weather, Lance, etc.), removed SimpleJson, included install extras and sample read/readStream usage.
New datasource docs
docs/datasources/*
Added pages: arrow.md (PyArrow dep, API ref), lance.md (Lance sink, install notes, API ref), opensky.md (no extra deps, API ref), weather.md (Tomorrow.io note, API ref). Content only; no runtime changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

I thump my paws on pages new,
Categories crisp, a tidy view.
Arrow flies, and Lance points true,
OpenSky wide, with weather too.
I nibble docs—so fresh, so bright—
Hop-hop! Examples take their flight.
README burrow, set just right. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
docs/datasources/arrow.md (1)

1-7: Expand docs with usage, options, and partitioning details matching implementation.

Add examples for single file, directory, and glob; document required options and one-partition-per-file behavior.

Proposed edit:

 # ArrowDataSource

 > Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
 > or use `pip install pyspark-data-sources[arrow]`.

 ::: pyspark_datasources.arrow.ArrowDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.arrow import ArrowDataSource
+spark.dataSource.register(ArrowDataSource)
+
+# Single file
+df = spark.read.format("arrow").load("/path/to/file.arrow")
+
+# Directory of .arrow files (all .arrow files discovered and sorted)
+df = spark.read.format("arrow").load("/path/to/dir")
+
+# Glob pattern
+df = spark.read.format("arrow").load("/data/**/*.arrow")
+```
+
+## Options
+- path (required): File path, directory, or glob pattern.
+
+## Notes
+- One partition per file (a single file => 1 partition; N files => N partitions).
+- Schema is inferred from the first file using PyArrow and converted to Spark StructType.
docs/datasources/weather.md (1)

3-3: Clarify options and add runnable examples (locations parsing, frequency).

The implementation expects locations as a stringified list of (lat, lon) tuples and validates frequency. Add examples and valid values.

Proposed edit:

 # WeatherDataSource

 > No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.
 
 ::: pyspark_datasources.weather.WeatherDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.weather import WeatherDataSource
+spark.dataSource.register(WeatherDataSource)
+
+df = (
+    spark.readStream.format("weather")
+    # String representation of a list of tuples: "[(lat, lon), (lat, lon)]"
+    .option("locations", "[(37.7749, -122.4194), (40.7128, -74.0060)]")
+    .option("apikey", "<tomorrow.io_api_key>")
+    .option("frequency", "minutely")  # or "hourly" or "daily"
+    .load()
+)
+```
+
+## Options
+- locations (required): String-form list of (latitude, longitude) tuples.
+- apikey (required): Tomorrow.io API key.
+- frequency (optional): "minutely" (default), "hourly", or "daily".
docs/datasources/opensky.md (1)

3-3: Document rate limits, regions, and optional OAuth credentials.

Add guidance on available regions, rate limits, and OAuth2 client credentials supported by the implementation.

Proposed edit:

 # OpenSkyDataSource

 > No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.
 
 ::: pyspark_datasources.opensky.OpenSkyDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.opensky import OpenSkyDataSource
+spark.dataSource.register(OpenSkyDataSource)
+
+df = (
+    spark.readStream.format("opensky")
+    .option("region", "EUROPE")  # See valid regions below
+    # Optional OAuth2 for higher limits:
+    # .option("client_id", "<client_id>").option("client_secret", "<client_secret>")
+    .load()
+)
+```
+
+## Options
+- region: One of EUROPE, NORTH_AMERICA, SOUTH_AMERICA, ASIA, AUSTRALIA, AFRICA, GLOBAL.
+- client_id / client_secret (optional): OAuth2 client credentials for higher rate limits.
+
+## Notes
+- The reader enforces minimum interval between requests and handles 429 and transient errors with retry/backoff.
+- Anonymous vs authenticated rate limits differ; consider OAuth2 for production use.
docs/datasources/lance.md (1)

3-4: Call out Spark version/prereq and overwrite behavior.

Per implementation notes, this sink needs Spark master nightly for DataSourceArrowWriter and does not support overwrite. Add explicit notes and a minimal example with registration.

Proposed edit:

 # LanceSink

 > Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
 > or use `pip install pyspark-data-sources[lance]`.
 
 ::: pyspark_datasources.lance.LanceSink
+
+## Prerequisites
+- Requires a Spark build that supports DataSourceArrowWriter (Spark master/nightly at the time of writing).
+
+## Usage
+
+```python
+from pyspark_datasources.lance import LanceSink
+spark.dataSource.register(LanceSink)
+
+# Overwrite is not supported by the sink implementation
+df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>")
+```
README.md (4)

41-58: Add a brief note about mandatory registration before using formats.

Examples in this table omit registration; most custom Python Data Sources here require it. Add a one-liner to avoid user confusion.

Proposed edit (inserted before the table):

 ## Example Data Sources
 
+> Note: Before using a data source or sink, register it with Spark. Example:
+> `from pyspark_datasources.arrow import ArrowDataSource; spark.dataSource.register(ArrowDataSource)`
+
 | Data Source                                                             | Short Name     | Type           | Description                                   | Dependencies          | Example                                                                                                                                                                      |

54-55: Tighten dependency wording for streaming sources.

Weather/OpenSky use external web APIs; clarify “No additional Python deps” vs operational requirements (API keys, rate limits).

Proposed edit:

-| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
-| [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from tomorrow.io           | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |
+| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | No extra Python deps   | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
+| [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from Tomorrow.io           | No extra Python deps   | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |

Optionally, add a footnote below the table:

  • OpenSky: Subject to API rate limits; optional OAuth2 credentials supported.
  • Weather: Requires a Tomorrow.io API key.

44-44: Clarify Arrow format phrasing and example registration.

Proposed edit:

-| [ArrowDataSource](pyspark_datasources/arrow.py)                        | `arrow`        | Batch Read     | Read Apache Arrow files (.arrow)             | `pyarrow`             | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")`                                                                                                     |
+| [ArrowDataSource](pyspark_datasources/arrow.py)                        | `arrow`        | Batch Read     | Read Apache Arrow IPC files (.arrow)         | `pyarrow`             | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")`                                                                                                     |

52-52: Mention dataset URI in Lance example.

Proposed edit:

-| [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data in Lance format                    | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")`                                                                                          |
+| [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data to a Lance dataset                 | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>")`                                                                          |
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b676bfe and 0b5e9b1.

📒 Files selected for processing (5)
  • README.md (1 hunks)
  • docs/datasources/arrow.md (1 hunks)
  • docs/datasources/lance.md (1 hunks)
  • docs/datasources/opensky.md (1 hunks)
  • docs/datasources/weather.md (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
docs/datasources/arrow.md (2)
pyspark_datasources/arrow.py (3)
  • ArrowDataSource (10-131)
  • ArrowDataSourceReader (134-161)
  • reader (118-119)
tests/test_data_sources.py (2)
  • test_arrow_datasource_single_file (103-136)
  • test_arrow_datasource_multiple_files (139-178)
docs/datasources/lance.md (1)
pyspark_datasources/lance.py (7)
  • LanceSink (10-57)
  • LanceWriter (65-95)
  • commit (86-95)
  • __init__ (66-72)
  • write (81-84)
  • LanceCommitMessage (61-62)
  • name (49-50)
docs/datasources/opensky.md (2)
pyspark_datasources/opensky.py (5)
  • OpenSkyDataSource (314-477)
  • OpenSkyStreamReader (112-311)
  • schema (452-474)
  • OpenSkyAPIError (100-103)
  • name (449-450)
tests/test_data_sources.py (1)
  • test_opensky_datasource_stream (58-73)
docs/datasources/weather.md (1)
pyspark_datasources/weather.py (5)
  • WeatherDataSource (9-72)
  • WeatherSimpleStreamReader (75-130)
  • name (48-50)
  • schema (59-68)
  • __init__ (100-108)
README.md (6)
pyspark_datasources/fake.py (3)
  • FakeDataSource (35-125)
  • FakeDataSourceStreamReader (147-173)
  • FakeDataSourceReader (128-144)
pyspark_datasources/stock.py (1)
  • StockDataSource (8-41)
pyspark_datasources/opensky.py (1)
  • OpenSkyDataSource (314-477)
pyspark_datasources/weather.py (1)
  • WeatherDataSource (9-72)
pyspark_datasources/arrow.py (1)
  • ArrowDataSource (10-131)
pyspark_datasources/salesforce.py (1)
  • SalesforceDataSource (19-205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.9)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.10)
  • GitHub Check: test (3.11)
🔇 Additional comments (4)
docs/datasources/arrow.md (1)

6-6: mkdocstrings is configured and ArrowDataSource is importable

Verified: mkdocs.yml contains the mkdocstrings plugin (python handler); the package exports ArrowDataSource and docs/datasources/arrow.md contains the mkdocstrings directive.

  • mkdocs.yml — plugin present: - mkdocstrings: with default_handler: python (handlers.python.options.docstring_style: numpy).
  • pyproject.toml — mkdocstrings listed in dependencies.
  • pyspark_datasources/init.py — exports ArrowDataSource.
  • pyspark_datasources/arrow.py — defines class ArrowDataSource.
  • docs/datasources/arrow.md — contains the directive: ::: pyspark_datasources.arrow.ArrowDataSource

Note: docs/datasources/arrow.md is not listed in mkdocs.yml nav — add it if you want it linked in the site navigation.

docs/datasources/weather.md (1)

3-3: Confirmed — requests is declared in packaging; no change required.

pyproject.toml (line ~15): requests = "^2.31.0"
pyspark_datasources/weather.py: import requests (line 3) and uses requests.Session() (line 108)

Docs statement ("No additional dependencies required") is accurate — users don't need to install requests separately.

docs/datasources/opensky.md (1)

3-3: Resolved — requests is a declared dependency and urllib3 is available transitively; docs OK

pyproject.toml declares requests (requests = "^2.31.0"); poetry.lock contains urllib3 and locked requests versions; pyspark_datasources/opensky.py uses requests + requests' urllib3 Retry/HTTPAdapter. The docs line "No additional dependencies required." is accurate — no packaging change required.

Verified locations:

  • pyproject.toml: requests = "^2.31.0"
  • pyspark_datasources/opensky.py: imports requests, HTTPAdapter, Retry
  • poetry.lock: urllib3 present (transitive dependency)
README.md (1)

41-58: Restructuring is a clear usability win.

The categorized table with Examples substantially improves discoverability and onboarding. Nice improvement.

Comment on lines +3 to +4
> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
> or use `pip install pyspark-data-sources[arrow]`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify extras and dependency declarations for Arrow support.

The docs reference an extras install pyspark-data-sources[arrow] and the pyarrow dependency. Please verify:

  • Extras group "arrow" exists.
  • pyarrow is included in that extras group (or base deps if intended).

Run this script to confirm extras and dependencies:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Looking for pyproject and setup configuration..."
fd -I -a 'pyproject.toml' || true
fd -I -a 'setup.cfg' || true
fd -I -a 'setup.py' || true

echo -e "\nInspect pyproject.toml for extras and dependencies:"
if fd -I 'pyproject.toml' >/dev/null; then
  cat $(fd -I 'pyproject.toml') | sed -n '1,200p'
  echo -e "\nExtras defined:"
  rg -n "^\[project\.optional-dependencies\]" -n -A 50 || true
  rg -n "pyarrow" || true
fi

echo -e "\nIf using setup.cfg or setup.py, search for extras and pyarrow:"
rg -n "extras_require|pyarrow" -S -A2 || true

Length of output: 26547


Fix docs: remove non-existent arrow extra; pyarrow is a core dependency

pyproject.toml declares pyarrow under [tool.poetry.dependencies] (pyarrow = ">=11.0.0") and there is no extras group named "arrow". Update docs to stop recommending pyspark-data-sources[arrow].

Files to update:

  • docs/datasources/arrow.md (lines 3–4)
  • README.md (the table entry that suggests pip install pyspark-data-sources[arrow])

Suggested replacements (examples):

docs/datasources/arrow.md

    • Requires the PyArrow library. You can install it manually: pip install pyarrow

    • or use pip install pyspark-data-sources[arrow].

    • Requires the PyArrow library. You can install it manually: pip install pyarrow or install the package (pyarrow is a core dependency): pip install pyspark-data-sources.

README.md (table cell)

    • pip install pyspark-data-sources[arrow]
    • pip install pyspark-data-sources (pyarrow is installed as a dependency) — or pip install pyarrow to install only PyArrow.

Evidence: pyproject.toml contains pyarrow = ">=11.0.0" and [tool.poetry.extras] does not include "arrow" (from the verification output).

🤖 Prompt for AI Agents
docs/datasources/arrow.md lines 3–4 and README.md table entry: the docs
reference a non-existent extras group `arrow` and suggest installing
`pyspark-data-sources[arrow]`; update both files to remove the `[...]` extra and
state that pyarrow is a core dependency. In docs/datasources/arrow.md replace
the two-line suggestion with a single sentence recommending `pip install
pyarrow` or `pip install pyspark-data-sources` (not the `[arrow]` extra) and
mention pyarrow is included as a dependency; in README.md update the table cell
to `pip install pyspark-data-sources` (pyarrow is installed as a dependency) —
or `pip install pyarrow` to install only PyArrow.

@allisonwang-db allisonwang-db merged commit 3ee0484 into master Aug 19, 2025
5 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants