Update README to add new data sources #22

allisonwang-db · 2025-08-14T19:57:34Z

Summary by CodeRabbit

Documentation
- Reorganized README data sources into categorized sections (Batch/Streaming Read/Write) with a new Type and Example columns.
- Added entries for Arrow (Batch Read), Hugging Face (Batch Read), Weather (Streaming Read), and Lance sink (Batch Write); reclassified Fake to support Batch and Streaming Read; removed Simple JSON.
- Included clearer usage examples for batch and streaming, plus installation guidance for optional extras.
- Added new docs pages for Arrow, Lance, OpenSky, and Weather with installation notes and API references.

coderabbitai · 2025-08-14T19:57:41Z

Walkthrough

Documentation-only update: README reorganized with categorized data source tables and examples; added four new docs pages for ArrowDataSource, LanceSink, OpenSkyDataSource, and WeatherDataSource with installation notes and API references. No code or public API changes.

Changes

Cohort / File(s)	Summary of Changes
README restructure `README.md`	Replaced single table with categorized sections (Batch/Streaming Read/Write), added Type and Example columns, added/remodeled data sources (Arrow, Fake, HuggingFace, Weather, Lance, etc.), removed SimpleJson, included install extras and sample read/readStream usage.
New datasource docs `docs/datasources/*`	Added pages: `arrow.md` (PyArrow dep, API ref), `lance.md` (Lance sink, install notes, API ref), `opensky.md` (no extra deps, API ref), `weather.md` (Tomorrow.io note, API ref). Content only; no runtime changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Add PyArrow data source #14 — Adds docs/README entries for ArrowDataSource matching this PR’s Arrow docs.
opensky pull request #11 — Adds README/docs and Sphinx API docs for OpenSkyDataSource aligned with this PR’s OpenSky docs.

Poem

I thump my paws on pages new,
Categories crisp, a tidy view.
Arrow flies, and Lance points true,
OpenSky wide, with weather too.
I nibble docs—so fresh, so bright—
Hop-hop! Examples take their flight.
README burrow, set just right. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (8)

docs/datasources/arrow.md (1)

1-7: Expand docs with usage, options, and partitioning details matching implementation.

Add examples for single file, directory, and glob; document required options and one-partition-per-file behavior.

Proposed edit:

 # ArrowDataSource

 > Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
 > or use `pip install pyspark-data-sources[arrow]`.

 ::: pyspark_datasources.arrow.ArrowDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.arrow import ArrowDataSource
+spark.dataSource.register(ArrowDataSource)
+
+# Single file
+df = spark.read.format("arrow").load("/path/to/file.arrow")
+
+# Directory of .arrow files (all .arrow files discovered and sorted)
+df = spark.read.format("arrow").load("/path/to/dir")
+
+# Glob pattern
+df = spark.read.format("arrow").load("/data/**/*.arrow")
+```
+
+## Options
+- path (required): File path, directory, or glob pattern.
+
+## Notes
+- One partition per file (a single file => 1 partition; N files => N partitions).
+- Schema is inferred from the first file using PyArrow and converted to Spark StructType.

docs/datasources/weather.md (1)

3-3: Clarify options and add runnable examples (locations parsing, frequency).

The implementation expects locations as a stringified list of (lat, lon) tuples and validates frequency. Add examples and valid values.

Proposed edit:

 # WeatherDataSource

 > No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.
 
 ::: pyspark_datasources.weather.WeatherDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.weather import WeatherDataSource
+spark.dataSource.register(WeatherDataSource)
+
+df = (
+    spark.readStream.format("weather")
+    # String representation of a list of tuples: "[(lat, lon), (lat, lon)]"
+    .option("locations", "[(37.7749, -122.4194), (40.7128, -74.0060)]")
+    .option("apikey", "<tomorrow.io_api_key>")
+    .option("frequency", "minutely")  # or "hourly" or "daily"
+    .load()
+)
+```
+
+## Options
+- locations (required): String-form list of (latitude, longitude) tuples.
+- apikey (required): Tomorrow.io API key.
+- frequency (optional): "minutely" (default), "hourly", or "daily".

docs/datasources/opensky.md (1)

3-3: Document rate limits, regions, and optional OAuth credentials.

Add guidance on available regions, rate limits, and OAuth2 client credentials supported by the implementation.

Proposed edit:

 # OpenSkyDataSource

 > No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.
 
 ::: pyspark_datasources.opensky.OpenSkyDataSource
+
+## Usage
+
+```python
+from pyspark_datasources.opensky import OpenSkyDataSource
+spark.dataSource.register(OpenSkyDataSource)
+
+df = (
+    spark.readStream.format("opensky")
+    .option("region", "EUROPE")  # See valid regions below
+    # Optional OAuth2 for higher limits:
+    # .option("client_id", "<client_id>").option("client_secret", "<client_secret>")
+    .load()
+)
+```
+
+## Options
+- region: One of EUROPE, NORTH_AMERICA, SOUTH_AMERICA, ASIA, AUSTRALIA, AFRICA, GLOBAL.
+- client_id / client_secret (optional): OAuth2 client credentials for higher rate limits.
+
+## Notes
+- The reader enforces minimum interval between requests and handles 429 and transient errors with retry/backoff.
+- Anonymous vs authenticated rate limits differ; consider OAuth2 for production use.

docs/datasources/lance.md (1)

3-4: Call out Spark version/prereq and overwrite behavior.

Per implementation notes, this sink needs Spark master nightly for DataSourceArrowWriter and does not support overwrite. Add explicit notes and a minimal example with registration.

Proposed edit:

 # LanceSink

 > Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
 > or use `pip install pyspark-data-sources[lance]`.
 
 ::: pyspark_datasources.lance.LanceSink
+
+## Prerequisites
+- Requires a Spark build that supports DataSourceArrowWriter (Spark master/nightly at the time of writing).
+
+## Usage
+
+```python
+from pyspark_datasources.lance import LanceSink
+spark.dataSource.register(LanceSink)
+
+# Overwrite is not supported by the sink implementation
+df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>")
+```

README.md (4)

41-58: Add a brief note about mandatory registration before using formats.

Examples in this table omit registration; most custom Python Data Sources here require it. Add a one-liner to avoid user confusion.

Proposed edit (inserted before the table):

 ## Example Data Sources
 
+> Note: Before using a data source or sink, register it with Spark. Example:
+> `from pyspark_datasources.arrow import ArrowDataSource; spark.dataSource.register(ArrowDataSource)`
+
 | Data Source                                                             | Short Name     | Type           | Description                                   | Dependencies          | Example                                                                                                                                                                      |

54-55: Tighten dependency wording for streaming sources.

Weather/OpenSky use external web APIs; clarify “No additional Python deps” vs operational requirements (API keys, rate limits).

Proposed edit:

-| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
-| [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from tomorrow.io           | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |
+| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | No extra Python deps   | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
+| [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from Tomorrow.io           | No extra Python deps   | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |

Optionally, add a footnote below the table:

OpenSky: Subject to API rate limits; optional OAuth2 credentials supported.
Weather: Requires a Tomorrow.io API key.

44-44: Clarify Arrow format phrasing and example registration.

Proposed edit:

-| [ArrowDataSource](pyspark_datasources/arrow.py)                        | `arrow`        | Batch Read     | Read Apache Arrow files (.arrow)             | `pyarrow`             | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")`                                                                                                     |
+| [ArrowDataSource](pyspark_datasources/arrow.py)                        | `arrow`        | Batch Read     | Read Apache Arrow IPC files (.arrow)         | `pyarrow`             | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")`                                                                                                     |

52-52: Mention dataset URI in Lance example.

Proposed edit:

-| [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data in Lance format                    | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")`                                                                                          |
+| [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data to a Lance dataset                 | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>")`                                                                          |

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b676bfe and 0b5e9b1.

📒 Files selected for processing (5)

README.md (1 hunks)
docs/datasources/arrow.md (1 hunks)
docs/datasources/lance.md (1 hunks)
docs/datasources/opensky.md (1 hunks)
docs/datasources/weather.md (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (5)

docs/datasources/arrow.md (2)

pyspark_datasources/arrow.py (3)

ArrowDataSource (10-131)

ArrowDataSourceReader (134-161)

reader (118-119)

tests/test_data_sources.py (2)

test_arrow_datasource_single_file (103-136)

test_arrow_datasource_multiple_files (139-178)

docs/datasources/lance.md (1)

pyspark_datasources/lance.py (7)

LanceSink (10-57)

LanceWriter (65-95)

commit (86-95)

__init__ (66-72)

write (81-84)

LanceCommitMessage (61-62)

name (49-50)

docs/datasources/opensky.md (2)

pyspark_datasources/opensky.py (5)

OpenSkyDataSource (314-477)

OpenSkyStreamReader (112-311)

schema (452-474)

OpenSkyAPIError (100-103)

name (449-450)

tests/test_data_sources.py (1)

test_opensky_datasource_stream (58-73)

docs/datasources/weather.md (1)

pyspark_datasources/weather.py (5)

WeatherDataSource (9-72)

WeatherSimpleStreamReader (75-130)

name (48-50)

schema (59-68)

__init__ (100-108)

README.md (6)

pyspark_datasources/fake.py (3)

FakeDataSource (35-125)

FakeDataSourceStreamReader (147-173)

FakeDataSourceReader (128-144)

pyspark_datasources/stock.py (1)

StockDataSource (8-41)

pyspark_datasources/opensky.py (1)

OpenSkyDataSource (314-477)

pyspark_datasources/weather.py (1)

WeatherDataSource (9-72)

pyspark_datasources/arrow.py (1)

ArrowDataSource (10-131)

pyspark_datasources/salesforce.py (1)

SalesforceDataSource (19-205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.9)
GitHub Check: test (3.12)
GitHub Check: test (3.10)
GitHub Check: test (3.11)

🔇 Additional comments (4)

docs/datasources/arrow.md (1)

6-6: mkdocstrings is configured and ArrowDataSource is importable

Verified: mkdocs.yml contains the mkdocstrings plugin (python handler); the package exports ArrowDataSource and docs/datasources/arrow.md contains the mkdocstrings directive.

mkdocs.yml — plugin present: - mkdocstrings: with default_handler: python (handlers.python.options.docstring_style: numpy).

pyproject.toml — mkdocstrings listed in dependencies.

pyspark_datasources/init.py — exports ArrowDataSource.

pyspark_datasources/arrow.py — defines class ArrowDataSource.

docs/datasources/arrow.md — contains the directive: ::: pyspark_datasources.arrow.ArrowDataSource

Note: docs/datasources/arrow.md is not listed in mkdocs.yml nav — add it if you want it linked in the site navigation.

docs/datasources/weather.md (1)

3-3: Confirmed — requests is declared in packaging; no change required.

pyproject.toml (line ~15): requests = "^2.31.0"
pyspark_datasources/weather.py: import requests (line 3) and uses requests.Session() (line 108)

Docs statement ("No additional dependencies required") is accurate — users don't need to install requests separately.

docs/datasources/opensky.md (1)

3-3: Resolved — requests is a declared dependency and urllib3 is available transitively; docs OK

pyproject.toml declares requests (requests = "^2.31.0"); poetry.lock contains urllib3 and locked requests versions; pyspark_datasources/opensky.py uses requests + requests' urllib3 Retry/HTTPAdapter. The docs line "No additional dependencies required." is accurate — no packaging change required.

Verified locations:

pyproject.toml: requests = "^2.31.0"

pyspark_datasources/opensky.py: imports requests, HTTPAdapter, Retry

poetry.lock: urllib3 present (transitive dependency)

README.md (1)

41-58: Restructuring is a clear usability win.

The categorized table with Examples substantially improves discoverability and onboarding. Nice improvement.

coderabbitai · 2025-08-14T20:04:52Z

docs/datasources/arrow.md

+> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
+> or use `pip install pyspark-data-sources[arrow]`.


💡 Verification agent

🧩 Analysis chain

Verify extras and dependency declarations for Arrow support.

The docs reference an extras install pyspark-data-sources[arrow] and the pyarrow dependency. Please verify:

Extras group "arrow" exists.

pyarrow is included in that extras group (or base deps if intended).

Run this script to confirm extras and dependencies:

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Looking for pyproject and setup configuration..." fd -I -a 'pyproject.toml' || true fd -I -a 'setup.cfg' || true fd -I -a 'setup.py' || true echo -e "\nInspect pyproject.toml for extras and dependencies:" if fd -I 'pyproject.toml' >/dev/null; then cat $(fd -I 'pyproject.toml') | sed -n '1,200p' echo -e "\nExtras defined:" rg -n "^\[project\.optional-dependencies\]" -n -A 50 || true rg -n "pyarrow" || true fi echo -e "\nIf using setup.cfg or setup.py, search for extras and pyarrow:" rg -n "extras_require|pyarrow" -S -A2 || true

Length of output: 26547

Fix docs: remove non-existent arrow extra; pyarrow is a core dependency

pyproject.toml declares pyarrow under [tool.poetry.dependencies] (pyarrow = ">=11.0.0") and there is no extras group named "arrow". Update docs to stop recommending pyspark-data-sources[arrow].

Files to update:

docs/datasources/arrow.md (lines 3–4)

README.md (the table entry that suggests pip install pyspark-data-sources[arrow])

Suggested replacements (examples):

docs/datasources/arrow.md

Requires the PyArrow library. You can install it manually: pip install pyarrow

or use pip install pyspark-data-sources[arrow].

Requires the PyArrow library. You can install it manually: pip install pyarrow or install the package (pyarrow is a core dependency): pip install pyspark-data-sources.

README.md (table cell)

pip install pyspark-data-sources[arrow]

pip install pyspark-data-sources (pyarrow is installed as a dependency) — or pip install pyarrow to install only PyArrow.

Evidence: pyproject.toml contains pyarrow = ">=11.0.0" and [tool.poetry.extras] does not include "arrow" (from the verification output).

🤖 Prompt for AI Agents

docs/datasources/arrow.md lines 3–4 and README.md table entry: the docs reference a non-existent extras group `arrow` and suggest installing `pyspark-data-sources[arrow]`; update both files to remove the `[...]` extra and state that pyarrow is a core dependency. In docs/datasources/arrow.md replace the two-line suggestion with a single sentence recommending `pip install pyarrow` or `pip install pyspark-data-sources` (not the `[arrow]` extra) and mention pyarrow is included as a dependency; in README.md update the table cell to `pip install pyspark-data-sources` (pyarrow is installed as a dependency) — or `pip install pyarrow` to install only PyArrow.

allisonwang-db added 2 commits August 14, 2025 12:47

update README

eff4955

more update

0b5e9b1

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

allisonwang-db merged commit 3ee0484 into master Aug 19, 2025
5 checks passed

coderabbitai bot mentioned this pull request Oct 3, 2025

Add SimpleDataSourceStreamReader documentation #24

Merged

coderabbitai bot mentioned this pull request Oct 22, 2025

Refactor documentation #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update README to add new data sources #22

Update README to add new data sources #22

Uh oh!

allisonwang-db commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 14, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
		> or use `pip install pyspark-data-sources[arrow]`.

Update README to add new data sources #22

Update README to add new data sources #22

Uh oh!

Conversation

allisonwang-db commented Aug 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

allisonwang-db commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)