- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19
Update README to add new data sources #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| WalkthroughDocumentation-only update: README reorganized with categorized data source tables and examples; added four new docs pages for ArrowDataSource, LanceSink, OpenSkyDataSource, and WeatherDataSource with installation notes and API references. No code or public API changes. Changes
 Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
 Poem
 Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type  Other keywords and placeholders
 CodeRabbit Configuration File ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
docs/datasources/arrow.md (1)
1-7: Expand docs with usage, options, and partitioning details matching implementation.Add examples for single file, directory, and glob; document required options and one-partition-per-file behavior.
Proposed edit:
# ArrowDataSource > Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow` > or use `pip install pyspark-data-sources[arrow]`. ::: pyspark_datasources.arrow.ArrowDataSource + +## Usage + +```python +from pyspark_datasources.arrow import ArrowDataSource +spark.dataSource.register(ArrowDataSource) + +# Single file +df = spark.read.format("arrow").load("/path/to/file.arrow") + +# Directory of .arrow files (all .arrow files discovered and sorted) +df = spark.read.format("arrow").load("/path/to/dir") + +# Glob pattern +df = spark.read.format("arrow").load("/data/**/*.arrow") +``` + +## Options +- path (required): File path, directory, or glob pattern. + +## Notes +- One partition per file (a single file => 1 partition; N files => N partitions). +- Schema is inferred from the first file using PyArrow and converted to Spark StructType.docs/datasources/weather.md (1)
3-3: Clarify options and add runnable examples (locations parsing, frequency).The implementation expects locations as a stringified list of (lat, lon) tuples and validates frequency. Add examples and valid values.
Proposed edit:
# WeatherDataSource > No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key. ::: pyspark_datasources.weather.WeatherDataSource + +## Usage + +```python +from pyspark_datasources.weather import WeatherDataSource +spark.dataSource.register(WeatherDataSource) + +df = ( + spark.readStream.format("weather") + # String representation of a list of tuples: "[(lat, lon), (lat, lon)]" + .option("locations", "[(37.7749, -122.4194), (40.7128, -74.0060)]") + .option("apikey", "<tomorrow.io_api_key>") + .option("frequency", "minutely") # or "hourly" or "daily" + .load() +) +``` + +## Options +- locations (required): String-form list of (latitude, longitude) tuples. +- apikey (required): Tomorrow.io API key. +- frequency (optional): "minutely" (default), "hourly", or "daily".docs/datasources/opensky.md (1)
3-3: Document rate limits, regions, and optional OAuth credentials.Add guidance on available regions, rate limits, and OAuth2 client credentials supported by the implementation.
Proposed edit:
# OpenSkyDataSource > No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data. ::: pyspark_datasources.opensky.OpenSkyDataSource + +## Usage + +```python +from pyspark_datasources.opensky import OpenSkyDataSource +spark.dataSource.register(OpenSkyDataSource) + +df = ( + spark.readStream.format("opensky") + .option("region", "EUROPE") # See valid regions below + # Optional OAuth2 for higher limits: + # .option("client_id", "<client_id>").option("client_secret", "<client_secret>") + .load() +) +``` + +## Options +- region: One of EUROPE, NORTH_AMERICA, SOUTH_AMERICA, ASIA, AUSTRALIA, AFRICA, GLOBAL. +- client_id / client_secret (optional): OAuth2 client credentials for higher rate limits. + +## Notes +- The reader enforces minimum interval between requests and handles 429 and transient errors with retry/backoff. +- Anonymous vs authenticated rate limits differ; consider OAuth2 for production use.docs/datasources/lance.md (1)
3-4: Call out Spark version/prereq and overwrite behavior.Per implementation notes, this sink needs Spark master nightly for DataSourceArrowWriter and does not support overwrite. Add explicit notes and a minimal example with registration.
Proposed edit:
# LanceSink > Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance` > or use `pip install pyspark-data-sources[lance]`. ::: pyspark_datasources.lance.LanceSink + +## Prerequisites +- Requires a Spark build that supports DataSourceArrowWriter (Spark master/nightly at the time of writing). + +## Usage + +```python +from pyspark_datasources.lance import LanceSink +spark.dataSource.register(LanceSink) + +# Overwrite is not supported by the sink implementation +df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>") +```README.md (4)
41-58: Add a brief note about mandatory registration before using formats.Examples in this table omit registration; most custom Python Data Sources here require it. Add a one-liner to avoid user confusion.
Proposed edit (inserted before the table):
## Example Data Sources +> Note: Before using a data source or sink, register it with Spark. Example: +> `from pyspark_datasources.arrow import ArrowDataSource; spark.dataSource.register(ArrowDataSource)` + | Data Source | Short Name | Type | Description | Dependencies | Example |
54-55: Tighten dependency wording for streaming sources.Weather/OpenSky use external web APIs; clarify “No additional Python deps” vs operational requirements (API keys, rate limits).
Proposed edit:
-| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()` | -| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from tomorrow.io | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` | +| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | No extra Python deps | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()` | +| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from Tomorrow.io | No extra Python deps | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` |Optionally, add a footnote below the table:
- OpenSky: Subject to API rate limits; optional OAuth2 credentials supported.
- Weather: Requires a Tomorrow.io API key.
44-44: Clarify Arrow format phrasing and example registration.Proposed edit:
-| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Batch Read | Read Apache Arrow files (.arrow) | `pyarrow` | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")` | +| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Batch Read | Read Apache Arrow IPC files (.arrow) | `pyarrow` | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")` |
52-52: Mention dataset URI in Lance example.Proposed edit:
-| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data in Lance format | `lance` | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")` | +| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data to a Lance dataset | `lance` | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("lance:///<dataset_uri_or_path>")` |
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (5)
- README.md(1 hunks)
- docs/datasources/arrow.md(1 hunks)
- docs/datasources/lance.md(1 hunks)
- docs/datasources/opensky.md(1 hunks)
- docs/datasources/weather.md(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
docs/datasources/arrow.md (2)
pyspark_datasources/arrow.py (3)
ArrowDataSource(10-131)
ArrowDataSourceReader(134-161)
reader(118-119)tests/test_data_sources.py (2)
test_arrow_datasource_single_file(103-136)
test_arrow_datasource_multiple_files(139-178)
docs/datasources/lance.md (1)
pyspark_datasources/lance.py (7)
LanceSink(10-57)
LanceWriter(65-95)
commit(86-95)
__init__(66-72)
write(81-84)
LanceCommitMessage(61-62)
name(49-50)
docs/datasources/opensky.md (2)
pyspark_datasources/opensky.py (5)
OpenSkyDataSource(314-477)
OpenSkyStreamReader(112-311)
schema(452-474)
OpenSkyAPIError(100-103)
name(449-450)tests/test_data_sources.py (1)
test_opensky_datasource_stream(58-73)
docs/datasources/weather.md (1)
pyspark_datasources/weather.py (5)
WeatherDataSource(9-72)
WeatherSimpleStreamReader(75-130)
name(48-50)
schema(59-68)
__init__(100-108)
README.md (6)
pyspark_datasources/fake.py (3)
FakeDataSource(35-125)
FakeDataSourceStreamReader(147-173)
FakeDataSourceReader(128-144)pyspark_datasources/stock.py (1)
StockDataSource(8-41)pyspark_datasources/opensky.py (1)
OpenSkyDataSource(314-477)pyspark_datasources/weather.py (1)
WeatherDataSource(9-72)pyspark_datasources/arrow.py (1)
ArrowDataSource(10-131)pyspark_datasources/salesforce.py (1)
SalesforceDataSource(19-205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (3.9)
- GitHub Check: test (3.12)
- GitHub Check: test (3.10)
- GitHub Check: test (3.11)
🔇 Additional comments (4)
docs/datasources/arrow.md (1)
6-6: mkdocstrings is configured and ArrowDataSource is importableVerified: mkdocs.yml contains the mkdocstrings plugin (python handler); the package exports ArrowDataSource and docs/datasources/arrow.md contains the mkdocstrings directive.
- mkdocs.yml — plugin present:
- mkdocstrings:withdefault_handler: python(handlers.python.options.docstring_style: numpy).- pyproject.toml —
mkdocstringslisted in dependencies.- pyspark_datasources/init.py — exports ArrowDataSource.
- pyspark_datasources/arrow.py — defines class ArrowDataSource.
- docs/datasources/arrow.md — contains the directive:
::: pyspark_datasources.arrow.ArrowDataSourceNote: docs/datasources/arrow.md is not listed in mkdocs.yml nav — add it if you want it linked in the site navigation.
docs/datasources/weather.md (1)
3-3: Confirmed —requestsis declared in packaging; no change required.pyproject.toml (line ~15):
requests = "^2.31.0"
pyspark_datasources/weather.py:import requests(line 3) and usesrequests.Session()(line 108)Docs statement ("No additional dependencies required") is accurate — users don't need to install
requestsseparately.docs/datasources/opensky.md (1)
3-3: Resolved — requests is a declared dependency and urllib3 is available transitively; docs OKpyproject.toml declares requests (requests = "^2.31.0"); poetry.lock contains urllib3 and locked requests versions; pyspark_datasources/opensky.py uses requests + requests' urllib3 Retry/HTTPAdapter. The docs line "No additional dependencies required." is accurate — no packaging change required.
Verified locations:
- pyproject.toml: requests = "^2.31.0"
- pyspark_datasources/opensky.py: imports requests, HTTPAdapter, Retry
- poetry.lock: urllib3 present (transitive dependency)
README.md (1)
41-58: Restructuring is a clear usability win.The categorized table with Examples substantially improves discoverability and onboarding. Nice improvement.
| > Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow` | ||
| > or use `pip install pyspark-data-sources[arrow]`. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify extras and dependency declarations for Arrow support.
The docs reference an extras install pyspark-data-sources[arrow] and the pyarrow dependency. Please verify:
- Extras group "arrow" exists.
- pyarrowis included in that extras group (or base deps if intended).
Run this script to confirm extras and dependencies:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Looking for pyproject and setup configuration..."
fd -I -a 'pyproject.toml' || true
fd -I -a 'setup.cfg' || true
fd -I -a 'setup.py' || true
echo -e "\nInspect pyproject.toml for extras and dependencies:"
if fd -I 'pyproject.toml' >/dev/null; then
  cat $(fd -I 'pyproject.toml') | sed -n '1,200p'
  echo -e "\nExtras defined:"
  rg -n "^\[project\.optional-dependencies\]" -n -A 50 || true
  rg -n "pyarrow" || true
fi
echo -e "\nIf using setup.cfg or setup.py, search for extras and pyarrow:"
rg -n "extras_require|pyarrow" -S -A2 || trueLength of output: 26547
Fix docs: remove non-existent arrow extra; pyarrow is a core dependency
pyproject.toml declares pyarrow under [tool.poetry.dependencies] (pyarrow = ">=11.0.0") and there is no extras group named "arrow". Update docs to stop recommending pyspark-data-sources[arrow].
Files to update:
- docs/datasources/arrow.md (lines 3–4)
- README.md (the table entry that suggests pip install pyspark-data-sources[arrow])
Suggested replacements (examples):
docs/datasources/arrow.md
- 
- 
Requires the PyArrowlibrary. You can install it manually:pip install pyarrow
 
- 
- 
- 
or use pip install pyspark-data-sources[arrow].
 
- 
- 
- 
Requires the PyArrowlibrary. You can install it manually:pip install pyarrowor install the package (pyarrow is a core dependency):pip install pyspark-data-sources.
 
- 
README.md (table cell)
- 
- pip install pyspark-data-sources[arrow]
 
- 
- pip install pyspark-data-sources(pyarrow is installed as a dependency) — or- pip install pyarrowto install only PyArrow.
 
Evidence: pyproject.toml contains pyarrow = ">=11.0.0" and [tool.poetry.extras] does not include "arrow" (from the verification output).
🤖 Prompt for AI Agents
docs/datasources/arrow.md lines 3–4 and README.md table entry: the docs
reference a non-existent extras group `arrow` and suggest installing
`pyspark-data-sources[arrow]`; update both files to remove the `[...]` extra and
state that pyarrow is a core dependency. In docs/datasources/arrow.md replace
the two-line suggestion with a single sentence recommending `pip install
pyarrow` or `pip install pyspark-data-sources` (not the `[arrow]` extra) and
mention pyarrow is included as a dependency; in README.md update the table cell
to `pip install pyspark-data-sources` (pyarrow is installed as a dependency) —
or `pip install pyarrow` to install only PyArrow.
Summary by CodeRabbit