Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,23 @@ spark.readStream.format("fake").load().writeStream.format("console").start()

## Example Data Sources

| Data Source | Short Name | Description | Dependencies |
|-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------|
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | `faker` |
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Read table from public Google Sheets | None |
| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` |
| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Read from OpenSky Network. | None |
| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming datasource for writing data to Salesforce | `simple-salesforce` |
| Data Source | Short Name | Type | Description | Dependencies | Example |
|-------------------------------------------------------------------------|----------------|----------------|-----------------------------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Batch Read** | | | | | |
| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Batch Read | Read Apache Arrow files (.arrow) | `pyarrow` | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")` |
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Batch/Streaming Read | Generate fake data using the `Faker` library | `faker` | `pip install pyspark-data-sources[fake]`<br/>`spark.read.format("fake").load()` or `spark.readStream.format("fake").load()` |
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Batch Read | Read pull requests from a Github repository | None | `pip install pyspark-data-sources`<br/>`spark.read.format("github").load("apache/spark")` |
| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Batch Read | Read table from public Google Sheets | None | `pip install pyspark-data-sources`<br/>`spark.read.format("googlesheets").load("https://docs.google.com/spreadsheets/d/...")` |
| [HuggingFaceDatasets](pyspark_datasources/huggingface.py) | `huggingface` | Batch Read | Read datasets from HuggingFace Hub | `datasets` | `pip install pyspark-data-sources[huggingface]`<br/>`spark.read.format("huggingface").load("imdb")` |
| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Batch Read | Read datasets from Kaggle | `kagglehub`, `pandas` | `pip install pyspark-data-sources[kaggle]`<br/>`spark.read.format("kaggle").load("titanic")` |
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Batch Read | Read stock data from Alpha Vantage | None | `pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()` |
| **Batch Write** | | | | | |
| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data in Lance format | `lance` | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")` |
| **Streaming Read** | | | | | |
| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()` |
| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from tomorrow.io | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` |
| **Streaming Write** | | | | | |
| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming Write | Streaming datasource for writing data to Salesforce | `simple-salesforce` | `pip install pyspark-data-sources[salesforce]`<br/>`df.writeStream.format("pyspark.datasource.salesforce").option("username", "user").start()` |

See more here: https://allisonwang-db.github.io/pyspark-data-sources/.

Expand Down
6 changes: 6 additions & 0 deletions docs/datasources/arrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ArrowDataSource

> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
> or use `pip install pyspark-data-sources[arrow]`.
Comment on lines +3 to +4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify extras and dependency declarations for Arrow support.

The docs reference an extras install pyspark-data-sources[arrow] and the pyarrow dependency. Please verify:

  • Extras group "arrow" exists.
  • pyarrow is included in that extras group (or base deps if intended).

Run this script to confirm extras and dependencies:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Looking for pyproject and setup configuration..."
fd -I -a 'pyproject.toml' || true
fd -I -a 'setup.cfg' || true
fd -I -a 'setup.py' || true

echo -e "\nInspect pyproject.toml for extras and dependencies:"
if fd -I 'pyproject.toml' >/dev/null; then
  cat $(fd -I 'pyproject.toml') | sed -n '1,200p'
  echo -e "\nExtras defined:"
  rg -n "^\[project\.optional-dependencies\]" -n -A 50 || true
  rg -n "pyarrow" || true
fi

echo -e "\nIf using setup.cfg or setup.py, search for extras and pyarrow:"
rg -n "extras_require|pyarrow" -S -A2 || true

Length of output: 26547


Fix docs: remove non-existent arrow extra; pyarrow is a core dependency

pyproject.toml declares pyarrow under [tool.poetry.dependencies] (pyarrow = ">=11.0.0") and there is no extras group named "arrow". Update docs to stop recommending pyspark-data-sources[arrow].

Files to update:

  • docs/datasources/arrow.md (lines 3–4)
  • README.md (the table entry that suggests pip install pyspark-data-sources[arrow])

Suggested replacements (examples):

docs/datasources/arrow.md

    • Requires the PyArrow library. You can install it manually: pip install pyarrow

    • or use pip install pyspark-data-sources[arrow].

    • Requires the PyArrow library. You can install it manually: pip install pyarrow or install the package (pyarrow is a core dependency): pip install pyspark-data-sources.

README.md (table cell)

    • pip install pyspark-data-sources[arrow]
    • pip install pyspark-data-sources (pyarrow is installed as a dependency) — or pip install pyarrow to install only PyArrow.

Evidence: pyproject.toml contains pyarrow = ">=11.0.0" and [tool.poetry.extras] does not include "arrow" (from the verification output).

🤖 Prompt for AI Agents
docs/datasources/arrow.md lines 3–4 and README.md table entry: the docs
reference a non-existent extras group `arrow` and suggest installing
`pyspark-data-sources[arrow]`; update both files to remove the `[...]` extra and
state that pyarrow is a core dependency. In docs/datasources/arrow.md replace
the two-line suggestion with a single sentence recommending `pip install
pyarrow` or `pip install pyspark-data-sources` (not the `[arrow]` extra) and
mention pyarrow is included as a dependency; in README.md update the table cell
to `pip install pyspark-data-sources` (pyarrow is installed as a dependency) —
or `pip install pyarrow` to install only PyArrow.


::: pyspark_datasources.arrow.ArrowDataSource
6 changes: 6 additions & 0 deletions docs/datasources/lance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# LanceSink

> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
> or use `pip install pyspark-data-sources[lance]`.

::: pyspark_datasources.lance.LanceSink
5 changes: 5 additions & 0 deletions docs/datasources/opensky.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# OpenSkyDataSource

> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.

::: pyspark_datasources.opensky.OpenSkyDataSource
5 changes: 5 additions & 0 deletions docs/datasources/weather.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# WeatherDataSource

> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.

::: pyspark_datasources.weather.WeatherDataSource