allisonwang-db · allisonwang-db · Aug 19, 2025 · Aug 14, 2025 · Aug 14, 2025 · coderabbitai
diff --git a/README.md b/README.md
@@ -38,16 +38,23 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
 
 ## Example Data Sources
 
-| Data Source                                                             | Short Name     | Description                                   | Dependencies          |
-|-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------|
-| [GithubDataSource](pyspark_datasources/github.py)                      | `github`       | Read pull requests from a Github repository  | None                  |
-| [FakeDataSource](pyspark_datasources/fake.py)                          | `fake`         | Generate fake data using the `Faker` library | `faker`               |
-| [StockDataSource](pyspark_datasources/stock.py)                        | `stock`        | Read stock data from Alpha Vantage           | None                  |
-| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py)          | `googlesheets` | Read table from public Google Sheets        | None                  |
-| [KaggleDataSource](pyspark_datasources/kaggle.py)                      | `kaggle`       | Read datasets from Kaggle                    | `kagglehub`, `pandas` |
-| [SimpleJsonDataSource](pyspark_datasources/simplejson.py)              | `simplejson`   | Write JSON data to Databricks DBFS                 | `databricks-sdk`      |
-| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Read from OpenSky Network.                   | None                  |
-| [SalesforceDataSource](pyspark_datasources/salesforce.py)              | `pyspark.datasource.salesforce`   | Streaming datasource for writing data to Salesforce | `simple-salesforce`   |
+| Data Source                                                             | Short Name     | Type           | Description                                   | Dependencies          | Example                                                                                                                                                                      |
+|-------------------------------------------------------------------------|----------------|----------------|-----------------------------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **Batch Read** | | | | | |
+| [ArrowDataSource](pyspark_datasources/arrow.py)                        | `arrow`        | Batch Read     | Read Apache Arrow files (.arrow)             | `pyarrow`             | `pip install pyspark-data-sources[arrow]`<br/>`spark.read.format("arrow").load("/path/to/file.arrow")`                                                                                                     |
+| [FakeDataSource](pyspark_datasources/fake.py)                          | `fake`         | Batch/Streaming Read | Generate fake data using the `Faker` library | `faker`               | `pip install pyspark-data-sources[fake]`<br/>`spark.read.format("fake").load()` or `spark.readStream.format("fake").load()`                                                                                |
+| [GithubDataSource](pyspark_datasources/github.py)                      | `github`       | Batch Read     | Read pull requests from a Github repository  | None                  | `pip install pyspark-data-sources`<br/>`spark.read.format("github").load("apache/spark")`                                                                                                                 |
+| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py)          | `googlesheets` | Batch Read     | Read table from public Google Sheets        | None                  | `pip install pyspark-data-sources`<br/>`spark.read.format("googlesheets").load("https://docs.google.com/spreadsheets/d/...")`                                                                             |
+| [HuggingFaceDatasets](pyspark_datasources/huggingface.py)              | `huggingface`  | Batch Read     | Read datasets from HuggingFace Hub           | `datasets`            | `pip install pyspark-data-sources[huggingface]`<br/>`spark.read.format("huggingface").load("imdb")`                                                                                                         |
+| [KaggleDataSource](pyspark_datasources/kaggle.py)                      | `kaggle`       | Batch Read     | Read datasets from Kaggle                    | `kagglehub`, `pandas` | `pip install pyspark-data-sources[kaggle]`<br/>`spark.read.format("kaggle").load("titanic")`                                                                                                               |
+| [StockDataSource](pyspark_datasources/stock.py)                        | `stock`        | Batch Read     | Read stock data from Alpha Vantage           | None                  | `pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()`                                                                  |
+| **Batch Write** | | | | | |
+| [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data in Lance format                    | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")`                                                                                          |
+| **Streaming Read** | | | | | |
+| [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
+| [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from tomorrow.io           | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |
+| **Streaming Write** | | | | | |
+| [SalesforceDataSource](pyspark_datasources/salesforce.py)              | `pyspark.datasource.salesforce`   | Streaming Write | Streaming datasource for writing data to Salesforce | `simple-salesforce`   | `pip install pyspark-data-sources[salesforce]`<br/>`df.writeStream.format("pyspark.datasource.salesforce").option("username", "user").start()`                                                         |
 
 See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
 

diff --git a/docs/datasources/arrow.md b/docs/datasources/arrow.md
@@ -0,0 +1,6 @@
+# ArrowDataSource
+
+> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
+> or use `pip install pyspark-data-sources[arrow]`.
+
+::: pyspark_datasources.arrow.ArrowDataSource
diff --git a/docs/datasources/lance.md b/docs/datasources/lance.md
@@ -0,0 +1,6 @@
+# LanceSink
+
+> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
+> or use `pip install pyspark-data-sources[lance]`.
+
+::: pyspark_datasources.lance.LanceSink
diff --git a/docs/datasources/opensky.md b/docs/datasources/opensky.md
@@ -0,0 +1,5 @@
+# OpenSkyDataSource
+
+> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.
+
+::: pyspark_datasources.opensky.OpenSkyDataSource
diff --git a/docs/datasources/weather.md b/docs/datasources/weather.md
@@ -0,0 +1,5 @@
+# WeatherDataSource
+
+> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.
+
+::: pyspark_datasources.weather.WeatherDataSource