This repo provides notebooks with Delta Lake examples using PySpark, Scala Spark, and Python.
Running these commands on your local machine is a great way to learn about how Delta Lake works.
You can install PySpark and Delta Lake by creating the pyspark-330-delta-220
conda environment.
Create the environment with this command: conda env create -f envs/pyspark-330-delta-220
.
Activate the environment with this command: conda activate pyspark-330-delta-220
.
Then you can run jupyter lab
and execute all the PySpark notebooks.
The Delta Lake Spark connector release process typically includes an early release candidate release a few weeks prior to the official release. (example)
The release candidate is typically hosted in non-standard repositories that require additional setup.
To configure a notebook to use a release candidate:
-
Make sure to use a conda environment that includes the correct
--extra-index-url
, see here for an example. -
Provide the appropriate Ivy settings file that has the correct non-standard repository, see this one for an example.
-
Initialize Spark with the correct Ivy configuration
.config( "spark.jars.ivySettings", "../../ivy/2.4.0rc1.xml" )
You can run the delta-rs notebooks that use the Python bindings by creating the mr-delta-rs
conda environment.
Create the environment with this command: conda env create -f envs/mr-delta-rs.yml
.
Activate the environment with this command: conda activate mr-delta-rs
.
Rust notebooks in notebooks/delta-rs
were developed using Evcxr Jupyter Kernel.
You can either follow the instructions in Evcxr Jupyter Kernel to set it up, or you can use the included Docker file using the following commands (in which case all you need is Docker installed):
cd notebooks/delta-rs
docker build -t delta-rs .
docker run -it --rm -p 8888:8888 --name delta-rs -v $PWD/notebooks/delta-rs:/usr/src/delta-rs delta-rs
Note: One of the main reasons for creating the Dockerfile was an issue with running Evcxr Jupyter Kernel on MacOS with Apple chip - despite being able to build Rust applications directly on on the host after setting [target.aarch64-apple-darwin]
in the ~/.cargo/config
file, :dep
builds are stil not working in the notebook.
You can install almond to run the Scala Spark notebooks in this repo.
We welcome contributions, especially notebooks that illustrate important functions that will benefit the Delta Lake community.
Check out the open issues for ideas on good notebooks to create!