The Delta Kernel project is a set of Java libraries for building Delta connectors that can read (and soon, write to) Delta tables without the need to understand the Delta protocol details.
You can use this library to do the following:
- Read data from small Delta tables in a single thread in a single process.
- Read data from large Delta tables using multiple threads in a single process.
- Build a complex connector for a distributed processing engine and read very large Delta tables.
- [soon!] Write to Delta tables from multiple threads / processes / distributed engines.
Here is an example of a simple table scan with a filter:
TableClient myTableClient = DefaultTableClient.create() ; // define a client (more details below)
Table myTable = Table.forPath("/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient); // define which version of table to scan
Scan myScan = mySnapshot.getScanBuilder(myTableClient) // specify the scan details
.withFilters(myTableClient, scanFilter)
.build();
CloseableIterator<ColumnarBatch> physicalData = // read the Parquet data files
.. read from Parquet data files ...
Scan.transformPhysicalData(...) // returns the table data
A complete version of the above example program is available here.
Notice that there two sets of public APIs to build connectors.
- Table APIs - Interfaces like
Table
andSnapshot
that allow you to read (and soon write to) Delta tables - TableClient APIs - The
TableClient
interface allow you to plug in connector-specific optimizations to compute intensive components in the Kernel. For example, Delta Kernel provides a default Parquet file reader via theDefaultTableClient
, but you may choose to replace that default with a customTableClient
implementation that has a faster Parquet reader for your connector/processing engine.
The Delta Kernel project provides the following two Maven artifacts:
delta-kernel-api
: This is a must-have dependency and contains all the publicTable
andTableClient
APIs discussed earlier.delta-kernel-defaults
: This is an optional dependency that contains default implementations of theTableClient
interfaces using Hadoop libraries. Developers can optionally use these default implementations to speed up the development of their Delta connector.
<!-- Must have dependency -->
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-kernel-api</artifactId>
<version>VERSION</version>
</dependency>
<!-- Optional dependency -->
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-kernel-defaults</artifactId>
<version>VERSION</version>
</dependency>
Note: This project is currently in preview
and all APIs are currently in an evolving state. We welcome trying out the APIs to build Delta Lake connectors and providing feedback (see below) to the project authors.
The Java API docs are available here. Only the classes and interfaces documented here are considered as public APIs with backward compatibility guarantees (when marked as Stable APIs). All other classes and interfaces available in the JAR are considered as private APIs with no stability guarantees.
Detailed user guide explaining the APIs and how to use them is available here.
Example Java programs that read Delta Lake tables using the Kernel APIs are avilable here.
We use GitHub Issues to track community-reported issues. You can also contact the community to get answers.
We welcome contributions to Delta Lake and we accept contributions via Pull Requests. See our CONTRIBUTING.md for more details. We also adhere to the Delta Lake Code of Conduct.