Skip to content

Latest commit

 

History

History
752 lines (643 loc) · 28.4 KB

tutorial.md

File metadata and controls

752 lines (643 loc) · 28.4 KB
layout title permalink
docs
Tutorial
/docs/tutorial.html

This is a step-by-step tutorial that shows how to build and connect to Calcite. It uses a simple adapter that makes a directory of CSV files appear to be a schema containing tables. Calcite does the rest, and provides a full SQL interface.

Calcite-example-CSV is a fully functional adapter for Calcite that reads text files in CSV (comma-separated values) format. It is remarkable that a couple of hundred lines of Java code are sufficient to provide full SQL query capability.

CSV also serves as a template for building adapters to other data formats. Even though there are not many lines of code, it covers several important concepts:

  • user-defined schema using SchemaFactory and Schema interfaces;
  • declaring schemas in a model JSON file;
  • declaring views in a model JSON file;
  • user-defined table using the Table interface;
  • determining the record type of a table;
  • a simple implementation of Table, using the ScannableTable interface, that enumerates all rows directly;
  • a more advanced implementation that implements FilterableTable, and can filter out rows according to simple predicates;
  • advanced implementation of Table, using TranslatableTable, that translates to relational operators using planner rules.

Download and build

You need Java (version 8, 9 or 10) and Git.

{% highlight bash %} $ git clone https://github.com/apache/calcite.git $ cd calcite/example/csv $ ./sqlline {% endhighlight %}

First queries

Now let's connect to Calcite using sqlline, a SQL shell that is included in this project.

{% highlight bash %} $ ./sqlline sqlline> !connect jdbc:calcite:model=src/test/resources/model.json admin admin {% endhighlight %}

(If you are running Windows, the command is sqlline.bat.)

Execute a metadata query:

{% highlight bash %} sqlline> !tables +-----------+-------------+------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+ | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REFERENCING_COL_NAME | REF_GENERATION | +-----------+-------------+------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+ | | SALES | DEPTS | TABLE | | | | | | | | | SALES | EMPS | TABLE | | | | | | | | | SALES | SDEPTS | TABLE | | | | | | | | | metadata | COLUMNS | SYSTEM TABLE | | | | | | | | | metadata | TABLES | SYSTEM TABLE | | | | | | | +-----------+-------------+------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+ {% endhighlight %}

(JDBC experts, note: sqlline's !tables command is just executing DatabaseMetaData.getTables() behind the scenes. It has other commands to query JDBC metadata, such as !columns and !describe.)

As you can see there are 5 tables in the system: tables EMPS, DEPTS and SDEPTS in the current SALES schema, and COLUMNS and TABLES in the system metadata schema. The system tables are always present in Calcite, but the other tables are provided by the specific implementation of the schema; in this case, the EMPS, DEPTS and SDEPTS tables are based on the EMPS.csv.gz, DEPTS.csv and SDEPTS.csv files in the resources/sales directory.

Let's execute some queries on those tables, to show that Calcite is providing a full implementation of SQL. First, a table scan:

{% highlight bash %} sqlline> SELECT * FROM emps; +-------+-------+--------+--------+---------------+-------+------+---------+---------+------------+ | EMPNO | NAME | DEPTNO | GENDER | CITY | EMPID | AGE | SLACKER | MANAGER | JOINEDAT | +-------+-------+--------+--------+---------------+-------+------+---------+---------+------------+ | 100 | Fred | 10 | | | 30 | 25 | true | false | 1996-08-03 | | 110 | Eric | 20 | M | San Francisco | 3 | 80 | | false | 2001-01-01 | | 110 | John | 40 | M | Vancouver | 2 | null | false | true | 2002-05-03 | | 120 | Wilma | 20 | F | | 1 | 5 | | true | 2005-09-07 | | 130 | Alice | 40 | F | Vancouver | 2 | null | false | true | 2007-01-01 | +-------+-------+--------+--------+---------------+-------+------+---------+---------+------------+ {% endhighlight %}

Now JOIN and GROUP BY:

{% highlight bash %} sqlline> SELECT d.name, COUNT(*) . . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno . . . .> GROUP BY d.name; +------------+---------+ | NAME | EXPR$1 | +------------+---------+ | Sales | 1 | | Marketing | 2 | +------------+---------+ {% endhighlight %}

Last, the VALUES operator generates a single row, and is a convenient way to test expressions and SQL built-in functions:

{% highlight bash %} sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!'); +---------+ | EXPR$0 | +---------+ | 13 | +---------+ {% endhighlight %}

Calcite has many other SQL features. We don't have time to cover them here. Write some more queries to experiment.

Schema discovery

Now, how did Calcite find these tables? Remember, core Calcite does not know anything about CSV files. (As a "database without a storage layer", Calcite doesn't know about any file formats.) Calcite knows about those tables because we told it to run code in the calcite-example-csv project.

There are a couple of steps in that chain. First, we define a schema based on a schema factory class in a model file. Then the schema factory creates a schema, and the schema creates several tables, each of which knows how to get data by scanning a CSV file. Last, after Calcite has parsed the query and planned it to use those tables, Calcite invokes the tables to read the data as the query is being executed. Now let's look at those steps in more detail.

On the JDBC connect string we gave the path of a model in JSON format. Here is the model:

{% highlight json %} { version: '1.0', defaultSchema: 'SALES', schemas: [ { name: 'SALES', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: { directory: 'sales' } } ] } {% endhighlight %}

The model defines a single schema called 'SALES'. The schema is powered by a plugin class, org.apache.calcite.adapter.csv.CsvSchemaFactory, which is part of the calcite-example-csv project and implements the Calcite interface SchemaFactory. Its create method instantiates a schema, passing in the directory argument from the model file:

{% highlight java %} public Schema create(SchemaPlus parentSchema, String name, Map<String, Object> operand) { final String directory = (String) operand.get("directory"); final File base = (File) operand.get(ModelHandler.ExtraOperand.BASE_DIRECTORY.camelName); File directoryFile = new File(directory); if (base != null && !directoryFile.isAbsolute()) { directoryFile = new File(base, directory); } String flavorName = (String) operand.get("flavor"); CsvTable.Flavor flavor; if (flavorName == null) { flavor = CsvTable.Flavor.SCANNABLE; } else { flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase(Locale.ROOT)); } return new CsvSchema(directoryFile, flavor); } {% endhighlight %}

Driven by the model, the schema factory instantiates a single schema called 'SALES'. The schema is an instance of org.apache.calcite.adapter.csv.CsvSchema and implements the Calcite interface Schema.

A schema's job is to produce a list of tables. (It can also list sub-schemas and table-functions, but these are advanced features and calcite-example-csv does not support them.) The tables implement Calcite's Table interface. CsvSchema produces tables that are instances of CsvTable and its sub-classes.

Here is the relevant code from CsvSchema, overriding the getTableMap() method in the AbstractSchema base class.

{% highlight java %} private Map<String, Table> createTableMap() { // Look for files in the directory ending in ".csv", ".csv.gz", ".json", // ".json.gz". final Source baseSource = Sources.of(directoryFile); File[] files = directoryFile.listFiles((dir, name) -> { final String nameSansGz = trim(name, ".gz"); return nameSansGz.endsWith(".csv") || nameSansGz.endsWith(".json"); }); if (files == null) { System.out.println("directory " + directoryFile + " not found"); files = new File[0]; } // Build a map from table name to table; each file becomes a table. final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder(); for (File file : files) { Source source = Sources.of(file); Source sourceSansGz = source.trim(".gz"); final Source sourceSansJson = sourceSansGz.trimOrNull(".json"); if (sourceSansJson != null) { final Table table = new JsonScannableTable(source); builder.put(sourceSansJson.relative(baseSource).path(), table); } final Source sourceSansCsv = sourceSansGz.trimOrNull(".csv"); if (sourceSansCsv != null) { final Table table = createTable(source); builder.put(sourceSansCsv.relative(baseSource).path(), table); } } return builder.build(); }

/** Creates different sub-type of table based on the "flavor" attribute. */ private Table createTable(Source source) { switch (flavor) { case TRANSLATABLE: return new CsvTranslatableTable(source, null); case SCANNABLE: return new CsvScannableTable(source, null); case FILTERABLE: return new CsvFilterableTable(source, null); default: throw new AssertionError("Unknown flavor " + this.flavor); } } {% endhighlight %}

The schema scans the directory, finds all files with the appropriate extension, and creates tables for them. In this case, the directory is sales and contains files EMPS.csv.gz, DEPTS.csv and SDEPTS.csv, which these become the tables EMPS, DEPTS and SDEPTS.

Tables and views in schemas

Note how we did not need to define any tables in the model; the schema generated the tables automatically.

You can define extra tables, beyond those that are created automatically, using the tables property of a schema.

Let's see how to create an important and useful type of table, namely a view.

A view looks like a table when you are writing a query, but it doesn't store data. It derives its result by executing a query. The view is expanded while the query is being planned, so the query planner can often perform optimizations like removing expressions from the SELECT clause that are not used in the final result.

Here is a schema that defines a view:

{% highlight json %} { version: '1.0', defaultSchema: 'SALES', schemas: [ { name: 'SALES', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: { directory: 'sales' }, tables: [ { name: 'FEMALE_EMPS', type: 'view', sql: 'SELECT * FROM emps WHERE gender = 'F'' } ] } ] } {% endhighlight %}

The line type: 'view' tags FEMALE_EMPS as a view, as opposed to a regular table or a custom table. Note that single-quotes within the view definition are escaped using a back-slash, in the normal way for JSON.

JSON doesn't make it easy to author long strings, so Calcite supports an alternative syntax. If your view has a long SQL statement, you can instead supply a list of lines rather than a single string:

{% highlight json %} { name: 'FEMALE_EMPS', type: 'view', sql: [ 'SELECT * FROM emps', 'WHERE gender = 'F'' ] } {% endhighlight %}

Now we have defined a view, we can use it in queries just as if it were a table:

{% highlight sql %} sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno; +--------+------------+ | NAME | NAME | +--------+------------+ | Wilma | Marketing | +--------+------------+ {% endhighlight %}

Custom tables

Custom tables are tables whose implementation is driven by user-defined code. They don't need to live in a custom schema.

There is an example in model-with-custom-table.json:

{% highlight json %} { version: '1.0', defaultSchema: 'CUSTOM_TABLE', schemas: [ { name: 'CUSTOM_TABLE', tables: [ { name: 'EMPS', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvTableFactory', operand: { file: 'sales/EMPS.csv.gz', flavor: "scannable" } } ] } ] } {% endhighlight %}

We can query the table in the usual way:

{% highlight sql %} sqlline> !connect jdbc:calcite:model=src/test/resources/model-with-custom-table.json admin admin sqlline> SELECT empno, name FROM custom_table.emps; +--------+--------+ | EMPNO | NAME | +--------+--------+ | 100 | Fred | | 110 | Eric | | 110 | John | | 120 | Wilma | | 130 | Alice | +--------+--------+ {% endhighlight %}

The schema is a regular one, and contains a custom table powered by org.apache.calcite.adapter.csv.CsvTableFactory, which implements the Calcite interface TableFactory. Its create method instantiates a CsvScannableTable, passing in the file argument from the model file:

{% highlight java %} public CsvTable create(SchemaPlus schema, String name, Map<String, Object> operand, @Nullable RelDataType rowType) { String fileName = (String) operand.get("file"); final File base = (File) operand.get(ModelHandler.ExtraOperand.BASE_DIRECTORY.camelName); final Source source = Sources.file(base, fileName); final RelProtoDataType protoRowType = rowType != null ? RelDataTypeImpl.proto(rowType) : null; return new CsvScannableTable(source, protoRowType); } {% endhighlight %}

Implementing a custom table is often a simpler alternative to implementing a custom schema. Both approaches might end up creating a similar implementation of the Table interface, but for the custom table you don't need to implement metadata discovery. (CsvTableFactory creates a CsvScannableTable, just as CsvSchema does, but the table implementation does not scan the filesystem for .csv files.)

Custom tables require more work for the author of the model (the author needs to specify each table and its file explicitly) but also give the author more control (say, providing different parameters for each table).

Comments in models

Models can include comments using /* ... */ and // syntax:

{% highlight json %} { version: '1.0', /* Multi-line comment. */ defaultSchema: 'CUSTOM_TABLE', // Single-line comment. schemas: [ .. ] } {% endhighlight %}

(Comments are not standard JSON, but are a harmless extension.)

Optimizing queries using planner rules

The table implementations we have seen so far are fine as long as the tables don't contain a great deal of data. But if your customer table has, say, a hundred columns and a million rows, you would rather that the system did not retrieve all of the data for every query. You would like Calcite to negotiate with the adapter and find a more efficient way of accessing the data.

This negotiation is a simple form of query optimization. Calcite supports query optimization by adding planner rules. Planner rules operate by looking for patterns in the query parse tree (for instance a project on top of a certain kind of table), and replacing the matched nodes in the tree by a new set of nodes which implement the optimization.

Planner rules are also extensible, like schemas and tables. So, if you have a data store that you want to access via SQL, you first define a custom table or schema, and then you define some rules to make the access efficient.

To see this in action, let's use a planner rule to access a subset of columns from a CSV file. Let's run the same query against two very similar schemas:

{% highlight sql %} sqlline> !connect jdbc:calcite:model=src/test/resources/model.json admin admin sqlline> explain plan for select name from emps; +-----------------------------------------------------+ | PLAN | +-----------------------------------------------------+ | EnumerableCalc(expr#0..9=[{inputs}], NAME=[$t1]) | | EnumerableTableScan(table=[[SALES, EMPS]]) | +-----------------------------------------------------+ sqlline> !connect jdbc:calcite:model=src/test/resources/smart.json admin admin sqlline> explain plan for select name from emps; +-----------------------------------------------------+ | PLAN | +-----------------------------------------------------+ | CsvTableScan(table=[[SALES, EMPS]], fields=[[1]]) | +-----------------------------------------------------+ {% endhighlight %}

What causes the difference in plan? Let's follow the trail of evidence. In the smart.json model file, there is just one extra line:

{% highlight json %} flavor: "translatable" {% endhighlight %}

This causes a CsvSchema to be created with flavor = TRANSLATABLE, and its createTable method creates instances of CsvTranslatableTable rather than a CsvScannableTable.

CsvTranslatableTable implements the TranslatableTable.toRel() method to create CsvTableScan. Table scans are the leaves of a query operator tree. The usual implementation is EnumerableTableScan, but we have created a distinctive sub-type that will cause rules to fire.

Here is the rule in its entirety:

{% highlight java %} public class CsvProjectTableScanRule extends RelRule<CsvProjectTableScanRule.Config> {

/** Creates a CsvProjectTableScanRule. */ protected CsvProjectTableScanRule(Config config) { super(config); }

@Override public void onMatch(RelOptRuleCall call) { final LogicalProject project = call.rel(0); final CsvTableScan scan = call.rel(1); int[] fields = getProjectFields(project.getProjects()); if (fields == null) { // Project contains expressions more complex than just field references. return; } call.transformTo( new CsvTableScan( scan.getCluster(), scan.getTable(), scan.csvTable, fields)); }

private static int[] getProjectFields(List exps) { final int[] fields = new int[exps.size()]; for (int i = 0; i < exps.size(); i++) { final RexNode exp = exps.get(i); if (exp instanceof RexInputRef) { fields[i] = ((RexInputRef) exp).getIndex(); } else { return null; // not a simple projection } } return fields; }

/** Rule configuration. */ @Value.Immutable(singleton = false) public interface Config extends RelRule.Config { Config DEFAULT = ImmutableCsvProjectTableScanRule.Config.builder() .withOperandSupplier(b0 -> b0.operand(LogicalProject.class).oneInput(b1 -> b1.operand(CsvTableScan.class).noInputs())) .build();

@Override default CsvProjectTableScanRule toRule() {
  return new CsvProjectTableScanRule(this);
}

} } {% endhighlight %}

The default instance of the rule resides in the CsvRules holder class:

{% highlight java %} public abstract class CsvRules { public static final CsvProjectTableScanRule PROJECT_SCAN = CsvProjectTableScanRule.Config.DEFAULT.toRule(); } {% endhighlight %}

The call to the withOperandSupplier method in the default configuration (the DEFAULT field in interface Config) declares the pattern of relational expressions that will cause the rule to fire. The planner will invoke the rule if it sees a LogicalProject whose sole input is a CsvTableScan with no inputs.

Variants of the rule are possible. For example, a different rule instance might instead match a EnumerableProject on a CsvTableScan.

The onMatch method generates a new relational expression and calls RelOptRuleCall.transformTo() to indicate that the rule has fired successfully.

The query optimization process

There's a lot to say about how clever Calcite's query planner is, but we won't say it here. The cleverness is designed to take the burden off you, the writer of planner rules.

First, Calcite doesn't fire rules in a prescribed order. The query optimization process follows many branches of a branching tree, just like a chess playing program examines many possible sequences of moves. If rules A and B both match a given section of the query operator tree, then Calcite can fire both.

Second, Calcite uses cost in choosing between plans, but the cost model doesn't prevent rules from firing which may seem to be more expensive in the short term.

Many optimizers have a linear optimization scheme. Faced with a choice between rule A and rule B, as above, such an optimizer needs to choose immediately. It might have a policy such as "apply rule A to the whole tree, then apply rule B to the whole tree", or apply a cost-based policy, applying the rule that produces the cheaper result.

Calcite doesn't require such compromises. This makes it simple to combine various sets of rules. If, say you want to combine rules to recognize materialized views with rules to read from CSV and JDBC source systems, you just give Calcite the set of all rules and tell it to go at it.

Calcite does use a cost model. The cost model decides which plan to ultimately use, and sometimes to prune the search tree to prevent the search space from exploding, but it never forces you to choose between rule A and rule B. This is important, because it avoids falling into local minima in the search space that are not actually optimal.

Also (you guessed it) the cost model is pluggable, as are the table and query operator statistics it is based upon. But that can be a subject for later.

JDBC adapter

The JDBC adapter maps a schema in a JDBC data source as a Calcite schema.

For example, this schema reads from a MySQL "foodmart" database:

{% highlight json %} { version: '1.0', defaultSchema: 'FOODMART', schemas: [ { name: 'FOODMART', type: 'custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } } ] } {% endhighlight %}

(The FoodMart database will be familiar to those of you who have used the Mondrian OLAP engine, because it is Mondrian's main test data set. To load the data set, follow Mondrian's installation instructions.)

Current limitations: The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite. Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.

The cloning JDBC adapter

The cloning JDBC adapter creates a hybrid database. The data is sourced from a JDBC database but is read into in-memory tables the first time each table is accessed. Calcite evaluates queries based on those in-memory tables, effectively a cache of the database.

For example, the following model reads tables from a MySQL "foodmart" database:

{% highlight json %} { version: '1.0', defaultSchema: 'FOODMART_CLONE', schemas: [ { name: 'FOODMART_CLONE', type: 'custom', factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } } ] } {% endhighlight %}

Another technique is to build a clone schema on top of an existing schema. You use the source property to reference a schema defined earlier in the model, like this:

{% highlight json %} { version: '1.0', defaultSchema: 'FOODMART_CLONE', schemas: [ { name: 'FOODMART', type: 'custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } }, { name: 'FOODMART_CLONE', type: 'custom', factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: { source: 'FOODMART' } } ] } {% endhighlight %}

You can use this approach to create a clone schema on any type of schema, not just JDBC.

The cloning adapter isn't the be-all and end-all. We plan to develop more sophisticated caching strategies, and a more complete and efficient implementation of in-memory tables, but for now the cloning JDBC adapter shows what is possible and allows us to try out our initial implementations.

Further topics

There are many other ways to extend Calcite not yet described in this tutorial. The adapter specification describes the APIs involved.