-
Notifications
You must be signed in to change notification settings - Fork 96
Programming Guide
Aleksandar Vitorovic edited this page May 30, 2016
·
20 revisions
Before reading this page, we recommend that you first read A high-level overview of Squall and Rationale. This page contains information about main Squall packages and classes:
- squall-core contains operators implementation and imperative and SQL interface.
- squall-examples has query plans written using the imperative interface.
- example project shows how to connect to external data sources such as Twitter API
- test directory contains small datasets (used for regressive testing and for example query plans), sql example queries, the expected results (used for checking correctness) and configuration files (which specify which query to run, on how many machines, etc.).
Next, we present Squall packages and classes in detail:
- operators: Each operator runs on multiple nodes using data parallelism. An operator takes one tuple as an input and produces a stream of output tuples (an exception is OneToOneOperator, that always produces a single output tuple). We provide selections, projections, aggregations (sum, count, average, distinct). SampleOperator randomly picks a desired percentage of its input. Sometimes it is handy to have an operator that is a virtual proxy for a chain of operators, and for that we use ChainOperator.
- components: We collocate the connected operators that use the same partitioning scheme. We denote a pipeline of collocated operators as a component. Three main types of components are DataSourceComponent, OperatorComponent (which takes its input from exactly one component) and JoinerComponent. We support equi-joins (EquiJoinComponent), theta-joins (ThetaJoinComponent) and multi-way joins (HyperCubeJoinComponent). The details on theta-joins are available here. For equi-joins on skewed datasets, ThetaJoinComponent outperforms EquiJoinComponent. ComponentProperties contains information about the component's operators and the connections to other components.
- theta_joins: For theta-joins, the partitioning schemes are in matrix_assignment package, and PredicateAnalyser parses (complex) join conditions.
- expressions are used to represent join conditions and where clauses. We provide constants (ValueSpecification) and column values from a tuple (ColumnReference). We build complex expressions using Composite Design Pattern with additions, subtractions, multiplications and divisions. This is also the right package for writing user-defined functions (UDFs) over tuples. An example is extraction of year from date.
- predicates: We have selection and join predicates. We build them using a Composite Design pattern over atomic ComparisonPredicate, LikePredicate and composite AndPredicate, OrPredicate, BetweenPredicate.
- window_semantics: Squall implements typical stream primitives, such as tumbling and sliding windows, by adding the window expiration logic on top of the full-history engine. Final results and aggregations are stored in key-value stores that expose window-identifiers and the corresponding timestamp ranges. Squall supports windows over stateful operators such as aggregations and joins. Here is an example of a fully running query with window semantics. Signals folder contains classes that trigger punctuations, which indicate window boundaries.
- main: main entry for the execution
- query_plans: skeleton and utilities for query plans, but the example query plans are elsewhere.
- storm_components: Mapping from Squall components (package components) to Storm Spouts and Bolts. There is 1:1 mapping between each Component and StormComponent. StormDstTupleStorageJoin is the main 2-way join implementation in Storm (StormDstJoin is deprecated as StormDstTupleStorageJoin performs better). Subpackage stream grouping maps Squall partitioning schemes to Storm counterpart - grouping. TopologyKiller releases all the resources once it receives EOF signal from all the data sources (this is used for testing purposes, when we read from files).
- storage contains indexes and various in-memory stores. AggregationStore and ValueStore are used for aggregations and distinct operators, respectively. For joins, TupleStorage performs better than KeyValueStore because TupleStorage uses Trove rather than built-in Java collections.
- api/sql contains classes related to SQL interface, including the parser, optimizer and modules for cardinality estimation (package estimators). For more information, please consult Query optimization.
- Functional: The functional interface for Squall operators. It does some basic optimizations such as merging map operations into one function.
- connectors/hdfs: In addition to the read/write interfaces to Twitter and local files, we also provide interface to HDFS.
-
types contains the types of the expressions and various conversions (e.g.
toString
,fromString
etc). - visitors: The Visitor design patter is used for indexes, expressions and predicates.
- utilities: SquallContext is the main entry for both the Squall code and user interaction with Squall (parameters, configurations etc.). StormWrapper encapsulates Storm's methods for submitting and killing query plans (in Storm terminology: topologies). CustomReader and ReaderProvider allow us to create new types of data sources in a very lightweight manner (see DataSource documentation and example project for an example of connecting to the Twitter API). LocalMergeResults aggregates the result from multiple threads in local mode, and checks whether the result is correct. StatisticsUtilities prints statistics about memory and number of tuples processed.