-
Notifications
You must be signed in to change notification settings - Fork 92
Druid Query Optimizations
Based on the Query pattern we rewrite Druid Queries to use the most optimal Druid Query Type: SearchQuery, TimeSeriesQuery or TopNQuery. To quite the Druid website:
Where possible, we recommend using Timeseries and TopN queries instead of GroupBy. GroupBy is the most flexible Druid query, but also has the poorest performance. Timeseries are significantly faster than groupBy queries for aggregations that don’t require grouping over dimensions. For grouping and sorting over a single dimension, topN queries are much more optimized than groupBys.
A ‘metadata’ query on a dimension is executed as a Search Query in Druid. In SQL this is expressed as select distinct colName
query.
- query must be on a single Dimension
- there must be no aggregations
- the query must be for the entire duration of the index
- if a limit is specified it must be on the dimension and must have ‘ascending’ order
Following is an example query that gets executed as a Druid Search Query
select c_name from orderLineItemPartSupplier group by c_name
Queries that have no Groping Expressions(sql aggregations without a Group By clause) using the TimeSeries Engine results in much faster execution. The query must have:
- no dimensions specified
- no limit or having clause
Following is an example query that gets executed as a Druid TimeSeries Query
SELECT min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__0`, min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__1`, min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__2`, max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__3`, max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__4`, max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__5`, count(1) AS `x__alias__0` FROM orderLineItemPartSupplier `tmp_sed_druid_20160201_20160331_v6_sparkline` WHERE ( NOT ( cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) IS NULL)) HAVING count(1) > 0
The user must choose to allow the approximate ranking using the allowTopN option. The query must have:
- a single dimension
- a single order by expression, the order by must be on a metric and not the single dimension.
- must have a limit
Following is an example query that gets executed as a Druid TopN Query
select l_returnflag, count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a from orderLineItemPartSupplier group by l_returnflag order by max(ps_supplycost) limit 200
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark