This repository has been archived by the owner on Mar 30, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 92
Druid Datasource Options
hbutani edited this page Sep 2, 2016
·
2 revisions
Name | Description | Default(if any) | Override in SQLContext? |
---|---|---|---|
sourceDataframe | The DataFrame with the raw data | (required) | no |
timeDimensionColumn | The column that represents time in the Druid Index | (required) | no |
druidHost | Zookeeper ensemble used by Druid servers | (required) | no |
druidDatasource | the name of the corresponding Datasource in Druid for this Spark Datasource | (required) | no |
starSchema | The details of the StarSchema, see [Defining a StarSchema](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a-Star-Schema) for details | (required) | no |
columnMapping | mapping of names from the raw schema to Druid | none | no |
functionalDependencies | **future use** | none | no |
pushHLLTODruid | Push HyperLogLog aggregator to Druid | true | no |
**future use** | |||
streamDruidQueryResults | Controls whether Query results from Druid are streamed into Spark Operator pipeline. | true | no |
**currently cannot be changed** | |||
loadMetadataFromAllSegments | When loading Druid DataSource metadata should the query interval be the entire dataSource interval, or only the latests segment is enough. Default is to load from the latest segment; loading from all segments can be very slow. | false | no |
**currently cannot be changed** | |||
zkSessionTimeoutMilliSecs | Zookeeper connection timeout | 30000 | no |
zkEnableCompression | Zookeper connection enable compression | true | no |
zkDruidPath | Root Path in Zookeeper for Druid | /druid | no |
queryHistoricalServers | A Query Execution Optimization, which directly talks to Historical Servers and does a post aggregation across Historical outputs in Spark | false | yes |
**only takes effect if the cost model is off** | |||
maxResultCardinality | If the result cardinality of a Query exceeds this value then Query is not converted to a Druid Query. | no | |
**future use** | |||
numSegmentsPerHistoricalQuery | The number of segments queries in 1 DruidQuery to a Historical Server. | Int.MaxInt | yes |
**only takes effect if the cost model is off** | |||
zkQualifyDiscoveryNames | When connecting to a Druid 0.9 cluster, set this to true | false | no |
numProcessingThreadsPerHistorical | number of processing threads per druid historical daemon | equal to spark.num.cores | no |
useSmile | communication with Druid use the Smile binary json format | true | yes |
nonAggQueryHandling | allow Druid Select Query on DataSource | push_none | no |
set to push_filters: push when there is at least filter expressions | |||
set to push_project_and_filters: push even for simple scans | |||
queryGranularity | used to estimate index cardinality for any timePeriod | none | no |
valid values are none,all,second, minute, hour,day etc. or a custom PeriodGranularity | |||
these match granularities available in Druid. | |||
set it to the Query Granularity of your index | |||
allowTopN | druid TopN queries are approximate in their aggregation and ranking, this flag controls if TopN query rewrites should happen. | false | yes |
topNMaxThreshold | if druid TopN queries are enabled, this property controls the maximum limit for which such rewrites are done. For limits beyond this value the GroupBy query is executed. | 100000 | yes |
- Override in SQL Context means that the setting in the SQLContext(these will have the prefix
spark.sparklinedata.druid.option
) takes precedence over the value in Druid DataSource. This enables runtime behavior changes on for example whether to use Smile protocol or not. -
queryHistoricalServers
,numSegmentsPerHistoricalQuery
will be ignored if the cost model is on. If the cost model is off all queries on this DataSource will be executed using these settings. Of course if a Query cannot be pushed to historical(for example queries with Limit) then these settings will be ignored for that Query.queryHistoricalServers
,numSegmentsPerHistoricalQuery
can be overridden by setting the values in the SQLContext, when a Query is executed the current settings in the SQLContext will be used. This is how you can try different Query execution options on a per Query basis.
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark