Merge branch 'master' into SNAP-3138

TIBCOSoftware · Feb 11, 2020 · 402beff · 402beff
2 parents 6934313 + c60f1e9
commit 402beff
Show file tree

Hide file tree

Showing 41 changed files with 389 additions and 196 deletions.
diff --git a/docs/Images/zeppelin_2.png b/docs/Images/zeppelin_2.png
diff --git a/docs/Images/zeppelin_3.png b/docs/Images/zeppelin_3.png
diff --git a/docs/architecture/core_components.md b/docs/architecture/core_components.md
@@ -11,6 +11,6 @@ The OLAP scheduler and job server coordinate all OLAP and Spark jobs and are cap
 
 To support replica consistency, fast point updates, and instantaneous detection of failure conditions in the cluster, SnappyData uses a P2P (peer-to-peer) cluster membership service that ensures view consistency and virtual synchrony in the cluster. Any of the in-memory tables can be synchronously replicated using this P2P cluster.
 
-In addition to the “exact” Dataset, data can also be summarized using probabilistic data structures, such as stratified samples and other forms of synopses. Using our API, applications can choose to trade accuracy for performance. SnappyData’s query engine has built-in support for Synopsis Data Engine (SDE) and exploits appropriate probabilistic data structures to meet the user’s requested level of accuracy or performance.
+In addition to the “exact” Dataset, data can also be summarized using probabilistic data structures, such as stratified samples and other forms of synopses. Using our API, applications can choose to trade accuracy for performance. SnappyData’s query engine has built-in support for Approximate Query Processing (AQP) and exploits appropriate probabilistic data structures to meet the user’s requested level of accuracy or performance.
 
 To understand the data flow architecture, you are first walked through a real-time use case that involves stream processing, ingesting into an in-memory store and interactive analytics.
diff --git a/docs/best_practices/structured_streaming_best_practices.md b/docs/best_practices/structured_streaming_best_practices.md
@@ -6,6 +6,7 @@ The following best practices for Structured Streaming are explained in this sect
 *	[Limiting Batch Size](#limitbatchsize)
 *	[Limiting Default Incoming Data Frame Size](#limitdefaultincoming)
 *	[Running a Structured Streaming Query with Dedicated SnappySession Instance](#dedicatedsnappysession)
+*   [SnappySession used for Streaming Query should not be used by other Operations](#otherops)
 
 <a id= sharefilesys> </a>
 ## Using Shared File System as Checkpoint Directory Location
@@ -93,5 +94,10 @@ val newSession = snappySession.newSession()
 The newSession instance has a similar session level config as snappySession.
 
 !!!Note
-	For embedded snappy jobs, it is recommended to use a new snappy-job for each streaming query.
+	For embedded snappy jobs, it is recommended to use a new [snappy-job](/programming_guide/snappydata_jobs.md)for each streaming query.
 
+
+<a id= otherops> </a>
+## SnappySession used for Streaming Query should not be used by other Operations
+
+When running a structured streaming query, Snappy hash aggregate is disabled for that entire session. This is done because SnappyData's hash aggregate does not work along with stateful aggregation required for streaming query aggregation. A best practice is to avoid using the same SnappySession instance for other operations since the Snappy hash aggregation is disabled for the entire session.
diff --git a/docs/configuring_cluster/configure_launch_cluster.md b/docs/configuring_cluster/configure_launch_cluster.md
@@ -105,7 +105,6 @@ The following core properties must be set in the **conf/leads** file:
 |   heap-size     |  Sets the maximum heap size for the Java VM, using SnappyData default resource manager settings. </br>For example, `-heap-size=8g` </br> It is recommended to allocate minimum **6-8 GB** of heap size per lead node. If you use the `-heap-size` option, by default SnappyData sets the critical-heap-percentage to 95% of the heap size, and the `eviction-heap-percentage` to 85.5% of the `critical-heap-percentage`. </br>SnappyData also sets resource management properties for eviction and garbage collection if the JVM supports them.       |   |
 |  dir      |   Working directory of the member that contains the SnappyData Server status file and the default location for the log file, persistent files, data dictionary, and so forth.   | <product_home>/work  |
 |   classpath     |  Location of user classes required by the SnappyData Server. This path is appended to the current classpath   | Appended to the current classpath |
-|  zeppelin.interpreter.enable=true    |Enable the SnappyData Zeppelin interpreter. Refer [How to use Apache Zeppelin with SnappyData](/howto/use_apache_zeppelin_with_snappydata.md) |   |
 |   spark.executor.cores     | The number of cores to use on each server.    |   |
 |    spark.jars    |        |   |
 
@@ -118,10 +117,10 @@ localhost   -dir=/opt/snappydata/data/lead -heap-size=6g
 ```
 You can add a line for each of the Lead members that you want to launch. Typically only one. In production, you may launch two.
 
-In the following configuration, you are specifying the [number of cores to use on each server](/best_practices/setup_cluster.md#computenoscores) as well as enabling the SnappyData Zeppelin interpreter:
+In the following configuration, you are specifying the [number of cores to use on each server](/best_practices/setup_cluster.md#computenoscores):
 
 ```
-localhost -spark.executor.cores=16 -zeppelin.interpreter.enable=true
+localhost -spark.executor.cores=16
 ```
 
 !!!Tip

diff --git a/docs/configuring_cluster/property_description.md b/docs/configuring_cluster/property_description.md
@@ -72,6 +72,7 @@ The following list of commonly used configuration properties can be set to confi
 
 |Property|Description|Components</br>|
 |-|-|-|
+|-DCHECK_EXTERNAL_TABLE_AUTHZ | Enable authorization of external tables by setting this system property to true when the cluster's security is enabled. System admin or the schema owner can grant or revoke the permissions on external tables to other users.|Lead|
 |-J-Dsnappydata.enable-rls|Enables the system for row level security when set to true.  By default this is off. If this property is set to true,  then the Smart Connector access to SnappyData fails.|Server</br>Lead</br>Locator
 |-J-Dsnappydata.RESTRICT_TABLE_CREATION|Applicable when security is enabled in the cluster. If true, users cannot execute queries (including DDLs and DMLs) even in their default or own schema unless cluster admin explicitly grants them the required permissions using GRANT command. The default is false. |Server</br>Lead</br>Locator|
 |-spark.ssl.enabled<a id="ssl_spark_enabled"></a>|Enables or disables Spark layer encryption. The default is false. |Lead|
@@ -179,6 +180,7 @@ node-l -heap-size=4096m -spark.ui.port=9090 -locators=node-b:8888,node-a:9999 -s
 |-snappydata.column.batchSize |The default size of blocks to use for storage in SnappyData column and store. When inserting data into the column storage this is the unit (in bytes or k/m/g suffixes for unit) that is used to split the data into chunks for efficient storage and retrieval. </br> This property can also be set for each table in the `create table` DDL. Maximum allowed size is 2GB. The default is 24m.|
 |-snappydata.column.maxDeltaRows|The maximum number of rows that can be in the delta buffer of a column table. The size of the delta buffer is already limited by `ColumnBatchSize` property, but this allows a lower limit on the number of rows for better scan performance. So the delta buffer is rolled into the column store whichever of `ColumnBatchSize` and this property is hit first. It can also be set for each table in the `create table` DDL, else this setting is used for the `create table`|
 |-snappydata.hiveServer.enabled|Enables the Hive Thrift server for SnappyData.This is enabled by default when you start the cluster. Thus it adds an additional 10 seconds to the cluster startup time. To avoid this additional time, you can set the property to false.|
+|snappydata.maxRetryAttemptsForWrite|Default retry of Spark tasks on failure can cause duplicates in the case of insert operations. This property can be set to **0** to avoid this scenario. Other operations, as usual, retries without causing any consistency issues.| 
 |-snappydata.sql.hashJoinSize|The join would be converted into a hash join if the table is of size less than the `hashJoinSize`.  The limit specifies an estimate on the input data size (in bytes or k/m/g/t suffixes for unit). The default value is 100MB.|
 |-snappydata.sql.hashAggregateSize|Aggregation uses optimized hash aggregation plan but one that does not overflow to disk and can cause OOME if the result of aggregation is large. The limit specifies the input data size (in bytes or k/m/g/t suffixes for unit) and not the output size. Set this only if there are queries that can return a large number of rows in aggregation results. The default value is set to 0 which means, no limit is set on the size, so the optimized hash aggregation is always used.|
 |-snappydata.sql.planCacheSize|Number of query plans that will be cached.|

diff --git a/docs/connectors/connector.md b/docs/connectors/connector.md
@@ -2,19 +2,18 @@
 
 SnappyData relies on the Spark SQL Data Sources API to parallelly load data from a wide variety of sources. Any data source or database that supports Spark to load or save state can be accessed from within SnappyData. 
 
-There is built-in support for many data sources as well as data formats. You can access data from sources such as S3, file system, HDFS, Hive, and RDB. The loaders have built-in support to handle data formats such as CSV, Parquet, ORC, Avro, JSON, and Java/Scala Objects.
+There is built-in support for many data sources as well as data formats. 
+Built-in data sources include - Amazon S3, GCS (Google Cloud storage), Azure Blob store, file systems, HDFS, Hive metastore, RDB access using JDBC, TIBCO Data Virtualization and Pivotal GemFire. 
 
-!!!Attention
-	This section currently only details the advanced connectors that SnappyData introduced. Refer to the [howto](../howto.md) section for a brief description about working with [external data sources](../howto/load_data_into_snappydata_tables.md) and [some examples](../howto/load_data_from_external_data_stores.md). 
+SnappyData supports the following data formats: CSV, Parquet, ORC, Avro, JSON, XML and Text.
 
-SnappyData provides a utility to deploy third-party connectors using the SQL `Deploy` command. Refer [Deployment of Third Party Connectors](/connectors/deployment_dependency_jar.md)
+You can also deploy other third party connectors using the SQL `Deploy` command. Refer [Deployment of Third Party Connectors](deployment_dependency_jar.md). You will likely find a Spark connector for your data source via the [Spark packages portal](https://spark-packages.org/) or doing a web search. 
 
-For more information see:
-
-* [START HERE - How to load data into SnappyData Tables](../howto/load_data_into_snappydata_tables.md)
+* [START HERE - for a quick overview of the concepts and some examples for loading data](../howto/load_data_into_snappydata_tables.md)
 * [Data Loading examples using Spark SQL/Data Sources API](../howto/load_data_from_external_data_stores.md)
 * [Supported Data Formats](../Data/data_formats.md)
-* [Accessing Cloud Storages](access_cloud_data.md)
+* [Accessing Cloud Stores](access_cloud_data.md)
 * [Connecting to External Hive Metastores](../Data/external_hive_support.md)
 * [Using the SnappyData Change Data Capture (CDC) Connector](cdc_connector.md)
 * [Using the SnappyData GemFire Connector](gemfire_connector.md)
+
diff --git a/docs/experimental.md b/docs/experimental.md
@@ -0,0 +1,16 @@
+# Experimental Features
+
+TIBCO ComputeDB 1.2.0 provides the following features on an experimental basis. These features are included only for testing purposes and are not yet supported officially:
+
+## Authorization for External Tables
+You can enable authorization of external tables by setting the system property **CHECK_EXTERNAL_TABLE_AUTHZ** to true when the cluster's security is enabled.
+System admin or the schema owner can grant or revoke the permissions on external tables to other users. 
+For example: `GRANT ALL ON <external-table> to <user>;`
+
+
+## Support ad-hoc, Interactive Execution of Scala code
+You can execute Scala code using a new CLI script **snappy-scala** that is built with IJ APIs. You can also run it as an SQL command using prefix **exec scala**. 
+The Scala code can use any valid/supported Spark API for example, to carry out custom data loading/transformations or to launch a structured streaming job. Since the code is submitted as an SQL command, you can now also use any SQL tool (based on JDBC/ODBC), including Notebook environments, to execute ad-hoc code blocks directly. Prior to this feature, apps were required to use the smart connector or use the TIBCO ComputeDB specific native Zeppelin interpreter. 
+**exec scala** command can be secured using the SQL GRANT/REVOKE permissions. System admin (DB owner) can grant or revoke permissions for Scala interpreter privilege.
+
+For more information refer to [Executing Spark Scala Code using SQL](/programming_guide/scala_interpreter.md)
diff --git a/docs/howto.md b/docs/howto.md
@@ -42,8 +42,6 @@ The following topics are covered in this section:
 
 * [How to Connect using JDBC Driver](howto/connect_using_jdbc_driver.md)<a id="howto-jdbc"></a>
 
-* [How to Store and Query JSON Objects](howto/store_and_query_json_objects.md)<a id="howto-JSON"></a>
-
 * [How to Store and Query Objects](howto/store_and_query_objects.md)<a id="howto-objects"></a>
 
 * [How to use Stream Processing with SnappyData](howto/use_stream_processing_with_snappydata.md)<a id="howto-streams"></a>

diff --git a/docs/howto/connect_oss_vis_client_tools.md b/docs/howto/connect_oss_vis_client_tools.md
@@ -40,6 +40,8 @@ To connect SnappyData from DbVisualizer, do the following:
 !!! Note
 	The steps provided here are specific to DbVisualizer 10.0.20 version. The steps can slightly vary in other versions.
 
+For secure connections, refer [Creating a Secure Connection from JDBC Client Tools](#secureconnectJDBC)
+
 <a id= sqlworkbenchj> </a>
 ## SQL Workbench/J
 
@@ -70,6 +72,8 @@ To connect SnappyData from SQL Workbench/J, do the following:
 	* Enter username and password.
 8. Click the **Test** button and then click **OK**. <br> After you get a successful connection, you run queries in SnappyData from SQL WorkBench/J.
 
+For secure connections, refer [Creating a Secure Connection from JDBC Client Tools](#secureconnectJDBC)
+
 <a id= dbeaver> </a>
 ## DBeaver
 DBeaver is a graphical database management tool. You can access SnappyData from DBeaver. Download and install DBeaver, start the LDAP server and print the LDAP conf, and then connect to SnappyData from DBeaver.
@@ -82,16 +86,6 @@ To download and install DBeaver, do the following:
 2.	Choose an appropriate installer for the corresponding operating system. For example, for Linux Debian package, download from [this link](https://dbeaver.io/files/dbeaver-ce_latest_amd64.deb).
 3.	Run the corresponding commands that are specified in the **Install** section on the Download page.
 
-### Starting the LDAP Server
-
-To start the LDAP server, do the following:
-
-1.	From the terminal, go to the location of ldap-test-server: <br> `cd $SNAPPY_HOME/store/ldap-test-server`
-2.	Run the following command to build: <br>`./gradlew build`
-3.	Run the script: <br>`./start-ldap-server.sh auth.ldif`<br>
-	This starts the LDAP server and prints the LDAP conf. The printed LDAP conf contains the username and password of LDAP that should be used to connect from DBeaver. Copy this into all the conf files of SnappyData.
-4.	Start the SnappyData cluster.
-
 ### Connecting to SnappyData from DBeaver
 4.	Launch DBeaver and click **New database connection**. 
 5.	Select **Hadoop / Big Data** section from the left. </br> ![Images](../Images/sql_clienttools_images/dbeaver_install1.png) </br> ![Images](../Images/sql_clienttools_images/dbeaver_install2.png)
@@ -101,6 +95,8 @@ To start the LDAP server, do the following:
 	*	Username / Password
 7.	Test the connection and finish the setup of the database source.
 
+For secure connections, refer [Creating a Secure Connection from JDBC Client Tools](#secureconnectJDBC)
+
 <a id= squirrel> </a>
 ## SQuirreL SQL Client
 
@@ -116,18 +112,7 @@ To download and install SQuirrel, do the following:
 3.	Go to the SQuirreL SQL Client installation folder and run the following command:<br> 
 	`./squirrel-sql.sh`
 
-### Starting the LDAP Server
-
-To start the LDAP server, do the following:
-
-1.	From the terminal, go to the location of ldap-test-server: <br> `cd $SNAPPY_HOME/store/ldap-test-server`
-2.	Run the following command: <br>`./gradlew build`
-3.	Run the following script: <br>`./start-ldap-server.sh auth.ldif`
-	This starts the LDAP server and prints the LDAP conf. The printed LDAP conf contains username and password of LDAP that should be used to connect from SQuirreL SQL Client. Copy this into all the conf files of SnappyData.
-4.	Start SnappyData cluster.
-
-
-### Connecting to SnappyData from SQuirreL SQL Client	
+### Connecting to SnappyData from SQuirreL SQL Client
 
 To connect SnappyData from SQuirreL SQL Client, do the following:
 
@@ -166,8 +151,21 @@ jdbc jar: https://mvnrepository.com/artifact/io.snappydata/snappydata-jdbc_2.11/
             drop table if exists colTable;
             show tables;
 
+For secure connections, refer [Creating a Secure Connection from JDBC Client Tools](#secureconnectJDBC)
 
 !!!Note
 	When connecting to SnappyData, if a SQL client tool sets JDBC autocommit to false and transaction isolation level such as read committed or repeatable read is used,  the unsupported operations such as those on column table will produce an error - **Operations on column tables are not supported when query routing is disabled or autocommit is false.**   In such cases, connection property **allow-explicit-commit=true** can be used in the connection URL to avoid this error. Refer to configuration parameters section <add a link to the section> for details on this property. For example,  JDBC URL: **jdbc:snappydata://locatoHostName:1527/allow-explicit-commit=true** 
 
+<a id= secureconnectJDBC> </a>
+## Creating a Secure Connection from JDBC Client Tools
+
+If you already have an LDAP server,  you can use the same to connect to SnappyData cluster or you can use the LDAP server that comes pre-configured with SnappyData. 
 
+To start the pre-configured LDAP server of SnappyData, do the following:
+
+1.	From the terminal, go to the location of ldap-test-server: <br> `cd $SNAPPY_HOME/store/ldap-test-server`
+2.	Run the following command to build: <br>`./gradlew build`
+3.	Run the script: <br>`./start-ldap-server.sh auth.ldif`<br>
+	This starts the LDAP server and prints the LDAP conf. The printed LDAP conf contains the username and password of LDAP that should be used to connect from JDBC clients. Copy this into the leads/servers/locators conf files of SnappyData.
+4.	Start the SnappyData cluster.
+5.	When you are connecting from a JDBC client, ensure to provide the user name and password printed in step 3.
diff --git a/docs/howto/load_data_from_external_data_stores.md b/docs/howto/load_data_from_external_data_stores.md
@@ -1,7 +1,7 @@
 <a id="howto-external-source"></a>
 # How to Load Data from External Data Stores (e.g. HDFS, Cassandra, Hive, etc) 
 
-SnappyData comes bundled with the libraries to access HDFS (Apache compatible). You can load your data using SQL or DataFrame API.
+SnappyData comes bundled with the libraries to access HDFS (Apache compatible). You can load your data using SQL or DataFrame API. 
 
 ## Example - Loading data from CSV file using SQL
 
@@ -76,7 +76,7 @@ df.write.format("column").saveAsTable("columnTable")
 ## Importing Data using JDBC from a relational DB
 
 !!! Note
-	Before you begin, you must install the corresponding JDBC driver. To do so, copy the JDBC driver jar file in **/jars** directory located in the home directory and then restart the cluster.
+	Before you begin, you must install the corresponding JDBC driver. Refer to [Deploying Third Party Connectors](/connectors/deployment_dependency_jar.md).
 
 <!--**TODO: This is a problem- restart the cluster ? Must confirm package installation or at least get install_jar tested for this case. -- Jags**
 -->
@@ -135,7 +135,7 @@ Refer to the [Spark SQL JDBC source access for how to parallelize access when de
 The example below demonstrates how you can load data from a NoSQL store:
 
 !!! Note
-	Before you begin, you must install the corresponding Spark-Cassandra connector jar. To do so, copy the Spark-Cassandra connector jar file to the **/jars** directory located in the home directory and then restart the cluster.
+	Before you begin, you must install the corresponding Spark-Cassandra connector jar. Refer to [Deploying Third Party Connectors](/connectors/deployment_dependency_jar.md).
 
 <!--**TODO** This isn't a single JAR from what I know. The above step needs testing and clarity. -- Jags
 -->