Merge pull request #2110 from aiven/dorota-clickhouse-federated-queries

clickhouse: add docs on the federated queries feature #2110
aiven · Aug 31, 2023 · 21f7895 · 21f7895
2 parents 332049a + 9fed6b2
commit 21f7895
Show file tree

Hide file tree

Showing 7 changed files with 274 additions and 1 deletion.
diff --git a/.github/vale/dicts/aiven.dic b/.github/vale/dicts/aiven.dic
@@ -202,6 +202,7 @@ Redli
 refcard
 reindex
 reindexing
+remoteSecure
 reStructuredText
 rivery
 rsyslog
@@ -237,6 +238,7 @@ Sql
 SSD/S
 Stackdriver
 Statusmap
+storages
 Stunnel
 subnet
 subnets

diff --git a/_toc.yml b/_toc.yml
@@ -726,6 +726,8 @@ entries:
             title: Disaster recovery
           - file: docs/products/clickhouse/concepts/strings
             title: Strings
+          - file: docs/products/clickhouse/concepts/federated-queries
+            title: Federated queries
       - file: docs/products/clickhouse/howto
         title: HowTo
         entries:
@@ -768,8 +770,10 @@ entries:
                 title: Read and write data across shards
               - file: docs/products/clickhouse/howto/copy-data-across-instances
                 title: Copy data across ClickHouse servers
-              - file: docs/products/clickhouse/howto/fetch-query-statistics.rst
+              - file: docs/products/clickhouse/howto/fetch-query-statistics
                 title: Fetch query statistics
+              - file: docs/products/clickhouse/howto/run-federated-queries
+                title: Run federated queries
           - file: docs/products/clickhouse/howto/list-manage-cluster
             title: Manage cluster
           - file: docs/products/clickhouse/howto/list-integrations

diff --git a/docs/products/clickhouse/concepts.rst b/docs/products/clickhouse/concepts.rst
@@ -18,6 +18,10 @@ Aiven service
         :shadow: md
         :margin: 2 2 0 0
 
+    .. grid-item-card:: :doc:`Read and pull data from external S3 with federated queries </docs/products/clickhouse/concepts/federated-queries>`
+        :shadow: md
+        :margin: 2 2 0 0
+
 General
 -------
 

diff --git a/docs/products/clickhouse/concepts/federated-queries.rst b/docs/products/clickhouse/concepts/federated-queries.rst
@@ -0,0 +1,55 @@
+About querying external data in Aiven for ClickHouse®
+=====================================================
+
+Discover federated queries and their capabilities in Aiven for ClickHouse®. Check why you might need to use them, for example, how they simplify and speed up migrating into Aiven from external data sources. Learn how federated queries work and how you can run them properly.
+
+About federated queries
+-----------------------
+
+Federated queries allow communication between Aiven for ClickHouse and S3-compatible object storages and web resources. The federated queries feature in Aiven for ClickHouse enables you to read and pull data from an external object storage that uses the S3 integration engine, or any web resource accessible over HTTP.
+
+.. note::
+
+   The federated queries feature in Aiven for ClickHouse is enabled by default.
+
+Why use federated queries
+-------------------------
+
+There are a few reasons why you might want to use federated queries:
+
+* Query remote data from your ClickHouse service. Ingest it into Aiven for ClickHouse or only reference external data sources as part of an analytics query. In the context of an increasing footprint of connected data sources, federated queries can help you better understand how your customers use your products.
+* Simplify and speed up the import of your data into the Aiven for ClickHouse instance from a legacy data source, avoiding a long and sometimes complex migration path.
+* Improve the migration of data in Aiven for ClickHouse, and extend analysis over external data sources with a relatively low effort in comparison to enabling distributed tables and `the remote and remoteSecure functionalities <https://clickhouse.com/docs/en/sql-reference/table-functions/remote>`_.
+
+.. note::
+
+   The ``remote()`` and ``remoteSecure()`` features are designed to read from remote data sources or provide the ability to create a distributed table across remote data sources but they are not designed to read from an external S3 storage.
+
+How it works
+------------
+
+To run a federated query, the ClickHouse service user connecting to the cluster requires grants to the S3 and/or URL sources. The main service user is granted access to the sources by default, and new users can be allowed to use the sources via the CREATE TEMPORARY TABLE grant, which is required for both sources.
+
+.. seealso::
+
+   For more information on how to enable new users to use the sources, check :ref:`Access and permissions <access-permissions>`.
+
+Federated queries read from external S3-compatible object storage utilizing the ClickHouse S3 engine. Once you read from a remote S3-compatible storage, you can select from that storage and insert into a table in the Aiven local instance, enabling migration of data into Aiven.
+
+.. seealso::
+
+    For more details on how to run federated querie in Aiven for ClickHouse, check :doc:`Read and pull data from S3 object storages and web resources over HTTP </docs/products/clickhouse/howto/run-federated-queries>`.
+
+Limitations
+-----------
+
+* Federated queries in Aiven for ClickHouse only support S3-compatible object storage providers for the time being. More external data sources coming soon!
+* Virtual tables are only supported for URL sources, using the URL table engine. Stay tuned for us supporting the S3 table engine in the future!
+
+Related reading
+---------------
+
+* :doc:`Read and pull data from S3 object storages and web resources over HTTP </docs/products/clickhouse/howto/run-federated-queries>`
+* `Cloud Compatibility | ClickHouse Docs <https://clickhouse.com/docs/en/whats-new/cloud-compatibility#federated-queries>`_
+* `Integrating S3 with ClickHouse <https://clickhouse.com/docs/en/integrations/s3>`_
+* `remote, remoteSecure | ClickHouse Docs <https://clickhouse.com/docs/en/sql-reference/table-functions/remote>`_
diff --git a/docs/products/clickhouse/howto.rst b/docs/products/clickhouse/howto.rst
@@ -15,6 +15,7 @@ Aiven for ClickHouse® how-tos
     - :doc:`Create materialised views in Aiven for ClickHouse® </docs/products/clickhouse/howto/materialized-views>`
     - :doc:`Monitor Aiven for ClickHouse® performance </docs/products/clickhouse/howto/monitor-performance>`
     - :doc:`Fetch query statistics for Aiven for ClickHouse® </docs/products/clickhouse/howto/fetch-query-statistics>`
+    - :doc:`Run federated queries in Aiven for ClickHouse® </docs/products/clickhouse/howto/run-federated-queries>`
 
 .. dropdown:: Cluster management
 

diff --git a/docs/products/clickhouse/howto/list-manage-service.rst b/docs/products/clickhouse/howto/list-manage-service.rst
@@ -34,3 +34,7 @@ Manage your Aiven for ClickHouse® service
     .. grid-item-card:: :doc:`Fetch query statistics for Aiven for ClickHouse® </docs/products/clickhouse/howto/fetch-query-statistics>`
         :shadow: md
         :margin: 2 2 0 0
+
+    .. grid-item-card:: :doc:`Run federated in Aiven for ClickHouse® </docs/products/clickhouse/howto/run-federated-queries>`
+        :shadow: md
+        :margin: 2 2 0 0
diff --git a/docs/products/clickhouse/howto/run-federated-queries.rst b/docs/products/clickhouse/howto/run-federated-queries.rst
@@ -0,0 +1,203 @@
+Read and pull data from S3 object storages and web resources over HTTP
+======================================================================
+
+With federated queries in Aiven for ClickHouse®, you can read and pull data from an external S3-compatible object storage or any web resource accessible over HTTP. Learn more about capabilities and applications of federated queries in :doc:`About querying external data in Aiven for ClickHouse® </docs/products/clickhouse/concepts/federated-queries>`.
+
+This article shows how to run federated queries in Aiven for ClickHouse. It provides multiple examples of querying external resources using the SELECT and INSERT statements and the S3 and the URL functions.
+
+About running federated queries
+-------------------------------
+
+Federated queries are written using specific SQL statements and can be run from CLI, for instance. To run a federated query, just send a query over an external S3-compatible object storage including relevant S3 bucket details. A properly-constructed federated query returns a specific output.
+
+Before you start
+----------------
+
+.. _access-permissions:
+
+Access and permissions
+''''''''''''''''''''''
+
+To run a federated query, the ClickHouse service user connecting to the cluster requires grants to the S3 and/or URL sources. The main service user is granted access to the sources by default, and new users can be allowed to use the sources with the following query:
+
+.. code-block:: sql
+
+   GRANT CREATE TEMPORARY TABLE, S3, URL ON *.* TO <username> [WITH GRANT OPTION]
+
+
+The CREATE TEMPORARY TABLE grant is required for both sources. Adding WITH GRANT OPTION allows the user to further transfer the privileges.
+
+Limitations
+'''''''''''
+
+* Federated queries in Aiven for ClickHouse only support S3-compatible object storage providers for the time being. More external data sources coming soon!
+* Virtual tables are only supported for URL sources, using the URL table engine. Stay tuned for us supporting the S3 table engine in the future!
+
+Run a federated query
+---------------------
+
+Check out some examples of running federated queries to read and pull data from external S3-compatible object storages.
+
+Query using SELECT and the S3 function
+''''''''''''''''''''''''''''''''''''''
+
+SQL SELECT statements using the S3 and URL functions are able to query public resources using the URL of the resource.
+For instance, let's explore the network connectivity measurement data provided by the `Open Observatory of Network Interference (OONI) <https://ooni.org/data/>`_.
+
+.. code-block:: sql
+
+   WITH ooni_data_sample AS
+      (
+	 SELECT *
+	 FROM s3('https://ooni-data-eu-fra.s3.eu-central-1.amazonaws.com/clickhouse_export/csv/fastpath_202308.csv.zstd')
+	 LIMIT 100000
+      )
+   SELECT
+      probe_cc AS probe_country_code,
+      test_name,
+      countIf(anomaly = 't') AS total_anomalies
+   FROM ooni_data_sample
+   GROUP BY
+      probe_country_code,
+      test_name
+   HAVING total_anomalies > 10
+   ORDER BY total_anomalies DESC
+   LIMIT 50
+
+Query using SELECT and the s3Cluster function
+'''''''''''''''''''''''''''''''''''''''''''''
+
+The ``s3Cluster`` function allows all cluster nodes to participate in the query execution.
+Using `default` for the cluster name parameter, we can compute the same aggregations as above as follows:
+
+.. code-block:: sql
+
+   WITH ooni_clustered_data_sample AS
+       (
+	   SELECT *
+	   FROM s3Cluster('default', 'https://ooni-data-eu-fra.s3.eu-central-1.amazonaws.com/clickhouse_export/csv/fastpath_202308.csv.zstd')
+	   LIMIT 100000
+       )
+   SELECT
+       probe_cc AS probe_country_code,
+       test_name,
+       countIf(anomaly = 't') AS total_anomalies
+   FROM ooni_clustered_data_sample
+   GROUP BY
+       probe_country_code,
+       test_name
+   HAVING total_anomalies > 10
+   ORDER BY total_anomalies DESC
+   LIMIT 50
+
+Query a private S3 bucket
+'''''''''''''''''''''''''
+
+Private buckets can be accessed by providing the access token and secret as function parameters.
+
+.. code-block:: sql
+
+   SELECT *
+   FROM s3(
+     'https://private-bucket.s3.eu-west-3.amazonaws.com/dataset-prefix/partition-name.csv',
+     'some_aws_access_key_id',
+     'some_aws_secret_access_key'
+   )
+
+Depending on the format, the schema can be automatically detected. If it isn't, you may also provide the column types as function parameters.
+
+.. code-block:: sql
+
+   SELECT *
+   FROM s3(
+     'https://private-bucket.s3.eu-west-3.amazonaws.com/orders-dataset/partition-name.csv',
+     'access_token',
+     'secret_token',
+     'CSVWithNames',
+     "`order_id` UInt64, `quantity` Decimal(9, 18), `order_datetime` DateTime"
+   )
+
+Query using SELECT and the URL function
+'''''''''''''''''''''''''''''''''''''''
+
+Let's query the `Growth Projections and Complexity Rankings <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XTAQMC&version=4.0>`_ dataset, courtesy of the
+`Atlas of Economic Complexity <https://atlas.cid.harvard.edu/>`_ project.
+
+.. code-block:: sql
+
+  WITH economic_complexity_ranking AS
+      (
+	  SELECT *
+	  FROM url('https://dataverse.harvard.edu/api/access/datafile/7259657?format=tab', 'TSV')
+      )
+  SELECT
+      replace(code, '"', '') AS `ISO country code`,
+      growth_proj AS `Forecasted annualized rate of growth`,
+      toInt32(replace(sitc_eci_rank, '"', '')) AS `Economic Complexity Index ranking`
+  FROM economic_complexity_ranking
+  WHERE year = 2021
+  ORDER BY `Economic Complexity Index ranking` ASC
+  LIMIT 20
+
+Query using INSERT and the URL function
+'''''''''''''''''''''''''''''''''''''''
+
+With the URL function, INSERT statements generate a POST request, which can be used to interact with APIs having public endpoints.
+For instance, if your application has a ``ingest-csv`` endpoint accepting CSV data, you can insert a row using the following statement:
+
+.. code-block:: sql
+
+   INSERT INTO FUNCTION
+     url('https://app-name.company-name.cloud/api/ingest-csv', 'CSVWithNames')
+   VALUES ('column1-value', 'column2-value');
+
+Query using INSERT and the S3 function
+'''''''''''''''''''''''''''''''''''''''
+
+When executing an INSERT statement into the S3 function, the rows are appended to the corresponding object if the table structure matches:
+
+.. code-block:: sql
+
+   INSERT INTO FUNCTION
+     s3('https://bucket-name.s3.region-name.amazonaws.com/dataset-name/landing/raw-data.csv', 'CSVWithNames')
+   VALUES ('column1-value', 'column2-value');
+
+Query a virtual table
+'''''''''''''''''''''
+
+Instead of specifying the URL of the resource in every query, it's possible to create a virtual table using the URL table engine. This can be achieved by running a DDL CREATE statement similar to the following:
+
+.. code-block:: sql
+
+   CREATE TABLE trips_export_endpoint_table
+   (
+       `trip_id` UInt32,
+       `vendor_id` UInt32,
+       `pickup_datetime` DateTime,
+       `dropoff_datetime` DateTime,
+       `trip_distance` Float64,
+       `fare_amount` Float32
+   )
+   ENGINE = URL('https://app-name.company-name.cloud/api/trip-csv-export', CSV)
+
+Once the table is defined, SELECT and INSERT statements execute GET and POST requests to the URL respectively:
+
+.. code-block:: sql
+
+    SELECT
+	toDate(pickup_datetime) AS pickup_date,
+	median(fare_amount) AS median_fare_amount,
+	max(fare_amount) AS max_fare_amount
+    FROM trips_export_endpoint_table
+    GROUP BY pickup_date
+
+   INSERT INTO trips_export_endpoint_table
+   VALUES (8765, 10, now() - INTERVAL 15 MINUTE, now(), 50, 20)
+
+Related reading
+---------------
+
+* :doc:`About querying external data in Aiven for ClickHouse® </docs/products/clickhouse/concepts/federated-queries>`
+* `Cloud Compatibility | ClickHouse Docs <https://clickhouse.com/docs/en/whats-new/cloud-compatibility#federated-queries>`_
+* `Integrating S3 with ClickHouse <https://clickhouse.com/docs/en/integrations/s3>`_
+* `remote, remoteSecure | ClickHouse Docs <https://clickhouse.com/docs/en/sql-reference/table-functions/remote>`_