-
Notifications
You must be signed in to change notification settings - Fork 6
Transforms
Transforms allow us to reshape and aggregate data according to our needs. While sources provide us with a point of origin for input data, and sinks provide us with a final destination for processed data, transforms define the business rules we wish to apply to our data.
CREATE STREAM sessions(user_id int, session_timestamp timestamp)
FROM JSON
OPTIONS(...);
CREATE TEMPORARY VIEW sessions_by_hour AS
SELECT
cast(date_format(session_timestamp, 'YYYYMMDDHH') as int) session_hour,
user_id
FROM sessions;
CREATE TEMPORARY VIEW session_count_by_hour AS
SELECT
date_format(session_timestamp, 'YYYYMMDDHH') session_hour,
count(distinct user_id) session_count,
FROM sessions
GROUP BY;
Transforms do not support an OPTIONS(...)
clause like you may be familiar with based on your experience with sources and sinks. However, there are other PSTL and [Apache Spark][apache-spark] settings which influence runtime behavior related to transform definitions. For that reason, we have listed configurations of interest below to improve your comprehension.
The maximum depth of a view reference in a nested view. A nested view may reference other nested views, the dependencies are organized in a directed acyclic graph (DAG). However the DAG depth may become too large and cause unexpected behavior. This configuration puts a limit on this: when the depth of a view exceeds this value during analysis, we terminate the resolution to avoid potential errors.
Defaults to 100
.
# spark.properties
spark.sql.view.maxNestedViewDepth=20
- [Spark SQL Functions][spark-sql-functions]
[apache-spark]: [spark-sql-functions]: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html