Skip to content

Transforms

vnnv01 edited this page Mar 6, 2018 · 1 revision

Transforms allow us to reshape and aggregate data according to our needs. While sources provide us with a point of origin for input data, and sinks provide us with a final destination for processed data, transforms define the business rules we wish to apply to our data.

Examples

CREATE STREAM sessions(user_id int, session_timestamp timestamp)
FROM JSON
OPTIONS(...);

CREATE TEMPORARY VIEW sessions_by_hour AS
SELECT
  cast(date_format(session_timestamp, 'YYYYMMDDHH') as int) session_hour,
  user_id
FROM sessions;

CREATE TEMPORARY VIEW session_count_by_hour AS
SELECT
  date_format(session_timestamp, 'YYYYMMDDHH') session_hour,
  count(distinct user_id) session_count,
FROM sessions
GROUP BY;

Options

Transforms do not support an OPTIONS(...) clause like you may be familiar with based on your experience with sources and sinks. However, there are other PSTL and [Apache Spark][apache-spark] settings which influence runtime behavior related to transform definitions. For that reason, we have listed configurations of interest below to improve your comprehension.

spark.sql.view.maxNestedViewDepth

The maximum depth of a view reference in a nested view. A nested view may reference other nested views, the dependencies are organized in a directed acyclic graph (DAG). However the DAG depth may become too large and cause unexpected behavior. This configuration puts a limit on this: when the depth of a view exceeds this value during analysis, we terminate the resolution to avoid potential errors.

Defaults to 100.

# spark.properties
spark.sql.view.maxNestedViewDepth=20

Resources

  • [Spark SQL Functions][spark-sql-functions]

Known Limitations

[apache-spark]: [spark-sql-functions]: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html

Clone this wiki locally