Skip to content

Commit 4e59070

Browse files
prateekmjagadish-v0
authored andcommitted
Updated API documentation for high and low level APIs.
vjagadish1989 nickpan47 Please take a look. Author: Prateek Maheshwari <pmaheshwari@apache.org> Reviewers: Jagadish<jagadish@apache.org> Closes apache#802 from prateekm/api-docs
1 parent 1d6f693 commit 4e59070

File tree

18 files changed

+483
-481
lines changed

18 files changed

+483
-481
lines changed

docs/learn/documentation/versioned/api/high-level-api.md

Lines changed: 214 additions & 197 deletions
Large diffs are not rendered by default.

docs/learn/documentation/versioned/api/low-level-api.md

Lines changed: 164 additions & 187 deletions
Large diffs are not rendered by default.

docs/learn/documentation/versioned/api/programming-model.md

Lines changed: 62 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -18,96 +18,103 @@ title: Programming Model
1818
See the License for the specific language governing permissions and
1919
limitations under the License.
2020
-->
21-
# Introduction
22-
Samza provides different sets of programming APIs to meet requirements from different sets of users. The APIs are listed below:
21+
### Introduction
22+
Samza provides multiple programming APIs to fit your use case:
2323

24-
1. Java programming APIs: Samza provides Java programming APIs for users who are familiar with imperative programming languages. The overall programming model to create a Samza application in Java will be described here. Samza also provides two sets of APIs to describe user processing logic:
25-
1. [High-level API](high-level-api.md): this API allows users to describe the end-to-end stream processing pipeline in a connected DAG (Directional Acyclic Graph). It also provides a rich set of build-in operators to help users implementing common transformation logic, such as filter, map, join, and window.
26-
2. [Task API](low-level-api.md): this is low-level Java API which provides “bare-metal” programming interfaces to the users. Task API allows users to explicitly access physical implementation details in the system, such as accessing the physical system stream partition of an incoming message and explicitly controlling the thread pool to execute asynchronous processing method.
27-
2. [Samza SQL](samza-sql.md): Samza provides SQL for users who are familiar with declarative query languages, which allows the users to focus on data manipulation via SQL predicates and UDFs, not the physical implementation details.
28-
3. Beam API: Samza also provides a [Beam runner](https://beam.apache.org/documentation/runners/capability-matrix/) to run applications written in Beam API. This is considered as an extension to existing operators supported by the high-level API in Samza.
24+
1. Java APIs: Samza's provides two Java programming APIs that are ideal for building advanced Stream Processing applications.
25+
1. [High Level Streams API](high-level-api.md): Samza's flexible High Level Streams API lets you describe your complex stream processing pipeline in the form of a Directional Acyclic Graph (DAG) of operations on message streams. It provides a rich set of built-in operators that simplify common stream processing operations such as filtering, projection, repartitioning, joins, and windows.
26+
2. [Low Level Task API](low-level-api.md): Samza's powerful Low Level Task API lets you write your application in terms of processing logic for each incoming message.
27+
2. [Samza SQL](samza-sql.md): Samza SQL provides a declarative query language for describing your stream processing logic. It lets you manipulate streams using SQL predicates and UDFs instead of working with the physical implementation details.
28+
3. Apache Beam API: Samza also provides a [Apache Beam runner](https://beam.apache.org/documentation/runners/capability-matrix/) to run applications written using the Apache Beam API. This is considered as an extension to the operators supported by the High Level Streams API in Samza.
2929

30-
The following sections will be focused on Java programming APIs.
31-
32-
# Key Concepts for a Samza Java Application
33-
To write a Samza Java application, you will typically follow the steps below:
34-
1. Define your input and output streams and tables
35-
2. Define your main processing logic
3630

31+
### Key Concepts
3732
The following sections will talk about key concepts in writing your Samza applications in Java.
3833

39-
## Samza Applications
40-
When writing your stream processing application using Java API in Samza, you implement either a [StreamApplication](javadocs/org/apache/samza/application/StreamApplication.html) or [TaskApplication](javadocs/org/apache/samza/application/TaskApplication.html) and define your processing logic in the describe method.
41-
- For StreamApplication:
34+
#### Samza Applications
35+
A [SamzaApplication](javadocs/org/apache/samza/application/SamzaApplication.html) describes the inputs, outputs, state, configuration and the logic for processing data from one or more streaming sources.
36+
37+
You can implement a
38+
[StreamApplication](javadocs/org/apache/samza/application/StreamApplication.html) and use the provided [StreamApplicationDescriptor](javadocs/org/apache/samza/application/descriptors/StreamApplicationDescriptor) to describe the processing logic using Samza's High Level Streams API in terms of [MessageStream](javadocs/org/apache/samza/operators/MessageStream.html) operators.
4239

4340
{% highlight java %}
44-
45-
public void describe(StreamApplicationDescriptor appDesc) { … }
41+
42+
public class MyStreamApplication implements StreamApplication {
43+
@Override
44+
public void describe(StreamApplicationDescriptor appDesc) {
45+
// Describe your application here
46+
}
47+
}
4648

4749
{% endhighlight %}
50+
51+
Alternatively, you can implement a [TaskApplication](javadocs/org/apache/samza/application/TaskApplication.html) and use the provided [TaskApplicationDescriptor](javadocs/org/apache/samza/application/descriptors/TaskApplicationDescriptor) to describe it using Samza's Low Level API in terms of per-message processing logic.
52+
53+
4854
- For TaskApplication:
4955

5056
{% highlight java %}
5157

52-
public void describe(TaskApplicationDescriptor appDesc) { … }
58+
public class MyTaskApplication implements TaskApplication {
59+
@Override
60+
public void describe(TaskApplicationDescriptor appDesc) {
61+
// Describe your application here
62+
}
63+
}
5364

5465
{% endhighlight %}
5566

56-
## Descriptors for Data Streams and Tables
57-
There are three different types of descriptors in Samza: [InputDescriptor](javadocs/org/apache/samza/system/descriptors/InputDescriptor.html), [OutputDescriptor](javadocs/org/apache/samza/system/descriptors/OutputDescriptor.html), and [TableDescriptor](javadocs/org/apache/samza/table/descriptors/TableDescriptor.html). The InputDescriptor and OutputDescriptor are used to describe the physical sources and destinations of a stream, while a TableDescriptor is used to describe the physical dataset and IO functions for a table.
58-
Usually, you will obtain InputDescriptor and OutputDescriptor from a [SystemDescriptor](javadocs/org/apache/samza/system/descriptors/SystemDescriptor.html), which include all information about producer and consumers to a physical system. The following code snippet illustrate how you will obtain InputDescriptor and OutputDescriptor from a SystemDescriptor.
5967

60-
{% highlight java %}
61-
62-
public class BadPageViewFilter implements StreamApplication {
63-
@Override
64-
public void describe(StreamApplicationDescriptor appDesc) {
65-
KafkaSystemDescriptor kafka = new KafkaSystemDescriptor();
66-
InputDescriptor<PageView> pageViewInput = kafka.getInputDescriptor(“page-views”, new JsonSerdeV2<>(PageView.class));
67-
OutputDescriptor<DecoratedPageView> pageViewOutput = kafka.getOutputDescriptor(“decorated-page-views”, new JsonSerdeV2<>(DecoratedPageView.class));
68+
#### Streams and Table Descriptors
69+
Descriptors let you specify the properties of various aspects of your application from within it.
6870

69-
// Now, implement your main processing logic
70-
}
71-
}
72-
73-
{% endhighlight %}
71+
[InputDescriptor](javadocs/org/apache/samza/system/descriptors/InputDescriptor.html)s and [OutputDescriptor](javadocs/org/apache/samza/system/descriptors/OutputDescriptor.html)s can be used for specifying Samza and implementation-specific properties of the streaming inputs and outputs for your application. You can obtain InputDescriptors and OutputDescriptors using a [SystemDescriptor](javadocs/org/apache/samza/system/descriptors/SystemDescriptor.html) for your system. This SystemDescriptor can be used for specify Samza and implementation-specific properties of the producer and consumers for your I/O system. Most Samza system implementations come with their own SystemDescriptors, but if one isn't available, you
72+
can use the [GenericSystemDescriptor](javadocs/org/apache/samza/system/descriptors/GenericSystemDescriptor.html).
7473

75-
You can also add a TableDescriptor to your application.
74+
A [TableDescriptor](javadocs/org/apache/samza/table/descriptors/TableDescriptor.html) can be used for specifying Samza and implementation-specific properties of a [Table](javadocs/org/apache/samza/table/Table.html). You can use a Local TableDescriptor (e.g. [RocksDbTableDescriptor](javadocs/org/apache/samza/storage/kv/descriptors/RocksDbTableDescriptor.html) or a [RemoteTableDescriptor](javadocs/org/apache/samza/table/descriptors/RemoteTableDescriptor).
75+
76+
77+
The following example illustrates how you can use input and output descriptors for a Kafka system, and a table descriptor for a local RocksDB table within your application:
7678

7779
{% highlight java %}
78-
79-
public class BadPageViewFilter implements StreamApplication {
80+
81+
public class MyStreamApplication implements StreamApplication {
8082
@Override
81-
public void describe(StreamApplicationDescriptor appDesc) {
82-
KafkaSystemDescriptor kafka = new KafkaSystemDescriptor();
83-
InputDescriptor<PageView> pageViewInput = kafka.getInputDescriptor(“page-views”, new JsonSerdeV2<>(PageView.class));
84-
OutputDescriptor<DecoratedPageView> pageViewOutput = kafka.getOutputDescriptor(“decorated-page-views”, new JsonSerdeV2<>(DecoratedPageView.class));
85-
TableDescriptor<String, Integer> viewCountTable = new RocksDBTableDescriptor(
86-
“pageViewCountTable”, KVSerde.of(new StringSerde(), new IntegerSerde()));
87-
88-
// Now, implement your main processing logic
83+
public void describe(StreamApplicationDescriptor appDescriptor) {
84+
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("kafka")
85+
.withConsumerZkConnect(ImmutableList.of("..."))
86+
.withProducerBootstrapServers(ImmutableList.of("...", "..."));
87+
KafkaInputDescriptor<PageView> kid =
88+
ksd.getInputDescriptor(“page-views”, new JsonSerdeV2<>(PageView.class));
89+
KafkaOutputDescriptor<DecoratedPageView> kod =
90+
ksd.getOutputDescriptor(“decorated-page-views”, new JsonSerdeV2<>(DecoratedPageView.class));
91+
92+
RocksDbTableDescriptor<String, Integer> td =
93+
new RocksDbTableDescriptor(“viewCounts”, KVSerde.of(new StringSerde(), new IntegerSerde()));
94+
95+
// Implement your processing logic here
8996
}
9097
}
9198

9299
{% endhighlight %}
93100

94101
The same code in the above describe method applies to TaskApplication as well.
95102

96-
## Stream Processing Logic
103+
#### Stream Processing Logic
97104

98-
Samza provides two sets of APIs to define the main stream processing logic, high-level API and Task API, via StreamApplication and TaskApplication, respectively.
105+
Samza provides two sets of APIs to define the main stream processing logic, High Level Streams API and Low Level Task API, via StreamApplication and TaskApplication, respectively.
99106

100-
High-level API allows you to describe the processing logic in a connected DAG of transformation operators, like the example below:
107+
High Level Streams API allows you to describe the processing logic in a connected DAG of transformation operators, like the example below:
101108

102109
{% highlight java %}
103110

104111
public class BadPageViewFilter implements StreamApplication {
105112
@Override
106113
public void describe(StreamApplicationDescriptor appDesc) {
107-
KafkaSystemDescriptor kafka = new KafkaSystemDescriptor();
114+
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor();
108115
InputDescriptor<PageView> pageViewInput = kafka.getInputDescriptor(“page-views”, new JsonSerdeV2<>(PageView.class));
109116
OutputDescriptor<DecoratedPageView> pageViewOutput = kafka.getOutputDescriptor(“decorated-page-views”, new JsonSerdeV2<>(DecoratedPageView.class));
110-
TableDescriptor<String, Integer> viewCountTable = new RocksDBTableDescriptor(
117+
RocksDbTableDescriptor<String, Integer> viewCountTable = new RocksDbTableDescriptor(
111118
“pageViewCountTable”, KVSerde.of(new StringSerde(), new IntegerSerde()));
112119

113120
// Now, implement your main processing logic
@@ -120,7 +127,7 @@ High-level API allows you to describe the processing logic in a connected DAG of
120127

121128
{% endhighlight %}
122129

123-
Task API allows you to describe the processing logic in a customized StreamTaskFactory or AsyncStreamTaskFactory, like the example below:
130+
Low Level Task API allows you to describe the processing logic in a customized StreamTaskFactory or AsyncStreamTaskFactory, like the example below:
124131

125132
{% highlight java %}
126133

@@ -130,7 +137,7 @@ Task API allows you to describe the processing logic in a customized StreamTaskF
130137
KafkaSystemDescriptor kafka = new KafkaSystemDescriptor();
131138
InputDescriptor<PageView> pageViewInput = kafka.getInputDescriptor(“page-views”, new JsonSerdeV2<>(PageView.class));
132139
OutputDescriptor<DecoratedPageView> pageViewOutput = kafka.getOutputDescriptor(“decorated-page-views”, new JsonSerdeV2<>(DecoratedPageView.class));
133-
TableDescriptor<String, Integer> viewCountTable = new RocksDBTableDescriptor(
140+
RocksDbTableDescriptor<String, Integer> viewCountTable = new RocksDbTableDescriptor(
134141
“pageViewCountTable”, KVSerde.of(new StringSerde(), new IntegerSerde()));
135142

136143
// Now, implement your main processing logic
@@ -142,11 +149,10 @@ Task API allows you to describe the processing logic in a customized StreamTaskF
142149

143150
{% endhighlight %}
144151

145-
Details for [high-level API](high-level-api.md) and [Task API](low-level-api.md) are explained later.
152+
#### Configuration for a Samza Application
146153

147-
## Configuration for a Samza Application
154+
To deploy a Samza application, you need to specify the implementation class for your application and the ApplicationRunner to launch your application. The following is an incomplete example of minimum required configuration to set up the Samza application and the runner. For additional configuration, see the Configuration Reference.
148155

149-
To deploy a Samza application, you will need to specify the implementation class for your application and the ApplicationRunner to launch your application. The following is an incomplete example of minimum required configuration to set up the Samza application and the runner:
150156
{% highlight jproperties %}
151157

152158
# This is the class implementing StreamApplication

docs/learn/documentation/versioned/api/samza-sql.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Note: Samza sql console right now doesn’t support queries that need state, for
8787

8888

8989
# Running Samza SQL on YARN
90-
The [hello-samza](https://github.com/apache/samza-hello-samza) project is an example project designed to help you run your first Samza application. It has examples of applications using the low level task API, high level API as well as Samza SQL.
90+
The [hello-samza](https://github.com/apache/samza-hello-samza) project is an example project designed to help you run your first Samza application. It has examples of applications using the Low Level Task API, High Level Streams API as well as Samza SQL.
9191

9292
This tutorial demonstrates a simple Samza application that uses SQL to perform stream processing.
9393

docs/learn/documentation/versioned/api/table-api.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ join with a table and finally write the output to another table.
181181

182182
# Using Table with Samza Low Level API
183183

184-
The code snippet below illustrates the usage of table in Samza low level API.
184+
The code snippet below illustrates the usage of table in Samza Low Level Task API.
185185

186186
{% highlight java %}
187187
1 class SamzaTaskApplication implements TaskApplication {
@@ -273,8 +273,7 @@ The table below summarizes table metrics:
273273

274274
[`RemoteTable`](https://github.com/apache/samza/blob/master/samza-core/src/main/java/org/apache/samza/table/remote/RemoteTableDescriptor.java)
275275
provides a unified abstraction for Samza applications to access any remote data
276-
store through stream-table join in high-level API or direct access in low-level
277-
API. Remote Table is a store-agnostic abstraction that can be customized to
276+
store through stream-table join in High Level Streams API or direct access in Low Level Task API. Remote Table is a store-agnostic abstraction that can be customized to
278277
access new types of stores by writing pluggable I/O "Read/Write" functions,
279278
implementations of
280279
[`TableReadFunction`](https://github.com/apache/samza/blob/master/samza-core/src/main/java/org/apache/samza/table/remote/TableReadFunction.java) and
@@ -283,7 +282,7 @@ interfaces. Remote Table also provides common functionality, eg. rate limiting
283282
(built-in) and caching (hybrid).
284283

285284
The async APIs in Remote Table are recommended over the sync versions for higher
286-
throughput. They can be used with Samza with low-level API to achieve the maximum
285+
throughput. They can be used with Samza with Low Level Task API to achieve the maximum
287286
throughput.
288287

289288
Remote Tables are represented by class
@@ -420,7 +419,7 @@ created during instantiation of Samza container.
420419
The life of a table goes through a few phases
421420

422421
1. **Declaration** - at first one declares the table by creating a `TableDescriptor`. In both
423-
Samza high level and low level API, the `TableDescriptor` is registered with stream
422+
Samza High Level Streams API and Low Level Task API, the `TableDescriptor` is registered with stream
424423
graph, internally converted to `TableSpec` and in return a reference to a `Table`
425424
object is obtained that can participate in the building of the DAG.
426425
2. **Instantiation** - during planning stage, configuration is
@@ -436,7 +435,7 @@ The life of a table goes through a few phases
436435
* In Samza high level API, all table instances can be retrieved from `TaskContext` using
437436
table-id during initialization of a
438437
[`InitableFunction`] (https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/operators/functions/InitableFunction.java).
439-
* In Samza low level API, all table instances can be retrieved from `TaskContext` using
438+
* In Samza Low Level Task API, all table instances can be retrieved from `TaskContext` using
440439
table-id during initialization of a
441440
[`InitableTask`] (https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/task/InitableTask.java).
442441
4. **Cleanup** -

docs/learn/documentation/versioned/connectors/eventhubs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ In this section, we will walk through a simple pipeline that reads from one Even
126126
4 MessageStream<KV<String, String>> eventhubInput = appDescriptor.getInputStream(inputDescriptor);
127127
5 OutputStream<KV<String, String>> eventhubOutput = appDescriptor.getOutputStream(outputDescriptor);
128128

129-
// Define the execution flow with the high-level API
129+
// Define the execution flow with the High Level Streams API
130130
6 eventhubInput
131131
7 .map((message) -> {
132132
8 System.out.println("Received Key: " + message.getKey());

0 commit comments

Comments
 (0)