Skip to content

Commit a06e8ec

Browse files
committed
Merge remote-tracking branch 'upstream/master'
2 parents 2c679c3 + 989f8f6 commit a06e8ec

File tree

248 files changed

+2707
-1653
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

248 files changed

+2707
-1653
lines changed
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
layout: blog
3+
title: Announcing the release of Apache Samza 0.13.1
4+
icon: git-pull-request
5+
authors:
6+
- name: Navina
7+
website:
8+
image:
9+
excerpt_separator: <!--more-->
10+
---
11+
<!--
12+
Licensed to the Apache Software Foundation (ASF) under one or more
13+
contributor license agreements. See the NOTICE file distributed with
14+
this work for additional information regarding copyright ownership.
15+
The ASF licenses this file to You under the Apache License, Version 2.0
16+
(the "License"); you may not use this file except in compliance with
17+
the License. You may obtain a copy of the License at
18+
19+
http://www.apache.org/licenses/LICENSE-2.0
20+
21+
Unless required by applicable law or agreed to in writing, software
22+
distributed under the License is distributed on an "AS IS" BASIS,
23+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24+
See the License for the specific language governing permissions and
25+
limitations under the License.
26+
-->
27+
28+
29+
Testing the excerpt
30+
31+
<!--more-->
32+
33+
34+
Announcing the release of Apache Samza 0.13.1
35+
36+
We are very excited to announce the release of **Apache Samza 0.13.1**
37+
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with
38+
39+
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
40+
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
41+
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
42+
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
43+
- High level API for expressing complex stream processing pipelines in a few lines of code.
44+
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
45+
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
46+
47+
48+
### Enhancements, Upgrades and Bug Fixes
49+
50+
This is a stability release to make Samza as an embedded library production ready. Samza as a library is part of Samza’s Flexible Deployment model; release fixes a number of outstanding bugs includes the following enhancements to existing features:
51+
52+
- **Standalone**
53+
- [SAMZA-1165](https://issues.apache.org/jira/browse/SAMZA-1165) Cleanup data created by ZkStandalone in ZK
54+
- [SAMZA-1324](https://issues.apache.org/jira/browse/SAMZA-1324) Add a metrics reporter lifecycle for JobCoordinator component of StreamProcessor
55+
- [SAMZA-1336](https://issues.apache.org/jira/browse/SAMZA-1336) Standalone session expiration propagation
56+
- [SAMZA-1337](https://issues.apache.org/jira/browse/SAMZA-1337) LocalApplicationRunner supports StreamTask
57+
- [SAMZA-1339](https://issues.apache.org/jira/browse/SAMZA-1339) Add standalone integration tests
58+
- **General**
59+
- [SAMZA-1282](https://issues.apache.org/jira/browse/SAMZA-1282) Fix killed leader process issue when spinning up more containers than the number of tasks kills leader
60+
- [SAMZA-1340](https://issues.apache.org/jira/browse/SAMZA-1340) StreamProcessor does not propagate container failures from StreamTask
61+
- [SAMZA-1346](https://issues.apache.org/jira/browse/SAMZA-1346) GroupByContainerCount.balance() should guard against null LocalityManager
62+
- [SAMZA-1347](https://issues.apache.org/jira/browse/SAMZA-1347) GroupByContainerIds NPE if containerIds list is null
63+
- [SAMZA-1358](https://issues.apache.org/jira/browse/SAMZA-1358) task.class empty string should be ignored when app.class is configured
64+
- [SAMZA-1361](https://issues.apache.org/jira/browse/SAMZA-1361) OperatorImplGraph used wrong keys to store/retrieve OperatorImpl in the map
65+
- [SAMZA-1366](https://issues.apache.org/jira/browse/SAMZA-1366) ScriptRunner should allow callers to control the child process environment
66+
- [SAMZA-1384](https://issues.apache.org/jira/browse/SAMZA-1384) Race condition with async commit affects checkpoint correctness
67+
- [SAMZA-1385](https://issues.apache.org/jira/browse/SAMZA-1385) Fix coordination issues during stream creation in LocalApplicationRunner
68+
69+
Overall, [29 JIRAs](https://issues.apache.org/jira/issues/?jql=project%20%3D%2012314526%20AND%20fixVersion%20%3D%2012340845%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC) were resolved in this release.
70+
A source download of the 0.13.1 release is available [here](http://www.apache.org/dyn/closer.cgi/samza/0.13.1). The release JARs are also available in Apache’s Maven repository. See Samza’s [download](http://samza.apache.org/startup/download/) page for details and Samza’s [feature preview](https://samza.apache.org/startup/preview/) for new features. We requires JDK version newer than 1.8.0_111 when running 0.13.1 release for users who are using Scala 2.12.
71+
72+
### Community Developments
73+
74+
We’ve made great community progress since the last release (0.13.0). We presented Samza high level API features at the Cloud+Data NEXT Conference 2017 held in Silicon Valley, USA, and also gave a talk regarding the key features (Secret Kung Fu) of Samza at ArchSummit 2017 in Shenzhen, China, and a detailed study of stateful stream processing in VLDB 2017. Here are the details to these conferences.
75+
76+
- July 15, 2017 - [Unified Processing with the Samza High-level API (Cloud+Data NEXT Conference, Silicon Valley)](http://www.cdnextcon.com/recap.html) [slides](https://www.slideshare.net/YiPan7/nextcon-samza-preso-july-final)
77+
- July 7, 2017 - [Secret Kung Fu of Massive Scale Stream Processing with Apache Samza - Xinyu Liu](http://sz2017.archsummit.com/presentation/900) [ArchSummit, Shenzhen, 2017]
78+
- Aug 28, 2017 - [Samza: Stateful Scalable Stream Processing at LinkedIn - Kartik Paramasivam (ACM VLDB, Munich, 2017)](http://www.vldb.org/pvldb/vol10/p1634-noghabi.pdf)
79+
80+
In industry, Samza got new adopters, including Redfin and VMWare.
81+
As future development, we are continuing working on improving the new High Level API and flexible deployment features. Here is the list of the tasks for upcoming features and improvements.
82+
### Contribute
83+
84+
It’s a great time to get involved. You can start by reviewing the [tutorials](http://samza.apache.org/startup/preview/#try-it-out), signing up for the [mailing list](http://samza.apache.org/community/mailing-lists.html), and grabbing some [newbie JIRAs](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SAMZA%20AND%20labels%20%3D%20newbie%20AND%20status%20%3D%20Open).
85+
I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
layout: blog
3+
title: Recap of Stream Processing with Apache Kafka & Apache Samza (July '18)
4+
icon: analytics
5+
authors:
6+
- name:
7+
website:
8+
image:
9+
excerpt_separator: <!--more-->
10+
---
11+
<!--
12+
Licensed to the Apache Software Foundation (ASF) under one or more
13+
contributor license agreements. See the NOTICE file distributed with
14+
this work for additional information regarding copyright ownership.
15+
The ASF licenses this file to You under the Apache License, Version 2.0
16+
(the "License"); you may not use this file except in compliance with
17+
the License. You may obtain a copy of the License at
18+
19+
http://www.apache.org/licenses/LICENSE-2.0
20+
21+
Unless required by applicable law or agreed to in writing, software
22+
distributed under the License is distributed on an "AS IS" BASIS,
23+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24+
See the License for the specific language governing permissions and
25+
limitations under the License.
26+
-->
27+
28+
A look back at July edition of quarterly stream processing meetup
29+
30+
<!--more-->
31+
32+
33+
On July 19th another successful stream processing meetup was hosted by LinkedIn!
34+
This event focused on Apache Kafka, Apache Samza, and related streaming technologies.
35+
This meetup was a full house and had techincal deep dives by engineers from LinkedIn and Uber on the latest
36+
and greatest in streaming tech
37+
38+
<br>
39+
40+
41+
### [Beam me up Samza: How we built a Samza Runner for Apache Beam](https://youtu.be/o5GaifLoZho)
42+
43+
LinkedIn's Xinyu Liu presented [Beam me up Samza](https://bit.ly/2Nyc4pl), describing how Linkedin is harnessing cutting edge features of Beam.
44+
Apache Beam provides an easy-to-use, and powerful model for state-of-the-art stream and batch processing, portability
45+
across a variety of languages, and the ability to converge offline and nearline data processing. In this talk,
46+
he discussed the Beam API and its implementation in Samza and the benefits of Beam Runner to the Samza and Beam community.
47+
He also explored various use cases of Beam at LinkedIN and future work on it.
48+
49+
50+
### [uReplicator: Uber Engineering’s Scalable Robust Kafka Replicator](https://bit.ly/2NxvFpz)
51+
52+
53+
Uber operates more than 20 Kafka clusters to collect system, application logs and event data from rider and driver apps.
54+
Uber's Hongliang Xu shared his insignts on Uber's approch for replicating data between Kafka clusters across multiple data centers.
55+
He covered the history behind [uReplicator](https://bit.ly/2NxvFpz) and gave the high level architecture. Furthermore he also discussed the
56+
scalability challenges and operational overhead as the Uber exapanded and how did they build Federated uReplicator
57+
which addressed challanges at scale
58+
59+
60+
### [Concourse - Near real time notifications platform at Linkedin](https://youtu.be/Fszo6jThq0I)
61+
62+
63+
[Concourse](https://bit.ly/2zXNwUJ) is LinkedIn’s first near-real-time targeting and scoring platform for notifications. In this talk LinkedIn's Ajith Muralidharan & Vivek Nelamangala provided an in-depth overview of the design and various scaling optimizations.
64+
Concourse has an ability to score millions of notifications per second, while supporting the use of feature-rich machine learning
65+
models based on terabytes of feature data.
66+
67+
68+
<!--more-->
69+
70+
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
layout: blog
3+
title: Recap of Chasing Stream Processing Utopia
4+
icon: analytics
5+
authors:
6+
- name:
7+
website:
8+
image:
9+
excerpt_separator: <!--more-->
10+
---
11+
<!--
12+
Licensed to the Apache Software Foundation (ASF) under one or more
13+
contributor license agreements. See the NOTICE file distributed with
14+
this work for additional information regarding copyright ownership.
15+
The ASF licenses this file to You under the Apache License, Version 2.0
16+
(the "License"); you may not use this file except in compliance with
17+
the License. You may obtain a copy of the License at
18+
19+
http://www.apache.org/licenses/LICENSE-2.0
20+
21+
Unless required by applicable law or agreed to in writing, software
22+
distributed under the License is distributed on an "AS IS" BASIS,
23+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24+
See the License for the specific language governing permissions and
25+
limitations under the License.
26+
-->
27+
28+
Strange Loop, St. Louis, MO
29+
30+
<!--more-->
31+
32+
Over the last 15 years batch processing frameworks have thrived and ruled over big data processing. But now in the age of social computing, it is no longer acceptable to wait for data to land into a data-lake before it gets processed.
33+
We want our applications to react to new data as soon as it gets generated upstream. For a web site, members expect their feed to be updated as soon as some relevant activity, news, jobs etc. happens.
34+
We are talking seconds (or minutes). We also want to detect degraded site experience, fraud, security breaches, spam etc. instantaneously. Even business metrics (written in traditionally batch oriented languages like HIVE/PIG) are now expected to run in realtime. The current status-quo of real-time data processing (stream processing) is still very far from Utopia.
35+
36+
Kartik Paramasivam, The Director of Engineering presented Chasing the Stream Utopia at Strange Loop '18. The talk was inspired
37+
by the extensive growth in Streaming Data at Linkedin, which has experienced a growth of as high as 5 Trillion Messages per day in 2018.
38+
Linkedin supports close to 3000 applications in production using Kafka and Samza. He shed further light on Samza's claim
39+
as State of the art Stream Processing framework in the streaming world, supporting use cases at LinkedIn, Slack, Uber, Intuit etc
40+
41+
His talk described LinkedIn's path on Chasing Utopia in Streaming world running apps at any complexity, any scale,
42+
any source, any language, and any environment! He shed light on all of the above with actual use cases from LinkedIn using Samza and Kafka in production. He touched Samza's battle tested Stateful and Stateless processing, and also on the
43+
newer available features like event time based processing using Beam Runner for Samza and Samza SQL. He further briefly explained running
44+
and managing Kafka at Scale. Covering an array of topics from Kafka Cluster Management Woes to Dynamic Load Balancing
45+
using Kafka Cruise Control.
46+
47+
He further added the tooling ecosystem that supports these apps and streaming challanges that are faced at LinkedIn. He
48+
concluded with the upcoming releases and features of Samza (Apache Samza 1.0) and Kafka (Apache Kafka 2.0). Please find more [here] (https://youtu.be/2y8QImf-RpI)
49+
50+
<br>

docs/_case-studies/digitalsmiths.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ title: Totally awesome use-case of samza by DigitalSmiths # title of case study
55
study_domain: digitalsmiths.com # just the domain, not the protocol
66
priority: 7
77
menu_title: DigitalSmiths # what shows up in the menu
8+
exclude_from_loop: true
89
excerpt_separator: <!--more-->
910
---
1011
<!--

docs/_case-studies/ebay.md

Lines changed: 42 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: case-study
33
hide_title: true # so we have control in case-study layout, but can still use page
4-
title: Low Latency Web Scale Fraud Prevention
4+
title: Low Latency Web-Scale Fraud Prevention
55
study_domain: ebay.com
66
menu_title: eBay
77
excerpt_separator: <!--more-->
@@ -23,34 +23,47 @@ excerpt_separator: <!--more-->
2323
limitations under the License.
2424
-->
2525

26-
Low Latency Web Scale Fraud Prevention
26+
How Samza powers low-latency, web-scale fraud prevention at Ebay?
2727

2828
<!--more-->
2929

30-
eBay Enterprise is the world’s largest omni-channel commerce provider with
31-
hundreds millions of units shipped annually, as commerce gets more
32-
convenient and complex, so does fraud. The engineering team at eBay
33-
Enterprise selected Samza as the platform to build the horizontally
34-
scalable, realtime (sub-seconds) and fault tolerant abnormality detection
35-
system. For example, the system computes and evaluates key metrics to
36-
detect abnormal behaviors
37-
38-
- Transaction velocity (#tnx/day) and change (#tnx/day vs #tnx/day over n days)
39-
- Amount velocity ($tnx/day) and change ($tnx/day vs $tnx/day over n days)
40-
41-
A wide range of realtime and historical adjunct data from various sources
42-
including people, places, interests, social and connections are ingested
43-
through Kafka, and stored in local RocksDB state store with changelog
44-
enabled for recovery. Incoming transaction data is aggregated using
45-
windowing and then joined with adjunct data stores in multiple stages.
46-
The system generates potential fraud cases for review real time. Finally,
47-
the engineering team at eBay Enterprise has built an OpenTSDB and Grafana
48-
based monitoring system using metrics collected through JMX.
49-
50-
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*,
51-
*JMX-metrics*
52-
53-
More information
54-
55-
- [https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends](https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends)
56-
- [http://ebayenterprise.com/](http://ebayenterprise.com/)
30+
eBay Enterprise is the world’s largest omni-channel commerce provider. The engineering team at eBay chose Apache Samza to build _PreCog_, their
31+
horizontally scalable anomaly detection system.
32+
33+
_PreCog_ extensively leverages Samza's high-performance, fault-tolerant local storage. Its architecture had the following requirements, for which Samza perfectly fit the bill: <br/>
34+
35+
_Web-scale:_ Scale to a large number of users and large volume of data per-user. Additionally, should be possible to add more commodity hardware and scale horizontally. <br/>
36+
_Low-latency:_ Process customer interactions real-time by reacting in milliseconds instead of hours. <br/>
37+
_Fault-tolerance:_ Gracefully tolerate and handle hardware failures. <br/>
38+
39+
![diagram-large](/img/{{site.version}}/learn/documentation/case-study/ebay.png)
40+
41+
The PreCog anomaly-detection system comprises of multiple tiers, with each tier consisting of multiple Samza jobs, which process the output of the previous tier.
42+
43+
_Ingestion tier:_ In this tier, a variety of historical and realtime data from various
44+
sources including people, places etc., is ingested into Kafka.
45+
46+
_Fanout tier:_ This tier consists of Samza jobs which process the Kafka events, fan them out and re-partition them based on various
47+
facets like email-address, ip-address, credit-card number, shipping address etc.
48+
49+
_Compute tier:_ The Samza jobs in this tier consume messages from the fan-out tier and compute various key metrics and derived features. Features used to evaluate fraud include:
50+
51+
1. Number of transactions per-customer per-day <br/>
52+
2. Change in the number of daily transactions over the past few days <br/>
53+
3. Amount value ($$) of each transaction per-day <br/>
54+
4. Change in the amount value of transactions over a sliding time-window <br/>
55+
5. Number of transactions per shipping-address
56+
57+
_Assembly tier:_ This tier comprises of Samza jobs which join the output of the compute-tier with other additional data-sources
58+
and make a final determination on transaction-fraud.
59+
60+
For monitoring the _PreCog_ pipeline, EBay leverages Samza's [JMXMetricsReporter](/learn/documentation/{{site.version}}/operations/monitoring.html) and ingests the reported metrics into OpenTSDB/ HBase. The metrics are then
61+
visualzed using [Grafana](https://grafana.com/).
62+
63+
64+
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*, *JMX-metrics*
65+
66+
More information:
67+
68+
- [Slides: Low latency Fraud prevention with Apache Samza](https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends)
69+
- [http://ebayenterprise.com/](http://ebayenterprise.com/)

docs/_case-studies/fortscale.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ title: Totally awesome use-case of samza by FortScale # title of case study page
55
study_domain: fortscale.com # just the domain, not the protocol
66
priority: 6
77
menu_title: FortScale # what shows up in the menu
8+
exclude_from_loop: true
89
excerpt_separator: <!--more-->
910
---
1011
<!--

docs/_case-studies/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ exclude_from_loop: true
2020
limitations under the License.
2121
-->
2222

23-
Explore the many use-cases of the Samza Framework via our case-studies.
23+
Explore the many use-cases of the Samza via our case-studies. For a complete list of companies using Samza, visit our <a href="/powered-by/">Powered By</a> page
2424

2525
<ul class="case-studies">
2626

docs/_case-studies/intuit.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ title: Totally awesome use-case of samza by Intuit # title of case study page
55
study_domain: intuit.com # just the domain, not the protocol
66
priority: 5
77
menu_title: Intuit # what shows up in the menu
8+
exclude_from_loop: true
89
excerpt_separator: <!--more-->
910
---
1011
<!--

0 commit comments

Comments
 (0)