support for Kinesis #223

pdeyhim · 2014-03-25T11:29:18Z

+KinesisInputDStream
+KinesisWordCount

AmplabJenkins · 2014-03-25T12:13:05Z

Can one of the admins verify this patch?

tdas · 2014-03-26T01:36:21Z

project/SparkBuild.scala

@@ -124,6 +124,9 @@ object SparkBuild extends Build {

  lazy val externalTwitter = Project("external-twitter", file("external/twitter"), settings = twitterSettings)
    .dependsOn(streaming % "compile->compile;test->test")
+
+  lazy val externalAmazonKinesis= Project("external-amazonkinesis", file("external/AmazonKinesis"), settings = kinesisSettings)


It would be better to keep the directory name short and lowercase (e.g., external/kinesis) which would be consistent with others like "flume", and "kafka".

tdas · 2014-03-26T01:49:36Z

I took a very high level pass at this and here are some high-level things that must be fixed.

Follow to Spark formatting style, refer to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide . You can use the Spark in-built style checker using 'sbt/sbt scalastyle'
Real unit tests, see in-code comments
More documentation, both Java docs and as well as in-code documentation.
Please add a JIRA ticket for this and update the PR subject accordingly (see other PRs).
Directory name should be short and lowercase, consistent with rest of the external projects.

tdas · 2014-03-26T01:50:22Z

examples/src/main/scala/org/apache/spark/streaming/examples/KinesisWordCount.scala

+import org.apache.spark.streaming.StreamingContext._
+
+
+object KinesisWordCount {


Please provide full instructions on how to run this example. See other examples for more details.

Also, please add a Java Kinesis example. Many users really ask for Java examples and can be a significant barrier for trying out the Kinesis

@pdeyhim - per our offline convo this wknd, please add a note about running this demo with a minimum of master=local 2 threads.

otherwise it appears that the KinesisNetworkReceiver thread does not startup - breaking the demo.

mateiz · 2014-03-26T20:25:12Z

Hey Parviz, this is great to see, but one thing I have to check is whether the Amazon Software License that amazon-kinesis-client is distributed under is okay with the Apache software foundation. Give me some time to do that. Beyond that, shouldn't you also add those dependencies to the Maven build? Have you tried building this with Maven?

AmplabJenkins · 2014-03-28T01:54:38Z

Can one of the admins verify this patch?

mateiz · 2014-04-03T19:06:29Z

Hey Parviz, FYI, because of the Amazon Software License's restrictions on kinesis-client, it may be difficult to include this in the default build (see https://issues.apache.org/jira/browse/LEGAL-198). In that case we should add a Maven profile for enabling it, and give the component a name that makes it clear it depends on something with a different license (e.g. spark-amazonkinesis-asl). We can also describe that in the docs.

We already have one component like this for reporting metrics to Ganglia (spark-ganglia-lgpl), which depends on an LGPL-licensed library. Users build it through a special Maven and SBT flag.

The issue BTW is not that Amazon Software License prevents us from licensing Spark under the Apache License, but that the Apache foundation wants to make sure that code used by its projects offers the same freedoms as the Apache license. Stuff that doesn't do that can go into optional packages.

mateiz · 2014-04-03T19:06:41Z

Jenkins, test this please

AmplabJenkins · 2014-04-03T19:07:24Z

Merged build triggered.

AmplabJenkins · 2014-04-03T19:07:30Z

Merged build started.

AmplabJenkins · 2014-04-03T19:08:29Z

Merged build finished.

AmplabJenkins · 2014-04-03T19:08:29Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13732/

kscaldef · 2014-05-08T18:12:44Z

What's the current status of this?

@transient

Mark partitioner, name, and generator field in RDD as @transient. As part of the effort to reduce serialized task size. (cherry picked from commit d6e5473) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

@transient

Mark partitioner, name, and generator field in RDD as @transient. As part of the effort to reduce serialized task size. (cherry picked from commit d6e5473) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

cfregly · 2014-05-31T21:33:35Z

update: i discussed this with parviz recently - and we agreed that i would take this over. new PR to come shortly. here's the jira ticket: https://issues.apache.org/jira/browse/SPARK-1981

```scala rdd.aggregate(Sum('val)) ``` is just shorthand for ```scala rdd.groupBy()(Sum('val)) ``` but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows. Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches. Author: Aaron Davidson <aaron@databricks.com> Closes apache#874 from aarondav/schemardd and squashes the following commits: e9e68ee [Aaron Davidson] Add comment db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations

Self explanatory. Author: Patrick Wendell <pwendell@gmail.com> Closes apache#878 from pwendell/java-constructor and squashes the following commits: 2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java

… all child expressions. `CountFunction` should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#861 from ueshin/issues/SPARK-1914 and squashes the following commits: 3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914 2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.

Author: witgo <witgo@qq.com> Closes apache#884 from witgo/scalastyle and squashes the following commits: 4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha

JIRA: https://issues.apache.org/jira/browse/SPARK-1925 Author: zsxwing <zsxwing@gmail.com> Closes apache#879 from zsxwing/SPARK-1925 and squashes the following commits: 5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'

905173d introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties. This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`. Author: Ankur Dave <ankurdave@gmail.com> Closes apache#885 from ankurdave/SPARK-1931 and squashes the following commits: 3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins 9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy

DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. Author: Zhen Peng <zhenpeng01@baidu.com> Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits: 76f7eda [Zhen Peng] remove redundant memory allocations aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM

@andrewor14

The assertJsonStringEquals method was missing an "assert" so did not actually check that the strings were equal. This commit adds the missing assert and fixes subsequently revealed problems with the JsonProtocolSuite. @andrewor14 I changed some of the test functionality to match what it looks like you intended based on the expected strings -- let me know if anything here looks wrong. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes apache#1198 from kayousterhout/json_test_fix and squashes the following commits: 77f858f [Kay Ousterhout] Fix broken Json tests.

Author: Michael Armbrust <michael@databricks.com> Closes apache#1201 from marmbrus/fixCacheTests and squashes the following commits: 9d87ed1 [Michael Armbrust] Use analyzer (which runs to fixed point) instead of manually removing analysis operators.

@pwendell

This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <meng@databricks.com> Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf

This will be helpful in join operators. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#1187 from chenghao-intel/joinedRow and squashes the following commits: 87c19e3 [Cheng Hao] Add base row set methods for JoinedRow

Author: Matthew Farrellee <matt@redhat.com> Closes apache#1185 from mattf/master-1 and squashes the following commits: 42150fc [Matthew Farrellee] Autodetect JAVA_HOME on RPM-based systems

Author: Michael Armbrust <michael@databricks.com> Closes apache#1204 from marmbrus/nullPointerToString and squashes the following commits: 35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString

Author: witgo <witgo@qq.com> Closes apache#1194 from witgo/SPARK-2248 and squashes the following commits: 6ac950b [witgo] spark.default.parallelism does not apply in local mode

@concretevitamin

JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263) Map objects were not converted to Hive types before inserting into Hive tables. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1205 from liancheng/spark-2263 and squashes the following commits: c7a4373 [Cheng Lian] Addressed @concretevitamin's comment 784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables

…utput The `BigDecimal` branch in `unwrap` matches to `scala.math.BigDecimal` rather than `java.math.BigDecimal`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1199 from liancheng/javaBigDecimal and squashes the following commits: e9bb481 [Cheng Lian] Should match java.math.BigDecimal when wnrapping Hive output

…th source-compatibility https://issues.apache.org/jira/browse/SPARK-2038 to differentiate with SparkConf object and at the same time keep the source level compatibility Author: CodingCat <zhunansjtu@gmail.com> Closes apache#1137 from CodingCat/SPARK-2038 and squashes the following commits: 11abeba [CodingCat] revise the comments 7ee5712 [CodingCat] to keep the source-compatibility 763975f [CodingCat] style fix d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions

venuktan · 2014-06-27T06:59:39Z

Hi Parviz,
Is there a package in maven repo called "spark-amazonkinesis-asl" now ?

If not how do I use this package ?

tdas · 2014-07-10T22:58:03Z

@pdeyhim Can you close this? :)

(cherry picked from commit 13f16d5)

* Made the HDFS keytab secret configurable. * Updated dcoscli version

### What changes were proposed in this pull request? The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315) - Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45152 from panbingkun/SPARK-47083. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`. 1.The new version brings some bug fixed, eg: - Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315) - Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1 No. Pass GA. No. Closes #45152 from panbingkun/SPARK-47083. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

tdas reviewed Mar 26, 2014
View reviewed changes

aarondav and others added 7 commits June 25, 2014 08:47

HOTFIX: Add no-arg SparkContext constructor in Java

035b976

Self explanatory. Author: Patrick Wendell <pwendell@gmail.com> Closes apache#878 from pwendell/java-constructor and squashes the following commits: 2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java

Fix scalastyle warnings in yarn alpha

e6c0550

Author: witgo <witgo@qq.com> Closes apache#884 from witgo/scalastyle and squashes the following commits: 4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha

SPARK-1925: Replace '&' with '&&'

c282a31

JIRA: https://issues.apache.org/jira/browse/SPARK-1925 Author: zsxwing <zsxwing@gmail.com> Closes apache#879 from zsxwing/SPARK-1925 and squashes the following commits: 5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'

pwendell and others added 13 commits June 25, 2014 08:47

HOTFIX: Disabling tests per SPARK-2264

a18f843

[SQL]Add base row updating methods for JoinedRow

386a0a2

This will be helpful in join operators. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#1187 from chenghao-intel/joinedRow and squashes the following commits: 87c19e3 [Cheng Hao] Add base row set methods for JoinedRow

Autodetect JAVA_HOME on RPM-based systems

e7de368

Author: Matthew Farrellee <matt@redhat.com> Closes apache#1185 from mattf/master-1 and squashes the following commits: 42150fc [Matthew Farrellee] Autodetect JAVA_HOME on RPM-based systems

Fix possible null pointer in acumulator toString

b6ac76a

Author: Michael Armbrust <michael@databricks.com> Closes apache#1204 from marmbrus/nullPointerToString and squashes the following commits: 35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString

SPARK-2248: spark.default.parallelism does not apply in local mode

fc9c55b

Author: witgo <witgo@qq.com> Closes apache#1194 from witgo/SPARK-2248 and squashes the following commits: 6ac950b [witgo] spark.default.parallelism does not apply in local mode

Replace doc reference to Shark with Spark SQL.

07ea156

added support for VPC and placement group to spar_ec2.py

faa1743

cfregly mentioned this pull request Jul 27, 2014

[SPARK-1981] Add AWS Kinesis streaming support #1434

Closed

asfgit closed this in ee91eb8 Aug 27, 2014

ash211 referenced this pull request in palantir/spark Apr 17, 2017

Upgrade bouncycastle, force bcprov version (#223)

9f04a8f

(cherry picked from commit 13f16d5)

lins05 pushed a commit to lins05/spark that referenced this pull request Apr 23, 2017

Upgrade bouncycastle, force bcprov version (apache#223)

13f16d5

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Upgrade bouncycastle, force bcprov version (apache#223)

02ab18e

mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 12, 2017

Force commons-compress versions to 1.8.1 (apache#223)

6fb0b99

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

MapR [SPARK-152] Incorrect date string parsing fixed (apache#223)

1a2e864

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

[Tests] Made the HDFS keytab secret configurable (apache#223)

b557f6d

* Made the HDFS keytab secret configurable. * Updated dcoscli version

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Add a Zuul job for tracking Kubernetes v1.11 branch (apache#223)

e8c59ec

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-152] Incorrect date string parsing fixed (apache#223)

ad32836

panbingkun mentioned this pull request Feb 18, 2024

[SPARK-47083][BUILD] Upgrade commons-codec to 1.16.1 #45152

Closed

		import org.apache.spark.streaming.StreamingContext._


		object KinesisWordCount {

support for Kinesis #223

support for Kinesis #223

Uh oh!

Conversation

pdeyhim commented Mar 25, 2014

Uh oh!

AmplabJenkins commented Mar 25, 2014

Uh oh!

tdas Mar 26, 2014

Choose a reason for hiding this comment

Uh oh!

tdas commented Mar 26, 2014

Uh oh!

tdas Mar 26, 2014

Choose a reason for hiding this comment

Uh oh!

tdas Mar 26, 2014

Choose a reason for hiding this comment

Uh oh!

cfregly Mar 30, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Mar 26, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

mateiz commented Apr 3, 2014

Uh oh!

mateiz commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

kscaldef commented May 8, 2014

Uh oh!

cfregly commented May 31, 2014

Uh oh!

venuktan commented Jun 27, 2014

Uh oh!

tdas commented Jul 10, 2014

Uh oh!

Uh oh!