-
Notifications
You must be signed in to change notification settings - Fork 28.6k
support for Kinesis #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for Kinesis #223
Conversation
Can one of the admins verify this patch? |
@@ -124,6 +124,9 @@ object SparkBuild extends Build { | |||
|
|||
lazy val externalTwitter = Project("external-twitter", file("external/twitter"), settings = twitterSettings) | |||
.dependsOn(streaming % "compile->compile;test->test") | |||
|
|||
lazy val externalAmazonKinesis= Project("external-amazonkinesis", file("external/AmazonKinesis"), settings = kinesisSettings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to keep the directory name short and lowercase (e.g., external/kinesis) which would be consistent with others like "flume", and "kafka".
I took a very high level pass at this and here are some high-level things that must be fixed.
|
import org.apache.spark.streaming.StreamingContext._ | ||
|
||
|
||
object KinesisWordCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide full instructions on how to run this example. See other examples for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add a Java Kinesis example. Many users really ask for Java examples and can be a significant barrier for trying out the Kinesis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdeyhim - per our offline convo this wknd, please add a note about running this demo with a minimum of master=local 2 threads.
otherwise it appears that the KinesisNetworkReceiver thread does not startup - breaking the demo.
Hey Parviz, this is great to see, but one thing I have to check is whether the Amazon Software License that amazon-kinesis-client is distributed under is okay with the Apache software foundation. Give me some time to do that. Beyond that, shouldn't you also add those dependencies to the Maven build? Have you tried building this with Maven? |
Can one of the admins verify this patch? |
Hey Parviz, FYI, because of the Amazon Software License's restrictions on kinesis-client, it may be difficult to include this in the default build (see https://issues.apache.org/jira/browse/LEGAL-198). In that case we should add a Maven profile for enabling it, and give the component a name that makes it clear it depends on something with a different license (e.g. We already have one component like this for reporting metrics to Ganglia ( The issue BTW is not that Amazon Software License prevents us from licensing Spark under the Apache License, but that the Apache foundation wants to make sure that code used by its projects offers the same freedoms as the Apache license. Stuff that doesn't do that can go into optional packages. |
Jenkins, test this please |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13732/ |
What's the current status of this? |
Mark partitioner, name, and generator field in RDD as @transient. As part of the effort to reduce serialized task size. (cherry picked from commit d6e5473) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
Mark partitioner, name, and generator field in RDD as @transient. As part of the effort to reduce serialized task size. (cherry picked from commit d6e5473) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
update: i discussed this with parviz recently - and we agreed that i would take this over. new PR to come shortly. here's the jira ticket: https://issues.apache.org/jira/browse/SPARK-1981 |
```scala rdd.aggregate(Sum('val)) ``` is just shorthand for ```scala rdd.groupBy()(Sum('val)) ``` but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows. Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches. Author: Aaron Davidson <aaron@databricks.com> Closes apache#874 from aarondav/schemardd and squashes the following commits: e9e68ee [Aaron Davidson] Add comment db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
Self explanatory. Author: Patrick Wendell <pwendell@gmail.com> Closes apache#878 from pwendell/java-constructor and squashes the following commits: 2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
… all child expressions. `CountFunction` should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#861 from ueshin/issues/SPARK-1914 and squashes the following commits: 3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914 2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
Author: witgo <witgo@qq.com> Closes apache#884 from witgo/scalastyle and squashes the following commits: 4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha
JIRA: https://issues.apache.org/jira/browse/SPARK-1925 Author: zsxwing <zsxwing@gmail.com> Closes apache#879 from zsxwing/SPARK-1925 and squashes the following commits: 5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
905173d introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties. This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`. Author: Ankur Dave <ankurdave@gmail.com> Closes apache#885 from ankurdave/SPARK-1931 and squashes the following commits: 3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins 9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. Author: Zhen Peng <zhenpeng01@baidu.com> Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits: 76f7eda [Zhen Peng] remove redundant memory allocations aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
The assertJsonStringEquals method was missing an "assert" so did not actually check that the strings were equal. This commit adds the missing assert and fixes subsequently revealed problems with the JsonProtocolSuite. @andrewor14 I changed some of the test functionality to match what it looks like you intended based on the expected strings -- let me know if anything here looks wrong. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes apache#1198 from kayousterhout/json_test_fix and squashes the following commits: 77f858f [Kay Ousterhout] Fix broken Json tests.
Author: Michael Armbrust <michael@databricks.com> Closes apache#1201 from marmbrus/fixCacheTests and squashes the following commits: 9d87ed1 [Michael Armbrust] Use analyzer (which runs to fixed point) instead of manually removing analysis operators.
This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124. @pwendell Are there spark properties that might be different on the driver and on the executors? Author: Xiangrui Meng <meng@databricks.com> Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits: 77ff32d [Xiangrui Meng] organize imports 68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor 46d332d [Xiangrui Meng] fix a test 7947c18 [Xiangrui Meng] increase slack size for akka 4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
This will be helpful in join operators. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#1187 from chenghao-intel/joinedRow and squashes the following commits: 87c19e3 [Cheng Hao] Add base row set methods for JoinedRow
Author: Matthew Farrellee <matt@redhat.com> Closes apache#1185 from mattf/master-1 and squashes the following commits: 42150fc [Matthew Farrellee] Autodetect JAVA_HOME on RPM-based systems
Author: Michael Armbrust <michael@databricks.com> Closes apache#1204 from marmbrus/nullPointerToString and squashes the following commits: 35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString
Author: witgo <witgo@qq.com> Closes apache#1194 from witgo/SPARK-2248 and squashes the following commits: 6ac950b [witgo] spark.default.parallelism does not apply in local mode
JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263) Map objects were not converted to Hive types before inserting into Hive tables. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1205 from liancheng/spark-2263 and squashes the following commits: c7a4373 [Cheng Lian] Addressed @concretevitamin's comment 784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables
…utput The `BigDecimal` branch in `unwrap` matches to `scala.math.BigDecimal` rather than `java.math.BigDecimal`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1199 from liancheng/javaBigDecimal and squashes the following commits: e9bb481 [Cheng Lian] Should match java.math.BigDecimal when wnrapping Hive output
…th source-compatibility https://issues.apache.org/jira/browse/SPARK-2038 to differentiate with SparkConf object and at the same time keep the source level compatibility Author: CodingCat <zhunansjtu@gmail.com> Closes apache#1137 from CodingCat/SPARK-2038 and squashes the following commits: 11abeba [CodingCat] revise the comments 7ee5712 [CodingCat] to keep the source-compatibility 763975f [CodingCat] style fix d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
Hi Parviz, If not how do I use this package ? |
@pdeyhim Can you close this? :) |
(cherry picked from commit 13f16d5)
* Made the HDFS keytab secret configurable. * Updated dcoscli version
### What changes were proposed in this pull request? The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315) - Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45152 from panbingkun/SPARK-47083. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`. 1.The new version brings some bug fixed, eg: - Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315) - Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1 No. Pass GA. No. Closes #45152 from panbingkun/SPARK-47083. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
+KinesisInputDStream
+KinesisWordCount