Skip to content

support for Kinesis #223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 750 commits into from
Closed

support for Kinesis #223

wants to merge 750 commits into from

Conversation

pdeyhim
Copy link

@pdeyhim pdeyhim commented Mar 25, 2014

+KinesisInputDStream
+KinesisWordCount

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -124,6 +124,9 @@ object SparkBuild extends Build {

lazy val externalTwitter = Project("external-twitter", file("external/twitter"), settings = twitterSettings)
.dependsOn(streaming % "compile->compile;test->test")

lazy val externalAmazonKinesis= Project("external-amazonkinesis", file("external/AmazonKinesis"), settings = kinesisSettings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to keep the directory name short and lowercase (e.g., external/kinesis) which would be consistent with others like "flume", and "kafka".

@tdas
Copy link
Contributor

tdas commented Mar 26, 2014

I took a very high level pass at this and here are some high-level things that must be fixed.

  1. Follow to Spark formatting style, refer to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide . You can use the Spark in-built style checker using 'sbt/sbt scalastyle'
  2. Real unit tests, see in-code comments
  3. More documentation, both Java docs and as well as in-code documentation.
  4. Please add a JIRA ticket for this and update the PR subject accordingly (see other PRs).
  5. Directory name should be short and lowercase, consistent with rest of the external projects.

import org.apache.spark.streaming.StreamingContext._


object KinesisWordCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide full instructions on how to run this example. See other examples for more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a Java Kinesis example. Many users really ask for Java examples and can be a significant barrier for trying out the Kinesis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdeyhim - per our offline convo this wknd, please add a note about running this demo with a minimum of master=local 2 threads.

otherwise it appears that the KinesisNetworkReceiver thread does not startup - breaking the demo.

@mateiz
Copy link
Contributor

mateiz commented Mar 26, 2014

Hey Parviz, this is great to see, but one thing I have to check is whether the Amazon Software License that amazon-kinesis-client is distributed under is okay with the Apache software foundation. Give me some time to do that. Beyond that, shouldn't you also add those dependencies to the Maven build? Have you tried building this with Maven?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mateiz
Copy link
Contributor

mateiz commented Apr 3, 2014

Hey Parviz, FYI, because of the Amazon Software License's restrictions on kinesis-client, it may be difficult to include this in the default build (see https://issues.apache.org/jira/browse/LEGAL-198). In that case we should add a Maven profile for enabling it, and give the component a name that makes it clear it depends on something with a different license (e.g. spark-amazonkinesis-asl). We can also describe that in the docs.

We already have one component like this for reporting metrics to Ganglia (spark-ganglia-lgpl), which depends on an LGPL-licensed library. Users build it through a special Maven and SBT flag.

The issue BTW is not that Amazon Software License prevents us from licensing Spark under the Apache License, but that the Apache foundation wants to make sure that code used by its projects offers the same freedoms as the Apache license. Stuff that doesn't do that can go into optional packages.

@mateiz
Copy link
Contributor

mateiz commented Apr 3, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13732/

@kscaldef
Copy link

kscaldef commented May 8, 2014

What's the current status of this?

jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
Mark partitioner, name, and generator field in RDD as @transient.

As part of the effort to reduce serialized task size.
(cherry picked from commit d6e5473)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
Mark partitioner, name, and generator field in RDD as @transient.

As part of the effort to reduce serialized task size.
(cherry picked from commit d6e5473)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@cfregly
Copy link
Contributor

cfregly commented May 31, 2014

update: i discussed this with parviz recently - and we agreed that i would take this over. new PR to come shortly. here's the jira ticket: https://issues.apache.org/jira/browse/SPARK-1981

aarondav and others added 7 commits June 25, 2014 08:47
```scala
rdd.aggregate(Sum('val))
```
is just shorthand for

```scala
rdd.groupBy()(Sum('val))
```

but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.

Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#874 from aarondav/schemardd and squashes the following commits:

e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
Self explanatory.

Author: Patrick Wendell <pwendell@gmail.com>

Closes apache#878 from pwendell/java-constructor and squashes the following commits:

2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
… all child expressions.

`CountFunction` should count up only if the child's evaluated value is not null.

Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes apache#861 from ueshin/issues/SPARK-1914 and squashes the following commits:

3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
Author: witgo <witgo@qq.com>

Closes apache#884 from witgo/scalastyle and squashes the following commits:

4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha
JIRA: https://issues.apache.org/jira/browse/SPARK-1925

Author: zsxwing <zsxwing@gmail.com>

Closes apache#879 from zsxwing/SPARK-1925 and squashes the following commits:

5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
905173d introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.

This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.

Author: Ankur Dave <ankurdave@gmail.com>

Closes apache#885 from ankurdave/SPARK-1931 and squashes the following commits:

3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes apache#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:

76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
pwendell and others added 13 commits June 25, 2014 08:47
The assertJsonStringEquals method was missing an "assert" so
did not actually check that the strings were equal. This commit
adds the missing assert and fixes subsequently revealed problems
with the JsonProtocolSuite.

@andrewor14 I changed some of the test functionality to match what it
looks like you intended based on the expected strings -- let me know if
anything here looks wrong.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes apache#1198 from kayousterhout/json_test_fix and squashes the following commits:

77f858f [Kay Ousterhout] Fix broken Json tests.
Author: Michael Armbrust <michael@databricks.com>

Closes apache#1201 from marmbrus/fixCacheTests and squashes the following commits:

9d87ed1 [Michael Armbrust] Use analyzer (which runs to fixed point) instead of manually removing analysis operators.
This is an alternative solution to apache#1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than apache#1124.

@pwendell Are there spark properties that might be different on the driver and on the executors?

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#1132 from mengxr/akka-bootstrap and squashes the following commits:

77ff32d [Xiangrui Meng] organize imports
68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor
46d332d [Xiangrui Meng] fix a test
7947c18 [Xiangrui Meng] increase slack size for akka
4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
This will be helpful in join operators.

Author: Cheng Hao <hao.cheng@intel.com>

Closes apache#1187 from chenghao-intel/joinedRow and squashes the following commits:

87c19e3 [Cheng Hao] Add base row set methods for JoinedRow
Author: Matthew Farrellee <matt@redhat.com>

Closes apache#1185 from mattf/master-1 and squashes the following commits:

42150fc [Matthew Farrellee] Autodetect JAVA_HOME on RPM-based systems
Author: Michael Armbrust <michael@databricks.com>

Closes apache#1204 from marmbrus/nullPointerToString and squashes the following commits:

35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString
Author: witgo <witgo@qq.com>

Closes apache#1194 from witgo/SPARK-2248 and squashes the following commits:

6ac950b [witgo] spark.default.parallelism does not apply in local mode
JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263)

Map objects were not converted to Hive types before inserting into Hive tables.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1205 from liancheng/spark-2263 and squashes the following commits:

c7a4373 [Cheng Lian] Addressed @concretevitamin's comment
784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables
…utput

The `BigDecimal` branch in `unwrap` matches to `scala.math.BigDecimal` rather than `java.math.BigDecimal`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1199 from liancheng/javaBigDecimal and squashes the following commits:

e9bb481 [Cheng Lian] Should match java.math.BigDecimal when wnrapping Hive output
…th source-compatibility

https://issues.apache.org/jira/browse/SPARK-2038

to differentiate with SparkConf object and at the same time keep the source level compatibility

Author: CodingCat <zhunansjtu@gmail.com>

Closes apache#1137 from CodingCat/SPARK-2038 and squashes the following commits:

11abeba [CodingCat] revise the comments
7ee5712 [CodingCat] to keep the source-compatibility
763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
@venuktan
Copy link

Hi Parviz,
Is there a package in maven repo called "spark-amazonkinesis-asl" now ?

If not how do I use this package ?

@tdas
Copy link
Contributor

tdas commented Jul 10, 2014

@pdeyhim Can you close this? :)

@asfgit asfgit closed this in ee91eb8 Aug 27, 2014
ash211 referenced this pull request in palantir/spark Apr 17, 2017
lins05 pushed a commit to lins05/spark that referenced this pull request Apr 23, 2017
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 12, 2017
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* Made the HDFS keytab secret configurable.

* Updated dcoscli version
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
dongjoon-hyun pushed a commit that referenced this pull request Feb 19, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`.

### Why are the changes needed?
1.The new version brings some bug fixed, eg:
- Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315)
- Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314).

2.The full release notes:
    https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45152 from panbingkun/SPARK-47083.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Apr 10, 2024
The pr aims to upgrade `commons-codec` from `1.16.0` to `1.16.1`.

1.The new version brings some bug fixed, eg:
- Fix possible IndexOutOfBoundException in PhoneticEngine.encode method #223. Fixes [CODEC-315](https://issues.apache.org/jira/browse/CODEC-315)
- Fix possible IndexOutOfBoundsException in PercentCodec.insertAlwaysEncodeChars() method #222. Fixes [CODEC-314](https://issues.apache.org/jira/browse/CODEC-314).

2.The full release notes:
    https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.1

No.

Pass GA.

No.

Closes #45152 from panbingkun/SPARK-47083.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.