merged Apache bug fixes #54

markhamstra · 2015-05-06T04:08:04Z

SKIPME

… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes apache#5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>

The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add a timeout at client side. Author: Davies Liu <davies@databricks.com> Closes apache#5324 from davies/collect_hang and squashes the following commits: e5a51a2 [Davies Liu] remove setReuseAddress 7977c2f [Davies Liu] do retry on client side b838f35 [Davies Liu] retry after timeout (cherry picked from commit 0cce545) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes apache#5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Conflicts: core/src/test/scala/org/apache/spark/SparkContextSuite.scala

…ork library. While the inbound path of a netty pipeline is thread-safe, the outbound path is not. That means that multiple threads can compete to write messages to the next stage of the pipeline. The network library sometimes breaks a single RPC message into multiple buffers internally to avoid copying data (see MessageEncoder). This can result in the following scenario (where "FxBy" means "frame x, buffer y"): T1 F1B1 F1B2 \ \ \ \ socket F1B1 F2B1 F1B2 F2B2 / / / / T2 F2B1 F2B2 And the frames now cannot be rebuilt on the receiving side because the different messages have been mixed up on the wire. The fix wraps these multi-buffer messages into a `FileRegion` object so that these messages are written "atomically" to the next pipeline handler. Author: Reynold Xin <rxin@databricks.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#5336 from vanzin/SPARK-6578-1.2 and squashes the following commits: 4d3395e [Reynold Xin] [SPARK-6578] Small rewrite to make the logic more clear in MessageWithHeader.transferTo. 526f230 [Marcelo Vanzin] [SPARK-6578] [core] Fix thread-safety issue in outbound path of network library.

…toreRelation's sameresult method only compare databasename and table name) override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count(*) from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count(*) from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name Author: seayi <405078363@qq.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#3898 from seayi/branch-1.2 and squashes the following commits: 8f0c7d2 [seayi] Update CachedTableSuite.scala a277120 [seayi] Update HiveMetastoreCatalog.scala 8d910aa [seayi] Update HiveMetastoreCatalog.scala

….logDirectory The config option is spark.history.fs.logDirectory, not spark.fs.history.logDirectory. So the descriptionof should be changed. Thanks. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes apache#5332 from KaiXinXiaoLei/historyConfig and squashes the following commits: 5ffbfb5 [KaiXinXiaoLei] the describe of jobHistory config is error (cherry picked from commit 8a0aa81) Signed-off-by: Andrew Or <andrew@databricks.com>

…g to load classes (branch-1.2) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. (See apache#4944 for the corresponding PR for 1.3/1.4). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squashes the following commits: 16e38fe [Josh Rosen] [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)

Prior to this change, the unit test for SPARK-3426 did not clone the original SparkConf, which meant that that test did not use the options set by suites that subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes apache#5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6753] Clone SparkConf in ShuffleSuite tests (cherry picked from commit 9d44ddc) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <fox@ucw.cz> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes apache#4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.

Author: Erik van Oosten <evanoosten@ebay.com> Closes apache#5489 from erikvanoosten/master and squashes the following commits: 1c91954 [Erik van Oosten] Rewrote double range matcher to an exact equality assert (SPARK-6878) f1708c9 [Erik van Oosten] Fix for sum on empty RDD fails with exception (SPARK-6878) (cherry picked from commit 51b306b) Signed-off-by: Sean Owen <sowen@cloudera.com>

We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see xerial/snappy-java#100). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bc) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml

…s f... ...ound. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#5515 from vanzin/SPARK-5634 and squashes the following commits: f74ecf1 [Marcelo Vanzin] [SPARK-5634] [core] Show correct message in HS when no incomplete apps found. (cherry picked from commit 30a6e0d) Signed-off-by: Andrew Or <andrew@databricks.com>

Set the current dir path $FWDIR and same at $ASSEMBLY_DIR1, $ASSEMBLY_DIR2 otherwise $SPARK_HOME cannot be visible from spark-env.sh -- no SPARK_HOME variable is assigned there. I am using the Spark-1.3.0 source code package and I come across with this when trying to start the master: sbin/start-master.sh Author: raschild <raschild@users.noreply.github.com> Closes apache#5261 from raschild/patch-1 and squashes the following commits: b9babcd [raschild] Update load-spark-env.sh

…lete apps f..." This reverts commit 5845a62. This was reverted because it broke compilation for branch-1.2. The problem is that the `requestedIncomplete` variable is not defined in this branch.

Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes apache#5496 from davies/big_closure and squashes the following commits: 9a0ea4c [Davies Liu] fix big closure with shuffle Conflicts: python/pyspark/rdd.py

sbin/spark-daemon.sh used ps -p "$TARGET_PID" -o args= to figure out whether the process running with the expected PID is actually a Spark daemon. When running with a large classpath, the output of ps gets truncated and the check fails spuriously. This weakens the check to see if it's a java command (which is something we do in other parts of the script) rather than looking for the specific main class name. This means that SPARK-4832 might happen under a slightly broader range of circumstances (a java program happened to reuse the same PID), but it seems worthwhile compared to failing consistently with a large classpath. Author: Punya Biswal <pbiswal@palantir.com> Closes apache#5535 from punya/feature/SPARK-6952 and squashes the following commits: 7ea12d1 [Punya Biswal] Handle long args when detecting PID reuse

If `StreamingKMeans` is not `Serializable`, we cannot do checkpoint for applications that using `StreamingKMeans`. So we should make it `Serializable`. Author: zsxwing <zsxwing@gmail.com> Closes apache#5582 from zsxwing/SPARK-6998 and squashes the following commits: 67c2a14 [zsxwing] Make StreamingKMeans 'Serializable' (cherry picked from commit fa73da0) Signed-off-by: Reynold Xin <rxin@databricks.com>

…on on. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#5751 from vanzin/cached-rdd-warning and squashes the following commits: 554cc07 [Marcelo Vanzin] Change message. 9efb9da [Marcelo Vanzin] [minor] [core] Warn users who try to cache RDDs with dynamic allocation on.

…allocation on." This reverts commit d208047.

…regation see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181). Author: Qiping Li <liqiping1991@gmail.com> Closes apache#5737 from chouqin/externalsorter and squashes the following commits: 2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation (cherry picked from commit 7f4b583) Signed-off-by: Sean Owen <sowen@cloudera.com>

Conflicts: assembly/pom.xml bagel/pom.xml core/pom.xml examples/pom.xml external/flume-sink/pom.xml external/flume/pom.xml external/kafka/pom.xml external/mqtt/pom.xml external/twitter/pom.xml external/zeromq/pom.xml extras/java8-tests/pom.xml extras/kinesis-asl/pom.xml extras/spark-ganglia-lgpl/pom.xml graphx/pom.xml mllib/pom.xml network/common/pom.xml network/shuffle/pom.xml network/yarn/pom.xml pom.xml repl/pom.xml sql/catalyst/pom.xml sql/core/pom.xml sql/hive-thriftserver/pom.xml sql/hive/pom.xml streaming/pom.xml tools/pom.xml yarn/alpha/pom.xml yarn/pom.xml yarn/stable/pom.xml

merged Apache bug fixes

srowen and others added 26 commits March 26, 2015 15:00

[HOTFIX] Updating CHANGES.txt for Spark 1.2.2

eac9525

Preparing development version 1.2.3-SNAPSHOT

7b7db59

[HOTFIX] Bumping versions for Spark 1.2.2

86d1715

Preparing Spark release v1.2.2-rc1

7531b50

Revert "[SPARK-5634] [core] Show correct message in HS when no incomp…

8e9fc27

…lete apps f..." This reverts commit 5845a62. This was reverted because it broke compilation for branch-1.2. The problem is that the `requestedIncomplete` variable is not defined in this branch.

Revert "[MINOR] [CORE] Warn users who try to cache RDDs with dynamic …

6fd74d8

…allocation on." This reverts commit d208047.

fixed csd versions

81ae704

markhamstra self-assigned this May 6, 2015

markhamstra added a commit that referenced this pull request May 6, 2015

Merge pull request #54 from markhamstra/csd-1.2

861d022

merged Apache bug fixes

markhamstra merged commit 861d022 into alteryx:csd-1.2 May 6, 2015

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017

Remove unused driver extra classpath upload code (alteryx#54)

ee01986

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merged Apache bug fixes #54

merged Apache bug fixes #54

Uh oh!

markhamstra commented May 6, 2015

Uh oh!

Uh oh!

merged Apache bug fixes #54

merged Apache bug fixes #54

Uh oh!

Conversation

markhamstra commented May 6, 2015

Uh oh!

Uh oh!