SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

markhamstra · 2015-12-09T21:52:14Z

No description provided.

…to branch-1.5 The main purpose of this PR is to backport apache#9664, which depends on apache#9277. Author: Cheng Lian <lian@databricks.com> Closes apache#9671 from liancheng/spark-11191.fix-temp-function.branch-1_5.

code snippet to reproduce it: ``` TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai")) val t = Timestamp.valueOf("1900-06-11 12:14:50.789") val us = fromJavaTimestamp(t) assert(getSeconds(us) === t.getSeconds) ``` it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#9728 from cloud-fan/seconds. (cherry picked from commit 06f1fdb) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…nt initialization On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread. The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance. yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this? Author: Yu Gao <ygao@us.ibm.com> Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com> Author: Yu Gao <crystalgaoyu@gmail.com> Closes apache#9272 from yolandagao/master. (cherry picked from commit 72c1d68) Signed-off-by: Yin Huai <yhuai@databricks.com>

…ctionRegistry According to discussion in PR apache#9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now. Author: Cheng Lian <lian@databricks.com> Closes apache#9737 from liancheng/spark-11191.follow-up. (cherry picked from commit fa13301) Signed-off-by: Cheng Lian <lian@databricks.com>

There events happen normally during the app's lifecycle, so printing out ERROR logs all the time is misleading, and can actually affect usability of interactive shells. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#9772 from vanzin/SPARK-11786. (cherry picked from commit 936bc0b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

… a batch We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9707 from zsxwing/fix-checkpoint. (cherry picked from commit 928d631) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Added mockito to the test scope to fix the compilation error in branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9782 from zsxwing/1.5-hotfix.

The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM) Author: Davies Liu <davies@databricks.com> Closes apache#9704 from davies/kyro_string. (cherry picked from commit 98be816) Signed-off-by: Davies Liu <davies.liu@gmail.com>

Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability Author: Sean Owen <sowen@cloudera.com> Closes apache#9731 from srowen/SPARK-11652. (cherry picked from commit 9631ca3) Signed-off-by: Sean Owen <sowen@cloudera.com>

It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <vivkul@iitk.ac.in> Closes apache#9771 from vivkul/patch-1. (cherry picked from commit 1429e0a) Signed-off-by: Sean Owen <sowen@cloudera.com>

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader. The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`. Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`. See apache#9367 for previous comments See SPARK-11195 for a full repro Author: Hurshal Patel <hpatel516@gmail.com> Closes apache#9779 from choochootrain/spark-11195-master. (cherry picked from commit 3cca5ff) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/TestUtils.scala

jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abd) Signed-off-by: Xiangrui Meng <meng@databricks.com>

SparkListenerSuite's _"onTaskGettingResult() called when result fetched remotely"_ test was extremely slow (1 to 4 minutes to run) and recently became extremely flaky, frequently failing with OutOfMemoryError. The root cause was the fact that this was using `System.setProperty` to set the Akka frame size, which was not actually modifying the frame size. As a result, this test would allocate much more data than necessary. The fix here is to simply use SparkConf in order to configure the frame size. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#9822 from JoshRosen/SPARK-11649.

…ceByKeyAndWindow invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None, thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data. In addition, the docstring used wrong parameter names, also fixed. Author: David Tolpin <david.tolpin@gmail.com> Closes apache#9775 from dtolpin/master. (cherry picked from commit 599a8c6) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…s (backport to branch 1.5) backport apache#9841 to branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9850 from zsxwing/SPARK-11831-branch-1.5.

…ting a NULL JIRA: https://issues.apache.org/jira/browse/SPARK-11817 Instead of return None, we should truncate the fractional seconds to prevent inserting NULL. Author: Liang-Chi Hsieh <viirya@appier.com> Closes apache#9834 from viirya/truncate-fractional-sec. (cherry picked from commit 60bfb11) Signed-off-by: Yin Huai <yhuai@databricks.com>

…Function and TransformFunctionSerializer TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9847 from zsxwing/pyspark-streaming-exception. (cherry picked from commit be7a2cf) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

udf/cast should use existing SQLContext. Author: Davies Liu <davies@databricks.com> Closes apache#9915 from davies/create_1.5.

…ted with a Stage This issue was addressed in apache#5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug. The intent of `submitMissingTasks` should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId". Because of a long-standing bug, the `jobId` parameter was never being used. Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880. The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks. This fix should be applied to all maintenance branches, since it has existed since 1.0. kayousterhout pankajarora12 Author: Mark Hamstra <markhamstra@gmail.com> Author: Imran Rashid <irashid@cloudera.com> Closes apache#6291 from markhamstra/SPARK-6880. (cherry picked from commit 0a5aef7) Signed-off-by: Imran Rashid <irashid@cloudera.com>

…VM exits deleting the temp dir like that ``` scala> import scala.collection.mutable import scala.collection.mutable scala> val a = mutable.Set(1,2,3,4,7,0,8,98,9) a: scala.collection.mutable.Set[Int] = Set(0, 9, 1, 2, 3, 7, 4, 8, 98) scala> a.foreach(x => {a.remove(x) }) scala> a.foreach(println(_)) 98 ``` You may not modify a collection while traversing or iterating over it.This can not delete all element of the collection Author: Zhongshuai Pei <peizhongshuai@huawei.com> Closes apache#9951 from DoingDone9/Bug_RemainDir. (cherry picked from commit 6b78157) Signed-off-by: Reynold Xin <rxin@databricks.com>

…eadPool doesn't cache any task In the previous codes, `newDaemonCachedThreadPool` uses `SynchronousQueue`, which is wrong. `SynchronousQueue` is an empty queue that cannot cache any task. This patch uses `LinkedBlockingQueue` to fix it along with other fixes to make sure `newDaemonCachedThreadPool` can use at most `maxThreadNumber` threads, and after that, cache tasks to `LinkedBlockingQueue`. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9978 from zsxwing/cached-threadpool. (cherry picked from commit d3ef693) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

```EventLoggingListener.getLogPath``` needs 4 input arguments: https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280 the 3rd parameter should be appAttemptId, 4th parameter is codec... Author: Teng Qiu <teng.qiu@gmail.com> Closes apache#10044 from chutium/SPARK-12053. (cherry picked from commit a8ceec5) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…down Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks' Author: Sean Owen <sowen@cloudera.com> Closes apache#10042 from srowen/SPARK-12049. (cherry picked from commit 96bf468) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

Author: Alexander Pivovarov <apivovarov@gmail.com> Closes apache#10064 from apivovarov/patch-1.

The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to clean up the file as this output committer is by design not retryable. Currently, the job fails with a confusing file exists error. This patch is a stop gap to tell the user to look at the top of the error log for the proper message. This is difficult to test locally as Spark is hardcoded not to retry. Manually verified by upping the retry attempts. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes apache#10080 from nongli/spark-11328. (cherry picked from commit 47a0abc) Signed-off-by: Yin Huai <yhuai@databricks.com>

…data source When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end) The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'" Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes apache#9872 from huaxingao/spark-11788. (cherry picked from commit 5a8b5fd) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

This bug was exposed as memory corruption in Timsort which uses copyMemory to copy large regions that can overlap. The prior implementation did not handle this case half the time and always copied forward, resulting in the data being corrupt. Author: Nong Li <nong@databricks.com> Closes apache#10068 from nongli/spark-12030. (cherry picked from commit 2cef1cd) Signed-off-by: Yin Huai <yhuai@databricks.com>

https://issues.apache.org/jira/browse/SPARK-11352 This one backports apache#10072 to branch 1.5. Author: Yin Huai <yhuai@databricks.com> Closes apache#10084 from yhuai/SPARK-11352-branch-1.5.

…HadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched * The JobConf is serialized as part of the DStream checkpoints. These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object. The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf. Tests to be added in apache#9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#10088 from tdas/SPARK-12087. (cherry picked from commit 8a75a30) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Author: Davies Liu <davies@databricks.com> Closes apache#10090 from davies/fix_coalesce. (cherry picked from commit 4375eb3) Signed-off-by: Davies Liu <davies.liu@gmail.com>

Author: gcc <spark-src@condor.rhaag.ip> Closes apache#10101 from rh99/master. (cherry picked from commit 04b6799) Signed-off-by: Sean Owen <sowen@cloudera.com>

When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u. yhuai Please review it. I did reproduce it and it works after the fix. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10155 from gatorsmile/escapeU. (cherry picked from commit 49efd03) Signed-off-by: Yin Huai <yhuai@databricks.com>

…r and AppClient (backport 1.5) backport apache#10108 to branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10135 from zsxwing/fix-threadpool-1.5.

This backports [apache#10161] to Spark 1.5, with the difference that ChiSqSelector does not require modification. Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: yhuai mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#10183 from jkbradley/sqlcontext-backport1.5.

Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at apache#9731 Author: Sean Owen <sowen@cloudera.com> Closes apache#10198 from srowen/SPARK-11652.2. (cherry picked from commit e3735ce) Signed-off-by: Sean Owen <sowen@cloudera.com>

…of aliases and real columns this is based on apache#9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <dbiswal@us.ibm.com> Author: Wenchen Fan <wenchen@databricks.com> Closes apache#9961 from cloud-fan/sort.

SPY-875: backported SPARK-11863 and merged Apache branch-1.5

liancheng and others added 30 commits November 15, 2015 13:16

[HOTFIX][STREAMING] Add mockito to fix the compilation error

f33e277

Added mockito to the test scope to fix the compilation error in branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9782 from zsxwing/1.5-hotfix.

rmse was wrongly calculated

4b8dc25

It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <vivkul@iitk.ac.in> Closes apache#9771 from vivkul/patch-1. (cherry picked from commit 1429e0a) Signed-off-by: Sean Owen <sowen@cloudera.com>

[SPARK-11831][CORE][TESTS] Use port 0 to avoid port conflicts in test…

6fe1ce6

…s (backport to branch 1.5) backport apache#9841 to branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#9850 from zsxwing/SPARK-11831-branch-1.5.

[SPARK-11836][SQL] use existing SQLContext for udf/cast (1.5 branch)

27b5f31

udf/cast should use existing SQLContext. Author: Davies Liu <davies@databricks.com> Closes apache#9915 from davies/create_1.5.

Set SPARK_EC2_VERSION to 1.5.2

80dac0b

Author: Alexander Pivovarov <apivovarov@gmail.com> Closes apache#10064 from apivovarov/patch-1.

[SPARK-11352][SQL][BRANCH-1.5] Escape */ in the generated comments.

4f07a59

https://issues.apache.org/jira/browse/SPARK-11352 This one backports apache#10072 to branch 1.5. Author: Yin Huai <yhuai@databricks.com> Closes apache#10084 from yhuai/SPARK-11352-branch-1.5.

[SPARK-12090] [PYSPARK] consider shuffle in coalesce()

ed7264b

Author: Davies Liu <davies@databricks.com> Closes apache#10090 from davies/fix_coalesce. (cherry picked from commit 4375eb3) Signed-off-by: Davies Liu <davies.liu@gmail.com>

gcc and others added 7 commits December 6, 2015 16:28

[SPARK-12048][SQL] Prevent to close JDBC resources twice

8bbb3cd

Author: gcc <spark-src@condor.rhaag.ip> Closes apache#10101 from rh99/master. (cherry picked from commit 04b6799) Signed-off-by: Sean Owen <sowen@cloudera.com>

[SPARK-12101][CORE] Fix thread pools that cannot cache tasks in Worke…

3868ab6

…r and AppClient (backport 1.5) backport apache#10108 to branch 1.5 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10135 from zsxwing/fix-threadpool-1.5.

Merge branch 'branch-1.5' of github.com:apache/spark into csd-1.5

df8fd5a

markhamstra assigned davidnavas Dec 9, 2015

davidnavas changed the title ~~backported SPARK-11863 and merged Apache branch-1.5~~ SPY-875: backported SPARK-11863 and merged Apache branch-1.5 Dec 9, 2015

davidnavas pushed a commit that referenced this pull request Dec 9, 2015

Merge pull request #127 from markhamstra/csd-1.5

837cc96

SPY-875: backported SPARK-11863 and merged Apache branch-1.5

davidnavas merged commit 837cc96 into alteryx:csd-1.5 Dec 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

Uh oh!

markhamstra commented Dec 9, 2015

Uh oh!

Uh oh!

SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

Uh oh!

Conversation

markhamstra commented Dec 9, 2015

Uh oh!

Uh oh!