Skip to content

SPY-875: backported SPARK-11863 and merged Apache branch-1.5 #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Dec 9, 2015

Conversation

markhamstra
Copy link

No description provided.

liancheng and others added 30 commits November 15, 2015 13:16
…to branch-1.5

The main purpose of this PR is to backport apache#9664, which depends on apache#9277.

Author: Cheng Lian <lian@databricks.com>

Closes apache#9671 from liancheng/spark-11191.fix-temp-function.branch-1_5.
code snippet to reproduce it:
```
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
val t = Timestamp.valueOf("1900-06-11 12:14:50.789")
val us = fromJavaTimestamp(t)
assert(getSeconds(us) === t.getSeconds)
```

it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#9728 from cloud-fan/seconds.

(cherry picked from commit 06f1fdb)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
…nt initialization

On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread.

The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance.

 yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this?

Author: Yu Gao <ygao@us.ibm.com>
Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com>
Author: Yu Gao <crystalgaoyu@gmail.com>

Closes apache#9272 from yolandagao/master.

(cherry picked from commit 72c1d68)
Signed-off-by: Yin Huai <yhuai@databricks.com>
…ctionRegistry

According to discussion in PR apache#9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now.

Author: Cheng Lian <lian@databricks.com>

Closes apache#9737 from liancheng/spark-11191.follow-up.

(cherry picked from commit fa13301)
Signed-off-by: Cheng Lian <lian@databricks.com>
There events happen normally during the app's lifecycle, so printing
out ERROR logs all the time is misleading, and can actually affect usability
of interactive shells.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#9772 from vanzin/SPARK-11786.

(cherry picked from commit 936bc0b)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
… a batch

We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#9707 from zsxwing/fix-checkpoint.

(cherry picked from commit 928d631)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Added mockito to the test scope to fix the  compilation error in branch 1.5

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#9782 from zsxwing/1.5-hotfix.
The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM)

Author: Davies Liu <davies@databricks.com>

Closes apache#9704 from davies/kyro_string.

(cherry picked from commit 98be816)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability

Author: Sean Owen <sowen@cloudera.com>

Closes apache#9731 from srowen/SPARK-11652.

(cherry picked from commit 9631ca3)
Signed-off-by: Sean Owen <sowen@cloudera.com>
It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <vivkul@iitk.ac.in>

Closes apache#9771 from vivkul/patch-1.

(cherry picked from commit 1429e0a)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.

The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`.

Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`.

See apache#9367 for previous comments
See SPARK-11195 for a full repro

Author: Hurshal Patel <hpatel516@gmail.com>

Closes apache#9779 from choochootrain/spark-11195-master.

(cherry picked from commit 3cca5ff)
Signed-off-by: Yin Huai <yhuai@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/TestUtils.scala
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes apache#9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
SparkListenerSuite's _"onTaskGettingResult() called when result fetched remotely"_ test was extremely slow (1 to 4 minutes to run) and recently became extremely flaky, frequently failing with OutOfMemoryError.

The root cause was the fact that this was using `System.setProperty` to set the Akka frame size, which was not actually modifying the frame size. As a result, this test would allocate much more data than necessary. The fix here is to simply use SparkConf in order to configure the frame size.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#9822 from JoshRosen/SPARK-11649.
…ceByKeyAndWindow

invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <david.tolpin@gmail.com>

Closes apache#9775 from dtolpin/master.

(cherry picked from commit 599a8c6)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
…s (backport to branch 1.5)

backport apache#9841 to branch 1.5

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#9850 from zsxwing/SPARK-11831-branch-1.5.
…ting a NULL

JIRA: https://issues.apache.org/jira/browse/SPARK-11817

Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes apache#9834 from viirya/truncate-fractional-sec.

(cherry picked from commit 60bfb11)
Signed-off-by: Yin Huai <yhuai@databricks.com>
…Function and TransformFunctionSerializer

TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#9847 from zsxwing/pyspark-streaming-exception.

(cherry picked from commit be7a2cf)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
udf/cast should use existing SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes apache#9915 from davies/create_1.5.
…ted with a Stage

This issue was addressed in apache#5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug.  The intent of `submitMissingTasks` should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId".  Because of a long-standing bug, the `jobId` parameter was never being used.  Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880.

The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks.

This fix should be applied to all maintenance branches, since it has existed since 1.0.

kayousterhout pankajarora12

Author: Mark Hamstra <markhamstra@gmail.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes apache#6291 from markhamstra/SPARK-6880.

(cherry picked from commit 0a5aef7)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
…VM exits

deleting the temp dir like that

```

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val a = mutable.Set(1,2,3,4,7,0,8,98,9)
a: scala.collection.mutable.Set[Int] = Set(0, 9, 1, 2, 3, 7, 4, 8, 98)

scala> a.foreach(x => {a.remove(x) })

scala> a.foreach(println(_))
98
```

You may not modify a collection while traversing or iterating over it.This can not delete all element of the collection

Author: Zhongshuai Pei <peizhongshuai@huawei.com>

Closes apache#9951 from DoingDone9/Bug_RemainDir.

(cherry picked from commit 6b78157)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…eadPool doesn't cache any task

In the previous codes, `newDaemonCachedThreadPool` uses `SynchronousQueue`, which is wrong. `SynchronousQueue` is an empty queue that cannot cache any task. This patch uses `LinkedBlockingQueue` to fix it along with other fixes to make sure `newDaemonCachedThreadPool` can use at most `maxThreadNumber` threads, and after that, cache tasks to `LinkedBlockingQueue`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#9978 from zsxwing/cached-threadpool.

(cherry picked from commit d3ef693)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
```EventLoggingListener.getLogPath``` needs 4 input arguments:
https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280

the 3rd parameter should be appAttemptId, 4th parameter is codec...

Author: Teng Qiu <teng.qiu@gmail.com>

Closes apache#10044 from chutium/SPARK-12053.

(cherry picked from commit a8ceec5)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
…down

Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks'

Author: Sean Owen <sowen@cloudera.com>

Closes apache#10042 from srowen/SPARK-12049.

(cherry picked from commit 96bf468)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Author: Alexander Pivovarov <apivovarov@gmail.com>

Closes apache#10064 from apivovarov/patch-1.
The issue is that the output commiter is not idempotent and retry attempts will
fail because the output file already exists. It is not safe to clean up the file
as this output committer is by design not retryable. Currently, the job fails
with a confusing file exists error. This patch is a stop gap to tell the user
to look at the top of the error log for the proper message.

This is difficult to test locally as Spark is hardcoded not to retry. Manually
verified by upping the retry attempts.

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

Closes apache#10080 from nongli/spark-11328.

(cherry picked from commit 47a0abc)
Signed-off-by: Yin Huai <yhuai@databricks.com>
…data source

When query the Timestamp or Date column like the following
val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end)
The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'"

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes apache#9872 from huaxingao/spark-11788.

(cherry picked from commit 5a8b5fd)
Signed-off-by: Yin Huai <yhuai@databricks.com>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.

Author: Nong Li <nong@databricks.com>

Closes apache#10068 from nongli/spark-12030.

(cherry picked from commit 2cef1cd)
Signed-off-by: Yin Huai <yhuai@databricks.com>
https://issues.apache.org/jira/browse/SPARK-11352

This one backports apache#10072 to branch 1.5.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#10084 from yhuai/SPARK-11352-branch-1.5.
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in apache#9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes apache#10090 from davies/fix_coalesce.

(cherry picked from commit 4375eb3)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
gcc and others added 7 commits December 6, 2015 16:28
Author: gcc <spark-src@condor.rhaag.ip>

Closes apache#10101 from rh99/master.

(cherry picked from commit 04b6799)
Signed-off-by: Sean Owen <sowen@cloudera.com>
When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#10155 from gatorsmile/escapeU.

(cherry picked from commit 49efd03)
Signed-off-by: Yin Huai <yhuai@databricks.com>
…r and AppClient (backport 1.5)

backport apache#10108 to branch 1.5

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#10135 from zsxwing/fix-threadpool-1.5.
This backports [apache#10161] to Spark 1.5, with the difference that ChiSqSelector does not require modification.

Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.

This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml.

CC: yhuai mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#10183 from jkbradley/sqlcontext-backport1.5.
Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at apache#9731

Author: Sean Owen <sowen@cloudera.com>

Closes apache#10198 from srowen/SPARK-11652.2.

(cherry picked from commit e3735ce)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…of aliases and real columns

this is based on apache#9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <dbiswal@us.ibm.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#9961 from cloud-fan/sort.
@davidnavas davidnavas changed the title backported SPARK-11863 and merged Apache branch-1.5 SPY-875: backported SPARK-11863 and merged Apache branch-1.5 Dec 9, 2015
davidnavas pushed a commit that referenced this pull request Dec 9, 2015
SPY-875: backported SPARK-11863 and merged Apache branch-1.5
@davidnavas davidnavas merged commit 837cc96 into alteryx:csd-1.5 Dec 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.