-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-13127][SQL] Update Parquet to 1.9.0 #16281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #70131 has finished for PR 16281 at commit
|
@@ -1424,7 +1424,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest { | |||
|
|||
catalystSchema = new StructType(), | |||
|
|||
expectedSchema = ParquetSchemaConverter.EMPTY_MESSAGE) | |||
expectedSchema = "message root {}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Don't we expect EMPTY_MESSAGE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Thank you for review, @viirya and @HyukjinKwon .
I'll revert that.
The tricky part about upgrading Parquet is whether it will continue to work with other dependencies that work with Parquet. It's also worth considering that a different Parquet may be on the classpath at runtime, certainly for older versions of Hadoop. We should probably proceed with this if it's fixing important bugs. It's worth scanning through what changes went in from 1.8 to 1.9 to see if anything looks like a potential behavior change or compatibility issue. |
Test build #70140 has finished for PR 16281 at commit
|
Thank you for review, @srowen . I see. First of all, I'll check the case of the mixed parquet jar files in the class path. Second, among 83 issues fixed by Parquet 1.9.0, the following seems to be notable.
@rdblue , could you give us some advice for compatibility issue between Parquet 1.8 and 1.9? |
I'm actually wondering if we should just fork Parquet and maintain a version of it, and add fixes ourselves to the fork. In the past Parquet updates often bring more regressions ... |
Thank you for review, @rxin . Every Apache projects (including Apache Spark) have some bugs at every release. BTW, @rxin and @srowen . To reduce the risk,
|
Parquet is the default format of Spark. It is pretty significant to Spark. Now, Parquet is becoming stable and might be the right time to fork it. We are just fixing the bugs. @liancheng and @rdblue are Parquet committers. They might be the right person to review the changes we made in the forked version. |
Forking is a very bad thing, and only a last resort. We haven't properly managed the fork of Hive yet, even. I don't hear specific bugs to fork around here either. As such I can't see why this would even be considered. |
We keep hitting the issues in Parquet. Below is another example: https://issues.apache.org/jira/browse/SPARK-18539 |
One more example: #16106 This issue degrades the performance. |
Those issues with Parquet are specified for certain Parquet versions. If upgrading Parquet can solve them, it can't justify the decision to fork Parquet. To fork such project we need more consideration about the pros and cons. |
Yes, but we face bugs in third-party components all the time and work around them or get them fixed. There is an unmentioned downside here too: not getting bug fixes and improvements that don't (yet) affect Spark. Running a fork risks incompatibility at the data format level, where it matters most to interop. I think this really can't be taken lightly -- better to find ways to increase integration testing or influence on Parquet or whatever to get things fixed faster where they bite Spark. |
My two cents:
|
@srowen Even if we fork our own version, it does not mean we will give up the upgrading to the newer version. We just added a few critical fixes. Normally, when upgrading to the newer versions, these problems will be resolved. That means, we do not need to patch these fixes to the newer version, and the maintenance cost is small. This is very normal in the mission-critical system. When the customers on mainframe hit a bug, they will not upgrade it to the newer major version. Normally, they can get the special build with the needed fixes. Parquet community will not do this for us, but we can do it by ourselves especially when we have the Parquet experts @liancheng @rdblue |
@gatorsmile @rdblue also works directly on Parquet. I am not seeing "unfixable" Parquet problems here. You're just pointing at problems that can and should be fixed, preferably in one place. Forking is not at all normal as a response to this. This, at least, must block on us figuring out how to manage forks. The Hive fork is still not really technically OK. |
The problem is the Parquet community will not create a branch 1.8.2+ for us. Upgrading to newer versions 1.9 or 2.0 are always risky. Based on the history, we hit the bugs and performance degradation, when we try to upgrade the Parquet versions. We need more time and efforts to decide whether we need to upgrade to the newer Parquet version. |
I'd much rather lobby to release 1.8.2 and help with the legwork than do all that legwork and more to maintain a fork. It's still not clear to me that upgrading to 1.9.0 is not a solution? |
Basically, the idea is to make a special build for Parquet 1.8.1 with the needed fixes by ourselves. Upgrading to newer version like Parquet 1.9.0 is risky. Parquet 1.9.0 was just released this Oct. The bugs might not be exposed until more users/clients start using it. This is also the same to our Spark community. I personally do not suggest any enterprise customer to use Spark 2.0.0, even if it resolved many bugs in Spark 1.6+ To evaluate whether upgrading Parquet 1.9.0. is OK now, the biggest effort is the performance evaluation. We need to have our own standard performance workload benchmarks (TPC-DS is not enough) to test whether the upgrade can introduce any major performance degradation. |
Actually, this PR is about Apache Spark 2.2 on Late March in terms of RC1. |
@dongjoon-hyun What kind of questions/requests should we ask in dev mailing list? IMO, the risk and cost are small if we make a special build by ourselves. We can get the bug fixes very quickly. Maybe @rxin @rdblue @liancheng can give their inputs here? |
I agree, but, in a long term perspective, the risk and cost of forking could be the worst. |
I think we are not adding new features into Parquet. The fixes must be small. To avoid the cost and risk, we need to reject all the major fixes in our special build. At the same time, we also need to request Parquet community to resolve the bugs in the newer releases. |
Yep. At the beginning, it starts like that. But, please look at Spark Hive or Spark Thrift Server. I don't think we are maintaining that in the best way. |
We are adding major code changes in Spark Thrift Server? What is the Spark Hive? |
Has anyone even asked for a new 1.8.x build from Parquet and been told it won't happen? You don't stop consuming non fix changes by forking. You do that by staying on a maintenance branch. If that branch is maintained of course. I'd be shocked if there were important bugs affecting a major ASF project and no way to get a maintenance release of a recent branch. Esp when we have experts here with influence. What does this block - why the urgency? |
Yep. Spark Thrift Server are different, but it's not actively maintained. For example, the default database feature is recently added. I mean this one by
|
Is the open issue the main purpose of trying to fork? |
That is just an example. We definitely need the fix ASAP, right? See the past release dates of Apache Parquet (that is just the one I got from the web, it might not be 100% accurate):
|
I see the point. Then, the forked repository is going for 2.1.1 or 2.1-rc4 ASAP? |
I don't think we are making a decision at this point to fork ... if we can really push parquet to make another maintenance bug fix release, that'd be great. |
+1 |
I don't think a fork is a good idea, nor do I think there is a reasonable need for one. @gatorsmile brought up that the Parquet community refused to build a patch release: "The problem is the Parquet community will not create a branch 1.8.2+ for us." I don't remember this happening, and as far as I can tell from both google and my inbox, the Parquet community never rejected the idea of patch releases. If there was a conversation that I don't know about, then I apologize that you were given the impression that patch releases aren't possible. That isn't the case. I'm happy to work with the community to put out patch releases, especially if that's needed for Spark. To demonstrate, look at PARQUET-389 and PARQUET-654. @rxin asked the Parquet dev list about predicate push-down features and within a week and a half, both of those issues were resolved. (PARQUET-389 is the fix for SPARK-18539, cited as motivation to fork.) As for the other motivating issue, PARQUET-686, a fork can't help solve this problem. This is an issue that requires updating the Parquet format spec so you couldn't simply fix your own fork without abandoning compatibility. The Parquet community put out a release that gives the user a choice between correctness and performance, which is a good compromise until this can be fixed. It is fair to point out that Parquet has not had a regular release cadence for minor releases (1.8.1 to 1.9.0), which is something that the Parquet community knows about and has discussed. We have recently committed to quarterly releases to fix this, with patch releases whenever they are needed. I'd encourage anyone interested to get involved. |
Like @rdblue said, I don't recall people asking for a 1.8.2 on the parquet dev list. |
I'd love to see frequent, conservative patch releases. From my experience, parquet bugs cause significant trouble for downstream consumers. For example, we encountered a data corruption bug writing parquet format V2 that made a good deal of that data unreadable. (I can't recall the issue number, but it was reported and fixed.) As another example, we recently upgraded some of our Spark clusters to use parquet-mr 1.9.0, because PARQUET-363 introduced a bug into one of our Spark 2.x patches. When we switched to 1.9, we found https://issues.apache.org/jira/browse/PARQUET-783, which breaks things in a different way. We needed a fix, so we forked 1.9 internally. FWIW, we haven't found any other issues using parquet-mr 1.9.0 with Spark 2.1. |
@julienledem @rdblue wow, that is so great!!! It will be much easier for us! I never expected the Parquet community is willing to do the regular patch releases. Sorry, that statement is just my personal assumption based on the release history, since I did not see the backport PRs. If Parquet community can do the quarterly fixpack release, it makes our maintenance work much easier. Thank you very much!!! |
Great! I'm glad it was just confusion. I completely agree with @srowen that forking should be a last resort. In the future, please reach out to the community, whether its Parquet or another, to address concerns before it gets this far. It's better for everyone if we can use the feedback to improve and continue to work together. |
@rdblue Thanks again! : ) |
As we understand the Parquet community is willing to put out patch releases (many thanks to them), I don't see any major reason now to motivate the forking. That is a great news for us. So do we postpone the upgrading to Parquet 1.9.0 and wait for upgrading to a patch release in the near future? |
Sure. We are going to wait for Parquet 1.8.2. |
I think we should move to a 1.8.2 patch release. The reason is that 1.9.0 moved to ByteBuffer based reads and we've found at least one problem with it. ByteBuffer based reads also changes an internal API that Spark uses for its vectorized reader. The changes here (now that I've looked at the actual PR) wouldn't work because Parquet is going to call the ByteBuffer method rather than the byte array method. I'm cleaning up the ByteBuffer code quite a bit right now for better performance with G1GC (see apache/parquet-java#390), so I think Spark should move to ByteBuffer reads after that makes it in. |
Thanks @dongjoon-hyun! Lets get a Parquet 1.8.2 out in January. |
Thank you for confirming, @rdblue . |
@rdblue Interesting, do you have any estimated or actual data for performance improvement?
@rxin made an experiment to compare performance among various accesses to an array |
The improvement is in how row groups are garbage collected. G1GC puts humongous allocations directly into the old generation, so you end up needing a full GC to reclaim the space. That just increases memory pressure, so you run out of memory and run full GCs and/or spill to disk. We don't have data yet because I haven't pushed the feature or metrics collection for it. |
FYI: Parquet 1.8.2 vote thread passed: https://mail-archives.apache.org/mod_mbox/parquet-dev/201701.mbox/%3CCAO4re1mHLT%2BLYn8s1RTEDZK8-9WSVugY8-HQqAN%2BtU%3DBOi1L9w%40mail.gmail.com%3E |
Hi, all. |
## What changes were proposed in this pull request? According to the discussion on apache#16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16751 from dongjoon-hyun/SPARK-19409.
FYI, we've been using 1.9.0 patched with a fix for https://issues.apache.org/jira/browse/PARQUET-783 without problem. |
Thank you for sharing that information, @mallman . |
## What changes were proposed in this pull request? According to the discussion on apache#16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16751 from dongjoon-hyun/SPARK-19409.
## What changes were proposed in this pull request? According to the discussion on apache#16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16751 from dongjoon-hyun/SPARK-19409.
* [SNAP-846][CLUSTER] Ensuring that Uncaught exceptions are handled in the Snappy side and do not cause a system.exit (#2) Instead of using SparkUncaughtExceptionHandler, executor now gets the uncaught exception handler and uses it to handle the exception. But if it is a local mode, it still uses the SparkUncaughtExceptionHandler A test has been added in the Snappy side PR for the same. * [SNAPPYDATA] Updated Benchmark code from Spark PR#13899 Used by the new benchmark from the PR adapted for SnappyData for its vectorized implementation. Build updated to set testOutput and other variables instead of appending to existing values (causes double append with both snappydata build adding and this adding for its tests) * [SNAPPYDATA] Spark version 2.0.1-2 * [SNAPPYDATA] fixing antlr generated code for IDEA * [SNAP-1083] fix numBuckets handling (#15) - don't apply numBuckets in Shuffle partitioning since Shuffle cannot create a compatible partitioning with matching numBuckets (only numPartitions) - check numBuckets too in HashPartitioning compatibility * [SNAPPYDATA] MemoryStore changes for snappydata * [SNAPPYDATA] Spark version 2.0.1-3 * [SNAPPYDATA] Added SnappyData modification license * [SNAPPYDATA] updating snappy-spark version after the merge * [SNAPPYDATA] Bootstrap perf (#16) Change involves: 1) Reducing the generated code size when writing struct having all fields of same data type. 2) Fixing an issue in WholeStageCodeGenExec, where a plan supporting CodeGen was not being prefixed by InputAdapter in case, the node did not participate in whole stage code gen. * [SNAPPYDATA] Provide preferred location for each bucket-id in case of partitioned sample table. (#22) These changes are related to AQP-79. Provide preferred location for each bucket-id in case of partitioned sample table. * [SNAPPYDATA] Bumping version to 2.0.3-1 * [SNAPPYDATA] Made two methods in Executor as protected to make them customizable for SnappyExecutors. (#26) * [SNAPPYDATA]: Honoring JAVA_HOME variable while compiling java files instead of using system javac. This eliminates problem when system jdk is set differently from JAVA_HOME * [SNAPPYDATA] Helper classes for DataSerializable implementation. (#29) This is to provide support for DataSerializable implementation in AQP * [SNAPPYDATA] More optimizations to UTF8String - allow direct UTF8String objects in RDD data conversions to DataFrame; new UTF8String.cloneIfRequired to clone only if required used by above - allow for some precision change in QueryTest result comparison * [SNAP-1192] correct offsetInBytes calculation (#30) corrected offsetInBytes in UnsafeRow.writeToStream * [SNAP-1198] Use ConcurrentHashMap instead of queue for ContextCleaner.referenceBuffer (#32) Use a map instead of queue for ContextCleaner.referenceBuffer. Profiling shows lot of time being spent removing from queue where a hash map will do (referenceQueue is already present for poll). * [SNAP-1194] explicit addLong/longValue methods in SQLMetrics (#33) This avoids runtime erasure for add/value methods that will result in unnecessary boxing/unboxing overheads. - Adding spark-kafka-sql project - Update version of deps as per upstream. - corrected kafka-clients reference * [SNAPPYDATA] Adding fixed stats to common filter expressions Missing filter statistics in filter's logical plan is causing incorrect plan selection at times. Also, join statistics always return sizeInBytes as the product of its child sizeInBytes which result in a big number. For join, product makes sense only when it is a cartesian product join. Hence, fixed the spark code to check for the join type. If the join is a equi-join, we now sum the sizeInBytes of the child instead of doing a product. For missing filter statistics, adding a heuristics based sizeInBytes calculation mentioned below. If the filtering condition is: - equal to: sizeInBytes is 5% of the child sizeInBytes - greater than less than: sizeInBytes is 50% of the child sizeInBytes - isNull: sizeInBytes is 50% of the child sizeInBytes - starts with: sizeInBytes is 10% of the child sizeInBytes * [SNAPPYDATA] adding kryo serialization missing in LongHashedRelation * [SNAPPYDATA] Correcting HashPartitioning interface to match apache spark Addition of numBuckets as default parameter made HashPartitioning incompatible with upstream apache spark. Now adding it separately so restore compatibility. * [SNAP-1233] clear InMemorySorter before calling its reset (#35) This is done so that any spill call (due to no EVICTION_DOWN) from within the spill call will return without doing anything, else it results in NPE trying to read page tables which have already been cleared. * [SNAPPYDATA] Adding more filter conditions for plan sizing as followup - IN is 50% of original - StartsWith, EndsWith 10% - Contains and LIKE at 20% - AND is multiplication of sizing of left and right (with max filtering of 5%) - OR is 1/x+1/y sizing of the left and right (with min filtering of 50%) - NOT three times of that without NOT * [SNAPPYDATA] reduced factors in filters a bit to be more conservative * [SNAP-1240] Snappy monitoring dashboard (#36) * UI HTML, CSS and resources changes * Adding new health status images * Adding SnappyData Logo. * Code changes for stting/updating Spark UI tabs list. * Adding icon images for Running, Stopped and Warning statuses. * 1. Adding New method for generating Spark UI page without page header text. 2. Updating CSS: Cluster Normal status text color is changed to match color of Normal health logo. * Suggestion: Rename Storage Tab to Spark Cache. * Resolving Precheckin failure due to scala style comments :snappy-spark:snappy-spark-core_2.11:scalaStyle SparkUI.scala message=Insert a space after the start of the comment line=75 column=4 UIUtils.scala message=Insert a space after the start of the comment line=267 column=4 * [SNAP-1251] Avoid exchange when number of shuffle partitions > child partitions (#37) - reason is that shuffle is added first with default shuffle partitions, then the child with maximum partitions is selected; now marking children where implicit shuffle was introduced then taking max of rest (except if there are no others in which case the negative value gets chosen and its abs returns default shuffle partitions) - second change is to add a optional set of alias columns in OrderlessHashPartitioning for expression matching to satisfy partitioning in case it is on an alias for partitioning column (helps queries like TPCH Q21 where implicit aliases are introduced to resolve clashes in self-joins); data sources can use this to pass projection aliases, if any (only snappydata ones in embedded mode) * [SNAPPYDATA] reverting lazy val to def for defaultNumPreShufflePartitions use child.outputPartitioning.numPartitions for shuffle partition case instead of depending on it being defaultNumPreShufflePartitions * [SNAPPYDATA] Code changes for displaying product version details. (#38) * [SNAPPYDATA] Fixes for Scala Style precheckin failure. (#39) * [SNAPPYDATA] Removing duplicate RDD already in snappy-core Update OrderlessHashPartitioning to allow multiple aliases for a partitioning column. Reduce plan size statistics by a factor of 2 for groupBy. * [SNAP-1256] (#41) set the memory manager as spark's UnifiedMemoryManager, if spark.memory.manager is set as default * SNAP-1257 (#40) * SNAP-1257 1. Adding SnappyData Product documentation link on UI. 2. Fixes for SnappyData Product version not displayed issue. * SNAP-1257: Renamed SnappyData Guide link as Docs. Conflicts: core/src/main/scala/org/apache/spark/ui/UIUtils.scala * [SNAPPYDATA] Spark Version 2.0.3-2 * [SNAP-1185] Guard logging and time measurements (#28) - add explicit log-level check for some log lines in java code (scala code already uses logging arguments as pass-by-name) - for System.currentTimeInMillis() calls that are used only by logging, guard it with the appropriate log-level check - use System.nanoTime in a few places where duration is to be measured; also using a DoubleAccumulator to add results for better accuracy - cache commonly used logging.is*Enabled flags - use explicit flag variable in Logging initialized lazily instead of lazy val that causes hang in streaming tests for some reason even if marked transient - renamed flags for consistency - add handling for possible DoubleAccumulators in a couple of places that expect only LongAccumulators in TaskMetrics - fixing scalastyle error due to 2c432045 Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala * SNAP-1281: UI does not show up if spark shell is run without snappydata (#42) Fixes: Re-enabling the default spark redirection handler to redirect user to spark jobs page. * [SNAP-1136] Kryo closure serialtization support and optimizations (#27) - added back configurable closure serializer in Spark which was removed in SPARK-12414; some minor changes taken from closed Spark PR https://github.com/apache/spark/pull/6361 - added optimized Kryo serialization for multiple classes; currently registration and string sharing fix for kryo (https://github.com/EsotericSoftware/kryo/issues/128) is only in the SnappyData layer PooledKryoSerializer implementation; classes providing maximum benefit have added KryoSerializable notably Accumulators and *Metrics - use closureSerializer for Netty messaging too instead of fixed JavaSerializer - updated kryo to 4.0.0 to get the fix for kryo#342 - actually fixing scalastyle errors introduced by d80ef1b4 - set ordering field with kryo serialization in GenerateOrdering - removed warning if non-closure passed for cleaning * [SNAP-1190] Reduce partition message overhead from driver to executor (#31) - DAGScheduler: - For small enough common task data (RDD + closure) send inline with the Task instead of a broadcast - Transiently store task binary data in Stage to re-use if possible - Compress the common task bytes to save on network cost - Task: New TaskData class to encapsulate task compressed bytes from above, the uncompressed length and reference index if TaskData is being read from a separate list (see next comments) - CoarseGrainedClusterMessage: Added new LaunchTasks message to encapsulate multiple Task messages to same executor - CoarseGrainedSchedulerBackend: - Create LaunchTasks by grouping messages in ExecutorTaskGroup per executor - Actual TaskData is sent as part of TaskDescription and not the Task to easily separate out the common portions in a separate list - Send the common TaskData as a separate ArrayBuffer of data with the index into this list set in the original task's TaskData - CoarseGrainedExecutorBackend: Handle LaunchTasks by splitting into individual jobs - CompressionCodec: added bytes compress/decompress methods for more efficient byte array compression - Executor: - Set the common decompressed task data back into the Task object. - Avoid additional serialization of TaskResult just to determine the serialization time. Instead now calculate the time inline during serialization write/writeExternal methods - TaskMetrics: more generic handling for DoubleAccumulator case - Task: Handling of TaskData during serialization to send a flag to indicate whether data is inlined or will be received via broadcast - ResultTask, ShuffleMapTask: delegate handling of TaskData to parent Task class - SparkEnv: encapsulate codec creation as a zero-arg function to avoid repeated conf lookups - SparkContext.clean: avoid checking serializability in case non-default closure serializer is being used - Test updates for above Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala core/src/main/scala/org/apache/spark/scheduler/Task.scala core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala * [SNAP-1202] Reduce serialization overheads of biggest contributors in queries (#34) - Properties serialization in Task now walks through the properties and writes to same buffer instead of using java serialization writeObject on a separate buffer - Cloning of properties uses SerializationUtils which is inefficient. Instead added Utils.cloneProperties that will clone by walking all its entries (including defaults if requested) - Separate out WholeStageCodegenExec closure invocation into its own WholeStageCodegenRDD for optimal serialization of its components including base RDD and CodeAndComment. This RDD also removes the limitation of having a max of only 2 RDDs in inputRDDs(). * [SNAP-1067] Optimizations seen in perf analysis related to SnappyData PR#381 (#11) - added hashCode/equals to UnsafeMapData and optimized hashing/equals for Decimal (assuming scale is same for both as in the calls from Spark layer) - optimizations to UTF8String: cached "isAscii" and "hash" - more efficient ByteArrayMethods.arrayEquals (~3ns vs ~9ns for 15 byte array) - reverting aggregate attribute changes (nullability optimization) from Spark layer and instead take care of it on the SnappyData layer; also reverted other changes in HashAggregateExec made earlier for AQP and nullability - copy spark-version-info in generateSources target for IDEA - updating snappy-spark version after the merge Conflicts: build.gradle sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala * [SNAP-1067] Optimizations seen in perf analysis related to SnappyData PR#381 (#11) - added hashCode/equals to UnsafeMapData and optimized hashing/equals for Decimal (assuming scale is same for both as in the calls from Spark layer) - optimizations to UTF8String: cached "isAscii" and "hash" - more efficient ByteArrayMethods.arrayEquals (~3ns vs ~9ns for 15 byte array) - reverting aggregate attribute changes (nullability optimization) from Spark layer and instead take care of it on the SnappyData layer; also reverted other changes in HashAggregateExec made earlier for AQP and nullability - copy spark-version-info in generateSources target for IDEA Conflicts: common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java * [SNAPPYDATA] Bootstrap perf (#16) 1) Reducing the generated code size when writing struct having all fields of same data type. 2) Fixing an issue in WholeStageCodeGenExec, where a plan supporting CodeGen was not being prefixed by InputAdapter in case, the node did not participate in whole stage code gen. * [SNAPPYDATA] Skip cast if non-nullable type is being inserted in nullable target Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala * [SNAPPYDATA] optimized versions for a couple of string functions * [SNAPPYDATA] Update to gradle-scalatest version 0.13.1 * Snap 982 (#43) * a) Added a method in SparkContext to manipulate addedJar. This is an workaround for SNAP-1133. b) made repl classloader a variable in Executor.scala * Changed Executor field variable to protected. * Changed build.gradle of launcher and network-yarn to exclude netty dependecies , which was causing some messages to hang. made urlclassLoader in Executor.scala a variable. * Made Utils.doFetchFile method public. * Made Executor.addReplClassLoaderIfNeeded() method as public. * [SNAPPYDATA] Increasing the code generation cache eviction size to 300 from 100 * [SNAP-1398] Update janino version to latest 3.0.x This works around some of the limitations of older janino versions causing SNAP-1398 * [SNAPPYDATA] made some methods protected to be used by SnappyUnifiedManager (#47) * SNAP-1420 What changes were proposed in this pull request? Logging level of cluster manager classes is changed to info in store-log4j.properties. But, there are multiple task level logs which generate lot of unneccessary info level logs. Changed these logs from info to debug. Other PRs #48 SnappyDataInc/snappy-store#168 SnappyDataInc/snappydata#573 * [SNAPPYDATA] Reducing file read/write buffer sizes Reduced buffer sizes from 1M to 64K to reduce unaccounted memory overhead. Disk read/write buffers beyond 32K don't help in performance in any case. * [SNAP-1486] make QueryPlan.cleanArgs a transient lazy val (#51) cleanArgs can end up holding transient fields of the class which can be recalculated on the other side if required in any case. Also added full exception stack for cases of task listener failures. * SNAP-1420 Review What changes were proposed in this pull request? Added a task logger that does task based info logging. This logger has WARN as log level by default. Info logs can be enabled using the following setting in log4j.properties. log4j.logger.org.apache.spark.Task=INFO How was this patch tested? Manual testing. Precheckin. * [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap (#53) Merging Spark fix. Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1). This PR fix the off-by-one bug in BytesToBytesMap. This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 . Conflicts: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java * SNAP-1545: Snappy Dashboard UI Revamping (#52) Changes: - Adding new methods simpleSparkPageWithTabs_2 and commonHeaderNodes_2 for custom snappy UI changes - Adding javascript librarires d3.js, liquidFillGauge.js and snappy-dashboard.js for snappy UI new widgets and styling changes. - Updating snappy-dashboard.css for new widgets and UI content stylings - Relocating snappy-dashboard.css into ui/static/snappydata directory. * [SNAPPYDATA] handle "prepare" in answer comparison inside Map types too * [SNAPPYDATA] fixing scalastyle errors introduced in previous commits * SNAP-1698: Snappy Dashboard UI Enhancements (#55) * SNAP-1698: Snappy Dashboard UI Enhancements Changes: - CSS styling and JavaScript code changes for displaying Snappy cluster CPU usage widget. - Removed Heap and Off-Heap usage widgets. - Adding icons/styling for displaying drop down and pull up carets/pointers to expand cell details. - Adding handler for toggling expand and collapse cell details. * [SNAPPYDATA] reduce a byte copy reading from ColumnVector When creating a UTF8String from a dictionary item from ColumnVector, avoid a copy by creating it over the range of bytes directly. Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java * [SNAPPYDATA] moved UTF8String.fromBuffer to Utils.stringFromBuffer This is done to maintain full compatibility with upstream spark-unsafe module. Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java * [SNAPPYDATA] reverting changes to increase DECIMAL precision to 127 The changes to DECIMAL precision were incomplete and broken in more ways than one. The other reason being that future DECIMAL optimization for operations in generated code will depend on value to fit in two longs and there does not seem to be a practical use-case of having precision >38 (which is not supported by most mainstream databases either) Renamed UnsafeRow.todata to toData for consistency. Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java * [SNAPPYDATA][MERGE-2.1] Some fixes after the merge - Fix for SnappyResourceEventsDUnitTest from Rishi - Scala style fixes from Sachin J - deleting unwanted files - reverting some changes that crept in inadvertently More code changes: - adding dependency for org.fusesource.leveldbjni, com.fasterxml.jackson.core, io.dropwizard.metrics, io.netty and org.apache.commons - fixing compilation issues after merge - adding dependency for jetty-client, jetty-proxy and mllib-local for graphx - bumped up parquetVersion and scalanlp breeze - fixed nettyAllVersion, removed hardcoded value - bumped up version - Implement Kryo.read/write for subclasses of Task - Do not implement KryoSerializable in Task - spark.sql.warehouse.dir moved to StaticSQLConf - moved VECTORIZED_AGG_MAP_MAX_COLUMNS from StaticSQLConf to SQLConf - corrected jackson-databind version * [SNAPPYDATA][MERGE-2.1] - Removed SimplifyCasts, RemoveDispensableExpressions - Fixed precheckin failuers - Fixed Task serialization issues - Serialize new TaskMetrics using Kryo serializer - Pass extraOptions in case of saveAsTable - removed debug statement - SnappySink for structured streaming query result * [SNAPPYDATA][MERGE-2.1] removed struct streaming classes * [SNAPPYDATA][MERGE-2.1] - Avoid splitExpressions for DynamicFoldableExpressions. This used to create a lot of codegen issues - Bump up the Hadoop version, to avoid issues in IDEA. - Modified AnalysisException to use getSimpleMessage * [SNAPPYDATA][MERGE-2.1] - Handled Array[Decimal] type in ScalaReflection, fixes SNAP-1772 (SplitSnappyClusterDUnitTest#testComplexTypesForColumnTables_SNAP643) - Fixing scalaStyle issues - updated .gitignore; gitignore build-artifacts and .gradle * [SNAPPYDATA][MERGE-2.1] Missing patches and version changes - updated optimized ByteArrayMethods.arrayEquals as per the code in Spark 2.1 - adapt the word alignment code and optimize it a bit - in micro-benchmarks the new method is 30-60% faster than upstream version; at larger sizes it is 40-50% faster meaning its base word comparison loop itself is faster - increase default locality time from 3s to 10s since the previous code to force executor-specific routing if it is alive has been removed - added back cast removal optimization when types differ only in nullability - add serialization and proper nanoTime handling from *CpuTime added in Spark 2.1.x; use DoubleAccumulator for these new fields like done for others to get more accurate results; also avoid the rare conditions where these cpu times could be negative - cleanup handling of jobId and related new fields in Task with kryo serialization - reverted change to AnalysisException with null check for plan since it is transient now - reverted old Spark 2.0 code that was retained in InsertIntoTable and changed to Spark 2.1 code - updated library versions and make them uniform as per upstream Spark for commons-lang3, metrics-core, py4j, breeze, univocity; also updated exclusions as per the changes to Spark side between 2.0.2 to 2.1.0 - added gradle build for the new mesos sub-project * [SNAP-1790] Fix one case of incorrect offset in ByteArrayMethods.arrayEquals The endOffset incorrectly uses current leftOffset+length when the leftOffset may already have been incremented for word alignment. * Fix from Hemant for fialing :docs target during precheckin run (#61) * SNAP-1794 (#59) * Retaining Spark's CodeGenerator#splitExpressions changes * [SNAP-1389] Optimized UTF8String.compareTo (#62) - use unsigned long comparisons, followed by unsigned int comparison if possible, before finishing with unsigned byte comparisons for better performance - use big-endian long/int for comparison since it requires the lower-index characters to be MSB positions - no alignment attempted since we expect most cases to fail early in first long comparison itself Detailed performance results in https://github.com/SnappyDataInc/spark/pull/62 * [SNAPPYDATA][PERF] Fixes for issues found during concurrency testing (#63) ## What changes were proposed in this pull request? Moved the regex patterns outside the functions into static variables to avoid their recreation. Made WholeStageCodeGenRDD as a case class so that its member variables can be accessed using productIterator. ## How was this patch tested? Precheckin ## Other PRs https://github.com/SnappyDataInc/snappy-store/pull/247 https://github.com/SnappyDataInc/snappydata/pull/730 * [SNAPPYDATA][PERF] optimized pattern matching for byte/time strings also added slf4j excludes to some imports * SNAP-1792: Display snappy members logs on Snappy Pulse UI (#58) Changes: - Adding snappy member details javascript for new UI view named SnappyData Member Details Page * SNAP-1744: UI itself needs to consistently refer to itself as "SnappyData Pulse" (#64) * SNAP-1744: UI itself needs to consistently refer to itself as "SnappyData Pulse" Changes: - SnappyData Dashboard UI is named as SnappyData Pulse now. - Code refactoring and code clean up. * Removed Array[Decimal] handling from spark layer as it only fixes embedded mode. (#66) * Removed Array[Decimal] handling from spark layer as it only fixes embedded mode * Snap 1890 : Snappy Pulse UI suggestions for 1.0 (#69) * SNAP-1890: Snappy Pulse UI suggestions for 1.0 Changes: - SnappyData logo shifted to right most side on navigation tab bar. - Adding SnappyData's own new Pulse logo on left most side on navigation tab bar. - Displaying SnappyData Build details along with product version number on Pulse UI. - Adding CSS,HTML, JS code changes for displaying version details pop up. * [SNAP-1377,SNAP-902] Proper handling of exception in case of Lead and Server HA (#65) * [SNAP-1377] Added callback used for checking CacheClosedException * [SNAP-1377] Added check for GemfirexdRuntimeException and GemfireXDException * Added license header in source file * Fix issue seen during precheckin * Snap 1833 (#67) Added a fallback path for WholeStageCodeGenRDD. As we dynamically change the classloader, generated code compile time classloaders and runtime class loader might be different. There is no clean way to handle this apart from recompiling the generated code. This code path will be executed only in case of components having dynamically changing class loaders i.e Snappy jobs & UDFs. Other sql queries won't be impacted by this. * Refactored the executor exception handling for cache (#71) Refactored the executor exception handling for cache closed exception. * [SNAP-1930] Rectified a code in WholeStageCodeGenRdd. (#73) This change will avoid repeatedly calling code compilation incase of a ClassCastException. * Snap 1813 : Security - Add Server (Jetty web server) level user authentication for Web UI in SnappyData. (#72) * SNAP-1813: Security - Add Server (Jetty web server) level user authentication for Web UI in SnappyData. Changes: - Adding Securty handler in jetty server with Basic Authentication. - Adding LDAP Authentication code changes for Snappy UI. Authenticator (SnappyBasicAuthenticator) is initialized by snappy leader. * [SNAPPYDATA] fixing scalastyle failure introduced by last commit merge of SNAP-1813 in 6b8f59e58f6f21103149ebacebfbaa5b7a5cbf00 introduced scalastyle failure * Resized company logo (#74) * Changes:. - Adding resized SnappyData Logo for UI . - Displaying spark version in version details pop up. - Code/Files(unused logo images) clean up. - Updated CSS * [SNAPPYDATA] update janino to latest release 3.0.7 * [SNAP-1951] move authentication handler bind to be inside connect (#75) When bind to default 5050 port fails, then code clears the loginService inside SecurityHandler.close causing the next attempt on 5051 to fail with "IllegalStateException: No LoginService for SnappyBasicAuthenticator". This change moves the authentication handler setting inside the connect method. * Bump version spark 2.1.1.1-rc1, store 1.5.6-rc1 and sparkJobserver 0.6.2.6-rc1 * Updated the year in the Snappydata copyright header. (#76) * [SNAPPYDATA] upgrade netty versions (SPARK-18971, SPARK-18586) - upgrade netty-all to 4.0.43.Final (SPARK-18971) - upgrade netty-3.8.0.Final to netty-3.9.9.Final for security vulnerabilities (SPARK-18586) * Added code to dump generated code in case of exception (#77) ## What changes were proposed in this pull request? Added code to dump generated code in case of exception in the server side. hasNext function of the iterator is the one that fails in case of an excpetion. Added exception handling for next as well, just in case. ## How was this patch tested? Manual. Precheckin. * [SNAPPYDATA] more efficient passing of non-primitive literals Instead of using CodegenFallback, add the value directly as reference object. Avoids an unncessary cast for every loop (and a virtual call) as also serialized object is smaller. * [SNAP-1993] Optimize UTF8String.contains (#78) - Optimized version of UTF8String.contains that improves performance by 40-50%. However, it is still 1.5-3X slower than JDK String.contains (that probably uses JVM intrinsics since the library version is slower than the new UTF8String.contains) - Adding native JNI hooks to UTF8String.contains and ByteArrayMethods.arrayEquals if present. Comparison when searching in decently long strings (100-200 characters from customers.csv treating full line as a single string). Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Linux 4.10.0-33-generic Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz compare contains: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ UTF8String (orig) 241 / 243 4.7 214.4 1.0X UTF8String (opt) 133 / 137 8.4 118.4 1.8X String 97 / 99 11.6 86.4 2.5X Regex 267 / 278 4.2 237.5 0.9X * Fix to avoid dumping of gen code in case of low memory exception. (#79) * Don't log the generated code when a low memory exception is being thrown. Also, fixed a review comment that print a exception message before the generated code. * [SNAPPYDATA][AQP-293] Native JNI callback changes for UTF8String (#80) - added MacOSX library handling to Native; made minimum size to use JNI as configurable (system property "spark.utf8.jniSize") - added compareString to Native API for string comparison - commented out JNI for ByteArrayMethods.arrayEquals since it is seen to be less efficient for cases where match fails in first few bytes (JNI overhead of 5-7ns is far more) - made the "memory leak" warning in Executor to be debug level; reason being that it comes from proper MemoryConsumers so its never a leak and it should not be required of MemoryConsumers to always clean up memory (unnecessary additional task listeners for each ParamLiteral) - pass source size in Native to make the API uniform * [SNAPPYDATA] update jetty version update jetty to latest 9.2.x version in an attempt to fix occasional "bad request" errors seen currently on dashboard * [SNAP-2033] pass the original number of buckets in table via OrderlessHashPartitioning (#82) also reduced parallel forks in tests to be same as number of processors/cores * Update versions for snappydata 1.0.0, store 1.6.0, spark 2.1.1.1 and spark-jobserver 0.6.2.6 * [SNAPPYDATA] use common "vendorName" in build scripts * [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance * Using 64 bit unsigned long comparison instead of unsigned int comparison in `org.apache.spark.unsafe.types.UTF8String#compareTo` for better performance. * Making `IS_LITTLE_ENDIAN` a constant for correctness reasons (shouldn't use a non-constant in `compareTo` implementations and it def. is a constant per JVM) Build passes and the functionality is widely covered by existing tests as far as I can see. Author: Armin <me@obrown.io> Closes #19180 from original-brownbear/SPARK-21967. * [SNAPPYDATA] relax access-level of Executor thread pools to protected * [SNAPPYDATA] Fix previous conflict in GenerateUnsafeProjection (#84) From @jxwr: remove two useless lines. * [SPARK-18586][BUILD] netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193 ## What changes were proposed in this pull request? Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16102 from srowen/SPARK-18586. * [SPARK-18951] Upgrade com.thoughtworks.paranamer/paranamer to 2.6 ## What changes were proposed in this pull request? I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6. Author: Yin Huai <yhuai@databricks.com> Closes #16359 from yhuai/SPARK-18951. * [SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final ## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971. * [SPARK-19409][BUILD] Bump parquet version to 1.8.2 ## What changes were proposed in this pull request? According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16751 from dongjoon-hyun/SPARK-19409. * [SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due to test dependency on avro ## What changes were proposed in this pull request? After using Apache Parquet 1.8.2, `ParquetAvroCompatibilitySuite` fails on **Maven** test. It is because `org.apache.parquet.avro.AvroParquetWriter` in the test code used new `avro 1.8.0` specific class, `LogicalType`. This PR aims to fix the test dependency of `sql/core` module to use avro 1.8.0. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2530/consoleFull ``` ParquetAvroCompatibilitySuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/avro/LogicalType at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144) ``` ## How was this patch tested? Pass the existing test with **Maven**. ``` $ build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver test ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:07 h [INFO] Finished at: 2017-02-04T05:41:43+00:00 [INFO] Final Memory: 77M/987M [INFO] ------------------------------------------------------------------------ ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16795 from dongjoon-hyun/SPARK-19409-2. * [SPARK-19411][SQL] Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown There is a metadata introduced before to mark the optional columns in merged Parquet schema for filter predicate pushdown. As we upgrade to Parquet 1.8.2 which includes the fix for the pushdown of optional columns, we don't need this metadata now. Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16756 from viirya/remove-optional-metadata. * [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of old Parquet versions ## What changes were proposed in this pull request? We've already upgraded parquet-mr to 1.8.2. This PR does some further cleanup by removing a workaround of PARQUET-686 and a hack due to PARQUET-363 and PARQUET-278. All three Parquet issues are fixed in parquet-mr 1.8.2. ## How was this patch tested? Existing unit tests. Author: Cheng Lian <lian@databricks.com> Closes #16791 from liancheng/parquet-1.8.2-cleanup. * [SPARK-20449][ML] Upgrade breeze version to 0.13.1 Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449. (cherry picked from commit 67eef47acfd26f1f0be3e8ef10453514f3655f62) Signed-off-by: DB Tsai <dbtsai@dbtsai.com> * [SNAPPYDATA] version upgrades as per previous cherry-picks Following cherry-picked versions for dependency upgrades that fix various issues: 553aac5, 1a64388, a8567e3, 26a4cba, 55834a8 Some were already updated in snappy-spark while others are handled in this. * Snap 2044 (#85) * Corrected SnappySession code. * Snap 2061 (#83) * added previous code for reference * added data validation in the test * Incorporated review comments. added test for dataset encoder conversion to dataframe. * [SNAPPYDATA] build changes/fixes (#81) - update gradle to 3.5 - updated many dependencies to latest bugfix releases - changed provided dependencies to compile/compileOnly - changed deprecated "<<" with doLast - changed deprecated JavaCompile.forkOptions.executable with javaHome - gradlew* script changes as from upstream release (as updated by ./gradlew wrapper --gradle-version 3.5.1) * [SNAP-2061] fix scalastyle errors, add test - fix scalastyle errors in SQLContext - moved the Dataset/DataFrame nested POJO tests to JavaDatasetSuite from SQLContextSuite - added test for Dataset.as(Encoder) for nested POJO in the same * [SPARK-17788][SPARK-21033][SQL] fix the potential OOM in UnsafeExternalSorter and ShuffleExternalSorter In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for radix sort. In `UnsafeExternalSorter`, we set the `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before reach this limitation, we may hit the max-page-size error. Users may see exception like this on large dataset: ``` Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94) ... ``` Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not enough, users can still set the config to a big number and trigger the too large page size issue. This PR fixes it by explicitly handling the too large page size exception in the sorter and spill. This PR also change the type of `spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's only compared with `numRecords`, which is an int. This is an internal conf so we don't have a serious compatibility issue. TODO Author: Wenchen Fan <wenchen@databricks.com> Closes #18251 from cloud-fan/sort. * [SNAPPYDATA] add missing jersey-hk2 dependency required after the upgrade to jersey 2.26 that does not include it automatically (used by Executors tab in the GUI) guard debug logs with "debugEnabled()" * [SNAPPYDATA][SNAP-2120] make codegen cache size configurable (#87) - use "spark.sql.codegen.cacheSize" to set codegenerator cache size else default to 1000 - also added explicit returns in MemoryPool else it does boxing/unboxing inside the sync block that also shows up in perf analysis (can be seen via decompiler too) - avoid NPE for "Stages" tab of a standby lead * Snap 2084 (#86) If SnappyUMM is found in classpath , SparkEnv will assign the memory manager to SnappyUMM.If user has explicitly set the memory manager that will take precedence. * [SNAPPYDATA] some optimizations to ExecutionMemoryPool - avoid multiple lookups into the map in ExecutionMemoryPool.releaseMemory - avoid an unnecessary boxing/unboxing by adding explicit return from lock.synchronized blocks * [SNAP-2087] fix ArrayIndexOutOfBoundsException with JSON data - issue is the custom code generation added for homogeneous Struct types where isNullAt check used an incorrect index variable - also cleaned up determination of isHomogeneousStruct in both safe/unsafe projection * [SNAPPYDATA] fixing all failures in snappy-spark test suite Three broad categories of issues fixed: - handling of double values in JSON conversion layer of the metrics; upstream spark has all metrics as Long but snappy-spark has the timings one as double to give more accurate results - library version differences between Spark's maven poms and SnappyData's gradle builds; these are as such not product issues but this checkin changes some versions to be matching to maven builds to be fully upstream compatible - path differences in test resource files/jars when run using gradle rather than using maven Other fixes and changes: - the optimized Decimal.equals gave incorrect result in case the scale of the two is different; this followed the Java BigDecimal convention of returning false if the scale is different but that is incorrect as per Spark's conventions; this should normally not happen from catalyst layer but can happen in RDD operations - correct accumulator result in Task to be empty rather than null when nothing present - override the extended two argument DStream.initialize in MapWithStateDStream.initialize - correct the UI path for Spark cache to be "/Spark Cache/" rather than "/storage/" - avoid sending the whole child plan across in DeserializeToObjectExec to executors when only the output is required (also see SNAP-1840 caused due to this) - rounding of some of the time statistics (that are accumulated as double) in Spark metrics - SparkListenerSuite local metrics tests frequently failed due to deserialization time being zero (despite above change); the reason being the optimizations in snappy-spark that allow it to run much quicker and not registering even with System.nanoTime(); now extended the closure to force a 1 milliseond sleep in its readExternal method - use spark.serializer consistently for data only and spark.closureSerializer for others (for the case the two are different) - don't allow inline message size to exceed spark.rpc.message.maxSize - revert default spark.locality.wait to be 3s in Spark (will be set at snappydata layer if required) - make SparkEnv.taskLogger to be serializable if required (extend Spark's Logging trait) - account for task decompression time in the deserialization time too The full spark test suite can be run either by: - ./dev/snappy-build.sh && ./dev/snappy-build.sh test (or equivalent) - ./gradlew check - from SnappyData: - ./gradlew snappy-spark:check, OR - ./gradlew precheckin -Pspark (for full test suite run including snappydata suite) For SnappyData product builds, one of the last two ways from SnappyData should be used * [SNAPPYDATA] fixing one remaining failure in gradle runs * Preserve the preferred location in MapPartitionRDD. (#92) * * SnappyData Spark Version 2.1.1.2 * [SNAP-2218] honour timeout in netty RPC transfers (#93) use a future for enforcing timeout (2 x configured value) in netty RPC transfers after which the channel will be closed and fail * Check for null connection. (#94) If connection is not established properly null connection should be handled properly. * [SNAPPYDATA] revert changes in Logging to upstream reverting flag check optimization in Logging to be compatible with upstream Spark * [SNAPPYDATA] Changed TestSparkSession in test class APIs to base SparkSession This is to allow override by SnappySession extensions. * [SNAPPYDATA] increased default codegen cache size to 2K also added MemoryMode in MemoryPool warning message * [SNAP-2225] Removed OrderlessHashPartitioning. (#95) Handled join order in optimization phase. Also removed custom changes in HashPartition. We won't store bucket information in HashPartitioning. Instead based on the flag "linkPartitionToBucket" we can determine the number of partitions to be either numBuckets or num cores assigned to the executor. Reverted changes related to numBuckets in Snappy Spark. * [SNAP-2242] Unique application names & kill app by names (#98) The standalone cluster should support unique application names. As they are user visible and easy to track user can write scripts to kill applications by names. Also, added support to kill Spark applications by names(case insensitive). * [SNAPPYDATA] make Dataset.boundEnc as lazy val avoid materializing it immediately (for point queries that won't use it) * fix for SNAP-2342 . enclosing with braces when the child plan of aggregate nodes are not simple relations or subquery aliases (#101) * Snap 1334 : Auto Refresh feature for Dashboard UI (#99) * SNAP-1334: Summary: - Fixed the JQuery DataTable Sorting Icons problem in the Spark UI by adding missing sort icons and CSS. - Adding new snappy-commons.js JavaScript for common utility functions used by Snappy Web UI. - Updated Snappy Dashboard and Member Details JavaScripts for following 1. Creating and periodically updating JQuery Data Tables for Members, Tables and External Tables tabular lists. 2. Loading , creating and updating Google Charts. 3. Creating and periodically updating the Google Line Charts for CPU and various Memory usages. 4. Preparing and making AJAX calls to snappy specific web services. 5. Updated/cleanup of Spark UIUtils class. Code Change details: - Sparks UIUtils.headerSparkPage customized to accommodate snappy specific web page changes. - Removed snappy specific UIUtils.simpleSparkPageWithTabs as most of the content was similar to UIUtils.headerSparkPage. - Adding snappy-commons.js javascript script for utility functions used by Snappy UI. - JavaScript implementation of New Members Grid on Dashboard page for displaying members stats and which will auto-refresh periodically. - JavaScript code changes for rendering collapsible details in members grid for description, heap and off-heap. - JavaScript code changes for rendering progress bar for CPU and Memory usages. - Display value as "NA" wherever applicable in case of Locator node. - JavaScript code implementation for displaying Table stats and External Table stats. - Changes for periodic updating of Table stats and External Table stats. - CSS updated for page styling and code formatting. - Adding Sort Control Icons for data tables. - - Code changes for adding, loading and rendering google charts for snappy members usages trends. - Displaying cluster level usage trends for Average CPU, Heap and Off-Heap with their respective storage and execution splits and Disk usage. - Removed Snappy page specific javaScripts from UIUtils to respective page classes. - Grouped all dashboard related ajax calls into single ajax call clusterinfo. - Utility function convertSizeToHumanReadable is updated in snappy-commons.js to include TB size. - All line charts updated to include crosshair pointer/marks. - Chart titles updated with % sign and GB for size to indicate values are in percents or in GB. - Adding function updateBasicMemoryStats to update members basic memory stats. - Displaying Connection Error message whenever cluster goes down. - Disable sorting on Heap and Off-Heap columns, as cell contains multiple values in different units. * Fixes for SNAP-2376: (#102) - Adding 5 seconds timeout for auto refresh AJAX calls. - Displays request timeout message in case AJAX request takes longer than 5 seconds. * [SNAP-2379] App was getting registered with error (#103) This change pertains to the modification to Standalone cluster for not allowing applications with the same name. The change was erroneous and was allowing the app to get registered even after determining a duplicate name. * Fixes for SNAP-2383: (#106) - Adding code changes for retaining page selection in tables during stats auto refresh. * Handling of POJOs containg array of Pojos while creating data frames (#105) * Handling of POJOs containg array of Pojos while creating data frames * added bug test for SNAp-2384 * Spark compatibility (#107) Made overrideConfs as a variable. & made a method protected. * Fixes for SNAP-2400 : (#108) - Removed (commented out) timeout from AJAX calls. * Code changes for SNAP-2144: (#109) * Code changes for SNAP-2144: - JavaScript and CSS changes for displaying CPU cores details on Dashboard page. - Adding animation effect to CPU Core details. * Fixes for SNAP-2415: (#110) - Removing z-index. * Fixing scala style issue. * Code changes for SNAP-2144: - Display only Total CPU Cores count and remove cores count break up (into locators, leads and data servers). * Reverting previous commit. * Code changes for SNAP-2144: (#113) - Display only Total CPU Cores count and remove cores count break up (into locators, leads and data servers). * Fixes for SNAP-2422: (#112) - Code changes for displaying error message if loading Google charts library fails. - Code changes for retrying loading of Google charts library. - Update Auto-Refresh error message to guide user to go to lead logs if there is any connectivity issue. * Fix to SNAP-2247 (#114) * This is a Spark bug. Please see PR https://github.com/apache/spark/pull/17529 Needed to do similar change in the code path of prepared statement where precision needed to be adjusted if smaller than scale. * Fixes for SNAP-2437: (#115) - Updating CSS, to fix the member description details alignment issue. * SNAP-2307 fixes (#116) SNAP-2307 fixes related to SnappyTableScanSuite * reverting changes done in pull request #116 (#119) Merging after discussing with Rishi * Code changes for ENT-21: (#118) - Adding skipHandlerStart flag based on which handler can be started, wherever applicable. - Updating access specifiers. * * Bump up version to 2.1.1.3 * [SNAPPYDATA] fixed scalastyle * * Version 2.1.1.3-RC1 * Code changes for SNAP-2471: (#120) - Adding close button in the SnappyData Version Details Pop Up to close it. * * [ENT-46] Mask sensitive information. (#121) * Code changes for SNAP-2478: (#122) - Updating font size of members basic statistics on Member Details Page. - Display External Tables only if available. * Fixes for SNAP-2377: (#123) - To fix Trend charts layout issue, changing fixed width to width in percent for all trends charts on UI. * [SNAPPY-2511] initialize SortMergeJoin build-side scanner lazily (#124) Avoid sorting the build side of SortMergeJoin if the streaming side is empty. This already works that way for inner joins with code generation where the build side is initialized on first call from processNext (using the generated variable "needToSort" in SortExec). This change also enables the behaviour for non-inner join queries that use "SortMergeJoinScanner" that instantiates build-side upfront. * [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13 - Update DateTimeUtilsSuite so that when testing roundtripping in daysToMillis and millisToDays multiple skipdates can be specified. - Updated test so that both new years eve 2014 and new years day 2015 are skipped for kiribati time zones. This is necessary as java versions pre 181-b13 considered new years day 2015 to be skipped while susequent versions corrected this to new years eve. Unit tests Author: Chris Martin <chris@cmartinit.co.uk> Closes #21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures. (cherry picked from commit c5b8d54c61780af6e9e157e6c855718df972efad) Signed-off-by: Sean Owen <srowen@gmail.com> * [SNAP-2569] remove explicit HiveSessionState dependencies To enable using any SparkSession with Spark's HiveServer2, explicit dependencies on HiveSessionState in processing have been removed. * [SNAPPYDATA] make Benchmark class compatible with upstream * [SNAPPYDATA] fix default bind-address of ThriftCLIService - ThriftCLIService uses InetAddress.getLocalHost() as default address to be shown but hive thrift server actually uses InetAddress.anyLocalAddress() - honour bind host property in ThriftHttpCLIService too * [SNAPPYDATA] generate spark-version-info.properties in source path spark-version-info.properties is now generated in src/main/extra-resources rather than in build output so that IDEA can pick it up cleanly remove Kafka-0.8 support from build: updated examples for Kafka-0.10 * [SNAPPYDATA] Increase hive-thrift shell history file size to 50000 lines - skip init to set history max-size else it invokes load() in constructor that truncates the file to default 500 lines - update jline to 2.14.6 for this new constructor (https://github.com/jline/jline2/issues/277) - add explicit dependency on jline2 in hive-thriftserver to get the latest version * [SNAPPYDATA] fix RDD info URLs to "Spark Cache" - corrected the URL paths for RDDs to use /Spark Cache/ instead of /storage/ - updated effected tests * [SNAPPYDATA] improved a gradle dependency to avoid unnecessary re-evaluation * Changed the year frim 2017 to 2018 in license headers. * SNAP-2602 : On snappy UI, add column named "Overflown Size"/ "Disk Size" in Tables. (#127) * Changes for SNAP-2602: - JavaScript changes for displaying tables overflown size to disk as Spill-To-Disk size. * Changes for SNAP-2612: (#126) - Displaying external tables fully qualified name (schema.tablename). * SNAP-2661 : Provide Snappy UI User a control over Auto Update (#128) * Changes for SNAP-2661 : Provide Snappy UI User a control over Auto Update - Adding JavaScript and CSS code changes for Auto Update ON/OFF Switch on Snappy UI (Dashboard and Member Details page). * [SNAPPYDATA] Property to set if hive meta-store client should use isolated ClassLoader (#132) - added a property to allow setting whether hive client should be isolated or not - improved message for max iterations warning in RuleExecutor * [SNAP-2751] Enable connecting to secure SnappyData via Thrift server (#130) * * Changes from @sumwale to set the credentials from thrift layer into session conf. * * This fixes an issue with RANGE operator in non-code generated plans (e.g. if too many target table columns) * Patch provided by @sumwale * avoid dumping generated code in quick succession for exceptions * correcting scalastyle errors * * Trigger authentication check irrespective of presence of credentials. * [SNAPPYDATA] update gradle to version 5.0 - updated builds for gradle 5.0 - moved all embedded versions to top-level build.gradle * change javax.servlet-api version to 3.0.1 * Updated the janino compiler version similar to upstream spark (#134) Updated the Janino compiler dependency version similar/compatible with the spark dependencies. * Changes for SNAP-2787: (#137) - Adding an option "ALL" in Show Entries drop down list of tabular lists, in order to display all the table entries to avoid paging. * Fixes for SNAP-2750: (#131) - Adding JavaScript plugin code for JQuery Data Table to sort columns containing file/data sizes in human readable form. - Updating HTML, CSS and JavaScript, for sorting, of tables columns. * Changes for SNAP-2611: (#138) - Setting configuration parameter for setting ordering column. * SNAP-2457 - enabling plan caching for hive thrift server sessions. (#139) * Changes for SNAP-2926: (#142) - Changing default page size for all tabular lists from 10 to 50. - Sorting Members List tabular view on Member Type for ordering all nodes such that all locators first, then all leads and then all servers. * Snap 2900 (#140) Changes: * For SNAP-2900 - Adding HTML, CSS, and JavaScript code changes for adding Expand and Collapse control button against each members list entry. Clicking on this control button, all additional cell details will be displayed or hidden. - Similarly adding parent expand and collapse control to expand and collapse all rows in the table in single click. - Removing existing Expand and Collapse control buttons per cell, as those will be redundant. * For SNAP-2908 - Adding third party library Jquery Sparklines to add sparklines (inline charts) in members list for CPU and Memory Usages. - Adding HTML, CSS, and JavaScript code changes for rendering CPU and Memory usages Sparklines. * Code clean up. - Removing unused icons and images. - removing unused JavaScript Library liquidFillGauge.js * Changes for SNAP-2908: [sparkline enhancements] (#143) [sparkline enhancements] * Adding text above sparklines to display units and time duration of charts. * Formatting sparkline tooltips to display numbers with 3 precision places. * [SNAP-2934] Avoid double free of page that caused server crash due to SIGABORT/SIGSEGV (#144) * [SNAP-2956] Wrap non fatal OOME from Spark layer in a LowMemoryException (#146) * Fixes for SNAP-2965: (#147) - Using disk store UUID as an unique identifier for each member node. * [SNAPPYDATA] correcting typo in some exception messages * SNAP-2917 - generating SparkR library along with snappy product (#141) removing some unused build code * [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in … (#149) * [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search scalanlp/breeze#651 Most of the content of this PR is cherry-picked from https://github.com/apache/spark/commit/b35660dd0e930f4b484a079d9e2516b0a7dacf1d with minimal code changes done to resolve merge conflicts. --- Faced one test failure (ParquetHiveCompatibilitySuite#"SPARK-10177 timestamp") while running precheckin. This was due to recent upgrade in `jodd` library version to `5.0.6`. Downgraded `jodd` library version to `3.9.1` to fix this failure. Note that this changes is independent from breeze version upgrade. * Changes for SNAP-2974 : Snappy UI re-branding to TIBCO ComputeDB (#150) * Changes for SNAP-2974: Snappy UI re-branding to TIBCO ComputeDB 1. Adding TIBCO ComputDB product logo 2. Adding Help Icon, clicking on which About box is displayed 3. Updating About Box content - Adding TIBCO ComputeDB product name and its Edition type - Adding Copyright information - Adding Assistance details web links - Adding Product Documentation link 4. Removing or Changing user visible SnappyData references on UI to TIBCO ComputeDB. 5. Renaming pages to just Dashboard, Member Details and Jobs 6. Removing Docs link from tabs bar * * Version changes * Code changes for SNAP-2989: Snappy UI rebranding to Tibco ComputeDB iff it's Enterprise Edition (#151) Product UI updated for following: 1. SnappyData is Community Edition - Displays Pulse logo on top left side. - Displays SnappyData logo on top right side. - About Box : Displays product name "Project SnappyData - Community Edition" Displays product version, copyright information Displays comunity product documentation link. 2. TIBCO ComputeDB is Enterprise : - Displays TIBCO ComputeDB logo on top left side. - About Box: Displays product name "TIBCO ComputeDB - Enterprise Edition" Displays product version, copyright information Displays enterprise product documentation link. * * Updated some metainfo in prep for 1.1.0 release * Changes for SNAP-2989: (#152) - Removing SnappyData Community page link from Enterprise About Box. - Fixes for issue SnappyData logo is displayed on first page load in Enterprise edition. * [SNAPPYDATA] fix scalastyle error * Spark compatibility fixes (#153) - Spark compatibility suite fixes to make them work both in Spark and SD - expand PathOptionSuite to check for data after table rename - use Resolver to check intersecting columns in NATURAL JOIN * Considering jobserver class loader as a key for generated code cache - (#154) ## Considering jobserver class loader as a key for generated code cache For each submission of a snappy-job, a new URI class loader is used. The first run of a snappy-job may generate some code and it will be cached. The subsequent run of the snappy job will end up using the generated code which was cached by the first run of the job. This can lead to issues as the class loader used for the cached code is the one from the first job submission and subsequent submissions will be using a different class loader. This change is done to avoid such failures. * SNAP-3054: Rename UI tab "JDBC/ODBC Server" to "Hive Thrift Server" (#156) - Renaming tab name "JDBC/ODBC Server" to "Hive Thrift Server". * SNAP-3015: Put thousands separators for Tables > Rows Count column in Dashboard. (#157) - Adding thousands separators for table row count as per locale. * Tracking spark block manager directories for each executors and cleaning them in next run if left orphan. * [SNAPPYDATA] fix scalastyle errors introduced by previous commit * Revert: Tracking spark block manager directories for each executors and cleaning them in next run if left orphan. * allow for override of TestHive session * [SNAP-3010] Cleaning block manager directories if left orphan (#158) ## What changes were proposed in this pull request? Tracking spark block manager directories for each executor and cleaning them in next run if left orphan. The changes are for tracking the spark local directories (which are used by block manager to store shuffle data) and changes to clean the local directories (which are left orphan due to abrupt failure of JVM). The changes to clean the orphan directory are also kept as part of Spark module itself instead of cleaning it on Snappy Cluster start. This is done because the changes to track the local directory has to go in Spark and if the clean up is not done at the same place then the metadata file used to track the local directories will keep growing while running spark cluster from snappy's spark distribution. This cleanup is skipped when master is local because in local mode driver and executors will end up writing `.tempfiles.list` file in the same directory which may…
What changes were proposed in this pull request?
This PR aims to update Parquet to 1.9.0 and remove the hacks due to Parquet 1.8.1 limitation.
How was this patch tested?
Pass the existing tests.