OAP is built using Apache Maven.
To clone OAP project, use
git clone -b branch-0.6-spark-2.4.4 https://github.com/Intel-bigdata/OAP.git
cd OAP
To build OAP package, use
mvn clean -DskipTests package
To run all the tests, use
mvn clean test
To run any specific test suite, for example OapDDLSuite
, use
mvn -DwildcardSuites=org.apache.spark.sql.execution.datasources.oap.OapDDLSuite test
NOTE: Log level of OAP unit tests currently default to ERROR, please override src/test/resources/log4j.properties if needed.
If you want to use OAP with DCPMM, you can follow the below building steps.
You need to install the required packages on the build system listed below.
You need to add -Ppersistent-memory to the build command line for building with DCPMM support.
mvn clean -q -Ppersistent-memory -DskipTests package
Although OAP acts as a plugin jar to Spark, there are still a few tricks to note when integration with Spark. Basically, OAP explored Spark extension & data source API to perform its core functionality. But there are other functionality aspects that cannot achieved by Spark extension and data source API. We made a few improvements or changes to the Spark internals to achieve the functionality. So when integrating OAP on Spark, you need to check whether you are running an unmodified Community Spark or a modified customized Spark.
If you are running an Community Spark, things will be much simple. Refer to OAP User Guide to configure and setup Spark to working with OAP.
It will be more complicated to integrate OAP with a customized Spark. Steps needed for this case is to check whether the OAP changes of Spark internals will conflict or override with your private changes.
- If no conflicts or overrides happens, the steps are the same as the steps of unmodified version of Spark described above.
- If conflicts or overrides happen, you need to have a merge plan of the source code to make sure the code changes you made in a Spark source file appears in the corresponding file included in OAP project. Once merged, you need to rebuild OAP.
The following files need to be checked/compared for changes:
• antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Add index related DDL in this file, such as "create/show/drop oindex".
• org/apache/spark/scheduler/DAGScheduler.scala
Add the oap cache location to aware task scheduling.
• org/apache/spark/sql/execution/DataSourceScanExec.scala
Add the metrics info to OapMetricsManager and schedule the task to read from the cached hosts.
• org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
Return the result of write task to driver.
• org/apache/spark/sql/execution/datasources/OutputWriter.scala
Add new API to support return the result of write task to driver.
• org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala
Add OapEnv.init() and OapEnv.stop
• org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
Add OapEnv.init() and OapEnv.stop in SparkSQLCLIDriver
• org/apache/spark/status/api/v1/OneApplicationResource.scala
Update the metric data to spark web UI.
• org/apache/spark/SparkEnv.scala
Add OapRuntime.stop() to stop OapRuntime instance.
• org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
Add the get and set method for the changed protected variable.
When using DCPMM as a cache medium, if you want to obtain optimum performance, you need to apply the Numa binding patch Spark.2.4.4.numa.patch to Spark source code.
-
Download src for Spark-2.4.4 and clone the src from github.
-
Apply this patch and rebuild Spark package.
git apply Spark.2.4.4.numa.patch
- When deploying OAP to Spark, please add below configuration item to Spark configuration file $SPARK_HOME/conf/spark-defaults.conf to enable Numa binding.
spark.yarn.numa.enabled true
Note: If you are using a customized Spark, there may be conflicts in applying the patch, you may need to manually resolve the conflicts.
If you think it is cumbersome to apply patches, we have a pre-built Spark spark-2.4.4-bin-hadoop2.7-patched.tgz with the patch applied.