Merge branch 'release/2.0.0'

A30002503 · Jul 16, 2019 · 1cc27db · 1cc27db
2 parents 51d3b88 + cbe3127
commit 1cc27db
Show file tree

Hide file tree

Showing 286 changed files with 20,456 additions and 20,942 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,31 @@
 ## Change Log
 
+# 2.0.0
+
+Arc 2.0.0 is a major (breaking) change which has been done for multiple reasons:
+
+- to support both `Scala 2.11` and `Scala 2.12` as they are not binary compatible and the Spark project is moving to `Scala 2.12`. Arc will be published for both `Scala 2.11` and `Scala 2.12`.
+- to decouple stages/extensions reliant on third-party packages from the main repository so that Arc is not dependent on a library which does not yet support `Scala 2.12` (for example).
+- to support first-class plugins by providing a better API to allow the same type-safety when reading the job configuration as the core Arc pipeline stages (in fact all the core stages have been rebuilt as included plugins). This extends to allowing version number specification in stage resolution.
+
+**BREAKING**
+
+**REMOVED**
+- remove `AzureCosmosDBExtract` stage. This could be reimplemented as a [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
+- remove `AzureEventHubsLoad` stage. This could be reimplemented as a [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
+- remove `DatabricksDeltaExtract` and `DatabricksDeltaLoad` stages and replace with the open-source [DeltaLake](https://delta.io/) versions (`DeltaLakeExtract` and `DeltaLakeLoad`) implemented https://github.com/tripl-ai/arc-deltalake-pipeline-plugin.
+- remove `DatabricksSQLDWLoad`. This could be reimplemented as a  [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
+- remove `bulkload` mode from `JDBCLoad`. Any target specific JDBC behaviours could be implemented by a custom plugins if required.
+- remove `user` and `password` from `JDBCExecute` for consistency. Move details to either `jdbcURL` or `params`.
+- remove the `Dockerfile` and put it in separate repo: (https://github.com/tripl-ai/docker)
+
+**CHANGES**
+- changed `inputURI` field for `TypingTransform` to `schemaURI` to allow addition of `schemaView`.
+- add `CypherTransform` and `GraphTransform` stages to support the https://github.com/opencypher/morpheus project (https://github.com/tripl-ai/arc-graph-pipeline-plugin).
+- add `MongoDBExtract` and `MongoDBLoad` stages (https://github.com/tripl-ai/arc-mongodb-pipeline-plugin).
+- move `ElasticsearchExtract` and `ElasticsearchLoad` to their own repository https://github.com/tripl-ai/arc-elasticsearch-pipeline-plugin.
+- move `KafkaExtract`, `KafkaLoad` and `KafkaCommitExecute` to their own repository https://github.com/tripl-ai/arc-kafka-pipeline-plugin.
+
 # 1.15.0
 
 - added `uriField` and `bodyField` to `HTTPExtract` allowing dynamic data to be generated and `POST`ed to endpoints when using an `inputView`.

diff --git a/Dockerfile b/Dockerfile
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Full documentation is available here: https://arc.tripl.ai
 ## Index
 
 - [What is Arc?](#what-is-spark-etl-pipeline)
-- [Notebook](#what-is-spark-etl-pipeline)
+- [Getting Started](#getting-started)
 - [Principles](#principles)
 - [Not just for data engineers](#not-just-for-data-engineers)
 - [Why abstract from code?](#why-abstract-from-code)
@@ -25,20 +25,22 @@ Arc is an **opinionated** framework for defining **predictable**, **repeatable**
 - **repeatable** in that if a job is executed multiple times it will produce the same result.
 - **manageable** in that execution considerations and logging has been baked in from the start.
 
-## Notebook
+## Getting Started
 
 ![Notebook](/docs-src/static/img/arc-starter.png)
 
-Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).
+Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. Start by cloning [https://github.com/tripl-ai/arc-starter](https://github.com/tripl-ai/arc-starter) and running through the [tutorial](https://arc.tripl.ai/tutorial/).
+
+This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).
 
 ## Principles
 
-Many of these principles have come from [12factor](https://12factor.net/):
+Many of these principles have come from [12factor](https://12factor.net/) with the aim of deploying **predictable**, **repeatable** and **manageable** data transformation pipelines:
 
-- **[single responsibility](https://en.wikipedia.org/wiki/Single_responsibility_principle)** components/stages.
+- **[single responsibility](https://en.wikipedia.org/wiki/Single_responsibility_principle)** components/stages with first-class support for [extending](https://arc.tripl.ai/extend/).
 - **stateless** jobs where possible and use of [immutable](https://en.wikipedia.org/wiki/Immutable_object) datasets.
 - **precise logging** to allow management of jobs at scale.
-- **library dependencies** are to be limited or avoided where possible.
+- **library dependencies** limited or avoided where possible.
 
 ## Not just for data engineers
 
@@ -139,7 +141,7 @@ A full worked example job is available [here](https://github.com/tripl-ai/arc/tr
 To compile the main library run:
 
 ```bash
-sbt package
+sbt +package
 ```
 
 To build a library to use with a [Databricks Runtime](https://databricks.com/product/databricks-runtime) environment it is easiest to `assembly` Arc with all the dependencies into a single JAR to simplify the deployment.
@@ -153,24 +155,15 @@ sbt assembly
 If you are having problems compiling it is likely due to environment setup. This command is executed in CICD and uses a predictable build environment pulled from Dockerhub:
 
 ```bash
-docker run --rm -v $(pwd):/app -w /app mozilla/sbt:8u212_1.2.8 sbt assembly
-```
-
-### Dockerfile
-
-To build the docker image:
-
-```bash
-export ARC_VERSION=$(awk -F'"' '$0=$2' version.sbt)
-docker build . --build-arg ARC_VERSION=${ARC_VERSION} -t triplai/arc:${ARC_VERSION}
+docker run --rm -v $(pwd):/app -w /app mozilla/sbt:8u212_1.2.8 sbt +package
 ```
 
 ### Tests
 
 To run unit tests:
 
 ```bash
-sbt test
+sbt +test
 ```
 
 To run integration tests (which have external service depenencies):

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -5,6 +5,7 @@
 
 trigger:
 - master
+- release/*
 - develop
 - feature/*
 
@@ -30,19 +31,23 @@ steps:
   displayName: 'Move pgp keys'  
 
 - script: |
-    docker-compose -f src/it/resources/docker-compose.yml up --build -d
-  displayName: 'Create a docker-compose based testing environment (including mozilla/sbt:8u212_1.2.8)'
+    docker-compose \
+    -f src/it/resources/docker-compose.yml \
+    up \
+    --build \
+    -d
+  displayName: 'Create a docker-compose based testing environment (including idle mozilla/sbt:8u212_1.2.8)'
 
 - script: |
     docker exec \
     sbt \
-    sbt test
+    sbt "+test"
   displayName: 'sbt test'
 
 - script: |
     docker exec \
     sbt \
-    sbt it:test
+    sbt "+it:test"
   displayName: 'sbt it:test'  
 
 - script: |
@@ -51,6 +56,16 @@ steps:
     -e SONATYPE_PASSWORD=$(SONATYPE_PASSWORD) \
     -e PGP_PASSPHRASE=$(PGP_PASSPHRASE) \
     sbt \
-    sbt publishSigned
-  displayName: 'sbt publishSigned (push to https://oss.sonatype.org/content/groups/staging/)'
-  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))
+    sbt "+publishSigned"
+  displayName: 'sbt publishSigned (push to https://oss.sonatype.org/content/groups/staging/ai/tripl/)'
+  condition: and(succeeded(), contains(variables['Build.SourceBranch'], 'refs/heads/release'))
+
+- script: |
+    docker exec \
+    -e SONATYPE_USERNAME=$(SONATYPE_USERNAME) \
+    -e SONATYPE_PASSWORD=$(SONATYPE_PASSWORD) \
+    -e PGP_PASSPHRASE=$(PGP_PASSPHRASE) \
+    sbt \
+    sbt "+sonatypeRelease"
+  displayName: 'sbt sonatypeRelease (push to https://repo1.maven.org/maven2/ai/tripl/)'
+  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))  
diff --git a/build.sbt b/build.sbt
@@ -1,13 +1,17 @@
 import Dependencies._
 
+lazy val scala211 = "2.11.12"
+lazy val scala212 = "2.12.8"
+lazy val supportedScalaVersions = List(scala211, scala212)
+
 lazy val root = (project in file(".")).
   enablePlugins(BuildInfoPlugin).
   configs(IntegrationTest).
   settings(
     name := "arc",
     organization := "ai.tripl",
     organizationHomepage := Some(url("https://arc.tripl.ai")),
-    scalaVersion := "2.11.12",
+    crossScalaVersions := supportedScalaVersions,
     licenses := List("MIT" -> new URL("https://opensource.org/licenses/MIT")),
     scalastyleFailOnError := false,
     libraryDependencies ++= etlDeps,
@@ -25,7 +29,20 @@ lazy val root = (project in file(".")).
 
 fork in run := true  
 
-scalacOptions := Seq("-target:jvm-1.8", "-unchecked", "-deprecation")
+scalacOptions := Seq(
+  "-deprecation",
+  "-encoding", "utf-8",
+  "-explaintypes",
+  "-target:jvm-1.8",
+  "-unchecked"
+
+  //"-Ywarn-dead-code",
+  //"-Ywarn-extra-implicit",
+  //"-Ywarn-inaccessible",
+  //"-Ywarn-infer-any",
+  //"-Ywarn-unused:privates",
+  //"-Ywarn-unused:imports"
+)
 
 test in assembly := {}
 

diff --git a/docs-src/config.toml b/docs-src/config.toml
@@ -20,10 +20,12 @@ pygmentsUseClasses=true
 	repo_url = "https://github.com/tripl-ai/arc"
 
 	image = "triplai/arc"
-	version = "1.15.0"
+	version = "2.0.0"
 	arc_jupyter_image= "triplai/arc-jupyter"
 	arc_jupyter_version = "0.0.14"
 	spark_version = "2.4.3"
+	scala_version = "2.11"
+	hadoop_version = "2.9.2"
 	logo = "images/logo.png"
 	favicon = ""
 
@@ -47,10 +49,10 @@ pygmentsUseClasses=true
 [social]
 	github  = "tripl-ai"
 
-# [[menu.main]]
-# 	name   = "Getting started"
-# 	url    = "getting-started/"
-# 	weight = 10
+[[menu.main]]
+ 	name   = "Getting started"
+ 	url    = "getting-started/"
+ 	weight = 10
 
 [[menu.main]]
 	name   = "Tutorial"
@@ -88,29 +90,29 @@ pygmentsUseClasses=true
 	weight = 70	
 
 [[menu.main]]
-	name   = "Partials"
-	url    = "partials/"
-	weight = 80		
+	name   = "Deploy"
+	url    = "deploy/"
+	weight = 80
 
 [[menu.main]]
-	name   = "Patterns"
-	url    = "patterns/"
-	weight = 90	
+	name   = "Plugins"
+	url    = "plugins/"
+	weight = 85
 
 [[menu.main]]
-	name   = "Deploy"
-	url    = "deploy/"
-	weight = 95
+	name   = "Partials"
+	url    = "partials/"
+	weight = 90		
 
 [[menu.main]]
-	name   = "Extend"
-	url    = "extend/"
-	weight = 98
+	name   = "Patterns"
+	url    = "patterns/"
+	weight = 95	
 
-[[menu.main]]
-	name   = "Contributing"
-	url    = "contributing/"
-	weight = 100	
+#[[menu.main]]
+#	name   = "Contributing"
+#	url    = "contributing/"
+#	weight = 100	
 
 [[menu.main]]
 	name   = "License"

diff --git a/docs-src/content/_index.md b/docs-src/content/_index.md
@@ -12,6 +12,14 @@ Arc is an **opinionated** framework for defining **predictable**, **repeatable**
 - **repeatable** in that if a job is executed multiple times it will produce the same result.
 - **manageable** in that execution considerations and logging has been baked in from the start.
 
+## Getting Started
+
+![Notebook](/img/arc-starter.png)
+
+Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. Start by cloning [https://github.com/tripl-ai/arc-starter](https://github.com/tripl-ai/arc-starter) and running through the [tutorial](https://arc.tripl.ai/tutorial/).
+
+This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).
+
 ## Principles
 
 Many of these principles have come from [12factor](https://12factor.net/):

diff --git a/docs-src/content/contributing/index.md b/docs-src/content/contributing/index.md
@@ -2,4 +2,5 @@
 title: Contributing
 weight: 100
 type: blog
+draft: true
 ---