Skip to content

Commit

Permalink
Merge branch 'release/2.0.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
seddonm1 committed Jul 16, 2019
2 parents 51d3b88 + cbe3127 commit 1cc27db
Show file tree
Hide file tree
Showing 286 changed files with 20,456 additions and 20,942 deletions.
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
## Change Log

# 2.0.0

Arc 2.0.0 is a major (breaking) change which has been done for multiple reasons:

- to support both `Scala 2.11` and `Scala 2.12` as they are not binary compatible and the Spark project is moving to `Scala 2.12`. Arc will be published for both `Scala 2.11` and `Scala 2.12`.
- to decouple stages/extensions reliant on third-party packages from the main repository so that Arc is not dependent on a library which does not yet support `Scala 2.12` (for example).
- to support first-class plugins by providing a better API to allow the same type-safety when reading the job configuration as the core Arc pipeline stages (in fact all the core stages have been rebuilt as included plugins). This extends to allowing version number specification in stage resolution.

**BREAKING**

**REMOVED**
- remove `AzureCosmosDBExtract` stage. This could be reimplemented as a [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
- remove `AzureEventHubsLoad` stage. This could be reimplemented as a [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
- remove `DatabricksDeltaExtract` and `DatabricksDeltaLoad` stages and replace with the open-source [DeltaLake](https://delta.io/) versions (`DeltaLakeExtract` and `DeltaLakeLoad`) implemented https://github.com/tripl-ai/arc-deltalake-pipeline-plugin.
- remove `DatabricksSQLDWLoad`. This could be reimplemented as a [Lifecycle Plugin](https://tripl-ai.github.io/arc/extend/#lifecycle-plugins).
- remove `bulkload` mode from `JDBCLoad`. Any target specific JDBC behaviours could be implemented by a custom plugins if required.
- remove `user` and `password` from `JDBCExecute` for consistency. Move details to either `jdbcURL` or `params`.
- remove the `Dockerfile` and put it in separate repo: (https://github.com/tripl-ai/docker)

**CHANGES**
- changed `inputURI` field for `TypingTransform` to `schemaURI` to allow addition of `schemaView`.
- add `CypherTransform` and `GraphTransform` stages to support the https://github.com/opencypher/morpheus project (https://github.com/tripl-ai/arc-graph-pipeline-plugin).
- add `MongoDBExtract` and `MongoDBLoad` stages (https://github.com/tripl-ai/arc-mongodb-pipeline-plugin).
- move `ElasticsearchExtract` and `ElasticsearchLoad` to their own repository https://github.com/tripl-ai/arc-elasticsearch-pipeline-plugin.
- move `KafkaExtract`, `KafkaLoad` and `KafkaCommitExecute` to their own repository https://github.com/tripl-ai/arc-kafka-pipeline-plugin.

# 1.15.0

- added `uriField` and `bodyField` to `HTTPExtract` allowing dynamic data to be generated and `POST`ed to endpoints when using an `inputView`.
Expand Down
95 changes: 0 additions & 95 deletions Dockerfile

This file was deleted.

29 changes: 11 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Full documentation is available here: https://arc.tripl.ai
## Index

- [What is Arc?](#what-is-spark-etl-pipeline)
- [Notebook](#what-is-spark-etl-pipeline)
- [Getting Started](#getting-started)
- [Principles](#principles)
- [Not just for data engineers](#not-just-for-data-engineers)
- [Why abstract from code?](#why-abstract-from-code)
Expand All @@ -25,20 +25,22 @@ Arc is an **opinionated** framework for defining **predictable**, **repeatable**
- **repeatable** in that if a job is executed multiple times it will produce the same result.
- **manageable** in that execution considerations and logging has been baked in from the start.

## Notebook
## Getting Started

![Notebook](/docs-src/static/img/arc-starter.png)

Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).
Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. Start by cloning [https://github.com/tripl-ai/arc-starter](https://github.com/tripl-ai/arc-starter) and running through the [tutorial](https://arc.tripl.ai/tutorial/).

This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).

## Principles

Many of these principles have come from [12factor](https://12factor.net/):
Many of these principles have come from [12factor](https://12factor.net/) with the aim of deploying **predictable**, **repeatable** and **manageable** data transformation pipelines:

- **[single responsibility](https://en.wikipedia.org/wiki/Single_responsibility_principle)** components/stages.
- **[single responsibility](https://en.wikipedia.org/wiki/Single_responsibility_principle)** components/stages with first-class support for [extending](https://arc.tripl.ai/extend/).
- **stateless** jobs where possible and use of [immutable](https://en.wikipedia.org/wiki/Immutable_object) datasets.
- **precise logging** to allow management of jobs at scale.
- **library dependencies** are to be limited or avoided where possible.
- **library dependencies** limited or avoided where possible.

## Not just for data engineers

Expand Down Expand Up @@ -139,7 +141,7 @@ A full worked example job is available [here](https://github.com/tripl-ai/arc/tr
To compile the main library run:

```bash
sbt package
sbt +package
```

To build a library to use with a [Databricks Runtime](https://databricks.com/product/databricks-runtime) environment it is easiest to `assembly` Arc with all the dependencies into a single JAR to simplify the deployment.
Expand All @@ -153,24 +155,15 @@ sbt assembly
If you are having problems compiling it is likely due to environment setup. This command is executed in CICD and uses a predictable build environment pulled from Dockerhub:

```bash
docker run --rm -v $(pwd):/app -w /app mozilla/sbt:8u212_1.2.8 sbt assembly
```

### Dockerfile

To build the docker image:

```bash
export ARC_VERSION=$(awk -F'"' '$0=$2' version.sbt)
docker build . --build-arg ARC_VERSION=${ARC_VERSION} -t triplai/arc:${ARC_VERSION}
docker run --rm -v $(pwd):/app -w /app mozilla/sbt:8u212_1.2.8 sbt +package
```

### Tests

To run unit tests:

```bash
sbt test
sbt +test
```

To run integration tests (which have external service depenencies):
Expand Down
29 changes: 22 additions & 7 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

trigger:
- master
- release/*
- develop
- feature/*

Expand All @@ -30,19 +31,23 @@ steps:
displayName: 'Move pgp keys'

- script: |
docker-compose -f src/it/resources/docker-compose.yml up --build -d
displayName: 'Create a docker-compose based testing environment (including mozilla/sbt:8u212_1.2.8)'
docker-compose \
-f src/it/resources/docker-compose.yml \
up \
--build \
-d
displayName: 'Create a docker-compose based testing environment (including idle mozilla/sbt:8u212_1.2.8)'

- script: |
docker exec \
sbt \
sbt test
sbt "+test"
displayName: 'sbt test'

- script: |
docker exec \
sbt \
sbt it:test
sbt "+it:test"
displayName: 'sbt it:test'

- script: |
Expand All @@ -51,6 +56,16 @@ steps:
-e SONATYPE_PASSWORD=$(SONATYPE_PASSWORD) \
-e PGP_PASSPHRASE=$(PGP_PASSPHRASE) \
sbt \
sbt publishSigned
displayName: 'sbt publishSigned (push to https://oss.sonatype.org/content/groups/staging/)'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))
sbt "+publishSigned"
displayName: 'sbt publishSigned (push to https://oss.sonatype.org/content/groups/staging/ai/tripl/)'
condition: and(succeeded(), contains(variables['Build.SourceBranch'], 'refs/heads/release'))

- script: |
docker exec \
-e SONATYPE_USERNAME=$(SONATYPE_USERNAME) \
-e SONATYPE_PASSWORD=$(SONATYPE_PASSWORD) \
-e PGP_PASSPHRASE=$(PGP_PASSPHRASE) \
sbt \
sbt "+sonatypeRelease"
displayName: 'sbt sonatypeRelease (push to https://repo1.maven.org/maven2/ai/tripl/)'
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/master'))
21 changes: 19 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
import Dependencies._

lazy val scala211 = "2.11.12"
lazy val scala212 = "2.12.8"
lazy val supportedScalaVersions = List(scala211, scala212)

lazy val root = (project in file(".")).
enablePlugins(BuildInfoPlugin).
configs(IntegrationTest).
settings(
name := "arc",
organization := "ai.tripl",
organizationHomepage := Some(url("https://arc.tripl.ai")),
scalaVersion := "2.11.12",
crossScalaVersions := supportedScalaVersions,
licenses := List("MIT" -> new URL("https://opensource.org/licenses/MIT")),
scalastyleFailOnError := false,
libraryDependencies ++= etlDeps,
Expand All @@ -25,7 +29,20 @@ lazy val root = (project in file(".")).

fork in run := true

scalacOptions := Seq("-target:jvm-1.8", "-unchecked", "-deprecation")
scalacOptions := Seq(
"-deprecation",
"-encoding", "utf-8",
"-explaintypes",
"-target:jvm-1.8",
"-unchecked"

//"-Ywarn-dead-code",
//"-Ywarn-extra-implicit",
//"-Ywarn-inaccessible",
//"-Ywarn-infer-any",
//"-Ywarn-unused:privates",
//"-Ywarn-unused:imports"
)

test in assembly := {}

Expand Down
44 changes: 23 additions & 21 deletions docs-src/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ pygmentsUseClasses=true
repo_url = "https://github.com/tripl-ai/arc"

image = "triplai/arc"
version = "1.15.0"
version = "2.0.0"
arc_jupyter_image= "triplai/arc-jupyter"
arc_jupyter_version = "0.0.14"
spark_version = "2.4.3"
scala_version = "2.11"
hadoop_version = "2.9.2"
logo = "images/logo.png"
favicon = ""

Expand All @@ -47,10 +49,10 @@ pygmentsUseClasses=true
[social]
github = "tripl-ai"

# [[menu.main]]
# name = "Getting started"
# url = "getting-started/"
# weight = 10
[[menu.main]]
name = "Getting started"
url = "getting-started/"
weight = 10

[[menu.main]]
name = "Tutorial"
Expand Down Expand Up @@ -88,29 +90,29 @@ pygmentsUseClasses=true
weight = 70

[[menu.main]]
name = "Partials"
url = "partials/"
weight = 80
name = "Deploy"
url = "deploy/"
weight = 80

[[menu.main]]
name = "Patterns"
url = "patterns/"
weight = 90
name = "Plugins"
url = "plugins/"
weight = 85

[[menu.main]]
name = "Deploy"
url = "deploy/"
weight = 95
name = "Partials"
url = "partials/"
weight = 90

[[menu.main]]
name = "Extend"
url = "extend/"
weight = 98
name = "Patterns"
url = "patterns/"
weight = 95

[[menu.main]]
name = "Contributing"
url = "contributing/"
weight = 100
#[[menu.main]]
# name = "Contributing"
# url = "contributing/"
# weight = 100

[[menu.main]]
name = "License"
Expand Down
8 changes: 8 additions & 0 deletions docs-src/content/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ Arc is an **opinionated** framework for defining **predictable**, **repeatable**
- **repeatable** in that if a job is executed multiple times it will produce the same result.
- **manageable** in that execution considerations and logging has been baked in from the start.

## Getting Started

![Notebook](/img/arc-starter.png)

Arc has an interactive [Jupyter Notebook](https://jupyter.org/) extension to help with rapid development of jobs. Start by cloning [https://github.com/tripl-ai/arc-starter](https://github.com/tripl-ai/arc-starter) and running through the [tutorial](https://arc.tripl.ai/tutorial/).

This extension is available at [https://github.com/tripl-ai/arc-jupyter](https://github.com/tripl-ai/arc-jupyter).

## Principles

Many of these principles have come from [12factor](https://12factor.net/):
Expand Down
1 change: 1 addition & 0 deletions docs-src/content/contributing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
title: Contributing
weight: 100
type: blog
draft: true
---
Loading

0 comments on commit 1cc27db

Please sign in to comment.