Skip to content

Latest commit

 

History

History
103 lines (72 loc) · 3.76 KB

README.md

File metadata and controls

103 lines (72 loc) · 3.76 KB

spark-etl

Build Status Coverage Status Join the chat at https://gitter.im/vngrs/spark-etl License

What is spark-etl?

The ETL(Extract-Transform-Load) process is a key component of many data management operations, including move data and to transform the data from one format to another. To effectively support these operations, spark-etl is providing a distributed solution.

spark-etl is a Scala-based project and it is developing with Spark. So it is scalable and distributed. spark-etl will process data from N source to N database. The project structure:

Extract

alt text

  • FILES (json, csv)
  • SQLs
  • NoSQLs
  • Key Value Stores
  • APIs
  • Streams

Transform

alt text

  • Row to json
  • Row to csv
  • Json to Row
  • Change records
  • Merge records

Load

alt text

  • FILES (json, csv)
  • SQLs
  • NoSQLs
  • Key Value Stores
  • APIs
  • Streams

Pros

  • parallel ETL on cluster level
  • synchronisation of data
  • open source

Example Scenario

We want to get data from multiple sources like MySQL and CVS. When we extracting data, we also want to filter and merge some fields/tables. During the transform layer, we want to run an SQL. Then we want to write the transformed data to multiple targets like S3 and Redshift.

etl

spark-etl is the easiest way to do this scenario!

Tech

  • Scala - Functional Programming Language
  • ScalaTest - ScalaTest is a testing tool in the Scala ecosystem.
  • wartremover - WartRemover is a flexible Scala code linting tool.
  • scalastyle - Scalastyle examines Scala code and indicates potential problems with it.
  • [scoverage] - Scoverage is a code coverage tool for scala that offers statement and branch coverage.
  • Apache Spark - Apache Spark is a fast and general engine for large-scale data processing.
  • travis-ci - Travis CI is a hosted, distributed continuous integration service used to build and test software projects
  • coveralls - Coveralls is a web service to help you track your code coverage over time, and ensure that all your new code is fully covered

Installation

Prerequisites for building spark-etl:

  • sbt clean assembly

How to become a committer

Want to contribute? Great! Let's say "Hello" on gitter.

Todos

  • Scalafmt integration
  • ETL Design

License

MIT License