forked from tripl-ai/arc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.xml
161 lines (132 loc) · 11.7 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Arc</title>
<link>https://arc.tripl.ai/</link>
<description>Recent content on Arc</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Wed, 09 Mar 2016 00:11:02 +0100</lastBuildDate>
<atom:link href="https://arc.tripl.ai/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Getting started</title>
<link>https://arc.tripl.ai/getting-started/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/getting-started/</guid>
<description>Notebook Arc provides an interactive development experience via a custom Jupyter Notebooks extension. This has been bundled and is available as a Docker image: https://hub.docker.com/r/triplai/arc-jupyter
Get Started To quickly get started with a real-world example with real data you can clone the Arc Starter project which has included job definitions and includes a limited set of data for you to quickly try Arc:
git clone https://github.com/tripl-ai/arc-starter.git cd arc-starter ./.develop.sh To work through a complete example try completing the tutorial.</description>
</item>
<item>
<title>Tutorial</title>
<link>https://arc.tripl.ai/tutorial/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/tutorial/</guid>
<description>This tutorial works through a real-world example using the New York City Taxi dataset which has been used heavliy around the web (see: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance and A Billion Taxi Rides in Redshift) due to its 1 billion+ record count and scripted process available on github.
It is a great dataset as it has a lot of the attributes of real-world data that need to be considered:</description>
</item>
<item>
<title>Extract</title>
<link>https://arc.tripl.ai/extract/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/extract/</guid>
<description>*Extract stages read in data from a database or file system.
*Extract stages should meet this criteria:
Read data from local or remote filesystems and return a DataFrame. Do not transform/mutate the data. Allow for Predicate Pushdown depending on data source. File based *Extract stages can accept glob patterns as input filenames which can be very useful to load just a subset of data. For example delta processing:</description>
</item>
<item>
<title>Transform</title>
<link>https://arc.tripl.ai/transform/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/transform/</guid>
<description>*Transform stages apply a single transformation to one or more incoming datasets.
Transformers should meet this criteria:
Be logically pure. Perform only a single function. Utilise Spark internal functionality where possible. CypherTransform Since: 2.0.0 - Supports Streaming: True Plugin
The CypherTransform is provided by the https://github.com/tripl-ai/arc-graph-pipeline-plugin package.
The CypherTransform executes an Cypher graph query against a graph already created by a GraphTransform stage.
Parameters Attribute Type Required Description name String true Name of the stage for logging.</description>
</item>
<item>
<title>Load</title>
<link>https://arc.tripl.ai/load/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/load/</guid>
<description>*Load stages write out Spark datasets to a database or file system.
*Load stages should meet this criteria:
Take in a single dataset. Perform target specific validation that the dataset has been written correctly. AvroLoad Since: 1.0.0 - Supports Streaming: False The AvroLoad writes an input DataFrame to a target Apache Avro file.
Parameters Attribute Type Required Description name String true Name of the stage for logging.</description>
</item>
<item>
<title>Execute</title>
<link>https://arc.tripl.ai/execute/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/execute/</guid>
<description>*Execute stages are used to execute arbitrary commands against external systems such as Databases and APIs.
CassandraExecute Since: 2.0.0 - Supports Streaming: False Plugin
The CassandraExecute is provided by the https://github.com/tripl-ai/arc-cassandra-pipeline-plugin package.
The CassandraExecute executes a CQL statement against an external Cassandra cluster.
Parameters Attribute Type Required Description name String true Name of the stage for logging. environments Array[String] true A list of environments under which this stage will be executed.</description>
</item>
<item>
<title>Validate</title>
<link>https://arc.tripl.ai/validate/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/validate/</guid>
<description>*Validate stages are used to perform validation and basic workflow controls..
EqualityValidate Since: 1.0.0 - Supports Streaming: False The EqualityValidate takes two input DataFrame and will succeed if they are identical or fail if not. This stage is useful to use in automated testing as it can be used to validate a derived dataset equals a known &lsquo;good&rsquo; dataset.
This stage will validate:
Same number of columns. Same data type of columns.</description>
</item>
<item>
<title>Metadata</title>
<link>https://arc.tripl.ai/metadata/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/metadata/</guid>
<description>The metadata format, consumed in the TypingTransform stage, is an opinionated format for specifying common data typing actions.
It is designed to:
Support common data typing conversions found in business datasets. Support limited &lsquo;schema evolution&rsquo; of source data in the form of allowed lists of accepted input formats. Collect errors into array columns so that a user can decide how to handle errors once all have been collected. Common Attributes Attribute Type Required Description id String true A unique identifier for this field.</description>
</item>
<item>
<title>Deploy</title>
<link>https://arc.tripl.ai/deploy/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/deploy/</guid>
<description>Arc has been packaged as a Docker image to simplify deployment as a stateless process on cloud infrastructure. As there are multiple versions of Arc, Spark, Scala and Hadoop see the https://hub.docker.com/u/triplai for the relevant version.
Running a Job An example command to start a job is:
docker run \ -e &quot;ETL_CONF_ENV=production&quot; \ -e &quot;ETL_CONF_JOB_PATH=/opt/tutorial/basic/job/0&quot; \ -it -p 4040:4040 triplai/arc:arc_2.1.0_spark_2.4.4_scala_2.11_hadoop_2.9.2_1.0.0 \ bin/spark-submit \ --master local[*] \ --class ai.tripl.arc.ARC \ /opt/spark/jars/arc.</description>
</item>
<item>
<title>Plugins</title>
<link>https://arc.tripl.ai/plugins/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/plugins/</guid>
<description>Arc can be exended in four ways by registering:
Dynamic Configuration Plugins which allow users to inject custom configuration parameters which will be processed before resolving the job configuration file.. Lifecycle Plugins which allow users to extend the base Arc framework with pipeline lifecycle hooks. Pipeline Stage Plugins which allow users to extend the base Arc framework with custom stages which allow the full use of the Spark Scala API.</description>
</item>
<item>
<title>Partials</title>
<link>https://arc.tripl.ai/partials/</link>
<pubDate>Wed, 09 Mar 2016 00:11:02 +0100</pubDate>
<guid>https://arc.tripl.ai/partials/</guid>
<description>Authentication The Authentication map defines the authentication parameters for connecting to a remote service (e.g. HDFS, Blob Storage, etc.).
Parameters Attribute Type Required Description method String true A value of AzureSharedKey, AzureSharedAccessSignature, AzureDataLakeStorageToken, AzureDataLakeStorageGen2AccountKey, AzureDataLakeStorageGen2OAuth, AmazonAccessKey, GoogleCloudStorageKeyFile which defines which method should be used to authenticate with the remote service. accountName String false* Required for AzureSharedKey and AzureSharedAccessSignature. signature String false* Required for AzureSharedKey.</description>
</item>
<item>
<title>Patterns</title>
<link>https://arc.tripl.ai/patterns/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/patterns/</guid>
<description>This section describes some job design patterns to deal with common ETL requirements.
Database Inconsistency When writing data to targets like databases using the JDBCLoad raises a risk of stale reads where a client is reading a dataset which is either old or one which is in the process of being updated and so is internally inconsistent.
Example create a new table each run using a JDBCLoad stage with a dynamic destination table specified as the ${JOB_RUN_DATE} environment variable (easily created with GNU date like: $(date +%Y-%m-%d)) the JDBCLoad will only complete successfully once the record count of source and target data have been confirmed to match execute a JDBCExecute stage to perform a change to a view on the database to point to the new version of the table in a transaction-safe manner if the job fails during any of these stages then the users will be unaware and will continue to consume the customers view which has the latest successful data { &#34;type&#34;: &#34;JDBCLoad&#34;, &#34;name&#34;: &#34;load active customers to web server database&#34;, &#34;environments&#34;: [ &#34;production&#34;, &#34;test&#34; ], &#34;inputView&#34;: &#34;ative_customers&#34;, &#34;jdbcURL&#34;: &#34;jdbc:postgresql://localhost:5432/customer&#34;, &#34;tableName&#34;: &#34;customers_&#34;${JOB_RUN_DATE}, &#34;params&#34;: { &#34;user&#34;: &#34;mydbuser&#34;, &#34;password&#34;: &#34;mydbpassword&#34; } }, { &#34;type&#34;: &#34;JDBCExecute&#34;, &#34;name&#34;: &#34;update the current view to point to the latest version of the table&#34;, &#34;environments&#34;: [ &#34;production&#34;, &#34;test&#34; ], &#34;inputURI&#34;: &#34;hdfs://datalake/sql/update_customer_view.</description>
</item>
<item>
<title>License</title>
<link>https://arc.tripl.ai/license/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://arc.tripl.ai/license/</guid>
<description>Arc Arc is released under the MIT License.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the &ldquo;Software&rdquo;), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:</description>
</item>
</channel>
</rss>