Skip to content

Commit eddfedd

Browse files
rxinpwendell
authored andcommitted
[SPARK-938][doc] Add OpenStack Swift support
See compiled doc at http://people.apache.org/~rxin/tmp/openstack-swift/_site/storage-openstack-swift.html This is based on apache#1010. Closes apache#1010. Author: Reynold Xin <rxin@apache.org> Author: Gil Vernik <gilv@il.ibm.com> Closes apache#2298 from rxin/openstack-swift and squashes the following commits: ff4e394 [Reynold Xin] Two minor comments from Patrick. 279f6de [Reynold Xin] core-sites -> core-site dfb8fea [Reynold Xin] Updated based on Gil's suggestion. 846f5cb [Reynold Xin] Added a link from overview page. 0447c9f [Reynold Xin] Removed sample code. e9c3761 [Reynold Xin] Merge pull request apache#1010 from gilv/master 9233fef [Gil Vernik] Fixed typos 6994827 [Gil Vernik] Merge pull request #1 from rxin/openstack ac0679e [Reynold Xin] Fixed an unclosed tr. 47ce99d [Reynold Xin] Merge branch 'master' into openstack cca7192 [Gil Vernik] Removed white spases from pom.xml 99f095d [Reynold Xin] Pending openstack changes. eb22295 [Reynold Xin] Merge pull request apache#1010 from gilv/master 39a9737 [Gil Vernik] Spark integration with Openstack Swift c977658 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark 2aba763 [Gil Vernik] Fix to docs/openstack-integration.md 9b625b5 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark eff538d [Gil Vernik] SPARK-938 - Openstack Swift object storage support ce483d7 [Gil Vernik] SPARK-938 - Openstack Swift object storage support b6c37ef [Gil Vernik] Openstack Swift support
1 parent f25bbbd commit eddfedd

File tree

2 files changed

+154
-0
lines changed

2 files changed

+154
-0
lines changed

docs/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,8 @@ options for deployment:
103103
* [Security](security.html): Spark security support
104104
* [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
105105
* [3<sup>rd</sup> Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions
106+
* Integration with other storage systems:
107+
* [OpenStack Swift](storage-openstack-swift.html)
106108
* [Building Spark with Maven](building-with-maven.html): build Spark using the Maven system
107109
* [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
108110

docs/storage-openstack-swift.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
layout: global
3+
title: Accessing OpenStack Swift from Spark
4+
---
5+
6+
Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
7+
same URI formats as in Hadoop. You can specify a path in Swift as input through a
8+
URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your
9+
Swift security credentials, through <code>core-site.xml</code> or via
10+
<code>SparkContext.hadoopConfiguration</code>.
11+
Current Swift driver requires Swift to use Keystone authentication method.
12+
13+
# Configuring Swift for Better Data Locality
14+
15+
Although not mandatory, it is recommended to configure the proxy server of Swift with
16+
<code>list_endpoints</code> to have better data locality. More information is
17+
[available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py).
18+
19+
20+
# Dependencies
21+
22+
The Spark application should include <code>hadoop-openstack</code> dependency.
23+
For example, for Maven support, add the following to the <code>pom.xml</code> file:
24+
25+
{% highlight xml %}
26+
<dependencyManagement>
27+
...
28+
<dependency>
29+
<groupId>org.apache.hadoop</groupId>
30+
<artifactId>hadoop-openstack</artifactId>
31+
<version>2.3.0</version>
32+
</dependency>
33+
...
34+
</dependencyManagement>
35+
{% endhighlight %}
36+
37+
38+
# Configuration Parameters
39+
40+
Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
41+
There are two main categories of parameters that should to be configured: declaration of the
42+
Swift driver and the parameters that are required by Keystone.
43+
44+
Configuration of Hadoop to use Swift File system achieved via
45+
46+
<table class="table">
47+
<tr><th>Property Name</th><th>Value</th></tr>
48+
<tr>
49+
<td>fs.swift.impl</td>
50+
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
51+
</tr>
52+
</table>
53+
54+
Additional parameters required by Keystone (v2.0) and should be provided to the Swift driver. Those
55+
parameters will be used to perform authentication in Keystone to access Swift. The following table
56+
contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name.
57+
58+
<table class="table">
59+
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
60+
<tr>
61+
<td><code>fs.swift.service.PROVIDER.auth.url</code></td>
62+
<td>Keystone Authentication URL</td>
63+
<td>Mandatory</td>
64+
</tr>
65+
<tr>
66+
<td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td>
67+
<td>Keystone endpoints prefix</td>
68+
<td>Optional</td>
69+
</tr>
70+
<tr>
71+
<td><code>fs.swift.service.PROVIDER.tenant</code></td>
72+
<td>Tenant</td>
73+
<td>Mandatory</td>
74+
</tr>
75+
<tr>
76+
<td><code>fs.swift.service.PROVIDER.username</code></td>
77+
<td>Username</td>
78+
<td>Mandatory</td>
79+
</tr>
80+
<tr>
81+
<td><code>fs.swift.service.PROVIDER.password</code></td>
82+
<td>Password</td>
83+
<td>Mandatory</td>
84+
</tr>
85+
<tr>
86+
<td><code>fs.swift.service.PROVIDER.http.port</code></td>
87+
<td>HTTP port</td>
88+
<td>Mandatory</td>
89+
</tr>
90+
<tr>
91+
<td><code>fs.swift.service.PROVIDER.region</code></td>
92+
<td>Keystone region</td>
93+
<td>Mandatory</td>
94+
</tr>
95+
<tr>
96+
<td><code>fs.swift.service.PROVIDER.public</code></td>
97+
<td>Indicates if all URLs are public</td>
98+
<td>Mandatory</td>
99+
</tr>
100+
</table>
101+
102+
For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code>
103+
defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include:
104+
105+
{% highlight xml %}
106+
<configuration>
107+
<property>
108+
<name>fs.swift.impl</name>
109+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
110+
</property>
111+
<property>
112+
<name>fs.swift.service.SparkTest.auth.url</name>
113+
<value>http://127.0.0.1:5000/v2.0/tokens</value>
114+
</property>
115+
<property>
116+
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
117+
<value>endpoints</value>
118+
</property>
119+
<name>fs.swift.service.SparkTest.http.port</name>
120+
<value>8080</value>
121+
</property>
122+
<property>
123+
<name>fs.swift.service.SparkTest.region</name>
124+
<value>RegionOne</value>
125+
</property>
126+
<property>
127+
<name>fs.swift.service.SparkTest.public</name>
128+
<value>true</value>
129+
</property>
130+
<property>
131+
<name>fs.swift.service.SparkTest.tenant</name>
132+
<value>test</value>
133+
</property>
134+
<property>
135+
<name>fs.swift.service.SparkTest.username</name>
136+
<value>tester</value>
137+
</property>
138+
<property>
139+
<name>fs.swift.service.SparkTest.password</name>
140+
<value>testing</value>
141+
</property>
142+
</configuration>
143+
{% endhighlight %}
144+
145+
Notice that
146+
<code>fs.swift.service.PROVIDER.tenant</code>,
147+
<code>fs.swift.service.PROVIDER.username</code>,
148+
<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in
149+
<code>core-site.xml</code> is not always a good approach.
150+
We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark
151+
via <code>spark-shell</code>.
152+
For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.

0 commit comments

Comments
 (0)