Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
eee70dc
[BAHIR-75] Initital code delivery for WebHDFS data source
sourav-mazumder Nov 15, 2016
c2d53fd
[BAHIR-75] - fix RAT excludes for DataSourceRegister
ckadner Nov 16, 2016
af805e3
[BAHIR-75] - minor README fixes
ckadner Nov 16, 2016
78ff29c
[BAHIR-75] - minor README fixes (2)
ckadner Nov 16, 2016
24e79c9
[BAHIR-75] - include DataSourceRegister in Maven build
ckadner Nov 16, 2016
365ee1f
[BAHIR-75] - fix package declaration in webhdfs package object
ckadner Nov 16, 2016
d4c6e56
[BAHIR-75] - fix 798 Scalastyle violations
ckadner Nov 16, 2016
d7b3bf7
[BAHIR-75] - use "${scala.binary.version}"" instead of "2.11"
ckadner Nov 16, 2016
a77e372
[BAHIR-75] - add "spark-" prefix to artifactId consistent with other …
ckadner Nov 17, 2016
6936bd8
[BAHIR-75][WIP] - rudimentary extension of WebHdfsFileSystem
ckadner Dec 1, 2016
a9ef907
[BAHIR-75][WIP] - rudimentary extension of WebHdfsFileSystem (use ori…
ckadner Dec 1, 2016
f791f1c
WebHdfsConnector prototype
sourav-mazumder Dec 7, 2016
2932f99
[BAHIR-75][WIP] - override WebHdfsFileSystem - code style fixes, remo…
ckadner Dec 7, 2016
2497971
[BAHIR-75][WIP] - override WebHdfsFileSystem - more code style fixes,…
ckadner Dec 7, 2016
2971880
[BAHIR-75][WIP] - override WebHdfsFileSystem - more and more code sty…
ckadner Dec 7, 2016
a9bbe31
[BAHIR-75][WIP] - override WebHdfsFileSystem - add printouts for debu…
Dec 8, 2016
f6429c9
[BAHIR-75][WIP] - override WebHdfsFileSystem - add printouts for debu…
ckadner Dec 8, 2016
39f5985
[BAHIR-75][WIP] - write to remote via webhdfs
sourav-mazumder Dec 9, 2016
183b1ec
[BAHIR-75][WIP] - override WebHdfsFileSystem - fix code style errors,…
ckadner Dec 9, 2016
59cad8e
[BAHIR-75][WIP] - write files via webhdfs
sourav-mazumder Dec 12, 2016
d047318
[BAHIR-75][WIP] - write files via webhdfs continued
sourav-mazumder Dec 22, 2016
8beedf9
[BAHIR-75][WIP] - custom WebHdfsFileSystem - minor scalastyle fixes
ckadner Dec 22, 2016
b63a202
Merge branch 'master' into BAHIR-75-WebHdfsFileSystem
ckadner Dec 22, 2016
b467c52
[BAHIR-75][WIP] - remove unnecessary dependencies from pom.xml
ckadner Dec 22, 2016
d3de3a7
[BAHIR-75][WIP] - minor fixes
sourav-mazumder Dec 23, 2016
b103aa9
Merge branch 'BAHIR-75-WebHdfsFileSystem' of https://github.com/soura…
sourav-mazumder Dec 23, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions datasource-webhdfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
A custom data source to read and write data from and to remote HDFS clusters using the [WebHDFS](https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) protocol.

## Linking

Using SBT:

```scala
libraryDependencies += "org.apache.bahir" %% "spark-datasource-webhdfs" % "2.1.0-SNAPSHOT"
```

Using Maven (Scala version 2.11):

```xml
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-datasource-webhdfs_2.11</artifactId>
<version>2.1.0-SNAPSHOT</version>
</dependency>
```

This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option.
For example, to include it when starting the spark shell:

```Shell
$ bin/bin/spark-shell --packages org.apache.bahir:spark-datasource-webhdfs_2.11:2.1.0-SNAPSHOT
```

Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath.
The `--packages` argument can also be used with `bin/spark-submit`.

This library is compiled for Scala 2.10 and 2.11, and intended to support Spark 2.0 onwards.

## Examples

A data frame can be created using this custom data source as shown below -

```scala
val filePath = s"webhdfs://<server name or ip>/gateway/default/webhdfs/v1/<file or folder name>"

val df = spark.read
.format("webhdfs")
.option("certValidation", "Y")
.option("userCred", "user1:pass1")
.option("header", "true")
.option("partitions", "8")
.load(filePath)
```

## Configuration options.

* `certValidation` Set this to `'Y'` or `'N'`. In case of `'N'` this component will ignore validation of teh SSL certification. Otherwise it will download the certificate and validate.
* `userCred` Set this to `'userid:password'` as needed by the remote HDFS for accessing a file from there.
* `partitions` This number tells the Data Source how many parallel connections to be opened to read data from HDFS in the remote cluster for each file. If this option is not specified default value is used which is 4. Recommended value for this option is same as the next nearest integer of (file size/block size) in HDFS or multiples of that. For example if file size in HDFS is 0.95 GB and block size of the file is 128 MB use 8 or multiple of 8 as number of partitions. However, number of partitions should not be more than (or may be little more than) maximum number of parallel tasks possible to spawn in your Spark cluster.
* `format` Format of the file. Right now only 'csv' is supported. If this option is not specified by default 'csv' is assumed.
* `output` Specify either `'LIST'` or `'Data'`. By default, `'Data'` is assumed which returns the actual data in the file. If a folder name is specified then data from all files in that folder would be fetched at once. If `'LIST'` is specified then the files within the folder is listed.
63 changes: 63 additions & 0 deletions datasource-webhdfs/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.bahir</groupId>
<artifactId>bahir-parent_2.11</artifactId>
<version>2.1.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>

<groupId>org.apache.bahir</groupId>
<artifactId>spark-datasource-webhdfs_2.11</artifactId>
<properties>
<sbt.project.name>datasource-webhdfs</sbt.project.name>
</properties>
<packaging>jar</packaging>
<name>Apache Bahir - Spark DataSource WebHDFS</name>
<url>http://bahir.apache.org/</url>

<dependencies>
<dependency>
<groupId>org.scalaj</groupId>
<artifactId>scalaj-http_${scala.binary.version}</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-tags_${scala.binary.version}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
Loading