HDFS Replicator

Tool to replicate files from an HDFS source path to a local destination, excluding previously copied files.

Motivation

Many machine learning projects lack a tool to replicate files from an HDFS-based data store or data lake to local storage, with the ability to exclude previously copied files. Using Apache Flume, Apache Nifi or another big ETL tool would be overengineered for such tasks. HDFS Replicator - a lightweight tool for copying files from HDFS to local storage with tracking of previously copied files is just designed to close this gap.

Build the application

Application is written in Go. Go environment version 1.19 is required. To build application run command

go build -o replicator main.go

Usage

To copy files from HDFS to local file system run command:

replicator -s <source-url> -d <destination-dir> -c <control-file>

-c <control-file> - control file with a list of downloaded files (default "./files.txt"). Files included in the file are excluded from replication. After execution this file is updated with new copied files to exclude them in next runs.

-d <destination-dir> - local destination path (default "./")

-s <source-url> - source HDFS URL, for example -s "hdfs://namenode:9000/user/data/parquet/passport"

Example

If file dst/files.txt contains part-00001-90fa676f-4d1b-4ad8-9afd-781f2a39339d-c000.snappy.parquet and you run command ./replicator -s "hdfs://namenode:9000/user/data/parquet/passport" -d dst -c dst/files.txt you could get the following result:

~/hdfs-replicator > ./replicator -s "hdfs://namenode:9000/user/data/parquet/passport" -d dst -c dst/files.txt        
copying file part-00000-90fa676f-4d1b-4ad8-9afd-781f2a39339d-c000.snappy.parquet
skipping file part-00001-90fa676f-4d1b-4ad8-9afd-781f2a39339d-c000.snappy.parquet

After program execution file dst/files.txt contains both files:

part-00001-90fa676f-4d1b-4ad8-9afd-781f2a39339d-c000.snappy.parquet
part-00000-90fa676f-4d1b-4ad8-9afd-781f2a39339d-c000.snappy.parquet

Contributing

Contributions are always welcome! Just fork the project and create PR.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HDFS Replicator

Motivation

Build the application

Usage

Example

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

valerylobachev/hdfs-replicator

Folders and files

Latest commit

History

Repository files navigation

HDFS Replicator

Motivation

Build the application

Usage

Example

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages