A Go package for synchronizing data between different cloud storage providers. Supports Google Cloud Storage (GCS), Amazon S3, Azure Blob Storage, and MinIO (or any S3-compatible service).
Cloud Data Sync is a tool that allows you to synchronize objects/files between different cloud storage providers. It is designed to be extensible, decoupled, and easy to use as a library or standalone application.
- Support for multiple storage providers:
- Google Cloud Storage (GCS)
- Amazon S3
- Azure Blob Storage
- MinIO (or any S3-compatible service)
- Unidirectional object synchronization (from a source to a destination)
- Metadata tracking for efficient synchronization
- Continuous synchronization with customizable interval
- On-demand single synchronization
- Change detection based on ETag and modification date
- Automatic removal of objects deleted at the source
To install the package:
go get github.com/DjonatanS/cloud-data-syncpackage main
import (
"context"
"log"
"github.com/DjonatanS/cloud-data-sync/internal/config"
"github.com/DjonatanS/cloud-data-sync/internal/database"
"github.com/DjonatanS/cloud-data-sync/internal/storage"
"github.com/DjonatanS/cloud-data-sync/internal/sync"
)
func main() {
// Load configuration
cfg, err := config.LoadConfig("config.json")
if err != nil {
log.Fatalf("Error loading configuration: %v", err)
}
// Initialize context
ctx := context.Background()
// Initialize database
db, err := database.NewDB(cfg.DatabasePath)
if err != nil {
log.Fatalf("Error initializing database: %v", err)
}
defer db.Close()
// Initialize provider factory
factory, err := storage.NewFactory(ctx, cfg)
if err != nil {
log.Fatalf("Error initializing provider factory: %v", err)
}
defer factory.Close()
// Create synchronizer
synchronizer := sync.NewSynchronizer(db, cfg, factory)
// Execute synchronization
if err := synchronizer.SyncAll(ctx); err != nil {
log.Fatalf("Error during synchronization: %v", err)
}
}To add support for a new storage provider, implement the storage.Provider interface:
// Example implementation for a new provider
package customstorage
import (
"context"
"io"
"github.com/DjonatanS/cloud-data-sync/internal/storage"
)
type Client struct {
// Provider-specific fields
}
func NewClient(config Config) (*Client, error) {
// Client initialization
return &Client{}, nil
}
func (c *Client) ListObjects(ctx context.Context, bucketName string) (map[string]*storage.ObjectInfo, error) {
// Implementation for listing objects
}
func (c *Client) GetObject(ctx context.Context, bucketName, objectName string) (*storage.ObjectInfo, io.ReadCloser, error) {
// Implementation for getting an object
}
func (c *Client) UploadObject(ctx context.Context, bucketName, objectName string, reader io.Reader, size int64, contentType string) (*storage.UploadInfo, error) {
// Implementation for uploading an object
}
// ... implementation of other interface methodsgo build -o cloud-data-sync ./cmd/gcs-minio-syncCreate a configuration file as shown in the example below or generate one with:
./cloud-data-sync --generate-configExample configuration:
{
"databasePath": "data.db",
"providers": [
{
"id": "gcs-bucket",
"type": "gcs",
"gcs": {
"projectId": "your-gcp-project"
}
},
{
"id": "s3-storage",
"type": "aws",
"aws": {
"region": "us-east-1",
"accessKeyId": "your-access-key",
"secretAccessKey": "your-secret-key"
}
},
{
"id": "azure-blob",
"type": "azure",
"azure": {
"accountName": "your-azure-account",
"accountKey": "your-azure-key"
}
},
{
"id": "local-minio",
"type": "minio",
"minio": {
"endpoint": "localhost:9000",
"accessKey": "minioadmin",
"secretKey": "minioadmin",
"useSSL": false
}
}
],
"mappings": [
{
"sourceProviderId": "gcs-bucket",
"sourceBucket": "source-bucket",
"targetProviderId": "local-minio",
"targetBucket": "destination-bucket"
},
{
"sourceProviderId": "s3-storage",
"sourceBucket": "source-bucket-s3",
"targetProviderId": "azure-blob",
"targetBucket": "destination-container-azure"
}
]
}To run a single synchronization:
./cloud-data-sync --config config.json --onceTo run the continuous service (periodic synchronization):
./cloud-data-sync --config config.json --interval 60You can also build and run the application using Docker. This isolates the application and its dependencies.
- Docker installed on your system.
- Google Cloud SDK (
gcloud) installed and configured with Application Default Credentials (ADC) if using GCS. Rungcloud auth application-default loginif you haven't already.
Navigate to the project's root directory (where the Dockerfile is located) and run:
docker build -t cloud-data-sync:latest .- Configuration File (
config.json): Ensure you have a validconfig.jsonin your working directory. - Data Directory: Create a directory (e.g.,
data_dir) in your working directory. This will store the SQLite database (data.db) and persist it outside the container. - Update
databasePath: Modify thedatabasePathin yourconfig.jsonto point to the location inside the container where the data directory will be mounted, e.g.,"databasePath": "/app/data/data.db". - GCP Credentials: The run command below assumes your GCP ADC file is at
~/.config/gcloud/application_default_credentials.json. Adjust the path if necessary.
Execute the container using docker run. You need to mount volumes for the configuration file, the data directory, and your GCP credentials.
Example 1: Run a single synchronization (--once)
# Define the path to your ADC file
ADC_FILE_PATH="$HOME/.config/gcloud/application_default_credentials.json"
# Check if the ADC file exists
if [ ! -f "$ADC_FILE_PATH" ]; then
echo "Error: GCP ADC file not found at $ADC_FILE_PATH"
echo "Run 'gcloud auth application-default login' first."
else
# Ensure config.json is present and data_dir exists
# Ensure databasePath in config.json is "/app/data/data.db"
docker run --rm \\
-v "$(pwd)/config.json":/app/config.json \\
-v "$(pwd)/data_dir":/app/data \\
-v "$ADC_FILE_PATH":/app/gcp_credentials.json \\
-e GOOGLE_APPLICATION_CREDENTIALS=/app/gcp_credentials.json \\
cloud-data-sync:latest --config /app/config.json --once
fiExample 2: Run in continuous mode (--interval)
# Define the path to your ADC file
ADC_FILE_PATH="$HOME/.config/gcloud/application_default_credentials.json"
# Check if the ADC file exists
if [ ! -f "$ADC_FILE_PATH" ]; then
echo "Error: GCP ADC file not found at $ADC_FILE_PATH"
echo "Run 'gcloud auth application-default login' first."
else
# Ensure config.json is present and data_dir exists
# Ensure databasePath in config.json is "/app/data/data.db"
docker run --rm \\
-v "$(pwd)/config.json":/app/config.json \\
-v "$(pwd)/data_dir":/app/data \\
-v "$ADC_FILE_PATH":/app/gcp_credentials.json \\
-e GOOGLE_APPLICATION_CREDENTIALS=/app/gcp_credentials.json \\
cloud-data-sync:latest --config /app/config.json --interval 60
fiExample 3: Generate a default configuration
docker run --rm cloud-data-sync:latest --generate-config > config.json.default-
storage: Defines the common interface for all storage providers.
- gcs: Implementation of the interface for Google Cloud Storage.
- s3: Implementation of the interface for Amazon S3.
- azure: Implementation of the interface for Azure Blob Storage.
- minio: Implementation of the interface for MinIO.
-
config: Manages the application configuration.
-
database: Provides metadata persistence for synchronization tracking.
-
sync: Implements the synchronization logic between providers.
- Google Cloud Storage:
cloud.google.com/go/storage - AWS S3:
github.com/aws/aws-sdk-go/service/s3 - Azure Blob:
github.com/Azure/azure-storage-blob-go/azblob - MinIO:
github.com/minio/minio-go/v7 - SQLite:
github.com/mattn/go-sqlite3
- Go 1.18 or higher
- Valid credentials for the storage providers you want to use
MIT
Contributions are welcome! Feel free to open issues or submit pull requests.
- Djonatan - Original author
-
Memory and I/O optimization
- Avoid reading the entire object into memory and then recreating
strings.NewReader(string(data)). Instead, useio.Pipeor pass theio.ReadCloserdirectly for streaming upload. - Where buffering is still necessary, replace with
bytes.NewReader(data)instead of converting to string:// filepath: internal/sync/sync.go readerFromData := bytes.NewReader(data) _, err = targetProvider.UploadObject( ctx, mapping.TargetBucket, objName, readerFromData, int64(len(data)), srcObjInfo.ContentType, )
- Avoid reading the entire object into memory and then recreating
-
Parallelism and concurrency control
- Process multiple objects in parallel (e.g.,
errgroup.Group+semaphore.Weighted) to increase throughput without exceeding API or memory limits. - Allow configuring the degree of concurrency per mapping in config.json.
- Process multiple objects in parallel (e.g.,
-
Retry and fault tolerance
- Implement a retry policy with backoff for network operations (List, Get, Upload, Delete), both generic and per provider.
- Handle deadlines and use
ctxin SDKs so that cancellation immediately stops operations.
-
Additional tests
- Cover error scenarios in
SyncBuckets(failure inGetObject,UploadObject, etc.) and ensure error counters and database status are updated correctly. - Create mocks with interfaces and use
gomockortestify/mockto simulate failures and validate retry logic.
- Cover error scenarios in
-
Observability
- Expose metrics (Prometheus) for synchronized objects, latency, errors.
- Add traces (OpenTelemetry) to track operations between providers and the DB.
-
Logging and levels
- Consolidate logger calls: use
.Debugfor large payloads and flow details;.Infofor milestones;.Erroralways with the error. - Allow configuring log level via flag.
- Consolidate logger calls: use
-
Code quality and CI/CD
- Add a GitHub Actions pipeline to run
go fmt,go vet,golangci-lint, tests, and generate coverage. - Use semantic versioning modules for releases.
- Add a GitHub Actions pipeline to run
-
Configuration and extensibility
- Support filters (prefix, regex) in each mapping.
- Allow hooks before/after each sync (e.g., KMS keys, custom validations).
-
Full metadata handling
- Preserve and propagate all object
Metadata(not justContentType), including headers and tags. - Add support for ACLs and encryption (when the provider offers it).
- Preserve and propagate all object
-
Graceful shutdown
- Ensure that upon receiving a termination signal, wait for ongoing workers to finish or roll back.
With these improvements, the project will gain in performance, resilience, test coverage, and flexibility for growth.