Cloud Data Sync

A Go package for synchronizing data between different cloud storage providers. Supports Google Cloud Storage (GCS), Amazon S3, Azure Blob Storage, and MinIO (or any S3-compatible service).
Overview
Cloud Data Sync is a tool that allows you to synchronize objects/files between different cloud storage providers. It is designed to be extensible, decoupled, and easy to use as a library or standalone application.
Key Features
- Support for multiple storage providers:
- Google Cloud Storage (GCS)
- Amazon S3
- Azure Blob Storage
- MinIO (or any S3-compatible service)
- Unidirectional object synchronization (from a source to a destination)
- Metadata tracking for efficient synchronization
- Continuous synchronization with customizable interval
- On-demand single synchronization
- Change detection based on ETag and modification date
- Automatic removal of objects deleted at the source
Installation
To install the package:
go get github.com/DjonatanS/cloud-data-sync
Usage as a Library
Basic Example
package main
import (
"context"
"log"
"github.com/DjonatanS/cloud-data-sync/internal/config"
"github.com/DjonatanS/cloud-data-sync/internal/database"
"github.com/DjonatanS/cloud-data-sync/internal/storage"
"github.com/DjonatanS/cloud-data-sync/internal/sync"
)
func main() {
// Load configuration
cfg, err := config.LoadConfig("config.json")
if err != nil {
log.Fatalf("Error loading configuration: %v", err)
}
// Initialize context
ctx := context.Background()
// Initialize database
db, err := database.NewDB(cfg.DatabasePath)
if err != nil {
log.Fatalf("Error initializing database: %v", err)
}
defer db.Close()
// Initialize provider factory
factory, err := storage.NewFactory(ctx, cfg)
if err != nil {
log.Fatalf("Error initializing provider factory: %v", err)
}
defer factory.Close()
// Create synchronizer
synchronizer := sync.NewSynchronizer(db, cfg, factory)
// Execute synchronization
if err := synchronizer.SyncAll(ctx); err != nil {
log.Fatalf("Error during synchronization: %v", err)
}
}
Implementing a New Provider
To add support for a new storage provider, implement the storage.Provider interface:
// Example implementation for a new provider
package customstorage
import (
"context"
"io"
"github.com/DjonatanS/cloud-data-sync/internal/storage"
)
type Client struct {
// Provider-specific fields
}
func NewClient(config Config) (*Client, error) {
// Client initialization
return &Client{}, nil
}
func (c *Client) ListObjects(ctx context.Context, bucketName string) (map[string]*storage.ObjectInfo, error) {
// Implementation for listing objects
}
func (c *Client) GetObject(ctx context.Context, bucketName, objectName string) (*storage.ObjectInfo, io.ReadCloser, error) {
// Implementation for getting an object
}
func (c *Client) UploadObject(ctx context.Context, bucketName, objectName string, reader io.Reader, size int64, contentType string) (*storage.UploadInfo, error) {
// Implementation for uploading an object
}
// ... implementation of other interface methods
Usage as an Application
Compilation
go build -o cloud-data-sync ./cmd/gcs-minio-sync
Configuration
Create a configuration file as shown in the example below or generate one with:
./cloud-data-sync --generate-config
Example configuration:
{
"databasePath": "data.db",
"providers": [
{
"id": "gcs-bucket",
"type": "gcs",
"gcs": {
"projectId": "your-gcp-project"
}
},
{
"id": "s3-storage",
"type": "aws",
"aws": {
"region": "us-east-1",
"accessKeyId": "your-access-key",
"secretAccessKey": "your-secret-key"
}
},
{
"id": "azure-blob",
"type": "azure",
"azure": {
"accountName": "your-azure-account",
"accountKey": "your-azure-key"
}
},
{
"id": "local-minio",
"type": "minio",
"minio": {
"endpoint": "localhost:9000",
"accessKey": "minioadmin",
"secretKey": "minioadmin",
"useSSL": false
}
}
],
"mappings": [
{
"sourceProviderId": "gcs-bucket",
"sourceBucket": "source-bucket",
"targetProviderId": "local-minio",
"targetBucket": "destination-bucket"
},
{
"sourceProviderId": "s3-storage",
"sourceBucket": "source-bucket-s3",
"targetProviderId": "azure-blob",
"targetBucket": "destination-container-azure"
}
]
}
Execution
To run a single synchronization:
./cloud-data-sync --config config.json --once
To run the continuous service (periodic synchronization):
./cloud-data-sync --config config.json --interval 60
Usage with Docker
You can also build and run the application using Docker. This isolates the application and its dependencies.
Prerequisites
- Docker installed on your system.
- Google Cloud SDK (
gcloud) installed and configured with Application Default Credentials (ADC) if using GCS. Run gcloud auth application-default login if you haven't already.
Build the Docker Image
Navigate to the project's root directory (where the Dockerfile is located) and run:
docker build -t cloud-data-sync:latest .
Prepare for Execution
- Configuration File (
config.json): Ensure you have a valid config.json in your working directory.
- Data Directory: Create a directory (e.g.,
data_dir) in your working directory. This will store the SQLite database (data.db) and persist it outside the container.
- Update
databasePath: Modify the databasePath in your config.json to point to the location inside the container where the data directory will be mounted, e.g., "databasePath": "/app/data/data.db".
- GCP Credentials: The run command below assumes your GCP ADC file is at
~/.config/gcloud/application_default_credentials.json. Adjust the path if necessary.
Run the Container
Execute the container using docker run. You need to mount volumes for the configuration file, the data directory, and your GCP credentials.
Example 1: Run a single synchronization (--once)
# Define the path to your ADC file
ADC_FILE_PATH="$HOME/.config/gcloud/application_default_credentials.json"
# Check if the ADC file exists
if [ ! -f "$ADC_FILE_PATH" ]; then
echo "Error: GCP ADC file not found at $ADC_FILE_PATH"
echo "Run 'gcloud auth application-default login' first."
else
# Ensure config.json is present and data_dir exists
# Ensure databasePath in config.json is "/app/data/data.db"
docker run --rm \\
-v "$(pwd)/config.json":/app/config.json \\
-v "$(pwd)/data_dir":/app/data \\
-v "$ADC_FILE_PATH":/app/gcp_credentials.json \\
-e GOOGLE_APPLICATION_CREDENTIALS=/app/gcp_credentials.json \\
cloud-data-sync:latest --config /app/config.json --once
fi
Example 2: Run in continuous mode (--interval)
# Define the path to your ADC file
ADC_FILE_PATH="$HOME/.config/gcloud/application_default_credentials.json"
# Check if the ADC file exists
if [ ! -f "$ADC_FILE_PATH" ]; then
echo "Error: GCP ADC file not found at $ADC_FILE_PATH"
echo "Run 'gcloud auth application-default login' first."
else
# Ensure config.json is present and data_dir exists
# Ensure databasePath in config.json is "/app/data/data.db"
docker run --rm \\
-v "$(pwd)/config.json":/app/config.json \\
-v "$(pwd)/data_dir":/app/data \\
-v "$ADC_FILE_PATH":/app/gcp_credentials.json \\
-e GOOGLE_APPLICATION_CREDENTIALS=/app/gcp_credentials.json \\
cloud-data-sync:latest --config /app/config.json --interval 60
fi
Example 3: Generate a default configuration
docker run --rm cloud-data-sync:latest --generate-config > config.json.default
Internal Packages
-
storage: Defines the common interface for all storage providers.
- gcs: Implementation of the interface for Google Cloud Storage.
- s3: Implementation of the interface for Amazon S3.
- azure: Implementation of the interface for Azure Blob Storage.
- minio: Implementation of the interface for MinIO.
-
config: Manages the application configuration.
-
database: Provides metadata persistence for synchronization tracking.
-
sync: Implements the synchronization logic between providers.
Dependencies
- Google Cloud Storage:
cloud.google.com/go/storage
- AWS S3:
github.com/aws/aws-sdk-go/service/s3
- Azure Blob:
github.com/Azure/azure-storage-blob-go/azblob
- MinIO:
github.com/minio/minio-go/v7
- SQLite:
github.com/mattn/go-sqlite3
Requirements
- Go 1.18 or higher
- Valid credentials for the storage providers you want to use
License
MIT
Contributions
Contributions are welcome! Feel free to open issues or submit pull requests.
Authors
Next Updates
-
Memory and I/O optimization
- Avoid reading the entire object into memory and then recreating
strings.NewReader(string(data)). Instead, use io.Pipe or pass the io.ReadCloser directly for streaming upload.
- Where buffering is still necessary, replace with
bytes.NewReader(data) instead of converting to string:
// filepath: internal/sync/sync.go
readerFromData := bytes.NewReader(data)
_, err = targetProvider.UploadObject(
ctx,
mapping.TargetBucket,
objName,
readerFromData,
int64(len(data)),
srcObjInfo.ContentType,
)
-
Parallelism and concurrency control
- Process multiple objects in parallel (e.g.,
errgroup.Group + semaphore.Weighted) to increase throughput without exceeding API or memory limits.
- Allow configuring the degree of concurrency per mapping in config.json.
-
Retry and fault tolerance
- Implement a retry policy with backoff for network operations (List, Get, Upload, Delete), both generic and per provider.
- Handle deadlines and use
ctx in SDKs so that cancellation immediately stops operations.
-
Additional tests
- Cover error scenarios in
SyncBuckets (failure in GetObject, UploadObject, etc.) and ensure error counters and database status are updated correctly.
- Create mocks with interfaces and use
gomock or testify/mock to simulate failures and validate retry logic.
-
Observability
- Expose metrics (Prometheus) for synchronized objects, latency, errors.
- Add traces (OpenTelemetry) to track operations between providers and the DB.
-
Logging and levels
- Consolidate logger calls: use
.Debug for large payloads and flow details; .Info for milestones; .Error always with the error.
- Allow configuring log level via flag.
-
Code quality and CI/CD
- Add a GitHub Actions pipeline to run
go fmt, go vet, golangci-lint, tests, and generate coverage.
- Use semantic versioning modules for releases.
-
Configuration and extensibility
- Support filters (prefix, regex) in each mapping.
- Allow hooks before/after each sync (e.g., KMS keys, custom validations).
-
Full metadata handling
- Preserve and propagate all object
Metadata (not just ContentType), including headers and tags.
- Add support for ACLs and encryption (when the provider offers it).
-
Graceful shutdown
- Ensure that upon receiving a termination signal, wait for ongoing workers to finish or roll back.
With these improvements, the project will gain in performance, resilience, test coverage, and flexibility for growth.