Skip to content

Documenting more S3 transfer utilities #548

@wwarriner

Description

@wwarriner

What would you like to see added?

Caveat!

Our understanding is that s5cmd uses md5 hashes to verify binary content integrity during uploads only, not downloads. For more intricate verification another tool will be required (e.g. of metadata or using another hash). A later post in this issue documents how to use rclone check.

Notes

  • --stat shows total files transferred, failed, successful, at the end of the job
  • --numworkers=$SLURM_CPUS_ON_NODE is perfect for a single-node job
  • --endpoint-url=https://s3.lts.rc.uab.edu/ is required for our S3 endpoint
  • mv will remove the file from the source!
  • cp is what we want until we've verified the files on the destination

Tests

Tests with 8 cpus and 8 GB memory on c0168:

  • 39 files @ 1 GiB each: ~5.1 gbps
  • 1000 files @ 10 MiB each: ~0.95 gbps

Tests with 100 cpus and 200 GB memory on c0202 (amd-hdr100)

  • 1000 files @ 10 MiB each: ~8.0 gbps

Example

Sample commands to get timing and s5cmd cp (in a script):

#!/bin/bash

start_time="$(date -u +%s.%N)"

s5cmd --stat \
    --numworkers=$SLURM_CPUS_ON_NODE \
    --endpoint-url=https://s3.lts.rc.uab.edu/ \
    cp \
    SOURCE_PATH \
    s3://DESTINATION_PATH/

end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"

Other thoughts

We don't fully understand the cp flag --concurrency.

There are also open questions about the Rados Gateway frontend configuration.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions