An efficient, zero-dependency command-line utility for concatenating, merging, and joining files, written in the D programming language.
- Uses low-level OS-specific functions for fast file preallocation and efficient data copying.
- Optional stacked bar chart visualizing byte distribution across input files.
- Optional support for pattern-based trimming from file contents.
- Single binary with zero dependencies — written in a fast, compiled language.
| OS | Preallocation | Copying Method |
|---|---|---|
| Linux | fallocate(2) |
copy_file_range(2) / sendfile(2) |
| Windows | SetEndOfFile |
Parallel double mmap |
| POSIX | posix_fallocate() |
Single mmap |
| macOS | ftruncate(2) |
Buffer-based copying |
All platforms also include a fallback sequential buffer-based implementation
accessible via -F | --fallback.
For a deeper dive into performance, see the Benchmarking section.
- Download the latest release and refer to the Command-Line Interface section for usage instructions.
- Alternatively, follow the Building and
Benchmarking to compile and test
dcatfrom the source code.
This utility follows POSIX Utility Conventions.
In short, the following is recognized:
- Flags: Short or long strings with no value (e.g.,
-for--flag). May be specified multiple times for options that act as counters. - Options: Short or long strings with an associated value
(e.g.,
-f <value>or--flag[=]value). - Parameters: Required strings without prefixes (e.g.,
<in1> <in2>).
-h | --help: Displays a help message with all command-line options.-V | --version: Shows thedcatversion and compiler information.
<input1>, <input2>, ...: A required list of input files to be concatenated.-O <PATH> | --output=<PATH>: A required output file path (must not already exist).
-N | --dry-run: Simulate execution; analyzes files and output schema without writing data.-T <HEX> | --trim[=]<HEX>: A hexadecimal pattern to greedily trim from both the beginning and end of each input file.
-F | --fallback: Forcesdcatto use the sequential buffer-based implementation. This is generally slower but more stable across various environments.-P | --posix: (Linux only) Use POSIX singlemmapcopy approach instead of Linux-specific methods.--no-cow: (Linux only) Usesendfile(2)instead ofcopy_file_range(2).
-v | --verbose: Increases the application's verbosity. Can be specified up to two times (-vv) to print detailed debug messages.--bar-padding <int>Left padding of the bar chart (default: 4).--bar-height <int>: Height of the bar chart (default: 10).--bar-width: Width of the bar chart (default: 10).
This project uses DUB, the official D package manager.
Optionally, it includes Taskfile.yml for use with
Task, a cross-platform task runner for easy
cross-platform workflows. It pairs a perfect duo with D, as both support
multiple architectures.
- D compiler (DMD, GDC, or LDC)
- DUB package manager (likely included with D)
- Task (optional task runner)
# Clone the repository
git clone https://github.com/nadvotsky/dcat.git && cd dcat
# List available tasks
task list
# Build for the currrent platform (binary will be in `dcat/bin/dcat[.exe]`)
task [default]
# Or build manually with DUB
dub build [--build=release] --root ./dcat
# Run benchmarks
task benchmarkFeel free to customize the Taskfile.yml. For example,
BUILD_DIR specifies a path to the project, and BENCH_*
variables are specific to the benchmarking.
dcat includes a flexible benchmarking suite for evaluating performance
across different environments.
- Windows (1703+): Automatically downloads:
- POSIX: Requires
dd,sync, andhyperfine.
The Taskfile.yml defines the following variables for
customizing the benchmark:
BENCH_SRC: Source code directory for the benchmarking suite.BENCH_TMP: Temporary directory for test files and binaries.BENCH_SZ: A list of file sizes (in MB) to benchmark.BENCH_IN: A list of filenames to be created and used for benchmarking, each of sizeBENCH_SZ.BENCH_OUT: The name of the output file.BENCH_LINUX,BENCH_NT,BENCH_POSIX: A list of specificdcatvariants to build and benchmark. See the Variants section for details.
Low-level file operations can be highly sensitive to a variety of system factors, which can significantly influence benchmark results. It is highly recommended to run the benchmarks on your specific system to get relevant performance data!
Key factors include:
- Number and size of files.
- Storage (RAM, HDD, SSD, NVMe, MMC), and their respective variations (i.e., DRAM Cache).
- Filesystems (ext4, Btrfs, XFS, NTFS, ReFS, APFS), including support for features like Copy-on-Write (CoW).
- D compiler (DMD, LDC, GDC).
- Kernel version and I/O scheduler (kyber, bfq, cfq).
- System load, cache exaggeration, hugepages, access times (
atime), etc.
NOTE: The explanations above is the tip of the iceberg of "zero-copy" theory. Filesystem manipulation cannot be standardized across all operating systems and filesystems; trying to do so would be a mistake. There is much more to consider: system cache behavior, buffer strategies, filesystem optimizations, portability issues between UNIX variants, and so on.
Traditionally, one of the most portable and straightforward ways to copy files is to use a buffered read/write loop. This approach remains common in thousands of applications and there is nothing wrong with it.
However, with the rise of Copy-On-Write (CoW) capable filesystems and the increasing complexity of modern operating systems (e.g., optimized transfers from the system cache to a NIC), more efficient alternatives were developed.
Starting with Linux kernel 2.2, the sendfile(2) system call became available.
This enables copying data entirely within kernel space, eliminating unnecessary
context switches between user and kernel modes.
Later, in Linux 4.5, the copy_file_range(2) syscall was introduced to
work with CoW in mind, offering a more flexible method of file copying
(for instance, allowing to copy between different filesystems).
Another alternative is to use memory-mapped files. This technique maps a file directly into a process's address space, allowing file I/O to be handled through standard memory operations. The kernel's virtual memory subsystem transparently handles page swapping.
The downside of memory-mapped files is that large files may exceed the addressable space of 32-bit applications. Additionally, this behavior can vary significantly by OS: some may overcommit memory pool, while others may perform inefficient copying under the hood.
The benchmarking suite includes the following implementation variants:
sendseq/sendpar: Sequential/parallelsendfile(2)sendpar: Opens an additional output file handle for each thread and performsfseek
mmapseq/mmappar: Sequential/parallel memory-mapped inputmmappar: Opens an additional output file handle for each thread and performsfseek
dmmapseq/dmmappar: Sequential/parallel dual memory-mapped input/outputcopyseq/copypar: Sequential/parallelcopy_file_range(2)copypar: Opens an additional output file handle for each thread
chunkseq: D language high-level chunked copyblockseq: Basic C-style buffer copy
NOTE: Results are machine-specific and may not be representative of the particular environment! Refer to Challenges for more information.
dbench_dmmappar.exe ran
1.37 ± 0.04 times faster than COPY /B
2.24 ± 0.01 times faster than dbench_dmmapseq.exe
4.73 ± 0.43 times faster than dbench_chunkseq.exe
4.89 ± 0.58 times faster than dbench_blockseq.exe
5.00 ± 2.31 times faster than dbench_mmappar.exe
6.23 ± 4.60 times faster than dbench_mmapseq.exe
Dual memory mapping on Windows yields promising performance.
cat ran
1.45 ± 0.14 times faster than dbench_sendseq
1.46 ± 0.19 times faster than dbench_copyseq
1.59 ± 0.16 times faster than dbench_sendpar
1.61 ± 0.23 times faster than dbench_copypar
275.09 ± 26.85 times faster than dbench_dmmappar
352.58 ± 34.36 times faster than dbench_mmapseq
361.62 ± 35.24 times faster than dbench_chunkseq
365.65 ± 37.74 times faster than dbench_blockseq
389.06 ± 37.91 times faster than dbench_dmmapseq
560.34 ± 91.21 times faster than dbench_mmappar
It is unsurprising that GNU
catis highly optimized for untrimmed copy, leveragingcopy_file_range(2)since 2022. For reference, FreeBSD also uses this approach.However,
dcatis competitive as a fast alternative for systems where a similarly optimizedcatis not available, such as on Busybox Linux distributions and other POSIX systems. It also supports trimming which may be challenging to implement efficently via shell scripts.
dbench_blockseq ran
1.23 ± 0.46 times faster than dbench_chunkseq
1.30 ± 0.48 times faster than dbench_dmmapseq
1.34 ± 0.51 times faster than cat
1.46 ± 0.56 times faster than dbench_dmmappar
1.69 ± 0.63 times faster than dbench_mmappar
1.70 ± 0.63 times faster than dbench_mmapseq
While macOS's
libcprovides file copying functions, it lacks partial file moving capabilities. See Apple developer notes: NSFileManager, FSCopyObjectAsync
Thanks for checking out dcat! Special thanks to:
- D Programming Language
- DUB Package Manager
- Task cross-platform task runner
- Hyperfine benchmarking tool
- Countless man pages, stackexchange threads, and community wisdom.
Here are some related links:
- ReFS
FSCTL_DUPLICATE_EXTENTS_TO_FILE_EX - GNU
copy_file_rangenotes - macOS
copyfile(3) - Linux I/O Schedulers
- Linux 5.6 I/O Scheduler Benchmarks
- Performance Tuning on Linux — Disk I/O
- Busybox
copyfd.c - The fastest way to copy a file
This project is licensed under the MIT License. You are free to use, modify, and distribute this software, but please provide attribution.
