Copying disk bytes in C

Table of Content

5.1 Measuring system calls overhead

6 Sendfile, splice and copy_file_range
7 Summary and conclusion
8 Going further
9 Methodology

9.1 Data
9.2 Automation

10 References

#Introduction

If you've ever done system programming in C, you may have noticed that the standard doesn't include a function to copy files on disk directly. There is no built-in function like copyfile(const char *from, const char *to, int flags) that copies a file in a single call. At first glance, this might feel like a missing piece of the standard library. Sure we can open, read and write to files using standard C functions. But there is no single function to copy an entire file at once. You could write your own copyfile function using just the standard library, but would that be a good idea?

If you want the copied file to preserve permissions, symlink status, ownership, and other metadata, the C standard library does not provide the necessary tools. However, if all you want is to copy the content of a regular file, then there are plenty of tools available. Especially on POSIX-compliant systems like Linux, there are many ways to perform file I/O efficiently and inefficiently.

In this article, I’ll walk you through six ways to copy files on Linux and highlight the pros and cons of each method.

#Copying Without I/O

In this section, I'm looking at how fast my system can copy bytes without performing any I/O operation. This benchmark ignores I/O and is meant to establish a limit for how fast the system can copy bytes. Think of memcpy as the theoretical limit: every copy method we discuss later can be compared against it.

I won't dive into all the details of memcpy, but it's highly optimized on modern systems. Internally, it uses unrolled loops and SIMD instructions to copy multiple bytes per instruction, maximizing throughput. For very large buffers, memory bandwidth and TLB misses start to dominate, which explains the higher per-byte cost we observe. Essentially, when the buffer exceeds 128 KiB, the CPU has to fetch data from main memory instead of the cache, which slows things down.

#Naive POSIX Copy

Let's begin with the most basic file-copying method available on POSIX systems: using the raw system calls open, read, and write.

open opens a file and returns a file descriptor;
read reads raw bytes from a file descriptor;
write writes raw bytes to a file descriptor.

First, we open the input file with O_RDONLY (for reading). Next, we open (or create) the output file with O_WRONLY | O_CREAT, this opens the file for writing, replacing any existing file. We then repeatedly call read to pull one byte at a time from the input file, and write to push that byte to the output file. This naive method makes direct calls to read and write until there are no bytes left to read in the file.

POSIX copy

void
copy_posix(const char* in, const char* out)
{
	int fd_in = open(in, O_RDONLY);
	int fd_out = open(out, O_WRONLY | O_CREAT, 0666);

	unsigned char c;
	while (1) {
		/* read one byte */
		if (read(fd_in, &c, 1) != 1)
			break;
		/* write one byte */
		if (write(fd_out, &c, 1) != 1)
			break;
	}
	close(fd_out);
	close(fd_in);
}

This function reads and writes 1 byte at a time. While it's fairly straightforward, it's not exactly fast. In fact, if you try to run this function, you'll see that it's excruciatingly slow. On my system, it took 29 seconds to copy a single 10MB file. It's not that my disk is slow, in fact it can reach upwards of 500MB/s of read+write speed. The issue comes from how we're copying the data. Actually, performing the same copy using the cp utility from GNU coreutils takes ~9 milliseconds only (and that's considering the entire overhead of executing cp from a shell).

Why the enormous difference? Each call to read or write triggers a system call, which means crossing the boundary between user mode and kernel mode. This triggers a mode switch. A single round-trip syscall typically costs 200–1000 ns, and this implementation performs two per bytes. Even worse than that, is the fact that when you're calling slow i/o syscalls like read, the kernel may decide to run another program while the read is fetching data from disk. That program may keep running after the disk has returned data, thus delaying our program from continuing.

Here's a simplified diagram of what happens when you call to read():

┌───────────┐      ┌─────────────┐                ┌──────┐
│ User mode │      │ Kernel mode │                │ Disk │
└───────────┘      └─────────────┘                └──────┘
      │                   │                           │
      │    read() call    │                           │
      ├──────────────────>┤                           │
      │     100-500ns     │                           │
      │                   │                           │
      │                   │                           │
      │             In Page Cache?                    │
      │               yes │ no     Submit disk I/O    │
      │              ┌────┼──────────────────────────>┤
      │              │    │        50-200us (SSD)     │
      │              │    │         5-10ms (HDD)      │
      │              │    │                           │
      │              │    │  Chunk mapped to memory   │
      │              └───>┼<──────────────────────────┤
      │                   │                           │
      │                   │ Copy to user buffer       │
      │                   ├────────────────────┐      │
      │                   │      50-300ns      │      │
      │                   ├<───────────────────┘      │
      │   read() return   │                           │
      ├<──────────────────┤                           │
      │     100-500ns     │                           │
      │                   │                           │

write() has a similar diagram, expect we don't suffer from the disk's latency since the kernel usually flushes the writes later on, without making the program wait. In the meantime, the written data sits in the kernel page cache.

The fundamental problem is that we're involving the disk, one of the slowest component of your system. In the time it takes for your disk to return the data we asked for, the CPU could have executed almost half a million basic instructions. That's the reason why modern kernels perform aggressive caching when doing i/o with the disk. Even though we request a single byte to be read every time, the kernel will typically read several kilobytes from disk into the page cache, then serve only the requested byte.

This is further justified by the fact that disks can't read one byte at a time. Disks operate on a block-basis, and you may not read less than one full block at a time. You can check your disk's block size by doing this: cat /sys/block/sdX/queue/physical_block_size (usually 512 bytes). On top of that, filesystems usually impose their own logical block size on top of the disk's physical block size. This value can be checked in /sys/block/sdX/queue/logical_block_size, and it usually 4kiB for Ext4.

Writes behave similarly. Thanks to the page cache, a write call usually does not write to disk immediately. Instead, the kernel buffers modifications and flushes them later when convenient. This improves performance dramatically, but can cause data loss on unexpected power-offs, one of the reasons why databases often bypass the page cache entirely. However, in most use-cases, caching improves performances drastically, especially if multiple programs try to read the same file simultaneously.

naive POSIX copy — As expected, this naive POSIX implementation's performances are *terrible*. If cp were added to the graph, it wouldn't be visible because it is order of magnitude faster. The per-byte cost of this naive approach is around ~1.3μs/byte (the regression slope), which isn't particularly fast, not to say *very* slow for copying.

#Copy in 100% standard C

In this section, I'll rewrite the previous POSIX copy function using only features in the C standard. I've replaced open with fopen, write with fwrite, etc.

The whole function looks something like this:

The simplest, 100% standard-compliant copy function

void
copy_stdio(const char* in, const char* out)
{
	FILE* fin = fopen(in, "r");
	FILE* fout = fopen(out, "w");

	while (1) {
		unsigned char c;
		/* read 1 byte */
		if (fread(&c, 1, 1, fin) != 1)
			break;
		/* write 1 byte */
		if (fwrite(&c, 1, 1, fout) != 1)
			break;
	}
	fclose(fout);
	fclose(fin);
}

Under the hood, a lot of things happen when we call fopen, fread and fwrite. The standard is pretty vague on how they work, but a lot of implementation add their own salt on top of what the standard requires. For instance, the standard never defines what should happen in a multi-threaded context, nor what happens when multiple programs read/write to the same file simultaneously.

But the standard does mandate one thing: I/O operations (fread, fwrite, etc.) have to be buffered. This means is that, when reading a single byte, more bytes may be requested to be available for later calls. I say may, because the standard says nothing about how buffering works in detail for these operations.

Below is the diagram on how fread() should work:

┌───────────┐      ┌─────────────┐
│  Program  │      │    libc     │
└───────────┘      └─────────────┘
      │                   │
      │                   │
      │                   │
      │                   │
      │                   │
      │                   │
      │   fread() call    │
      ├──────────────────>┤
      │                   │
      │            In libc buffer?
      │               yes │ no       read() call ┊
      │              ┌────┼─────────────────────>┤
      │              │    │                      │
      │              │    │    read() return     │
      │              └───>┼<─────────────────────┤
      │                   │                      ┊
      │                   │
      │                   │ Copy to user buffer
      │                   ├────────────────────┐
      │                   │                    │
      │                   ├<───────────────────┘
      │                   │
      │                   │
      │  fread() return   │
      ├<──────────────────┤
      │                   │
      │                   │

Note that many implementations add more stuff on top of fread. Notably glibc adds mutex locking to prevent multiple threads from using the underlying data simultaneously. Nevertheless, most implementations will attempt to efficiently manage their internal buffers, in order to minimize the number of read() system calls required.

In practice, libc's buffering is not meant to squeeze performance, but to make the experience less painful. For instance, when writing to stdout or stderr, libc will flush every time a \n is encountered.

libc's stdio copy — This function yields a similar graph to the previous naive POSIX copy, while being about faster. But it's still far from matching cp's performance.

If you pay attention to details, you may have noticed that the time is roughly the same for all sizes under 1kB. When I run the program under strace for a 512-bytes file:

[...]
openat(AT_FDCWD, "in.bin", O_RDONLY)    = 3
openat(AT_FDCWD, "out.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
# First fread call
# fstat in.bin
fstat(3, {st_mode=S_IFREG|0664, st_size=512, ...}) = 0
read(3, "h#\211\277p\\\257\372\202\232\2\255\r=\304c\314\207\306\331\243\203}x\7t\327\241\277\211\305\360"..., 4096) = 512
# fstat out.bin
fstat(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
read(3, "", 4096)                       = 0
# After last fwrite call, before calling close, stdio will sync it's buffer with the file, calling write to do so
write(4, "h#\211\277p\\\257\372\202\232\2\255\r=\304c\314\207\306\331\243\203}x\7t\327\241\277\211\305\360"..., 512) = 512
close(4)                                = 0
close(3)
[...]

The first thing to notice is that even though we call fread and fwrite hundreds of times, read and write are only called once. Another odd fact is that fread will call fstat. There are many reasons for wanting to stat the file, but one reason is that stdio needs to select an appropriate flushing strategy. For instance when writing to a PIPE, data is usually flushed every time a newline (\n) is encountered, similarly this is used when writing to stdout/stderr, so your messages don't appear at the end of the program when file descriptors are closed. However, another reason for stat-ing the file is to determine the size of the file in advance. Even though I requested for a single byte to be read, fread asked for 4096 bytes to be read to the kernel. 4096 bytes might seem like an odd value, but it's actually a really specific value: the size of a memory page. In fact, if we look into how direct memory access for file descriptor works on Linux, open(2) man page says:

O_direct
[...]
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of
the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.

Since the logical block size for Ext4 is 4096, the kernel can't read less than 4096 bytes at once. However the story doesn't end here. Not only does stdio use buffering, the kernel page cache still applies when stdio calls to read or write. These buffers usually allocate at least one memory page, which corresponds to that 4096 bytes that read is invoked with. So it would look like glibc is doing a great job at buffering, it seems to be correctly exploiting the various hard constraints from the filesystem, kernel and memory.

To test this, I computed a speedup comparison of stdio vs POSIX copying:

stdio copy vs naive POSIX copy — Even for smaller file sizes, it outperforms POSIX copy. On larger files, libc's stdio is the clear winner.

As you can see, using stdio really did improve copying speed. Once the file size exceeds 4KiB, the copy is about 60x faster. That's a huge improvement. But we're still lagging far behind the simple cp utility. However, stdio is not always your friend. I ran the same benchmarks as before, this time using sizes close to the page boundary. The results speak for themselves:

stdio copy around the page boundary size

There seems to be a significant overhead around 4096 bytes, which is odd since this is the page boundary. Meaning that the same number of syscalls should be made than for copying 4095 bytes. What is even weirder, is the fact that it's slower than copying 4097 bytes. Looking at strace, 4097 bytes does cause additional system calls, since stdio will now have to read two pages. This is a hint, that what is slowing us down is not system calls, but stdio itself. To be sure of that, I set stdio's buffer size to 8kiB:

stdio copy around the page boundary size with an 8kiB buffer — Using *setvbuf* we're able to set the buffer's size for stdio. Note that the time includes the overhead caused by calling *setvbuf*.

This tells us that stdio is the culprit, but this change only moved the problem further down the line. Performing copies of about 8192 bytes will highlight the same issues as before.

Which is why it might be time to learn buffering for ourselves. In the next section I'll show you how you can squeeze performances by using your own buffering.

#Buffered POSIX copy

As I've stated in the previous section, fread and fwrite have their own buffering. And while not always optimal, it still performs way better than not using any buffers at all. So we'll implement our own custom buffering to find whether we can get more performances than stdio.

The basic idea behind buffering is that performing system calls (which in terms perform I/O operations) is inherently slow. So we do as much as possible to minimize the number of system calls our program will have to perform. In practice, that means reading more bytes at once, and letting read or write tell us how many bytes were actually read or written.

Buffered POSIX copy

#define BUFFER_SIZE (8192)

void
copy_posix_buffered(const char* in, const char* out)
{
	int fd_in = open(in, O_RDONLY);
	int fd_out = open(out, O_WRONLY | O_CREAT, 0666);
	unsigned char buf[BUFFER_SIZE];

	while (1) {
		/* read up to BUFFER_SIZE bytes */
		ssize_t nread = read(fd_in, buf, BUFFER_SIZE);
		if (nread <= 0)
			break;
		size_t pos = 0;
		/* write nread bytes */
		while (pos != (size_t)nread) {
			ssize_t nwrite = write(fd_out, buf + pos, (size_t)nread - pos);
			if (nwrite < 0)
				goto end;
			pos += (size_t)nwrite;
		}
	}
end:
	close(fd_out);
	close(fd_in);
}

This function is similar to POSIX copy, however instead of reading and writing single a byte, we process 8192 bytes at once. Why that number you may ask? We'll see later why choosing the right buffer size matters.

In theory, since our previous POSIX copy was being slowed down by the high number of systems calls, using a buffer should mitigate that. If the bottleneck really was the number of system calls, this one should be much faster since we've slashed the number of system calls made by about 8192.

And here are the results:

buffered POSIX copy — For sizes up to 4096, the time stays roughly the same, which is expected since they all finish within one call to *read* + *write*. Then the relationship becomes linear.

What comes next, is naturally to test different buffer sizes and see how they affect the copying speed. Below is a benchmark of 20 different buffer sizes, to see how they affect the average copying speed:

buffered POSIX copy with 20 buffer sizes — As the files become larger, we can clearly see the benefit of using larger buffers.

As expected, the larger our buffer is, the faster the copy becomes. It would look like it's always optimal to use a very large buffer. Buffer size 64k is outperforming every other sizes, but there's a caveat to that.

First of, larger buffers require more memory and may greatly exceed the size of the read file. As stated previously, due to ext4's restrictions, a file may not occupy less that 4kiB of disk space, so that is a hint that our buffer size should be at least 4kiB. However using a 64kiB buffer to read a small file may waste a larger amount of memory, depending of where you put your buffers. But while memory is cheap, cache is not. When your buffer fits in cache entirely, you will get the best speedup possible. If your buffer has to be in memory, then you suffer from the CPU <-> memory communication slowdown. Looking at cachegrind, we get 0.08% cache misses for a buffer of 64kiB, while a buffer of 8192 gets almost no cache misses.

The second reason is that calling read will always force the kernel to read the requested number of bytes. In theory that is a good thing, if you know the exact size of the file, you can allocate a large-enough buffer and read it in one system call. However, the kernel won't always do exactly as you planned. When you request for data to be read, that request gets split into 4kiB pages and stored in the kernel's I/O Scheduler. Then it's up to the kernel to decide when to honor these requests. It may decide to read a few pages for your process, then read another page for another process. This what's called fairness, and it's a major factor into making your operating system responsive. E.g no single process may block I/O for all others. In practice, that means sacrificing time where your program could have been performing other tasks.

Here's an example of how that might work:

┌─────────────┐      ┌────────────┐               ┌──────────┐
│  Process A  │      │   Kernel   │               │   Disk   │
└─────────────┘      └────────────┘               └──────────┘
      │   read(16kiB)      │                           │
      ├───────────────────>┤                           │
      ┊                    │                           │
┌─────────────┐            │                           │
│  Process B  │            │                           │
└─────────────┘            │                           │
      │   read(4kiB)       │                           │
      ├───────────────────>┤                           │
      ┊                    │                           │
┌─────────────┐            │                           │
│  Process C  │            │                           │
└─────────────┘            │                           │
      │   metadata I/O     │                           │
      ├───────────────────>┤                           │
      ┊                    │ Requests are split        │
                           │  into 4kiB chunks         │
                  ┌────────┴─────────┐                 │
                  │  I/O Scheduler   │                 │
                  ├────┬─────────────┤                 │
                  │ ID │   Request   │                 │
                  ├────┼─────────────┤                 │
                  │ 0  │ A read 4kiB │                 │
                  │ 1  │ A read 4kiB │                 │
                  │ 2  │ B read 4kiB │                 │
                  │ 3  │ A read 4kiB │                 │
                  │ 4  │ C metadata  │                 │
                  │ 5  │ A read 4kiB │                 │
                  └────┴───┬─────────┘                 │
                           │      Issue reads 0-2      │
                           ├───────────────────────────┤
                           │                           │
                           ├<──────────────────────────┤
      ┊     Wake up B      │                           │
      │<───────────────────┤                           │
      ┊                    │      Issue reads 3-5      │
                           ├───────────────────────────┤
                           │                           │
                           ├<──────────────────────────┤
      ┊     Wake up A      │                           │
      │<───────────────────┤                           │
      ┊                    │                           │
                           │                           │
      ┊     Wake up C      │                           │
      │<───────────────────┤                           │
      ┊                    ┊                           ┊

This is just a simple model, in reality a lot more goes on.

The kernel doesn't just split requests for fairness. The main reason is that I/O is faster when batched (and sequential on SSDs). As stated earlier, since we want to do as little I/O operations as possible (those are slow), the kernel makes a batch of i/o operations to execute later.

For really large buffers, we start to see these diminishing returns:

Checking with cachegrid, we can see that cache misses start appearing for buffers over 64kiB. Even though misses are small, they may evict other relevant data from caches, which can lead to a slowdown further down the line.

From this information, we will derive a number: the overhead of system calls; the total time spent by the program while performing syscalls. That number should help us decide which buffer size is optimal for copying a file of given size.

##Measuring system calls overhead

The system calls overhead can be defined as follows:

We perform one read plus one write system call for every piece of data we need to transfer.

In order to determine the values of and accurately, I'll consider these first order approximations:

, where and are constants;
, where and are constants.

Constants should encapsulate the constant overhead of performing system calls. Constants should represent how much overhead is required for each read/written bytes (I/O time and time spent by the kernel copying bytes across buffers).

To test out this hypothesis, I will run a micro benchmark measuring every read and write, then we'll see if a linear regression fits the data. In that case, the slope and intercept will give the and coefficients respectively.

pre-read overhead for buffered POSIX copy — This is the overhead of each read call for reading a 10MB file. The timings are averaged on each run, 500 runs were performed for each buffer size. Note how the distribution is multi-modal; this is because every time about 128kiB is read, the kernel pre-fetches the next 128kiB piece of data to be available for later reads. The number of bytes the kernel pre-fetches can be tuned in */sys/block/sdX/queue/read_ahead_kb* This phenomenon is clearly visible in the 8192 buffer size, since every 8 reads, one read will takes significantly longer than the others. Then the kernel decides to start fetching 256kiB at once, to speed things up even further. This phenomenon is highlighted by the fact that for buffers above 128kiB, the lower distribution shifts up.

The slope is roughly , meaning that every additional byte adds about of read time. The intercept, , is very large for a system call. But since the distribution is multi-modal, that intercept should be interpreted with a grain of salt.

pre-write overhead for buffered POSIX copy — This is the overhead of each write call for reading a 10MB file. The timings are averaged on each run, 500 runs were performed for each buffer size. It looks like it's faster to write than to read a file. The main reason is that kernel is buffering all writes. If the kernel were actually syncing to disk when we call write, we would see a much slower distribution, like what happened with *read*.

Due to the fact that the kernel is buffering all writes, what we see is a pure system-call scenario, without I/O. This confirms that for every system call, there will be a overhead (round trip), then it takes per bytes copied. That matches the overhead for copying a buffer (e.g through memcpy).

Therefore, the simple first order approximation seems to fit the data correctly. Thus, We can express the syscall overhead like this:

Looking at the terms, there's a constant factor that doesn't vary when we increase or decrease the buffer size. So the only term we can act on is . It would look like we'd get better results by indefinitely increasing the buffer size. However, as stated previously we don't want extremely large buffers.

So what we want is a buffer size large enough, so that the constant syscall overhead becomes negligible compared to the per-byte copy overhead.

Let's define as how much overhead from the per-syscall term we can tolerate over the per-byte term:

If we substitute , meaning we tolerate 1% of overhead over the per-byte term: . Rounding up to the nearest page boundaries: . Therefore, to minimize the per-syscall overhead, we need a buffer that can hold at least 13 memory pages. But note that there are other ways to tackle this. For instance, using 16 pages would make the kernel readahead bytes become a multiple of our buffer size, thus triggering at regular number of read calls. However, this analysis doesn't describe what happens for small buffers ( ), nor what happens for larger buffers ( ) where cache misses and kernel readahead spikes become dominant.

Overall, buffering will drastically improve copying performances. Even for smaller files, a small buffer will already yield significant speedups. And for most applications, buffering alone is often sufficient to improve read/write speeds. That said, there's an unexplored blind spot in this method. I've kept writing about how the kernel does buffering on its end too, so what would happen if we could take advantage of the kernel's buffering? Since reading and writing are both system calls, could we delegate the copy entirely to the kernel? That's what we'll see in the next section.

#Sendfile, splice and copy_file_range

In this section, I'll write about these 3 specific linux system calls:

sendfile is a system call to copy data from one file descriptor to another.
splice is like sendfile except it operates on pipes.
copy_file_range copies files directly, letting you specify offsets and sizes.

The main particularity of these system calls is that they perform copying entirely in kernel mode, also referred to as zero-copy. There are no buffers you have to manage, no read/write separation, everything is handled by these calls.

The main benefit of using these is that the kernel doesn't have to give you a buffer from reading, that you then send back to the kernel for writing. That's an added benefit since you also don't have to allocate any additional memory to store your buffers. That also means less cache pollution, CPU usage and memory bandwidth. Since now there's only one buffer containing the data, which should also improve the per-byte overhead. Furthermore the kernel is doing the heavy-lifting, therefore it can better adapt to the system's current conditions such as a busy filesystem, or I/O congestion.

Here's a rough sketch of how kernel-copy might work:

┌───────────┐      ┌────────────┐
│  Process  │      │   Kernel   │
└───────────┘      └────────────┘
      │   copy(A, B)     │
      ├─────────────────>┤
      │                  │  stat file A
      │                  ├───────────────┐
      │                  │               │
      │                  ├<──────────────┘
      │                  │ Allocate buffer to hold
      │                  │  A, optimized for copy
      │                  ├─────────────────────────┐
      │                  │                         │
      │                  ├<────────────────────────┘
      │                  │
      │                  ├<────────────────────────────┐
      │                  │  Read A into buffer         │
      │                  ├──────────────────────┐      │
      │                  │                      │      │
      │                  ├<─────────────────────┘      │
      │                  │  Write buffer into B        │
      │                  ├──────────────────────┐      │
      │                  │                      │      │
      │                  ├<─────────────────────┘      │
      │                  │      Repeat until EOF       │
      │                  ├─────────────────────────────┘
      │      return      │
      ├<─────────────────┤
      ┊                  ┊

That logic flow isn't different from the previously mentioned copy methods, but we no longer suffer the cost of performing repeated system calls.

As a side note, copy_file_range works differently on some filesystems. On Btrfs copy-on-write is supported, meaning that no data is ever copied. Instead the files are marked as holding the same content, which means the copy will be performed only once one of them is modified. However that's not the case on Ext4, meaning the kernel is actually copying bytes.

So here are the results for these 3 system calls:

sendfile copy — Benchmark for *sendfile*

splice copy — Benchmark for *splice*. A size of 1MiB was used for splice.

copy_file_range copy — Benchmark for *copy_file_range*

The first thing that we notice is that they all seem to have similar performances. In fact the regressions (for larger files) all have the same slope of . In order to better compare them, below is a comparative plot of the three methods:

comparative plot of sendfile and splice against copy_file_range — *splice* was used as the baseline because it's the slowest of the three. *sendfile* and *copy_file_range* seem to have the exact same performances, while being slightly faster than *splice*. However, they eventually match *splice* performances for larger buffers.

In practice, copy_file_range is the most appropriate function for copying files, as the two others are generally used for other purposes. For instance, sendfile is generally used for serving files over a socket without having to copy the file in userspace (requiring less memory and less CPU usage). While splice is useful for sending data between pipes.

With these kernel-level copy methods, we're getting performances close to the memcpy-only method presented in the first section.

Comparing copy_file_range against buffered POSIX copy gives this graph:

copy_file_range vs buffered POSIX copy — Comparing all buffer sizes against *copy_file_range* shows significant slowdowns for smaller buffers. However, even larger buffers experience significant slowdowns when the copied file becomes larger.

Overall copy_file_range is faster than all other methods previously discussed in this article. However, these performances come at a cost: copy_file_range is unable to copy files across filesystems. According to the manual, trying to do so will result in EXDEV:

EXDEV (before Linux 5.3)
	The files referred to by fd_in and fd_out are not on the same filesystem.

EXDEV (since Linux 5.19)
	The  files  referred  to  by  fd_in and fd_out are not on the same filesystem, and the source and target
	filesystems are not of the same type, or do not support cross-filesystem copy.

This is a fundamental limitation compared to simply using read+write. In practice, you can fall back to buffered POSIX copy when kernel-level copy isn't possible.

Nevertheless, when conditions are appropriate, copy_file_range should be the preferred method for copying files.

Also, since it requires a single call, error handling becomes simpler. Compared to repetitively calling read and write, where every call is a potential source of errors.

#Summary and conclusion

Here's a comparative graph showing all every copy method compares against memcpy:

Comparison of all copy methods mentioned in this article — Comparative graph of all copy methods, compared to *memcpy*. Note the sudden jump when the copied data reaches 256kiB. This is due to the fact that memcpy becomes much slower per bytes at this point. Overall, all kernel-copy methods outperforms the other methods, and get really close to *memcpy*'s performances for larger files.

In this article, we explored multiple ways to copy files on Linux, each with its own trade-offs in terms of simplicity, performance, and system requirements. Here's a concise summary:

POSIX copy (1-byte read/write):

Extremely simple and portable across POSIX systems.
Performs very poorly due to the high number of system calls and context switches.
Useful only for educational purposes or very small files.

Standard C library copy (fread/fwrite):

Adds buffering on top of POSIX calls, improving performance significantly.
The buffering strategy is implementation-defined, and small buffers may still incur overhead.
Generally faster than raw POSIX copy for most files, but still far from optimal for large files.
Very portable due to the high availability of the C standard library

Buffered POSIX copy (custom buffer):

Reading and writing in larger chunks dramatically reduces syscall overhead.
Allows fine-grained control over buffer size to balance memory usage, CPU cache, and system call overhead.
Properly tuned buffers can yield performance close to cp for files of typical sizes.
Still uses userspace memory and involves copying data between kernel and userspace.

Kernel-assisted copy (sendfile, splice, copy_file_range):

Performs the copy entirely in kernel space, avoiding userspace memory overhead.
Minimizes CPU usage, cache pollution, and memory bandwidth.
copy_file_range is best suited for file-to-file copying, while sendfile and splice are better for network or pipe operations.
Extremely fast, often outperforming buffered userspace copies, but limited by filesystem boundaries (cannot copy across filesystems with copy_file_range).

For most applications, buffered POSIX or standard C copy is sufficient, especially if you need portability or fine control over buffering. However, for maximum performance on Linux, leveraging kernel-assisted zero-copy methods like copy_file_range is the clear winner. They reduce CPU load, memory overhead, and achieve throughput close to the theoretical limits of the underlying storage.

Ultimately, the best method depends on your constraints:

Need portability: standard C / POSIX buffered copy.
Need performance on Linux: copy_file_range (with fallback to buffered copy for cross-filesystem support).
Need network/pipe efficiency: sendfile or splice.

By understanding how each method interacts with the kernel, system calls, and caching, you can make informed decisions when designing high-performance file-copy routines on Linux.

#Going further

In this article filesystem-level copy has been left unexplored. These are special types of copy operations that rely on filesystem-specific features. As mentioned previously, btrfs has a copy-on-write feature. That means making a copy of a file is a trivial operation that doesn't even require reading the original file at all. Instead, the filesystem marks both files as holding the same content, and the copy will only be performed if you modify either of those file.

It is also possible to create a kernel-level copy by using Linux's io_uring feature. However, since we already have copy_file_range, such a method may not yield any advantages over the kernel-level copy methods already available. As a side note, copy_file_range was added in Linux 4.5, while io_uring was added in Linux 5.1. And as of today io_uring is still tricky to work with and prone to errors. On top of that, since glibc 2.27, a retro-implementation (in user-space) of copy_file_range was created to support the call on older kernels or non-linux systems.

In this article I completely omitted metadata operations that tools like cp perform. When you copy a file, you're usually copying its owner/group/permissions combination as well as its content. You also want to perform different handling if the file is a symbolic link, a pipe or a socket, which this article doesn't take into consideration.

I briefly mentioned O_DIRECT at the beginning of the article. When opening a file with this flag, it entirely bypasses the kernel's cache. This flag is not meant for general-purpose copy, but for situations where reliability and latency need to be tightly controlled, for instance in a database software. In fact O_DIRECT greatly reduces the speed of I/O operations.

There's also the entire unexplored area of network filesystems or FUSE filesystems in general. They're very useful as they allow to mount another filesystem that isn't naturally part of your computer, for instance a FTP server, or a smartphone. However they introduce their own caveats as they are not managed by your kernel, therefore it's very hard to reliably transfer data across those links. For instance, the kernel can't guarantee atomicity, so multiple programs may write to the same file without evident coordination. Note that on a network filesystem, copying data using the naive stdio or POSIX buffered copy means the data is sent over the network twice: once to your computer and once from your computer back to the server. This is explained in glibc's copy_file_range documentation.

#Methodology

All tests were done on the same filesystem, copying from one drive to the same drive. For the sake of simplicity, the copy program does not check for errors. A later function checks that the copied file's content match that of the original file.

Linux version: 6.12.38+deb13-amd64
CPU: AMD Ryzen 5 3600X
Drive: WDC WDS500G1R0B
Filesystem: Ext4 behind LVM+Luks2
GCC version: 15.2.0
glibc version: 2.41-12

Every time a function from the C standard is mentioned, expect it to come from glibc. As it is not possible to control how SDDs cache data, all tests were run once before recording their data.

To run individual benchmark, compile the copy.c file in src/ with the following flags:

gcc copy.c -Wall -Wextra -Wconversion -O2 -DCOPY=<NAME> -DBUFFER_SIZE=<SIZE> -DSPLICE_CHUNK=<SIZE> -o copy

Where <NAME> is the copy method you want to benchmark:

posix for 1-byte POSIX copy;
stdio for 1-byte stdio copy;
stdio_lb for 8192-bytes stdio copy (using setvbuf);
posix_buffered for buffered POSIX copy (using specified BUFFER_SIZE);
posix_buffered_aligned for buffered POSIX copy (using specified BUFFER_SIZE), buffer is allocated using posix_memalign;
buffered_stdio for buffered stdio copy (using specified BUFFER_SIZE);
copy_file_range for copy_file_range;
sendfile for sendfile;
splice for splice (using specified SPLICE_CHUNK);
splice_lb for splice (using specified SPLICE_CHUNK), with a pre-allocated buffer using fcntl(pipe[0], F_SETPIPE_SZ, SPLICE_CHUNK);
cp for the cp utility from coreutils;
memcpy for pure memcpy copy without I/O.
memcpy_aligned for pure memcpy copy without I/O using a buffer allocated by posix_memalign.

You can then run the program:

./copy <NUM> [avg|box]

Where:

<NUM> is the number of tests you want to run;
[avg] is an optional flag to output the average time in microseconds.
[box] is an optional flag to output a box-diagram data for the run, in this format: MIN, Q1, MED, Q3, MAX, AVG (in microseconds).
When specifying NUM only, the program outputs all measurements (in microseconds).

##Data

Every benchmark outputs data in the data/ directory, in CSV format. The first row corresponds to the column's file size in human readable format. And each columns contains the measurements in microseconds. Except for data/overhead which only contains the measurements.

##Automation

All scripts are located in scripts/.

Bash scripts:

The scripts will output each run to it's own file inside data/.

bench.sh: Benchmark copy methods. You need to specify the tested methods and file sizes inside the script:

methods: The list of methods to test
sizes: An array containing the size of the file to copy (passed to dd) and the number of tests to perform

Outputs in data/.

bench_buf.sh: Benchmark copy methods that take a BUFFER_SIZE argument. You need to specify the tested methods, file sizes and buffer sizes inside the script:

methods: The list of methods to test
sizes: An array containing the size of the file to copy (passed to dd) and the number of tests to perform
bufsizs: The buffer sizes to test

Outputs in data/bufsiz/.

bench_buf_10M.sh: Benchmark copy methods that take a BUFFER_SIZE argument against a single 10MiB file. You need to specify the tested methods and buffer sizes inside the script:

methods: The list of methods to test
bufsizs: The buffer sizes to test

Outputs in data/bufsiz_10M/.

bench_buf_read.sh: Benchmark read overhead for buffered POSIX copy. You need to specify the tested buffer sizes inside the script:

bufsizs: The buffer sizes to test

Outputs in data/overhead/.

bench_buf_write.sh: Benchmark write overhead for buffered POSIX copy. You need to specify the tested buffer sizes inside the script:

bufsizs: The buffer sizes to test

Outputs in data/overhead/.

R scripts:

plot_simple.R: Violin plot. Takes a CSV file and a name as input.
plot.R: Violin plot performing two regressions with a fixed cutoff. Takes a CSV file and a name as input.
plot_single.R: Violin plot performing a single regression. Takes a CSV file and a name as input.
compare.R: Compares two benchmarks and outputs a violin plot of . Takes two CSV files and two names as input.
plot_bufsiz.R: Plots buffer sizes for data/bufsiz/ (used by buffered POSIX copy)
plot_bufsiz_10M.R: Plots buffer sizes for data/bufsiz_10M/ (used by buffered POSIX copy)
plot_read_oh.R: Plots read overhead for buffered POSIX copy in data/overhead/.
plot_write_oh.R: Plots write overhead for buffered POSIX copy in data/overhead/.
compareNhigh.R: Compares N benchmarks against the first one (used when the first benchmark is slower). Takes one CSV/name combination, then N CSV/name as input
compareNlow.R: Compares N benchmarks against the first one (used when the first benchmark is faster). Takes one CSV/name combination, then N CSV/name as input
compareSummary.R: Compares N benchmarks against the first one (used for the summary graph only). Takes one CSV/name combination, then N CSV/name as input

#References

This article heavily references the Linux programmer's manual, commonly referred to as man:

Mode Switch
open The open system call
read The read system call
write The write system call
fopen The fopen standard C function
fread The fread standard C function
fwrite The fwrite standard C function
sendfile The sendfile system call
splice The splice system call
copy_file_range The copy_file_range system call
How useful should copy_file_range() be? A discussion about the history and usage of copy_file_range
io_uring The io_uring Linux facility
FUSE The FUSE Linux facility
Latency Numbers Every Programmer Should Know Latency of different operations
Linux Page Cache A LinuxVox article about the Linux Page Cache