LATHAMA
Andrew Latham aka lathama, gringo malvado

Moving large Sparse files

Sparse files are handy but at times difficult to work with.

When you need to move large sparse files across the network there are many issues related to support of this new FS method. Sparse files are files that say they are size X but only allocate blocks on the file system that are actually used. This is a great use of space and very nice for virtualization. In the past methods like COW to only use space as it was needed. These solutions worked. Sparse file support was integrated into the Linux Kernel and now it is the preferred way to handle images.

Problem

The need to move a large 100GB+ file from one server to another over the network. The file is sparse in nature which means that only a small portion of the file may actually be used. One does not want to transfer every byte of data and to fully allocate the file on the target system.

Solution

Use Tar with its support for Sparse and stdin and stdout. Tar checks the source file twice (normally and a second time for sparse) before streaming. On large files this can take time and processing power. The target file will be checked as it is written.

Requirements

Pipe Viewer will show us what is happening in the pipe. Without this you may go insane.

serverA:/# aptitude install pv
serverB:/# aptitude install pv

First you need to understand that Tar is going to look at the file TWICE. This will take lots of time and make you think nothing is happening. Wait, Wait, Wait and then smile. Select a port under 45,000 and above 1024 that is not in use by another service.

Example*

serverA:/# tar -cS IMG.img | pv -b | nc -n -q 15 172.20.2.3 5555
serverB:/# nc -n -l 5555 | pv -b | tar -xS

As an example here is another method. As with all SSH connections, it will cause 99% + CPU load for the duration of the connection even with compression off.

tar -cS IMG.img | pv -b | ssh -o 'Compression no' root@172.20.2.3 "cat > IMG.img.tar"

Then you need to extract the TAR image.

tar -xSf IMG.img.tar

Summary

There are other methods of completing this action. This method is the fastest that I have found. Using Rsync with Sparse options does work but it trasfers every null byte over the network, so it takes more time. It also runs two checksums on both source and target files. Further testing shows that compression can cause issues if one or both the servers are under load. This method can also be used over SSH or other authenticated protocols.

* This method has only hung once for me. If it causes you issues, wait for the connection to time out or test with another image.

Written by Andrew Latham on Tuesday March 17, 2015
Permalink -

« Reseting DNS SOA Serial - Using BASH to manage Bridged VLANs on Debian »