At the time of writing, data deduplication was not a built-in feature of Synology NAS servers. If you wish to enable this undocumented option, you are acting at your own risk. Neither Synology nor author is responsible for the consequences of using the instructions given in this article.
What you need:
- NAS Synology with DSM 6.1 or higher
- Btrfs file system partition
- Ability to install Docker from Synology package center
- As long as you do backups before you begin, a little theory, little and common words.
READ FULL ARTICLE
Btrfs – key to block deduplication
If you write to the NAS in 10 folders 10 files windows.iso size of 3 GB, you will take 30 GB of free space. Conventional deduplication can detect duplicate files and show you that they can be deleted, Synology has this feature in the storage analyzer package, but this is the last century. Block deduplication scans each file for duplicate blocks (extents), and if it finds identical blocks in different files, it removes the duplicate of the extent in one file, replacing it with a reference to the same extent in another file.
Very roughly it looks like this: let’s say there are 10 files, each of which is dx.dll, so here is from 9 archives content file dx.the dll will be removed and replaced with a link sending to the 10th archive where this file is saved. Of course, in fact, this process is much more complex, but the meaning is still the same – a link to dx.dll, for example, takes 128 bytes, and the dx.dll-300 megabytes. By deduplicating, we save 2,700 megabytes out of 3,000, and all files remain in place, and can be copied, opened, and deleted independently. When writing to another medium, such as a computer, the deduplicated data will again take 3000 megabytes, so we save space only inside the NAS.
Docker is the key to duperemover
Starting with DSM version 6.x, Synology has made it much more difficult to write and prepare third-party software to run under its OS, explaining: “if you want to install software – use Docker or virtualization”. In principle, this is the right step to ensure the integrity of the operating system. Synology Virtual Machine Manager hypervisor will not give us direct access to the file system, but container virtualization will. Therefore, we will use the Docker package, inside which we will create a container and load the duperemove program there.
Installing Docker and Debian image
After making sure that you have on the NAS-e file system Btrfs, open the “start – package center”, scroll down and install Docker.
Run Docker and select the debian:latest image in the “Registry” tab. We will use this image, although it should work in Ubuntu and Centos in The same way.
Go to the “Image” tab, click ” Run ” – our container is running. Stop it by clicking on the switch on the right.
After selecting the container Debian 1, click “Settings”. We need to give the container full access to the DSM, so select this check box, and then configure the resource limit. By default, duperemove uses all processor cores and can consume all memory available on the system.
The process of deduplication is very resource-intensive and long, so it is better to limit the resources to the container, allocating no more than half of the CPU cores and about half of the RAM used. When deduplication is over, all unnecessary data will be deleted from memory. This is not a ZFS, then store the table of extents in memory is not necessary.
By default, our container does not see the files and directories that are stored on the NAS, and it is necessary to “forward” physical directories to its virtual environment. Click the ” advanced settings “button, go to the” Volume “tab and click the”Add folder” button. Select the desired folder, it can be either root or any subfolder, click “OK” and enter the path inside the container where the NAS folder will be mounted, let it be /mnt/sviko . If you need to add more than one NAS folder, repeat this step several times.
In General, as I said, deduplication in Btrfs is a very difficult process for the system, so it makes sense to perform it selectively, for example, only for a folder with virtual machine images or other duplicate data. Moreover, it is better to make several copies of the container for different folders and run, for example, each of them separately on a schedule (how to configure the launch of containers on a schedule – look in Google), and we go further.
Run our container by clicking on the switch on the right. Click the “Details” button, a new window opens. Click “Terminal”, and a little wait, get into the command line interface. We have a completely “naked” debian, in which there is not even SSH access, so it is better and faster to enter a long URL from the keyboard once than to forward ports inside the container and go from the terminal program.
Go to the folder /tmp
apt-get install wget nano
Next we have to add buster repo
Add string to end of file:
deb http://deb.debian.org/debian buster main
Save file (CTRL+X -> ENTER)
apt-get install duperemove
All, now all that is needed-we have, it’s time to run deduplication!
duperemove -rdh --hashfile=hash /mnt/sviko
- r-walk through directories inside (recursively)
- h-print the report in human language
- -hashfile=hash file in which to store the hashes at the time of deduplication. If you do not specify it, all hashes will be stored in RAM and this can lead to memory overflow and program error. At the end of the process, this file will be automatically deleted. At the time of the procedure, it is in the /tmp folder
- /mnt/sviko – path to the mounted directory inside the container that leads to the directory on the NAS
The process of deduplication is very long – on Celeron series processors it can take several days or even weeks for a folder of 1 TB. First, the program will create a set of hashes for the extents of each file inside the mounted directory, and then start deleting them from the files themselves. If the process is interrupted, have to start from the beginning. If the deduplication server will be off (the power goes out or all the hangs) – after reboot, nothing should break. We checked it 3 times during testing – the data were intact.
One of the faster options is file deduplication. Unlike block, it works entirely with files, and if it finds two identical ones – one as it deletes, indicating to the file system that its contents can be taken at a different address. For the user, this process is not noticeable – you will still have access to all the same files, only they will occupy space as one original. This is a much less resource-intensive process, but not as efficient as block deduplication, but with the first, the second is not required. Let’s go back to our container and run file deduplication. Open the container and go to the terminal, as shown earlier.
Install the program fdupes, which is a list of duplicate files
apt-get install fdupes
Now run the search and deduplication in our folder, as in the example above
fdupes -r -S ./mnt/sviko | duperemove --fdupes
Similarly, the program will go through all the files of the specified directory, create hashes for them and ask the program duperemove to inform the file system about the presence of duplicates and clean.
Well, let’s test how it works?
We recorded 414 GB of virtual machine images in the test folder. Before deduplication, the volume had 10.27 GB of free space.
- Synology RS18017XS+
- 16GB RAM
- 2xSSD Samsung MZ7KM80 RAID 1
Block deduplication pleased with speed – after 3 hours we had 150 Gb used instead of 414 Gb.
Naturally, in real life, the effectiveness can be different, both higher and lower. What is unpleasant is the inability to interrupt the work of Duperemove; some extents it works for a few seconds, and for some it spends hours. And while the program is processing the extent, you will not be able to stop or restart the container.
If you are familiar with containers, you can configure deduplication to run automatically on a schedule without waiting for Synology to introduce support for this feature in the DSM. As mentioned above, it makes sense to make separate containers for individual folders and run them one at a time. I tested block deduplication on a clean Linux without containers, and there the process took only a few hours on the same processor. Apparently, Docker is too thick layer that absorbs CPU resources.
Deduplication does not work on encrypted folders.
Yes, of course Btrfs is not ZFS, where deduplication occurs on the fly while writing data to disk. But if you achieve the same performance in DSM as in Debian 9, you can do this process weekly or at night, getting the same savings effect that ZFS gives.