Test bench configuration:
- 4 x HDD Seagate Exos 16Tb
- RAID 10
- VMWare ESXi 6.7U3
- Windows Server 2016
- Connection iSCSI
- File system LUN - NTFS, 4kb
Fault tolerance testing
When configuring storage, you need to choose which network ports will work in fault-tolerant mode, so that when one controller is disabled, their IP addresses will be duplicated to the second controller. Simple mirroring is provided here: port 1 on controller A is reserved with port 2 on controller B, and so on. Note that for fault tolerance, the reserved ports must have a static IP address, identical subnets, a gateway, and even an MTU. These are quite normal and understandable requirements, and to see how fault tolerance works in the NAS, let’s start with synthetic tests.
To do this, connect a regular Thin Provision LUN in Windows Server 2016 and look at the delay in accessing the volume in different conditions. The first test is a 5-minute read of the 4K sector in random order, in which we see good constant access stability throughout the entire interval.
When you disable the active controller in Random 4K reading mode, the downtime is just some record-breaking small - only 13 seconds, and I’m not mistaken if I say that 99% of applications will not even feel this small delay, and will not lead to service interruptions.
Working on the backup controller in read mode also does not differ from working on the main one, except for a slightly increasing delay, which will be visible on some SSD models, but from a practical point of view will not affect the operation of the service.
It takes about 120 seconds for the main controller to return, but the disk access interruption is already about 20 seconds, and as we can see from the graph, access to the array is interrupted twice.
The results that Synology UC3200 demonstrates are a real breakthrough, if not a miracle, because such a short switching time from the main controller to the backup one is typical for much more expensive machines. This could be a curtain call, but first you need to make sure that in real life everything will be as smooth as on synthetic tests. Let’s repeat all the above for a 2-stream load of type 4K Rnd Read/Write in a 50/50 ratio.
We have a disk subsystem assembled on hard drives with a spindle speed of 7200 RPM, and of course the access time jumps a lot. Over time, obviously, predictive algorithms give up and the maximum delay increases.
The switching time from the active controller to the standby one is already significantly increased - up to 20 seconds, but still remains relatively low for a device of this price level. Returning the controller to active mode interrupts the storage operation for 15 seconds, after which the overall array delay is noticeably reduced.