Our Experience with Ceph – Part 3

nine Team Jan 9, 2017

In the third and final part of our blog series about our experiences with Ceph, we report on the finishing touches to our NVMe Ceph cluster.

Benchmarks in the VMs

After collection of various benchmark data for the entire cluster, we began by testing the reading and writing speeds directly in the VMs that use the storage. In these benchmarks, we also experimented with different settings and, above all, with different processors.

The first test was based per node on two Intel E5-2630 v4 (10x 2.2Ghz) processors. We also chose these processors for the productive setup because the price of the other CPUs was too high considering the minimal increase in performance.

With this first CPU, we arrived at the following test results:

Test\Blocksize 4k 16k 4096k
Serial Write 14908 IOPS 12650 IOPS 184 IOPS / 739.24 MB/s
Serial Read 1212 IOPS 1095 IOPS 77 IOPS / 310.22 MB/s
Random Write 9886 IOPS 8382 IOPS 170 IOPS / 681.58 MB/s
Random Read 1074 IOPS 949 IOPS 66 IOPS / 264.70 MB/s

The subsequent second test series used two Intel E5-2680 v4 (14 x 2.4 GHz) processors:

Test\Blocksize 4k 16k 4096k
Serial Write 15209 IOPS 12702 IOPS 131 IOPS / 527.70 MB/s
Serial Read 1387 IOPS 1287 IOPS 81 IOPS / 326.53 MB/s
Random Write 9026 IOPS 9097 IOPS 137 IOPS / 550.10 MB/s
Random Read 1103 IOPS 1097 IOPS 80 IOPS / 320.30 MB/s

For the final test with different CPUs, we again used Intel E5-2680 v4 processors – this time, however, with 10 rather than the actual 14 cores. This final test was mainly to determine whether the clock speed of the cores or the number of cores deliver decisive performance changes.

Test\Blocksize 4k 16k 4096k
Serial Write 15713 IOPS 12981 IOPS 153 IOPS / 615.38 MB/s
Serial Read 1371 IOPS 1254 IOPS 99 IOPS / 397.05 MB/s
Random Write 9645 IOPS 8458 IOPS 142 IOPS / 568.25 MB/s
Random Read 1271 IOPS 1004 IOPS 66 IOPS / 264.50 MB/s

From all the collected results, it became clear to us that the faster CPUs do not pay off if with these when we only achieve roughly the same results as with 10 x 2.2 GHz.

With the results, we also gained a good overview of the performance of individual virtual machines. It was important for us to find out the load above which degradation of speed is noticeable on a machine in the cluster. For this purpose, we created on the cluster a base load produced by multiple parallel FIO jobs. In a specific virtual machine, we then tested the performance at the respective base load. To generate this base load, we use the following FIO parameters:

[global]
invalidate=1
ioengine=rbd
iodepth=1
time_based
runtime=7200
size=10G
bsrange=4k-4m
rw=randrw
direct=1
buffered=0
percentage_random=50
rate=5M
pool=volumes
clientname=libvirt

We tested many different load profiles with these parameters, perceiving, however, no real change during extended reading under a load of 1 GB/s and a writing load of approximately 1.5 GB/s. After reaching these values, we received the following metrics in a VM:

Test\Blocksize 4k 16k 4096k
Serial Write 13770 IOPS 7895 IOPS 34 IOPS / 136.75 MB/s
Serial Read 498 IOPS 447 IOPS 14 IOPS / 58.04 MB/s
Random Write 10278 IOPS 8410 IOPS 33 IOPS / 135.63 MB/s
Random Read 344 IOPS 268 IOPS 13 IOPS / 53.73 MB/s

The measurements at around 50 MB/s of reading base load and 110 MB/s of writing base load were as follows:

Test\Blocksize 4k 16k 4096k
Serial Write 15187 IOPS 12799 IOPS 170 IOPS / 681.30 MB/s
Serial Read 1155 IOPS 1126 IOPS 49 IOPS / 196.45 MB/s
Random Write 10064 IOPS 8646 IOPS 170 IOPS / 680.40 MB/s
Random Read 933 IOPS 859 IOPS 53 IOPS / 214.31 MB/s

More OSDs per NVMe

After extensive testing, we were now interested in the possible influence of more OSDs per NVMe on the speed of the cluster. To find this out, we built our cluster so that we had new 2 OSDs and correspondingly two journaling partitions on each NVMe. This conversion had a significant impact primarily on the reading speed, whereas the writing performance suffered only a little:

Test\Blocksize 4k 16k 4096k
Serial Write 14686 IOPS 12407 IOPS 128 IOPS / 515.33 MB/s
Serial Read 1719 IOPS 1546 IOPS 104 IOPS / 416.01 MB/s
Random Write 9377 IOPS 8900 IOPS 122 IOPS / 488.29 MB/s
Random Read 1401 IOPS 1323 IOPS 106 IOPS / 427.72 MB/s

In order to see the behaviour when the Ceph storage is at higher workload, we again generated around 1 GB of reading load and 1.5 GB of writing load:

Test\Blocksize 4k 16k 4096k
Serial Write 13909 IOPS 9724 IOPS 38 IOPS / 154.25 MB/s
Serial Read 519 IOPS 495 IOPS 18 IOPS / 72.13 MB/s
Random Write 9689 IOPS 9609 IOPS 38 IOPS / 155.38 MB/s
Random Read 293 IOPS 311 IOPS 18 IOPS / 73.02 MB/s

In doing so, a significant improvement was noticeable in comparison with the configuration with only one OSD on each NVMe device.

Finally, we attempted to improve performance by taking a solution approach from Samsung, which – in cooperation with RedHat – was proposed as the best alternative. With the recommended upgrade to 4 OSDs per NVMe we were not able to detect any increase in performance, and even perceived deterioration in the reading and writing of larger blocks. Our benchmarks were then complete, and it was with 2 OSDs per NVMe that we were able to extract the optimal performance from the hardware.

Reliability

To then complete the various tests by testing the reliability of our setup, we simulated different reliability scenarios under load:

NVMe Failure

We simulated the failure of an NVMe disk on the fly and with a significant degree of load on the cluster. We did this by removing an NVMe from the server.

The result of this was that no more IO operations were possible for a period of a few seconds. Thereafter, the storage worked normally again and no losses were detected on the VMs.

On the fly, this failure is sufficiently short that no impairment could be detected on the virtual machines.

NVMe Failure with Redistribution

With this scenario, we tested potential performance problems by removing a NVMe disk and waiting until all data had been redistributed. Another objective of this test was to determine how long it takes until all the data are available again, accurately and with redundancy.

The result was 5 minutes. This was the time that the cluster needed to repair itself subsequent to the removal of the disk. To do this, it had to copy a substantial amount of data, something which was not, however, detectable on the VMs. For the test, approximately 6182 GB of data were available on the cluster, for which recovery took around 20 minutes.

Node Failure

We reconstructed this situation in order to find out what happens in the event of a cluster node failure, for instance during a power failure or a reboot.

In this situation, we again generated a large amount of load on the cluster before removing the electricity supply to the node. The behaviour was very similar to the failure of an OSD. For several seconds thereafter, the cluster processed no IO operations, but subsequently worked properly again. In this case, too, recovery of the “lost” data began after 5 minutes.

Even a reboot of a Ceph server demonstrated the same effect. For a period of a few seconds, no IO operations were carried out, but the cluster subsequently ran normally. The data that were not stored redundantly during the reboot were then synchronised within a very short time.

Production Environment

After completion of all the tests, we were able to move the hardware from the lab racks into the production environment. Following the move, we conducted another performance test to verify our results from the lab in a production environment. On the basis of another recommendation from RedHat, we additionally migrated the server from Ubuntu Trusty to Ubuntu Xenial. Through the final performance test, we were also able to rule out the distribution upgrade having an impact on the performance of the cluster.

You will find the first and the second part of the series here.