Our Experience with Ceph

In the third and final part of our blog series about our experiences with Ceph, we report on the finishing touches to our NVMe Ceph cluster.

Benchmarks in the VMs

After collection of various benchmark data for the entire cluster, we began by testing the reading and writing speeds directly in the VMs that use the storage. In these benchmarks, we also experimented with different settings and, above all, with different processors.

The first test was based per node on two Intel E5-2630 v4 (10x 2.2Ghz) processors. We also chose these processors for the productive setup because the price of the other CPUs was too high considering the minimal increase in performance.

With this first CPU, we arrived at the following test results:

Test\Blocksize	4k	16k	4096k
Serial Write	14908 IOPS	12650 IOPS	184 IOPS / 739.24 MB/s
Serial Read	1212 IOPS	1095 IOPS	77 IOPS / 310.22 MB/s
Random Write	9886 IOPS	8382 IOPS	170 IOPS / 681.58 MB/s
Random Read	1074 IOPS	949 IOPS	66 IOPS / 264.70 MB/s

The subsequent second test series used two Intel E5-2680 v4 (14 x 2.4 GHz) processors:

Test\Blocksize	4k	16k	4096k
Serial Write	15209 IOPS	12702 IOPS	131 IOPS / 527.70 MB/s
Serial Read	1387 IOPS	1287 IOPS	81 IOPS / 326.53 MB/s
Random Write	9026 IOPS	9097 IOPS	137 IOPS / 550.10 MB/s
Random Read	1103 IOPS	1097 IOPS	80 IOPS / 320.30 MB/s

For the final test with different CPUs, we again used Intel E5-2680 v4 processors – this time, however, with 10 rather than the actual 14 cores. This final test was mainly to determine whether the clock speed of the cores or the number of cores deliver decisive performance changes.

Test\Blocksize	4k	16k	4096k
Serial Write	15713 IOPS	12981 IOPS	153 IOPS / 615.38 MB/s
Serial Read	1371 IOPS	1254 IOPS	99 IOPS / 397.05 MB/s
Random Write	9645 IOPS	8458 IOPS	142 IOPS / 568.25 MB/s
Random Read	1271 IOPS	1004 IOPS	66 IOPS / 264.50 MB/s

From all the collected results, it became clear to us that the faster CPUs do not pay off if with these when we only achieve roughly the same results as with 10 x 2.2 GHz.

With the results, we also gained a good overview of the performance of individual virtual machines. It was important for us to find out the load above which degradation of speed is noticeable on a machine in the cluster. For this purpose, we created on the cluster a base load produced by multiple parallel FIO jobs. In a specific virtual machine, we then tested the performance at the respective base load. To generate this base load, we use the following FIO parameters:

[global]
invalidate=1
ioengine=rbd
iodepth=1
time_based
runtime=7200
size=10G
bsrange=4k-4m
rw=randrw
direct=1
buffered=0
percentage_random=50
rate=5M
pool=volumes
clientname=libvirt

We tested many different load profiles with these parameters, perceiving, however, no real change during extended reading under a load of 1 GB/s and a writing load of approximately 1.5 GB/s. After reaching these values, we received the following metrics in a VM:

Test\Blocksize	4k	16k	4096k
Serial Write	13770 IOPS	7895 IOPS	34 IOPS / 136.75 MB/s
Serial Read	498 IOPS	447 IOPS	14 IOPS / 58.04 MB/s
Random Write	10278 IOPS	8410 IOPS	33 IOPS / 135.63 MB/s
Random Read	344 IOPS	268 IOPS	13 IOPS / 53.73 MB/s

The measurements at around 50 MB/s of reading base load and 110 MB/s of writing base load were as follows:

Test\Blocksize	4k	16k	4096k
Serial Write	15187 IOPS	12799 IOPS	170 IOPS / 681.30 MB/s
Serial Read	1155 IOPS	1126 IOPS	49 IOPS / 196.45 MB/s
Random Write	10064 IOPS	8646 IOPS	170 IOPS / 680.40 MB/s
Random Read	933 IOPS	859 IOPS	53 IOPS / 214.31 MB/s

More OSDs per NVMe

After extensive testing, we were now interested in the possible influence of more OSDs per NVMe on the speed of the cluster. To find this out, we built our cluster so that we had new 2 OSDs and correspondingly two journaling partitions on each NVMe. This conversion had a significant impact primarily on the reading speed, whereas the writing performance suffered only a little:

Test\Blocksize	4k	16k	4096k
Serial Write	14686 IOPS	12407 IOPS	128 IOPS / 515.33 MB/s
Serial Read	1719 IOPS	1546 IOPS	104 IOPS / 416.01 MB/s
Random Write	9377 IOPS	8900 IOPS	122 IOPS / 488.29 MB/s
Random Read	1401 IOPS	1323 IOPS	106 IOPS / 427.72 MB/s

In order to see the behaviour when the Ceph storage is at higher workload, we again generated around 1 GB of reading load and 1.5 GB of writing load:

Test\Blocksize	4k	16k	4096k
Serial Write	13909 IOPS	9724 IOPS	38 IOPS / 154.25 MB/s
Serial Read	519 IOPS	495 IOPS	18 IOPS / 72.13 MB/s
Random Write	9689 IOPS	9609 IOPS	38 IOPS / 155.38 MB/s
Random Read	293 IOPS	311 IOPS	18 IOPS / 73.02 MB/s

In doing so, a significant improvement was noticeable in comparison with the configuration with only one OSD on each NVMe device.

Finally, we attempted to improve performance by taking a solution approach from Samsung, which – in cooperation with RedHat – was proposed as the best alternative. With the recommended upgrade to 4 OSDs per NVMe we were not able to detect any increase in performance, and even perceived deterioration in the reading and writing of larger blocks. Our benchmarks were then complete, and it was with 2 OSDs per NVMe that we were able to extract the optimal performance from the hardware.

Reliability

To then complete the various tests by testing the reliability of our setup, we simulated different reliability scenarios under load:

NVMe Failure

We simulated the failure of an NVMe disk on the fly and with a significant degree of load on the cluster. We did this by removing an NVMe from the server.

The result of this was that no more IO operations were possible for a period of a few seconds. Thereafter, the storage worked normally again and no losses were detected on the VMs.

On the fly, this failure is sufficiently short that no impairment could be detected on the virtual machines.

NVMe Failure with Redistribution

With this scenario, we tested potential performance problems by removing a NVMe disk and waiting until all data had been redistributed. Another objective of this test was to determine how long it takes until all the data are available again, accurately and with redundancy.

The result was 5 minutes. This was the time that the cluster needed to repair itself subsequent to the removal of the disk. To do this, it had to copy a substantial amount of data, something which was not, however, detectable on the VMs. For the test, approximately 6182 GB of data were available on the cluster, for which recovery took around 20 minutes.

Node Failure

We reconstructed this situation in order to find out what happens in the event of a cluster node failure, for instance during a power failure or a reboot.

In this situation, we again generated a large amount of load on the cluster before removing the electricity supply to the node. The behaviour was very similar to the failure of an OSD. For several seconds thereafter, the cluster processed no IO operations, but subsequently worked properly again. In this case, too, recovery of the “lost” data began after 5 minutes.

Even a reboot of a Ceph server demonstrated the same effect. For a period of a few seconds, no IO operations were carried out, but the cluster subsequently ran normally. The data that were not stored redundantly during the reboot were then synchronised within a very short time.

Production Environment

After completion of all the tests, we were able to move the hardware from the lab racks into the production environment. Following the move, we conducted another performance test to verify our results from the lab in a production environment. On the basis of another recommendation from RedHat, we additionally migrated the server from Ubuntu Trusty to Ubuntu Xenial. Through the final performance test, we were also able to rule out the distribution upgrade having an impact on the performance of the cluster.

You will find the first and the second part of the series here.

Our Experience with Ceph – Part 3

Benchmarks in the VMs

More OSDs per NVMe