Our Experiences with Ceph

Following the presentation of the problems which needed to be solved and our approaches in the first part of our blog series “Our experiences with Ceph”, we would now like to discuss the structure as well as the expected and effective speeds of the individual aspects of our new cluster.

Structure and tests

After having decided on the extremely fast NVMe version, we ordered the relevant hardware.

After arrival and installation of the servers in the lab racks we started installing the systems. We then got started straight away running the first benchmarks. In order to later find out the performance of each individual part of the cluster, which would allow us to detect any possible bottlenecks early on, we tested each individual part starting with the NVMe disks and going through to testing the IO speed in a virtual machine using Ceph RBD as Block Storage.

Most of the following benchmarks were performed using FIO.

Disk benchmarks (OSDs & journals)

We started with a speed test of the NVMe discs. To do this, we created an FIO job to test different block sizes and get the corresponding IOPS and MB/s figures:

Test\Blocksize	4k	8k	16k	2048k	4096k
Random Write	314276 / 1227.7	192417 / 1503.3	104569 / 1633.1	892 / 1787.3	445 / 1788.1
Random Read	351525 / 1373.2	246589 / 1926.6	138682 / 2166.1	1248 / 2501.5	648 / 2605.5

In these first benchmarks it was discovered that around 5.6x more random writes could be achieved with 4k blocks than indicated by Intel for the relevant NVMes (56,000 IOPS) and around 100,000 IOPS fewer random reads. A possible reason for this could be the fact that the official Intel benchmarks were performed across the entire NVMe disk - something which we had not done due to time constraints.

As each OSD also required one journal each, and we wanted to place this on the same disk as the OSD, we also had to test this performance. This test would also tell us what the write performance of the cluster for each node would look like later. We therefore created the relevant journal partitions and made parallel writes to all four NVMes at the same time. The result was a write speed of 6830 MB/s per node.

In the next step we tried to reconstruct the behaviour of the OSD processes (i.e. the reading and writing of OSD partitions). For this, we created the following FIO job:

[global]
invalidate=1
ramp_time=5
ioengine=libaio
iodepth=128
exec_prerun="~/clear_caches.sh"
# we need to write more than we have memory as we use buffered IO
size=512G
direct=0
buffered=1
bs=4m

[random-read-write-1]
stonewall
rw=randrw
rwmixread=20
rwmixwrite=80
filename=/dev/nvme0n1p2

[random-read-write-2]
rw=randrw
rwmixread=20
rwmixwrite=80
filename=/dev/nvme1n1p2

[random-read-write-3]
rw=randrw
rwmixread=20
rwmixwrite=80
filename=/dev/nvme2n1p2

[random-read-write-4]
rw=randrw
rwmixread=20
rwmixwrite=80
filename=/dev/nvme3n1p2

Reading and writing to/from OSD data storage devices (not to the journal) is performed in a “buffered” way, i.e. it is not cached in RAM. Here, it is important to write more data than amount of memory available in the node. We generated a workload of 80% write accesses and 20% read accesses. The results here gave us a writing speed of approx. 3773 MB/s and a reading speed of approx. 947 MB/s per node. In real operation, however, the speeds should be lower, because when writing to an OSD, the journal is written to at the same time (“double write penalty”). However, this case was not covered in our test.

To now find out how the writing speed behaves in reality from the perspective of the OSD processes, we installed and configured Ceph on the servers. Ceph provides its own tools for testing the performance of individual components. Among other things this includes ceph tell osd.<ID> bench <size> <blocksize>. The sequential writing speeds with different block sizes look like this:

OSD Nr.	4k	8k	16k	2048k	4096k
osd.9	24.16 MB/s	61.24 MB/s	108.47 MB/s	1140.04 MB/s	1119.95 MB/s
osd.15	22.53 MB/s	68.60 MB/s	102.60 MB/s	1103.43 MB/s	1087.54 MB/s

The writing speed drops off significantly, especially for small block sizes.

Network

We first set up the network to use two of the four 10 Gbps connections, bonded, for the internal communication within storage cluster. We wanted to use the two other ports, also bonded together, for server management and for external communication (to the compute nodes). For this, we planned to run communication with the compute nodes via a VLAN.

In order to also receive key figures for the possible network speed and any potential indications for problems, different tests on each the different networks involved were necessary. We performed these using the iperf tool.

Initially, the tests went according to plan and we received roughly the values that we had expected. However, the results from the tests of the throughput between Ceph and Compute nodes were not as expected. Only 2.86 Gbits per second went through the line over the 10 Gbps interfaces. In order to identify the problem, we loaded the line with many different combinations. In doing so we discovered that the poor performance only occurred when the traffic was tagged with a VLAN tag and sent through a bonded line. As soon as the bonding or VLAN tags were removed, we achieved the expected speeds. Based on this we performed many trials with different variations. This also included a direct connection between two servers without a switch and testing with tagging and bonding, which worked without any losses.

It turned out that the problems must be related to our switches and we notified the manufacturer accordingly.

As we were not able to “simply” solve the problem and would be dependent on a possible update for the switches, we decided on using an equivalent connection without tagging. This way we were able to utilise the entire network bandwidth.

The storage cluster

After completing the tests on all the individual cluster components we were able to quickly and easily set everything up using Puppet, our configuration management tool.

We then tested the cooperation of the individual servers and started by benchmarking the read speed on a so-called pool. For this, ‘rados bench’ proved a very useful tool:

# rados bench -p tmppool 30 write

Total time run:         30.183657
Total writes made:      9648
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1278.57
Stddev Bandwidth:       87.7862
Max bandwidth (MB/sec): 1436
Min bandwidth (MB/sec): 1056
Average IOPS:           319
Stddev IOPS:            21
Max IOPS:               359
Min IOPS:               264
Average Latency(s):     0.0500232
Stddev Latency(s):      0.0477811
Max latency(s):         0.652174
Min latency(s):         0.0133229

Based on these values we could clearly see that when writing to a pool the speed achieved is approximately the same as in a previous test writing to a single OSD.

This is how we came to the result of the performance of a single pool. The individual RBD volumes are generated in the pools themselves, which act as disks for the virtual servers. To measure the speed of such RBD volumes, Ceph has already integrated several different tools. We used rbd bench-write and received the following IOPS values:

Test\Blocksize	4k	8k	16k	2048k	4096k
Serial Write	45385.67	34774.90	21122.50	309.98	160.60
Random Write	12690.85	12189.42	1329.33	284.44	152.23

In order to substantiate these results in other ways, we performed further tests with FIO, writing and reading directly to and from the RBD image. We took these measurements using fio --rw=write --size=10G --ioengine=rbd --direct=1 --iodepth=1 --pool=volumes --rbdname=testing --name=testing --clientname=libvirt --blocksize=4k, whereby the --blocksize and --rw was adjusted for the individual measurements. Initial results looked like this:

Test\Blocksize	4k	16k	4096k
Serial Write	25941	22599	117 IOPS / 471.78MB/s
Serial Read	1382	1306	135 IOPS / 543.33MB/s
Random Write	10907	11748	153 IOPS / 613.95MB/s
Random Read	1196	1085	91 IOPS / 365.50MB/s

After we had established the initial performance data for an RBD image, we tried to implement two tuning techniques, which we found on the Ceph mailing list. These were the following two settings:

The CPUs should always run at the highest possible frequency. ‘Waking’ up from the lower frequencies takes too much time.
The rq_affinity (see https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt) needed to be set to 2 for all NVMe disks. This way processed IO requests were sent to the CPU, which had triggered them. As we pin our OSD processes to individual CPUs, this should result in a performance increase.

We then tested again and, with the same test, received the following values:

Test\Blocksize	4k	16k	4096k
Serial Write	26791	28086	142 IOPS / 571.36MB/s
Serial Read	2147	1960	176 IOPS / 705.56MB/s
Random Write	12408	13370	255 IOPS / 1020.1MB/s
Random Read	1512	1451	175 IOPS / 703.50MB/s

In our next blog post - Part 3 of our Ceph experiences, we will measure the IO speeds in a virtual machine, find out to what extent different CPUs impact on performance, and we will try to identify the optimal ratio of NVMe disks and the number of OSD processes running on them.

Click here to read the first part of the series.

Our Experiences with Ceph - Part 2

Structure and tests

Disk benchmarks (OSDs & journals)

Network

The storage cluster

Important links

Contact

Our Experiences with Ceph - Part 2

Structure and tests

Disk benchmarks (OSDs & journals)

Network

The storage cluster

SUBSCRIBE

Important links

Contact