Following the presentation of the problems which needed to be solved and our approaches in the first part of our blog series “Our experiences with Ceph”, we would now like to discuss the structure as well as the expected and effective speeds of the individual aspects of our new cluster.
Structure and tests
After having decided on the extremely fast NVMe version, we ordered the relevant hardware.
After arrival and installation of the servers in the lab racks we started installing the systems. We then got started straight away running the first benchmarks. In order to later find out the performance of each individual part of the cluster, which would allow us to detect any possible bottlenecks early on, we tested each individual part starting with the NVMe disks and going through to testing the IO speed in a virtual machine using Ceph RBD as Block Storage.
Most of the following benchmarks were performed using FIO.
Disk benchmarks (OSDs & journals)
We started with a speed test of the NVMe discs. To do this, we created an FIO job to test different block sizes and get the corresponding IOPS and MB/s figures:
|Random Write||314276 / 1227.7||192417 / 1503.3||104569 / 1633.1||892 / 1787.3||445 / 1788.1|
|Random Read||351525 / 1373.2||246589 / 1926.6||138682 / 2166.1||1248 / 2501.5||648 / 2605.5|
In these first benchmarks it was discovered that around 5.6x more random writes could be achieved with 4k blocks than indicated by Intel for the relevant NVMes (56,000 IOPS) and around 100,000 IOPS fewer random reads. A possible reason for this could be the fact that the official Intel benchmarks were performed across the entire NVMe disk - something which we had not done due to time constraints.
As each OSD also required one journal each, and we wanted to place this on the same disk as the OSD, we also had to test this performance. This test would also tell us what the write performance of the cluster for each node would look like later. We therefore created the relevant journal partitions and made parallel writes to all four NVMes at the same time. The result was a write speed of 6830 MB/s per node.
In the next step we tried to reconstruct the behaviour of the OSD processes (i.e. the reading and writing of OSD partitions). For this, we created the following FIO job:
[global] invalidate=1 ramp_time=5 ioengine=libaio iodepth=128 exec_prerun="~/clear_caches.sh" # we need to write more than we have memory as we use buffered IO size=512G direct=0 buffered=1 bs=4m [random-read-write-1] stonewall rw=randrw rwmixread=20 rwmixwrite=80 filename=/dev/nvme0n1p2 [random-read-write-2] rw=randrw rwmixread=20 rwmixwrite=80 filename=/dev/nvme1n1p2 [random-read-write-3] rw=randrw rwmixread=20 rwmixwrite=80 filename=/dev/nvme2n1p2 [random-read-write-4] rw=randrw rwmixread=20 rwmixwrite=80 filename=/dev/nvme3n1p2
Reading and writing to/from OSD data storage devices (not to the journal) is performed in a “buffered” way, i.e. it is not cached in RAM. Here, it is important to write more data than amount of memory available in the node. We generated a workload of 80% write accesses and 20% read accesses. The results here gave us a writing speed of approx. 3773 MB/s and a reading speed of approx. 947 MB/s per node. In real operation, however, the speeds should be lower, because when writing to an OSD, the journal is written to at the same time (“double write penalty”). However, this case was not covered in our test.
To now find out how the writing speed behaves in reality from the perspective of the OSD processes, we installed and configured Ceph on the servers. Ceph provides its own tools for testing the performance of individual components. Among other things this includes
ceph tell osd.<ID> bench <size> <blocksize>. The sequential writing speeds with different block sizes look like this:
|osd.9||24.16 MB/s||61.24 MB/s||108.47 MB/s||1140.04 MB/s||1119.95 MB/s|
|osd.15||22.53 MB/s||68.60 MB/s||102.60 MB/s||1103.43 MB/s||1087.54 MB/s|
The writing speed drops off significantly, especially for small block sizes.
We first set up the network to use two of the four 10 Gbps connections, bonded, for the internal communication within storage cluster. We wanted to use the two other ports, also bonded together, for server management and for external communication (to the compute nodes). For this, we planned to run communication with the compute nodes via a VLAN.
In order to also receive key figures for the possible network speed and any potential indications for problems, different tests on each the different networks involved were necessary. We performed these using the iperf tool.
Initially, the tests went according to plan and we received roughly the values that we had expected. However, the results from the tests of the throughput between Ceph and Compute nodes were not as expected. Only 2.86 Gbits per second went through the line over the 10 Gbps interfaces. In order to identify the problem, we loaded the line with many different combinations. In doing so we discovered that the poor performance only occurred when the traffic was tagged with a VLAN tag and sent through a bonded line. As soon as the bonding or VLAN tags were removed, we achieved the expected speeds. Based on this we performed many trials with different variations. This also included a direct connection between two servers without a switch and testing with tagging and bonding, which worked without any losses.
It turned out that the problems must be related to our switches and we notified the manufacturer accordingly.
As we were not able to “simply” solve the problem and would be dependent on a possible update for the switches, we decided on using an equivalent connection without tagging. This way we were able to utilise the entire network bandwidth.
The storage cluster
After completing the tests on all the individual cluster components we were able to quickly and easily set everything up using Puppet, our configuration management tool.
We then tested the cooperation of the individual servers and started by benchmarking the read speed on a so-called pool. For this, ‘rados bench’ proved a very useful tool:
# rados bench -p tmppool 30 write Total time run: 30.183657 Total writes made: 9648 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 1278.57 Stddev Bandwidth: 87.7862 Max bandwidth (MB/sec): 1436 Min bandwidth (MB/sec): 1056 Average IOPS: 319 Stddev IOPS: 21 Max IOPS: 359 Min IOPS: 264 Average Latency(s): 0.0500232 Stddev Latency(s): 0.0477811 Max latency(s): 0.652174 Min latency(s): 0.0133229
Based on these values we could clearly see that when writing to a pool the speed achieved is approximately the same as in a previous test writing to a single OSD.
This is how we came to the result of the performance of a single pool. The individual RBD volumes are generated in the pools themselves, which act as disks for the virtual servers. To measure the speed of such RBD volumes, Ceph has already integrated several different tools. We used
rbd bench-write and received the following IOPS values:
In order to substantiate these results in other ways, we performed further tests with FIO, writing and reading directly to and from the RBD image. We took these measurements using
fio --rw=write --size=10G --ioengine=rbd --direct=1 --iodepth=1 --pool=volumes --rbdname=testing --name=testing --clientname=libvirt --blocksize=4k, whereby the
--rw was adjusted for the individual measurements. Initial results looked like this:
|Serial Write||25941||22599||117 IOPS / 471.78MB/s|
|Serial Read||1382||1306||135 IOPS / 543.33MB/s|
|Random Write||10907||11748||153 IOPS / 613.95MB/s|
|Random Read||1196||1085||91 IOPS / 365.50MB/s|
After we had established the initial performance data for an RBD image, we tried to implement two tuning techniques, which we found on the Ceph mailing list. These were the following two settings:
- The CPUs should always run at the highest possible frequency. ‘Waking’ up from the lower frequencies takes too much time.
- The rq_affinity (see https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt) needed to be set to 2 for all NVMe disks. This way processed IO requests were sent to the CPU, which had triggered them. As we pin our OSD processes to individual CPUs, this should result in a performance increase.
We then tested again and, with the same test, received the following values:
|Serial Write||26791||28086||142 IOPS / 571.36MB/s|
|Serial Read||2147||1960||176 IOPS / 705.56MB/s|
|Random Write||12408||13370||255 IOPS / 1020.1MB/s|
|Random Read||1512||1451||175 IOPS / 703.50MB/s|
In our next blog post - Part 3 of our Ceph experiences, we will measure the IO speeds in a virtual machine, find out to what extent different CPUs impact on performance, and we will try to identify the optimal ratio of NVMe disks and the number of OSD processes running on them.
Click here to read the first part of the series.