Server migration done right - Moving hundreds of client VMs to Nutanix with almost no downtime

Server migration done right - Moving hundreds of client VMs to Nutanix with almost no downtime

After the decision to replace our existing virtualization infrastructure ninevirt with Nutanix Hyperconverged Infrastructure‎ and the implementation of the needed architectural changes, we were faced with the next big challenges:

  1. How do we technically migrate a client VM from ninevirt to Nutanix?
  2. How do we organize this for all our clients and VMs with minimal impact?

This blogpost covers some of our thoughts and provides insights from both the technical and the operational side on our Nutanix migration journey.

Migrations can be painful

We all know that migrations can create an immense amount of work and there is also a chance that something may go wrong. Operating and migrating thousands of servers (both hardware and virtual) has taught us this the hard way. The golden rule for our Nutanix migrations was that neither the customer nor we have to extensively test anything on the application side before or after migrating.

Our absolute focus was to create a process that does not require our customers to take actions when we migrate their VMs. As a little spoiler: we made it work :)

Asking the technical migration questions

When we are faced with a complex technical task and a lot of open questions, we usually start with a MVP story. In our case it looked like this:

As a managed service engineer, I need a solid and easy path to migrate clients from the ninevirt infrastructure to the Nutanix infrastructure with minimal client impact. With the help of this story, I will evaluate different ways to achieve this and choose the best option.

Acceptance criteria:

- MVP implemented

- Follow up stories are created for the live migration

- Clients have no impact since migration happens during a maintenance window

- No customer interaction is necessary (DNS changes, etc.)

- Plan for migration path is defined, no script has to be written

- Simplicity rules complexity

Once we feel confident that the idea works, we will continue to work on the story until we have something that fits our requirements.

The best engineering option is often a solution that already exists and doesn’t require any new development. There is an existing product from Nutanix called Nutanix Move, which allows users to migrate Microsoft Hyper-V based VMs to Nutanix AHV or AWS EC2 to Nutanix AHV. This didn't support our scenario, so we had to come up with our own solution.

As with every migration, there is always the possibility to manually copy data from the old to the new machine. This process usually happens by performing a sync between the machines. However, as this process is time consuming, error prone and requires longer downtimes, it was ruled out from the beginning.

Nine’s previous in-house solution is based on Qemu and KVM, and given that Nutanix AHV has its roots in KVM, we were not miles apart from the beginning. Realistically, AHV was the most pain free option we could have chosen as our migration destination.

Migrating with existing tools

The “heavy lifting” work migrating the VMs from our ninevirt environment to the new Nutanix environment turned out to be the ability to make use of built-in tools on both sides.

One crucial factor for this was the ability to share the Nutanix storage from the so called controller VMs (CVM) to our ninevirt infrastructure.

The VM containers created on the Nutanix environment are compatible with the containers we have on our ninevirt infrastructure, as the virtualization technology stack is very similar, as we stated above.

Step by step migration explanation

Creating the empty VMs

Since we use Terraform to manage all VMs on the Nutanix platform, we had to find a way to create the Terraform manifests and to populate them with the specs of the original machines.

Remember the “Simplicity rules complexity” acceptance criteria? All information that is needed from here on by the engineer who is doing the migration part is the hostname of the server. We are able to abstract all other configuration items from persistent information using tools that already exist in our environment.

Technically we solved this by 

  1. Fetching the information of the VM from the old virtualization environment and our puppet database
  2. Creating a Terraform template with that information
  3. Adding this new configuration to the existing Github repository and then trigger a pipeline, which executes terraform and creates the VM on Nutanix

To create a new VM on Nutanix, we especially need 4 technical pieces of information about the old VM: resources for CPU, RAM and disk, and most importantly, the VLAN ID(s) of the network interface(s).

In order to prevent migration issues, we add 1 GB to the disk size parameter. The justification for this is that we can’t copy a larger volume to a smaller volume. If the new Nutanix disk is smaller just by a single byte, we would risk corrupt data and file systems which would create havoc when migrating.

An example terraform definition would then look like this:

screenshot-Jun-29-2020-10-42-04-32-AM

Our engineers decided to implement this process in a Bash script. Go was also a possibility, but we don’t force any specific implementation as long as the chosen solution is part of our common toolset - you choose the tool you can handle best.

Initially, we pushed this file in a new git branch, enforcing a full GitOPs workflow with merge requests, approval, and a manual merge. However, this added complexity and provided no additional value for this specific case and was later removed. The reason for this initial workflow was to prevent accidental removals of VMs, but we found a smarter way to ensure this when we focussed on our credo: “Simplicity rules complexity”.

We kept the Gitlab pipelines that execute all the terraform commands and initiate the “terraform deploy”, the step that creates or removes a VM. This is a manual step after an Engineer confirms that the prior “terraform plan” will give the expected results.

Migration of the data

The new VM on Nutanix is now ready, including its empty disks, so it is still missing user data and is in the wrong power state. In order to be able to safely copy data from the ninevirt to the Nutanix environment, we have to ensure that the destination VM is shutdown and its filesystem is unmounted.

This is where we use the Nutanix API to shutdown the newly created VM.

To copy the data to the new VM, we wrote a second bash script that:

  1. Uses the Nutanix API to determine the UUID of the destination disk, the one created together with the new VM
  2. Mounts the Nutanix storage to the ninevirt machine
  3. Performs a block copy of the ninevirt VMs storage container to the new Nutanix storage container
  4. Shuts down the origin ninevirt machine and powers on the new Nutanix VM using the Nutanix API

This step again only requires the hostname of the VM that should be migrated as a parameter. The Nutanix API allows us to get all relevant information about the VM (Disk UUID and size for the storage and the power state to ensure a correct state at all times).

The real “magic” happening here is the “virsh blockcopy”. As the name suggests, this does an exact blockcopy of the source disk. Furthermore, it also synchronizes all changes that happen on the running source VM until the source VM has been shut down. This is a key factor in keeping migration downtime as short as it is; we do not need to do a file based synchronization nor need to re-sync any data.

After the initial blockcopy is finished, we require confirmation by the migrating engineer, which allows full control of the exact migration time. We gracefully shut down the old instance on ninevirt and to ensure consistency, we validate this in two different ways. If one were to start the new VM on Nutanix while the old VM continues running on ninevirt with the blockcopy synchronization still active, this would result in filesystem errors on the new environment and we would need to start the full process over again.

Yes, you read this right, up to this point these steps are completely reversible! If we see any unexpected behaviour of the migrated VMs, we can stop the Nutanix VM and restart the old ninevirt VM. We have actually never had to revert any of our customer VMs, however while testing all edge cases in the implementation process of the migration, we did this several times for our own VMs.

Other than a lot of sanity checks and some improvements for the migration of VMs that are hosting databases, along with the installation of the nutanix guest tools after the Nutanix VM was started, that pretty much describes the magic behind the almost seamless migration process.

Organizing the migrations for all systems

After we had the technical solution in place, it was time to start with the migrations. We wanted the migrations for a client to happen with minimal impact and therefore had to be 100% sure our plan works as we expected. It would have been of questionable sanity to start with production client migrations straight away, so we decided to do this in 3 stages.

1. Test Phase

To be confident to handle such a critical task, every engineer had to migrate his personal test VMs to Nutanix. The test phase not only ensured that every engineer learned the required steps for a migration, it was also a good way to familiarise them with the new technology stack. Not everybody was involved in the development of the platform to the same extent and some toolsets like terraform were known in theory but not yet in practical use by everyone. During the testing phase, we received a lot of feedback for improvements of the migration script which were implemented along the way in our sprints.

2. Internal Phase

Before we started the internal migration, we created a list of all the systems we wanted to migrate to the new Nutanix platform. Systems planned for phase out and some special systems which we could not yet migrate were excluded. The internal phase started with the migration of non critical nine-internal systems, which were easily migrated during office hours. Everything worked well with initial tests, so we continued to migrate the rest of the internal infrastructure.

We now felt more than confident to enter phase three, migrating the VMs of our clients.

3. Client phase

The client phase started with a simple shared Google sheet and the training of our Canadian team. Each client system was given a migration date, and the customers were informed by our brilliant Customer Service Desk about the exact migration date. The team in Canada started with a few migrations in the first week to get a first feeling and then increased the number of VMs migrations per night .

We also migrated some client cluster systems during normal office hours, as we can assure that there are no downtimes thanks to their redundant design.

Thanks to the excellent preparation no bad surprises occurred so far. At one point we were facing physical network bandwidth constraints of our old platform as we started multiple migrations in parallel. But we accommodated this by adjusting our work mode accordingly, and it did not affect our initial plans regarding the planned amount of migrations per night.

There is no routine without surprises every now and then, and so it came that very few  clients noticed that their websites were not available after the final reboot of their machines. Those cases however turned out to be isolated and not linked to a nine managed service.  We allow our clients to run services in their user space, and it is in their responsibility to take care of these services and ensure they will come back online after a reboot. In these cases the client missed the notice in the migration announcement and we were happy to help them, so their services will now automatically start after a future reboot.

The concept of the three phases worked out really well: A good migration project needs technical confidence, clear communication and an easy to track progress. 

Migrations can also be painless

To date, we migrated several hundreds of VMs in a way no customer had to change anything on their side. There was no testing needed, no configurations needed to be touched and on our end the migration is divided into a few steps:

  • Run a script that creates the terraform config (only supplying the VMs hostname)
  • Trigger the gitlab pipeline in the customer context
  • Tell your clients to be aware of a server reboot
  • Run a script that starts the blockcopy (only supplying the VMs hostname)
  • Let the script stop the old and start the new environment

This creates zero uncertainties as all steps have been automated by just supplying the VMs hostname or the customer identifier.

Watch our Canadian colleague Nicholas, showcasing a migration over on our Youtube channel: 

Special thanks go out to the team and everybody involved. We hope our clients appreciate the improved performance of the Nutanix platform and the new features it will offer in the future.

Never miss an update again!

Yes, I would like to receive Logbook Updates!