Hello! This is cw-ozaki from the SRE department.

This post will be a continuation of Part 1, and I will explain the strategy with which we migrated our system to Kubernetes.

Our strategy for migrating to Kubernetes
1. Add a spec to EC2.
- Using ServerSpec
2. Add a simple E2E to the service launched in EC2.
Convert the runbook to IaC with Ansible.
4. Run the spec and E2E with CI against the IaC system built with Ansible.
5. Replace the production environment with the IaC system.
6. Streamline the specifications, refactor, and remove specifications that are incompatible with Kubernetes.
7. Deploy on Kubernetes!!!
Summary

Our strategy for migrating to Kubernetes

The migration from EC2 to Kubernetes was performed in the following steps.

Add a spec to EC2.
Add a simple E2E test to the service launched in EC2.
Convert the runbook to IaC with Ansible.
Run the spec and E2E with CI against the IaC system built with Ansible.
Replace the production environment with the IaC system.
Streamline the specifications, refactor, and remove specifications that are incompatible with Kubernetes.
Deploy on Kubernetes!!!

If you only have a simple workflow and you can grasp it well on your own, you can easily deploy applications on Kubernetes.
However, Chatwork was made of an EC2 environment built up over several years. The individual who had initially built this environment has already departed our company, and although I have some understanding of it, I was not confident that I could fully understand it and migrate the system perfectly.

Therefore, I needed to find out the specifications of this EC environment in order to avoid issues that might occur due to lapses in the migration process.

1. Add a spec to EC2.

The current EC2 was built with the following steps.

Do the initial setup EC2 based on the runbook and create a base AMI.
Use fabric to sync the latest configuration files.
Use capistrano to sync the latest applications.
Make an AMI with the synced items, register them in the Auto Scaling Group in each environment.

In this regard, we could identify the current specifications to some extent from the runbook and fabric/capistrano, but there were several things that were not immediately obvious.

Has this runbook been updated correctly in the first place?
fabric can tell us the values we set but not the default values.
- For instance, the default values of PHP extension settings were not included in the fabric.
Are there any changes that only exist on the AMI?
- Ad-hoc changes that were made in response to issues may not have been included in the runbook or fabric.

As we believed that these ambiguities would certainly have led to migration lapses, we decided to introduce ServerSpec and write a spec to codify the specifications in order to have a better understanding of the existing system.

Using ServerSpec

serverspec.org

ServerSpec is a tool that all infrastructure engineers love.
I will not go into detail on how to write a spec using ServerSpec or how to use it, since that is not really my main focus here.

We created a spec to check the following from the instances launched on the current AMI, the runbook, the PRs of fabric and capistrano, and past logs.

The installed applications and their versions
The installed plugins and their versions
Whether the installed applications are being run as services
Whether there are commands being used in scripts
Whether the kernel parameters, locale, etc., have been configured appropriately
Whether the required environment variables have been configured appropriately
Whether the directories and files created by the applications and scripts have the appropriate permissions
Whether there is an SSH user for capistrano
Whether there is a user configured for Git to be used on capistrano

, etc...

We then proceeded to carefully check all the items that seem to be relevant as we began to write the spec based on the runbook, various documents, and the code.

The important thing to note here is not to seek absolute perfection.
It is quite impossible to create a perfect spec in such a situation where no spec had previously existed, the documentation is incomplete, and there is no one from whom we could seek verbal confirmation. Rather, it makes more sense psychologically to adhere to the idea of "fail fast," where we set up a system quickly and identify any problems so that we can feel that we are a step closer to perfection.

2. Add a simple E2E to the service launched in EC2.

The steps here may differ in practice, but when we tried playing around with the environment, we found that in some cases, the Chatwork service fails to launch even when ServerSpec is successful.
Of course, in such cases, we would add new specs for the problematic settings as required. However, it would not make sense to test this manually every time, so we needed an HTTP-based way to test the behavior in an E2E manner.

We decided to use Infrataster to perform this test.
One advantage of Infrataster is that it is created using Ruby, which makes it easy to use together with ServerSpec. github.com

In fact, we had also considered Bats and Selenium at this point. However, we found that using these frameworks to test requests was more difficult and they were a little too loaded as compared to Infrataster, which was simple but served our needs well. Infrataster struck a good balance as it was user-friendly and satisfied our requirement of being able to test E2E in a simple way.

Infrataster is mainly used to test the following behaviors.

Coverage of NGINX location settings Checking application features that must be checked on the most basic level, such as login

We used Infrataster in the case of Chatwork as it does not have E2E testing. For services that already come with E2E testing, it is best to just use that directly.
To be honest, it would have been ideal to be able to develop E2E testing at this point, but no matter how we thought about it, doing that would be a massive project that may hinder the actual migration, so we made a compromise this time by using a simple E2E.

Convert the runbook to IaC with Ansible.

During this migration, we also discontinued the use of the runbook, fabric and migrated the runbook to Ansible.

www.ansible.com

This is because we did not know how long the migration process from EC2 to Kubernetes would take when we initially started working on this. Also, I am in charge of running the current PHP application, and if any issues occur on that front, I would need to perform a restore and find a permanent fix. This would make it impossible to estimate the amount of follow-up work required.

As a result, if the specifications were changed during the migration process, these changes may be missed and become impossible to track. In view of this, we introduced the use of Ansible as a substitute for the entire flow of how changes are made in the EC2 environment, including CI as described below.

On Ansible itself, we have been steadily writing up the runbook, the content of the fabric, and the parts that we might possibly need from the spec.

However, in retrospect, it remains unclear if using Ansible for this was indeed the best solution.
This is because when the number of EC2 instances exceeds a certain number, Ansible starts to slow down and takes a long time to run, which gives us the feeling that this was a step backward in terms of DX as compared to our use of the runbook and fabric in the past.

Nevertheless, one major advantage of introducing Ansible is its reproducibility and the strong incentive it has given us to perform regular maintenance.
In fact, even though we have updated our PHP version from v5.3 to v7.1 and v7.3, added new extensions, and finally migrated from Amazon Linux to Amazon Linux 2, we were able to maintain the spec without any issues, and the work itself was rather easy to perform and test. On balance, I think we have managed to accomplish all of this while managing to keep our DX capabilities at the same level.

4. Run the spec and E2E with CI against the IaC system built with Ansible.

Up to this point, we needed to use Ansible to build the EC2 environment, ServerSpec to test if we were able to build it as we had expected, and Infrataster to test if the service was running as expected. Instead of deploying this on a test environment and checking it as we have done in the past, we can check if it is running properly by simply deploying it with CI and opening a PR on GitHub.

This offers several advantages, but the biggest advantage is that it gives developers greater motivation to make changes in Ansible and create PRs, which allows improvements to be made. The spec is run for all the changes made in Ansible, which prevents the spec from becoming obsolete.

5. Replace the production environment with the IaC system.

In this step, we applied the flow that we have developed so far to the production environment and convert it into a state in which the spec can be continuously updated.

However, as mentioned in "1. Add a spec to EC2," this is not a perfect spec, and Ansible scripts built on any imperfect spec will cause issues. (In fact, we did encounter some issues.)
This is something that cannot be avoided, and it is a painful price we ultimately have to pay at some point for not fully understanding what we are running. Anyone doing this should be prepared to recognize that this is an opportunity to adhere to the idea of "fail fast" and make the spec closer to perfection.

Of course, it would have been possible to get around our lack of understanding by immediately replacing the system with Kubernetes. However, we decided to attempt to identify and keep the specifications before deploying the system on Kubernetes as we felt that replacing the system with a similar EC2 system with the same configuration would lower the likelihood of issues and allow us to promptly implement fixes for any issues since we can easily refer to the original environment.

6. Streamline the specifications, refactor, and remove specifications that are incompatible with Kubernetes.

At this point, we were finally able to create an EC2 environment with a clear set of specifications.
However, once we were able to have a clearer sense of the specifications, it became possible for us to identify things that were no longer necessary, as well as things that were part of specifications developed in the past and which should be modified.

For instance, Chatwork used to run Nagios and Zabbix in the past, but we migrated to Datadog in 2016. We had no inclination to continue using Nagios and Zabbix, so we took this chance to switch entirely over to Datadog.

In addition, some requests are processed by a static-IP EC2 under the specifications of the existing EC2 environment.

This outdated part of the specifications was implemented when we switched over to a service that handled card payments in the past. Since this is no longer necessary, we have discontinued this part of the specifications so that we do not have to think of ways to keep the IP static when deploying the system on Kubernetes.

Previously, the flow for making any changes to the configuration was to first create a runbook and test it in a test environment before deploying it in the production environment. Now, it is psychologically less stressful as we can simply rewrite the code on Ansible and ServerSpec/Infrataster and create PRs.
In particular, human error becomes a problem if many steps are required for deployment. It is now a lot easier to implement major changes to the specifications as we are able to narrow down the scope under consideration by ensuring that changes are made in a manner that does not affect other items.

7. Deploy on Kubernetes!!!

Finally, we made the following shortlist for the middleware needed to migrate Chatwork from EC2 to Kubernetes.

NGINX
PHP-FPM
PHP (+ various extensions)
Postfix
NewRelic
Datadog (integrated with the metric collection infrastructure on Kubernetes)
Fluentd (integrated with the log collection infrastructure on Kubernetes)

Once we launch the Chatwork application on the middleware, its deployment on Kubernetes will be complete!
In my next post, I will explain how to deploy each middleware on Kubernetes.

Summary

Write tests to ensure that the specifications are clear.
Prepare for the migration period with CI and automated provisioning.
Clear up any unnecessary specifications.

This is the end of my post on how to develop a set of specifications for existing infrastructure.
In my next post, I will be discussing single-tenancy and multi-tenancy, as well as lifecycle management, all of which are important matters that must be considered when using Kubernetes.

kubell Creator's Note

ビジネスチャット「Chatwork」のエンジニアのブログです。

Migrating our PHP Legend System from EC2 to Kubernetes, Part 2: The ideas behind our migration strategy