Migrating our PHP Legend System from EC2 to Kubernetes, Part 1

Why did Chatwork migrate from EC2 to Kubernetes?
- Issue 1: Deployment becomes unstable when the EC2 instance exceeds about 40 machines
- Issue 2: A rollback cannot be performed depending on the deployment status
Kubernetes as a solution
Kubernetes as an option
Conclusion

Hello! This is cw-ozaki of the SRE Department.

The task of migrating PHP’s legend system from EC2 to Kubernetes that I have been involved with for some time has settled down a bit, so I wanted to share how this migration was made possible.

This article and the referenced articles to follow are the complete versions of the discussions held during the Japan Container Days V18.12 Meetup and the AWS Dev Day Online. I will also talk about some things related to the PHP Conference 2020 Re:born held in December, so I hope you are looking forward to it.

speakerdeck.com

Why did Chatwork migrate from EC2 to Kubernetes?

As stated in the announcement that “Chatwork is migrating to Scala!,” the service is gradually being changed to Scala. However, the PHP legend system that has supported Chatwork for many years still accounts for a large part of the overall operation. And it will still take time to completely switch to Scala.

So, while this means we will continue to operate the PHP legend system, we are experiencing some instability issues with deploy and rollback as Chatwork services grow.

The data for 2017 when we began this project shows, there were around 100 PR releases per month, looking at the stats there was a failure probability rate of 5 to 10%. This deploy and rollback instability accounted for the slow release of new features and fixes pushed by the dev teams as a Chatwork’s daily improvement. Further, when a rollback could not be done, the release caused a service stop, henceforth it impacted the overall SLA availability rate and the lengthened MTTR hurt the user experience resulting in a loss of user trust.

For this reason, deploy and rollback is very important for supporting the service.

Issue 1: Deployment becomes unstable when the EC2 instance exceeds about 40 machines

In the PHP legend system capistrano is used to conduct the deployment.

f:id:cw-tomita:20210408183238p:plain

For this reason, the main workflow is as follows.

Suspend Auto Scaling Group.
Use composer on the deploy server to resolve the dependent libraries and distribute them to the servers.
Capistrano was used to update the code in the deploy server and then the NGINX and PHP-FPM were updated after the symlink was updated.
Create an AMI using an AMI instance.
Create a new launch configuration and restart the Auto Scaling Group.

This deployment model operates stably for up to about 40 EC2 instances, but when we have around 40 machines, the entire process takes several tens of minutes to begins which increasing the chances of deployment failure due to various reasons.

An additional problem is a temporary rise in PHP errors accompanying an increase in service throughput when, for example, a non-backward compatible fix is made, such as the deletion of class during a library upgrade. Due to a timing problem, it results in an error when an attempt is made to read a deleted class.

Issue 2: A rollback cannot be performed depending on the deployment status

As with deployment, Capistrano is also used for rollback procedures, and a symlink is used to return to the previous deployment.

This makes it possible to return in less than one minute just by changing the symlinks for all servers, but there are times when a rollback cannot be done depending on the server status.

One cause is the case where the generation management of the services is out of sync.

f:id:cw-tomita:20210408184231p:plain

This problem occurs when a rollback cannot be performed after a failure has occurred during an update of the codes but the redeployment was successful, and when the AMI started the ASG is out of sync. In this case, a rollback cannot be performed because the rollback target generation does not exist.

Another cause is when a rollback fails causing the generation specification to become out of sync.

This occurs when the released contents caused the CPU usage rate to max out and prevented the rollback from being executed. When this happens, you have to wait until the CPU usage rate drops or if the instance gets stoped after the CPU max out, which then requires time to perform the recovery operation.

In the past, there were cases where rollback could not be performed when the composer updated a library. The fact that rollback could not be performed increased the burden on the developer and reduced the reliability, which leads to trying to use other ways besides a rollback to solve the problem even for cases that could be solved just using a rollback.

Kubernetes as a solution

I think those issues discussed above could be solved with EC2 hosting.

For example, changing to a pull method from a push method to increase the concurrency to shorten the deployment time presuming a retry will be done when an error occurs. Creating an AMI in advance and switching it in will make the state’s uniform after deployment and rollback. I think there are a variety of methods for doing this, such as with a Blue/Green deployment using multiple environments and switching between the old and new.

However, such as the overall slowness of EC2/AMI configuration, the need to provide everything for EC2 generation management, and the cost-performance of changing the deployment/rollback method, we determined it will be difficult to solve these problems while continuing to use EC2 and Capistrano.

That's because Chatwork decided to switch to deployment/rollback using Kubernetes.

f:id:cw-tomita:20210408185228p:plain

The concept is the same as the method using AMI, the new tags being created in advance, and rolling updates being performed. So that when conducting a rollback, you roll back to the previous tag and conduct a rolling update, it allows you to return to the same place as the previous deployment status.

The advantages of using Kubernetes over AMI are as follows.

Deploy and rollback is done under the Kubernetes framework where the state is declarative, so selecting the deployment framework is not required.
The image can be created more quickly than an AMI image.
The container can be started up more quickly than the EC2 instance, so switching over is faster.

Kubernetes as an option

By the way, when using a container orchestration system, the options available to use were ECS, Docker Swarm, or OpenShift? Or how about going serverless with Lambda to make deployment and rollback even faster?

Chatwork use case is rather special in this regard. As it was published in past articles, we were already operating a Kubernetes cluster, so adding another product to achieve the same result was not a great option, so we opted for Kubernetes.

blog-ja.chatwork.com

creators-note.chatwork.com

I guess we could have gone with another option, but we did not see any merits in the other options that would exceed the overall optimization we could obtain by using a uniform tool for the entire operation, so we used Kubernetes.

On a side note, SRE needs to know how to use both EC2 and Kubernetes and both of these have a high learning cost, which creates the problem of scaling the team and there is also the problem of the increased costs required to modify multiple locations in just applying one change. So, Kubernetes was our best option for those reasons as well.

Conclusion

Chatwork is migrating the PHP legend system running on EC2 to Kubernetes.
This is because we need a high-speed and stable release infrastructure that can handle the service scale speed.
We are already using Kubernetes, so this was nearly the only real option for us.

So, this is the background behind the decision to migrate from EC2 to Kubernetes.

In the next article, I will discuss the strategy used for the migration and the method for identifying the specifications of the existing infrastructure in preparation for migration.

kubell Creator's Note

ビジネスチャット「Chatwork」のエンジニアのブログです。