Hello, this is Sakamoto from the SRE Department. I run at least one full marathon every year, but unfortunately, I couldn't make it in 2020. I managed to run under 90 minutes (89 minutes) in the half-marathon held just in February, and I was aiming to run under 3 hours and 10 minutes this year but, unfortunately, there was nothing I could do.
I was not particularly interested in participating in track events as I did not run as part of my club activities, so I only had a goal of running a total distance for the year, and as of mid-December as I write this, my goal is 2591km. Last year's goal was 2724km, so I'm hoping to improve on that.
And without further ado, we've been using Kubernetes at Chatwork since 2016, at that time the EKS was not in existence, we were the early adopters of Kubernetes, so we used a tool named kube-aws to host and run Kubernetes on EC2, but we migrated to EKS this year.
I'd like to introduce the tools that support Kubernetes for Chatwork (2020 version). The tools are not the command-based tools, such as kubectl and helm, but rather tools that run as applications in Kubernetes and support the application of services.
These tools are not included by default in EKS, some are included by default in GKE and AKE.
- List of tools
- node-problem-detector, draino
List of tools
First, I would like to itemize the tools we use.
- aws-ebs-csi-driver (trial usage)
That's quite a lot. This list is a bit bloated, but we use all of them. If I were to write about everything in detail, it would be too long so Ozaki from the SRE department said,
"I'll cover fluent, postfix and newrelic”
so I'll leave it to him, and write about the other tools, either roughly or in some detail.
EKS deploys 2 pods of CoreDNS by default. 2 pods are enough for a cluster of a certain size because CoreDNS has very good performance, but of course there is a limit and it will overflow. I'd like to write about the limits of CoreDNS somewhere else, and I'll skip it here.
So, you can use HPA, but since HPA is metrics-based, you may have trouble with the threshold value when you want to expand before it gets clogged to some extent.
That's where cluster-proportional-autoscaler comes in, and I introduced it based on the following article.
The cluster-proportional-autoscaler expands specific deployments based on the number of nodes and the overall number of CPUs, and we use this to expand CoreDNS according to the scale of the nodes.
Just by expanding CoreDNS stably, I think the DNS area of EKS will be quite stable.
Although CoreDNS alone will give you enough performance, you can improve the overall performance by reducing DNS requests between nodes in the cluster as much as possible. This is where node-local-dns can be used.
However! There are two points to note.
- When using node-local-dns with EKS, never use
Timeouts occur frequently, making it completely unusable. The AWS documentation just says not recommend, it will degrade the performance considerably and affect the performance of the entire cluster. Not only did I remove it, but I changed it to
prefer_udp and it became stable.
When running CoreDNS on Amazon EC2, we recommend not using force_tcp in the configuration and ensuring that options use-vc is not set in /etc/resolv.conf
It seems that EC2 (based on nitro) limits the number of simultaneous connections to 2 when using dns over tcp.
We have identified the root cause as limitations in the current TCP DNS handling on EC2 Nitro Instances. The software which forwards DNS requests to our fleet for resolution is limited to 2 simultaneous TCP connections and blocks on TCP queries for each connection
- There is no need to change the dns destination of the node.
In eksctl, there is a setting to change the destination of dns for each container configured by kubelet, but if you use node-local-dns, this setting is not necessary (i.e. you can leave it as kube-dns).
This is because node-local-dns creates a dummy device and iptables at startup and directs the requests to kube-dns to itself.
When node-local-dns is activated, requests to kube-dns will be forced to go to node-local-dns. So please be careful when applying it to production (as always).
For your reference, here is a link to the proposal of node-local-dns (it used to be in the main site, but it was removed for some reason, so here is the reference URL) and the process around creating a dummy device.
I can write an entire article just on this topic, so I will write it separately.
This is required if you are using HPA. github.com
I think it works well with HPA as a set, and it expands the nodes out in a good way.
The main point of cluster-autoscaler (although it is a bit tedious ) is to create node groups for each AZ as shown in the following document.
However, if a node group is created with Multi AZ, cluster-autoscaler will not expand the nodes because it cannot determine which AZ the nodes will be launched from.
To deal with this, create a node group for each AZ, so cluster-autoscaler can find the node group (actually ASG) corresponding to the PV, and the nodes will expand.
In Chatwork, most of the applications for services that accept requests from outside are creating ALBs with this.
There are a lot of API calls in aws-alb-ingress-controller as a whole, and I think many of you are struggling with throttling.
In order to reduce the number of API calls, if you are not using waf around aws-alb-ingress-controller, I recommend you to turn off waf and wafv2.
There is also a feature that allows you to cache API calls, so if you are concerned about this, try the options below and adjust the duration (duration is 300s by default).
With these two options, the number of API calls will improve considerably.
However, if you use the cache, it will not be released and will continue to accumulate in memory. Also, this feature is not listed in the documentation, so it is probably an experimental feature.
There is nothing we can do about the memory, so let's wait for the OOM to kill it (so if you are in a production environment, use two pods)👻
Or better yet, migrate to v2 (aws-load-balancer-controller). Chatwork is still preparing for the migration, but on a metrics level, the number of API calls was reduced by about 1/3.
At Chatwork, we run most of our nodes as spot instances. The spot instances have been pretty stable lately, but we still have a few instances go down per week. This tool will drain the pods from those nodes when that happens.
However, of course, the 2-minute drop specification of the spot remains the same, and you need to adjust the termination period or the preStop sleep, etc. for that.
Incidentally, ASG has recently added a new feature called Capacity Rebalance, which anticipates spot stops and starts them up, but when I tried it, it started up the node quite aggressively and made it rather unstable, so I do not recommend it.
The aws-node-termination-handler has been updated to support Capacity Rebalance in 1.11, but as of now (2020/12/17), it only does cordon, not drain. If a node fails, it will fall with a thud.
By the way, I think it was originally included in the Managed Nodegroup, but Chatwork operates in all non-managed Groups and uses this tool.
We haven't migrated everything yet, but we are installing it in applications that we don't have a problem dropping. The ebs-driver has a special structure, and if you use all the features, one pod of controllers contains 6 containers.
I've tried many things, but when the node that had PV attached disappears, detach is slower than classic, and I haven't deployed to all of them yet because I haven't completely understood the behavior yet. We may be able to adjust this with parameters.
These two are listed as a set.
The node-problem-detector generates events and changes the status of the node by looking at the various statuses of the kubelet and the kernel logs.
Then draino monitors the events and performs drain from the node in question.
I think they work best when used as a set.
Cert-manager, which has recently been updated to v1. It has been a bit troublesome to use because it actively introduces destructive changes, but I hope this will stabilize it.
We use ACM for most of the certificates, but there are some certificates for which I do not use ACM, and I am managing those certificates.
Also, aws-load-balancer-controller now plugs in Readiness Gate with mutation webhook, and I plan to use it for certificate management.
Chatwork uses the Blue/Green system for upgrading clusters and balances the requests between the old and new clusters by using Route53 weights. Instead of changing the weights manually, we use external-dns so that we can do it only in the manifest. By being able to do this only with manifest, the application team is able to deploy without involving SRE members in cluster migration releases.
Also, in aws-alb-ingress-controller, alb is closed in the cluster, so we use it as automation of linking alb and records.
This is a rather niche tool, but it can be used when you are using an external configmap/secret that is not managed with the application, and you want to replace the Deployment Pod when the configmap/secret is updated.
By specifying the configmap/secret to be monitored on the Deployment side, the reloader will monitor each of them and replace them accordingly.
Chatwork basically aggregates and monitors metrics in Datadog. Recently, many of the metrics are published in Prometheus exporter format, which is also collected by datadog-agent.
Also, when we want to set up alerts at the log level, we send the logs to datadog logs and set up a monitor.
I'm also interested in Prometheus, but since Chatwork uses immutable clusters and destroys the cluster itself, we don't have a cluster to run Prometheus, so I use Datadog.
However, with the advent of AMP, there is a possibility that this will change (or run in parallel).
Datadog also automatically collects Kubernetes metrics and is useful in many ways, so if Prometheus is too much for you, please consider it, although it depends on the price.
I'm sure you're having a hard time figuring out how to manage Kubernetes secrets, and we use aws-secret-operator.
We use aws-secret-operator because it is a tool that creates a secret based on the credentials stored in the aws secrets manager, and it works well with GitOps.
It also supports RDS password rotation.
Kubernetes has HPA, but it doesn't suit all applications, and there are times when you want to manage the number of pods with some leeway. In that case, I use this. I forked it from the original and modified it a bit.
Why did I fork it? - HPA annotations are written in deployment in the original version, and it was not very clear. - It's HPA, so I want to write it in HPA. - The logging method was too delicate, and it was difficult to parse for Datadog logs. - I'm embarrassed to say that it’s still in print debug state because of the temporary support. - The initial startup script was buggy. - This may be a good candidate to send a PR to the main repository, but the kube-schedule-scaler had a bug in the catch-up execution during the initial startup that prevented it from working properly, and I fixed it.
However, it doesn't work well with GitOps (I'm using ArgoCD's ignore diff), so it's an annoying problem.
And I'm also considering moving to scheduled-pod-autoscaler (personally).
Chatwork uses GitOps to deploy its applications, and the tools I mentioned above are also run using GitOps.
Flux and ArgoCD are the big two, and we are using both at the moment.
- You can install it with the eksctl command, so the installation itself is easy.
- You want it when building a cluster, and it is easy to manage with manifest (namespace, RBAC used with aws-auth, etc.)
- Other applications in which the manifest can become complex
We are planning to integrate it into ArgoCD, but it is not a big problem at this point, so we are using it in parallel.
These are the tools that support the 2020 version of Kubernetes. I hope you enjoyed this overview of the Kubernetes ecosystem.
This is the 2020 version, and I plan to write the 2021 version next year (if there are any differences).