Chatwork Creator's Note




As a SRE at Chatwork


こんにちは。SRE 部の Vikas です。This seems to be my first blog post on Chatwork Creator's note, although after joining Chatwork for 6 months I have much more share as an experience. As I am new to the platform and to the users, I think the best part would be introducing myself, My name is Vikas Kandwal id:cw-vikas and I am an SRE @chatwork.

This blog post will be helpful for the techies who are interested in joining Chatwork or who are following the SRE's threads.

What does the three letters SRE means?

For the readers who are new to the term "SRE", the best explanation would be,

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production.

The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn’t want anything to blow up in production.

The concept of site reliability engineering started in 2003 within Google. As Google continued to grow and scale to become the massive company they are today, they encountered many of their own growing pains. Their challenge was, to support large-scale evergrowing systems while also introducing new features continuously.

To accomplish the goal, they created a new role that had the dual purpose of developing new features while also ensuring that production systems ran smoothly. Site reliability engineering has grown significantly within Google and most projects have site reliability engineers as part of the team.

Ben Treynor the man behind the Google’s SRE, still hasn’t published a single-sentence definition, but describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

If you are interested in learning more about the SRE in detail I would recommend you SRE book by Betsy Beyer and other co-authors this will give you a deep dive into the domain.

The first 6 months as an SRE

When I was introduced to the SRE department it was a completely new world for me which you can imagine like a Santas bag full of different technologies and the happiest part is the entire bag is yours, I was amazed by the capabilities and the technology use case here at Chatwork.


On joining I was assigned with an onboarding task which took 2-3 months, basically, it was a brief orientation and introduction to the underlying architecture and the infrastructure. It took longer than expected to close these tasks but no regrets with the hands-on I got introduced to many new technologies, concepts and got a chance to practically use them.

As per my personal views, the SRE is not just a role basically its a thinking and a mindset of an engineer.

“At scale, there will be anomalies that are hard to detect, so they’ll need the ability to think statistically, rather than procedurally, to uncloak problems.” ― Betsy Beyer, SRE: How Google Runs Production Systems

Being a Falconist

After the completion of the onboarding I was picked for the Falconist task, we have a Falcon system in place which works as a core for the messaging platform it is used for the processing of the message data part, which is the basis of the chat service. Falconist is a member who deploys, scales, troubleshoot and maintain the Falcon Cluster. The Falcon cluster consists of the Kafka, Zookeeper, HBASE and Scala Apps which makes the entire stack reactive for the data.

Falcon Architecture

The entire deployment is done by the Terraform and Ansible, as we love AWS for infrastructure the things are straight forward. All the serverside configuration part was done by the Ansible for deploying script and writing config files changing permissions etc.. This task was composed of tons of new things and although if even I know most of the things the way of execution as per the SRE way was totally different which makes a huge difference as compared to what I was used to do before.

Scaling the Kafka with the new Brokers

As we are a messaging platform we are a data-oriented company which means we are always in need of scaling our platform as per our user's growth and the increase in the usage of services. Kafka is a message queuing service and the Kafka stack was a part of the Falcon Cluster I was assigned a task to scale the Kafka Cluster by adding a new broker without affecting the other running brokers. This kind of task was featuring the SRE's "Reliability of service"

Implementing the new search cluster by migrating terabytes of data

There was a new requirement of implementing the search cluster which we are currently migrating from AWS cloud search to Elasticsearch Cluster, this project consists of the Amazon Cloudsearch, Elastic Search, HBASE, Kafka, EMR, Spark, Scala. The project is still in development so I will share more information about it in the upcoming weeks in a separate post.

Cloudsearch Architecture
Architecture diagram courtesy id:cw-adachi

SRE culture at Chatwork

The SRE here help Product and Engineering deliver the best experience possible for users on the other side the DevOps is meant to be about breaking down the silos between groups and working together.

Chatwork SRE Team

Now as an SRE at Chatwork, I collaborate with a group of engineers that builds messaging software and the underlying technologies, the platform processes more than a billion messages a day and this is where the reliability comes in the scenario. Everyone in Chatwork Engineering department understands the importance of reliability and speed. If my opinion is about reliability, well, our developers want to be on that side too, the vision of shared responsibility creates a more collaborative and culture.

Although the SRE job is full of responsibilities, communication, and tasks, on the other hand, its very joyful for the people who are impatient in learning new things and experimenting with cutting edge technologies.

Since the day I have joined Chatwork, I have attended SRE Camp twice (Tokyo, Kagoishima ). We host the SRE Camps quarterly SRE camp gives us the space to think and plan our upcoming goals, also we discuss issues with other SRE members and understand there approach. The core theme of the SRE camps is to get all SRE's in sync understand the demands from the infrastructure side and developer side and plan accordingly the next steps.

Tokyo SRE Camp
SRE Camp at Kagoshima

Life at Chatwork

Learn from the best: You will be able to learn and work with professionals who have created some of the most revolutionary solutions in their domains.

Make an Impact: We are writing the future of messaging and communication we are changing the way people work and communicate

Knowledge meets industry: As Chatwork is a solution which is used by end-user and the You can develop your career with

Diversity: The diversity and freedom of expression is the key to the growth and nurturing of ideas and innovation in this era, we have current people working from various countries.

This was my experience as an SRE at Chatwork, explore more awesome talks and writes published by other SREs and Developers on Chatwork Creators Note

I feel that these publishings can give you a brief idea of the SRE tasks and accomplishments

Join Chatwork

And we're hiring engineers who love distributed systems and middleware upgrade! Also, there are many other positions in different domains please feel free to explore them as per your career path.