Guest post by Joram Wilander, Director of Engineering at Mattermost, Inc.
Most products that run as Software-as-a-Service (SaaS) are built to be multi-tenant, meaning that a single instance or deployment is meant to be used by multiple organizations. There’s a good reason for this: it’s generally easier to scale and operate multi-tenant applications.
But in this new age of containers, orchestration, infrastructure-as-code, and Kubernetes, where it’s cheaper, faster, and simpler to deploy a new instance of an application, that may no longer be the case. Here at Mattermost we thought there was enough merit in that idea that we built our Cloud architecture off of our single tenant application.
Why is Mattermost Single-Tenant in the First Place?
Before I get into the details of how that works, a bit of background on Mattermost’s architecture is warranted. When we first started developing Mattermost back in 2014 we wanted to create a collaboration chat platform that was 1) open source and 2) self-hosted. We didn’t have any plans for running Mattermost as SaaS and we really wanted to build a product that was focused on the organization using it. So, we architected Mattermost to be single-tenant. There can be teams within a single Mattermost instance but there are no barriers between the users of those teams interacting with each other. Users were made to be the top-level and the relationship with teams is 1:n.
For the first five or so years of Mattermost’s existence this served us very well. It let us focus on making a great experience for our self-hosted users and customers. But then, a couple of years ago, we wanted to expand into offering Mattermost as a hosted service.
Taking Mattermost to the Cloud
Now the engineering team was faced with the question: how? A lot of the engineers at Mattermost have worked on large SaaS systems in the past and well-knew the benefits that being multi-tenant brings to them. We also knew that re-architecting Mattermost to be multi-tenant would be no small feat because being single-tenant was at the core of how it was built. We were also still a fairly small start-up company at the time and didn’t have the resources to fork our product and build something separate for Cloud. The question we asked was, do we have to re-architect at the application layer or was there another way to solve this? Perhaps it could be solved at the infrastructure layer, say using something like Kubernetes? To cut the story short, after numerous discussions, POCs and a bunch of research, we were confident we could do it.
So, we did it. And it looks like this:
The short version of what we did was leverage the orchestration power of Kubernetes and add another layer of orchestration on top of that, which was much more Mattermost-aware, to be able to deploy individual instances of Mattermost quickly and on-demand when a customer signs up for a workspace. This means that each customer who has their own workspace gets their own deployment of single-tenant Mattermost and a set of pods that is only for them.
Additionally, all of this wouldn’t be possible without a ton of other great, open source CNCF projects that we built our Cloud with:
- Prometheus and Thanos for monitoring
- Fluent Bit for log collection
- Flux and ArgoCD for automated deployments
- Helm and the Operator Framework for building and managing deployments
- Chaos Mesh for chaos fault simulation and reliability testing
- And others!
In addition to not having to re-architect our entire application to be multi-tenant, we also get a number of other benefits from building our Cloud architecture around a single-tenant application:
- Improved data isolation between customers. Database tables are not shared and each customer gets their own logical database, reducing the likelihood of an exploit or bug that would cause data to leak across multiple customers.
- Better application reliability. Since each customer has their own deployment and pods, if one customer’s deployment goes down it does not necessarily mean any customers are affected.
- Customers can be scaled independently of each other and it’s also much easier to attribute hosting costs directly to a single customer.
- Ability to roll out new changes more slowly across all the different deployments to test new changes that are not just limited to features.
None of these are small potatoes either, especially since Mattermost was founded with data privacy and security as core principles.
Running any SaaS product is hard. It’s even harder when you’re doing any sort of trailblazing off the beaten path. Here’s a non-exhaustive list of some challenges we faced. All of these are deep enough to be their own blog posts, and some of them already are.
- Maximum IP addresses and pods per Kubernetes cluster. We have at minimum two pods per customer workspace so if we have 5,000 workspaces created then that’s a minimum of 10,000 pods which all need their own IP addresses. Addressed in a number of ways from choosing the right subnet sizes, Kubernetes CNIs, getting clever with “hibernating” less-used workspaces, and more.
- Sharing databases across a large number of applications that all expect their own logical database and are liberal with opening database connections. Addressed by optimizing database connections with PgBouncer and automation for spinning up and down RDS databases on demand.
- Handling a high volume of metrics coming from the application that wasn’t shy about generating them. Addressed by setting up the right Prometheus and Thanos infrastructure.
- Driving costs per workspace low enough to be economical. Addressed by improving sharing efficiency of many systems like the Kubernetes clusters and databases. Finding ways to have more workspaces per cluster and database.
- Rolling upgrading thousands of workspaces and deployments. Addressed by clever installation group management and automation.
As often happens when you’re building anything complex, there are always lessons to be learned. Some of the missteps we had were:
- Not picking the right Kubernetes CNI from the beginning. An obvious one in hindsight, this ended up causing some significant pain once we started to hit IP address limits for clusters and EC2 machines with a large number of workspaces.
- Trying to go full Kubernetes too hard. When we first started building Mattermost Cloud we had the vision of running everything ourselves inside the Kubernetes cluster including databases and file storage. That proved to be a little too ambitious and we settled with using managed services from AWS where we could to streamline.
Getting into some more detail, at a high level there are three primary components of the system:
- Customer Web Server (CWS), or Customer Portal as some of you may be familiar with it
- Provisioning Server – GitHub repository
- Mattermost Kubernetes Operator – GitHub repository
The Customer Web Server is a fairly standard web server. It provides the front-end portal that our customers use to sign up, handles billing and customer account information, and is what tells the Provisioning Server to create or delete new deployments of Mattermosts, what we call installations on the backend. This is what serves pages like this.
The Provisioning Server is the brain, or command-and-control, of our entire Cloud architecture. It is responsible for creating and managing Kubernetes clusters, deploying, configuring, and managing Mattermost installations onto an appropriate cluster, and scheduling and rolling through updates to both clusters and installations. This is the primary part of that additional Mattermost-aware orchestration layer I mentioned. It has a REST API and is built using a micro-monolith architecture that lets us ship as a single binary that can run one or more supervisors to handle cluster, installation, and other responsibilities separately. If you’d like to learn more about the specifics, check out the original technical specification.
Finally, there is our Kubernetes Operator that handles all the low-level Kubernetes management of the Mattermost installations. It creates all the Kubernetes resources in the cluster needed for Mattermost to run, knows how to perform rolling updates, and other maintenance tasks. If you’re not familiar with Kubernetes operators you can think of it as a system administrator that is an expert on the application being codified into an application that manages the deployment to Kubernetes for you. While we heavily rely on our operator for our Cloud, it is also usable by customers who want to self-host Mattermost on their own Kubernetes clusters.
There’s a lot more to it than just that but that’s a good high-level summary.
There are two main pieces of infrastructure that makeup Mattermost Cloud. There’s the command-and-control (CnC) Kubernetes cluster and the application Kubernetes clusters.
There is only one CnC cluster per deployment of the entire Cloud infrastructure. Both the Customer Web Server and Provisioning Server run within it, as well as all the other utilities and services we need to run a Cloud service such as logging and metrics collection.
Application Kubernetes clusters are where the Mattermost installations that our customers use are deployed. These clusters are created as needed by the Provisioning Server and there can be as many of them as needed.
We currently run everything on AWS and leverage S3 and RDS for file storage and databases, respectively. In the future we’d love to go multi-cloud, potentially even giving customers the choice for which cloud provider they’re deployed on.
Want to Learn More?
Check out these other blog posts from our Cloud team:
- Migrating Thousands of Cloud Instances to New Kubernetes Custom Resources
- Automate EKS Node Rotation for AMI Releases | Mattermost
- How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
- Monitoring Cloud Environments at Scale with Prometheus and Thanos
- Optimizing Database Connection Loads With PgBouncer and Testwick
- Chimera: Painless OAuth for Plugin Frameworks
Join the ~Developers: Cloud channel on our community server if you want to discuss anything you just read with us.
Mattermost is actively hiring SREs and Cloud Engineers, apply here!