ShuttleCloud

ShuttleCloud is a small startup specialized in email and contacts migrations. The company developed a reliable migration platform in high availability used by clients like Gmail, Gcontacts and Comcast. For example, Gmail alone has imported data for 3 million users with our API and we process hundreds of terabytes every month.

Before transitioning to Prometheus, the company had near-zero monitoring. Now they have all of their infrastructure monitored with the necessary metrics and alerts. ShuttleCloud currently has around 200 instances monitored with a comfortable cost-effective in-house monitoring stack based on Prometheus.

In the below blog, originally published by Prometheus, ShuttleCloud talks about why a company does not need a big fleet to embrace Prometheus and that it is a non-expensive solution for monitoring.

Ignacio P. Carretero, Software Engineer at ShuttleCloud, also spoke on this topic at CloudNativeCon + KubeCon North America 2016. The video of his presentation can be found here and slides can be found here.

To hear more stories about Prometheus’ production use, participate in technical sessions on the monitoring tool, and learn how it integrates with Kubernetes and other open source technologies, attend PromCom 2017, August 17-18 at Google Munich. Speaking submissions close May 31st. Submit here.

Interview with ShuttleCloud

Posted at: September 7, 2016 by Brian Brazil

Continuing our series of interviews with users of Prometheus, ShuttleCloud talks about how they began using Prometheus. Ignacio from ShuttleCloud also explained how Prometheus Is Good for Your Small Startup at PromCon 2016.

What does ShuttleCloud do?

ShuttleCloud is the world’s most scalable email and contacts data importing system. We help some of the leading email and address book providers, including Google and Comcast, increase user growth and engagement by automating the switching experience through data import.

By integrating our API into their offerings, our customers allow their users to easily migrate their email and contacts from one participating provider to another, reducing the friction users face when switching to a new provider. The 24/7 email providers supported include all major US internet service providers: Comcast, Time Warner Cable, AT&T, Verizon, and more.

By offering end users a simple path for migrating their emails (while keeping complete control over the import tool’s UI), our customers dramatically improve user activation and onboarding.

Screenshot showing Gmail message about Three tips to get the most out of Gmail, with bring your contacts and mail into Gmail being highlighted

ShuttleCloud’s integration with Google’s Gmail Platform. Gmail has imported data for 3 million users with our API.

ShuttleCloud’s technology encrypts all the data required to process an import, in addition to following the most secure standards (SSL, oAuth) to ensure the confidentiality and integrity of API requests. Our technology allows us to guarantee our platform’s high availability, with up to 99.5% uptime assurances.

ShuttleCloud by numbers

What was your pre-Prometheus monitoring experience?

In the beginning, a proper monitoring system for our infrastructure was not one of our main priorities. We didn’t have as many projects and instances as we currently have, so we worked with other simple systems to alert us if anything was not working properly and get it under control.

Fortunately, big customers arrived, and the SLAs started to be more demanding. Therefore, we needed something else to measure how we were performing and to ensure that we were complying with all SLAs. One of the features we required was to have accurate stats about our performance and business metrics (i.e., how many migrations finished correctly), so reporting was more on our minds than monitoring.

We developed the following system:

Diagram shows monitoring system flow

With all that in place, it didn’t take us long to realize that we would need a proper metrics, monitoring, and alerting system as the number of projects started to increase.

Some drawbacks of the systems we had at that time were:

Why did you decide to look at Prometheus?

We analyzed several monitoring and alerting systems. We were eager to get our hands dirty and check if the a solution would succeed or fail. The system we decided to put to the test was Prometheus, for the following reasons:

How do you use Prometheus?

Initially we were only using some metrics provided out of the box by the node_exporter, including:

Our internal DNS service is integrated to be used for service discovery, so every new instance is automatically monitored.

Some of the metrics we used, which were not provided by the node_exporter by default, were exported using the node_exporter textfile collector feature. The first alerts we declared on the Prometheus Alertmanager were mainly related to the operational metrics mentioned above.

We later developed an operation exporter that allowed us to know the status of the system almost in real time. It exposed business metrics, namely the statuses of all operations, the number of incoming migrations, the number of finished migrations, and the number of errors. We could aggregate these on the Prometheus side and let it calculate different rates.

We decided to export and monitor the following metrics:

Diagram shows monitoring system with Prometheus and Grafana

We have most of our services duplicated in two Google Cloud Platform availability zones. That includes the monitoring system. It’s straightforward to have more than one operation exporter in two or more different zones, as Prometheus can aggregate the data from all of them and make one metric (i.e., the maximum of all). We currently don’t have Prometheus or the Alertmanager in HA – only a metamonitoring instance – but we are working on it.

For external blackbox monitoring, we use the Prometheus Blackbox Exporter. Apart from checking if our external frontends are up, it is especially useful for having metrics for SSL certificates’ expiration dates. It even checks the whole chain of certificates. Kudos to Robust Perception for explaining it perfectly in their blogpost.

We set up some charts in Grafana for visual monitoring in some dashboards, and the integration with Prometheus was trivial. The query language used to define the charts is the same as in Prometheus, which simplified their creation a lot.

We also integrated Prometheus with Pagerduty and created a schedule of people on-call for the critical alerts. For those alerts that were not considered critical, we only sent an email.

How does Prometheus make things better for you?

We can’t compare Prometheus with our previous solution because we didn’t have one, but we can talk about what features of Prometheus are highlights for us:

What do you think the future holds for ShuttleCloud and Prometheus?

We’re very happy with Prometheus, but new exporters are always welcome (Celery or Spark, for example).

One question that we face every time we add a new alarm is: how do we test that the alarm works as expected? It would be nice to have a way to inject fake metrics in order to raise an alarm, to test it.