KubeCon + CloudNativeCon Amsterdam | March 30 – April 2 | Don’t Miss Out | Learn more



Prometheus User Profile: ShuttleCloud Explains Why Prometheus Is Good for Your Small Startup

By | Blog

ShuttleCloud is a small startup specialized in email and contacts migrations. The company developed a reliable migration platform in high availability used by clients like Gmail, Gcontacts and Comcast. For example, Gmail alone has imported data for 3 million users with our API and we process hundreds of terabytes every month.

Before transitioning to Prometheus, the company had near-zero monitoring. Now they have all of their infrastructure monitored with the necessary metrics and alerts. ShuttleCloud currently has around 200 instances monitored with a comfortable cost-effective in-house monitoring stack based on Prometheus.

In the below blog, originally published by Prometheus, ShuttleCloud talks about why a company does not need a big fleet to embrace Prometheus and that it is a non-expensive solution for monitoring.

Ignacio P. Carretero, Software Engineer at ShuttleCloud, also spoke on this topic at CloudNativeCon + KubeCon North America 2016. The video of his presentation can be found here and slides can be found here.

To hear more stories about Prometheus’ production use, participate in technical sessions on the monitoring tool, and learn how it integrates with Kubernetes and other open source technologies, attend PromCom 2017, August 17-18 at Google Munich. Speaking submissions close May 31st. Submit here.

Interview with ShuttleCloud

Posted at: September 7, 2016 by Brian Brazil

Continuing our series of interviews with users of Prometheus, ShuttleCloud talks about how they began using Prometheus. Ignacio from ShuttleCloud also explained how Prometheus Is Good for Your Small Startup at PromCon 2016.

What does ShuttleCloud do?

ShuttleCloud is the world’s most scalable email and contacts data importing system. We help some of the leading email and address book providers, including Google and Comcast, increase user growth and engagement by automating the switching experience through data import.

By integrating our API into their offerings, our customers allow their users to easily migrate their email and contacts from one participating provider to another, reducing the friction users face when switching to a new provider. The 24/7 email providers supported include all major US internet service providers: Comcast, Time Warner Cable, AT&T, Verizon, and more.

By offering end users a simple path for migrating their emails (while keeping complete control over the import tool’s UI), our customers dramatically improve user activation and onboarding.

ShuttleCloud’s integration with Google’s Gmail Platform. Gmail has imported data for 3 million users with our API.

ShuttleCloud’s technology encrypts all the data required to process an import, in addition to following the most secure standards (SSL, oAuth) to ensure the confidentiality and integrity of API requests. Our technology allows us to guarantee our platform’s high availability, with up to 99.5% uptime assurances.

What was your pre-Prometheus monitoring experience?

In the beginning, a proper monitoring system for our infrastructure was not one of our main priorities. We didn’t have as many projects and instances as we currently have, so we worked with other simple systems to alert us if anything was not working properly and get it under control.

  • We had a set of automatic scripts to monitor most of the operational metrics for the machines. These were cron-based and executed, using Ansible from a centralized machine. The alerts were emails sent directly to the entire development team.
  • We trusted Pingdom for external blackbox monitoring and checking that all our frontends were up. They provided an easy interface and alerting system in case any of our external services were not reachable.

Fortunately, big customers arrived, and the SLAs started to be more demanding. Therefore, we needed something else to measure how we were performing and to ensure that we were complying with all SLAs. One of the features we required was to have accurate stats about our performance and business metrics (i.e., how many migrations finished correctly), so reporting was more on our minds than monitoring.

We developed the following system:

  • The source of all necessary data is a status database in a CouchDB. There, each document represents one status of an operation. This information is processed by the Status Importer and stored in a relational manner in a MySQL database.
  • A component gathers data from that database, with the information aggregated and post-processed into several views.
    • One of the views is the email report, which we needed for reporting purposes. This is sent via email.
    • The other view pushes data to a dashboard, where it can be easily controlled. The dashboard service we used was external. We trusted Ducksboard, not only because the dashboards were easy to set up and looked beautiful, but also because they provided automatic alerts if a threshold was reached.

With all that in place, it didn’t take us long to realize that we would need a proper metrics, monitoring, and alerting system as the number of projects started to increase.

Some drawbacks of the systems we had at that time were:

  • No centralized monitoring system. Each metric type had a different one:
    • System metrics → Scripts run by Ansible.
    • Business metrics → Ducksboard and email reports.
    • Blackbox metrics → Pingdom.
  • No standard alerting system. Each metric type had different alerts (email, push notification, and so on).
  • Some business metrics had no alerts. These were reviewed manually.

Why did you decide to look at Prometheus?

We analyzed several monitoring and alerting systems. We were eager to get our hands dirty and check if the a solution would succeed or fail. The system we decided to put to the test was Prometheus, for the following reasons:

  • First of all, you don’t have to define a fixed metric system to start working with it; metrics can be added or changed in the future. This provides valuable flexibility when you don’t know all of the metrics you want to monitor yet.
  • If you know anything about Prometheus, you know that metrics can have labels that abstract us from the fact that different time series are considered. This, together with its query language, provided even more flexibility and a powerful tool. For example, we can have the same metric defined for different environments or projects and get a specific time series or aggregate certain metrics with the appropriate labels:
    • http_requests_total{job=”my_super_app_1″,environment=”staging”} – the time series corresponding to the staging environment for the app “my_super_app_1”.
    • http_requests_total{job=”my_super_app_1″} – the time series for all environments for the app “my_super_app_1”.
    • http_requests_total{environment=”staging”} – the time series for all staging environments for all jobs.
  • Prometheus supports a DNS service for service discovery. We happened to already have an internal DNS service.
  • There is no need to install any external services (unlike Sensu, for example, which needs a data-storage service like Redis and a message bus like RabbitMQ). This might not be a deal breaker, but it definitely makes the test easier to perform, deploy, and maintain.
  • Prometheus is quite easy to install, as you only need to download an executable Go file. The Docker container also works well and it is easy to start.

How do you use Prometheus?

Initially we were only using some metrics provided out of the box by the node_exporter, including:

  • hard drive usage.
  • memory usage.
  • if an instance is up or down.

Our internal DNS service is integrated to be used for service discovery, so every new instance is automatically monitored.

Some of the metrics we used, which were not provided by the node_exporter by default, were exported using the node_exporter textfile collector feature. The first alerts we declared on the Prometheus Alertmanager were mainly related to the operational metrics mentioned above.

We later developed an operation exporter that allowed us to know the status of the system almost in real time. It exposed business metrics, namely the statuses of all operations, the number of incoming migrations, the number of finished migrations, and the number of errors. We could aggregate these on the Prometheus side and let it calculate different rates.

We decided to export and monitor the following metrics:

  • operation_requests_total
  • operation_statuses_total
  • operation_errors_total

We have most of our services duplicated in two Google Cloud Platform availability zones. That includes the monitoring system. It’s straightforward to have more than one operation exporter in two or more different zones, as Prometheus can aggregate the data from all of them and make one metric (i.e., the maximum of all). We currently don’t have Prometheus or the Alertmanager in HA — only a metamonitoring instance — but we are working on it.

For external blackbox monitoring, we use the Prometheus Blackbox Exporter. Apart from checking if our external frontends are up, it is especially useful for having metrics for SSL certificates’ expiration dates. It even checks the whole chain of certificates. Kudos to Robust Perception for explaining it perfectly in their blogpost.

We set up some charts in Grafana for visual monitoring in some dashboards, and the integration with Prometheus was trivial. The query language used to define the charts is the same as in Prometheus, which simplified their creation a lot.

We also integrated Prometheus with Pagerduty and created a schedule of people on-call for the critical alerts. For those alerts that were not considered critical, we only sent an email.

How does Prometheus make things better for you?

We can’t compare Prometheus with our previous solution because we didn’t have one, but we can talk about what features of Prometheus are highlights for us:

  • It has very few maintenance requirements.
  • It’s efficient: one machine can handle monitoring the whole cluster.
  • The community is friendly—both dev and users. Moreover, Brian’s blog is a very good resource.
  • It has no third-party requirements; it’s just the server and the exporters. (No RabbitMQ or Redis needs to be maintained.)
  • Deployment of Go applications is a breeze.

What do you think the future holds for ShuttleCloud and Prometheus?

We’re very happy with Prometheus, but new exporters are always welcome (Celery or Spark, for example).

One question that we face every time we add a new alarm is: how do we test that the alarm works as expected? It would be nice to have a way to inject fake metrics in order to raise an alarm, to test it.

Developing Cloud Native Applications

By | Blog

By Ken Owens, Technologist and Innovation Engineer currently CTO Cloud Native Platforms at Cisco Systems and Cloud Native Computing Foundation (CNCF) Technical Oversight Committee (TOC) representative

*Blog originally posted on DevNetCreate.io

Figure 1: Cloud Native Reference architecture

Software engineering and developer communities are driving the market for cloud consumption and leading each industry into a new era of software-defined disruption. There are no longer questions about elastic and flexible agile development as the way to innovate and reduce time to market for businesses. Open source software plays a key role in the digital transformation to cloud native and understanding how your business strategy needs to address this next disruption in software development is crucial to the success of your business.

Cloud Native applications are a combination of existing and new software development patterns. The existing patterns are software automation (infrastructure and systems), API integrations, and services oriented architectures. The new cloud native pattern consists of microservices architecture, containerized services, and distributed management and orchestration. The journey towards cloud native has started and many organizations are already testing with this new pattern. To be successful in developing cloud native applications it is important that you prepare for the cloud native journey and understand the impact to your infrastructure.

Preparing for the Journey

The first step in a success business transformation is setting the vision and the goal for the organization. Many organizations struggle with transformation because they start with the technology first.

Technology is exciting and challenging, but lessons learned from the industry are not to start there, but with your business mission, vision, and your people.

At the outset of the transformation, it is critical to get your leadership, partners, and customers on board with your plans, set clear expectations, and gather feedback often. It is important to over-communicate at this initial stage and ensure that buy-in is strong. As you progress on the cloud native journey, you will need to get the support and cover from your leadership.

The next step is to assemble the right team and break down the vision for the journey into phases or steps that are further decomposed into development actions or sprints. It is critical to evaluate the team member’s strengths and weaknesses against the organizational goals to accomplish your vision and invest upfront in training. Most good ops and engineers are interested in furthering their career with training.

Lastly, evaluate technology choices and plan for technology integration with your existing back office, support, and IT systems including existing processes you have in place as an organization. You will need to work with other organizations across the business to identify skill sets needed and support the training or staff augmentation requirements of the other organizations.

Infrastructure Impact

If everyone is using the Cloud and everyone is using the same services, your service will perform, fail, and be as secure as your competitors. Perhaps you’re going for “good enough” service offerings, but business need that differentiation and you’re only get differentiation by taking advantage of the both the cloud native software patterns and latest technology advances in the underlying hardware infrastructure.

When evaluating the impact of infrastructure, it is important to start with the end goal of software defined, automated, and integrated in mind. Software defined is often an industry buzzword, but it is very important to look at your existing network, compute, virtualization, storage, and security solutions from a set of software abstractions that can be programmatically configured (set) and consumed (get). Software defined infrastructure is considered part of the infrastructure services (network, compute, storage, and security) that can be configured for the business services to be deployed into or alternatively, the business can deploy a set of services leveraged these abstraction layers, often called blueprints.

Automated infrastructure is the ability to leverage API’s for provisioning a set of services and configuration primitives that enable abstractions and sets of services that deploy pre-defined sets of configuration and then validates the installation and readiness for the application to be deployed. Automation is key for cloud native, as applications need to be able to configure and re-configure in real time based on a number in inputs. These inputs range from failure states, to user demand, to application performance.

Integrated infrastructure considerations are often a secondary thought; however, they should be part of your initial planning. Many applications have dependencies that are internal to the IT infrastructure (i.e. behind the firewall). These dependencies can be data base related, IT compliance related, OSS related, or BSS related. Often, multiple dependencies are discovered as part of the application composition exercise. When evaluating your existing infrastructure, it is important to look at the integration points that your application dependence services have on the cloud native architecture. Many of these integration points can be abstracted as a set of services and API end points.

Defining Business Services

Once you have prepared your organization for the journey and addressed the infrastructure impact- both existing IT, back office, and cloud native technologies, it is time to define business services for the application(s) you are developing. In general, the best way to define business services is to understand the 4 subsystems for a cloud native architecture:

  • Application Composition
  • Policy and Event Framework
  • Application Delivery
  • Common Control and Ops

Figure 2: Cloud Native Architecture Sub-Systems

There are several ways to decompose your application, and in general, there is not a specific right or wrong way. The best guideline from experience is start by the looking at the composition of your application from a set of services. Application composition is as much an art as a science. My recommendation is to look as the application composition as a block box with three components:

Figure 3: Cloud Native Application Composition

  1. Internals of the application black box, which consists of the application design leveraging the functions that comprise the application logic.
  2. North bound interfaces from the black box, which consists of external API interfaces to the customer and external services (external to the your firewall)
  3. South bound interfaces from the black box, which consists of internal API interfaces to internal services, including OSS and BSS.

The application design (internals of the black box) for cloud native design methodology should be thought of as a set of function calls and dependencies. Each function should be independent of the other functions and operate independently as much a possible. The state and scalability should be completely independent.

Each function needs a defined set of policies and events to enable the scalability and resiliency of the service functionality as well as an independent set of common control and operational primitives that enable the individual function service health and control to be performed independent of the other application functional components. The cloud native application composition can be codified into a Cloud Native Application Blueprint as shown below:

Figure 4: Cloud Native Application Composition Blueprint

As mentioned above, the most complicated aspect of the application composition consist of the external and internal services the application depends on. The business models that leverage the OSS and BSS components need to be evaluated in light of business process and function. Some of these services can be containerized while others are not able to be containerized. How to enable a cloud native application to integrate with the OSS and BSS services should not be overlooked.

In addition, external services, especially from cloud providers, are very common. The top concern experienced in cloud native deployments has to do with the latency between the business application services and the external services consumed by the application services. This latency can often times experience timeouts, packet loss, and delays that cause user experience issues. To address this aspect of your cloud native design, understanding the networking impact of your external interfaces and leveraging DNS and SDN controllers to optimize the routing within the application services is required.

Deploy Business Services

Application deployed must be separate from the composition. Application portability is a key business requirement and one of the more reliability methods to achieve this to decouple the application code from the underlying deployment target.

The following guidelines are based on best practices for application portability:

  • Deploying the application into different environments (dev, test, production) each of which can consists of different environments (laptop, server, bare metal, private cloud, or public cloud)
  • Deploying to different locations (data center(s), availability zones, geo-location constraints)
  • Continuous Integration and Continuous Delivery of the application services across environments, location, and hybrid models to ensure that application and application services are continuously refreshed
  • Continually improve and optimize performance and scale. Once the basics and some of these underlying issues of the technology are understood, the team can then focus on improvements from a process and technology aspect. Sprints should incorporate user stories to address performance and stability issues. Then deployments can reflect these improvements along with enhancements to the underlying infrastructure, BSS, and OSS aspects.

Join the Community

Join CNCF in San Francisco May 23 and 24th for DevNet Create 2017, where we will hear from Dan Kohn, Executive Director of CNCF on Migrating Legacy Monoliths, a panel on “Becoming Cloud Native: Taking it One Container at a Time”, and many other sessions on Cloud and DevOps, IoT and Apps, and more!

Diversity Scholarship Series: My experience at CloudNativeCon + KubeCon Europe 2017

By | Blog

CNCF offered six diversity scholarships to developers to attend CloudNativeCon + KubeCon Europe 2017. In this post, our scholarship recipient Konrad Djimeli, University of Buea student, shares his experience meeting the community, participating in technical sessions and bringing his experience back to his community in Africa. Anyone interested in applying for the CNCF diversity scholarship to be able to attend CloudNativeCon + KubeCon North America 2017 in Austin December 6-8, can submit an application for here. Applications due October 13th.

By Konrad Djimeli, Software Developer, Community organizer (GDG Buea), GitHub Campus Expert and Google Summer of Code 2015 and 2017 Software engineering intern

Being a CloudNativeCon + KubeCon Europe 2017 Diversity Scholarship recipient was actually a life changing experience for me. I came across this scholarship while browsing the Linux Foundation website for upcoming events. When I stumbled on this event and realized it had a scholarship opportunity, I was very happy. I am passionate about cloud computing and container related technologies, which is what the conference was all about. I applied for the scholarship knowing that it would give me the opportunity to learn from experts and also gain experience, which I could share with members of my community to inspire and motivate them.

One day, I opened my email inbox and had received an email saying I had been selected as a CloudNativeCon + KubeCon Europe 2017 scholarship recipient. This was like a dream to me as I had never had the opportunity to attend any conference out of Cameroon. This was also going to be my first time traveling out of my country. My contact at the Linux Foundation Katie Schultz was very helpful in enabling me to obtain a Visa and every other requirement for my travel and accommodation. When I arrived Berlin, just the trip alone was an interesting experience and my hotel was just a few minutes walk to the Berlin Congress Center, where the conference was taking place. I went to the conference hall after arriving and obtained my conference badge.

While at the conference center, the meals were great and the talks were very interesting,  although I got lost during some of the talks due to the fact that they required a certain technical knowledge I do not possess yet. Even so, I was inspired and motivated to work hard and improve my technical knowledge. I was excited to visit the sponsor booths and It was great to talk with employees from some of the tech giants in the world like Google, Microsoft and others. I had a very interesting chat with some employees at Bitnami, including the CEO Daniel Lopez Ridruejo , who was very humble and down to earth.

Photo Caption: Konrad and Daniel Lopez Ridruejo

Talking with Sebastien Goasguen, Senior Director of Cloud at Bitnami, led me to start contributing code to a project, which involved developing Jupyter Notebooks for the Kubernetes Python Client. This notebooks, could be used by others to

learn about and explore Kubernetes and its functionalities interactively, and this project is currently hosted on GitHub. This is actually going to be the topic of my Google Summer of Code (GSOC) project this summer.

I was also very glad when I got the chance to meet with Katie Schultz who had been very helpful in making it possible for me to be at the conference.

Photo Caption: Konrad and Katie Schultz

Everyone I met at the conference was very smart and hardworking and it made me realize how much we as developers in Africa have to work in order to become world class developers.

After the conference, I returned to Cameroon and sometimes feel like it was all a dream, as the experience seemed too good to be real. As a member of the Silicon Mountain community in Cameroon, I have shared my experience with other members of my community and they have all been motivated. I have also shared with them the importance of cloud and container technologies in our community and how integrating these technologies into our applications could improve their performance and maintenance.

I think it is extremely necessary for developers from African communities like mine to get an opportunity to attend such an international tech conference. This experience enables us to obtain so much awareness about the hard work required to archive our goals.

Where In the World is CNCF? Find Us at These Community Events 🗓

By | Blog

Throughout the next few weeks, CNCF is sponsoring, speaking and exhibiting at a number of exciting community events, including: Amazonia, OpenStack Summit Boston, OSCON, DevNet Create, Open Source Summit Japan, CoreOS Fest and LinuxCon + ContainerCon + CloudOpen China.

These open source conferences help foster collaborative conversation and feature expert insight around containerized applications, microservices, the modern infrastructure stack, uniting women in tech, Kubernetes and much more.

We hope to see you at our booths, meetups and talks during the following events!


May 6, 2017


As London is home to many Women Who Code meetups and workshop events, Amazonia is a hyperlocal gathering open to all female-identified (cis and trans) and non-binary people – showcasing technical might, promoting diversity in tech and fostering cross-community relationships.

CNCF is a proud sponsor of Amazonia!

OpenStack Summit Boston

May 8-11, 2017


On May 9, from 8 AM – 5 PM, CNCF will host a Kubernetes Day as part of OpenStack’s Open Source Days. As one of the highest velocity projects in the history of open source, please join us for a deep-dive into the system and to learn about how Kubernetes is changing computing.

CNCF will also be exhibiting on the show floor all week. Our booth, staffed with the Foundation team and member company technologists, will be located at C20 – don’t forget to stop by!

Don’t miss “Migrating Legacy Monoliths to Cloud Native Microservices Architectures on Kubernetes” – a session from Dan Kohn, executive director of the Cloud Native Computing Foundation, at 4:10 PM, Thursday, May 11 in MR 210.


May 8-1, 2017


CNCF will be exhibiting at Austin Convention Center (Hall 4) Booth 609 all week. Don’t miss the following presentations and speaking engagements from CNCF community members and ambassadors:

Monday, May 8
Kubernetes Hands-On Kelsey Hightower 1:30pm–5:00pm

Location: Ballroom F

Tuesday, May 9
From zero to distributed traces: An OpenTracing tutorial Ben Sigelman

Yuri Shkuro

Priyanka Sharma


Location: Ballroom E

Wednesday, May 10
Shifting to Kubernetes on OpenShift Seth Jennings 2:35pm–3:15pm

Location: Meeting Room 14

Thursday, May 11
Hands-on with containerized infrastructure services Shannon Williams

Darren Shepherd


Location: Ballroom E

DevNet Create

May 23-24, 2017

San Francisco

On May 23, CNCF ambassador Val Bercovici; Mackenzie Burnett, product at CoreOS; Mark Thiele, chief strategy officer at Apcera; and Stephen Day, senior software engineer at Docker, will participate in a panel titled “Becoming Cloud Native: Taking it One Container at a Time.”

On the same day, Dan Kohn will also present “Migrating Legacy Monoliths to Cloud Native Microservices Architectures on Kubernetes.”

Open Source Summit Japan

May 31-June 2


CNCF is a proud Gold Sponsor of the Summit and will host a Fluentd Mini Summit on May 31 during the event, which will introduce attendees to the CNCF technology project, cloud native logging and more. To view the schedule, learn about the sessions and register to attend the Fluentd Mini Summit, please visit http://bit.ly/2q5Lqwi.

On day one, attendees will hear Dan Kohn present Migrating Legacy Monoliths to Cloud Native Microservices Architectures on Kubernetes at 11 AM in Room 1.

Also on May 31, do not miss a panel discussion on “The Future is Cloud Native: How Projects Like Kubernetes, Fluentd, OpenTracing, and Linkerd Will Help Shape Modern Infrastructure” moderated by Chris Aniszczyk, COO of CNCF and featuring speakers Keiko Harada, Program Manager at Microsoft; Ian Lewis, Developer Advocate at Google; and Eduardo Silva, Open Source Engineer at Treasure Data.

CoreOS Fest

May 31-June 1

San Francisco

CNCF is proud to be a Star Sponsor and can’t wait to gather with systems architects, DevOps engineers, sysadmins, application developers, security engineers and more at Pier 27 later this month!

LinuxCon + ContainerCon + CloudOpen China

June 19-20, 2017


On June 20, catch Dan Kohn presenting Migrating Legacy Monoliths to Cloud Native Microservices Architectures on Kubernetes and Chris Aniszczyk presentingHow the Open Container Initiative (OCI) is Setting Standards for Container Format and Runtime.”

Kubernetes Making a Splash at OpenStack Summit Boston (May 7-11)

By | Blog

By: OpenStack Special Interest Group leaders, Ihor Dvoretskyi from Mirantis and Steve Gordon from Red Hat, highlighting the status of collaboration between Kubernetes and OpenStack

OpenStack Summit is happening next week in Boston (May 7-11). It is one of the most important and notable bi-annual events of the cloud computing world, especially from the world of open source cloud computing.

This summit will feature an increased number of high-quality tracks, talks, panels and other community-gathering events, that are dedicated not just to OpenStack, but to other “friendly” technologies from the world of open source. These technologies are used by many in conjunction with OpenStack, and one of the most noteworthy technologies from the list, is Kubernetes.

OpenStack is an open source solution that allows you to build your own cloud leveraging both the OpenStack software itself and the ecosystem around it on controlled hardware – whether in a single location, or on distributed worldwide multi-datacenter infrastructure. At the same time, end-users require to run a mix of applications, not just pure virtual machines – but also bare-metal and containerized workloads and here Kubernetes is increasingly being used to assist with orchestration of these workloads.

Kubernetes brings a different layer of abstraction between infrastructure (OpenStack) and end-user applications. Kubernetes provides the ability to manage containerized applications providing enough contextual awareness to take advantage of the capabilities of the underlying clouds while ensuring the technical independence of the applications themselves from that infrastructure.

In this context OpenStack exposes a rich set of infrastructure level services for distributed applications to consume via Kubernetes, much as the Linux kernel traditionally exposed hardware resources on a single physical host for consumption by userspace process. OpenStack and Kubernetes provide the level of workflow automation and repeatability required to scale these concepts across the context of a distributed cluster. Here OpenStack works as a cloud provider for Kubernetes, and several Kubernetes-related projects (for example – kargo, tectonic, and OpenShift) are solving the question of deployment and lifecycle management of Kubernetes on OpenStack.

Kubernetes increasingly also plays a different role for OpenStack – it can act as an underlay for the containerization and management of the OpenStack services. OpenStack projects including Kolla-Kubernetes and OpenStack-Helm are actively pursuing this approach to evolving the deployment and management of OpenStack itself

At OpenStack Summit Boston besides the general track, where you will find many Kubernetes-related talks, panels and other community-gathering events (schedule) the Cloud Native Computing Foundation  and OpenStack Foundation are organizing a community day specifically dedicated to Kubernetes – Kubernetes Day. Refer to the recent article in Superuser, the OpenStack Foundations publication highlighting the contributions of superusers, for more information and background on this great event focused on interoperability of not just technologies but communities. Also, this year at the first OpenStack Forum – an enhanced version of what was formerly part of the Developers and Operators Summits which is co-located with the main OpenStack Summit, there will be a Kubernetes Ops on OpenStack session. This session will focus on gathering feedback from users and determining how to apply it to the future development roadmaps of both projects.

In the Kubernetes Community the OpenStack Special Interest Group is the focal point for cross-collaboration between OpenStack and Kubernetes communities to build integrated solutions. We are always looking for new contributors to join us on this journey, if you are interested in joining us jump on the mailing list and say hi!

CNCF Brings Kubernetes, CoreDNS, OpenTracing and Prometheus to Google Summer of Code 2017

By | Blog

The Google Summer of Code (GSOC) program allows university students (over the age of 18) from around the world to spend their summer breaks writing code and learning about open source development. Accepted students work with a mentor and become a part of the open source community. In its 13 year, the program has previously accepted 12,000+ students from 104 countries to work on 568 open source projects, writing over 30 million lines of code.

201 organizations were accepted to participate in GSOC 2017 to bring new, excited developers into their community and the world of open source. The Cloud Native Computing Foundation is proud to be one of these organizations, bringing seven interns this summer. Mentors were paired with interns to help advance the following CNCF projects: 4 Kubernetes, 1 CoreDNS, 1 OpenTracing and 1 Prometheus.

“As a former GSOC mentor, I have seen the amazing impact this program has on the students, projects and larger open source community. CNCF is very proud to have 7 projects in the 2017 program that cover a range of our cloud native technologies. We look forward to watching the progress and results of the students over the summer.” – Chris Aniszczyk (@cra)

Additional details on the projects, mentors, and students can be found below. Coding beings May 30 and we’ll report back on their progress in a few months.  


Create and Implement a Data Model to Standardize Kubernetes Logs

Student: Amit Kumar Jaiswal, UIET CSJM University

Mentor: Miguel Perez Colino, Red Hat

This project aims to build and implement a data model for logs in a large Kubernetes cluster to process, correlate, and query to make troubleshooting easier and reduce the time in finding root causes.

Develop a Set of Jupyter Notebooks for the Kubernetes Python Client + Kubernetes Python Client Update

Student: Konrad Djimeli, University of Buea       

Mentor: Sebastien Goasguen, Skippbox (acquired by Bitnami)

The Kubernetes python client is a Kubernetes incubator project. The python client makes it possible to access Kubernetes with python. Jupyter notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process. The aim of this project is to develop a set of notebooks that highlight the Kubernetes primitives. This project would also include updating the python client to make it easier for users to carry out certain operations.

Improve ThirdPartyResources  

Student: Nikhita Raghunath, Veermata Jijabai Technological Institute (Mumbai)

Mentor: Stefan Schimanski, Red Hat

ThirdPartyResources are already available, but the implementation has languished with multiple outstanding capabilities missing. They did not complete the list of requirements for graduating to beta. Hence, there are multiple problems present in the current implementation of ThirdPartyResources. This project aims to work towards a number of known shortcomings to drive the ongoing effort toward a stable TPR release forward.

Integrate Unikernel Runtime

Student: Hao Zhang, Zhejiang University, Computer Science (master)

Mentor: Harry Zhang, Lob and Pengfei Ni, HyperHQ

This work will focus on  why and how to integrate unikerneal technology as a runtime into into the Kubernetes/frakti project. This will allow Kubernetes to use use a unikernel instance just like it uses Docker, which eventually will open  Kubernetes up to more more application scenarios.


CoreDNS: Middleware

Student: Antoine Debuisson (University of Paris-Sud)

Mentor: Miek Gieben, CoreDNS and John Belamaric, Infoblox

The goal of the project is to capture the DNS data within a CoreDNS middleware and write it to a “dnstap log file” (perhaps over the network).

Codebase to build upon:


Instrument OpenTracing with Go-restful Web Framework

Student: Liang Mingqiang, Hyogo University and Carnegie Mellon University

Mentor: Ted Young, LightStep and Wu Sheng, OpenTracing

Go-restful (https://github.com/emicklei/go-restful) is a widely-used library for building REST-style Web Services using Google Go programming language. With powerful built-in modules, including intelligent request routing, RESTful support and filters for intercepting HTTP request, go-restful makes it very convenient to build a web application from scratch. This proposal aims to instrument OpenTracing with go-restful.


Storage and Query Engine Improvements to Prometheus

Student: Goutham Veeramachaneni, Indian Institute of Technology, Hyderabad

Mentor: Ben Kochie, SoundCloud, Fabian Reinartz, CoreOS and Julius Volz, Prometheus

While the Prometheus monitoring system solves most use-cases, improvements will  will reduce the already minimal load on the ops team, including  checking alerts over time, unit-testing alerts, backups and figuring out which queries OOM.

Meeting Challenges in Using and Deploying Containers

By | Blog

The Cloud Native Computing Foundation (CNCF) surveyed attendees at CloudNativeCon+ KubeCon in late 2016 on a range of topics related to container management and orchestration. In a previous blog, we examined the implications of survey results, in particular how Kubernetes had advanced from the test bench to real-world production deployments in the course of the preceding year.

An equally interesting data set coming out of that survey, and from earlier surveys conducted by Google, highlighted challenges respondents faced as as they increasingly used and deployed applications with containers.  Respondents could include multiple topics in their responses, and in the earlier Google surveys, also insert freeform commentary.

Figure 1. Summarizes the three response sets from the CloudNativeCon + KubeCon (Nov. ‘16)  and previous Google surveys (June and March ‘16):

Figure 1.  Challenges to Adoption
(Non-exclusive Survey Responses)

Characterizing the Challenges

Let’s take a moment to examine the leading concerns in the survey, and also in the earlier ones:

Networking – 50 percent of CloudNativeCon + KubeCon respondents pointed to “networking” as their greatest challenge. Those who articulated their concerns further focused on:

  • Debugging network connectivity, especially across managed containers and containers deployed on geographically disparate clouds
  • More configurable and secure networking with multi-tenancy

Cloud native networking certainly appeals to networking engineers as well as a growing number of developers who increasingly find it a part of their daily work. In this recent CNCF webinar, Christopher Liljenstolpe, CTO of Tigera and Founder orProject Calico, and Brian Boreham, Director of Engineering, Weaveworks, dive into networking for containers and microservices.

Security – as production deployment increases, so do security risks, in particular for containers hosting execution of Internet- and customer-facing applications.  With 42 percent of CloudNativeCon + KubeCon respondents highlighting security, specific concerns included:

  • Applying security patches and updates to container contents
  • Network isolation and secure isolation/communication among managed containers
  • Understanding the scope of potential attack surfaces

Storage & Resource Management – “storage” led the responses for the earlier Google surveys, and almost half (42 percent) of respondents at CloudNativeCon + KubeCon still voiced concerns in this area:

  • Lack of appropriate and accessible network storage
  • Secure and standards-compliant network storage (e.g., for HIPAA)
  • Persistent and performant storage
  • Meeting legacy storage requirements and storage portability
  • Better load management
  • Standardization of / patterns for file systems and container layouts

Complexity – 39 percent of  CloudNativeCon + KubeCon respondents with the above concerns also cited “complexity” as a challenge, and certainly issues with networking, security and storage contribute to these concerns.  

Logging and Monitoring – also high on respondents’ list of concerns at 42 percent was logging and monitoring, in particular:

  • The need for more detailed k8s manifests
  • More insight into operational metrics
  • More robust application logging capabilities

Meeting the Challenges

The challenges posed by respondents are being addressed incrementally by CNCF project developers and the container management ecosystem.  In particular, CNCF projects that address the above technical hurdles in networking, security and storage, and also logging, tooling and automation, and beyond, include:

Linkerd Resilient service mesh for cloud native apps, including a transparent proxy that adds service discovery, routing, failure handling, and visibility to modern software applications.  

Learn more at https://linkerd.io/

Fluentd A data collector for unified logging layer. Fluentd lets you unify data collection and consumption for a better use and understanding of data.

Learn more at http://www.fluentd.org/

Kubernetes Kubernetes itself focuses on automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.

Learn more at https://kubernetes.io/

Prometheus A systems monitoring and alerting toolkit with a very active developer and user community.  Prometheus works well for recording numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures.

Learn more at https://prometheus.io/

The CNCF Cloud Native Landscape Project categorizes many of the most popular projects and startups in the cloud native space. This is another resource where people can find technologies that might help solve their technical challenges. It is under development by CNCF, Redpoint and Amplify.

Learn more about all CNCF projects at https://www.cncf.io/projects.

Service Mesh: A Critical Component of the Cloud Native Stack

By | Blog

What’s a service mesh? And why do I need one?

Originally published by William Morgan on Buoyant.io.

tl;dr: A service mesh is a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable. If you’re building a cloud native application, you need a service mesh!

Over the past year, the service mesh has emerged as a critical component of the cloud native stack. High-traffic companies like Paypal, Lyft, Ticketmaster, and Credit Karma have all added a service mesh to their production applications, and this January, Linkerd, the open source service mesh for cloud native applications, became an official project of the Cloud Native Computing Foundation. But what is a service mesh, exactly? And why is it suddenly relevant?

In this article, I’ll define the service mesh and trace its lineage through shifts in application architectural over the past decade. I’ll distinguish the service mesh from the related, but distinct, concept of API gateways, edge proxies, and the enterprise service bus. Finally, I’ll describe where the service mesh is heading, and what to expect as this concept evolves alongside the cloud native adoption.


A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware. (But there are variations to this idea, as we’ll see.)

The concept of the service mesh as a separate layer is tied to the rise of the cloud native application. In the cloud native model, a single application might consist of hundreds of services; each service might have thousands of instances; and each of those instances might be in a constantly-changing state as they are dynamically scheduled an orchestrator like Kubernetes. Not only is service communication in this world incredibly complex, it’s a pervasive and fundamental part of runtime behavior. Managing it is vital to ensuring end-to-end performance and reliability.


The service mesh is a networking model that sits at a layer of abstraction above TCP/IP. It assumes that the underlying L3/L4 network is present and capable of delivering bytes from point to point. (It also assumes that this network, as with every other aspect of the environment, is unreliable; the service mesh must therefore also be capable of handling network failures.)

In some ways, the service mesh is analogous to TCP/IP. Just as the TCP stack abstracts the mechanics of reliably delivering bytes between network endpoints, the service mesh abstracts the mechanics of reliably delivering requests between services. Like TCP, the service mesh doesn’t care about the actual payload or how it’s encoded. The application has a high-level goal (“send something from A to B”), and the job of the service mesh, like that of TCP, is to accomplish this goal while handling any failures along the way.

Unlike TCP, the service mesh has a significant goal beyond “just make it work”: it provides a uniform, application-wide point for introducing visibility and control into the application runtime. The explicit goal of the service mesh is to move service communication out of the realm of the invisible, implied infrastructure, and into the role of a first-class member of the ecosystem—where it can be monitored, managed and controlled.


Reliably delivering requests in a cloud native application can be incredibly complex. A service mesh like Linkerd manages this complexity with a wide array of powerful techniques: circuit-breaking, latency-aware load balancing, eventually consistent (“advisory”) service discovery, retries, and deadlines. These features must all work in conjunction, and the interactions between these features and the complex environment in which they operate can be quite subtle.

For example, when a request is made to a service through Linkerd, a very simplified timeline of events is as follows:

  1. Linkerd applies dynamic routing rules to determine which service the requester intended. Should the request be routed to a service in production or in staging? To a service in a local datacenter or one in the cloud? To the most recent version of a service that’s being tested or to an older one that’s been vetted in production? All of these routing rules are dynamically configurable, and can be applied both globally and for arbitrary slices of traffic.
  2. Having found the correct destination, Linkerd retrieves the corresponding pool of instances from the relevant service discovery endpoint, of which there may be several. If this information diverges from what Linkerd has observed in practice, Linkerd makes a decision about which source of information to trust.
  3. Linkerd chooses the instance most likely to return a fast response based on a variety of factors, including its observed latency for recent requests.
  4. Linkerd attempts to send the request to the instance, recording the latency and response type of the result.
  5. If the instance is down, unresponsive, or fails to process the request, Linkerd retries the request on another instance (but only if it knows the request is idempotent).
  6. If an instance is consistently returning errors, Linkerd evicts it from the load balancing pool, to be periodically retried later (for example, an instance may be undergoing a transient failure).
  7. If the deadline for the request has elapsed, Linkerd proactively fails the request rather than adding load with further retries.
  8. Linkerd captures every aspect of the above behavior in the form of metrics and distributed tracing, which are emitted to a centralized metrics system.

And that’s just the simplified version–Linkerd can also initiate and terminate TLS, perform protocol upgrades, dynamically shift traffic, and fail over between datacenters!

The linkerd service mesh decouples manages service-to-service communication and decouples it from application code.

It’s important to note that these features are intended to provide both pointwise resilience and application-wide resilience. Large-scale distributed systems, no matter how they’re architected, have one defining characteristic: they provide many opportunities for small, localized failures to escalate into system-wide catastrophic failures. The service mesh must be designed to safeguard against these escalations by shedding load and failing fast when the underlying systems approach their limits.


The service mesh is ultimately not an introduction of new functionality, but rather shift in where functionality is located. Web applications have always had to manage the complexity of service communication. The origins of the service mesh model can be traced in the evolution of these applications over the past decade and a half.

Consider the typical architecture of a medium-sized web application in the 2000’s: the three-tiered app. In this model, application logic, web serving logic, and storage logic are each a separate layer. The communication between layers, while complex, was limited in scope—there are only two hops, after all. There is no “mesh”, but there is communication logic between hops, handled within the code of each layer.

When this architectural approach was pushed to very high scale, it started to break. Companies like Google, Netflix, and Twitter, faced with massive traffic requirements, implemented what was effectively a predecessor of the cloud native approach: the application layer was split into many services (sometimes called “microservices”), and the tiers became a topology. In these systems, a generalized communication layer became suddenly relevant, but typically took the form of a “fat client” library—Twitter’sFinagle, Netflix’s Hysterix, and Google’s Stubby being cases in point.

In many ways, libraries like Finagle, Stubby, and Hysterix were the first service meshes. While they were specific to the details of their surrounding environment, and required the use of specific languages and frameworks, they were forms of dedicated infrastructure for managing service-to-service communication, and (in the case of the open source Finagle and Hysterix libraries) found use outside of their origin companies.

Fast forward to the modern cloud native application. The cloud native model combines the microservices approach of many small services with two additional factors: containers (e.g. Docker, which provide resource isolation and dependency management, and an orchestration layer (e.g. Kubernetes), which abstracts away the underlying hardware into a homogenous pool.

These three components allow applications with natural mechanisms for scaling under load and to handle the ever-present partial failures of the cloud environment. But with hundreds of services or thousands of services, and an orchestration layer that’s rescheduling instances from moment to moment, the path that a single request follows through the service topology can be incredibly complex, and since containers make it easy for each service to be written in a different language, the library approach is no longer feasible.

This combination of complexity and criticality motivates the need for a dedicated layer for service-to-service communication decoupled from application code and able to capture the highly dynamic nature of the underlying environment. This layer is the service mesh.


While service mesh adoption in the cloud native ecosystem is growing rapidly, there is an extensive and exciting roadmap ahead still to be explored. The requirements for serverless computing (e.g. Amazon’s Lambda) fit directly into the service mesh’s model of naming and linking, and form a natural extension of its role in the cloud native ecosystem. The role of service identity and access policy are still very nascent in cloud native environments, and the service mesh is well poised to play a fundamental part of the story here. Finally, the service mesh, like TCP/IP before it, will continue to be pushed further into the underlying infrastructure. Just as Linkerd evolved from systems like Finagle, the current incarnation of the service as a separate, user-space proxy that must be explicitly added to a cloud native stack will also continue to evolve.


The service mesh is a critical component of the cloud native stack. A little more than one year from its launch, Linkerd is part of the Cloud Native Computing Foundation and has a thriving community of contributors and users. Adopters range from startups like Monzo, which is disrupting the UK banking industry, to high scale Internet companies like Paypal, Ticketmaster, and Credit Karma, to companies that have been in business for hundreds of years like Houghton Mifflin Harcourt.

The Linkerd open source community of adopters and contributors are demonstrating the value of the service mesh model every day. We’re committed to building an amazing product and continuing to grow our incredible community. Join us!

Share this post on Hacker News.

Release Update: Linkerd 1.0 and Service Mesh Explained

By | Blog

Announcing Linkerd 1.0

Originally published by Oliver Gould on Buoyant.io.

Today, we’re thrilled to announce Linkerd version 1.0. A little more than one year from our initial launch, Linkerd is part of the Cloud Native Computing Foundation and has a thriving community of contributors and users. Adopters range from startups like Monzo, which is disrupting the UK banking industry, to high scale Internet companies like Paypal, Ticketmaster, and Credit Karma, to companies that have been in business for hundreds of years like Houghton Mifflin Harcourt.

A 1.0 release is a meaningful milestone for any open source project. In our case, it’s a recognition that we’ve hit a stable set of features that our users depend on to handle their most critical production traffic. It also signals a commitment to limiting breaking configuration changes moving forward.

It’s humbling that our little project has amassed such an amazing group of operators and developers. I’m continually stunned by the features and integrations coming out of the Linkerd community; and there’s simply nothing more satisfying than hearing how Linkerd is helping teams do their jobs with a little less fear and uncertainty.


Linkerd a service mesh for cloud native applications. As part of this release, we wanted to define what this actually meant. My cofounder William Morgan has a writeup in another post we released today, What’s a service mesh? And why do I need one?


Beyond stability and performance improvements, Linkerd 1.0 has a couple new features worth talking about.

This release includes a substantial change to the way that routers are configured in Linkerd. New plugin interfaces have been introduced to allow for much finer-grained policy control.

To read more on pre-service configurations, pre-client configurations, identifier kinds, response classifier kinds, client and service parameters, timeouts, TLS, Metrics, Trace Annotations and HTTP Headers, check out Oliver’s original post at: https://blog.buoyant.io/2017/04/25/announcing-linkerd-1.0/.

Linkerd Videos

For educational, technical and case study presentations about Linkerd, check out:

CloudNativeCon + KubeCon North America

Participate in technical sessions, hear case studies and learn more about service mesh for cloud native applications by attending CloudNativeCon + KubeCon North America 2017, December 6-8 at the Austin Convention Center. Speaking submissions close August 21st. Submit here.

Release Update: Prometheus 1.6.1 and Sneak Peak at 2.0

By | Blog

Prometheus 1.6.1

After 1.5.0 earlier in the year, Prometheus 1.6.1 is now out. There’s a plethora of changes, so let’s dive in.

The biggest change is to how memory is managed. The -storage.local.memory-chunks and -storage.local.max-chunks-to-persist flags have been replaced by -storage.local.target-heap-size. Prometheus will attempt to keep the heap at the given size in bytes. For various technical reasons, actual memory usage will be higher so leave a buffer on top of this. Setting this flag to 2/3 of how much RAM you’d like to use should be safe.

The GOGC environment variable has been defaulted to 40, rather than its default of 100. This will reduce memory usage, at the cost of some additional CPU.

A feature of major note is that experimental remote read support has been added, allowing the read back of data from long term storage and other systems. The previous built-in experimental support for writing to Graphite/OpenTSDB/InfluxDB has been removed in favour of the experimental remote write interface. These are now available via the example remote storage adapter, which can also read from InfluxDB via remote read.

In terms of general features and improvements, there are a few highlights:

  • Joyent Triton discovery has been added.
  • Promtool has a linter for /metrics pages.
  • There’s new storage, alerting and evaluation-related metrics.
  • Checkpoint and timeseries maintenance impact has been reduced. T
  • here have been numerous UI improvements.

In terms of bug fixes, federation now exposes an empty instance label if one is not set, so if you are using honor_labels you’ll no longer pick up the instance label of the Prometheus itself.

For a full list of changes see the release notes.

*Above blog originally published by Brian Brazil on RobustPerception.io.

Prometheus 2.0

In July 2016 Prometheus reached a big milestone with its 1.0 release. Since then, plenty of new features like new service discovery integrations and our experimental remote APIs have been added. We also realized that new developments in the infrastructure space, in particular Kubernetes, allowed monitored environments to become significantly more dynamic. Unsurprisingly, this also brings new challenges to Prometheus and we identified performance bottlenecks in its storage layer.

Over the past few months we have been designing and implementing a new storage concept that addresses those bottlenecks and shows considerable performance improvements overall. It also paves the way to add features such as hot backups.

The changes are so fundamental that merging them will trigger a new major release: Prometheus 2.0.

Important features and changes beyond storage are planned before its stable release. However, today we are releasing an early alpha of Prometheus 2.0 to kick off the stabilization process of the new storage.

Release tarballs and Docker containers are now available. If you are interested in the new mechanics of the storage, make sure to read the deep-dive blog post looking under the hood.

This version does not work with old storage data and should not replace existing production deployments. To run it, the data directory must be empty and all existing storage flags except for -storage.local.retention have to be removed.

For example; before:

./prometheus -storage.local.retention=200h -storage.local.memory-chunks=1000000 -storage.local.max-chunks-to-persist=500000 -storage.local.chunk-encoding=2 -config.file=/etc/prometheus.yaml


./prometheus -storage.local.retention=200h -config.file=/etc/prometheus.yaml

This is a very early version and crashes, data corruption, and bugs in general should be expected. Help us move towards a stable release by submitting them to our issue tracker.

The experimental remote storage APIs are disabled in this alpha release. Scraping targets exposing timestamps, such as federated Prometheus servers, does not yet work. The storage format is breaking and will break again between subsequent alpha releases. We plan to document an upgrade path from 1.0 to 2.0 once we are approaching a stable release.

*Above blog originally published by Fabian Reinartz on Prometheus.io.

Prometheus Roadmap

Prometheus has had significant community and project growth in the last 2 years and the core team is always looking toward the future. Here are three high-level roadmap goals for Prometheus:

  1. Implement and evaluate the read path of the generic remote storage interface. In combination with the already existing generic write path, this will allow anyone to build their own remote storage to use behind Prometheus, including querying data back via PromQL through Prometheus.
  2. Improve time series indexing such that Prometheus will be able to handle larger numbers of time series over a longer amount of time more efficiently.
  3. Aim to make Prometheus’s metrics exchange format an IETF standard. There is early work going on around this, but no clear outcome yet.

Prometheus Videos

For technical and case study presentations about Prometheus, check out the Prometheus playlist on the CNCF YouTube channel.

PromCom 2017

Participate in technical sessions on the monitoring tool, hear case studies and learn how Prometheus integrates with Kubernetes and other open source technologies by attending PromCom 2017, August 17-18 at Google Munich. Speaking submissions close May 31st. Submit here.

1 34 35 36 42