Measuring the popularity of Kubernetes using BigQuery

Posted on February 27, 2017 by Dan Kohn

CNCF projects highlighted in this post

By Dan Kohn, CNCF Executive Director, @dankohn1

Kubernetes Logo

As the executive director of CNCF, I’m proud to host Kubernetes, which is one of the highest development velocity projects in the history of open source. I know this because I can do a web search and see… quite a few people being quoted saying that, but does the data support this claim?

This blog post works through the process of investigating that question. CNCF licenses a dashboard from Bitergia, but it’s more useful for project trends over time than comparing to other open source projects. Project velocity matters because developers, enterprises and startups are more interested in working with a technology that others are adopting, so that they can leverage the investments of their peers. So, how does Kubernetes compare to the other 53 million GitHub repos?

By way of excellent blog posts from Felipe Hoffa and Jess Frazelle (the latter a Kubernetes contributor and speaker at our upcoming CloudNativeCon/KubeCon Berlin), I got started on using BigQuery to analyze the public GitHub data set. You can re-run any of the gists below by creating a free BigQuery account. All of the data below is for 2016, though you can easily run against different time periods.

My first attempt found that the project with the highest commit rate on GitHub…. KenanSulayman/heartbeat, a repo with 9 stars which appears to be an hourly update from a Tor exit node. Well, that’s kind of a cool use of GitHub, but not really what I’m looking for. I learned from Krihelinator (a thoughtful though arbitrary new metric that currently ranks Kubernetes #4, right in front of Linux), that some people use GitHub as a backup service. So, rerunning with a filter of more than 10 contributors puts Kubernetes at #29 based on its 8,703 commits. For reference, that’s almost exactly one commit an hour, around the clock, for the entire year.

That metric also leaves off torvalds/linux, because the kernel’s git tree is mirrored to GitHub, but that mirroring does not generate GitHub events that are stored in that data set. Instead, there is a separate BigQuery data set that just measures commits. When I run a query to show the projects with the most commits, I unhelpfully get dozens of forks of Linux and also many forks of a git learning tool. Here is a better query that manually checks for committers, authors, and commits of 8 popular projects, and shows Kubernetes as #2, with about 1/5th the authors and commits of Linux.¹

To see how many unique committers Kubernetes had in 2016, I used this query, which showed that there were… 59, because Kubernetes uses a GitHub robot to do the vast majority of the actual commits. The correct query requires looking inside the commits at the actual authors, and when ranked by unique authors, Kubernetes comes in at #10 with 868.

Updating Hoffa’s query about issues opened to include data for all of 2016 (while still ignoring robot comments), Kubernetes remains #1 with 42,703, with comments from 3,077 different developers. Frazelle’s analysis of pull requests (updated for all of 2016 and to require more than 10 contributors to avoid backup projects) now shows Kubernetes at #2 with 10,909, just behind a Java intranet portal. (Rather than GitHub issues and pull requests, Linux uses its own email-based workflow described in a talk last year by stable kernel maintainer Greg Kroah-Hartman, so it doesn’t show up in these comparisons.)

Kubernetes 2016 Rankings

Measure	Ranking
Krihelinator	4
Commits	29
Authors	10
Issue Comments	1
Pull Requests	2

In conclusion, I’m not sure that any of these metrics represents the definitive one. You can pick your preferred statistic, such as that Kubernetes is in the top 0.00006% of the projects on GitHub. I prefer to just think of it as one of the fastest moving projects in the history of open source.

What’s your preferred metric(s)? Please let me know at @dankohn1 or in the Hacker News comments, and I’m happy to provide t-shirts in exchange for cool visualizations.

¹OpenHub incorrectly showed more than 3x as many authors and 5x the commits for Linux in 2016 as the BigQuery data set. I confirmed this is an error with Linux stable kernel maintainer Greg Kroah-Hartman (who checked the actual git results) and reported it to OpenHub. They’ve since fixed the bug.

Hyderabad, India