Top 11 things you didn’t know about cloud native statefulness

Posted on September 12, 2022 by W. Watson and Denver Wiliams

Community post by W. Watson and Denver Wiliams from the Cloud Native Network Function (CNF) Test Suite

Scrabble showing database word — “Database” by christophe.benoit74 is licensed under CC BY 2.0.

1. You need more than ACID

An RDBMS’s atomicity, consistency, isolation, and durability are not what they seem to be. Specifically, consistency and isolation have undergone a facelift in recent years. Consistency (in the colloquial usage, see #4) is now tailored to the datasets and workloads of your specific use case, especially in the case of data intensive systems.

2. Consistency is not consistent

In the CAP theorem, Consistency means linearizability (the whole system acts as if there is one, and only one, copy of the data). In ACID, consistency means the system’s data must always be in a ‘good state’. That is to say the system has the ability to enforce correctness about its data through code. This is, counterintuitively, a property of the application and not really the database (although the database can offer some enforcement guarantees such as foreign key or uniqueness constraints). Applications have the power to put correct, or incorrect, data into the database. The application relies on the database’s ability to provide atomicity and isolation to do this. This is completely different from linearizability or even serializability, which is the gold standard for isolation.

3. The speed of your isolation model depends on your data model

Isolation is the property of a database that keeps concurrent transactions from running over each other. Serializability is when the isolation is so strong that each transaction acts as if it is the only transaction running on the database. There are many types of isolation, all of which fall into the weak and strong isolation categories. For the most part there isn’t any free lunch. The stronger the isolation (which is the property most people are actually thinking about when they think of ACID) the slower your transactions will be. If your data model allows for a weaker isolation setting then you can take advantage of the latency reductions that come with weak isolation.

4. You need some level of availability

For cloud native systems, users are at the point where they do not accept planned maintenance windows. Regardless of how many (or few) ‘9s’ you have, planned maintenance windows are a thing of the past. This means you will need to deploy applications using something like phoenix deployment strategy with rolling updates on multiple nodes in order to avoid downtime.

5. Partition tolerance is not negotiable

A common misconception about the CAP theorem is that you can use a combination that doesn’t include the ‘P’ (e.g. CA). This isn’t the case. Essentially, for any non-trivial application, you’ll need some form of partition tolerance.

6. Some distributed data algorithms are coarse grained

You may have heard that the Paxos and Raft algorithms are the solution for highly available and consistent data. This is true in a sense but this data is coarse grained and tailored to the problem of configuration data and leader election. You’ll need another solution for the correctness of your other, more complex, workload data.

7. You can’t ignore the split brain problem

Many believe that somehow their data will just correct itself even when two or more nodes think they are the leader. If you aren’t specifically choosing some kind of conflict resolution such as last write wins, you are essentially begging for corrupted data and unpredictable results.

8. You can’t ignore data structures when reviewing functional requirements

When determining functional requirements for a system, it’s imperative to review data requirements in tandem and ask specific questions about the handling of any non-trivial persistent data. Tradeoffs made between data correctness and latency will influence, and sometimes limit, the technologies that can be used to achieve system requirements.

9. Data structures are difficult to track and match to a model

Some data models, such as ones that handle scheduling, aren’t categorized easily under their required isolation model. This can be problematic when dealing with an ever changing code base.

10. Consensus won’t get you serializability

Just because you have a way to elect leaders (e.g. using Paxos) doesn’t mean that your data is correct. You still need to solve the problem of cloud native availability (which requires replication) and some kind of isolation for the correctness. Leaderless and multi-leader replication models do not lend themselves to strong isolation models. Beware.

10. Multiput doesn’t mean serializability

Many databases have a multi-put option that will gladly insert or update your data in an non serializable way. Just because your data is updated atomically doesn’t mean it is serializable. Check your database vendor.

Conclusion

New players in the cloud native statefulness arena have unique challenges, especially if they have unique non-functional requirements. For example, the telco space has low latency requirements which would affect what isolation models (see #3) and data structures (see #9) are used to obtain their objectives. Unfortunately, these decisions are often overlooked in the design phase and even harder to track during code updates. In addition to being conscious of these challenges during the design phase, one solution is to implement a check for best practices in your CD pipeline (see https://github.com/cncf/cnf-testsuite/), as a warning for things that have gone awry. Another is to check for stateful best practices through certifications. For more information on cloud native certifications in the telco space, see https://www.cncf.io/certification/cnf/. For deeper discussion of these ideas and their application to cloud native network functions, see: https://vmblog.com/archive/2022/05/16/stateful-cnfs.aspx and/or register to attend Cloud Native Telco Day North America, a co-located event at KubeCon+CloudNativeCon in Detroit, MI on Monday, October 24, 2022.

About the authors

Denver Williams, Dip.CompSc, extensive international consulting experience in high level cloud computing roles, developing certification frameworks, Principal and Senior Engineering roles in multi-year projects. Technical skills in Kubernetes, Cloud Native, Terraform, Docker, Bash, Ruby, Erlang/Elixir, Azure, Google Cloud, AWS, Linux.

W. Watson has been professionally developing software for 30 years. He has spent numerous years studying game theory and other business expertise in pursuit of the perfect organizational structure for software co-operatives. He also founded the Austin Software Cooperatives meetup group (https://www.meetup.com/Austin-Software-Co-operatives/) and Vulk Coop (https://www.vulk.coop) as an alternative way to work on software as a group. He has a diverse background that includes service in the Marine Corps as a computer programmer, and software development in numerous industries including defense, medical, education, and insurance. He has spent the last couple of years developing complementary cloud native systems such as the cncf.ci dashboard. He currently works on the Cloud Native Network Function (CNF) Certification Program (https://www.cncf.io/certification/cnf/) and the Cloud Native Network Function (CNF) Test Suite (https://github.com/cncf/cnf-testsuite).

Special thanks to Drew Bentley!

Mumbai, India