Case Study

China Mobile

China Mobile Communications Group LTD uses ChaosBlade to dramatically improve troubleshooting and recovery

To facilitate digital transformation, China Mobile has enthusiastically embraced the adoption of cloud-native architectures to meet the complex demands of continuously growing businesses and respond to the changing market conditions. However, the adoption of cloud-native architectures also brings great challenges. As system complexity and O&M stress consistently increase, the stability and security of the business systems in cross-region and multilayer environments have played an important role in business success

To meet the preceding challenges, China Mobile started to explore the emerging chaos engineering sector and introduced the open source chaos engineering tool (ChaosBlade) to their business systems. Based on ChaosBlade, the sophisticated architecture of China Mobile’s Panji PaaS, and their years of experiences and practices, China Mobile has designed the CMChaos platform to support chaos engineering that targets a variety of business scenarios. 

This platform allows enterprises to perform drills or run chaos engineering experiments for hosts, applications, networks, storage, security, information technology application, and middleware. The platform adopts the microservices architecture and can inject faults into hosts, applications, networks, and storage systems to verify the resilience and reliability of business systems. With the help of the CMChaos platform, enterprises can identify potential problems in a controllable environment in order to ensure business continuity and enhance system resilience.

CMChaos platform

Performance statistics: 

● Fault discovery: The percentage of faults discovered within 1 minute increases from 40% to 70%. 

● Troubleshooting: The percentage of faults that are troubleshooted within 5 minutes increases from 51% to 65%.

Boosting resilience

How does the CMChaos platform improve the resilience of China Mobile’s system? China Mobile is one of the world’s largest mobile network operators by total number of subscribers. China Mobile places stability above all else. To guarantee the high availability and resilience of their large-scale and complex network systems, China Mobile has used the CMChaos platform to empower its internal businesses and achieved remarkable results. 

● Improved system stability: By periodically performing chaos engineering drills, China Mobile succeeded in identifying large numbers of potential problems in their business systems and reducing the number of critical incidents to zero. The annual number of incidents has been reduced by 32.69%. This greatly improves the stability and reliability of the internal system. 

● Reduce the scope of impact: This platform helps China Mobile quickly locate the cause of faults to reduce the fault recovery time. The annual duration of faults has been reduced by 43.22%. The reduction in the duration of anomalies has efficiently minimized the scope of impact. 

● Enhance emergency response capabilities: By simulating faults and conducting attacking and defending drills among departments, China Mobile has successfully improved the efficiency of the O&M team in responding to emergencies and handling faults.

Technical highlights

As a powerful chaos engineering tool, ChaosBlade functions as the solid technical base of the CMChaos platform. What technical benefits and core capabilities does ChaosBlade provide for the CMChaos platform? 

1. Technical benefits 

● Various scenarios and wide scope: ChaosBlade allows enterprises to run drills on infrastructure, applications, containers, and cloud-native resources to test issues such as the network latency, service crashes, and CPU loads. 

● Flexible management and precise injection: ChaosBlade supports fine-grained fault injection control. Enterprises can precisely control the trigger time, duration, and scope of impact of faults to avoid causing severe consequences in the production environment. 

● Out-of-the-box and easy to extend: ChaosBlade provides an easy-to-use CLI and API to help enterprises quickly get started. In addition, ChaosBlade is easy to extend. Enterprises can customize fault scenarios and use custom plug-ins on demand. 

2. Core capabilities 

● Fault drills: The CMChaos platform provides visualized fault drills based on ChaosBlade, allowing enterprises to easily create, manage, and run fault drills and monitor drills in real time. ● Scenario orchestration: The CMChaos platform allows enterprises to orchestrate fault scenarios to simulate complex faults, such as network disconnection and network congestion, to verify the stability and resilience of their systems. 

● Monitoring and alerting: The CMChaos platform is interfaced with a monitoring system to collect system metrics and business metrics in real time, and generates alerts during drills to help enterprises identify and troubleshoot issues. 

● Drill reports: The CMChaos platform can generate detailed drill reports to help enterprises assess the resilience of their systems based on drill results, changes in system metrics, and impact analysis, and further optimize the business systems.

Team photo

Future plans

What are the future plans and goals of China Mobile for the open source ChaosBlade project? 

China Mobile will consistently invest in the open source ChaosBlade project and work closely with partners to co-develop the mobile cloud-native ecosystem. In the future, China Mobile will focus on the following sectors to empower the ecosystem and provide chaos engineering solutions to help enterprises. 

1. Multidimensional performance insights and visualization 

● Enhanced visualized analysis: Develop visualized analysis tools, such as topology maps and flame graphs, to help enterprises identify performance bottlenecks and locate root cause. 

● Intelligent analysis engine: Use advanced technologies, such as machine learning, to analyze performance statistics and provide intelligent fault prediction and RCA services to prevent and resolve issues. 

2. Support for multiple platforms and high compatibility 

● Support for multiple operating systems and platforms: Continuously optimize ChaosBlade to support more operating systems (such as Kylin and UOS) and platforms (such as ARM and MIPS). 

● Enhanced integration with the cloud-native ecosystem: Facilitate the integration of ChaosBlade with mainstream cloud-native tools, such as Kubernetes and Prometheus, to provide an optimized experience. 

3. Open marketplace for fault handling capabilities 

● Open platform for fault handling capabilities: Build an open marketplace to provide fault handling capabilities and encourage developers to share their custom fault scenarios and plug-ins in order to enrich the ecosystem of ChaosBlade. 

● Fault handling capability monetization: Explore a commercial mode to monetize fault handling capabilities, such as fault scenarios or professional consulting services, to promote the development of the ChaosBlade ecosystem. With relentless efforts from the community, ChaosBlade will be widely adopted in the chaos engineering sector to enhance the stability and resilience of cloud-native applications.

Challenges:
Location:
Cloud Type:
Published:
June 9, 2025

Projects used

By the numbers

32.69%

Reduction in annual incidents

43.22%

Decrease in fault duration

70%

1-minute fault detection rate