Loading…
USENIX ATC '24 and OSDI '24
Attending this event?
Friday July 12, 2024 9:00am - 9:25am PDT
Zhe Wang, Shanghai Jiao Tong University; Huanwu Hu, Alibaba Cloud; Linghe Kong, Shanghai Jiao Tong University; Xinlei Kang and Teng Ma, Alibaba Cloud; Qiao Xiang, Xiamen University; Jingxuan Li and Yang Lu, Alibaba Cloud; Zhuo Song, Shanghai Jiao Tong University and Alibaba Cloud; Peihao Yang, Alibaba Cloud; Jiejian Wu, Shanghai Jiao Tong University; Yong Yang and Tao Ma, Alibaba Cloud; Zheng Liu, Alibaba Cloud and Zhejiang University; Xianlong Zeng and Dennis Cai, Alibaba Cloud; Guihai Chen, Shanghai Jiao Tong University

Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent in currently deployed monitoring systems. Ping systems are oblivious to the "actual weights'' of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing.

We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retxs) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from ‘noisy’ retxs with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.

https://www.usenix.org/conference/atc24/presentation/wang-zhe
Friday July 12, 2024 9:00am - 9:25am PDT
Grand Ballroom CD

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link