Users demand for 24/7 dependability of cloud services. Unfulfilled dependability is costly, yet, there are complex challenges to reach an ideal dependability. Behind cloud computing is a collection of hundreds of complex systems written in millions of lines of code that are brittle and prone to failures. In this talk, I am discussing about one of unsolved problems in distributed systems, “distributed concurrency bugs”. Distributed concurrency bugs are caused by nondeterministic orders of distributed events such as message arrivals, crashes, and reboots. I am presenting my insight I gain from our bug study, which can help many research on bug combating. And I am presenting my effort to advance distributed system model checker to unearth hidden bugs in systems. I am proposing a principle of semantic awareness to tackle the major problem of model checker, “state space explosion”. In this work, I am showing that leveraging semantic knowledge of systems under test can help model checker finds bugs 2x – 340x faster than state of the art.
See more on this video at www.microsoft.com/en-us/research/video/unearthing-concurrency-bugs-cloud-scale-distributed-systems/