Abstract
The foundational software systems that power machine learning advances—operating systems, compilers, and storage stacks—remain largely governed by sub-optimal and often outdated hand-tuned heuristics. As the complexity of modern systems grows and the pressure on compute resources intensifies, these static rules are becoming a bottleneck. In this talk, I will cover a few selected examples of how we applied ML to address system problems within Google production infrastructure. Beyond the success stories, I will also discuss the challenges, and lessons learned the hard way from applying ML in a system serving billions of users each day.