As we train increasingly capable reinforcement learning agents, we've started to see many examples of "reward gaming", where the agent was accidentally incentivised to produce unexpected behaviour. This is one of several ways in which an AI's goals might diverge from ours. A crucial objective for machine learning research is AI safety: ensuring that, as our most advanced systems become even more powerful, they will remain aligned with human goals, rather than behaving in harmful or adversarial ways. Richard will give a broad overview of the arguments for prioritising safety research, and then discuss some of the most promising approaches. In particular, he’ll focus on several questions currently being explored by the safety team at DeepMind: how can we use human feedback to guide reinforcement learning? Which types of undesirable behaviour can we precisely specify and rule out? And how might these techniques scale as we build increasingly capable AI?
A talk by Richard Ngo (DeepMind).
Richard is a research engineer on the AI safety team at DeepMind, primarily working on reward learning from human preferences. He is particularly interested in building reward models which remain predictable when scaled up to harder problems. He has also worked on analysing approaches to AI safety at Oxford's Future of Humanity Institute. He graduated with distinctions from Oxford and Cambridge.