This is Scientific American’s 60 Second Science. I’m Pakinam Amer.[CLIP: Retro game theme music]
Whether you’re a pro gamer or you dip your toes in that world every once in a while, chances are you got stuck while playing a video game once or was even gloriously defeated by one. I know I have. Maybe, in your frustration, you kicked the console a little. Maybe you took it out on the controllers or—if you’re an 80s kid like me—made the joystick pay.
Now a group of computer scientists from Uber AI are taking revenge for all of us who’ve been in this situation before. Using a family of simple algorithms, tagged “Go-Explore,” they went back and beat some of the most notoriously difficult Atari games whose chunky blocks of pixels and eight-bit tunes had once challenged, taunted and even enraged us. [Adrien Ecoffet et al., First return, then explore] [CLIP: Swish]
But what does revisiting those games from the 80s and 90s accomplish, besides fulfilling a childhood fantasy?
According to the scientists, who published their work in Nature, experimenting with solving video games that require complex, hard exploration gives way to better learning algorithms. They become more intelligent and perform better under real-world scenarios.
“One of the nice things of Go-Explore is that it’s not just limited to video games, but that you can also apply it to practical applications like robotics.”
So how does it actually work? Let’s start with the basics. When AI processes images of the world in the form of pixels, it does not know which changes should count and which should be ignored. For instance, a slight change in the pattern of the clouds in the sky in a game environment is probably unimportant when exploring said game. But finding a missing key certainly is. But to the AI, both involve changing a few pixels in that world.
This is where deep reinforcement learning comes in. It’s an area of machine learning that helps an agent analyze an environment to decide what matters and which actions count through feedback signals in the form of extrinsic and intrinsic rewards.
“This is something that animals, basically, constantly do. You can imagine, if you touch a hot stove, you immediately get strong negative feedback like ‘Hey, this is something you shouldn’t do in the future.’ If you eat a bar of chocolates, assuming you like chocolates, you immediately get a positive feedback signal like ‘Hey, maybe I should seek out chocolate more in the future.’ The same is true for machine learning. These are problems where the agent has to take some actions, and then maybe it wins a game.”
Creating an algorithm that can navigate rooms with traps, obstacles to jump over, rewards to collect and pitfalls to avoid, means that you have to create an artificial intelligence that is curious and that can explore an environment in a smart way. This helps it decide what brings it closer to a goal or how to collect hard-to-get treasures.
Reinforcement learning is great for that but it isn’t perfect in every situation.
“In practice, reinforcement learning works very well, if you have very rich feedback—if you can tell, ‘Hey, this move is good, that move is bad, this move is good, that move is bad.’”
In Atari games like Montezuma’s Revenge, the game environment offers little feedback, and its rewards can intentionally lead to dead ends. Randomly exploring the space just doesn’t cut it.
“You could imagine, and this is especially true in video games like Montezuma’s Revenge, that sometimes you have to take a lot of very specific actions—you have to dodge hazards, jump over enemies—you can imagine that random actions like,‘Hey, maybe I should jump here,’ in this new place, is just going to lead to a ‘Game Over’ because that was a bad place to jump—especially if you’re already fairly deep into the game. So let’s say you want to explore level two: if you start taking random actions in level one and just randomly dying, you’re not going to make progress on exploring level two.”
You can’t rely on “intrinsic motivation” alone, which, in the context of artificial intelligence, typically comes from exploring new or unusual situations.
“Let’s say you have a robot, and it can go left into the house and right into the house. Let’s say at first it goes left, it explores left, meaning that it gets this intrinsic reward for a while. It doesn’t quite finish exploring left, and at some point, the episode ends, and it starts anew in the starting room. This time it goes right. It goes fairly far into the room on the right; it doesn’t quite explore it. And then it goes back to the starting room. Now the problem is because it has gone both left and right, and basically it’s already seen the start, it no longer gets as much intrinsic motivation from going there.”
In short, it stops exploring and counts that as a win.
Detaching from a place that was previously visited after collecting a reward doesn’t work in difficult games, because you might leave out important clues.
Go-Explore goes around this by not rewarding some actions, such as going somewhere new. Instead it encourages “sufficient exploration” of a space, with no or little hints, by enabling its agent to explicitly “remember” promising places or states in a game.
Once the agent keeps a record of that state, it can then reload it and intentionally explore—what Adrien and Joost call, the “first return, then explore” principle.
According to Adrien, leaning on another form of learning called imitation learning, where agents can mimic human tasks, their AI can go a long way, especially in the field of robotics.
“You have a difference between the world that you can train in and the real world. So one example would be if you’re doing robotics: You know, in robotics, it’s possible to have simulations of your robotics environments. But then, of course, you want your robot to run in the real world, right? And so what can you do, then? If you’re in a situation like that, of course, the simulation is not exactly the same as the environment, so just having something that works in simulation is not necessarily sufficient. We show that in our work. What we’re doing is that we’re using existing algorithms that are called ‘imitation learning.’ And what it is is it just takes an existing solution to a problem and just makes sure that you can reliably use that solution even when, you know, there are slight variations in your environment, including, you know, it being the real world rather than a simulation.”
Adrien and Joost say their model’s strength lies in its simplicity. It can be adapted and expanded easily into real-life applications such as language learning or drug design.
That was 60 Seconds Science, and this is Pakinam Amer. Thank you for listening.
—Pakinam Amer[The above text is a transcript of this podcast.]