Reinforcement learning is a mode of machine learning driven by the feedback from the environment on how good a string of actions of the learning agent turns out to be.
We consider here a reinforcement learning mechanism for neural networks that is similar to policy gradients (see A. Karpathy's introduction) but with the following distinction: several agents collected in a "culture" interact with the environment independently, and rather than updating the policy according to their own score, each agent learns from the experience of a peer with a better score.
Briefly, agents imitate the more successul peers.
This learning mechanism may also be seen as an evolutionary algorithm applied to behavioral memes. In particular:
- No structural compatibility between neural networks of individual agents is necessary beyond the input and output layers, which allows for coevolution of different designs.
- It is easy to parallelize the computation across the independent agents.
This "cultural diffusion" has been observed in bumblebees, for example.
The environment we look at is a very simplified version of pac-man. At each time step the agent (dark dot) performs one of five actions: stay/up/down/left/right. The aim is to collect as many tokens (brighter dots) as possible within a round of 100 time steps (frames). The score for the round is the number of collected tokens. Five tokens appear in random locations a) initially and b) as soon as the present tokens are collected. The boundary of the playfield is periodic, and it is presented to the agent shifted by the coordinates of the agent.
Here is what the response of an agent typically looks like a) before training, b) with some training, c) with substantial training:
(See the archive peer_learning_plot1.zip attached for producing these images.)
If in doubt, use python3.
Required libraries: numpy, keras (with tensorflow or theano), matplotlib.pyplot.
License: CC BY 4.0.