-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

The Reinforcement Learning Workshop
By :

Now that we have established the motivation behind favoring policy-based methods over value-based ones with the navigation example in the previous section, let's begin our formal introduction to policy gradients. Unlike Q-learning, which uses a storage buffer to store past experiences, policy-gradient methods learn in real time (that is, they learn from the most recent experience or action). A policy gradient's learning is driven by whatever the agent encounters in the environment. After each gradient update, the experience is discarded and the policy moves on. Let's look at a pictorial representation of what we have just learned:
Figure 11.4: The policy gradient method explained pictorially
One thing that should immediately catch our attention is that the policy gradient method is, in general, less sample-efficient than Q-learning because the experiences are discarded after each gradient update. The mathematical representation...