“Reinforcement learning, a powerful branch of artificial intelligence (AI), has revolutionized the way machines learn and make decisions. From training autonomous robots to mastering complex games, reinforcement learning holds the key to creating intelligent systems that can adapt and improve through continuous interaction with their environment.” This short intro was written by ChatGPT. With machine learning all over the news, it’s an exciting time to apply the ergodicity lens to the topic.

Let’s start with what reinforcement learning is. In reinforcement learning, we have an algorithm that is supposed to learn through trial and error. It tries out things and receives feedback on how well it performed. From this feedback, it then learns how to optimally solve its assigned task (ideally). To understand exactly what this feedback is, we revisit one of the best-known textbooks on reinforcement learning, which is also the go-to resource for university courses: “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto. At the beginning of Section 3.2 of this book, we find the reward hypothesis, stating:

“That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”

Thus, the expected value is at the very root of reinforcement learning. And it enters in a way which sounds oddly familiar to maximizing the expected value of wealth – which, from the ergodicity economics perspective, is often not at all the objectively “right” quantity to maximize. However, choosing the right objective to optimize is crucial for obtaining the desired result – otherwise, we might end up like Greek King Midas, wishing that everything he touches would turn to gold, and in the end starving to death since gold is difficult to eat (researchers from UC Berkeley have related this story to reinforcement learning in a separate blog post).

Nomenclature

Before diving deeper into reinforcement learning and into how we can enable algorithms to handle non-ergodic dynamics, we need to get the nomenclature straight. We do this using a variant of the coin-toss example, where players are allowed to scale their bets. We consider a human with a starting wealth of $100, who gets to bet a fraction of their wealth. Then, a fair coin is tossed. If it comes up heads, the player receives the stake back with a 50% gain; otherwise, they lose 40% of the stake.

From a reinforcement learning perspective, the current wealth of a person corresponds to the “state” of the algorithm. The wealth increments in every round are called the “reward,” and the terminal wealth at the end of a trajectory would correspond to the “cumulative reward” at the end of an “episode.” In reinforcement learning language, the fraction of the wealth the human chooses to bet in each round is the “action.” The goal in reinforcement learning is to learn a “policy,” which is a mapping from state to action: which action to choose in each time step given the current state. In the coin-toss example, the policy may be thought of as the leverage – that’s the proportion of wealth bet in each round. Following the reward hypothesis, the policy should be chosen such that it maximizes the expected value of the cumulative reward.

Ergodicity economics | Reinforcement learning |

Wealth | State |

Terminal wealth | Cumulative reward |

Wealth increment | Reward |

Bet | Action |

Leverage | Policy |

Trajectory | Episode |

One toss of a coin | One time step |

**Table 1:**A “dictionary” between typical nomenclatures used in ergodicity economics and reinforcement learning.

An initial experiment

Considering that reinforcement learning has beaten the world champion in the game of Go, the coin-toss game seems like a fairly simple example that any reinforcement learning algorithm should solve with ease. To check whether this holds true, we select a state-of-the-art algorithm, which is also the underlying algorithm for ChatGPT: proximal policy optimization (PPO) and let it learn a (hopefully) optimal policy.

We start with training. The reinforcement learning algorithm learns by playing the game for a predefined number of episodes each with a fixed number of time steps. In ergodicity-economics terms that means that we toss the coin a fixed number of times, and then “reset” the wealth to the original value and start playing again. From playing the game multiple times, the algorithm learns a policy. To get started, we decide that each game should last 100 time steps and train the algorithm for 10’000 episodes.

After training, we evaluate the final policy of the algorithm ten times by generating 10 trajectories played according to the policy. As we can see in Figure 1, especially considering that the y-axis is logarithmic, the final cumulative reward in all but one of the ten runs is significantly lower than the initial reward of 100, marked with the thick red line. That means that if the algorithm had decided to invest 0% at each time step, it would still have ended up with a better reward than the supposedly optimal policy it found.

Things get even worse if we try to extrapolate. If we take the policy and simulate the game for 1000 time steps, we see in Figure 2 that in all ten cases the final reward is way lower than the initial.

The doomed remedy: training on more episodes doesn’t solve the problem

The coin-toss game is non-ergodic. Therefore, the expected value of wealth, which we set out to maximize, differs from the time average of wealth. Suppose we were to run even more episodes in an attempt to better train the algorithm. The more episodes we train on, the more very rare extremely lucky outcomes will be included in the training data. In vanishingly unlikely episodes we end up exponentially rich, and the more of our wealth we bet, the greater the effect of these extremes. Already with 10’000 episodes, the reinforcement learning algorithm experiences some of these extremely lucky cases during training. And even though there are rather few of these cases, through the multiplicative dynamics, the cumulative reward becomes so high that when averaging, the algorithm still learns that it is best to bet high amounts in each round. If we were to train on a million or a billion episodes, the effect would become even more pronounced. The Sutton and Barto quote shows that from the outset, the algorithm is designed to find a policy that is beneficial when we get to play the game thousands or millions of times, each time resetting our wealth to $100, and then average the outcomes. But in practice, we are usually more interested in what happens during individual runs. And for the vast majority of those individual runs, the cumulative reward will be close to zero.

The algorithm itself did exactly what we asked it to do: it optimized the expected value or ensemble average. But the result is different from what we wanted. We misspecified the problem: what we actually would have liked to maximize is the performance of individual runs – or, in other words, the time average wealth or time-average growth rate of wealth.

The effect of episode length and number

There is now a simple “solution” to the problem: we can make the episodes longer and reduce the number of episodes. In longer episodes, it is less likely that we get extremely lucky and end up winning an atypical wealth. Likewise, having fewer episodes reduces the probability of seeing such outcomes. Especially the second point may seem counter-intuitive. Increasing the number of episodes provides the agent with more data and should help it converge to the optimal policy. However, what it converges to is the policy that achieves the optimal *expected value *of the accumulated reward – which is not what we are looking for. With fewer and longer episodes, the algorithm should learn a strategy that works for the individual, not only on average for a large ensemble.

We can verify this by simulating the coin-toss game again. We train the algorithm in episodes that last 10’000 time steps. At the same time, we decrease the number of episodes to 1000. This makes it much less likely to experience positive rewards with very risky policies. Letting the algorithm train and then again evaluating it ten times, we indeed see that the algorithm has learned a more sensible strategy (see Figure 3). However, how long do we need to choose the episode length so we can be sure the algorithm learns what we want it to learn? And how many episodes should we choose? These questions cannot be easily answered as reinforcement learning is most useful for problems that we cannot solve analytically – hence, we cannot say in advance how we need to choose parameters like episode length so that everything works out.

Ergodicity economics offers a different perspective. The algorithm learned the wrong thing because the expected cumulative reward was the wrong objective to maximize in the first place. The ergodicity economics solution is to optimize the time average growth rate instead, which constitutes an ergodic observable. For the coin-toss game with its exponential dynamics, we know how to extract this ergodic observable: by taking the logarithm of the wealth and looking at the increments of logarithmic wealth.

Transforming cumulative rewards

Let’s apply this perspective to the RL setup. Instead of maximizing the expected cumulative rewards we tell the agent to maximize the expected *logarithmic* cumulative reward. Since for an ergodic observable, time average and epecxted value coincide, this is equivalent to maximizing the time average.

In every round of the coin-toss game, the agent decides how much to bet and in return, depending on the outcome of the coin toss, receives or loses “money” – i.e., receives a positive or negative reward. We typically denote the reward an agent receives (or loses) in a round t, with r(t) and the accumulated reward up to time t with R(t). We then extract an ergodic observable by defining the transformed reward as \tilde{r}(t) = ln(R(t+1)) – ln(R(t)) and feed this transformed reward to the learning algorithm. With this change, we run our algorithm again, this time again for the original 10’000 episodes with length of 100 time steps each. When evaluating it ten times, we see, in Figure 4, that in all of the runs, we end up with a higher reward than what we started with.

This time, this also holds when we extend the time horizon, even without retraining the algorithm. Using the same policy, we can now simulate the game for 1000 time steps and again see (Figure 5) that all runs ended up with a higher reward than they started with.

Outlook

Reinforcement learning does not stop at the coin-toss game. Different from economics, where we can consider many interesting questions by solely looking at scalar wealth of a person and how it develops under different wealth dynamics and decision-making strategies, reinforcement learning typically considers problems with high-dimensional states and actions: robotic systems, complex board games like Go, or computer games like StarCraft. Then, extracting an ergodic observable is intractable analytically and we need to learn how to extract one while learning an optimal policy. And then there are more parameters besides number of episodes and episode length that play into the equation: discount rates, on which ergodicity economics has its own say, or learning rates, for example. So there are many more challenges to solve all of which will contribute to building a sound foundation for reinforcement learning.

Further readings

Following the ideas in this blog post, we have recently published a paper on arXiv that discusses in even more detail the connection between ergodicity and reinforcement learning, including an algorithm for learning a transformation that extracts an ergodic observable when the dynamics are unknown. Another possibility is adjusting the optimization objective to let RL agents directly optimize the time average instead of the expected value. An abstract and presentation of such an approach is available at GitHub.

Acknowledgements

This blog post originated from a joint discussion with Bert Verbruggen, Ph.D. candidate at the Vrije Universiteit Brussels, and James Price, Ph.D. candidate at the University of Warwick. Ole Peters, Founder of the London Mathematical Laboratory and External Professor at the Santa Fe Institute and Thomas Schön, Professor at Uppsala University, provided valuable feedback that helped shaping the post.

## Leave a Reply