RL in a Nutshell


Reinforcement Learning (RL) underlies a number of fantastic recent technological advancements. This framework facilitates complex sequential decision making to maximize cumulative reward as prescribed by the model builder. The ability to specify how rewards are assigned and systematically determine the best course of action offers an entirely different problem solving paradigm inspired by our understanding of humans decision making. I am excited to see how these ideas will manifest in the automated systems of the future.

Sutton and Barto's Reinforcement Learning: An Introduction comes as a highly recommended reference on the topic. Here I'll share some of my takeaways on their materials.

RL is different. When we talk about learning, we are often making the distinction on whether our problem is (un/semi)supervised, as though this were the primary axis for characterizing learning methods. RL would evolve into a discipline recognized as distinct from (un)supervised learning. Positioned as a hub among science/engineering, RL draws influence from psychology & animal behavior as well operations research & control theory. But RL and (un)supervised methods can be combined to exploit the hidden structures of a problem, offering additional modeling flexibility.

RL is holistic. Reinforcement learning is in keeping with the current trend toward end-to-end solutions, and general learning principles. The abstractions of an environment, possible actions, goals, rewards, and credit assignment for the outcome of sequential choices underlies many real-world decision making challenges. (un)supervised learning methods are modular solutions, in that the practitioner reduces a larger problem to one where accuracy in object recognition could be helpful, for example.

The importance of exploration vs. exploitation. In order to find the best you will likely be looking around a lot to confirm that fact. In this sense, search is the fundamental problem. Sutton and Barto suggest RL facilitates structured search with interaction-level information. This puts RL somewhere between evolutionary methods, which essentially black-box agent interactions and brute forcing it, which becomes infeasible for many problems of interest. Every RL algorithm must strike a balance between the competing strategies of exploration and exploitation. More explicitly, we should encourage exploration so long as it improves our worldview. But otherwise, as goal-seeking agents, we should exploit this intuition to maximize rewards. Sutton and Barto argue that the exploration-exploitation paradigm fundamental to RL does not manifest itself in (un)supervised learning. The question of how much time to invest in looking for the next big thing is left for the ML practitioner through phases of feature engineering or model tuning.

The core mathematical assumption is that system dynamics are well modeled as Markov Decision Processes. This provides the framework for describing a sequence of actions and characterizing decision making under uncertainty. We refer to states in time and actions available to agents which result in transitions to new states. Naturally, a model for our environment may reveal regularities that can be used to improve efficiency for our search for the best behaviors to maximize cumulative reward. The ability to model the environment allows for planning which is important because it is often the case that greedy strategies fail to find the best solutions. The MDP framework offers the leeway to proceed when we believe the simple strategy of choosing the action leading to the highest immediate reward is too heavy-handed for the temporal characteristics.