Friday, June 6, 2025

A bit about Dynamic Programming and Reinforcement Learning

These days I've been working towards my PhD. As part of my research work, I have had to understand Reinforcement Learning (RL) at a fundamental level. Searching for materials on the topic led me to Dynamic Programming (DP) - a subject that I had previously studied as part of my Master's degree curriculum. Turns out DP is pretty closely related to RL. The problems they are designed to solve are pretty similar, if not the same - maximizing some kind of sum of rewards in the context of an environment embodied by a Markov Decision Process (MDP). MDPs themselves are a big rabbit-hole themselves, so I'm going to refrain from going into the details. The important bit is - both of these fields: DP and RL - are trying to do the same thing, but one (RL) is more general than the other (DP). Historically, DP came first, followed by RL. To solve a DP, you need how its environment - the governing MDP - works completely. There is no such hard requirement on RL. However, for that convenience, RL compromises on DP's promise of a perfect solution. For a lot of practical problems, that tradeoff is very worthwhile.

Anyway, to brush up on my knowledge, I revisited a textbook DP problem I was taught during my Master's degree: the water allocation problem. The problem is as follows:

A total of 6 units of water is to be allocated optimally to 3 users. The allocation is made in discrete steps of one unit ranging from 0 to 6. With the 3 users denoted as User1, User2 and User3 respectively, the returns obtained from the users for a given allocation are as follows. The problem is to find allocations to the 3 users such that the total return is maximized.

At first glance, given the discrete and finite nature of the problem, an exhaustive search and evaluation of the total rewards for the limited number of combinations of water allocations is wholly possible, but the idea is not to go with this bruteforce approach. It is easy to see how quickly the search space can explode with trivial relaxation or scaling of the scope of the problem (say there are a million users, or the total unit of water available is 6 million instead of just 6). Dynamic Programming, with its bedrock - the Bellman equation, provides a systematic way to efficiently enumerate and evaluate the optimality of feasible solutions to this problem. Describing the solution methodology is out of scope of this article and there are numerous high-quality resources on the topic that cater to readers of varied mathematical backgrounds. The solution that I do want to cover here is one using a transformation of this problem to the world of Linear Programming (LP). This was motivated by an interesting PyCon UK YouTube video where Geraint Palmer optimized his Pokemon team to have the best chance of winning. To achieve this, he also makes use of concepts in multi-objective optimization and Pareto fronts within the framework of linear programming, and the whole video is absolutely worth the 26 minute watch for his explanations. The star of the show, however, is a Python library called PuLP. I won't belabor on the library and its capabilities but suffice to say it's a really great tool to formulate, translate (to a form that's readable and solvable by other more commonly used solvers) and solve LP problems. Coming back to our problem, the way to formulate the water allocation problem as a LP problem is to maximize a function that calculates the total return from the allocations made to each of the three users. After a while of thinking, I came up with this scheme: a function for each user that takes in the water allocated to them and outputs their reward (R1(x) or f(x), R2(y) or g(y) and R3(z) or h(z) as shown in the figure above, x, y and z being the water allocated to each of the users respectively); maximize the total of these three functions; constrain x, y and z to be integers less than or equal to 6 and more than 0; constrain their sum to be equal to 6. Now to achieve this in PuLP, I had to do a few other things. To actually define each user's reward function perfectly with the rewards defined at only the provided discrete allocations, I introduced 6 binary variables per user, each corresponding to whether or not a value from 0 to 6 is true - meaning only one of these should be equal to 1, and remember - this is for each user. To ensure only one of these 6 binary variables is ever equal to 1, I introduced a constraint dictating their sum be equal to 1 - for each user. The complete script is here. It's far from groundbreaking but the fact that it gave me the result [x,y,z]=[2,1,3] with a maximum total return of 28 that I had obtained using DP back in the day was really validating.

Okay, now on to RL. Grokking Deep Reinforcement Learning (2020) by Miguel Morales was a really good starting point picking up where DP left. The gridworld example it repeatedly uses (much like the OG Sutton's RL book of 2018) to communicate core concepts is really effective. There is such a glut of content online, video-based ones swamping out texts and articles, on the subject of RL that it is hard not to get confused how each of the RL algorithms works. Morales' book cuts right through that haze. Once you religiously go through the basic definitions of terms such as an MDP, state, action, policy, state and action-value functions, followed by the first exact DP solution methods: Policy iteration (and the distinctly small Value iteration discussion), most other algorithms after that should be a lot more digestible regardless of which resource you may want to follow thence. Personally, I felt like the chapter after the Policy and Value iteration, involving the Multi Armed Bandit was a bit of a digression though I'm sure the author thought it best to keep it there as a prelude to the "actual" RL methods to come (Monte Carlo evaluation methods). I felt like there was a definite switch from an integrated discussion of the "Prediction problem" (policy evaluation) and the "Control problem" (policy improvement) to a style where concepts were siloed off into algorithms for policy evaluation and those for policy improvement, which affected my motivation and attention, and at times felt a little crammed. This was when I restarted my search online to pick up after the exact methods (AKA DP methods - Policy and Value iterations). Oh, before that though, the Google DeepMind RL lecture series by David Silver were also really enlightening for the policy and value iteration parts. For the first "real" RL method - one that didn't depend on having access to the MDP's transition probabilities to reach a solution - Monte Carlo (First and Every Visit variants) sampling, hands down the best down-to-earth explanation was provided by Prof. Saeed Saeedvand of NTU, Taiwan here. I would also like to mention Kimia lab's videos by Prof. Tizhoosh that provided a very high-level intuition of RL methods to get started. Stanford University also had a series on RL on YouTube but it didn't catch my fancy for whatever reason. Only after watching Prof. Saeedvand's lecture on MC methods did I feel like I understood it enough to follow Sutton's chapter on the topic, after which I started to realize the dichotomy between policy evaluation and improvement algorithms (as I understand it up to now, there's not a lot going on in the evaluation space) and why the Grokking  textbook was so structured (I think, anyway). I have started reading up on what's apparently called the SARSA algorithm and Q-learning but haven't implemented anything in those yet. However, I do have a simple gridworld application in JavaScript/TypeScript implemented with Policy/Value Iteration and Monte Carlo methods that was really fun and insightful: