These days I've been working towards my PhD. As part of my research work, I have had to understand Reinforcement Learning (RL) at a fundamental level. Searching for materials on the topic led me to Dynamic Programming (DP) - a subject that I had previously studied as part of my Master's degree curriculum. Turns out DP is pretty closely related to RL. The problems they are designed to solve are pretty similar, if not the same - maximizing some kind of sum of rewards in the context of an environment embodied by a Markov Decision Process (MDP). MDPs themselves are a big rabbit-hole themselves, so I'm going to refrain from going into the details. The important bit is - both of these fields: DP and RL - are trying to do the same thing, but one (RL) is more general than the other (DP). Historically, DP came first, followed by RL. To solve a DP, you need how its environment - the governing MDP - works completely. There is no such hard requirement on RL. However, for that convenience, RL compromises on DP's promise of a perfect solution. For a lot of practical problems, that tradeoff is very worthwhile.
Anyway, to brush up on my knowledge, I revisited a textbook DP problem I was taught during my Master's degree: the water allocation problem. The problem is as follows:
A total of 6 units of water is to be allocated optimally to 3 users. The allocation is made in discrete steps of one unit ranging from 0 to 6. With the 3 users denoted as User1, User2 and User3 respectively, the returns obtained from the users for a given allocation are as follows. The problem is to find allocations to the 3 users such that the total return is maximized.
Okay, now on to RL. Grokking Deep Reinforcement Learning (2020) by Miguel Morales was a really good starting point picking up where DP left. The gridworld example it repeatedly uses (much like the OG Sutton's RL book of 2018) to communicate core concepts is really effective. There is such a glut of content online, video-based ones swamping out texts and articles, on the subject of RL that it is hard not to get confused how each of the RL algorithms works. Morales' book cuts right through that haze. Once you religiously go through the basic definitions of terms such as an MDP, state, action, policy, state and action-value functions, followed by the first exact DP solution methods: Policy iteration (and the distinctly small Value iteration discussion), most other algorithms after that should be a lot more digestible regardless of which resource you may want to follow thence. Personally, I felt like the chapter after the Policy and Value iteration, involving the Multi Armed Bandit was a bit of a digression though I'm sure the author thought it best to keep it there as a prelude to the "actual" RL methods to come (Monte Carlo evaluation methods). I felt like there was a definite switch from an integrated discussion of the "Prediction problem" (policy evaluation) and the "Control problem" (policy improvement) to a style where concepts were siloed off into algorithms for policy evaluation and those for policy improvement, which affected my motivation and attention, and at times felt a little crammed. This was when I restarted my search online to pick up after the exact methods (AKA DP methods - Policy and Value iterations). Oh, before that though, the Google DeepMind RL lecture series by David Silver were also really enlightening for the policy and value iteration parts. For the first "real" RL method - one that didn't depend on having access to the MDP's transition probabilities to reach a solution - Monte Carlo (First and Every Visit variants) sampling, hands down the best down-to-earth explanation was provided by Prof. Saeed Saeedvand of NTU, Taiwan here. I would also like to mention Kimia lab's videos by Prof. Tizhoosh that provided a very high-level intuition of RL methods to get started. Stanford University also had a series on RL on YouTube but it didn't catch my fancy for whatever reason. Only after watching Prof. Saeedvand's lecture on MC methods did I feel like I understood it enough to follow Sutton's chapter on the topic, after which I started to realize the dichotomy between policy evaluation and improvement algorithms (as I understand it up to now, there's not a lot going on in the evaluation space) and why the Grokking textbook was so structured (I think, anyway). I have started reading up on what's apparently called the SARSA algorithm and Q-learning but haven't implemented anything in those yet. However, I do have a simple gridworld application in JavaScript/TypeScript implemented with Policy/Value Iteration and Monte Carlo methods that was really fun and insightful: