Reinforcement Learning

Graduate course, Centrale Supélec, Machine learning Engineering, 2021

This Reinforcement Learning course focused more on algorithms than theory. Some theory was taught with results left unproved.

Below is a checklist for students to use and revise the concepts that were tackled in class.

Control theory

• Definition of Markov Decision Process
• Markov property
• Discount factor
• Discounted value, Finite time horizon value
• Bellman operator, Bellman optimal operator
• Dynamic Programming principle
• Policy Evaluation: Direct computation, Iteration, Monte-Carlo
• Contraction of Bellman operator, of Bellman optimal operator
• Value Iteration
• Policy Iteration
• Bellman Q-operator.
• Incremental Monte-carlo updates: Temporal Difference, $TD(\lambda)$
• Q-temporal difference, Q-learning
• Function approximation for V: Least-squares TD and Q: Fitted Q-iteration
• Projection vs Contraction

Multi-armed bandits

• The notion of Regret, of optimality gap $\Delta_a$
• What is Exploration? What is Exploitation?
• The optimism in face of uncertainty principle
• Hoeffding inequality for finite samples
• Handling random number of samples with Union bound
• The Upper Confidence bound (UCB) strategy
• The Thompson sampling strategy
• Problem dependent regret lower bound: scaling in $T$, Kullback-Leibler
• Problem-free (minimax) regret lower bound: scaling in $T$, $A$
• KL-UCB strategy lower-bound approach
• IMED strategy

Structured Multi-armed bandits

• What is Unimodal structure? Lipschitz structure? Linear structure?
• Graph seen as a linear structure
• Lower-bound for structured bandits: optimization problem
• IMED for Lipschitz bandits
• Linear regression setup
• Sub-Gaussian noise assumption
• Least-squares estimate
• Optimistic principle for linear bandits
• Information gain

Model-based MDPs

• Average gain criterion
• Poisson equation (gain and bias)
• Diameter of an MDP
• Value Iteration convergence issues
• Stopping criterion for Value Iteration

Planning

• Monte Carlo Tree Search
• What are the 4 main steps of of MCTS strategy?
• UCT rule for the value of each node
• What is a Generative model?
• What is Best-armed identification (BAI) objective?
• What is Simple regret?
• Fixed-budget objective vs Fixed-confidence objective
• Reduction from cumulative to simple regret
• What is forced exploration?
• UCT rule in max node, versus UCT rule in min node

Deep Reinforcement Learning

• Model-based vs Model-free
• Critic algorithm, Actor algorithm, Actor-critic algorithm
• Example of Critic, Actor, algorithms?
• Q-learning idea
• What is slow/fast network updates? Why was it introduced?
• What is experience replay?
• What is prioritized experience replay?
• Double DQN