Reinforcement Learning

Graduate course, Centrale Supélec, Machine learning Engineering, 2021

This Reinforcement Learning course focused more on algorithms than theory. Some theory was taught with results left unproved.

Below is a checklist for students to use and revise the concepts that were tackled in class.

Control theory

  • Definition of Markov Decision Process
  • Markov property
  • Discount factor
  • Discounted value, Finite time horizon value
  • Bellman operator, Bellman optimal operator
  • Dynamic Programming principle
  • Policy Evaluation: Direct computation, Iteration, Monte-Carlo
  • Contraction of Bellman operator, of Bellman optimal operator
  • Value Iteration
  • Policy Iteration
  • Quality function, Advantage function
  • Bellman Q-operator.
  • Incremental Monte-carlo updates: Temporal Difference, $TD(\lambda)$
  • Q-temporal difference, Q-learning
  • Function approximation for V: Least-squares TD and Q: Fitted Q-iteration
  • Projection vs Contraction

Multi-armed bandits

  • The notion of Regret, of optimality gap $\Delta_a$
  • What is Exploration? What is Exploitation?
  • Exploration-Exploitation trade-off
  • Follow the leader, Explore then Commit strategies
  • The optimism in face of uncertainty principle
  • Hoeffding inequality for finite samples
  • Handling random number of samples with Union bound
  • The Upper Confidence bound (UCB) strategy
  • The Thompson sampling strategy
  • Problem dependent regret lower bound: scaling in $T$, Kullback-Leibler
  • Problem-free (minimax) regret lower bound: scaling in $T$, $A$
  • KL-UCB strategy lower-bound approach
  • IMED strategy

Structured Multi-armed bandits

  • What is Unimodal structure? Lipschitz structure? Linear structure?
  • Graph seen as a linear structure
  • Lower-bound for structured bandits: optimization problem
  • IMED for Lipschitz bandits
  • Linear regression setup
  • Sub-Gaussian noise assumption
  • Least-squares estimate
  • Optimistic principle for linear bandits
  • Information gain

Model-based MDPs

  • Average gain criterion
  • Poisson equation (gain and bias)
  • Diameter of an MDP
  • Value Iteration convergence issues
  • Stopping criterion for Value Iteration


  • Monte Carlo Tree Search
  • What are the 4 main steps of of MCTS strategy?
  • UCT rule for the value of each node
  • What is a Generative model?
  • What is Best-armed identification (BAI) objective?
  • What is Simple regret?
  • Fixed-budget objective vs Fixed-confidence objective
  • Reduction from cumulative to simple regret
  • What is forced exploration?
  • UCT rule in max node, versus UCT rule in min node

Deep Reinforcement Learning

  • Model-based vs Model-free
  • Critic algorithm, Actor algorithm, Actor-critic algorithm
  • Example of Critic, Actor, algorithms?
  • Q-learning idea
  • What is slow/fast network updates? Why was it introduced?
  • What is experience replay?
  • What is prioritized experience replay?
  • Double DQN
  • Policy gradient theorem
  • Idea behind Reinforce strategy