# Reinforcement Learning

Graduate course, *Centrale SupĂ©lec, Machine learning Engineering*, 2021

This Reinforcement Learning course focused more on algorithms than theory. Some theory was taught with results left unproved.

Below is a checklist for students to use and revise the concepts that were tackled in class.

## Control theory

- Definition of Markov Decision Process
- Markov property
- Discount factor
- Discounted value, Finite time horizon value
- Bellman operator, Bellman optimal operator
- Dynamic Programming principle
- Policy Evaluation: Direct computation, Iteration, Monte-Carlo
- Contraction of Bellman operator, of Bellman optimal operator
- Value Iteration
- Policy Iteration
- Quality function, Advantage function
- Bellman Q-operator.
- Incremental Monte-carlo updates: Temporal Difference, $TD(\lambda)$
- Q-temporal difference, Q-learning
- Function approximation for V: Least-squares TD and Q: Fitted Q-iteration
- Projection vs Contraction

## Multi-armed bandits

- The notion of Regret, of optimality gap $\Delta_a$
- What is Exploration? What is Exploitation?
- Exploration-Exploitation trade-off
- Follow the leader, Explore then Commit strategies
- The optimism in face of uncertainty principle
- Hoeffding inequality for finite samples
- Handling random number of samples with Union bound
- The Upper Confidence bound (UCB) strategy
- The Thompson sampling strategy
- Problem dependent regret lower bound: scaling in $T$, Kullback-Leibler
- Problem-free (minimax) regret lower bound: scaling in $T$, $A$
- KL-UCB strategy lower-bound approach
- IMED strategy

## Structured Multi-armed bandits

- What is Unimodal structure? Lipschitz structure? Linear structure?
- Graph seen as a linear structure
- Lower-bound for structured bandits: optimization problem
- IMED for Lipschitz bandits
- Linear regression setup
- Sub-Gaussian noise assumption
- Least-squares estimate
- Optimistic principle for linear bandits
- Information gain

## Model-based MDPs

- Average gain criterion
- Poisson equation (gain and bias)
- Diameter of an MDP
- Value Iteration convergence issues
- Stopping criterion for Value Iteration

## Planning

- Monte Carlo Tree Search
- What are the 4 main steps of of MCTS strategy?
- UCT rule for the value of each node
- What is a Generative model?
- What is Best-armed identification (BAI) objective?
- What is Simple regret?
- Fixed-budget objective vs Fixed-confidence objective
- Reduction from cumulative to simple regret
- What is forced exploration?
- UCT rule in max node, versus UCT rule in min node

## Deep Reinforcement Learning

- Model-based vs Model-free
- Critic algorithm, Actor algorithm, Actor-critic algorithm
- Example of Critic, Actor, algorithms?
- Q-learning idea
- What is slow/fast network updates? Why was it introduced?
- What is experience replay?
- What is prioritized experience replay?
- Double DQN
- Policy gradient theorem
- Idea behind Reinforce strategy