Reinforcement learning algorithms may vary along the following axes:

Axis of variationCategory definitionsCategory examples
Model-free vs. model-basedModel-free: attempts to estimate reward values for specific actions directly from their associated values; analogous to stimulus-response associations or operant conditioning.

Model-based: uses internal representations of the environment—given or learned during training—to choose the best policy (i.e., estimating the probability distributions over states and rewards associated with each action, then solving the resulting Markov decision process).
Model-free: Q-learning, REINFORCE, PPO, A3C

Model-based: value iteration, Monte Carlo, DYNA
Learning strategyValue-based: learn a value function, then indirectly derive a policy by using a fixed rule to map values to actions (e.g., greedy or -greedy).

Policy gradient: directly learn a parameterized policy (representing a distribution of actions over states) by searching in a space of policies and optimizing parameters using gradient estimates.
Value-based: Monte Carlo, temporal difference learning (Q-learning, SARSA), DQN

Policy gradient: REINFORCE, PPO, A3C
On-policy vs. off-poicyOn-policy: learns the next action by using the current policy (i.e., what is currently being executed in the MDP).

Off-policy: learns the next action from a different policy than the agent’s own (e.g., the best state action value estimate, replays, another agent).
On-policy: SARSA, REINFORCE, A2C, PPO

Off-policy: Q-learning
RepresentationTabular: values or policies are stored in tables; applies to problems with small (finite) discrete state and action spaces.

Function approximation: the value function is approximated by a parametrized function; applies to large, continuous, or structured spaces.
Tabular: basic Q-learning, SARSA, TD(0), REINFORCE

Function approximation: DQN, PPO

References