Reinforcement Learning — AI 工程课程

01 MDPs, States, Actions & Rewards

✓ → 02 Dynamic Programming — Policy Iteration & Value Iteration

✓ → 03 Monte Carlo Methods — Learning from Complete Episodes

✓ → 04 Temporal Difference — Q-Learning & SARSA

✓ → 05 Deep Q-Networks (DQN)

✓ → 06 Policy Gradient — REINFORCE from Scratch

✓ → 07 Actor-Critic — A2C and A3C

✓ → 08 Proximal Policy Optimization (PPO)

✓ → 09 Reward Modeling & RLHF

✓ → 10 Multi-Agent RL

✓ → 11 Sim-to-Real Transfer

✓ → 12 RL for Games — AlphaZero, MuZero, and the LLM-Reasoning Era