Ethics Safety Alignment
30 个课时
01 Instruction-Following as Alignment Signal
CODE QUIZ 1 OUTPUTS
✓ → 02 Reward Hacking and Goodhart's Law CODE QUIZ 1 OUTPUTS
✓ → 03 The Direct Preference Optimization Family CODE QUIZ 1 OUTPUTS
✓ → 04 Sycophancy as RLHF Amplification CODE QUIZ 1 OUTPUTS
✓ → 05 Constitutional AI and RLAIF CODE QUIZ 1 OUTPUTS
✓ → 06 Mesa-Optimization and Deceptive Alignment CODE QUIZ 1 OUTPUTS
✓ → 07 Sleeper Agents — Persistent Deception CODE QUIZ 1 OUTPUTS
✓ → 08 In-Context Scheming in Frontier Models CODE QUIZ 1 OUTPUTS
✓ → 09 Alignment Faking CODE QUIZ 1 OUTPUTS
✓ → 10 AI Control — Safety Despite Subversion CODE QUIZ 1 OUTPUTS
✓ → 11 Scalable Oversight and Weak-to-Strong Generalization CODE QUIZ 1 OUTPUTS
✓ → 12 Red-Teaming: PAIR and Automated Attacks CODE QUIZ 1 OUTPUTS
✓ → 13 Many-Shot Jailbreaking CODE QUIZ 1 OUTPUTS
✓ → 14 ASCII Art and Visual Jailbreaks CODE QUIZ 1 OUTPUTS
✓ → 15 Indirect Prompt Injection — Production Attack Surface CODE QUIZ 1 OUTPUTS
✓ → 16 Red-Team Tooling — Garak, Llama Guard, PyRIT CODE QUIZ 1 OUTPUTS
✓ → 17 WMDP and Dual-Use Capability Evaluation CODE QUIZ 1 OUTPUTS
✓ → 18 Frontier Safety Frameworks — RSP, PF, FSF CODE QUIZ 1 OUTPUTS
✓ → 19 Anthropic's Model Welfare Program CODE QUIZ 1 OUTPUTS
✓ → 20 Bias and Representational Harm in LLMs CODE QUIZ 1 OUTPUTS
✓ → 21 Fairness Criteria — Group, Individual, Counterfactual CODE QUIZ 1 OUTPUTS
✓ → 22 Differential Privacy for LLMs CODE QUIZ 1 OUTPUTS
✓ → 23 Watermarking — SynthID, Stable Signature, C2PA CODE QUIZ 1 OUTPUTS
✓ → 24 Regulatory Frameworks — EU, US, UK, Korea CODE QUIZ 1 OUTPUTS
✓ → 25 EchoLeak and the Emergence of CVEs for AI CODE QUIZ 1 OUTPUTS
✓ → 26 Model, System, and Dataset Cards CODE QUIZ 1 OUTPUTS
✓ → 27 Data Provenance and Training-Data Governance CODE QUIZ 1 OUTPUTS
✓ → 28 Alignment Research Ecosystem — MATS, Redwood, Apollo, METR CODE QUIZ 1 OUTPUTS
✓ → 29 Moderation Systems — OpenAI, Perspective, Llama Guard CODE QUIZ 1 OUTPUTS
✓ → 30 Dual-Use Risk — Cyber, Bio, Chem, Nuclear Uplift CODE QUIZ 1 OUTPUTS
✓ →