01 Instruction-Following as Alignment Signal
CODE QUIZ 1 OUTPUTS
02 Reward Hacking and Goodhart's Law
CODE QUIZ 1 OUTPUTS
03 The Direct Preference Optimization Family
CODE QUIZ 1 OUTPUTS
04 Sycophancy as RLHF Amplification
CODE QUIZ 1 OUTPUTS
05 Constitutional AI and RLAIF
CODE QUIZ 1 OUTPUTS
06 Mesa-Optimization and Deceptive Alignment
CODE QUIZ 1 OUTPUTS
07 Sleeper Agents — Persistent Deception
CODE QUIZ 1 OUTPUTS
08 In-Context Scheming in Frontier Models
CODE QUIZ 1 OUTPUTS
09 Alignment Faking
CODE QUIZ 1 OUTPUTS
10 AI Control — Safety Despite Subversion
CODE QUIZ 1 OUTPUTS
11 Scalable Oversight and Weak-to-Strong Generalization
CODE QUIZ 1 OUTPUTS
12 Red-Teaming: PAIR and Automated Attacks
CODE QUIZ 1 OUTPUTS
13 Many-Shot Jailbreaking
CODE QUIZ 1 OUTPUTS
14 ASCII Art and Visual Jailbreaks
CODE QUIZ 1 OUTPUTS
15 Indirect Prompt Injection — Production Attack Surface
CODE QUIZ 1 OUTPUTS
16 Red-Team Tooling — Garak, Llama Guard, PyRIT
CODE QUIZ 1 OUTPUTS
17 WMDP and Dual-Use Capability Evaluation
CODE QUIZ 1 OUTPUTS
18 Frontier Safety Frameworks — RSP, PF, FSF
CODE QUIZ 1 OUTPUTS
19 Anthropic's Model Welfare Program
CODE QUIZ 1 OUTPUTS
20 Bias and Representational Harm in LLMs
CODE QUIZ 1 OUTPUTS
21 Fairness Criteria — Group, Individual, Counterfactual
CODE QUIZ 1 OUTPUTS
22 Differential Privacy for LLMs
CODE QUIZ 1 OUTPUTS
23 Watermarking — SynthID, Stable Signature, C2PA
CODE QUIZ 1 OUTPUTS
24 Regulatory Frameworks — EU, US, UK, Korea
CODE QUIZ 1 OUTPUTS
25 EchoLeak and the Emergence of CVEs for AI
CODE QUIZ 1 OUTPUTS
26 Model, System, and Dataset Cards
CODE QUIZ 1 OUTPUTS
27 Data Provenance and Training-Data Governance
CODE QUIZ 1 OUTPUTS
28 Alignment Research Ecosystem — MATS, Redwood, Apollo, METR
CODE QUIZ 1 OUTPUTS
29 Moderation Systems — OpenAI, Perspective, Llama Guard
CODE QUIZ 1 OUTPUTS
30 Dual-Use Risk — Cyber, Bio, Chem, Nuclear Uplift
CODE QUIZ 1 OUTPUTS