Multimodal AI — AI 工程课程

01 Vision Transformers and the Patch-Token Primitive

✓ → 02 CLIP and Contrastive Vision-Language Pretraining

✓ → 03 From CLIP to BLIP-2 — Q-Former as Modality Bridge

✓ → 04 Flamingo and Gated Cross-Attention for Few-Shot VLMs

✓ → 05 LLaVA and Visual Instruction Tuning

✓ → 06 Any-Resolution Vision: Patch-n'-Pack and NaFlex

✓ → 07 Open-Weight VLM Recipes: What Actually Matters

✓ → 08 LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model

✓ → 09 Qwen-VL Family and Dynamic-FPS Video

✓ → 10 InternVL3: Native Multimodal Pretraining

✓ → 11 Chameleon and Early-Fusion Token-Only Multimodal Models

✓ → 12 Emu3: Next-Token Prediction for Image and Video Generation

✓ → 13 Transfusion: Autoregressive Text + Diffusion Image in One Transformer

✓ → 14 Show-o and Discrete-Diffusion Unified Models

✓ → 15 Janus-Pro: Decoupled Encoders for Unified Multimodal Models

✓ → 16 MIO and Any-to-Any Streaming Multimodal Models

✓ → 17 Video-Language Models: Temporal Tokens and Grounding

✓ → 18 Long-Video Understanding at Million-Token Context

✓ → 19 Audio-Language Models: the Whisper to Audio Flamingo 3 Arc

✓ → 20 Omni Models: Qwen2.5-Omni and the Thinker-Talker Split

✓ → 21 Embodied VLAs: RT-2, OpenVLA, π0, GR00T

✓ → 22 Document and Diagram Understanding

✓ → 23 ColPali and Vision-Native Document RAG

✓ → 24 Multimodal RAG and Cross-Modal Retrieval

✓ → 25 Multimodal Agents and Computer-Use (Capstone)