Multimodal AI
25 个课时
01 Vision Transformers and the Patch-Token Primitive
CODE 1 OUTPUTS
✓ → 02 CLIP and Contrastive Vision-Language Pretraining CODE 1 OUTPUTS
✓ → 03 From CLIP to BLIP-2 — Q-Former as Modality Bridge CODE 1 OUTPUTS
✓ → 04 Flamingo and Gated Cross-Attention for Few-Shot VLMs CODE 1 OUTPUTS
✓ → 05 LLaVA and Visual Instruction Tuning CODE 1 OUTPUTS
✓ → 06 Any-Resolution Vision: Patch-n'-Pack and NaFlex CODE 1 OUTPUTS
✓ → 07 Open-Weight VLM Recipes: What Actually Matters CODE 1 OUTPUTS
✓ → 08 LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model CODE 1 OUTPUTS
✓ → 09 Qwen-VL Family and Dynamic-FPS Video CODE 1 OUTPUTS
✓ → 10 InternVL3: Native Multimodal Pretraining CODE 1 OUTPUTS
✓ → 11 Chameleon and Early-Fusion Token-Only Multimodal Models CODE 1 OUTPUTS
✓ → 12 Emu3: Next-Token Prediction for Image and Video Generation CODE 1 OUTPUTS
✓ → 13 Transfusion: Autoregressive Text + Diffusion Image in One Transformer CODE 1 OUTPUTS
✓ → 14 Show-o and Discrete-Diffusion Unified Models CODE 1 OUTPUTS
✓ → 15 Janus-Pro: Decoupled Encoders for Unified Multimodal Models CODE 1 OUTPUTS
✓ → 16 MIO and Any-to-Any Streaming Multimodal Models CODE 1 OUTPUTS
✓ → 17 Video-Language Models: Temporal Tokens and Grounding CODE 1 OUTPUTS
✓ → 18 Long-Video Understanding at Million-Token Context CODE 1 OUTPUTS
✓ → 19 Audio-Language Models: the Whisper to Audio Flamingo 3 Arc CODE 1 OUTPUTS
✓ → 20 Omni Models: Qwen2.5-Omni and the Thinker-Talker Split CODE 1 OUTPUTS
✓ → 21 Embodied VLAs: RT-2, OpenVLA, π0, GR00T CODE 1 OUTPUTS
✓ → 22 Document and Diagram Understanding CODE 1 OUTPUTS
✓ → 23 ColPali and Vision-Native Document RAG CODE 1 OUTPUTS
✓ → 24 Multimodal RAG and Cross-Modal Retrieval CODE 1 OUTPUTS
✓ → 25 Multimodal Agents and Computer-Use (Capstone) CODE 1 OUTPUTS
✓ →