01 Vision Transformers and the Patch-Token Primitive
CODE 1 OUTPUTS
02 CLIP and Contrastive Vision-Language Pretraining
CODE 1 OUTPUTS
03 From CLIP to BLIP-2 — Q-Former as Modality Bridge
CODE 1 OUTPUTS
04 Flamingo and Gated Cross-Attention for Few-Shot VLMs
CODE 1 OUTPUTS
05 LLaVA and Visual Instruction Tuning
CODE 1 OUTPUTS
06 Any-Resolution Vision: Patch-n'-Pack and NaFlex
CODE 1 OUTPUTS
07 Open-Weight VLM Recipes: What Actually Matters
CODE 1 OUTPUTS
08 LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model
CODE 1 OUTPUTS
09 Qwen-VL Family and Dynamic-FPS Video
CODE 1 OUTPUTS
10 InternVL3: Native Multimodal Pretraining
CODE 1 OUTPUTS
11 Chameleon and Early-Fusion Token-Only Multimodal Models
CODE 1 OUTPUTS
12 Emu3: Next-Token Prediction for Image and Video Generation
CODE 1 OUTPUTS
13 Transfusion: Autoregressive Text + Diffusion Image in One Transformer
CODE 1 OUTPUTS
14 Show-o and Discrete-Diffusion Unified Models
CODE 1 OUTPUTS
15 Janus-Pro: Decoupled Encoders for Unified Multimodal Models
CODE 1 OUTPUTS
16 MIO and Any-to-Any Streaming Multimodal Models
CODE 1 OUTPUTS
17 Video-Language Models: Temporal Tokens and Grounding
CODE 1 OUTPUTS
18 Long-Video Understanding at Million-Token Context
CODE 1 OUTPUTS
19 Audio-Language Models: the Whisper to Audio Flamingo 3 Arc
CODE 1 OUTPUTS
20 Omni Models: Qwen2.5-Omni and the Thinker-Talker Split
CODE 1 OUTPUTS
21 Embodied VLAs: RT-2, OpenVLA, π0, GR00T
CODE 1 OUTPUTS
22 Document and Diagram Understanding
CODE 1 OUTPUTS
23 ColPali and Vision-Native Document RAG
CODE 1 OUTPUTS
24 Multimodal RAG and Cross-Modal Retrieval
CODE 1 OUTPUTS
25 Multimodal Agents and Computer-Use (Capstone)
CODE 1 OUTPUTS