Speech And Audio — AI 工程课程

01 Audio Fundamentals — Waveforms, Sampling, Fourier Transform

✓ → 02 Spectrograms, Mel Scale & Audio Features

✓ → 03 Audio Classification — From k-NN on MFCCs to AST and BEATs

✓ → 04 Speech Recognition (ASR) — CTC, RNN-T, Attention

✓ → 05 Whisper — Architecture & Fine-Tuning

✓ → 06 Speaker Recognition & Verification

✓ → 07 Text-to-Speech (TTS) — From Tacotron to F5 and Kokoro

✓ → 08 Voice Cloning & Voice Conversion

✓ → 09 Music Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake

✓ → 10 Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio

✓ → 11 Real-Time Audio Processing

✓ → 12 Build a Voice Assistant Pipeline — The Phase 6 Capstone

✓ → 13 Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split

✓ → 14 Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush Trick

✓ → 15 Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue

✓ → 16 Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify

✓ → 17 Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open Leaderboards