
Explain like I'm five
Imagine you have a wise old chef who knows every recipe by heart, but he's slow and takes up the whole kitchen. You train a young apprentice to watch him cook and learn the key tricks, so the apprentice can make almost the same dishes quickly and in a tiny food truck. That way, you get the chef's expertise without needing his full kitchen.

Why it matters
Knowledge Distillation makes AI models smaller and faster, so they can run on phones, smartwatches, or other devices with limited power. You encounter it whenever a voice assistant responds instantly or a camera app recognizes objects offline.

Common misconception
Many think the student model just memorizes the teacher's answers, but it actually learns the teacher's 'soft' reasoning—like how confident the teacher is between similar choices. This subtlety is what gives the student its surprising accuracy, not just copying final labels.

Formal definition
Knowledge Distillation is a model compression method where a smaller 'student' model is trained to replicate the output distribution of a larger 'teacher' model, often using softened probabilities (via a temperature parameter) to capture inter-class similarities. The student minimizes a loss that combines its own prediction error with a distillation loss that matches the teacher's softened outputs. This allows the student to achieve performance close to the teacher while requiring far fewer parameters and computations.