
Explain like I'm five
Imagine you have a team of chefs, each an expert in a different cuisine—Italian, Mexican, Japanese. When a customer orders sushi, a host (the gate) quickly sends that order to the Japanese chef, not the others. This way, each chef only works on what they know best, saving time and making better food.

Why it matters
Mixture of Experts allows AI models to be much larger and more capable without proportionally increasing computational cost, because only a subset of experts is activated per input. You encounter it in state-of-the-art language models like Mixtral 8x7B and Google's Switch Transformer, which use it to achieve high performance with fewer resources.

Common misconception
A common misconception is that all experts are used for every input. In reality, only a small number of experts (often just 1 or 2) are activated per input, chosen by a learned router. Another mistake is thinking experts are completely independent; they are trained jointly so they specialize and complement each other.

Formal definition
Mixture of Experts (MoE) is a neural network architecture comprising multiple expert subnetworks and a trainable gating function. The gate learns to softly or hard-select a subset of experts for each input, and the output is a weighted combination of the selected experts' outputs. This enables model capacity scaling with sublinear computational cost, as only a fraction of parameters are used per forward pass.