
Explain like I'm five
Imagine you're teaching a puppy to fetch. If you sometimes throw the ball far, sometimes close, the puppy gets confused. Batch Normalization is like always throwing the ball from the same spot—it makes learning consistent and faster by keeping the inputs to each step balanced.

Why it matters
It allows you to use higher learning rates and reduces sensitivity to initialization, which means models train faster and more reliably. You encounter it in nearly all modern deep learning architectures, from image classifiers to language models.

Common misconception
Many think Batch Normalization normalizes the output of a layer, but it actually normalizes the input to the next layer. Also, people assume it works the same for small batch sizes, but it becomes unstable when batches are tiny (e.g., batch size 1).

Formal definition
Batch Normalization is a technique that normalizes the activations of a layer across the mini-batch dimension by subtracting the batch mean and dividing by the batch standard deviation, then applies learnable scale and shift parameters (gamma and beta) to restore representational power. It is typically applied before the activation function in a neural network layer.