Artificial Intelligence Difficulty 65/100

Self-Attention

Every eye sees all eyes.

⚡ The 5-second answer

Self-attention lets a model weigh the importance of every word in a sentence when processing each word, capturing relationships regardless of distance.

Explain like I'm five

Imagine you're reading a sentence and need to understand each word by looking at all the other words around it, like when you see 'it' in 'The cat chased the mouse because it was hungry' and you check every word to figure out 'it' means 'the cat'. Self-attention does this for every word at once, making sure nothing gets missed.

Why it matters

Self-attention is the core mechanism behind transformers, which power models like GPT and BERT, enabling them to handle long-range dependencies in text, translation, and even images. It's why AI can now write coherent paragraphs, summarize articles, and understand context in search queries.

Common misconception

Many think self-attention means the model pays attention to itself in a narcissistic way, but it actually means each element in a sequence attends to all other elements. It's not about self-focus, but about relating every piece of data to every other piece.

Formal definition

Self-attention computes a weighted sum of all elements in a sequence for each element, where weights are derived from pairwise similarity scores (usually via dot products of learned query, key, and value vectors). This allows each position to dynamically aggregate information from the entire sequence, enabling context-aware representations without recurrence or convolution.