Self-attention and cross-attention are transformer attention mechanisms that differ in where they get information from. In self-attention, a sequence attends to itself. In cross-attention, one sequence attends to another sequence, source, or modality.
The short version is this: self-attention builds context inside the current sequence, while cross-attention injects information from outside the current sequence.
| Question | Self-attention | Cross-attention |
|---|---|---|
| What attends to what? | Tokens attend to other tokens in the same sequence | One sequence attends to another sequence or modality |
| Where do queries come from? | Current sequence | Current sequence, usually the decoder or target representation |
| Where do keys and values come from? | Current sequence | External sequence, encoder output, retrieved context, image features, or memory |
| Common use | Decoder-only LLMs, encoder blocks, prompt context | Encoder-decoder models, multimodal models, retrieval-conditioned systems |
| Simple memory rule | "What matters inside this sequence?" | "What external information matters for this sequence?" |
Self-Attention: Looking Within the Same Sequence
In self-attention, the queries, keys, and values are all derived from the same input sequence.
If the model is processing:
The model answered the question because the context was relevant
then each token can compare itself with every other token in that same sentence or prompt.
This lets the model discover relationships such as:
- which noun a pronoun might refer to
- which adjectives modify which nouns
- which earlier phrases define the topic
- which distant tokens shape the meaning of the current one
Self-attention is the default mechanism inside most transformer blocks.
Cross-Attention: Looking Outside the Current Sequence
In cross-attention, the queries come from one representation, while the keys and values come from another.
That means one sequence is not just consulting itself. It is consulting an external source of information.
This is useful when one representation needs to be conditioned on another, such as:
- a decoder attending to encoder outputs in translation
- a text decoder attending to image features in multimodal models
- a generation step attending to retrieved memory
So self-attention builds internal context, while cross-attention injects external context.
What Is the Difference Between Self-Attention and Cross-Attention?
The difference between self-attention and cross-attention is the source of the keys and values. Self-attention uses queries, keys, and values from the same sequence. Cross-attention uses queries from the current sequence but keys and values from an external sequence or representation.
This source difference changes the job of the attention layer:
- self-attention helps tokens understand each other inside the same prompt, sentence, document, or generated output
- cross-attention lets a model condition one representation on another, such as a decoder attending to encoder output
That is why self-attention is central to decoder-only LLMs, while cross-attention is common in encoder-decoder, multimodal, and retrieval-augmented systems.
The Fastest Way to Remember the Difference
Use this rule:
- self-attention = "what matters inside my own sequence?"
- cross-attention = "what matters in that other sequence for what I am doing right now?"
That framing is more useful than memorizing implementation details in isolation.
Query, Key, and Value Source Difference
The key structural difference is the source of the projections.
| Mechanism | Queries come from | Keys come from | Values come from |
|---|---|---|---|
| Self-attention | Current sequence | Current sequence | Current sequence |
| Cross-attention | Current sequence | External sequence | External sequence |
This means self-attention is about internal interaction, while cross-attention is about conditional interaction.
For a deeper explanation of the scoring mechanism itself, see the related guide on what attention means in transformers.
A Translation Example
Suppose an encoder processes a source sentence in French, while a decoder generates an English sentence.
The decoder still needs self-attention to understand the partial English output generated so far. But it also needs access to the encoded French representation.
That is where cross-attention enters:
- decoder queries come from the partial English sequence
- keys and values come from the encoded French sequence
This lets the decoder ask, in effect, "Which parts of the source sentence should influence the next word I generate?"
This encoder-decoder pattern is one reason transformers became more flexible than recurrent models. The broader architecture shift is explained in why transformers replaced RNNs.
Why Self-Attention Alone Is Not Always Enough
If the model only had self-attention inside the current sequence, it could organize what it already has, but it could not directly pull in aligned information from another source.
That becomes limiting in tasks where output depends on external structure:
- translation
- speech-to-text
- image captioning
- grounded generation
- multimodal reasoning
Cross-attention gives the model a formal path for importing those external signals.
Where You See Self-Attention Most Often
Self-attention dominates in:
- standard decoder-only LLMs
- encoder stacks
- contextual representation learning
- prompt processing in generative models
Whenever the goal is to make each token context-aware relative to the rest of the same sequence, self-attention is the tool.
Where You See Cross-Attention Most Often
Cross-attention shows up when one representation must condition on another:
- encoder-decoder transformers
- text-to-image and image-to-text systems
- retrieval-conditioned generation
- systems that merge user input with external memory or tool outputs
It is especially important in architectures where the model needs controlled access to a source it did not generate itself.
Is Cross-Attention Used in Transformers?
Yes. Cross-attention is used in many transformer architectures, especially when a model must condition one sequence on another. The classic example is an encoder-decoder transformer: the decoder uses self-attention over the tokens it has generated so far, then cross-attention over the encoder output.
Cross-attention is also common in systems that combine modalities or external sources:
- text generation conditioned on an image
- captioning systems conditioned on visual features
- speech or audio systems conditioned on acoustic representations
- generation systems conditioned on retrieved documents
Not every transformer uses cross-attention. Decoder-only LLMs often rely mainly on masked self-attention over the prompt and generated context. Cross-attention appears when the architecture needs a distinct external representation.
Why This Matters in Multimodal Models
Multimodal AI gives a good modern example.
If a model answers a question about an image, the text side cannot rely on self-attention alone. It must be able to attend to visual representations. That is exactly the kind of information flow cross-attention supports.
So when people talk about a model grounding its text generation in an image, an audio clip, or a retrieved document, cross-attention is often part of the mechanism that makes that grounding possible.
Do LLMs Use Self-Attention or Cross-Attention?
Most decoder-only LLMs mainly use self-attention. The model attends over the prompt tokens and previously generated tokens to decide what should come next.
Some LLM systems add cross-attention-like behavior through extra architecture, adapters, retrieval modules, or multimodal components. For example, a text model that answers questions about images may need a mechanism for text representations to attend to visual features.
So the practical answer is:
- base decoder-only LLMs mostly use self-attention
- encoder-decoder transformers use both self-attention and cross-attention
- multimodal and retrieval-conditioned systems may use cross-attention or similar conditioning mechanisms
Position information also matters because attention alone does not inherently know token order. That topic is covered in positional encoding explained.
Common Misunderstandings
Is cross-attention a completely different mechanism?
No. The scoring idea is the same. What changes is where the queries, keys, and values come from.
Do decoder-only LLMs use cross-attention everywhere?
Not typically in the standard base architecture. They mainly rely on self-attention over the prompt and generated context, though additional modules may introduce cross-attention-like behavior in extended systems.
Is self-attention weaker because it has no external source?
No. It solves a different problem. Self-attention is what gives a sequence rich internal context.
FAQ
What is the difference between self-attention and cross-attention?
Self-attention compares tokens within the same sequence. Cross-attention lets one sequence query another sequence or representation.
Is cross-attention used in transformers?
Yes. Cross-attention is used in encoder-decoder transformers, multimodal transformers, and systems where generation depends on an external representation.
When is cross-attention used instead of self-attention?
Cross-attention is used when the model needs information from a separate source, such as encoder output, image features, retrieved documents, or memory.
Do LLMs use self-attention or cross-attention?
Decoder-only LLMs mostly use self-attention. Encoder-decoder LLMs and multimodal systems may use both self-attention and cross-attention.
Why do encoder-decoder models need both self-attention and cross-attention?
Because the decoder must understand its own partial output and also consult the encoded source sequence.
Why is cross-attention useful in multimodal models?
Because text representations need a principled way to retrieve relevant information from images, audio, or other external inputs.
Which one is more important?
Neither in isolation. They serve different roles in moving information through transformer systems.