Self-Attention vs Cross-Attention in Transformers

Self-attention and cross-attention are transformer attention mechanisms that differ in where they get information from. In self-attention, a sequence attends to itself. In cross-attention, one sequence attends to another sequence, source, or modality.

The short version is this: self-attention builds context inside the current sequence, while cross-attention injects information from outside the current sequence.

Question	Self-attention	Cross-attention
What attends to what?	Tokens attend to other tokens in the same sequence	One sequence attends to another sequence or modality
Where do queries come from?	Current sequence	Current sequence, usually the decoder or target representation
Where do keys and values come from?	Current sequence	External sequence, encoder output, retrieved context, image features, or memory
Common use	Decoder-only LLMs, encoder blocks, prompt context	Encoder-decoder models, multimodal models, retrieval-conditioned systems
Simple memory rule	"What matters inside this sequence?"	"What external information matters for this sequence?"

Self-Attention: Looking Within the Same Sequence

In self-attention, the queries, keys, and values are all derived from the same input sequence.

If the model is processing:

The model answered the question because the context was relevant

then each token can compare itself with every other token in that same sentence or prompt.

This lets the model discover relationships such as:

which noun a pronoun might refer to
which adjectives modify which nouns
which earlier phrases define the topic
which distant tokens shape the meaning of the current one

Self-attention is the default mechanism inside most transformer blocks.

Cross-Attention: Looking Outside the Current Sequence

In cross-attention, the queries come from one representation, while the keys and values come from another.

That means one sequence is not just consulting itself. It is consulting an external source of information.

This is useful when one representation needs to be conditioned on another, such as:

a decoder attending to encoder outputs in translation
a text decoder attending to image features in multimodal models
a generation step attending to retrieved memory

So self-attention builds internal context, while cross-attention injects external context.

What Is the Difference Between Self-Attention and Cross-Attention?

The difference between self-attention and cross-attention is the source of the keys and values. Self-attention uses queries, keys, and values from the same sequence. Cross-attention uses queries from the current sequence but keys and values from an external sequence or representation.

This source difference changes the job of the attention layer:

self-attention helps tokens understand each other inside the same prompt, sentence, document, or generated output
cross-attention lets a model condition one representation on another, such as a decoder attending to encoder output

That is why self-attention is central to decoder-only LLMs, while cross-attention is common in encoder-decoder, multimodal, and retrieval-augmented systems.

The Fastest Way to Remember the Difference

Use this rule:

self-attention = "what matters inside my own sequence?"
cross-attention = "what matters in that other sequence for what I am doing right now?"

That framing is more useful than memorizing implementation details in isolation.

Query, Key, and Value Source Difference

The key structural difference is the source of the projections.

Mechanism	Queries come from	Keys come from	Values come from
Self-attention	Current sequence	Current sequence	Current sequence
Cross-attention	Current sequence	External sequence	External sequence

This means self-attention is about internal interaction, while cross-attention is about conditional interaction.

For a deeper explanation of the scoring mechanism itself, see the related guide on what attention means in transformers.

A Translation Example

Suppose an encoder processes a source sentence in French, while a decoder generates an English sentence.

The decoder still needs self-attention to understand the partial English output generated so far. But it also needs access to the encoded French representation.

That is where cross-attention enters:

decoder queries come from the partial English sequence
keys and values come from the encoded French sequence

This lets the decoder ask, in effect, "Which parts of the source sentence should influence the next word I generate?"

This encoder-decoder pattern is one reason transformers became more flexible than recurrent models. The broader architecture shift is explained in why transformers replaced RNNs.

Why Self-Attention Alone Is Not Always Enough

If the model only had self-attention inside the current sequence, it could organize what it already has, but it could not directly pull in aligned information from another source.

That becomes limiting in tasks where output depends on external structure:

translation
speech-to-text
image captioning
grounded generation
multimodal reasoning

Cross-attention gives the model a formal path for importing those external signals.

Where You See Self-Attention Most Often

Self-attention dominates in:

standard decoder-only LLMs
encoder stacks
contextual representation learning
prompt processing in generative models

Whenever the goal is to make each token context-aware relative to the rest of the same sequence, self-attention is the tool.

Where You See Cross-Attention Most Often

Cross-attention shows up when one representation must condition on another:

encoder-decoder transformers
text-to-image and image-to-text systems
retrieval-conditioned generation
systems that merge user input with external memory or tool outputs

It is especially important in architectures where the model needs controlled access to a source it did not generate itself.

Is Cross-Attention Used in Transformers?

Yes. Cross-attention is used in many transformer architectures, especially when a model must condition one sequence on another. The classic example is an encoder-decoder transformer: the decoder uses self-attention over the tokens it has generated so far, then cross-attention over the encoder output.

Cross-attention is also common in systems that combine modalities or external sources:

text generation conditioned on an image
captioning systems conditioned on visual features
speech or audio systems conditioned on acoustic representations
generation systems conditioned on retrieved documents

Not every transformer uses cross-attention. Decoder-only LLMs often rely mainly on masked self-attention over the prompt and generated context. Cross-attention appears when the architecture needs a distinct external representation.

Why This Matters in Multimodal Models

Multimodal AI gives a good modern example.

If a model answers a question about an image, the text side cannot rely on self-attention alone. It must be able to attend to visual representations. That is exactly the kind of information flow cross-attention supports.

So when people talk about a model grounding its text generation in an image, an audio clip, or a retrieved document, cross-attention is often part of the mechanism that makes that grounding possible.

Do LLMs Use Self-Attention or Cross-Attention?

Most decoder-only LLMs mainly use self-attention. The model attends over the prompt tokens and previously generated tokens to decide what should come next.

Some LLM systems add cross-attention-like behavior through extra architecture, adapters, retrieval modules, or multimodal components. For example, a text model that answers questions about images may need a mechanism for text representations to attend to visual features.

So the practical answer is:

base decoder-only LLMs mostly use self-attention
encoder-decoder transformers use both self-attention and cross-attention
multimodal and retrieval-conditioned systems may use cross-attention or similar conditioning mechanisms

Position information also matters because attention alone does not inherently know token order. That topic is covered in positional encoding explained.

Common Misunderstandings

Is cross-attention a completely different mechanism?

No. The scoring idea is the same. What changes is where the queries, keys, and values come from.

Do decoder-only LLMs use cross-attention everywhere?

Not typically in the standard base architecture. They mainly rely on self-attention over the prompt and generated context, though additional modules may introduce cross-attention-like behavior in extended systems.

Is self-attention weaker because it has no external source?

No. It solves a different problem. Self-attention is what gives a sequence rich internal context.

FAQ

What is the difference between self-attention and cross-attention?

Self-attention compares tokens within the same sequence. Cross-attention lets one sequence query another sequence or representation.

Is cross-attention used in transformers?

Yes. Cross-attention is used in encoder-decoder transformers, multimodal transformers, and systems where generation depends on an external representation.

When is cross-attention used instead of self-attention?

Cross-attention is used when the model needs information from a separate source, such as encoder output, image features, retrieved documents, or memory.

Do LLMs use self-attention or cross-attention?

Decoder-only LLMs mostly use self-attention. Encoder-decoder LLMs and multimodal systems may use both self-attention and cross-attention.

Why do encoder-decoder models need both self-attention and cross-attention?

Because the decoder must understand its own partial output and also consult the encoded source sequence.

Why is cross-attention useful in multimodal models?

Because text representations need a principled way to retrieve relevant information from images, audio, or other external inputs.

Which one is more important?

Neither in isolation. They serve different roles in moving information through transformer systems.