Skip to content

Self-Attention

Single Headed

A description of my image.
  • Ignore the Softmax operation and normalize by dividing by the square root of d_model, because these operations do not affect the dimensions of the matrices involved.

Multiheaded

A description of my image.
  • Ignore the Softmax operation and normalize by dividing by the square root of d_model, because these operations do not affect the dimensions of the matrices involved.