This page provides an annotated view of the Transformer architecture implementation. On the left, each line’s purpose is described. On the right, you’ll see the corresponding Python code, similar to nn.labml.ai.
Q and key K.
It divides by √dₖ to stabilize gradients.
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
softmax to convert scores into attention weights (probabilities).
attn = torch.softmax(scores, dim=-1)
V to produce the attention output.
output = torch.matmul(attn, V)
Q, K, and V.
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
x = x.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
out, attn = self.attention(Q, K, V, mask)
out = self.w_o(torch.cat(heads, dim=-1))
x = x + self.dropout(attn_out)
x = self.norm1(x)
x = x + self.dropout(ff_out)
x = self.norm2(x)
masked_out, masked_attn = self.masked_mha(x, x, x, mask=tgt_mask)
encdec_out, encdec_attn = self.mha(x, enc_out, enc_out, mask=src_mask)