Transformer_Attention

Transformer — Modular Rebuild (Code-by-Code Explanation)

This project contains a modular, educational implementation of the Transformer model with explicit mapping between the mathematical formulas from “Attention Is All You Need” and code.

Overview

A pedagogical rebuild of the Transformer architecture with clear correspondence between mathematical concepts from the original paper and their Python implementation. Perfect for understanding the inner workings of attention mechanisms and transformer models.

Mathematical Formulas and Implementation Mapping

Component Formula Implementation
Scaled Dot-Product Attention Attention(Q,K,V) = softmax((QKᵀ)/√dₖ) V ScaledDotProductAttention
Multi-Head Attention headᵢ = Attention(QWᵢᵠ, KWᵢᴷ, VWᵢⱽ)
MultiHead = Concat(head₁,...,headₕ)Wᴼ
MultiHeadAttention
Positional Encodings Sinusoidal functions PositionalEncoding
Layer Structure Add & Norm + FeedForward EncoderLayer / DecoderLayer

Variable Definitions

Symbol Description
Q Query matrix
K Key matrix
V Value matrix
dₖ Dimension of key vectors
Wᵢᵠ, Wᵢᴷ, Wᵢⱽ Weight matrices for head i
Wᴼ Output projection matrix
h Number of attention heads

Project Structure

src/
├── model.py              # Transformer modules (PositionalEncoding, ScaledDotProductAttention, MultiHeadAttention, FeedForward, EncoderLayer, DecoderLayer, SmallTransformer)
├── data.py               # Toy copy task dataset and loader
├── train.py              # Training script for the small transformer on copy task
├── export_attention_json.py  # Export attention tensors to JSON for interactive visualizer
└── visualize.py          # Plotting utilities to visualize attention matrices

notebooks/                # Original notebooks (if uploaded)
docs/assets/
└── Attention.png         # Architecture diagram (from the paper)

Quick Start

Train on Copy Task

python -m src.train --epochs 3 --save_dir checkpoints

Export Attention from Checkpoint

python -m src.export_attention_json --checkpoint checkpoints/ckpt_epoch1.pt --out attention_epoch1.json

Visualize Attention

from src.visualize import load_attention_json, plot_attention_matrix

# Load exported attention data
data = load_attention_json('attention_epoch1.json')

# Plot layer 0, head 0, encoder self-attention
plot_attention_matrix(
    data['attentions']['encoder_self'][0][0], 
    title='Layer 0 Head 0 - Encoder Self-Attention'
)

Features

Requirements

pip install torch matplotlib numpy

License

MIT - Arvind@2025