-
Sparse attention 3 -- inefficiency of extracting similar content
-
Emergence of Induction Head Depends on Learning Rate Schedule
-
Sparse attention 2 -- Unattention head, branching dynamics
-
Sparse attention 1 -- sticky plateau and rank collapse
-
Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule