-
Emergence of Induction Head Depends on Learning Rate Schedule
-
Sparse attention 2 -- Unattention head, branching dynamics
-
Sparse attention 1 -- sticky plateau and rank collapse
-
Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule
-
Fine-tuning with sparse updates? A toy teacher-student Setup