-
Sparse attention 2 -- Unattention head, branching dynamics
-
Sparse attention 1 -- sticky plateau and rank collapse
-
Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule
-
Fine-tuning with sparse updates? A toy teacher-student Setup
-
Multi-Head Cross Entropy Loss