-
Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule
-
Fine-tuning with sparse updates? A toy teacher-student Setup
-
Multi-Head Cross Entropy Loss
-
What's the difference -- (physics of) AI, physics, math and interpretability
-
Representation anisotropy from nonlinear functions