| Jan 25, 2026 | Optimization 3 / Depth 2 -- Adding Bias After ReLU |
| Jan 24, 2026 | Optimization 2 -- Elementwise Scale Reparametrization |
| Jan 23, 2026 | Optimization 1 -- Norm reparametrization |
| Jan 22, 2026 | Sparse attention 5 -- Attention sink |
| Jan 21, 2026 | Bigram 4 -- On the difficulty of spatial map emergence |
| Jan 20, 2026 | Depth 1 -- Understanding Pre-LN and Post-LN |
| Jan 19, 2026 | Bigram 3 -- Low Rank Structure |
| Jan 16, 2026 | Bigram 2 -- Emergence of Hyperbolic Spaces |
| Jan 15, 2026 | Bigram 1 -- Walk on a Circle |
| Jan 14, 2026 | Diffusion 1 -- Sparse and Dense Neurons |
| Jan 13, 2026 | Sparse attention 4 -- previous token head |
| Jan 12, 2026 | Sparse attention 3 -- inefficiency of extracting similar content |
| Jan 11, 2026 | Emergence of Induction Head Depends on Learning Rate Schedule |
| Jan 10, 2026 | Sparse attention 2 -- Unattention head, branching dynamics |
| Jan 09, 2026 | Sparse attention 1 -- sticky plateau and rank collapse |
| Jan 08, 2026 | Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule |
| Jan 07, 2026 | Fine-tuning with sparse updates? A toy teacher-student Setup |
| Jan 06, 2026 | Multi-Head Cross Entropy Loss |
| Jan 05, 2026 | What's the difference -- (physics of) AI, physics, math and interpretability |
| Jan 04, 2026 | Representation anisotropy from nonlinear functions |
| Jan 03, 2026 | Training dynamics of A Single ReLU Neuron |
| Jan 02, 2026 | Physics of AI – How to Begin |
| Jan 01, 2026 | Physics of Feature Learning 1 – A Perspective from Nonlinearity |