2026

an archive of posts from this year

May 06, 2026 AI4AI needs its own "world model"
May 05, 2026 How does memorization affect l2?
May 03, 2026 Grokking in loss doesn't necessarily imply grokking in accuracy
May 03, 2026 Data efficiency of Bi-gram data
Apr 12, 2026 Must-read papers by Jürgen Schmidhuber
Apr 10, 2026 If not LLMs, what should I work on?
Apr 09, 2026 A Toy Model of Masked-Autoencoder (MAE)
Apr 08, 2026 Understanding cardinality (as in ResNext) from representation collapse
Mar 18, 2026 Attention residual 2
Mar 17, 2026 Estimating structural information fraction of your dataset
Mar 16, 2026 When does Kimi's "Attention Residuals" work?
Mar 15, 2026 When does RandOpt work?
Mar 14, 2026 Tokenization 1 -- Factorized tokenization
Mar 13, 2026 How to ground your ideas?
Mar 12, 2026 A toy model of distillation
Mar 11, 2026 When does Muon work? Model depth is a key factor
Mar 10, 2026 MOE 1 -- Experssive power of MOEs through the lens of spectral bias and memorization capacity
Mar 09, 2026 A toy model of video generative models -- bottleneck dimension controls "classical"/"quantum" strategies
Mar 06, 2026 Memory 2 -- How many bits does each parameter store? An analysis of MLP
Mar 05, 2026 Sparse attention 8 -- Numeric randomness speeds up emergence of symbolic structure (induction head)
Mar 04, 2026 Drifting VQ-VAE -- How "drifting models" fixe failure modes of VQ-VAE
Mar 03, 2026 Loss landscape visualization 1 -- Seeing sticky plateau
Feb 28, 2026 Research agent 1 -- Reproducing 2026-01-01 blog (physics of feature learning)
Feb 25, 2026 Research agents should target knowledge graphs, not papers
Feb 24, 2026 181-parameter transformer-like models for 10-digit addition
Feb 15, 2026 When should I use physics of AI?
Feb 09, 2026 Memory 1 -- How much do linear layers memorize?
Feb 08, 2026 Transformers don't learn Newton's laws? They learn Kepler's laws!
Feb 07, 2026 When I say "toy models", what do I mean?
Feb 06, 2026 On the physical interpretation of drifting generative models
Feb 05, 2026 Physics 2 -- Transformers fail to maintain physical cosistency for circular motion
Feb 04, 2026 Physics 1 -- Attention can't exactly simulate uniform linear motion
Feb 03, 2026 Depth 4 -- Flat directions (in weight space) are high frequency modes (in function space)
Feb 02, 2026 Depth 3 -- Fun facts about loss hessian eigenvalues
Feb 01, 2026 Diffusion 2 -- Visualizing flow matching, temporal dynamics
Jan 31, 2026 Sparse attention 7 -- Stack of causal attention creates implicit positional embedding, and explaning "Loss in the middle"
Jan 30, 2026 Sparse attention 6 -- In-context Associative recall
Jan 29, 2026 MLP 2 -- Effective linearity, Generalized SiLU
Jan 28, 2026 MLP 1 -- Gating is good for polynomials
Jan 27, 2026 Optimization 4 -- Loss Spikes
Jan 25, 2026 Optimization 3 / Depth 2 -- Adding Bias After ReLU
Jan 24, 2026 Optimization 2 -- Elementwise Scale Reparametrization
Jan 23, 2026 Optimization 1 -- Norm reparametrization
Jan 22, 2026 Sparse attention 5 -- Attention sink
Jan 21, 2026 Bigram 4 -- On the difficulty of spatial map emergence
Jan 20, 2026 Depth 1 -- Understanding Pre-LN and Post-LN
Jan 19, 2026 Bigram 3 -- Low Rank Structure
Jan 16, 2026 Bigram 2 -- Emergence of Hyperbolic Spaces
Jan 15, 2026 Bigram 1 -- Walk on a Circle
Jan 14, 2026 Diffusion 1 -- Sparse and Dense Neurons
Jan 13, 2026 Sparse attention 4 -- previous token head
Jan 12, 2026 Sparse attention 3 -- inefficiency of extracting similar content
Jan 11, 2026 Emergence of Induction Head Depends on Learning Rate Schedule
Jan 10, 2026 Sparse attention 2 -- Unattention head, branching dynamics
Jan 09, 2026 Sparse attention 1 -- sticky plateau and rank collapse
Jan 08, 2026 Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule
Jan 07, 2026 Fine-tuning with sparse updates? A toy teacher-student Setup
Jan 06, 2026 Multi-Head Cross Entropy Loss
Jan 05, 2026 What's the difference -- (physics of) AI, physics, math and interpretability
Jan 04, 2026 Representation anisotropy from nonlinear functions
Jan 03, 2026 Training dynamics of A Single ReLU Neuron
Jan 02, 2026 Physics of AI – How to Begin
Jan 01, 2026 Physics of Feature Learning 1 – A Perspective from Nonlinearity