AI | Ziming Liu

Mar 18, 2026	Attention residual 2
Mar 17, 2026	Estimating structural information fraction of your dataset
Mar 16, 2026	When does Kimi's "Attention Residuals" work?
Mar 15, 2026	When does RandOpt work?
Mar 14, 2026	Tokenization 1 -- Factorized tokenization
Mar 13, 2026	How to ground your ideas?
Mar 12, 2026	A toy model of distillation
Mar 11, 2026	When does Muon work? Model depth is a key factor
Mar 10, 2026	MOE 1 -- Experssive power of MOEs through the lens of spectral bias and memorization capacity
Mar 09, 2026	A toy model of video generative models -- bottleneck dimension controls "classical"/"quantum" strategies
Mar 06, 2026	Memory 2 -- How many bits does each parameter store? An analysis of MLP
Mar 05, 2026	Sparse attention 8 -- Numeric randomness speeds up emergence of symbolic structure (induction head)
Mar 04, 2026	Drifting VQ-VAE -- How "drifting models" fixe failure modes of VQ-VAE
Mar 03, 2026	Loss landscape visualization 1 -- Seeing sticky plateau
Feb 28, 2026	Research agent 1 -- Reproducing 2026-01-01 blog (physics of feature learning)
Feb 25, 2026	Research agents should target knowledge graphs, not papers
Feb 24, 2026	181-parameter transformer-like models for 10-digit addition
Feb 15, 2026	When should I use physics of AI?
Feb 09, 2026	Memory 1 -- How much do linear layers memorize?
Feb 08, 2026	Transformers don't learn Newton's laws? They learn Kepler's laws!
Feb 07, 2026	When I say "toy models", what do I mean?
Feb 06, 2026	On the physical interpretation of drifting generative models
Feb 05, 2026	Physics 2 -- Transformers fail to maintain physical cosistency for circular motion
Feb 04, 2026	Physics 1 -- Attention can't exactly simulate uniform linear motion
Feb 03, 2026	Depth 4 -- Flat directions (in weight space) are high frequency modes (in function space)
Feb 02, 2026	Depth 3 -- Fun facts about loss hessian eigenvalues
Feb 01, 2026	Diffusion 2 -- Visualizing flow matching, temporal dynamics
Jan 31, 2026	Sparse attention 7 -- Stack of causal attention creates implicit positional embedding, and explaning "Loss in the middle"
Jan 30, 2026	Sparse attention 6 -- In-context Associative recall
Jan 29, 2026	MLP 2 -- Effective linearity, Generalized SiLU
Jan 28, 2026	MLP 1 -- Gating is good for polynomials
Jan 27, 2026	Optimization 4 -- Loss Spikes
Jan 25, 2026	Optimization 3 / Depth 2 -- Adding Bias After ReLU
Jan 24, 2026	Optimization 2 -- Elementwise Scale Reparametrization
Jan 23, 2026	Optimization 1 -- Norm reparametrization
Jan 22, 2026	Sparse attention 5 -- Attention sink
Jan 21, 2026	Bigram 4 -- On the difficulty of spatial map emergence
Jan 20, 2026	Depth 1 -- Understanding Pre-LN and Post-LN
Jan 19, 2026	Bigram 3 -- Low Rank Structure
Jan 16, 2026	Bigram 2 -- Emergence of Hyperbolic Spaces
Jan 15, 2026	Bigram 1 -- Walk on a Circle
Jan 14, 2026	Diffusion 1 -- Sparse and Dense Neurons
Jan 13, 2026	Sparse attention 4 -- previous token head
Jan 12, 2026	Sparse attention 3 -- inefficiency of extracting similar content
Jan 11, 2026	Emergence of Induction Head Depends on Learning Rate Schedule
Jan 10, 2026	Sparse attention 2 -- Unattention head, branching dynamics
Jan 09, 2026	Sparse attention 1 -- sticky plateau and rank collapse
Jan 08, 2026	Unigram toy model is surprisingly rich -- representation collapse, scaling laws, learning rate schedule
Jan 07, 2026	Fine-tuning with sparse updates? A toy teacher-student Setup
Jan 06, 2026	Multi-Head Cross Entropy Loss
Jan 05, 2026	What's the difference -- (physics of) AI, physics, math and interpretability
Jan 04, 2026	Representation anisotropy from nonlinear functions
Jan 03, 2026	Training dynamics of A Single ReLU Neuron
Jan 02, 2026	Physics of AI – How to Begin
Jan 01, 2026	Physics of Feature Learning 1 – A Perspective from Nonlinearity
Dec 31, 2025	Physics of AI Requires Mindset Shifts
Dec 25, 2025	Achieving AGI Intelligently – Structure, Not Scale
May 27, 2024	Philosophical thoughts on Kolmogorov-Arnold Networks
Jun 16, 2023	A Good ML Theory is Like Physics -- A Physicist's Analysis of Grokking