-
When does Muon work? Model depth is a key factor
-
MOE 1 -- Experssive power of MOEs through the lens of spectral bias and memorization capacity
-
A toy model of video generative models -- bottleneck dimension controls "classical"/"quantum" strategies
-
Memory 2 -- How many bits does each parameter store? An analysis of MLP
-
Sparse attention 8 -- Numeric randomness speeds up emergence of symbolic structure (induction head)