-
A toy model of distillation
-
When does Muon work? Model depth is a key factor
-
MOE 1 -- Experssive power of MOEs through the lens of spectral bias and memorization capacity
-
A toy model of video generative models -- bottleneck dimension controls "classical"/"quantum" strategies
-
Memory 2 -- How many bits does each parameter store? An analysis of MLP