-
Depth 3 -- Fun facts about loss hessian eigenvalues
-
Diffusion 2 -- Visualizing flow matching, temporal dynamics
-
Sparse attention 7 -- Stack of causal attention creates implicit positional embedding, and explaning "Loss in the middle"
-
Sparse attention 6 -- In-context Associative recall
-
MLP 2 -- Effective linearity, Generalized SiLU