Data efficiency of Bi-gram data

Black: Agent generated; Red: Added by human

Introduction

Bi-gram data is a simple dataset capturing some basic aspect of natural language. When there are \(V\) tokens in vocabulary, it is expected that at least \(V^2\) tokens should be seen to learn the bi-gram model well. As a result, I’m interested in measuring generalization gap as a function of vocabulary size and train size, as swept in this blog.

Experimental setup

Dataset: Bigram low-rank dataset

Model: MLP_token model

Optimizer: Adam

Loss: Cross-entropy loss

Observables wired to training: Observable Accuracy

Train vs test gap

Trainer: Trainer

Sweep experiments

Sweep comparison — Grid: vocab size × train size — Loss

Sweep comparison — Grid: vocab size × train size — Accuracy

Sweep comparison — Grid: vocab size × train size — Train vs test gap — last step (lines)

The result is expected: generalization gap (test loss minus train loss) decreases as train data increases, and larger vocabulary size requires more data. (TODO: automatic quantitative analysis)

Code

Code can be downloaded here.