DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT
Authors: Harvard University と OpenAI (GPT モデルを作ったところ) のグループ
[補足] Bias-Variance Trade-off
The bias-variance trade-off is a fundamental concept in classical statistical learning theory (e.g., Hastie et al. (2005)).
[補足] GTP-3 は 1750億個のパラメータを持つ巨大なモデル
“larger models are worse.”
“larger models are better”
“more data is always better”
Figure 1 Left: Train and test error as a function of model size, for ResNet18s of varying width on CIFAR-10 with 15% label noise.
Figure 1 Right: Test error, shown for varying train epochs. All models
trained using Adam for 4K epochs. The largest model (width 64) corresponds to standard ResNet18.
Figure 2 Left: Test error as a function of model size and train epochs. The horizontal line corresponds to model-wise double descent–varying model size while training for as long as possible. The vertical line corresponds to epoch-wise double descent, with test error undergoing double-descent as train time increases.
Figure 3: Test loss (per-token perplexity) as a function of Transformer model size (embedding dimension d_model) on language translation (IWSLT‘14 German-to-English). The curve for 18k samples is generally lower than the one for 4k samples, but also shifted to the right, since fitting 18k samples requires a larger model. Thus, for some models, the performance for 18k samples is worse than for 4k samples.
様々な条件で実験を行った結果、それぞれに特徴的な double descent の挙動が観測された。
より詳細が知りたい場合は論文の Appendix B を参照
We define the effective model complexity of T (w.r.t. distribution D) to be the maximum number of samples n on which T achieves on average ≈ 0 training error.
EMC と Interpolation threshold に関しては下図をみるとイメージしやすい
Figure 15: Sample-wise double-descent slice for Random Fourier Features on the Fashion MNIST dataset. In this figure the embedding dimension (number of random features) is 1000.
Ref: 技術ブログ アクセルユニバース株式会社
Our experiments suggest that there is a critical interval around the interpolation threshold when EMC = n: below and above this interval increasing complexity helps performance, while within this interval it may hurt performance.
There is a “plateau” in test error around the interpolation point with no label noise
Plataue: A comparatively stable level in something that varies.
we view adding label noise as merely a proxy for making distributions
“harder”— i.e. increasing the amount of model mis-specification