ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKAL
1
PaperDateCategoryTaskTask detailsArchitectureLossDependent Variable
Scaling Variable 1
Scaling Variable 2
TypeFunctional formExponent 1Exponent 2Irreducible lossParameter values
Compute range (FLOP)
Data range (mixed)
Data unit
Size range (parameters)
Additional Conditions
Scaling strategy
2
Scaling Vision Transformers8 Jun 2022Vision, TransferImageNetFine-tuningVision TransformerAccuracyLossCompute
Power law with transition
L(C) = K(C+C0)^-c + E0.350.09
K=0.26, C0=0.01, c=0.35, E=0.09
2.13E+181.06272E+231.00E+081.00E+10Images5.40E+061.80E+09
3
Scaling Vision Transformers8 Jun 2022Vision, TransferImageNetLinear 10-shotVision TransformerAccuracyLossCompute
Power law with transition
L(C) = K(C+C0)^-c + E0.320.12
K=0.63, C0=0.52, c=0.32, E=0.12
2.13E+181.06272E+231.00E+081.00E+10Images5.40E+061.80E+09
4
Learning Curve Theory8 Feb 2021Theory
5
Scaling Scaling Laws with Board Games7 Apr 2021Games, RLHexAlphaZeroEloLossComputeBoard size
Logarithmic with transition
L(D,Bs) = (Mi * Bs + K log C + ci).clamp(Mp * Bs + cp, 0)Mi = -430, K=510, ci=-4400, Mp=-270, cp=5701.00E+091.00E+17
6
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=1.969, E=0.057, b=0.2855.12E+055.12E+08Sentence pairs3.00E+083.00E+08
7
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanHybrid Transformer-LSTMCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=1.817, E=0.11, b=0.2855.12E+055.12E+08Sentence pairs3.00E+083.00E+08
8
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanDecoder-only TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.011, E=0.078, b=0.2855.12E+055.12E+08Sentence pairs3.00E+083.00E+08
9
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.222, E=0.067, b=0.2965.12E+055.12E+08Sentence pairs3.00E+083.00E+08Source noise
10
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.772, E=0.323, b=0.2965.12E+055.12E+08Sentence pairs3.00E+083.00E+08Target noise
11
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.501, E=0.034, b=0.2785.12E+055.12E+08Sentence pairs3.00E+083.00E+08No filtering
12
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.235, E=0.054, b=0.2785.12E+055.12E+08Sentence pairs3.00E+083.00E+08CDS filtering
13
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.130, E=0.064, b=0.2785.12E+055.12E+08Sentence pairs3.00E+083.00E+08Bicleaner filtering
14
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.343, E=0.059, b=0.1985.12E+055.12E+08Sentence pairs3.00E+083.00E+08Back-translation 2L2L
15
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.288, E=0.054, b=0.1985.12E+055.12E+08Sentence pairs3.00E+083.00E+08Back-translation 6L6L
16
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.251, E=0.040, b=0.1985.12E+055.12E+08Sentence pairs3.00E+083.00E+08Back-translation 32L6L
17
Data Scaling Laws in NMT: The Effect of Noise and Architecture4 Feb 2022LanguageTranslationEnglish to GermanEncoder-Decoder TransformerCross-entropyLossData
Power law plus constant
L(D) = B(D^-1 + E)^bB=2.224, E=0.037, b=0.1985.12E+055.12E+08Sentence pairs3.00E+083.00E+08Back-translation 64L6L
18
Scaling Laws for a Multi-Agent Reinforcement Learning Model29 Sept 2022Games, RLPentagoAlphaZeroPlayer strengthLossParametersPower lawL(N) = N^a0.87a=0.874.00E+114.00E+162.00E+011.00E+04Training Steps2.00E+033.00E+05Width scaling
19
Scaling Laws for a Multi-Agent Reinforcement Learning Model29 Sept 2022Games, RLPentagoAlphaZeroLossComputePower lawL(C) = C^c0.55c=0.554.00E+114.00E+162.00E+011.00E+04Training Steps2.00E+033.00E+05Width scaling
20
Scaling Laws for a Multi-Agent Reinforcement Learning Model29 Sept 2022Games, RLConnectFourAlphaZeroPlayer strengthLossParametersPower lawL(N) = N^a0.88a=0.881.50E+113.00E+162.00E+011.00E+04Training Steps6.00E+022.00E+05Width scaling
21
Scaling Laws for a Multi-Agent Reinforcement Learning Model29 Sept 2022Games, RLConnectFourAlphaZeroLossComputePower lawL(C) = C^c0.55c=0.551.50E+113.00E+162.00E+011.00E+04Training Steps6.00E+022.00E+05Width scaling
22
Training Compute-Optimal Large Language Models29 Mar 2022LanguageLanguage modelingDecoder-only TransformerCross-entropyLossParametersData
Bivariate power law plus constant - sum
L(N,D) = AN^-a + BD^-b + E0.340.281.69
A=406.4, a=0.34, B=410.7, b=0.28, E=1.69
6.00E+183.00E+211.00E+071.00E+09BPE Tokens2.00E+071.60E+10
23
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossParametersPower lawL(N) = AN^-a0.07A=9.810, a=0.078.64E+144.32E+23BPE Tokens1.00E+051.75E+11
24
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossComputePower lawL(C) = KC^-c0.048K=23.27, c=0.0488.64E+144.32E+23BPE Tokens1.00E+051.75E+11
25
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling16x16 Image modeling, VQ encodingDecoder-only TransformerCross-entropyLossParameters
Power law plus constant
L(N) = AN^-a + E0.133.99A=3.767, a=0.131.00E+131.73E+201.00E+08
64x64 VQ256 Images
1.00E+053.00E+09
26
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling16x16 Image modeling, VQ encodingDecoder-only TransformerCross-entropyLossCompute
Power law plus constant
L(C) = KC^-c + E0.114.09K=32.31, c=0.111.00E+131.73E+201.00E+08
64x64 VQ256 Images
1.00E+053.00E+09
27
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling32x32 Image modeling, VQ encodingDecoder-only TransformerCross-entropyLossParameters
Power law plus constant
L(N) = AN^-a + E0.143.07A=3.972, a=0.143.46E+142.59E+201.00E+08
64x64 VQ1024 Images
1.00E+053.00E+09
28
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling32x32 Image modeling, VQ encodingDecoder-only TransformerCross-entropyLossCompute
Power law plus constant
L(C) = KC^-c + E0.123.17K=52.74, c=0.123.46E+142.59E+201.00E+08
64x64 VQ1024 Images
1.00E+053.00E+09
29
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020LanguageMathematicsDecoder-only TransformerCross-entropyLossParametersPower lawL(N) = AN^-a0.160.28A=4.432, a=0.162.59E+144.32E+20
Characters (bytes)
2.00E+053.00E+09
30
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020LanguageMathematicsDecoder-only TransformerCross-entropyLossComputePower lawL(C) = KC^-c0.170.14K=366.4, c=0.172.59E+144.32E+20
Characters (bytes)
2.00E+053.00E+09
31
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling16x16 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossParameters
Power law plus constant
L(N) = AN^-a + E0.222.64A=3.454, a=0.221.73E+148.64E+201.00E+0816x16 Images1.00E+052.00E+09
32
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling16x16 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossCompute
Power law plus constant
L(C) = KC^-c + E0.162.64K=87.59, c=0.161.73E+148.64E+201.00E+0816x16 Images1.00E+052.00E+09
33
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling32x32 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossParameters
Power law plus constant
L(N) = AN^-a + E0.132.2A=1.713, a=0.131.73E+148.64E+201.00E+0832x32 Images1.00E+052.00E+09
34
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling32x32 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossCompute
Power law plus constant
L(C) = KC^-c + E0.12.21K=14.10, c=0.11.73E+148.64E+201.00E+0832x32 Images1.00E+052.00E+09
35
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalText-to-Image generationDecoder-only TransformerCross-entropy (text)LossParametersPower lawL(N) = AN^-a0.037A=2.107, a=0.0371.73E+163.46E+20
Captions (32x32 Image, 128 BPE token)
1.00E+058.00E+08
36
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalText-to-Image generationDecoder-only TransformerCross-entropy (image)LossParameters
Power law plus constant
L(N) = AN^-a + E0.162A=3.919, a=0.161.73E+163.46E+20
Captions (32x32 Image, 128 BPE token)
1.00E+058.00E+08
37
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalText-to-Image generationDecoder-only Transformer
Cross-entropy (combined)
LossCompute
Power law plus constant
L(C) = KC^-c + E0.151.93K=130.8, c=0.151.73E+163.46E+20
Captions (32x32 Image, 128 BPE token)
1.00E+058.00E+08
38
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VideoVideo generationDecoder-only TransformerCross-entropyLossParametersPower lawL(N) = AN^-a0.241.01A=12.48, a=0.248.64E+134.32E+201.00E+02
64x64 VQ256 Video hours
1.00E+048.00E+08
39
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VideoVideo generationDecoder-only TransformerCross-entropyLossComputePower lawL(C) = KC^-c0.140.95K=137.7, c=0.148.64E+134.32E+201.00E+02
64x64 VQ256 Video hours
1.00E+048.00E+08
40
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling8x8 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossParameters
Power law plus constant
L(N) = AN^-a + E0.243.12A=2.862, a=0.241.00E+138.64E+191.00E+088x8 Images1.00E+053.00E+08
41
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020VisionImage modeling8x8 Image modeling, pixel encodingDecoder-only TransformerCross-entropyLossCompute
Power law plus constant
L(C) = KC^-c + E0.193.13K=207.2, c=0.191.00E+138.64E+191.00E+088x8 Images1.00E+053.00E+08
42
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalImage captioningDecoder-only TransformerCross-entropy (text)LossParametersPower lawL(N) = AN^-a0.039A=2.212, a=0.0392.59E+163.46E+19
Captions (32x32 Image, 128 BPE token)
1.00E+051.00E+08
43
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalImage captioningDecoder-only TransformerCross-entropy (image)LossParameters
Power law plus constant
L(N) = AN^-a + E0.152A=3.639, a=0.152.59E+163.46E+19
Captions (32x32 Image, 128 BPE token)
1.00E+051.00E+08
44
Scaling Laws for Autoregressive Generative Modeling28 Oct 2020MultimodalImage captioningDecoder-only Transformer
Cross-entropy (combined)
LossCompute
Power law plus constant
L(C) = KC^-c + E0.161.97K=181.1, c=0.162.59E+163.46E+19
Captions (32x32 Image, 128 BPE token)
1.00E+051.00E+08
45
A Scaling Law for Syn2real Transfer: How Much Is Your Pre-training Effective?25 Aug 2021Vision, TransferMultipleResNetFinetuning lossLossPretraining Data
Power law plus constant
L(D) = BD^-b + G
46
Scaling Laws for Neural Language Models23 Jan 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossComputePower lawL(C) = KC^-c0.05K=26.38, c=0.051.69E+103.45E+192.20E+072.30E+10BPE Tokens7.68E+021.50E+09
47
Scaling Laws for Neural Language Models23 Jan 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossParametersPower lawL(N) = AN^-a0.076A=11.48, a=0.0761.69E+103.45E+192.20E+072.30E+10BPE Tokens7.68E+021.50E+09
48
Scaling Laws for Neural Language Models23 Jan 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossDataPower lawL(D) = BD^-b0.095B=20.81, b=0.0951.69E+103.45E+192.20E+072.30E+10BPE Tokens7.68E+021.50E+09
49
Scaling Laws for Neural Language Models23 Jan 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossParametersData
Bivariate power law
L(N,D) = [(A/N)^(a/b) + B/D]^-b0.0760.095A=11.48, a=0.076, B=20.81, b=0.0951.69E+103.45E+192.20E+072.30E+10BPE Tokens7.68E+021.50E+09
50
Scaling Laws for Neural Language Models23 Jan 2020LanguageLanguage modelingDecoder-only TransformerCross-entropyLossParametersTraining Steps
Bivariate power law - sum
L(N,S) = AN^-a + BS^-b0.0760.76A=11.48, a=0.076, B=334.88, b=0.761.69E+103.45E+192.20E+072.30E+10BPE Tokens7.68E+021.50E+09
Training at the critical batch size
51
Effect of scale on catastrophic forgetting in neural networks21 Sep 2022Vision
52
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingSwitch Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.233.25E+124.33E+131.74E+082.96E+10
53
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingSwitch TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.143.25E+124.33E+131.74E+082.96E+10
54
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingSwitch Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.133.25E+124.33E+131.74E+082.96E+10
55
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingSwitch TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.083.25E+124.33E+131.74E+082.96E+10
56
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingSwitch TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.583.25E+124.33E+131.74E+082.96E+10
57
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEncoder-Decoder Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.541.21E+126.38E+131.60E+072.90E+09
58
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEncoder-Decoder TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.281.21E+126.38E+131.60E+072.90E+09
59
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEncoder-Decoder Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.471.21E+126.38E+131.60E+072.90E+09
60
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEncoder-Decoder TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.241.21E+126.38E+131.60E+072.90E+09
61
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEncoder-Decoder TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.491.21E+126.38E+131.60E+072.90E+09
62
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingFunnel Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.471.10E+124.03E+131.60E+072.90E+09
63
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingFunnel TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.221.10E+124.03E+131.60E+072.90E+09
64
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingFunnel Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.381.10E+124.03E+131.60E+072.90E+09
65
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingFunnel TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.181.10E+124.03E+131.60E+072.90E+09
66
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingFunnel TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.461.10E+124.03E+131.60E+072.90E+09
67
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMoS-Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.431.29E+121.12E+142.70E+072.90E+09
68
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMoS-TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.211.29E+121.12E+142.70E+072.90E+09
69
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMoS-Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.431.29E+121.12E+142.70E+072.90E+09
70
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMoS-TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.21.29E+121.12E+142.70E+072.90E+09
71
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMoS-TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.471.29E+121.12E+142.70E+072.90E+09
72
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMLP-Mixer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.323.83E+124.83E+136.70E+072.86E+09
73
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMLP-MixerDownstream AccuracyLossComputePower lawL(C) = KC^c-0.033.83E+124.83E+136.70E+072.86E+09
74
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMLP-Mixer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.263.83E+124.83E+136.70E+072.86E+09
75
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMLP-MixerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.653.83E+124.83E+136.70E+072.86E+09
76
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingMLP-MixerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l-0.023.83E+124.83E+136.70E+072.86E+09
77
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingGLU-Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.491.29E+126.13E+132.60E+072.85E+09
78
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingGLU-TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.241.29E+126.13E+132.60E+072.85E+09
79
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingGLU-Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.421.29E+126.13E+132.60E+072.85E+09
80
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingGLU-TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.221.29E+126.13E+132.60E+072.85E+09
81
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingGLU-TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.461.29E+126.13E+132.60E+072.85E+09
82
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingLConv
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.321.20E+127.70E+131.70E+072.30E+09
83
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingLConvDownstream AccuracyLossComputePower lawL(C) = KC^c0.131.20E+127.70E+131.70E+072.30E+09
84
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingLConv
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.291.20E+127.70E+131.70E+072.30E+09
85
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingLConvDownstream AccuracyLossParametersPower lawL(N) = AN^a0.111.20E+127.70E+131.70E+072.30E+09
86
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingLConvDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.481.20E+127.70E+131.70E+072.30E+09
87
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEvolved Transformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.441.31E+127.13E+131.90E+072.20E+09
88
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEvolved TransformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.221.31E+127.13E+131.90E+072.20E+09
89
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEvolved Transformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.421.31E+127.13E+131.90E+072.20E+09
90
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEvolved TransformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.211.31E+127.13E+131.90E+072.20E+09
91
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingEvolved TransformerDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l0.471.31E+127.13E+131.90E+072.20E+09
92
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingDConv
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c1.39E+127.80E+132.20E+071.20E+09
93
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingDConvDownstream AccuracyLossComputePower lawL(C) = KC^c1.39E+127.80E+132.20E+071.20E+09
94
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingDConv
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a1.39E+127.80E+132.20E+071.20E+09
95
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingDConvDownstream AccuracyLossParametersPower lawL(N) = AN^a1.39E+127.80E+132.20E+071.20E+09
96
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingDConvDownstream AccuracyLoss
Upstream Negative Cross-entropy
Power lawLd(Lu) = GLu^l1.39E+127.80E+132.20E+071.20E+09
97
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingPerformer
Upstream Negative Cross-entropy
LossComputePower lawL(C) = KC^c0.251.14E+123.28E+131.60E+077.39E+08
98
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingPerformerDownstream AccuracyLossComputePower lawL(C) = KC^c0.051.14E+123.28E+131.60E+077.39E+08
99
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingPerformer
Upstream Negative Cross-entropy
LossParametersPower lawL(N) = AN^a0.241.14E+123.28E+131.60E+077.39E+08
100
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?21 Jul 2022LanguageLanguage modelingPerformerDownstream AccuracyLossParametersPower lawL(N) = AN^a0.051.14E+123.28E+131.60E+077.39E+08
OSZAR »