A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | AD | AE | AF | AG | AH | AI | AJ | AK | AL | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Paper | Date | Category | Task | Task details | Architecture | Loss | Dependent Variable | Scaling Variable 1 | Scaling Variable 2 | Type | Functional form | Exponent 1 | Exponent 2 | Irreducible loss | Parameter values | Compute range (FLOP) | Data range (mixed) | Data unit | Size range (parameters) | Additional Conditions | Scaling strategy | |||||||||||||||||
2 | Scaling Vision Transformers | 8 Jun 2022 | Vision, Transfer | ImageNet | Fine-tuning | Vision Transformer | Accuracy | Loss | Compute | Power law with transition | L(C) = K(C+C0)^-c + E | 0.35 | 0.09 | K=0.26, C0=0.01, c=0.35, E=0.09 | 2.13E+18 | 1.06272E+23 | 1.00E+08 | 1.00E+10 | Images | 5.40E+06 | 1.80E+09 | ||||||||||||||||||
3 | Scaling Vision Transformers | 8 Jun 2022 | Vision, Transfer | ImageNet | Linear 10-shot | Vision Transformer | Accuracy | Loss | Compute | Power law with transition | L(C) = K(C+C0)^-c + E | 0.32 | 0.12 | K=0.63, C0=0.52, c=0.32, E=0.12 | 2.13E+18 | 1.06272E+23 | 1.00E+08 | 1.00E+10 | Images | 5.40E+06 | 1.80E+09 | ||||||||||||||||||
4 | Learning Curve Theory | 8 Feb 2021 | Theory | ||||||||||||||||||||||||||||||||||||
5 | Scaling Scaling Laws with Board Games | 7 Apr 2021 | Games, RL | Hex | AlphaZero | Elo | Loss | Compute | Board size | Logarithmic with transition | L(D,Bs) = (Mi * Bs + K log C + ci).clamp(Mp * Bs + cp, 0) | Mi = -430, K=510, ci=-4400, Mp=-270, cp=570 | 1.00E+09 | 1.00E+17 | |||||||||||||||||||||||||
6 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=1.969, E=0.057, b=0.285 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | ||||||||||||||||||||||
7 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Hybrid Transformer-LSTM | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=1.817, E=0.11, b=0.285 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | ||||||||||||||||||||||
8 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Decoder-only Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.011, E=0.078, b=0.285 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | ||||||||||||||||||||||
9 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.222, E=0.067, b=0.296 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Source noise | |||||||||||||||||||||
10 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.772, E=0.323, b=0.296 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Target noise | |||||||||||||||||||||
11 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.501, E=0.034, b=0.278 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | No filtering | |||||||||||||||||||||
12 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.235, E=0.054, b=0.278 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | CDS filtering | |||||||||||||||||||||
13 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.130, E=0.064, b=0.278 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Bicleaner filtering | |||||||||||||||||||||
14 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.343, E=0.059, b=0.198 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Back-translation 2L2L | |||||||||||||||||||||
15 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.288, E=0.054, b=0.198 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Back-translation 6L6L | |||||||||||||||||||||
16 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.251, E=0.040, b=0.198 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Back-translation 32L6L | |||||||||||||||||||||
17 | Data Scaling Laws in NMT: The Effect of Noise and Architecture | 4 Feb 2022 | Language | Translation | English to German | Encoder-Decoder Transformer | Cross-entropy | Loss | Data | Power law plus constant | L(D) = B(D^-1 + E)^b | B=2.224, E=0.037, b=0.198 | 5.12E+05 | 5.12E+08 | Sentence pairs | 3.00E+08 | 3.00E+08 | Back-translation 64L6L | |||||||||||||||||||||
18 | Scaling Laws for a Multi-Agent Reinforcement Learning Model | 29 Sept 2022 | Games, RL | Pentago | AlphaZero | Player strength | Loss | Parameters | Power law | L(N) = N^a | 0.87 | a=0.87 | 4.00E+11 | 4.00E+16 | 2.00E+01 | 1.00E+04 | Training Steps | 2.00E+03 | 3.00E+05 | Width scaling | |||||||||||||||||||
19 | Scaling Laws for a Multi-Agent Reinforcement Learning Model | 29 Sept 2022 | Games, RL | Pentago | AlphaZero | Loss | Compute | Power law | L(C) = C^c | 0.55 | c=0.55 | 4.00E+11 | 4.00E+16 | 2.00E+01 | 1.00E+04 | Training Steps | 2.00E+03 | 3.00E+05 | Width scaling | ||||||||||||||||||||
20 | Scaling Laws for a Multi-Agent Reinforcement Learning Model | 29 Sept 2022 | Games, RL | ConnectFour | AlphaZero | Player strength | Loss | Parameters | Power law | L(N) = N^a | 0.88 | a=0.88 | 1.50E+11 | 3.00E+16 | 2.00E+01 | 1.00E+04 | Training Steps | 6.00E+02 | 2.00E+05 | Width scaling | |||||||||||||||||||
21 | Scaling Laws for a Multi-Agent Reinforcement Learning Model | 29 Sept 2022 | Games, RL | ConnectFour | AlphaZero | Loss | Compute | Power law | L(C) = C^c | 0.55 | c=0.55 | 1.50E+11 | 3.00E+16 | 2.00E+01 | 1.00E+04 | Training Steps | 6.00E+02 | 2.00E+05 | Width scaling | ||||||||||||||||||||
22 | Training Compute-Optimal Large Language Models | 29 Mar 2022 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Data | Bivariate power law plus constant - sum | L(N,D) = AN^-a + BD^-b + E | 0.34 | 0.28 | 1.69 | A=406.4, a=0.34, B=410.7, b=0.28, E=1.69 | 6.00E+18 | 3.00E+21 | 1.00E+07 | 1.00E+09 | BPE Tokens | 2.00E+07 | 1.60E+10 | |||||||||||||||||
23 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law | L(N) = AN^-a | 0.07 | A=9.810, a=0.07 | 8.64E+14 | 4.32E+23 | BPE Tokens | 1.00E+05 | 1.75E+11 | ||||||||||||||||||||||
24 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law | L(C) = KC^-c | 0.048 | K=23.27, c=0.048 | 8.64E+14 | 4.32E+23 | BPE Tokens | 1.00E+05 | 1.75E+11 | ||||||||||||||||||||||
25 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 16x16 Image modeling, VQ encoding | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.13 | 3.99 | A=3.767, a=0.13 | 1.00E+13 | 1.73E+20 | 1.00E+08 | 64x64 VQ256 Images | 1.00E+05 | 3.00E+09 | |||||||||||||||||||
26 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 16x16 Image modeling, VQ encoding | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.11 | 4.09 | K=32.31, c=0.11 | 1.00E+13 | 1.73E+20 | 1.00E+08 | 64x64 VQ256 Images | 1.00E+05 | 3.00E+09 | |||||||||||||||||||
27 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 32x32 Image modeling, VQ encoding | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.14 | 3.07 | A=3.972, a=0.14 | 3.46E+14 | 2.59E+20 | 1.00E+08 | 64x64 VQ1024 Images | 1.00E+05 | 3.00E+09 | |||||||||||||||||||
28 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 32x32 Image modeling, VQ encoding | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.12 | 3.17 | K=52.74, c=0.12 | 3.46E+14 | 2.59E+20 | 1.00E+08 | 64x64 VQ1024 Images | 1.00E+05 | 3.00E+09 | |||||||||||||||||||
29 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Language | Mathematics | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law | L(N) = AN^-a | 0.16 | 0.28 | A=4.432, a=0.16 | 2.59E+14 | 4.32E+20 | Characters (bytes) | 2.00E+05 | 3.00E+09 | |||||||||||||||||||||
30 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Language | Mathematics | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law | L(C) = KC^-c | 0.17 | 0.14 | K=366.4, c=0.17 | 2.59E+14 | 4.32E+20 | Characters (bytes) | 2.00E+05 | 3.00E+09 | |||||||||||||||||||||
31 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 16x16 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.22 | 2.64 | A=3.454, a=0.22 | 1.73E+14 | 8.64E+20 | 1.00E+08 | 16x16 Images | 1.00E+05 | 2.00E+09 | |||||||||||||||||||
32 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 16x16 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.16 | 2.64 | K=87.59, c=0.16 | 1.73E+14 | 8.64E+20 | 1.00E+08 | 16x16 Images | 1.00E+05 | 2.00E+09 | |||||||||||||||||||
33 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 32x32 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.13 | 2.2 | A=1.713, a=0.13 | 1.73E+14 | 8.64E+20 | 1.00E+08 | 32x32 Images | 1.00E+05 | 2.00E+09 | |||||||||||||||||||
34 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 32x32 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.1 | 2.21 | K=14.10, c=0.1 | 1.73E+14 | 8.64E+20 | 1.00E+08 | 32x32 Images | 1.00E+05 | 2.00E+09 | |||||||||||||||||||
35 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Text-to-Image generation | Decoder-only Transformer | Cross-entropy (text) | Loss | Parameters | Power law | L(N) = AN^-a | 0.037 | A=2.107, a=0.037 | 1.73E+16 | 3.46E+20 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 8.00E+08 | ||||||||||||||||||||||
36 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Text-to-Image generation | Decoder-only Transformer | Cross-entropy (image) | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.16 | 2 | A=3.919, a=0.16 | 1.73E+16 | 3.46E+20 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 8.00E+08 | |||||||||||||||||||||
37 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Text-to-Image generation | Decoder-only Transformer | Cross-entropy (combined) | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.15 | 1.93 | K=130.8, c=0.15 | 1.73E+16 | 3.46E+20 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 8.00E+08 | |||||||||||||||||||||
38 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Video | Video generation | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law | L(N) = AN^-a | 0.24 | 1.01 | A=12.48, a=0.24 | 8.64E+13 | 4.32E+20 | 1.00E+02 | 64x64 VQ256 Video hours | 1.00E+04 | 8.00E+08 | ||||||||||||||||||||
39 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Video | Video generation | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law | L(C) = KC^-c | 0.14 | 0.95 | K=137.7, c=0.14 | 8.64E+13 | 4.32E+20 | 1.00E+02 | 64x64 VQ256 Video hours | 1.00E+04 | 8.00E+08 | ||||||||||||||||||||
40 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 8x8 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.24 | 3.12 | A=2.862, a=0.24 | 1.00E+13 | 8.64E+19 | 1.00E+08 | 8x8 Images | 1.00E+05 | 3.00E+08 | |||||||||||||||||||
41 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Vision | Image modeling | 8x8 Image modeling, pixel encoding | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.19 | 3.13 | K=207.2, c=0.19 | 1.00E+13 | 8.64E+19 | 1.00E+08 | 8x8 Images | 1.00E+05 | 3.00E+08 | |||||||||||||||||||
42 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Image captioning | Decoder-only Transformer | Cross-entropy (text) | Loss | Parameters | Power law | L(N) = AN^-a | 0.039 | A=2.212, a=0.039 | 2.59E+16 | 3.46E+19 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 1.00E+08 | ||||||||||||||||||||||
43 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Image captioning | Decoder-only Transformer | Cross-entropy (image) | Loss | Parameters | Power law plus constant | L(N) = AN^-a + E | 0.15 | 2 | A=3.639, a=0.15 | 2.59E+16 | 3.46E+19 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 1.00E+08 | |||||||||||||||||||||
44 | Scaling Laws for Autoregressive Generative Modeling | 28 Oct 2020 | Multimodal | Image captioning | Decoder-only Transformer | Cross-entropy (combined) | Loss | Compute | Power law plus constant | L(C) = KC^-c + E | 0.16 | 1.97 | K=181.1, c=0.16 | 2.59E+16 | 3.46E+19 | Captions (32x32 Image, 128 BPE token) | 1.00E+05 | 1.00E+08 | |||||||||||||||||||||
45 | A Scaling Law for Syn2real Transfer: How Much Is Your Pre-training Effective? | 25 Aug 2021 | Vision, Transfer | Multiple | ResNet | Finetuning loss | Loss | Pretraining Data | Power law plus constant | L(D) = BD^-b + G | |||||||||||||||||||||||||||||
46 | Scaling Laws for Neural Language Models | 23 Jan 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Compute | Power law | L(C) = KC^-c | 0.05 | K=26.38, c=0.05 | 1.69E+10 | 3.45E+19 | 2.20E+07 | 2.30E+10 | BPE Tokens | 7.68E+02 | 1.50E+09 | ||||||||||||||||||||
47 | Scaling Laws for Neural Language Models | 23 Jan 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Power law | L(N) = AN^-a | 0.076 | A=11.48, a=0.076 | 1.69E+10 | 3.45E+19 | 2.20E+07 | 2.30E+10 | BPE Tokens | 7.68E+02 | 1.50E+09 | ||||||||||||||||||||
48 | Scaling Laws for Neural Language Models | 23 Jan 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Data | Power law | L(D) = BD^-b | 0.095 | B=20.81, b=0.095 | 1.69E+10 | 3.45E+19 | 2.20E+07 | 2.30E+10 | BPE Tokens | 7.68E+02 | 1.50E+09 | ||||||||||||||||||||
49 | Scaling Laws for Neural Language Models | 23 Jan 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Data | Bivariate power law | L(N,D) = [(A/N)^(a/b) + B/D]^-b | 0.076 | 0.095 | A=11.48, a=0.076, B=20.81, b=0.095 | 1.69E+10 | 3.45E+19 | 2.20E+07 | 2.30E+10 | BPE Tokens | 7.68E+02 | 1.50E+09 | ||||||||||||||||||
50 | Scaling Laws for Neural Language Models | 23 Jan 2020 | Language | Language modeling | Decoder-only Transformer | Cross-entropy | Loss | Parameters | Training Steps | Bivariate power law - sum | L(N,S) = AN^-a + BS^-b | 0.076 | 0.76 | A=11.48, a=0.076, B=334.88, b=0.76 | 1.69E+10 | 3.45E+19 | 2.20E+07 | 2.30E+10 | BPE Tokens | 7.68E+02 | 1.50E+09 | Training at the critical batch size | |||||||||||||||||
51 | Effect of scale on catastrophic forgetting in neural networks | 21 Sep 2022 | Vision | ||||||||||||||||||||||||||||||||||||
52 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Switch Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.23 | 3.25E+12 | 4.33E+13 | 1.74E+08 | 2.96E+10 | ||||||||||||||||||||||||
53 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Switch Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.14 | 3.25E+12 | 4.33E+13 | 1.74E+08 | 2.96E+10 | ||||||||||||||||||||||||
54 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Switch Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.13 | 3.25E+12 | 4.33E+13 | 1.74E+08 | 2.96E+10 | ||||||||||||||||||||||||
55 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Switch Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.08 | 3.25E+12 | 4.33E+13 | 1.74E+08 | 2.96E+10 | ||||||||||||||||||||||||
56 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Switch Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.58 | 3.25E+12 | 4.33E+13 | 1.74E+08 | 2.96E+10 | ||||||||||||||||||||||||
57 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Encoder-Decoder Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.54 | 1.21E+12 | 6.38E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
58 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Encoder-Decoder Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.28 | 1.21E+12 | 6.38E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
59 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Encoder-Decoder Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.47 | 1.21E+12 | 6.38E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
60 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Encoder-Decoder Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.24 | 1.21E+12 | 6.38E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
61 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Encoder-Decoder Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.49 | 1.21E+12 | 6.38E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
62 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Funnel Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.47 | 1.10E+12 | 4.03E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
63 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Funnel Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.22 | 1.10E+12 | 4.03E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
64 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Funnel Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.38 | 1.10E+12 | 4.03E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
65 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Funnel Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.18 | 1.10E+12 | 4.03E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
66 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Funnel Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.46 | 1.10E+12 | 4.03E+13 | 1.60E+07 | 2.90E+09 | ||||||||||||||||||||||||
67 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MoS-Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.43 | 1.29E+12 | 1.12E+14 | 2.70E+07 | 2.90E+09 | ||||||||||||||||||||||||
68 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MoS-Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.21 | 1.29E+12 | 1.12E+14 | 2.70E+07 | 2.90E+09 | ||||||||||||||||||||||||
69 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MoS-Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.43 | 1.29E+12 | 1.12E+14 | 2.70E+07 | 2.90E+09 | ||||||||||||||||||||||||
70 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MoS-Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.2 | 1.29E+12 | 1.12E+14 | 2.70E+07 | 2.90E+09 | ||||||||||||||||||||||||
71 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MoS-Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.47 | 1.29E+12 | 1.12E+14 | 2.70E+07 | 2.90E+09 | ||||||||||||||||||||||||
72 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MLP-Mixer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.32 | 3.83E+12 | 4.83E+13 | 6.70E+07 | 2.86E+09 | ||||||||||||||||||||||||
73 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MLP-Mixer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | -0.03 | 3.83E+12 | 4.83E+13 | 6.70E+07 | 2.86E+09 | ||||||||||||||||||||||||
74 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MLP-Mixer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.26 | 3.83E+12 | 4.83E+13 | 6.70E+07 | 2.86E+09 | ||||||||||||||||||||||||
75 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MLP-Mixer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.65 | 3.83E+12 | 4.83E+13 | 6.70E+07 | 2.86E+09 | ||||||||||||||||||||||||
76 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | MLP-Mixer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | -0.02 | 3.83E+12 | 4.83E+13 | 6.70E+07 | 2.86E+09 | ||||||||||||||||||||||||
77 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | GLU-Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.49 | 1.29E+12 | 6.13E+13 | 2.60E+07 | 2.85E+09 | ||||||||||||||||||||||||
78 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | GLU-Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.24 | 1.29E+12 | 6.13E+13 | 2.60E+07 | 2.85E+09 | ||||||||||||||||||||||||
79 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | GLU-Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.42 | 1.29E+12 | 6.13E+13 | 2.60E+07 | 2.85E+09 | ||||||||||||||||||||||||
80 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | GLU-Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.22 | 1.29E+12 | 6.13E+13 | 2.60E+07 | 2.85E+09 | ||||||||||||||||||||||||
81 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | GLU-Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.46 | 1.29E+12 | 6.13E+13 | 2.60E+07 | 2.85E+09 | ||||||||||||||||||||||||
82 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | LConv | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.32 | 1.20E+12 | 7.70E+13 | 1.70E+07 | 2.30E+09 | ||||||||||||||||||||||||
83 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | LConv | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.13 | 1.20E+12 | 7.70E+13 | 1.70E+07 | 2.30E+09 | ||||||||||||||||||||||||
84 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | LConv | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.29 | 1.20E+12 | 7.70E+13 | 1.70E+07 | 2.30E+09 | ||||||||||||||||||||||||
85 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | LConv | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.11 | 1.20E+12 | 7.70E+13 | 1.70E+07 | 2.30E+09 | ||||||||||||||||||||||||
86 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | LConv | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.48 | 1.20E+12 | 7.70E+13 | 1.70E+07 | 2.30E+09 | ||||||||||||||||||||||||
87 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Evolved Transformer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.44 | 1.31E+12 | 7.13E+13 | 1.90E+07 | 2.20E+09 | ||||||||||||||||||||||||
88 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Evolved Transformer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.22 | 1.31E+12 | 7.13E+13 | 1.90E+07 | 2.20E+09 | ||||||||||||||||||||||||
89 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Evolved Transformer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.42 | 1.31E+12 | 7.13E+13 | 1.90E+07 | 2.20E+09 | ||||||||||||||||||||||||
90 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Evolved Transformer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.21 | 1.31E+12 | 7.13E+13 | 1.90E+07 | 2.20E+09 | ||||||||||||||||||||||||
91 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Evolved Transformer | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 0.47 | 1.31E+12 | 7.13E+13 | 1.90E+07 | 2.20E+09 | ||||||||||||||||||||||||
92 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | DConv | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 1.39E+12 | 7.80E+13 | 2.20E+07 | 1.20E+09 | |||||||||||||||||||||||||
93 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | DConv | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 1.39E+12 | 7.80E+13 | 2.20E+07 | 1.20E+09 | |||||||||||||||||||||||||
94 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | DConv | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 1.39E+12 | 7.80E+13 | 2.20E+07 | 1.20E+09 | |||||||||||||||||||||||||
95 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | DConv | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 1.39E+12 | 7.80E+13 | 2.20E+07 | 1.20E+09 | |||||||||||||||||||||||||
96 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | DConv | Downstream Accuracy | Loss | Upstream Negative Cross-entropy | Power law | Ld(Lu) = GLu^l | 1.39E+12 | 7.80E+13 | 2.20E+07 | 1.20E+09 | |||||||||||||||||||||||||
97 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Performer | Upstream Negative Cross-entropy | Loss | Compute | Power law | L(C) = KC^c | 0.25 | 1.14E+12 | 3.28E+13 | 1.60E+07 | 7.39E+08 | ||||||||||||||||||||||||
98 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Performer | Downstream Accuracy | Loss | Compute | Power law | L(C) = KC^c | 0.05 | 1.14E+12 | 3.28E+13 | 1.60E+07 | 7.39E+08 | ||||||||||||||||||||||||
99 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Performer | Upstream Negative Cross-entropy | Loss | Parameters | Power law | L(N) = AN^a | 0.24 | 1.14E+12 | 3.28E+13 | 1.60E+07 | 7.39E+08 | ||||||||||||||||||||||||
100 | Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? | 21 Jul 2022 | Language | Language modeling | Performer | Downstream Accuracy | Loss | Parameters | Power law | L(N) = AN^a | 0.05 | 1.14E+12 | 3.28E+13 | 1.60E+07 | 7.39E+08 |