Database of Scaling Laws

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y
1	Paper	Date	Category	Task	Task details	Architecture	Loss	Dependent Variable	Scaling Variable 1	Scaling Variable 2	Type	Functional form	Exponent 1	Exponent 2	Irreducible loss	Parameter values	Compute range (FLOP)		Data range (mixed)		Data unit	Size range (parameters)		Additional Conditions	Scaling strategy

2	Scaling Vision Transformers	8 Jun 2022	Vision, Transfer	ImageNet	Fine-tuning	Vision Transformer	Accuracy	Loss	Compute		Power law with transition	L(C) = K(C+C0)^-c + E	0.35		0.09	K=0.26, C0=0.01, c=0.35, E=0.09	2.13E+18	1.06272E+23	1.00E+08	1.00E+10	Images	5.40E+06	1.80E+09
3	Scaling Vision Transformers	8 Jun 2022	Vision, Transfer	ImageNet	Linear 10-shot	Vision Transformer	Accuracy	Loss	Compute		Power law with transition	L(C) = K(C+C0)^-c + E	0.32		0.12	K=0.63, C0=0.52, c=0.32, E=0.12	2.13E+18	1.06272E+23	1.00E+08	1.00E+10	Images	5.40E+06	1.80E+09
4	Learning Curve Theory	8 Feb 2021	Theory
5	Scaling Scaling Laws with Board Games	7 Apr 2021	Games, RL	Hex		AlphaZero	Elo	Loss	Compute	Board size	Logarithmic with transition	L(D,Bs) = (Mi * Bs + K log C + ci).clamp(Mp * Bs + cp, 0)				Mi = -430, K=510, ci=-4400, Mp=-270, cp=570	1.00E+09	1.00E+17
6	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=1.969, E=0.057, b=0.285			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08
7	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Hybrid Transformer-LSTM	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=1.817, E=0.11, b=0.285			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08
8	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Decoder-only Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.011, E=0.078, b=0.285			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08
9	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.222, E=0.067, b=0.296			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Source noise
10	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.772, E=0.323, b=0.296			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Target noise
11	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.501, E=0.034, b=0.278			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	No filtering
12	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.235, E=0.054, b=0.278			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	CDS filtering
13	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.130, E=0.064, b=0.278			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Bicleaner filtering
14	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.343, E=0.059, b=0.198			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Back-translation 2L2L
15	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.288, E=0.054, b=0.198			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Back-translation 6L6L
16	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.251, E=0.040, b=0.198			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Back-translation 32L6L
17	Data Scaling Laws in NMT: The Effect of Noise and Architecture	4 Feb 2022	Language	Translation	English to German	Encoder-Decoder Transformer	Cross-entropy	Loss	Data		Power law plus constant	L(D) = B(D^-1 + E)^b				B=2.224, E=0.037, b=0.198			5.12E+05	5.12E+08	Sentence pairs	3.00E+08	3.00E+08	Back-translation 64L6L
18	Scaling Laws for a Multi-Agent Reinforcement Learning Model	29 Sept 2022	Games, RL	Pentago		AlphaZero	Player strength	Loss	Parameters		Power law	L(N) = N^a	0.87			a=0.87	4.00E+11	4.00E+16	2.00E+01	1.00E+04	Training Steps	2.00E+03	3.00E+05		Width scaling
19	Scaling Laws for a Multi-Agent Reinforcement Learning Model	29 Sept 2022	Games, RL	Pentago		AlphaZero		Loss	Compute		Power law	L(C) = C^c	0.55			c=0.55	4.00E+11	4.00E+16	2.00E+01	1.00E+04	Training Steps	2.00E+03	3.00E+05		Width scaling
20	Scaling Laws for a Multi-Agent Reinforcement Learning Model	29 Sept 2022	Games, RL	ConnectFour		AlphaZero	Player strength	Loss	Parameters		Power law	L(N) = N^a	0.88			a=0.88	1.50E+11	3.00E+16	2.00E+01	1.00E+04	Training Steps	6.00E+02	2.00E+05		Width scaling
21	Scaling Laws for a Multi-Agent Reinforcement Learning Model	29 Sept 2022	Games, RL	ConnectFour		AlphaZero		Loss	Compute		Power law	L(C) = C^c	0.55			c=0.55	1.50E+11	3.00E+16	2.00E+01	1.00E+04	Training Steps	6.00E+02	2.00E+05		Width scaling
22	Training Compute-Optimal Large Language Models	29 Mar 2022	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Parameters	Data	Bivariate power law plus constant - sum	L(N,D) = AN^-a + BD^-b + E	0.34	0.28	1.69	A=406.4, a=0.34, B=410.7, b=0.28, E=1.69	6.00E+18	3.00E+21	1.00E+07	1.00E+09	BPE Tokens	2.00E+07	1.60E+10
23	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law	L(N) = AN^-a	0.07			A=9.810, a=0.07	8.64E+14	4.32E+23			BPE Tokens	1.00E+05	1.75E+11
24	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law	L(C) = KC^-c	0.048			K=23.27, c=0.048	8.64E+14	4.32E+23			BPE Tokens	1.00E+05	1.75E+11
25	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	16x16 Image modeling, VQ encoding	Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.13		3.99	A=3.767, a=0.13	1.00E+13	1.73E+20		1.00E+08	64x64 VQ256 Images	1.00E+05	3.00E+09
26	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	16x16 Image modeling, VQ encoding	Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.11		4.09	K=32.31, c=0.11	1.00E+13	1.73E+20		1.00E+08	64x64 VQ256 Images	1.00E+05	3.00E+09
27	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	32x32 Image modeling, VQ encoding	Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.14		3.07	A=3.972, a=0.14	3.46E+14	2.59E+20		1.00E+08	64x64 VQ1024 Images	1.00E+05	3.00E+09
28	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	32x32 Image modeling, VQ encoding	Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.12		3.17	K=52.74, c=0.12	3.46E+14	2.59E+20		1.00E+08	64x64 VQ1024 Images	1.00E+05	3.00E+09
29	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Language	Mathematics		Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law	L(N) = AN^-a	0.16		0.28	A=4.432, a=0.16	2.59E+14	4.32E+20			Characters (bytes)	2.00E+05	3.00E+09
30	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Language	Mathematics		Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law	L(C) = KC^-c	0.17		0.14	K=366.4, c=0.17	2.59E+14	4.32E+20			Characters (bytes)	2.00E+05	3.00E+09
31	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	16x16 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.22		2.64	A=3.454, a=0.22	1.73E+14	8.64E+20		1.00E+08	16x16 Images	1.00E+05	2.00E+09
32	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	16x16 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.16		2.64	K=87.59, c=0.16	1.73E+14	8.64E+20		1.00E+08	16x16 Images	1.00E+05	2.00E+09
33	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	32x32 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.13		2.2	A=1.713, a=0.13	1.73E+14	8.64E+20		1.00E+08	32x32 Images	1.00E+05	2.00E+09
34	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	32x32 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.1		2.21	K=14.10, c=0.1	1.73E+14	8.64E+20		1.00E+08	32x32 Images	1.00E+05	2.00E+09
35	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Text-to-Image generation		Decoder-only Transformer	Cross-entropy (text)	Loss	Parameters		Power law	L(N) = AN^-a	0.037			A=2.107, a=0.037	1.73E+16	3.46E+20			Captions (32x32 Image, 128 BPE token)	1.00E+05	8.00E+08
36	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Text-to-Image generation		Decoder-only Transformer	Cross-entropy (image)	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.16		2	A=3.919, a=0.16	1.73E+16	3.46E+20			Captions (32x32 Image, 128 BPE token)	1.00E+05	8.00E+08
37	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Text-to-Image generation		Decoder-only Transformer	Cross-entropy (combined)	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.15		1.93	K=130.8, c=0.15	1.73E+16	3.46E+20			Captions (32x32 Image, 128 BPE token)	1.00E+05	8.00E+08
38	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Video	Video generation		Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law	L(N) = AN^-a	0.24		1.01	A=12.48, a=0.24	8.64E+13	4.32E+20		1.00E+02	64x64 VQ256 Video hours	1.00E+04	8.00E+08
39	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Video	Video generation		Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law	L(C) = KC^-c	0.14		0.95	K=137.7, c=0.14	8.64E+13	4.32E+20		1.00E+02	64x64 VQ256 Video hours	1.00E+04	8.00E+08
40	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	8x8 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.24		3.12	A=2.862, a=0.24	1.00E+13	8.64E+19		1.00E+08	8x8 Images	1.00E+05	3.00E+08
41	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Vision	Image modeling	8x8 Image modeling, pixel encoding	Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.19		3.13	K=207.2, c=0.19	1.00E+13	8.64E+19		1.00E+08	8x8 Images	1.00E+05	3.00E+08
42	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Image captioning		Decoder-only Transformer	Cross-entropy (text)	Loss	Parameters		Power law	L(N) = AN^-a	0.039			A=2.212, a=0.039	2.59E+16	3.46E+19			Captions (32x32 Image, 128 BPE token)	1.00E+05	1.00E+08
43	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Image captioning		Decoder-only Transformer	Cross-entropy (image)	Loss	Parameters		Power law plus constant	L(N) = AN^-a + E	0.15		2	A=3.639, a=0.15	2.59E+16	3.46E+19			Captions (32x32 Image, 128 BPE token)	1.00E+05	1.00E+08
44	Scaling Laws for Autoregressive Generative Modeling	28 Oct 2020	Multimodal	Image captioning		Decoder-only Transformer	Cross-entropy (combined)	Loss	Compute		Power law plus constant	L(C) = KC^-c + E	0.16		1.97	K=181.1, c=0.16	2.59E+16	3.46E+19			Captions (32x32 Image, 128 BPE token)	1.00E+05	1.00E+08
45	A Scaling Law for Syn2real Transfer: How Much Is Your Pre-training Effective?	25 Aug 2021	Vision, Transfer	Multiple		ResNet	Finetuning loss	Loss	Pretraining Data		Power law plus constant	L(D) = BD^-b + G
46	Scaling Laws for Neural Language Models	23 Jan 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Compute		Power law	L(C) = KC^-c	0.05			K=26.38, c=0.05	1.69E+10	3.45E+19	2.20E+07	2.30E+10	BPE Tokens	7.68E+02	1.50E+09
47	Scaling Laws for Neural Language Models	23 Jan 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Parameters		Power law	L(N) = AN^-a	0.076			A=11.48, a=0.076	1.69E+10	3.45E+19	2.20E+07	2.30E+10	BPE Tokens	7.68E+02	1.50E+09
48	Scaling Laws for Neural Language Models	23 Jan 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Data		Power law	L(D) = BD^-b	0.095			B=20.81, b=0.095	1.69E+10	3.45E+19	2.20E+07	2.30E+10	BPE Tokens	7.68E+02	1.50E+09
49	Scaling Laws for Neural Language Models	23 Jan 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Parameters	Data	Bivariate power law	L(N,D) = [(A/N)^(a/b) + B/D]^-b	0.076	0.095		A=11.48, a=0.076, B=20.81, b=0.095	1.69E+10	3.45E+19	2.20E+07	2.30E+10	BPE Tokens	7.68E+02	1.50E+09
50	Scaling Laws for Neural Language Models	23 Jan 2020	Language	Language modeling		Decoder-only Transformer	Cross-entropy	Loss	Parameters	Training Steps	Bivariate power law - sum	L(N,S) = AN^-a + BS^-b	0.076	0.76		A=11.48, a=0.076, B=334.88, b=0.76	1.69E+10	3.45E+19	2.20E+07	2.30E+10	BPE Tokens	7.68E+02	1.50E+09	Training at the critical batch size
51	Effect of scale on catastrophic forgetting in neural networks	21 Sep 2022	Vision
52	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Switch Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.23				3.25E+12	4.33E+13				1.74E+08	2.96E+10
53	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Switch Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.14				3.25E+12	4.33E+13				1.74E+08	2.96E+10
54	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Switch Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.13				3.25E+12	4.33E+13				1.74E+08	2.96E+10
55	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Switch Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.08				3.25E+12	4.33E+13				1.74E+08	2.96E+10
56	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Switch Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.58				3.25E+12	4.33E+13				1.74E+08	2.96E+10
57	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Encoder-Decoder Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.54				1.21E+12	6.38E+13				1.60E+07	2.90E+09
58	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Encoder-Decoder Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.28				1.21E+12	6.38E+13				1.60E+07	2.90E+09
59	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Encoder-Decoder Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.47				1.21E+12	6.38E+13				1.60E+07	2.90E+09
60	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Encoder-Decoder Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.24				1.21E+12	6.38E+13				1.60E+07	2.90E+09
61	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Encoder-Decoder Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.49				1.21E+12	6.38E+13				1.60E+07	2.90E+09
62	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Funnel Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.47				1.10E+12	4.03E+13				1.60E+07	2.90E+09
63	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Funnel Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.22				1.10E+12	4.03E+13				1.60E+07	2.90E+09
64	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Funnel Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.38				1.10E+12	4.03E+13				1.60E+07	2.90E+09
65	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Funnel Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.18				1.10E+12	4.03E+13				1.60E+07	2.90E+09
66	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Funnel Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.46				1.10E+12	4.03E+13				1.60E+07	2.90E+09
67	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MoS-Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.43				1.29E+12	1.12E+14				2.70E+07	2.90E+09
68	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MoS-Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.21				1.29E+12	1.12E+14				2.70E+07	2.90E+09
69	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MoS-Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.43				1.29E+12	1.12E+14				2.70E+07	2.90E+09
70	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MoS-Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.2				1.29E+12	1.12E+14				2.70E+07	2.90E+09
71	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MoS-Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.47				1.29E+12	1.12E+14				2.70E+07	2.90E+09
72	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MLP-Mixer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.32				3.83E+12	4.83E+13				6.70E+07	2.86E+09
73	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MLP-Mixer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	-0.03				3.83E+12	4.83E+13				6.70E+07	2.86E+09
74	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MLP-Mixer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.26				3.83E+12	4.83E+13				6.70E+07	2.86E+09
75	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MLP-Mixer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.65				3.83E+12	4.83E+13				6.70E+07	2.86E+09
76	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		MLP-Mixer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	-0.02				3.83E+12	4.83E+13				6.70E+07	2.86E+09
77	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		GLU-Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.49				1.29E+12	6.13E+13				2.60E+07	2.85E+09
78	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		GLU-Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.24				1.29E+12	6.13E+13				2.60E+07	2.85E+09
79	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		GLU-Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.42				1.29E+12	6.13E+13				2.60E+07	2.85E+09
80	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		GLU-Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.22				1.29E+12	6.13E+13				2.60E+07	2.85E+09
81	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		GLU-Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.46				1.29E+12	6.13E+13				2.60E+07	2.85E+09
82	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		LConv	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.32				1.20E+12	7.70E+13				1.70E+07	2.30E+09
83	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		LConv	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.13				1.20E+12	7.70E+13				1.70E+07	2.30E+09
84	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		LConv	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.29				1.20E+12	7.70E+13				1.70E+07	2.30E+09
85	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		LConv	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.11				1.20E+12	7.70E+13				1.70E+07	2.30E+09
86	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		LConv	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.48				1.20E+12	7.70E+13				1.70E+07	2.30E+09
87	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Evolved Transformer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.44				1.31E+12	7.13E+13				1.90E+07	2.20E+09
88	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Evolved Transformer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.22				1.31E+12	7.13E+13				1.90E+07	2.20E+09
89	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Evolved Transformer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.42				1.31E+12	7.13E+13				1.90E+07	2.20E+09
90	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Evolved Transformer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.21				1.31E+12	7.13E+13				1.90E+07	2.20E+09
91	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Evolved Transformer	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l	0.47				1.31E+12	7.13E+13				1.90E+07	2.20E+09
92	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		DConv	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c					1.39E+12	7.80E+13				2.20E+07	1.20E+09
93	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		DConv	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c					1.39E+12	7.80E+13				2.20E+07	1.20E+09
94	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		DConv	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a					1.39E+12	7.80E+13				2.20E+07	1.20E+09
95	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		DConv	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a					1.39E+12	7.80E+13				2.20E+07	1.20E+09
96	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		DConv	Downstream Accuracy	Loss	Upstream Negative Cross-entropy		Power law	Ld(Lu) = GLu^l					1.39E+12	7.80E+13				2.20E+07	1.20E+09
97	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Performer	Upstream Negative Cross-entropy	Loss	Compute		Power law	L(C) = KC^c	0.25				1.14E+12	3.28E+13				1.60E+07	7.39E+08
98	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Performer	Downstream Accuracy	Loss	Compute		Power law	L(C) = KC^c	0.05				1.14E+12	3.28E+13				1.60E+07	7.39E+08
99	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Performer	Upstream Negative Cross-entropy	Loss	Parameters		Power law	L(N) = AN^a	0.24				1.14E+12	3.28E+13				1.60E+07	7.39E+08
100	Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?	21 Jul 2022	Language	Language modeling		Performer	Downstream Accuracy	Loss	Parameters		Power law	L(N) = AN^a	0.05				1.14E+12	3.28E+13				1.60E+07	7.39E+08