Scaling laws
- Original Kaplan laws
- MoE laws
- Chinchilla scaling laws
- Broken scaling laws
- Newer scaling laws with data bottleneck
LLM evaluation
- Survey of LLM evaluation as of July 2023
- LLMs recitate or reason?: both!
- see also lmsys.org: vicuna benchmark is deprecated, better use MT-bench
Long context
- Focused transformer: separate documents in the context; provides LongLlama-3b
- NoPE: no positional encodings
- Lost in the middle: do long-context transformers suffer similar blindness issue than LSTM?
- LongNet: 1M tokens with dilated attention (idea taken from dilated convolution)
- Reddit thread
Mixture of experts
- PanGu
- St-MoE
- Tutel
- MegaBlocks
- collapsing
- designing MoE
- LLM-Blender: ensembling LLMs