Scaling laws

LLM evaluation

Survey of LLM evaluation as of July 2023
LLMs recitate or reason?: both!
see also lmsys.org: vicuna benchmark is deprecated, better use MT-bench

Long context

Focused transformer: separate documents in the context; provides LongLlama-3b
NoPE: no positional encodings
Lost in the middle: do long-context transformers suffer similar blindness issue than LSTM?
LongNet: 1M tokens with dilated attention (idea taken from dilated convolution)
Reddit thread

Mixture of experts

Generalization

Flipped learning