Interesting findings on how LLMs do in-context learning.
TL;DR: with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets.
These are the different setups with examples: Regular ICL, Flipped-Label ICL, and Sematnicall-Unrelated Label ICL (SUL-ICL).
There are many cool results in the paper but this one is interesting: instruction-tuned LMs perform better at learning input-label mappings than pretraining-only LMs. Generally, better results with the bigger models and more exemplars per class too.
The paper claims that in-context learning with semantically unrelated labels emerges with scale. You can see in the chart below that "performance decreases more for small models than for large models when using semantically-unrelated targets instead or NL targets."
Some more interesting results for SUL-ICL setup: (top) Larger models benefit more from additional exemplars than smaller models. (bottom) SUL-ICL emerges with scale (using k=8 exemplars per class) for both PaLM and Codex models.
From a practical perspective, it's good to provide more exemplars when using in-context learning and to put effort into formatting, etc. I am curious how the paper findings generalize across different task categories, especially where semantic prior knowledge is not available and you can instead combine large LM to do ICL using input-label mappings. Really cool research thread to follow. Adding some of these notes to the Prompt Engineering guide too.