A new study from the University of Wisconsin investigates how small transformers trained by random initialization can efficiently learn arithmetic operations using the next token prediction goal

https://arxiv.org/abs/2307.03381 For various downstream tasks, including language and code translation, compositional thinking, and fundamental arithmetic operations, large language models such as GPT-3/4, PaLM, and LaMDA exhibited general characteristics, sometimes emergent abilities. Perhaps surprisingly, the model’s training goal, which is often an autoregressive loss based on predicting the next token, doesn’t directly encode these goals. These … Read more