Language Mannequin Coaching and Inference: From Idea to Code | by Cameron R. Wolfe, Ph.D. | Jan, 2024

Thank you for reading this post, don't forget to subscribe!

Studying and implementing subsequent token prediction with an off-the-cuff language mannequin…

Cameron R. Wolfe, Ph.D.

Towards Data Science
(Photograph by Chris Ried on Unsplash)

Regardless of all that has been completed with giant language fashions (LLMs), the underlying idea that powers all of those fashions is straightforward — we simply have to precisely predict the following token! Although some could (moderately) argue that current analysis on LLMs goes past this primary concept, subsequent token prediction nonetheless underlies the pre-training, fine-tuning (relying on the variant), and inference technique of all causal language fashions, making it a elementary and essential idea for any LLM practitioner to know.

“It’s maybe stunning that underlying all this progress remains to be the unique autoregressive mechanism for producing textual content, which makes token-level selections one after the other and in a left-to-right style.” — from [10]

Inside this overview, we are going to take a deep and sensible dive into the idea of subsequent token prediction to know how it’s utilized by language fashions each throughout coaching and inference. First, we are going to be taught these concepts at a conceptual stage. Then, we are going to stroll by an precise implementation (in PyTorch) of the language mannequin pretraining and inference processes to make the concept of subsequent token prediction extra concrete.

Previous to diving into the subject of this overview, there are a couple of elementary concepts that we have to perceive. Inside this part, we are going to shortly overview these essential ideas and supply hyperlinks to additional studying for every.

The transformer structure. First, we have to have a working understanding of the transformer structure [5], particularly the decoder-only variant. Fortunately, we’ve got coated these concepts extensively previously:

  • The Transformer Structure—-7f60cf5620c9—4
  • Decoder-Solely Transformers—-7f60cf5620c9—4

Extra basically, we additionally want to know the concept of self-attention and the function that it performs within the transformer structure. Extra particularly, giant causal language fashions — the sort that we are going to research on this overview — use a selected variant of self-attention referred to as multi-headed causal…

Leave a Reply

Your email address will not be published. Required fields are marked *