Demystifying AI Large Language Models (LLMs)

Introduction

How do cutting-edge language models like ChatGPT operate? How should we approach and conceptualize such AI tools? How can this understanding guide our practical applications of these technologies? In this blog series we hope to answer these questions and more.

Artificial Intelligence has become a bit of a buzzword, often used to refer to a broad set of technologies. In this blog post, we'd like to drill a little deeper into today's most popular type of AI - Large Language Models (LLMs) - which power ChatGPT and most of the new AI tools you've encountered. Understanding how LLMs work and the best practices for using them will help you wield them most effectively.

Our goals for this series are threefold:

Demystify LLMs – In this first post, we'll provide insight into the underlying technologies and illuminate the revolutionary architecture that has been the driving force behind the recent breakneck pace of AI advancements. We'll peel back the layers and show you what's really going on under the hood.
Share a Mental Framework – In the next post, An Effective Mental Framework for LLMs, we'll offer a helpful framework for thinking about these AI technologies. This will help you conceptualize their strengths and weaknesses, which will guide you towards applying them most effectively.
Explain Best Practices – In the final post, Best Practices for LLM Powered Solutions, we'll apply these insights and cover some of the advanced features and best practices for using and building with AI. This will help you understand how to maximize the potential of these powerful tools while avoiding common pitfalls. We'll turn theory into practice, providing actionable strategies for building solutions powered by LLMs.

By the end of this journey, you'll have a clearer picture of what makes LLMs tick, how to think about them strategically, and how to leverage them effectively in your work or projects. Whether you're a developer looking to integrate AI into your applications, a business leader considering AI adoption, or simply a curious mind eager to understand this transformative technology, this guide will equip you with the knowledge to navigate the exciting - and sometimes perplexing - world of Large Language Models.

What are LLMs and How Do They Work?

In this first post, we'll explain the inner workings of LLMs, exploring their architecture, training process, and the technologies that power them in order to provide you with a solid foundation for understanding this transformative technology.

Terminology

Before diving into LLMs' intricacies, let's clarify some terminology that often gets used interchangeably. If you're already familiar with these concepts, feel free to skip to the next section.

Artificial Intelligence (AI) refers to software that can broadly perform tasks typically requiring human intelligence, such as reasoning, problem-solving, or understanding language.
Machine Learning (ML) is a subset of AI focused on developing algorithms and models that allow computers to learn from and make decisions based on data, without being explicitly programmed for each task. Unlike traditional computer programs that follow pre-defined rules, ML models improve their performance as they learn from more data.
Neural Networks are a specific and complex type of machine learning model inspired by the human brain. They consist of interconnected nodes (or artificial neurons) that process data by passing it through the network. These models "learn" by adjusting their parameters—specifically, the strengths of connections between nodes—based on patterns identified in the data. Once trained, neural networks can make predictions on new data using the relationships they've learned.
Generative AI refers to AI models designed to produce new, original content based on patterns learned from training data rather than simply analyzing or categorizing existing data.

What's an LLM?

Now that we've laid the groundwork let's focus on LLMs. ChatGPT, Google's Gemini, and Anthropic's Claude are all powered by LLMs. But what exactly is an LLM?

An LLM is a type of machine learning model built using a neural network, and it falls under the umbrella of generative AI. As their name suggests, Large Language Models are characterized by two key features:

They're large - in terms of the number of artificial neurons, their connections, and the amount of data used to train them.
They focus on language - learning patterns and relationships in text based on extensive training with vast amounts of linguistic data.

What sets LLMs apart from other AI models is their specialized ability to understand and generate natural language, a field often referred to as Natural Language Processing (NLP). While some models focus on generating images or solving specific problems, LLMs are designed to process text inputs (and sometimes other modalities like images, audio, or video) and generate text outputs.

How do LLMs work?

To truly understand LLMs, we need to examine their architecture even more closely. Modern LLMs are built using a revolutionary design called transformer architecture.

Introduced in the seminal 2017 paper "Attention Is All You Need" by Google researchers, the transformer architecture has become the dominant paradigm in AI models. In fact, GPT stands for "Generative Pre-trained Transformer," highlighting its centrality to these models.

A transformer is particularly well-suited for efficiently learning relationships in sequential data, like the words in a sentence, and predicting the next item in that sequence. Let's break down the key components that make transformers so special:

Tokens: Transformers break down input data into basic units called tokens. For text, these could be words, subwords, characters, or even punctuation marks. This tokenization allows the model to work with any type of sequential data, be it text, audio, or video.

An example sentence tokenized by GPT-4

A figure showing attentions from just the word ‘its’.

Attention: The attention mechanism is the transformer's secret sauce. It allows the model to learn the importance of all tokens in the input sequence simultaneously and focus on the most relevant parts. This enables the model to capture long-range dependencies and understand context effectively. For example, in the text "I grew up in France...[several paragraphs]...I speak fluent..." the model can connect the earlier text "France" with the later text "I speak fluent" even though they're far apart, allowing the model to effectively predict the next word will be "French"

(LEFT) Figure 5 from Attention Is All You Need showing attentions from just the word ‘its’.

An example showing the probability distribution returned by an LLM predicting the next token in the text sequence "The pandas were".

Prediction and Probabilities: Transformers predict the next item in a sequence by generating a probability distribution of possible next items. For instance, given "The quick brown...," the model might assign high probabilities to words like "fox" or "bear," lower probabilities to "dog" or "cat," and very low probabilities to contextually irrelevant words like "house" or "blue." This probabilistic approach allows for coherent yet flexible text generation.

(LEFT) An example from A Hackers Guide to Language Models showing the probability distribution returned by an LLM predicting the next token in the text sequence "The pandas were".

Sequential Output: Transformers generate sequences one item at a time. After each prediction, the newly predicted item is added to the existing sequence, and this updated whole sequence is used to predict the next item. This iterative process, while computationally expensive, allows for coherent and contextually appropriate generation of long sequences.

Understanding these components helps explain why LLMs are so powerful at language tasks, but also why they require significant computational resources.

How are LLMs created?

Creating an LLM is a complex process involving several stages. Let's break it down:

Data Collection and Preparation: The first step is gathering vast amounts of diverse data, including books, websites, articles, and more. This data forms the foundation of the model's knowledge.
Self-Supervised Learning: In this initial phase, the model is trained on the massive raw dataset without explicit human-labeled examples. It learns to predict parts of the data, such as the next word in a sentence. This task allows the model to grasp the general structure of language, including grammar, context, and word relationships. As the model processes billions of sentences, it continuously adjusts the connections between its neurons, refining its ability to understand and generate coherent text.
Supervised Learning: After self-supervised learning, the model undergoes supervised learning tailored for specific tasks, often conversational ones. The model is fine-tuned on datasets of question-answer pairs, dialogue examples, and other task-specific data. This helps the LLM specialize in generating contextually appropriate, relevant, and informative responses.
Fine-Tuning: The LLM is further refined on specialized, human-curated datasets that reflect its intended use cases. For chatbots, this might involve tuning on data that captures the nuances of human dialogue, such as handling polite requests, managing conflicting information, or responding to emotionally charged situations.
Reinforcement Learning from Human Feedback (RLHF): This final step is crucial for aligning the model's behavior with human expectations. Human evaluators interact with the model, providing feedback on its responses. For instance, evaluators might rank multiple possible replies or indicate which responses are more appropriate in a given context. The model uses this feedback to adjust its output generation process, learning to prioritize responses that are not only accurate but also aligned with user preferences.

These stages require immense computational power and time. However, their combined effects create LLMs capable of engaging in dynamic, natural, and meaningful conversations.

Inference

Once an LLM has been trained, it can be used for inference. Inference is the process of applying the trained model to new, unseen data to generate outputs. In the case of LLMs, this typically involves providing a prompt or context (the input) and having the model generate text in response.

During inference, the model uses what it learned during training to process the input and produce results. This could be predicting the next word in a sentence, generating an entire piece of text, or providing an answer to a question.

It's important to note that, due to the transformer architecture, inference is an iterative process. The model generates one token (which could be a word, part of a word, or a punctuation mark) at a time. After each token is generated, it's added to the existing input, this entire new sequence is fed back through the model to predict the next token, and this process continues until the model determines an appropriate stopping point. This iterative approach allows the model to maintain context and coherence throughout longer generations, but it also means that generating lengthy responses can be computationally intensive.

The Future of LLMs

Despite being fundamentally "next token predictors," LLMs have demonstrated remarkable capabilities. Some researchers argue that to make accurate predictions, LLMs must have internalized a representation or understanding of the world based on their training data. An alternative perspective is that LLMs function as "lossy compression algorithms," capable of compressing vast amounts of information about the world into a model, sacrificing only some details in the process.

Regardless of the underlying mechanism, LLMs have proven to be immensely capable, showing steady improvement over time (as evidenced by advances from models like GPT-3 to GPT-4).

Research on LLMs has revealed certain scaling laws, which describe relationships between model size, training data, computational resources, and model performance:

Model Size: Larger models (with more parameters) generally perform better across a wide range of tasks, capturing more complex patterns and generating more coherent text.
Training Data: More diverse and extensive training data typically enhances model performance, improving generalization and robustness.
Compute: More computational power and training time usually lead to better performance, allowing for larger models and longer training periods.

A chart detailing how model size and compute (FLOPs) affect training efficiency.

This charts how model size and compute (FLOPs) affect training efficiency. Lower loss indicates better performance. Source: https://arxiv.org/pdf/2203.15556

This graph shows how compute requirements for newer, more powerful LLMs has increased over time. Source: https://blogs.nvidia.com/blog/what-is-a-transformer-model/

(LEFT) This charts how model size and compute (FLOPs) affect training efficiency. Lower loss indicates better performance. Source: Training Compute-Optimal Large Language Models

(RIGHT) This graph shows how compute requirements for newer, more powerful LLMs has increased over time. Source: What Is a Transformer Model? | NVIDIA Blogs

The best performance is achieved when model size, data, and compute are scaled together. However, this comes with trade-offs: larger models require significantly more resources, leading to increased costs, longer training times, and slower inference.

Looking to the future, several exciting developments are on the horizon:

Multi-modal Capabilities – Some LLMs are already evolving to process not just text, but also images, audio, and video inputs. This continued expansion of input modalities could lead to more versatile and powerful models.
Improved Reasoning – While current LLMs operate in a stream-of-consciousness manner outputting one word after another, future models might be able to "think through" different complete outputs before responding, potentially leading to more thoughtful and accurate responses.
Specialization and Customization – Research suggests that smaller, fine-tuned LLMs customized for specific applications can perform as well as larger, general-purpose models in narrow domains. This could lead to more efficient, task-specific AI solutions.
Self-Improvement – LLMs can produce "synthetic data" for training, potentially allowing models to bootstrap their own improvement. "Synthetic data" refers to data (such as text) produced by LLMs and not by actual people. Some studies have demonstrated that including synthetic data in the right proportions during training can lead to improved model performance. This mechanism provides a potential pathway for LLMs to enhance their own capabilities, which could accelerate advancements in AI technology.

As LLM technology continues to evolve, we can expect to see not only more powerful and capable models but also more efficient and specialized applications of this transformative technology.

Introduction

What are LLMs and How Do They Work?

Terminology

What's an LLM?

How do LLMs work?

How are LLMs created?

Inference

The Future of LLMs

Related Reads

An Effective Mental Framework for LLMs

Best Practices for LLM Powered Solutions

To Thank or Not to Thank: The ChatGPT Dilemma