Jul 2, 2026

From Common Crawl to the Base Model: How an LLM Learns Language

Series Inside LLMs Part 1 of 2 ▶

1 From Common Crawl to the Base Model: How an LLM Learns Language
2 From Base Model to ChatGPT: SFT, Tools and Reinforcement Learning

Next part → From Base Model to ChatGPT: SFT, Tools and Reinforcement Learning

Table of contents

Where the LLM Came From
What the Transformer Solved
The Raw Material: Common Crawl and FineWeb
Tokenization: Why the Model Doesn’t See Words
Pre-training: Predicting the Next Token, Trillions of Times
The Base Model: Powerful and Still Useless
Next Steps

Part 1 of 2. What happens inside a language model before you send the first prompt.

When you type a question into ChatGPT and get back coherent text, it feels like a conversation. Mechanically it’s nothing of the sort. What the model does is compute, token by token, which piece of text is most likely to come next, and that ability doesn’t arrive ready-made. It’s built in phases, over an expensive, time-consuming process that happens long before any user types anything.

Understanding these phases changes how you use the tool. A good prompt, a robust agent, and safe AI usage all depend on knowing what the model actually does under the hood. This first article covers the foundation: how an LLM goes from random parameters to what’s called the base model, passing through the Transformer architecture, data collection, tokenization, and pre-training. In the second article I’ll show how that base model is transformed into the assistant you know.

Where the LLM Came From

LLMs are part of Natural Language Processing (NLP), the field that studies how to make machines handle human language. For a long time this field was dominated by RNNs (Recurrent Neural Networks), networks that processed text sequentially, one word after another. They worked, and they were important to the field’s evolution, but they had serious structural limits: they lost context in long texts, struggled to capture relationships between distant words, and hit a performance ceiling that was hard to break through.

The turning point came in 2017, with the paper “Attention Is All You Need”, from Google Brain. It introduced the Transformer architecture, which abandoned the sequential logic of RNNs and put the attention mechanism at the center. The core idea is that the model now observes the entire sentence at once, evaluating how each word relates to all the others, even when they are far apart in the sequence.

Comparison between an RNN, which passes context word by word, and a Transformer, in which each word observes all the others through attention
In the RNN, context crosses the whole chain; in the Transformer, each word looks directly at the others.

In the RNN, for “came back” to make sense, the information about “cat” has to cross the entire chain and arrives weakened. In the Transformer, “came back” looks directly at “cat” and “that ran,” without depending on the distance between them.

What the Transformer Solved

The impact of this change was immediate on three fronts. Performance comfortably surpassed previous architectures. The architecture’s parallelism made it viable to train much larger models, because processing was no longer bound to the order of the words. And the representation of language gained depth, capturing relationships that older techniques simply couldn’t see.

It was on this foundation that BERT, GPT, GPT-2, GPT-3 and, later, ChatGPT emerged — the latter popularizing access to LLMs in 2022 and marking the start of the massive adoption of generative AI. It’s worth retaining the essence without diving into the math: the Transformer captures the structure of language better because it analyzes text globally, identifies broad relationships, and perceives patterns that were invisible before. That’s what makes the models seem to reason, in the sense of generating answers that connect ideas, interpret nuance, and stay coherent across several paragraphs. Without this architecture, the current generation of LLMs and the entire ecosystem of applications that depends on it wouldn’t exist.

The Raw Material: Common Crawl and FineWeb

Before an LLM can hold a dialogue or answer anything, it goes through a large-scale learning phase called pre-training. The goal here is specific and worth pinning down: it’s not to memorize information from the web, it’s to internalize how language works. Sentence structure, transitions between ideas, variation in style, semantic relationships, the logic that connects words within a context. That’s the base that later lets the model produce coherent text.

A large chunk of this content comes from Common Crawl, a project that continuously sweeps the web and archives pages systematically. Over the years this became one of the largest public text repositories in the world, in the range of hundreds of billions of pages. The problem is that this repository is raw. It has useful text, repeated content, poorly structured pages, and a lot of low-quality material mixed in. It delivers the volume the network needs, but it can’t be thrown straight into training.

To become training data, it goes through curation pipelines like FineWeb. This step does four main things. It filters URLs, removing domains tied to inappropriate content, spam, extremism, and the like, using blocklists maintained by various organizations. It extracts the pure text, throwing away HTML, tags, scripts, and the rest of the page structure, so the model doesn’t learn patterns that aren’t part of human language. It filters by language, keeping only the languages relevant to the training. And it tries to remove sensitive data, like addresses, documents, and banking information that show up in the raw collection.

Curation pipeline: Common Crawl passes through URL filtering, text extraction, language filtering, and sensitive-data removal until it becomes the final dataset
From Common Crawl's raw repository to the final dataset, through FineWeb's curation.

At the end of this process, what’s left is a broad, varied, and linguistically relevant set that keeps diversity of topic and style without carrying the noise of the raw web. All this content is converted into a continuous sequence of tokens, as if it were a single giant dataset spanning formal writing, conversational text, and various knowledge domains all at once. At this point the model isn’t trying to answer a question or solve a problem. It’s only learning to predict the continuation of the text and to perceive the regularities that define language. This phase is the foundation of everything that comes after. Without this initial immersion in language, no later stage would have anything to build on.

Tokenization: Why the Model Doesn’t See Words

After gathering and cleaning the text, we run into a basic limit: models don’t work with words. A computer operates on numbers, not on sentences, letters, or punctuation. For the model to learn language, the text has to become a numerical sequence it can process. That process is tokenization, and it happens in layers.

The first step is to turn each character into a binary representation, using an encoding like UTF. Each letter, symbol, or space becomes a sequence of bits. This makes the text hardware-compatible, but it still isn’t good for training the network: the sequences get too long and don’t reflect the language units the model needs to learn well. To improve this, we group the bits. Joining 8 bits gives us a byte, which takes 256 possible values. These values stop being read as numbers with meaning and start working as symbolic identifiers. The byte becomes the smallest raw unit from which the vocabulary will be built.

Even in bytes, the text is still too large to be processed efficiently. That’s where Byte Pair Encoding (BPE) comes in. BPE looks for repeating patterns, pairs of bytes or whole sequences, and replaces each frequent pattern with a new identifier. It does this iteratively: at each step, a frequent sequence becomes a new unique symbol. The vocabulary grows, but the total length of the data shrinks. The result is a set of tokens that ranges from a single character to a large fragment of a word. A common term tends to become a single token; a rare term is broken into several parts. This balance keeps the representation compact and flexible at the same time.

Tokenization layers: human text becomes binary via UTF, then bytes, then passes through BPE and ends up as token IDs
Human text descends through the layers until it becomes a sequence of token IDs.

Once tokenized, the model doesn’t see “Hi, how are you?”. It sees a sequence of numbers like [89191, 11, 2526, 12156, 7888, 30]. These numbers have no meaning of their own; they’re just identifiers that point to pieces of text. Tools like tiktokenizer make this visible, showing how each word is sliced into tokens and revealing internal breaks we don’t even notice while reading. A word like “Procrastination,” for example, can be cut into three different tokens.

This matters more than it seems. The Transformer works exclusively with tokens. All of its learning comes from observing patterns in these numerical sequences: which tokens appear together, which usually comes before which, which combination is most likely in a given context. That’s where it learns to predict the next token. Tokenization, then, defines how the model sees language. Good tokenization makes learning easier, improves performance, and reduces training cost. Bad tokenization hurts all of that. That’s why it’s a fundamental step and not a technical detail: it’s where human text turns into something the LLM can process.

Pre-training: Predicting the Next Token, Trillions of Times

With tokenization sorted out, we can explain the central mechanism that lets the model produce text: sequential token prediction. The interaction looks like a dialogue, but underneath it’s pure statistics. Given an input, the model computes the probability of each possible token and picks the next element of the sequence. Then it repeats this continuously until the answer ends.

The cycle starts by converting the input into tokens. Each word, fragment, punctuation mark, or space becomes a numerical identifier. That sequence goes into the Transformer, which computes a probability distribution over all tokens in the vocabulary — and we’re talking tens or hundreds of thousands of possibilities. The model never returns a fixed answer: it does a sampling proportional to the predicted probabilities. A token incompatible with the text gets a near-zero probability, a coherent token becomes a strong candidate, and the choice comes out of that distribution. The chosen token joins the sequence and the cycle starts over.

Prediction cycle: the input becomes tokens, the Transformer computes the probability of each token, sampling picks the next one, and it joins the sequence, repeating until it ends
At each step the model predicts a distribution, picks a token, and feeds the sequence back in.

This ability to assign the right probability doesn’t come out of the box. The model starts training with random parameters, an enormous number of degrees of freedom, and pre-training gradually calibrates those values until the model consistently infers which token should come next within a context window.

That context window is the fixed segment the training operates over. In the first generations of models it was around 4,000 tokens. The loop always follows the same repetitive logic:

Training loop in four steps: select a window, predict the next-token distribution, compare it with the real token, and adjust the parameters to reduce the error
The same loop repeated trillions of times, each pass adjusting the parameters a little.

To make it concrete: imagine the context “Hi, how are you”, where the expected next token is ”?”. In the first attempts the model might assign high probability to completely wrong tokens. Comparison with the correct token triggers the adjustment, which raises the probability of the right token, lowers the probability of the wrong ones, and reorganizes the internal relationships involved in that prediction. Each of these corrections is tiny, but applied at massive scale they make the model learn the structure of language. That’s also why pre-training is expensive: it involves a huge volume of data, high-performance machines, and long stretches of computation on distributed infrastructure. Just to give you a sense, training a relatively small model like Llama 3 with 8B parameters can cost between 2 and 5 million dollars, and that number skyrockets on larger models.

When the parameters reach a stable performance, the training produces the base model. It has a statistical command of the structure of language, but it still doesn’t know how to follow instructions, interact in a controlled way, or behave the way we expect from an assistant. That depends on the next stages.

The Base Model: Powerful and Still Useless

After collection, data preparation, tokenization, and pre-training, we arrive at the base model. And here comes the part that tends to surprise anyone who’s never seen it up close: at this stage the LLM doesn’t work as any kind of assistant. It wasn’t instructed to follow commands, answer a question usefully, or interact in a structured way. The base model is essentially a sequence-continuation mechanism. It predicts the next token based on the statistical patterns of the training data, which is why it’s well described as a text simulator, reflecting the regularity of the content it processed.

This shows up in behavior. If you throw in a simple input like “What’s 2 + 2?”, the base model doesn’t understand it should return a number. It tends to prolong the sequence arbitrarily, return fragments that resemble excerpts from the dataset, repeat patterns that appear frequently, and produce a different variation on each run. That’s the expected behavior of something trained only to predict a sequence, with no tuning aimed at interpreting instructions.

Because it’s a statistical prediction mechanism, the base model also hallucinates. When it lands in a context that goes beyond what it saw in training, or in an incomplete format, it completes with something linguistically plausible, even if it’s wrong. It doesn’t evaluate whether it’s true; it just prolongs the learned pattern. That’s one of the reasons the base model, on its own, isn’t fit for applications that require reliability and control.

Even with these limitations, it has an important emergent property: in-context learning. The model can identify a pattern within the prompt itself and replicate it in its generation, without ever having been specifically trained for it.

In-context learning: given a prompt with Portuguese-to-English pairs, the model detects the systematic relation and continues the pattern, translating casa to house
The model detects the relation within the prompt itself and continues the pattern, without having been trained to translate.

That same property explains why you can “fake” an assistant with a base model. If the prompt brings a consistent dialogue format, with “User:” and “Assistant:” alternating, the model recognizes the structure and continues in the assistant’s role, because that’s the most likely next token given that format. But this isn’t understanding the task; it’s maintaining the statistical coherence of the presented format.

To sum up what the base model actually is: it reproduces patterns but doesn’t execute tasks, it’s highly variable and barely controllable, it generates plausible but unreliable content, it adapts to the prompt’s format but doesn’t understand instructions, and it has knowledge scattered across the parameters with no steering mechanism. It’s a foundation and nothing more. Turning it into a system that follows commands, stays consistent, and operates predictably only happens in the next phases, supervised fine-tuning and reinforcement learning, which are exactly the subject of the second part.

Next Steps

This article covered the foundation: from data collection on Common Crawl to the base model, through the Transformer, tokenization, and pre-training. What you end up with is a powerful but raw text simulator, with no notion of instruction or reliability.

In Part 2: From Base Model to ChatGPT I take exactly this base model and show how it becomes the assistant you use day to day, through supervised fine-tuning, tool use, and reinforcement learning. If you want to see how this kind of model shows up inside a real software architecture, it’s also worth reading Amazon Bedrock in practice: AI as part of the architecture, where AI stops being a concept and becomes part of a system in production.

I wrote this while studying the inner workings of LLMs. If it made sense to you, the second part takes the base model from this article and shows how it becomes ChatGPT. Follow me on LinkedIn and GitHub.