The Token Factory

What I learned trying to understand how AI data centers actually make money, and why it changes everything about sovereignty, hardware, and the next decade of AI.

A few months ago, I found myself stuck on a question that sounds simple but turned out to be surprisingly hard to answer: how do large AI data centers actually make money? I had a rough understanding of the API pricing model, where you pay per token and OpenAI and Anthropic charge fractions of a cent per thousand, but I could not figure out whether the economics actually worked. The GPU hardware is extraordinarily expensive. The energy costs are enormous. Yet these companies keep building more infrastructure. So I started digging. What I found reshaped how I think about AI infrastructure, open-source models, and European sovereignty in ways I did not expect.

Part One: The Physics of Token Generation

The question nobody told me to ask

When I first started experimenting with running models locally on my MacBook Air M4 with 16GB of RAM, I noticed that I was getting a relatively small number of tokens per second. That was not surprising, since it is a consumer laptop. But then I started watching videos of people running much more powerful setups: Mac Studio clusters, professional GPU rigs, dedicated cloud instances. And I kept seeing something that confused me. Someone with a single Mac Studio was generating maybe 30 tokens per second. When they chained four Mac Studios together in a cluster, they were getting maybe 40 or 50, not 120. They had multiplied the hardware cost by four but barely moved the output. I had assumed that more compute meant proportionally more tokens. That assumption was wrong, and understanding why turned out to be the key to understanding everything else.

Two phases, two bottlenecks

Running a language model is not a single operation. There are two distinct phases with completely different performance profiles, and most people conflate them in ways that lead to real confusion.

The first is the prefill phase. This is when the model processes your input, the prompt you typed and the context it needs to understand. This phase is heavily parallel and compute-intensive. The GPU's tensor cores are working hard, and if you have more raw computational power (measured in teraflops), the prefill phase gets faster. This is the phase people tend to imagine when they think about AI hardware, and it behaves the way most people expect: more compute, more speed.

The second is the decode phase. This is where each output token is generated, one at a time. And here everything changes. To generate a single token, the GPU must load the entire set of model weights from memory, perform a relatively small amount of arithmetic, and then repeat the process for the next token. The arithmetic intensity, the ratio of computation to memory access, drops dramatically. The GPU's compute cores are largely sitting idle, waiting for data to arrive from memory. The bottleneck is no longer how fast the chip can calculate. It is how fast the chip can read its own memory.

Both GPUs may perform the same floating-point operations per forward pass, but the one with higher memory bandwidth completes each pass faster because it reads weights from VRAM at a higher rate. More bandwidth equals more tokens per second. This is the insight that unlocked everything for me. Tokens per second during inference is almost entirely a function of memory bandwidth, not raw compute power.

The formula that makes it concrete

Once I understood this, a simple approximation helped crystallize it:

Tokens per second ~= Memory Bandwidth (GB/s) / Model Size in Memory (GB)

Take an NVIDIA L40S GPU with around 864 GB/s of memory bandwidth. Load a Qwen 27B model quantized to 4-bit, which fits in roughly 14GB of VRAM. The theoretical ceiling is something in the range of 60 tokens per second, which matches closely what I observed in practice when running experiments on OVH cloud instances. This formula is approximate and ignores KV cache overhead, batching effects, and software stack efficiency, but it is directionally correct and explains why GPU performance in inference is so different from what people expect when they look at teraflop numbers.

VRAM, bandwidth, and why they are different things

This distinction matters enormously in practice, and it is one the industry does not explain clearly to newcomers. VRAM capacity determines which models you can run at all. If your model weights exceed your VRAM, you cannot load it, or you start offloading to CPU RAM, which is catastrophically slower. A 70B parameter model in 4-bit quantization requires roughly 35GB of VRAM, which already exceeds single consumer GPUs and most mid-range professional cards. Memory bandwidth, on the other hand, determines how fast you can run a model that already fits. When comparing GPUs for inference, VRAM capacity is the first priority. You need enough to fit your model. But once two cards have the same amount of VRAM, the one with higher bandwidth will almost always deliver more tokens per second and a smoother experience.

When I clustered multiple GPUs together, the reason the token throughput did not scale linearly became clear: adding more GPUs in a naive multi-GPU setup does not increase the memory bandwidth available to any single generation step in the same way. The model gets distributed across more VRAM, which is useful for fitting larger models, but the inter-GPU communication overhead and the way the decode phase works means you do not simply multiply your output. The architecture of multi-GPU inference is a deep topic in itself.

Quantization: fitting more into less

One of the most practically important levers in inference is quantization, reducing the numerical precision of model weights from the full 32-bit or 16-bit floating point they are trained in to lower-precision formats like 8-bit integers (Q8), 4-bit (Q4), or even 3-bit representations. The gains are substantial. A model that requires 54GB of VRAM in full 16-bit precision might fit into 14GB at 4-bit quantization. This is not without cost: lower precision introduces some quality degradation, and the effect varies significantly across models and tasks. But for many practical applications, a well-quantized model is nearly indistinguishable from its full-precision counterpart, and the hardware requirements drop dramatically.

Quantization applies not just to model weights but also to the KV cache, the intermediate state that grows during generation as the model tracks the context of the conversation. The KV cache is often underestimated: for long context windows, it can consume more VRAM than the model weights themselves. KV cache grows linearly with sequence length and concurrency, and in real workloads often consumes more VRAM than model weights, making it a critical factor in hardware sizing. Quantizing the KV cache to 8-bit or 4-bit representations recovers significant memory headroom and can meaningfully increase the number of tokens per second, especially at longer contexts. This is one of the reasons llama.cpp, vLLM, and other inference engines expose granular quantization controls: the right combination of model quantization and KV cache quantization for a given hardware setup is a real optimization problem.

Inference software: llama.cpp, vLLM, and what they actually do

The choice of inference runtime matters more than most people realize. llama.cpp is a C++ inference engine that runs efficiently on a wide range of hardware, including CPUs and Apple Silicon, and supports virtually every open-source model format. It is what most people use when running models locally. It is fast, portable, and actively maintained. vLLM is a more production-oriented server designed for high-throughput multi-user inference. It implements continuous batching, a technique that significantly improves GPU utilization when serving multiple concurrent requests, and is the standard choice for teams deploying open-source models for internal or commercial use. When I was running experiments on OVH infrastructure with L40S, A100, and V100S instances, I was using llama.cpp and a simple server wrapper that let me benchmark different configurations directly. With a Qwen 27B model in 4-bit quantization on a single L40S, I was consistently seeing 30 to 60 tokens per second, completely usable for most applications, deployable by a small team, and at a fraction of the cost of frontier API services.

Part Two: The Token Economy and Why Sovereignty Matters

The data center as a token factory

Jensen Huang, at Nvidia's GTC 2026 press briefing in San Jose, insisted that the industry must stop thinking about computers as systems for data entry and retrieval. The new paradigm, he said, is a "token manufacturing system." This framing is not just marketing, it is a precise description of how the economics work. A data center receives electrical power as its primary input. It converts that power, through GPU clusters, networking, cooling systems, and software, into tokens as its output. The business model is a factory: maximize the number of useful tokens produced per watt of electricity consumed, and sell those tokens at a margin above cost.

At GTC 2026, Huang effectively reduced the next decade of AI infrastructure to a handful of variables: power, throughput, memory, interconnect, orchestration, and security. The next wave of data center advantage will not belong only to those who can add megawatts. It will belong to those who can turn fixed power into the highest-value token output. This is the key insight for understanding Nvidia's strategy. They are not just selling GPUs. They are selling the full stack that converts power into tokens as efficiently as possible, and they are now positioning themselves to sell that stack as a turnkey system to data center operators around the world.

At GTC 2026, Huang announced that purchase orders between Blackwell and Vera Rubin are expected to reach $1 trillion through 2027, up from a $500 billion projection the previous year. The demand signal is extraordinary, and it is driven almost entirely by inference, not training.

The open-source shift nobody is pricing in

This is where my own view diverges from the mainstream narrative. The public conversation about AI is dominated by frontier model providers: OpenAI, Anthropic, Google. These companies produce extraordinarily capable models that are used heavily today and receive the vast majority of media attention. But when I look at the trajectory of open-source models, I think there is a significant repricing coming.

The Qwen 3 family from Alibaba's research lab, at 27 billion parameters, runs comfortably on a single high-end consumer GPU or a small cloud instance. Yet its performance on a wide range of tasks is now comparable to frontier closed models on most everyday workloads. Claude Opus 4.6, Anthropic's most capable model, is estimated to have somewhere around 1 to 2 trillion parameters. Qwen 27B is roughly 50 times smaller. The gap in capability for the tasks that 95% of users actually perform, writing, coding assistance, summarization, analysis, and answering questions, is narrowing rapidly. When I run Qwen 27B locally, what I can do with it covers the vast majority of my daily AI use cases. The remaining gap exists at the edges: very long documents, extremely complex multi-step reasoning, and tasks that require broad world knowledge synthesized with nuance. For everything else, a small model running locally or on a modest cloud instance is entirely sufficient.

The implication is that the token economics change completely when the model is open-source and small. Instead of paying a frontier provider fractions of a cent per token on their infrastructure, you can deploy your own instance at a cost that, at scale, is an order of magnitude lower. For companies and governments with predictable, high-volume use cases, this math will eventually be impossible to ignore.

Sovereignty is a hardware problem

This is where the technical and the political converge in ways that I think Europeans, and especially the French, have not yet fully internalized.

Producing tokens requires hardware. That hardware today means, almost entirely, Nvidia GPUs. Nvidia's supply chain runs through TSMC in Taiwan for fabrication, and through ASML in the Netherlands for the extreme ultraviolet lithography machines that make modern chip manufacturing possible. This supply chain is extraordinarily concentrated and carries geopolitical risk that is not theoretical. If France and Europe want to be capable of producing tokens for their own citizens and businesses, not dependent on American clouds and not subject to export control decisions made in Washington, they need physical infrastructure on French and European soil, running hardware they can access, powered by energy they control.

France is actually in a surprisingly strong position for the energy side of this equation. Nuclear power provides stable, low-carbon baseload electricity at a cost that is competitive for data center operations. French data centers also have access to efficient cooling approaches, including closed water circuits that recycle rather than evaporate, that reduce environmental impact compared to air-cooled American facilities in warm climates. The energy bottleneck that constrains data center expansion in many parts of the world is less acute in France than almost anywhere else.

The hardware bottleneck is real but not insurmountable. The answer is not to wait for a European Nvidia. It is to build infrastructure now using the best available hardware, which today means Nvidia, while investing seriously in alternatives that could reduce dependence over time. Initiatives like OVHcloud in France are part of this picture, but the scale of investment required to become genuinely sovereign in token production is larger than what currently exists.

The Taalas signal

The most technically interesting development I followed during this research is a Toronto-based startup called Taalas. Their approach is radical: instead of running models on programmable GPUs where weights must be continuously shuttled from memory to compute, they etch the model's weights directly into the silicon of a custom chip.

Taalas's solution is to eliminate the memory-fetch cycle entirely. By using a proprietary automated design flow, they translate the computational graph of a specific model directly into the physical layout of a chip. In their HC1, the model's weights and architecture are literally etched into the wiring of the silicon. The result is a system that does not depend on difficult or exotic technologies: no HBM, advanced packaging, 3D stacking, or liquid cooling. Engineering simplicity enables an order-of-magnitude reduction in total system cost, and their silicon Llama achieves 17,000 tokens per second per user, nearly 10x faster than the current state of the art, while costing 20x less to build and consuming 10x less power.

Taalas unveiled the HC1 on February 19, 2026, fabricated on TSMC's 6nm process at 815mm² and 53 billion transistors, alongside a $169 million funding round led by Fidelity and Quiet Capital, lifting total capital raised to over $219 million against a reported $30 million in actual product spend.

The obvious limitation is flexibility: a chip hardwired for Llama 3.1 8B cannot run Qwen 27B. Taalas addresses this through manufacturing automation. They have built a compiler-like foundry system that takes model weights and generates a chip design, collapsing the turnaround time from weights to silicon to roughly two months by changing only the top metal masks of the chip rather than redesigning the entire die. This is an extraordinary compression of what would traditionally be a two-year, multi-hundred-million-dollar ASIC development cycle.

The Taalas model is not a solution for all inference use cases. It excels at high-volume serving of a stable, production-ready model, exactly the use case of a sovereign inference infrastructure designed to serve millions of citizens with a trusted, validated open-source model. Whether Taalas or a similar approach can scale to larger models remains an open question, and the economics of frequent chip respins at production volume have not yet been validated at scale. But the direction it points is important: breaking the memory wall through hardware specialization rather than incremental improvements to the general-purpose GPU paradigm.

What I find most significant about Taalas is that they are Canadian. There is no reason a similar initiative could not emerge in France or Europe. The technical knowledge exists. The capital markets for deep tech investment, while less developed than in North America, are growing. And the strategic motivation is arguably stronger in Europe than anywhere else.

The hardware stack of the future

My view on where this goes in the next three to five years is the following. Frontier models from OpenAI, Anthropic, and Google will remain important for the most complex tasks, those requiring the full reasoning capacity of the largest systems. But the everyday AI workload for the majority of users, across the majority of applications, will migrate toward smaller open-source models running closer to the user: on company infrastructure, on national cloud providers, and eventually on edge devices.

Apple's strategic direction reinforces this. They are investing heavily not in building the largest frontier models but in making powerful small models run efficiently on their hardware. A MacBook Pro today can run a 27B model at usable speeds. In two or three years, with continued hardware improvements, the gap between what runs locally and what runs in a data center will narrow further. Apple's silicon is already unusually well-suited to inference workloads because of how they integrate memory and compute, with a unified memory architecture that reduces the bandwidth bottleneck that plagues discrete GPU setups.

The combination of better small models, better local hardware, and purpose-built inference chips like what Taalas is building points toward a world where token production is far more distributed and far less concentrated in a handful of American hyperscalers than it is today. For Europe, and for France specifically, the window to build meaningful sovereign capacity is open now, while the technology is accessible, while the open-source models are catching up, and before the infrastructure dependencies of the next decade get locked in.

The token is the unit of intelligence. Whoever controls the factory controls the product.