Cover image: The 11th specimen of Archeopteryx in the Senckenberg Museum. Photo by JensKunstfreund, CC BY-SA 4.0, via Wikimedia Commons.
If you ask a paleontologist or a knowledgeable fifth grader what happened to the dinosaurs, they will tell you that dinosaurs are everywhere. While many species went extinct, others survived, thrived and evolved into the birds that dominate our skies today. For every human on earth, there are six birds. Evolution is a process of optimization, and history reveals a recurring pattern where large, dominant entities are often outmaneuvered by smaller, specialized competitors. This transition from mega to micro is evident throughout technological innovation, and it is currently playing out in the conflict between Ukraine and Russia as well as the AI arms race.
But first, let’s look backward to understand what may lie ahead.
In 1990, when Tim Berners-Lee was at CERN developing the foundational ideas for the World Wide Web, he was trying to solve a practical problem for the research community. Deep research involves a tremendous amount of cross referencing of citations, appendices and materials. How could documents and files be authored, stored, and distributed on the wide-area computer network used by labs and universities in a way that was semantically connected and interactive?
Although the hypertext concept was an undoubtedly a brilliant set of primitives combining existing ideas, it was the proof of concept software Berners-Lee and his colleagues developed that changed the world. To make hypertext possible, he needed four core ingredients:
- New text file markup standards.
- A new network protocol.
- The software to serve files and broker the protocol over the network.
- The software to view the documents from anywhere on the network.
Within a few short years, Berners-Lee’s idea was rapidly extrapolated into an infinite number of use cases, which particularly attracted commercial interests. Structured, interactive information could be shared publicly and in real time by anyone who could establish a connection to the emerging global network of computer systems.
It’s remarkable to consider that the race to develop and establish all the necessary tooling was viewed as a gold rush, with early companies such as Netscape entering the NASDAQ at a valuation of 2.9 billion dollars while offering only a internet browser and a mail client that were free for personal use. Over the following decade, a series of clumsy and delusional acquisitions made it clear that the drunken party of Netscape lacked a defensible technological advantage or a sustainable business model. Once Microsoft, Apple, and various open-source projects began offering free equivalent alternatives, Netscape could no longer compete effectively.
The “picks and shovels” of the dot-com gold rush were companies like Hewlett-Packard, Sun Microsystems and Cisco systems, which built industrial-grade computers and network switches for data centers and telecoms to serve the exploding demand for online data at a massive scale.
Fast-moving companies like AOL not only required warehouses full of servers, cables and routers but also software that evolved so rapidly that only free and open-source solutions could meet the need. By the end of the 1990s, the free Apache HTTP Server dominated professional deployments, and the open-source, Unix-compatible operating system Linux proliferated through server farms, thanks to its free, flexible, and infinitely customizable architecture. Most of the software we use in enterprise engineering today is open source, from soup to nuts.

But simultaneous to the familiar hub-spoke client to server topology was the rise of distributed decentralized compute. In the late 1990s, the screensavers of millions of personal computers transformed from simple scrolling graphics into a massive, distributed research lab. The SETI at Home project invited ordinary users to download a small application that would process radio telescope data during periods of desktop inactivity. By harnessing the unused processing power of hundreds of thousands of home machines, researchers created a global supercomputer capable of analyzing complex patterns in the search for extraterrestrial intelligence. Rather than sending massive raw datasets to a single facility, the infrastructure pushed the workload to the edge of the network, where it could be processed in parallel.
The internet was designed from day 0 to be lean and scrappy. Because connections were slow and unreliable, engineers built the system to be as efficient and resilient as possible. Instead of one massive computer doing all the work, they spread the load across many different machines. Parts of the network could fail and packets would simply find another path in milliseconds. They created a smart storage pattern in the protocol that allowed every device along the way to your device to save copies of files with an expiration date. If you visited an address, your computer would ask the next system “up the wire” if it had the latest copy. Large companies, universities and ISPs utilized proxy servers to make the internet faster and more efficient. Much of AOL’s technology involved aggressive asset compression and optimization as part of the proxy to move faster over antiquated telephone lines. Today, the descendants of this approach are a standard pattern called “edge caching” that no professional software ships without.
All of this is important pretext to understanding the current phase of massive AI adoption and infrastructure expansion. The early development of large language models benefited from tremendous barriers to entry, thanks to the massive computing power required for training. However, that edge is rapidly diminishing as an international force of open-source model providers narrows the performance gap on frontier benchmarks by (often unscrupulously) using an expensive model to train a free one, and further optimizing it with reinforcement learning. Three years ago, as my colleagues and I began developing solutions with AI, we predicted that models would soon be free and inference was going to be a commodity. We had seen this movie before.
Well, here we are.
What I didn’t predict was how the “ah-ha” moment of agentic AI would present a sudden step increase in demand while simultaneous advances in model training would bend the cost curve. Nonetheless, Jevon’s paradox being what it is, there’s no clear limit to how much this will grow, but the economic incentives to drive down costs by any means necessary are going to be relentless.
The industry’s early assumptions that the deployment landscape would be dominated by massive, computationally intensive and heavily subsidized general purpose models with PhD-level knowledge overlooked the cumulative innovations from the past 35 years. Businesses and individuals are realizing that the marginal costs of inference are unlike those of traditional computing systems — in fact, they’re horrible. Fifteen years ago, engineers would procure a rack of servers and iteratively optimize every component to avoid the cost and hassle of buying, setting up, deploying, and maintaining additional hardware that usually had to run twenty-four-seven under variable load, waiting for it to pay for itself. This was a win-win for the CIO and the CFO. Engineering had strong incentives to minimize marginal costs.

Software engineers are thus deeply familiar with the concept of “right-sizing”, understanding that a semi-truck is a dumb solution for takeout delivery when electric scooters and drone carts exist. Business leaders who were bullish on AI are now experiencing sticker shock as their companies burn through generous annual inference budgets in months while struggling to realize significant productivity gains. Understanding the competitive necessity of inference, engineering teams are now doing what they do best: relentlessly optimizing to drive down costs. Smart teams and researchers are reporting that they can slash costs by 80 percent by right-sizing in scalable containers and shifting operations to appropriate open-weight models. For instance, marketing, research, finance and tech support departments may be using premium proprietary models and inference for basic research, ideation, support chat, retrieval augmented knowledge bases, communication, and document processing. Each one of those is a specific use case that could be using far more efficient alternatives like Gemma, Llama, Phi, Kimi, and DeepSeek while also gaining total control over data privacy for compliance-sensitive workloads.
Software developers are ahead of the curve on this principle. Often paying for personal inference out of pocket to remain competitive in the first sector under threat, they quickly realized that the latest generation of low cost open-weight models were highly suitable for a wide range of tasks, reserving the most powerful models for specialized jobs. Combined with increasingly sophisticated agentic orchestration, they may not need to rely on frontier models or per-token charges at all, finding that open models hosted on containerized cloud instances serve as drop-in substitutes for the overpriced premium offerings.

For lightweight tasks, the latest generation of small models are highly specialized and shockingly great, utilizing advanced training methods and lower bitrates that allow them to run smoothly on modern laptops and smartphones, offering the significant benefit of total data privacy. This also means distributed, selective orchestration will slash costs, following the same patterns we discussed earlier. A lightweight local orchestrator can assign basic tasks to local compute and selectively offload more complex tasks to various cloud models based on cost efficiency rubrics as an agent skill.
This shift toward open-weight models represents a fundamental optimization of enterprise operations, where the reduction of token-processing costs has become a strategic driver for smart businesses. In his 2026 paper, Inference Economics: A New Paradigm for the Economics of Artificial Intelligence, Ibrahim Niankara notes that the cost of processing individual tokens has emerged as the critical variable determining the operational viability of any large-scale AI deployment. By migrating from subscription-based APIs to self-hosted models, firms are effectively insulating themselves from the volatile overhead of proprietary vendor ecosystems desperate to realize a return on moonshot capital bets.

This operational transition is supported by advancements in model efficiency. In the 2026 MDPI publication Performance Trade-offs of Optimizing Small Language Models for E-commerce, J.T. Licardo details how parameter-efficient fine-tuning and post-training quantization allow organizations to tailor models for specific business tasks without necessitating expensive, high-end GPU clusters. This capability ensures that enterprises can achieve high-performance results while significantly curbing their recurring cloud expenditures.
Furthermore, the hardware barrier to effective AI deployment is lowering through the use of standard components. In the research paper Empirical Analysis of Small Language Model Quantization for CPU-only Cloud Infrastructure, Jessica Sciammarelli uses a quantitative analysis to show that aggressive 4-bit integer quantization allows for real-time inference on standard CPU-based architecture. By maintaining a minimal RAM footprint, these models fall into the “sweet spot” of cost and performance. They perform reliably on existing data center assets, allowing firms to bypass the rising rental costs of specialized HBM-based AI chips. The adoption of these strategies enables a leaner, more sustainable technical infrastructure that prioritizes practical utility over excessive raw computational power.
This has fundamentally altered the economics of enterprise inference, moving the industry toward a landscape where processing costs are no longer tethered to proprietary API pricing. In his 2026 paper, The End of the Foundation Model Era, J.J. Grogan notes that the traditional economic moats of proprietary providers—centered on the immense capital intensity of training frontier models—has largely eroded as open-weight alternatives achieve performance parity. Grogan further argues that as businesses increasingly adopt these smaller, specialized models, the market value is shifting toward domain-specific expertise and the capacity to warrant autonomous, locally-hosted outputs.
This is a vulnerability for hyperscalers, who risk falling into a commodity trap regarding their underlying compute infrastructure. In the working paper The Emerging Market for Intelligence: Pricing, Supply, and Demand for LLMs, published by NBER in 2026, M. Demirer observes that open-source models are now approximately 90% cheaper than proprietary counterparts of comparable capability. This dramatic price disparity, as C. Li details in the Berkeley Center for Management of Technology report The Coming Disruption, forces hyperscalers into a precarious position where they must continue to justify massive capital expenditures on data centers while their primary customers increasingly migrate to more cost-effective, self-hosted deployments.
The economic implications extend to the sustainability of the hardware ecosystem itself. Servaas Storm asserts that the current industry obsession with scaling parameters is hitting diminishing returns, leaving large infrastructure providers to support low-margin, high-volume inference workloads on expensive, specialized hardware.
Ultimately, the shift toward open-weight architectures is also driven by a requirement for data sovereignty. In their 2025 research, Performance Evaluation of Popular Open-Source Large Language Models in Healthcare, Saif Khairat and colleagues highlight that local deployment provides the requisite security for sensitive information, a factor that renders proprietary, cloud-only models untenable for many regulated industries. As Christian Catalini observes in Some Simple Economics of AGI, the future of the field is less about the sheer size of the model and more about the integration of verifiable data, finite human verficiation, and local compute efficiency, signaling a long-term recalibration of how value is captured in the AI technology stack.
These trends are converging toward a set of interesting conclusions:
- OpenAI and Anthropic may not have much of a moat.
- Compute is a race to the bottom, and the bottom is a computer in your office.
- Large, foundational models are less important than we assumed.
- Practical efficiency innovations will rapidly drive down the cost (and ROI) of applied AI.
We’ve been betting on T-Rex when a murmation of starlings are forming beautiful waves on the horizon.