Cover image: The 11th specimen of Archeopteryx in the Senckenberg Museum. Photo by JensKunstfreund, CC BY-SA 4.0, via Wikimedia Commons.
If you ask a paleontologist or a knowledgeable fifth grader what happened to the dinosaurs, they will tell you that while many species went extinct, others survived, thrived and evolved into the birds that dominate our skies today. For every human on earth, there are six birds. Evolution is a process of optimization, and history reveals a recurring pattern where large, dominant entities are often outmaneuvered by smaller, specialized competitors.
In 1990, when Tim Berners-Lee was at CERN developing the foundational ideas for the World Wide Web, he was trying to solve a practical problem for the research community. Research involves a lot of cross referencing of citations, appendices and materials. How could documents and files be authored, stored, and distributed on the wide-area computer network used by labs and universities in a way that was semantically connected and interactive?
Although the hypertext concept was a brilliant set of primitives combining existing ideas, it was the software he developed, and the standards that emerged from it that changed the world. To make hypertext possible, he needed four core ingredients:
- New text file markup standards.
- A new network protocol with rules and standards to make it efficient
- The software to serve files and broker the protocol over the network.
- The software to view the documents from anywhere on the network.
Within a few short years, his idea was rapidly extrapolated into an infinite number of use cases, which particularly attracted commercial interests. It’s remarkable to consider that the race to develop and establish all the necessary tooling was viewed as a gold rush, with early companies such as Netscape entering the NASDAQ at a valuation of 2.9 billion dollars while offering only a browser and a mail client that were free for personal use. Over the following decade, a series of clumsy and delusional acquisitions made it clear that the drunken party of Netscape lacked a defensible technological advantage or a sustainable business model. Once Microsoft, Apple, and various open-source projects began offering free equivalent alternatives, Netscape could no longer compete effectively. The business was worthless.
The “picks and shovels” of the dot-com gold rush were companies like Hewlett-Packard, Sun Microsystems and Cisco systems, which built industrial-grade computers and network switches for data centers and telecoms to serve the exploding demand for online data at a massive scale.
Fast-moving companies like AOL not only required warehouses full of servers, cables and routers but also software that evolved so rapidly that only free and open-source solutions could meet the need. By the end of the 1990s, the free Apache HTTP Server dominated professional deployments, and the open-source, Unix-compatible operating system Linux proliferated through server farms, thanks to their free and customizable architectures. Most of the software we use in enterprise engineering today is open source, from soup to nuts.

Simultaneous to the familiar hub-spoke topology, the Internet led to the rise of distributed decentralized compute. In the late 1990s, the SETI at Home project invited ordinary users to download a small application that would process radio telescope data during periods of desktop inactivity. By harnessing the unused processing power of hundreds of thousands of home machines, researchers created a global supercomputer capable of analyzing complex patterns in the search for extraterrestrial intelligence. Rather than sending massive raw datasets to a single facility, the infrastructure pushed the workload to the edge of the network, where it could be processed in parallel.
The internet was designed to be lean and scrappy. Because connections were slow and unreliable, engineers built the system to be as efficient and resilient as possible. Instead of one massive computer doing all the work, they spread the load across many different machines. Parts of the network could fail and packets would simply find another path in milliseconds. They created a smart storage pattern in the protocol that allowed every device along the way to your device to save copies of files with an expiration date. If you visited an address, your computer would ask the next system “up the wire” if it had the latest copy. Large companies, universities and ISPs utilized such proxy servers to make the internet faster and more efficient. Much of AOL’s technology involved aggressive asset compression and optimization as part of the proxy to move faster over antiquated telephone lines. Today, the descendants of this approach are standard.
All of this is important pretext to understanding the current phase of massive AI adoption and infrastructure expansion. The early development of large language models benefited from tremendous barriers to entry, thanks to the massive computing power required for training. However, that edge is rapidly diminishing as an international force of open-source model providers narrows the performance gap on frontier benchmarks by (often unscrupulously) using an expensive model to train a free one, and further optimizing it with reinforcement learning. Three years ago, as my colleagues and I began developing solutions with AI, we predicted that models would soon be free and inference was going to be a commodity. We had seen this movie before.
Well, here we are.
What I didn’t predict was how the “ah-ha” moment of agentic AI would present a sudden step increase in demand while simultaneous advances in model training would bend the cost curve. Nonetheless, Jevon’s paradox being what it is, there’s no clear limit to how much this will grow, but the economic incentives to cut costs by any means necessary are relentless.
The industry’s early assumptions that the deployment landscape would be dominated by massive, centralized, computationally intensive foundation models with PhD-level knowledge overlooked the cumulative innovations and patterns of the past. Businesses and individuals are quickly realizing that the marginal costs of inference are unlike those of traditional computing systems — in fact, they’re horrible. Fifteen years ago, engineers would procure a rack of servers and iteratively optimize every part to avoid the cost and hassle of buying, setting up, deploying, and maintaining a new server that usually had to run twenty-four-seven under variable load, waiting for it to pay for itself. This was a win-win for the CIO and the CFO. Engineering had strong incentives to minimize marginal costs.

Software engineers are thus deeply familiar with the concept of “right-sizing”, understanding that a semi-truck is a dumb solution for takeout delivery when electric scooters and drone carts exist. Business leaders who were bullish on AI are now experiencing sticker shock as their companies burn through generous annual inference budgets in months while struggling to realize significant productivity gains. Understanding the competitive necessity of inference, engineering teams are now doing what they do best: relentlessly optimizing. Smart teams are slashing costs by 80 percent by right-sizing in scalable containers and shifting operations to appropriate open-weight models. For instance, marketing, research, finance and tech support departments may be using premium proprietary models and inference for basic research, ideation, support chat, retrieval augmented knowledge bases, communication, and document processing. Each one of those is a specific use case that could be using far more efficient alternatives like Gemma, Llama, Phi, Kimi, and DeepSeek while also gaining total control over data privacy for compliance-sensitive workloads.
Software developers have been ahead of the curve on this principle. Often paying exorbitant costs for inference out of pocket to remain competitive, they quickly realized that the latest generation of low cost open-weight models were highly suitable for a wide range of coding tasks, reserving the most powerful models for specialized jobs. Combined with increasingly sophisticated agentic orchestration, they may not need to rely on frontier models or per-token charges at all, finding that open models hosted on their local machine or containerized cloud instances serve as drop-in substitutes for the overpriced premium offerings.

For lightweight tasks, the latest generation of small models are highly specialized and shockingly great, utilizing advanced training methods and lower bitrates that allow them to run smoothly on modern laptops and smartphones, offering the significant benefit of total data privacy. This also means distributed, selective orchestration will slash costs, following the same patterns we discussed earlier. A lightweight local orchestrator can assign basic tasks to local compute and selectively offload more complex tasks to various cloud models based on cost efficiency rubrics as an agent skill.
This shift toward open-weight models represents a fundamental optimization of enterprise operations, where the reduction of token-processing costs has become a strategic driver for smart businesses. In his 2026 paper, Inference Economics: A New Paradigm for the Economics of Artificial Intelligence, Ibrahim Niankara notes that the cost of processing individual tokens has emerged as the critical variable determining the operational viability of any large-scale AI deployment. By migrating from subscription-based APIs to self-hosted models, firms are effectively insulating themselves from the volatile overhead of proprietary vendor ecosystems desperate to realize a return on moonshot capital bets.

This operational transition is supported by advancements in model efficiency. J.T. Licardo details how parameter-efficient fine-tuning and post-training quantization allow organizations to tailor models for specific business tasks without necessitating expensive, high-end GPU clusters. This capability ensures that enterprises can achieve high-performance results while significantly curbing their recurring cloud expenditures.
The assumption that specialized GPU hardware was a barrier to AI deployment was wrong. In the research paper Empirical Analysis of Small Language Model Quantization for CPU-only Cloud Infrastructure, Jessica Sciammarelli uses a quantitative analysis to show that aggressive 4-bit integer quantization allows for real-time inference on standard CPU-based architecture. By maintaining a minimal RAM footprint, these models fall into the “sweet spot” of cost and performance. They perform reliably on existing legacy data center assets, allowing firms to bypass the rising rental costs of specialized HBM-based AI chips. The adoption of these strategies enables a leaner, more sustainable technical infrastructure that prioritizes practical utility over excessive raw computational power.
This has fundamentally altered the economics of enterprise inference, moving the industry toward a landscape where processing costs are no longer tethered to proprietary API pricing. In his 2026 paper, The End of the Foundation Model Era, J.J. Grogan notes that the traditional economic moats of proprietary providers—centered on the immense capital intensity of training frontier models—has largely eroded as open-weight alternatives achieve performance parity. Grogan further argues that as businesses increasingly adopt these smaller, specialized models, the market value is shifting toward domain-specific expertise and the capacity to warrant autonomous, locally-hosted outputs.
This is a vulnerability for hyperscalers, who risk falling into a commodity trap. In the working paper The Emerging Market for Intelligence: Pricing, Supply, and Demand for LLMs, published by NBER in 2026, Mert Demirer finds that open-source models are now approximately 90% cheaper than proprietary counterparts of comparable capability. Without value-added software and services, hosting is a brutally competitive commodity business. This dramatic price disparity, as C. Li details in the Berkeley Center for Management of Technology report The Coming Disruption, forces hyperscalers into a precarious position where they must continue to justify massive capital expenditures on data centers while their primary customers increasingly migrate to more cost-effective, self-hosted deployments.
The economic implications extend to the sustainability of the hardware ecosystem itself. Servaas Storm asserts that the current industry obsession with scaling parameters is hitting diminishing returns, leaving large infrastructure providers to support low-margin, high-volume inference workloads on expensive, specialized hardware.
Ultimately, the shift toward open-weight architectures is also driven by a requirement for data sovereignty. In their 2025 research, Performance Evaluation of Popular Open-Source Large Language Models in Healthcare, Saif Khairat and colleagues highlight that local deployment provides the requisite security for sensitive information, a factor that renders proprietary, cloud-only models untenable for many regulated industries. As Christian Catalini observes in Some Simple Economics of AGI, the future of the field is less about the sheer size of the model and more about the integration of verifiable data, finite human verficiation, and local compute efficiency, signaling a long-term recalibration of how value is captured in the AI technology stack. Similar to the thesis in my recent post Bot Shepherds, Catalini points out that the biggest problem with AI is going to be designing intent, and the verification and oversight of agentic decisions, which will ironically create a lot of jobs.
These trends are converging toward a set of interesting conclusions:
- Proprietary LLMs may not have much of a moat.
- Compute is a race to the bottom, and the bottom is a computer in your office or the servers you already have.
- Large, foundational models are less important than we assumed.
- Practical efficiency innovations will rapidly drive down the cost (and ROI) of applied AI.
We’ve been betting on T-Rex when a murmation of starlings are forming beautiful waves on the horizon.