The end of free tokens

May 18, 2026

Around 2018 I took an UberPool across Manhattan for $1.78. Cheaper than the subway. Today the same ride is about $20 and nobody finds that surprising. The rides were never $1.78. The difference came out of somebody else's wallet, and eventually that somebody wanted it back.

The same thing is coming for tokens. Last month ccusage estimated my Claude usage at $5,098.85 in API list-price terms. I paid Anthropic $200. Maybe API list price isn't Anthropic's real cost. Maybe ccusage over-counts cache reads. Fine. Divide by four. The gap is still embarrassing.

The obvious objection: we've been told the AI bill was about to land for two years now and it hasn't. That is fair. Frontier model output prices are down roughly 95% since GPT-4 launched, and the deflation engine has been doing real work. What changed in 2026 is that three things are pushing the other way at once.

First, the rate cards have stopped falling at the top. Anthropic's Opus tier has been $5/$25 per million tokens for three releases in a row (4.5, 4.6, 4.7), and Opus 4.7 ships with a new tokenizer that produces up to 35% more tokens for the same input. Same rate card, bigger invoice. Second, supply is behind demand. Anthropic tightened Pro and Max limits all spring, then signed a 300-megawatt compute deal with SpaceX in May, doubled Code limits, and is still triaging. The largest infrastructure buildout in the history of tech is not keeping up: agents outrun silicon. Third, the labs need presentable unit economics. Anthropic is reportedly raising $50 billion at a $900 billion valuation with an IPO as early as October; OpenAI's bankers are pricing a $1 trillion listing. Five-figure subsidies to power users do not survive a roadshow.

So what now? There's some good news; open-weight models are getting better quickly. The Chinese open frontier (DeepSeek, Qwen, GLM, Kimi) is good enough that the Stanford AI Index now puts the top US-Chinese model gap at 2.7%, down from 17+ two years ago. Nvidia Nemotron is even more open than that and is available today. And some of these models are starting to perform really nicely. A Mac M4 with 24GB of RAM makes any of these models in the 32B parameter range fly.

The real solution though is using an agent harness that can mix all three modes (closed-hosted, open-hosted, open-local) on demand and route per request. When you need the frontier model power, pay for it, when a task can be done locally, don't. Hosted open weight models for the cases in between. Goose gives you this today. The model is the bit that gets squeezed. The agent layer is where your workflows, permissions, and memory should live.

The end of free tokens isn't the end of accessible AI. It is the end of mistaking the onboarding flow for the architecture. The $1.78 UberPool ride was real. So was the bill that arrived after. Free tokens were never the model.