AI Inference Cost Fell 1,000x in 3 Years

Category: Pricing & Economics · Duration: 17 min · ▶ Watch

Speakers: Female Co-host · Marco

Segments (13)

00:00:05 · The 1,000x Cost Collapse
- The cost of AI inference has dropped 1,000 times in three years, a historic decline faster than Moore’s law, making it the most important chart in technology.
00:00:48 · Three Forces Driving the Collapse
- The cost collapse is driven by three converging forces: dramatic improvements in model efficiency (like Mixture of Experts), fierce price competition among API providers, and hardware specialization for inference.
00:03:09 · Market Shift: Training to Inference
- The AI compute market is structurally shifting from training-dominated to inference-dominated, with inference projected to account for two-thirds of all AI compute by 2026.
00:03:38 · The Rise of Open Source
- The open-source AI market is exploding, with enterprise deployment of open-weights models jumping from 23% to 67%, further commoditizing the model layer.
00:04:20 · Winners & Losers: The Asia Advantage
- The cost collapse democratizes AI, disproportionately benefiting Asia, the fastest-growing AI inference market, by enabling small and medium businesses to deploy AI at scale.
00:05:38 · Sovereign AI and the New Threshold
- Cheap inference makes sovereign AI initiatives in countries like Singapore, India, and Japan viable without needing massive, expensive training clusters.
00:06:13 · NVIDIA’s Strategic Dilemma
- NVIDIA’s dominance, built on training-optimized GPUs, is threatened by the shift to inference, forcing them to hedge by licensing inference-specific LPU technology from Grok.
00:07:27 · Hyperscalers and Custom Chips
- Hyperscalers like Amazon, Google, and Microsoft are also developing their own custom chips designed with inference efficiency as a primary goal, recognizing that’s where ongoing revenue lies.
00:07:53 · The Paradox: Costs Drop, Bills Rise
- Despite plummeting per-unit costs, total enterprise AI bills are rising because usage is exploding faster than costs are declining, as companies deploy AI everywhere.
00:08:44 · Contrarian Corner: Is Cheap Inference Bad for AI Companies?
- The contrarian view is that the cost collapse is terrible for most AI model providers, as it leads to commoditization and a race-to-the-bottom on pricing, squeezing them from both price competition and volume loss to open source.
00:10:27 · The ROI Problem: AI for AI’s Sake
- Despite massive spending and usage, fewer than 1% of companies report significant ROI from AI, suggesting most current deployment is ineffective experimentation (‘AI for AI’s sake’).
00:11:57 · Value Capture Shifts to the Application Layer
- As the infrastructure and model layers commoditize, the real value will be captured by companies at the application layer that build specific, measurable business outcomes on top of cheap inference.
00:13:00 · Action Items for Investors
- Investors should track the inference-to-training ratio, monitor open-source deployment rates vs. proprietary API revenue, and focus on application-layer companies in Asia building measurable business outcomes.

Specific Prices (11)

Timestamp	Item	Value	Context
00:00:17	GPT-4 level query (early 2023)	$400 per million tokens	Cost of running a query at GPT-4 level performance in early 2023.
00:00:24	GPT-4 level query (March 2026)	40 cents per million tokens	Cost of running a query at the same performance level today (March 2026).
00:01:04	DeepSeek V3	14 cents/M input tokens, 28 cents/M output tokens	Pricing for DeepSeek’s V3 model, which is roughly 20 times cheaper than GPT-4 at launch.
00:01:27	Anthropic’s Claude Opus 4.1	$15 per million input tokens	Previous price before a major price cut.
00:01:32	Anthropic’s Claude Opus 4.6	$5 per million input tokens	New price, representing a 67% drop from the previous version.
00:01:35	Google’s Gemini 2.5 Pro	$1.25 per million input tokens	Pricing for Google’s premium model.
00:01:42	Gemini Flashlight	10 cents per million tokens	Pricing for Google’s lightweight model.
00:01:44	Claude Haiku	25 cents per million tokens	Pricing for Anthropic’s lightweight model, approaching database query costs.
00:02:03	NVIDIA license from Grok	$20 billion	Amount NVIDIA paid to license technology from the startup Grok for its LPU.
00:03:22	AI Inference Market Size (2026)	>$50 billion	Projected size of the inference market in 2026.
00:09:48	OpenAI Annual Revenue (2025)	~$4 billion	Reported annual revenue for OpenAI in 2025.

Predictions (4)

[00:03:13, 2026] Inference workloads will account for two-thirds of all AI compute.
[00:03:21, 2026] The inference market will exceed $50 billion.
[00:04:30, Through 2035] The Asia Pacific AI inference market will have a compound annual growth rate of 24.7%.
[00:11:43, Unspecified future] AI will see a cycle similar to cloud computing: an initial phase of universal deployment driven by cheap cost, followed by a rationalization phase where companies cut workloads that don’t generate ROI.

Companies Mentioned (27)

OpenAI · DeepSeek · Anthropic · Google · NVIDIA · Grok · TrendForce · Cerebras · AWS (Amazon Web Services) · Deloitte · Alibaba · Meta · Tencent · Samsung · Grab · Reliance Jio · AMD · Intel · SambaNova · GoPay · Paytm · GCash · Shopee · Lazada · Tokopedia · TikTok · ByteDance

Notable Quotes (6)

And I believe this single chart, inference cost over time, is the most important chart in technology right now. — Marco @ 00:00:37

So NVIDIA is essentially admitting that GPUs are not the optimal architecture for inference. — Marco @ 00:02:33

It means larger AI budgets spent on orders of magnitude more AI usage. — Marco @ 00:08:37

The contrarian position is that the inference cost collapse is actually terrible for most AI companies, and potentially for the AI industry as a whole. — Marco @ 00:09:00

Most enterprise AI usage today is what I would call AI for the sake of AI. — Marco @ 00:11:07

If the inference cost collapse leads to commoditization of the inference layer, the value in the AI stack shifts. It moves away from the model layer… and away from the chip layer… and toward the application layer. — Marco @ 00:12:00

Key Topics

AI inference cost · AI economics · GPU vs ASIC · inference hardware · open-source AI models · AI commoditization · AI value capture · application layer AI · sovereign AI · NVIDIA strategy · Asian tech market · AI ROI · training vs inference compute

Takeaways

The cost of AI inference has collapsed by 1,000x in three years, a rate faster than Moore’s Law, driven by model efficiency, hardware specialization, and intense price competition.
The AI compute market is structurally shifting from being training-dominated to inference-dominated, with inference projected to be two-thirds of the market by 2026.
This shift threatens NVIDIA’s training-optimized GPU business, forcing them to invest heavily ($20B for Grok’s tech) in specialized inference hardware (LPUs/ASICs) to hedge against this future.
Value in the AI stack is moving away from the commoditizing infrastructure (chips, APIs) and model layers, and toward the application layer where companies with domain expertise and distribution can build businesses with measurable ROI.
The cost collapse is democratizing AI, enabling massive adoption in emerging markets, particularly in Asia, which is the fastest-growing region for AI inference.
Open-source models are rapidly gaining enterprise adoption (from 23% to 67% in a year), further commoditizing the model layer and putting pressure on proprietary API providers like OpenAI and Anthropic.
A paradox exists: while per-unit inference costs are plummeting, total enterprise AI spending is rising due to an explosion in usage. However, current ROI is extremely low (<1%), suggesting a phase of widespread experimentation before a likely rationalization.
The winners of the AI era may not be the chip or model makers, but the application-layer companies that effectively use cheap inference to solve specific business problems, especially those with deep local market knowledge in regions like Asia.