Breaking Through GPU Memory Wall (NVIDIA + VAST Data)

Category: Memory & HBM · Duration: 46 min · ▶ Watch

Speakers: Anat Heifetz (VAST Data) · Dr. Vikram Sharma (NVIDIA)

Segments (14)

00:00:01 · Introduction: The GPU Memory Wall
- The speaker introduces the GPU memory wall as a primary bottleneck for scaling AI, shifting focus from compute to memory management.
00:00:41 · Speaker Introductions
- Anat Heifetz from VAST Data and Dr. Vikram Sharma from NVIDIA are introduced, highlighting their roles in AI architecture and research.
00:01:50 · Presentation Outline
- The presentation is structured around three areas: the state of AI inference, current practical solutions, and next-generation architecture (CMX).
00:03:04 · Inference State-of-the-Art & Agentic AI
- The shift from single-prompt chatbots to multi-step, reasoning-based agentic AI workflows is discussed, emphasizing the need for persistent context.
00:05:07 · KV Cache: Basics and Challenges
- The basics of the prefill and decode stages of an inference pipeline are explained, along with how KV caching works and how its compute time explodes with context length.
00:06:19 · Inference Context as the New Bottleneck
- The presentation explains how inference context is the new bottleneck and introduces the concept of a context memory hierarchy.
00:07:35 · NVIDIA & VAST Collaboration: Accelerating Inference
- The collaboration between NVIDIA and VAST is detailed, focusing on the NVIDIA Dynamo framework for distributed inference at scale.
00:09:50 · VAST Contributions to Dynamo
- The integration of VAST’s data services into the Dynamo architecture is explained, highlighting performance gains and enterprise features.
00:12:36 · Performance Results and Savings
- Experimental results are presented, showing a 20x speedup in Time to First Token (TTFT) and 90% savings in GPU time by using VAST’s KV Cache solution.
00:15:51 · Data Reduction and Security
- The benefits of 1.4x data reduction on KV cache are outlined, along with the critical security considerations for offloading sensitive context data.
00:17:38 · Designing for KV Acceleration & Introducing CMX
- The need for a new storage tier is established, leading to the introduction of NVIDIA’s Context Memory Storage (CMX) platform, powered by BlueField-4.
00:24:13 · CMX Architecture with VAST
- The architecture of CMX with VAST is detailed, showing how VAST’s DASE architecture and BlueField-4 DPUs enable a highly efficient, scalable solution.
00:27:47 · VAST CMX KV$ Sizing Guidance
- Practical sizing guidance is provided for different experience tiers, from instant resume to full agentic memory, showing capacity needs from terabytes to petabytes.
00:29:45 · Wrapping Up & Q&A
- The key takeaways are summarized, followed by a question and answer session with the audience.

Specific Prices (1)

Timestamp	Item	Value	Context
00:14:37	Tokens per Dollar	60%-130% More	A projection of achieving 60-130% more tokens per dollar with VAST KV cache acceleration in a real-world deployment.

Memory Facts (4)

[00:00:06] The GPU memory wall is a significant challenge in scaling AI.
[00:04:51] KV cache is becoming a long-lived AI memory.
[00:05:44] A 125,000 token context length requires 64 GB of KV Cache memory.
- 125,000 tokens, 64 GB
[00:27:57] Sizing guidance for 10k users with a 32 GB KV cache size per conversation requires 48 PB for full ‘Agentic Memory’.
- 10k users, 32 GB, 48 PB

Bottleneck Claims (3)

[00:00:12] The primary bottleneck for scaling AI is shifting from compute to memory management.
- Evidence: The speaker states that as AI becomes more complex and reasoning-based, the context of the conversation becomes as important as the model itself, stressing memory systems.
[00:06:23] Inference context is the new bottleneck in AI systems.
- Evidence: The context is large, dynamic, and must be shared across GPUs and nodes. Local memory is limited, and scaling traditional storage for this purpose is inefficient and costly.
[00:17:57] Traditional storage architectures become the bottleneck for throughput when dealing with gigascale context.
- Evidence: The latency of traditional storage delays the critical Time-to-First-Token (TTFT), and using standard hardware to solve the speed problem is too expensive in terms of cost, power, and space.

Predictions (2)

[00:17:47, Next-generation] Gigascale context requires a fundamental leap in both speed and economics beyond what current architectures offer.
[00:27:21, Future] NVIDIA Dynamo APIs will evolve to directly direct memory systems to handle different datasets uniquely through shared services.

Key Technologies (9)

GPU (Graphics Processing Unit): The core processor for accelerating AI computations.
LLM (Large Language Model): The type of AI model being discussed, which requires large amounts of memory for its context.
KV Cache: A memory cache that stores the key and value states of previous tokens in an AI model to avoid re-computation and speed up inference.
NVIDIA Dynamo: A highly efficient, production-grade open-source framework with a modular design for distributed inference at scale.
VAST Data Platform: A disaggregated, shared-everything (DASE) data platform used for storing and accelerating access to the KV Cache.
CMX (Context Memory Storage): A new, AI-native, pod-level storage tier purpose-built for inference context and KV cache management, designed to reduce TCO and improve performance.
NVIDIA BlueField-4 DPU: A Data Processing Unit that powers the CMX platform, providing networking, compute, and storage processing capabilities to offload the host CPU/GPU.
NVIDIA Spectrum-X Ethernet: A networking platform that provides predictable, low-latency, high-bandwidth connectivity for AI workloads, connecting CMX trays.
NVIDIA DOCA: An SDK that provides software capabilities for connecting and interfacing with the inference infrastructure, including key-value APIs for CMX.

Companies Mentioned (5)

NVIDIA · VAST Data · OpenAI · Llama Stack · Cisco

Notable Quotes (4)

The bottleneck is no longer just compute. It is how we manage memory. — Anat Heifetz @ 00:00:12

Inference context itself is becoming the key bottleneck, and not the primary compute. — Dr. Vikram Sharma @ 00:06:23

We’re not making the GPU faster… but we’re making it available more often, turning the storage into a compute force multiplier. — Anat Heifetz @ 00:15:41

Why are we working with VAST, right? So you are motivated to work with VAST. — Dr. Vikram Sharma @ 00:35:28

Key Topics

GPU Memory Wall · AI Inference Optimization · KV Cache Management · Agentic AI · Distributed Inference Systems · Data Storage Architecture · Total Cost of Ownership (TCO) Reduction · NVIDIA Dynamo · VAST Data Platform · Context Memory Storage (CMX)

Takeaways

The primary bottleneck in scaling modern, reasoning-based AI is shifting from raw compute power to memory management, specifically handling the large context required for agentic workflows.
Offloading the KV Cache from GPU memory to a specialized, high-speed storage tier is a critical strategy to overcome the GPU memory wall.
The collaboration between NVIDIA (with Dynamo and CMX) and VAST Data (with its DASE architecture) provides a solution that can accelerate Time-to-First-Token (TTFT) by up to 20x and improve GPU utilization by 90%.
The new Context Memory Storage (CMX) architecture, powered by BlueField-4 DPUs, creates a new, power-efficient storage tier that significantly reduces the total cost of ownership (TCO) by lowering power consumption and physical rack space by up to 75%.
By turning the storage I/O bottleneck into a network-bound problem, the system’s performance can scale directly with network bandwidth improvements.
As KV cache contains sensitive data, moving it off the GPU requires robust enterprise data services, including encryption, multi-tenancy security, and audit trails, which the VAST platform provides.