Gemma 4 with Baseten and NVIDIA

Year: 2026 · ▶ Watch on YouTube

Host (Host) · Jay Rodge (Senior Developer Advocate) · Philip Kiley (Head of AI Education)

Switch language → zh

Segments (9)

  • 00:00:04 · Introduction — Host
    • The host introduces his guests, Jay Rodge from NVIDIA and Philip Kiley from Baseten, to discuss AI inference.
  • 00:00:42 · NVIDIA and Google Cloud Partnership Announcements — Jay Rodge
    • Jay announces that Google Cloud will be among the first to offer NVIDIA’s next-gen Vera Rubin hardware and is adding Blackwell RTX Pro 6000 GPUs.
  • 00:01:38 · The Meaning of Inference for Applications — Philip Kiley
    • Philip defines inference as the practical delivery of AI applications, emphasizing the need for low latency, high reliability, and scalability.
  • 00:02:27 · How Baseten Uses the Full Stack — Philip Kiley
    • Philip explains how Baseten leverages the full stack, from Google Cloud’s GKE for multi-region deployments to NVIDIA’s hardware (Hopper, Blackwell) and software (Dynamo) for optimized inference.
  • 00:04:33 · Leveraging Google’s Gemma Models — Philip Kiley
    • Philip discusses Baseten’s day-zero support for Gemma 4, highlighting the value of its multimodality and range of model sizes for fine-tuning.
  • 00:06:06 · Optimizing Inference Performance — Jay Rodge
    • Jay recommends using open-source tools like TensorRT-LLM, NVIDIA Dynamo, and the NvFP4 precision format to achieve the best inference performance on NVIDIA hardware.
  • 00:07:27 · Baseten Platform Demo — Philip Kiley
    • Philip demonstrates deploying and running a multimodal Gemma 4 model on the Baseten platform, showcasing one-click deployment, the playground, and autoscaling metrics.
  • 00:13:21 · The Role of Google Kubernetes Engine (GKE) — Philip Kiley
    • Philip praises GKE’s low-latency networking for complex agentic systems and its flexibility and scale for handling massive AI workloads.
  • 00:15:17 · Inference Engineering Book and Conference Highlights — Philip Kiley
    • Philip introduces his book ‘Inference Engineering’ as a comprehensive guide, and the guests share their excitement for engaging with developers at the conference.

Products Announced (2)

  • 00:00:57 · NVIDIA Vera Rubin (Next-generation hardware)
    • Designed for inference and training
    • Coming to Google Cloud in the later half of the year.
  • 00:01:09 · NVIDIA RTX Pro 6000 (Blackwell) (New GPU)
    • 96GB of VRAM · Allows deploying multiple models on a single GPU
    • Being added to Google Cloud.

Commitments (1)

  • 01:03 (Later this year) — Google Cloud will offer NVIDIA’s Vera Rubin hardware.

Demos (1)

  • 00:07:38 ✓ · Baseten Platform Demo with Gemma 4 — Philip Kiley
    • The demo showed the Baseten UI for deploying a Gemma 4 model from the model library onto an L4 GPU, viewing logs and metrics, and using the playground to run a multimodal inference on a picture of a dog. It also highlighted the autoscaling configuration and metrics dashboard.

Notable Quotes (4)

  • 01:39 — Philip Kiley:

    To me, what inference means is being able to actually deliver on the promise of AI applications.

  • 02:19 — Host:

    Full-stack seating chart.

  • 06:50 — Jay Rodge:

    With just a couple of lines of code, you can get the best performance available for on any kind of NVIDIA hardware.

  • 16:17 — Philip Kiley:

    Inference is not one thing… It’s everything from CUDA to infrastructure, from the on-GPU optimization to all the distributed systems problems, all together in a single stack.

Visual Signals

On-screen (5)

  • 00:10:00 · Google Cloud Next, Live from Vegas
    • Identifies the event and context of the broadcast.
  • 00:48 · Jay Rodge, Senior Developer Advocate, NVIDIA
    • Identifies the speaker, his role, and affiliation.
  • 01:44 · Philip Kiley, Head of AI Education, Baseten
    • Identifies the speaker, his role, and affiliation.
  • 03:32 · baseten.co homepage with the text 'Inference is everything'
    • Shows the branding and core value proposition of the Baseten platform.
  • 15:19 · A green book cover with the title 'INFERENCE ENGINEERING' by Philip Kiley
    • Highlights a resource for developers interested in the topic of the discussion.

Stage (2)

  • 00:09 · A wide shot shows three speakers sitting at a desk in a large conference hall, with microphones and laptops.
  • 18:15 · The three speakers give each other a fist bump to conclude the segment.

Visual demos (1)

  • 07:38 · A screen share of the Baseten platform.
    • The demo starts on the Baseten homepage, moves to a demo workspace, shows a list of deployed models, navigates to the Hugging Face page for Gemma 4, then to the Baseten model library, through the deployment UI, and finally to the deployed model’s dashboard with logs, metrics, and a playground for interaction.

Key Topics

AI Inference · LLM Optimization · GPU Hardware · NVIDIA Blackwell · NVIDIA Vera Rubin · Google Gemma · Baseten · TensorRT-LLM · Full-Stack AI · Model Deployment · Autoscaling · Google Kubernetes Engine (GKE) · Multimodal AI · Developer Experience

Takeaways

  • Google Cloud is deepening its partnership with NVIDIA, being among the first to offer next-gen Vera Rubin hardware and adding Blackwell RTX Pro 6000 GPUs.
  • Inference is a critical, multi-faceted engineering discipline that goes beyond just running a model, encompassing infrastructure, optimization, and distributed systems to ensure reliability and low latency.
  • Platforms like Baseten abstract away the complexity of the full AI stack, allowing developers to deploy, scale, and manage models on top of Google Cloud and NVIDIA hardware with ease.
  • Google’s open-source Gemma models, particularly the new Gemma 4, are valued for their range of sizes and multimodal capabilities, making them ideal for fine-tuning and various application use cases.
  • Optimizing inference is key to performance and cost-efficiency, with open-source tools like NVIDIA’s TensorRT-LLM and Dynamo playing a crucial role.
  • Google Kubernetes Engine (GKE) is a foundational component for scalable AI, providing the flexibility and low-latency networking required for complex, multi-model agentic systems.
  • The AI stack is layered, from Google Cloud’s infrastructure, to NVIDIA’s GPUs and software, to application platforms like Baseten, each providing value at a different level of abstraction.