Infer Summit Logo

Inference at Scale

One day. Three Pillars. Real Engineering. 

Let’s build the future of inference together.

Pinterest Logo
Nvidia Logo
baseten
databricks logo

What You'll Take Away

Lessons from running inference at scale

Emerging trends in the inference stack

New approaches for your stack

Hear From Experts

HOSTS

Hien Luu

Hien Luu

Khawaja Shams

Khawaja Shams

Salina Wu Headshot

Salina Wu

Suman Tatiraju

Suman Tatiraju

Ying Chen Headshot

Ying Chen

Philip Keily

Emilio Andere

Meryem Arik

Meryem Arik

Chenyang Zhao headshot

Chenyang Zhao

Daniel S

Daniel Svonava

Sinan Ozdemir

Sinan Ozdemir

Sessions

Salina Wu Headshot

Salina Wu

Engines

Building Pinterest’s VLM Serving Stack on NVIDIA Dynamo

Vision-language models introduce new inference challenges beyond text-only workloads: large multimodal payloads, expensive image-driven prefills, and heavy KV cache pressure from multi-turn interactions. In this talk, Pinterest shares how it built its VLM serving platform using NVIDIA Dynamo, vLLM, AWS EKS, and Blackwell GPUs to support production workloads such as the Pinterest Assistant and Multimodal Reranker. We’ll cover the key optimizations that make multimodal inference efficient at scale, including KV-aware routing, prefill/decode disaggregation, tiered KV cache offloading, and realistic benchmarking with AIPerf.

 

Key Takeaways

  • Why VLM workloads are uniquely prefill-heavy and cache-sensitive
  • How KV-aware routing, E/PD disaggregation, and cache offloading improve efficiency at scale
  • How Pinterest benchmarks complex multimodal and agentic workloads using AIPerf
Suman Tatiraju

Suman Tatiraju

Engines

How Dynamo accelerates agent execution

Agentic workloads introduce new challenges for inference systems, including tool-call stalls, inefficient scheduling, and unpredictable KV cache behavior. In this talk, NVIDIA shares how Dynamo optimizes long-running agent workflows by turning execution traces into structured performance data that can be analyzed and replayed to identify bottlenecks. We’ll explore how agentic routing, workload hints, and programmatic KV cache management work together to improve efficiency, utilization, and responsiveness for production agent systems.

 

Key Takeaways

  • How execution tracing helps identify bottlenecks in agent workflows
  • How agentic routing and workload hints improve scheduling efficiency
  • How programmatic KV cache management reduces latency and improves utilization

Philip Kiely

Architectures

Inference Engineering for product differentiation

Increasingly, AI-native builders are turning to open source, fine-tuned, and custom-trained AI models to build new product capabilities and achieve sustainable, competitive unit economics. Delivering these models requires a new capacity, inference engineering.

 

Key Takeaways

  • Trading off effectively between cost, quality, and latency 
  • Navigating the intersection between training and inference 
  • Understanding model performance techniques for SOTA performance on frontier models
Ying Chen Headshot

Ying Chen

Architectures

Building Reliable LLM Serving Infrastructure at Scale

LLM inference infrastructure must handle variable demand across concurrent workloads at scale. This talk covers autoscaling, load balancing, and recovery from silent failures without degrading performance.

 

Key Takeaways

  • Capacity and autoscaling under variable demand
  • Routing and load balancing across heterogeneous workloads
  • Engine-agnostic recovery from silent runtime failures
Khawaja Shams

Khawaja Shams

Architectures

You Already Know More About Inference Than You Think

Inference can feel like a wall of new vocabulary: KV cache, prefill, decode, tensor parallelism, speculative decoding. But once you look past the names, the shape is familiar. It is routing, scheduling, memory pressure, cache eviction, noisy neighbors, tail latency, and state movement.

This talk is an invitation for systems engineers to step into the inference stack with confidence. We’ll build a practical mental model for how requests move through modern LLM serving systems, where performance gets lost, and why utilization is often governed by state rather than compute. You’ll leave with the anatomy of the stack, the vocabulary to reason about it, and the intuition to start asking sharper questions.

 

Key Takeaways

  • A practical map of the inference stack
  • The vocabulary behind prefill, decode, KV cache, and routing
  • Why KV cache becomes a capacity bottleneck
  • How familiar systems instincts apply to modern inference
  • When advanced techniques like disaggregation help, and when they just move the bottleneck

Abi Aryan

Architectures

Reliability Engineering for Inference Serving

Inference is where AI systems meet reality. As agentic workloads consume more compute and GPU costs continue to rise, reliability now extends beyond uptime to include latency, efficiency, and operational resilience.

Key Takeaways

  • New reliability principles for AI and inference workloads
  • Balancing cost, latency, and performance in production
  • Common failure modes and lessons from operating AI infrastructure at scale
 
 

Emilio Andere

Engines

State-of-the-Art LLM Inference on AMD

Learn how Wafer achieved state-of-the-art Qwen3.5-397B inference performance on AMD MI355X through custom MoE kernels, expert fusion, and caching optimizations.

 

Key Takeaways

  • Optimizing AMD GPUs for frontier-model inference
  • Eliminating MoE performance bottlenecks
  • Reducing TTFT with smarter caching
  • Balancing throughput, latency, and determinism
  • Lessons from production-scale deployments
Chenyang Zhao headshot

Chenyang Zhao

Engines

SGLang Omni: Serving Multi-Stage Generative Models by Decode-Time Compute Characteristics

As multimodal and agentic models evolve beyond a single decoding loop, inference systems must adapt. This talk explores the principles behind multi-stage decoding and the architecture powering SGLang Omni.

 

Key Takeaways

  • Why multi-stage decoding matters more than modality
  • Scheduling compute-bound and latency-sensitive stages independently
  • Managing cross-stage memory contention efficiently
  • Reducing latency with tightly coupled stage execution
  • Design patterns for next-generation multimodal inference systems
Meryem Arik

Meryem Arik

Architectures

Inference for Async Agents in Production

As AI agents take on longer, multi-step workflows, inference costs and scaling challenges grow quickly. This talk explores practical techniques for building high-performance async agents while keeping costs under control.

 

Key Takeaways

  • Reducing token and inference costs
  • Effective context engineering and compaction
  • Cache management for long-running agents
  • Model routing and batching strategies
  • Scaling async agent workloads efficiently
Daniel S

Daniel Svonava

OPERATIONS

Should You Self-Host Inference?

Small models now match GPT5.1 smarts and are relatively easy to run. You can get 20x+ cost savings, gain control of your AI stack and even bundle inference right into your software product, instead of treating it like a 3rd party dependency.

 

Key Takeaways

  • Check benchmarks to understand how good <40B param models are today
  • Compare open-source inference products for self-hosting at scale
  • Build a fully self-contained agent that runs on 6 different models hosted in one cluster
Sinan Ozdemir

Sinan Ozdemir

ARCHITECTURES

The Cheapest Token Is the One You Never Generate

Most inference optimization happens on the serving side: writing better kernels, using smarter caching, running faster hardware. The biggest cost and latency wins tend to come from the model side — deciding which tokens never need to be generated by a large model in the first place. This talk covers practical techniques for right-sizing the model to maximize token ROI: distilling frontier-model behavior into small models you can afford to serve at scale, controlling reasoning budgets so models don’t think longer than the problem deserves, and routing between small and frontier models in real time.

 

Key Takeaways

  • Distill, cap, route: three model-side levers for cutting inference cost and latency before serving optimizations begin
  • The real cost math on what each technique actually saves in production, and the failure modes that quietly give the savings back
  • A practical decision framework for when a small model beats the frontier model that taught it

The Three Pillars of Modern Inference

Engines

Inside today’s inference engines: schedulers, KV caches, serving systems, and the code powering them.

Operations

Operating inference at scale: reliability, observability, and performance.

Architectures

Building faster, cheaper, and more efficient inference platforms.

Presented by

Momento