June 25, 2026
10:00AM - 4:30PM PT
Presented Online

Inference at Scale

One day. Three Pillars. Real Engineering.

Let's build the future of inference together.

What You'll Take Away

Lessons from running inference at scale

Emerging trends in the inference stack

New approaches for your stack

Hear From Experts

HOSTS

Host

Agenda

As AI agents take on longer, multi-step workflows, inference costs and scaling challenges grow quickly. This talk explores practical techniques for building high-performance async agents while keeping costs under control.

Key Takeaways

Reducing token and inference costs
Effective context engineering and compaction
Cache management for long-running agents
Model routing and batching strategies
Scaling async agent workloads efficiently

Abi Aryan

Architectures

10:45 AM - 11:15 AM PT

Reliability Engineering for Inference Serving

Inference is where AI systems meet reality. As agentic workloads consume more compute and GPU costs continue to rise, reliability now extends beyond uptime to include latency, efficiency, and operational resilience.

Key Takeaways

New reliability principles for AI and inference workloads
Balancing cost, latency, and performance in production
Common failure modes and lessons from operating AI infrastructure at scale

Suman Tatiraju

Engines

11:15 AM - 11:45 AM PT

How Dynamo accelerates agent execution

Agentic workloads introduce new challenges for inference systems, including tool-call stalls, inefficient scheduling, and unpredictable KV cache behavior. NVIDIA shares how Dynamo optimizes long-running agent workflows by turning execution traces into structured performance data.

Key Takeaways

How execution tracing helps identify bottlenecks in agent workflows
How agentic routing and workload hints improve scheduling efficiency
How programmatic KV cache management reduces latency and improves utilization

Khawaja Shams

Architectures

11:45 AM - 12:15 PM PT

You Already Know More About Inference Than You Think

Inference can feel like a wall of new vocabulary: KV cache, prefill, decode, tensor parallelism, speculative decoding. This talk gives systems engineers a practical mental model for how requests move through modern LLM serving systems.

Key Takeaways

A practical map of the inference stack
The vocabulary behind prefill, decode, KV cache, and routing
Why KV cache becomes a capacity bottleneck
How familiar systems instincts apply to modern inference

12:15 PM - 1:00 PM PT

Break for lunch

Philip Kiely

Architectures

1:00 PM - 1:30 PM PT

Inference Engineering for product differentiation

AI-native builders are turning to open source, fine-tuned, and custom-trained models to build differentiated product capabilities and sustainable unit economics. Delivering these models requires inference engineering.

Key Takeaways

Trading off effectively between cost, quality, and latency
Navigating the intersection between training and inference
Understanding model performance techniques for SOTA performance on frontier models

Chenyang Zhao

Engines

1:30 PM - 2:00 PM PT

SGLang Omni: Serving Multi-Stage Generative Models by Decode-Time Compute Characteristics

As multimodal and agentic models evolve beyond a single decoding loop, inference systems must adapt. This talk explores the principles behind multi-stage decoding and the architecture powering SGLang Omni.

Key Takeaways

Why multi-stage decoding matters more than modality
Scheduling compute-bound and latency-sensitive stages independently
Managing cross-stage memory contention efficiently
Reducing latency with tightly coupled stage execution

Emilio Andere

Engines

2:00 PM - 2:30 PM PT

State-of-the-Art LLM Inference on AMD

Learn how Wafer achieved state-of-the-art Qwen3.5-397B inference performance on AMD MI355X through custom MoE kernels, expert fusion, and caching optimizations.

Key Takeaways

Optimizing AMD GPUs for frontier-model inference
Eliminating MoE performance bottlenecks
Reducing TTFT with smarter caching
Balancing throughput, latency, and determinism
Lessons from production-scale deployments

Salina Wu

Cristian Lopez

Engines

2:30 PM - 3:00 PM PT

Building Pinterest's VLM Serving Stack on NVIDIA Dynamo

Pinterest shares how it built its VLM serving platform using NVIDIA Dynamo, vLLM, AWS EKS, and Blackwell GPUs to support workloads such as Pinterest Assistant and Multimodal Reranker.

Key Takeaways

Why VLM workloads are uniquely prefill-heavy and cache-sensitive
How KV-aware routing, E/PD disaggregation, and cache offloading improve efficiency at scale
How Pinterest benchmarks complex multimodal and agentic workloads using AIPerf

Ying Chen

Architectures

3:00 PM - 3:30 PM PT

Building Reliable LLM Serving Infrastructure at Scale

LLM inference infrastructure must handle variable demand across concurrent workloads at scale. This talk covers autoscaling, load balancing, and recovery from silent failures without degrading performance.

Key Takeaways

Capacity and autoscaling under variable demand
Routing and load balancing across heterogeneous workloads
Engine-agnostic recovery from silent runtime failures

Daniel Svonava

Operations

3:30 PM - 4:00 PM PT

Should You Self-Host Inference?

Small models now match frontier-model smarts and are relatively easy to run. You can get major cost savings, gain control of your AI stack, and bundle inference into your software product.

Key Takeaways

Check benchmarks to understand how good sub-40B parameter models are today
Compare open-source inference products for self-hosting at scale
Build a fully self-contained agent that runs on six different models hosted in one cluster

Sinan Ozdemir

Architectures

4:00 PM - 4:30 PM PT

The Cheapest Token Is the One You Never Generate

The biggest cost and latency wins often come from the model side: deciding which tokens never need to be generated by a large model in the first place. This talk covers distillation, reasoning budgets, and model routing.

Key Takeaways

Distill, cap, route: three model-side levers for cutting inference cost and latency
The real production cost math behind each technique
A practical decision framework for when a small model beats the frontier model that taught it

The Three Pillars of Modern Inference

Engines

Inside today's inference engines: schedulers, KV caches, serving systems, and the code powering them.

Operations

Operating inference at scale: reliability, observability, and performance.

Architectures

Building faster, cheaper, and more efficient inference platforms.

June 25, 2026
10:00AM - 4:30PM PT
Presented Online

Presented by

Inference at Scale

What You'll Take Away

Hear From Experts

Agenda

Summit Kickoff

Inference for Async Agents in Production

Reliability Engineering for Inference Serving

How Dynamo accelerates agent execution

You Already Know More About Inference Than You Think

Break for lunch

Inference Engineering for product differentiation

SGLang Omni: Serving Multi-Stage Generative Models by Decode-Time Compute Characteristics

State-of-the-Art LLM Inference on AMD

Building Pinterest's VLM Serving Stack on NVIDIA Dynamo

Building Reliable LLM Serving Infrastructure at Scale

Should You Self-Host Inference?

The Cheapest Token Is the One You Never Generate

The Three Pillars of Modern Inference

Engines

Operations

Architectures