One day. Three Pillars. Real Engineering.
Let’s build the future of inference together.
Lessons from running inference at scale
Emerging trends in the inference stack
New approaches for your stack
Engines
Vision-language models introduce new inference challenges beyond text-only workloads: large multimodal payloads, expensive image-driven prefills, and heavy KV cache pressure from multi-turn interactions. In this talk, Pinterest shares how it built its VLM serving platform using NVIDIA Dynamo, vLLM, AWS EKS, and Blackwell GPUs to support production workloads such as the Pinterest Assistant and Multimodal Reranker. We’ll cover the key optimizations that make multimodal inference efficient at scale, including KV-aware routing, prefill/decode disaggregation, tiered KV cache offloading, and realistic benchmarking with AIPerf.
Key Takeaways
Engines
Agentic workloads introduce new challenges for inference systems, including tool-call stalls, inefficient scheduling, and unpredictable KV cache behavior. In this talk, NVIDIA shares how Dynamo optimizes long-running agent workflows by turning execution traces into structured performance data that can be analyzed and replayed to identify bottlenecks. We’ll explore how agentic routing, workload hints, and programmatic KV cache management work together to improve efficiency, utilization, and responsiveness for production agent systems.
Key Takeaways
Architectures
Increasingly, AI-native builders are turning to open source, fine-tuned, and custom-trained AI models to build new product capabilities and achieve sustainable, competitive unit economics. Delivering these models requires a new capacity, inference engineering.
Key Takeaways
Architectures
LLM inference infrastructure must handle variable demand across concurrent workloads at scale. This talk covers autoscaling, load balancing, and recovery from silent failures without degrading performance.
Key Takeaways
Architectures
Inference can feel like a wall of new vocabulary: KV cache, prefill, decode, tensor parallelism, speculative decoding. But once you look past the names, the shape is familiar. It is routing, scheduling, memory pressure, cache eviction, noisy neighbors, tail latency, and state movement.
This talk is an invitation for systems engineers to step into the inference stack with confidence. We’ll build a practical mental model for how requests move through modern LLM serving systems, where performance gets lost, and why utilization is often governed by state rather than compute. You’ll leave with the anatomy of the stack, the vocabulary to reason about it, and the intuition to start asking sharper questions.
Key Takeaways
Architectures
Inference is where AI systems meet reality. As agentic workloads consume more compute and GPU costs continue to rise, reliability now extends beyond uptime to include latency, efficiency, and operational resilience.
Key Takeaways
Engines
State-of-the-Art LLM Inference on AMD
Learn how Wafer achieved state-of-the-art Qwen3.5-397B inference performance on AMD MI355X through custom MoE kernels, expert fusion, and caching optimizations.
Key Takeaways
Engines
As multimodal and agentic models evolve beyond a single decoding loop, inference systems must adapt. This talk explores the principles behind multi-stage decoding and the architecture powering SGLang Omni.
Key Takeaways
Architectures
Inference for Async Agents in Production
As AI agents take on longer, multi-step workflows, inference costs and scaling challenges grow quickly. This talk explores practical techniques for building high-performance async agents while keeping costs under control.
Key Takeaways
OPERATIONS
Should You Self-Host Inference?
Small models now match GPT5.1 smarts and are relatively easy to run. You can get 20x+ cost savings, gain control of your AI stack and even bundle inference right into your software product, instead of treating it like a 3rd party dependency.
Key Takeaways
ARCHITECTURES
Most inference optimization happens on the serving side: writing better kernels, using smarter caching, running faster hardware. The biggest cost and latency wins tend to come from the model side — deciding which tokens never need to be generated by a large model in the first place. This talk covers practical techniques for right-sizing the model to maximize token ROI: distilling frontier-model behavior into small models you can afford to serve at scale, controlling reasoning budgets so models don’t think longer than the problem deserves, and routing between small and frontier models in real time.
Key Takeaways
The Three Pillars of Modern Inference
Engines
Inside today’s inference engines: schedulers, KV caches, serving systems, and the code powering them.
Operations
Operating inference at scale: reliability, observability, and performance.
Architectures
Building faster, cheaper, and more efficient inference platforms.