公式動画ピックアップ

AAPL   ADBE   ADSK   AIG   AMGN   AMZN   BABA   BAC   BL   BOX   C   CHGG   CLDR   COKE   COUP   CRM   CROX   DDOG   DELL   DIS   DOCU   DOMO   ESTC   F   FIVN   GILD   GRUB   GS   GSK   H   HD   HON   HPE   HSBC   IBM   INST   INTC   INTU   IRBT   JCOM   JNJ   JPM   LLY   LMT   M   MA   MCD   MDB   MGM   MMM   MSFT   MSI   NCR   NEM   NEWR   NFLX   NKE   NOW   NTNX   NVDA   NYT   OKTA   ORCL   PD   PG   PLAN   PS   RHT   RNG   SAP   SBUX   SHOP   SMAR   SPLK   SQ   TDOC   TEAM   TSLA   TWOU   TWTR   TXN   UA   UAL   UL   UTX   V   VEEV   VZ   WDAY   WFC   WK   WMT   WORK   YELP   ZEN   ZM   ZS   ZUO  

  公式動画&関連する動画 [Pop Goes the Stack | KV cache is the real inference bottleneck (Not GPUs) | Agentic AI]

#GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of #F5's Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack. Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token. From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state. The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand. Chapters: 00:00 Welcome to Pop Goes the Stack 00:18 GPUs aren’t the inference bottleneck—KV cache memory is 00:42 Why inference became stateful (long context + agents) 01:59 What exactly is KV cache? 03:18 Distributed inference: Compute-bound prefill vs memory-bound decode 04:31 Routing matters: Send prompts back to the GPU with the cache 05:24 The 4-tier memory hierarchy: HBM → DRAM → SSD → NAS 06:32 Time-to-first-token: Why offload beats recompute 07:30 Semantic caching: Answer before the LLM even runs 08:58 The ugly math: Huge KV caches create “elephant flows” 10:43 NVIDIA NIXL: Tuned networking for KV cache transfers 12:00 Sticky sessions for LLMs: Persistence and “prompt-to-GPU” mapping 15:14 Enterprises need to optimize and retrofit, don’t just buy GPUs 18:01 Key takeaways: Move the small things, reuse what you can, and retrofit what you have 20:32 The upside, we've solved these problems before Learn how you can stay ahead of the curve and keep your stack whole with additional insights on app security, multicloud, #AI, and emerging tech: https://go.f5.net/kp608z4f More about F5: https://go.f5.net/pkxbtxgp Read our blog: https://go.f5.net/s6xpjel3 Follow us on LinkedIn: https://go.f5.net/7e12jm0x
 88      7