Pop Goes the Stack | KV cache is the real inference bottleneck (Not GPUs) | Agentic AI 米国企業公式動画まとめ

公式動画ピックアップ

AAPL ADBE ADSK AIG AMGN AMZN BABA BAC BL BOX C CHGG CLDR COKE COUP CRM CROX DDOG DELL DIS DOCU DOMO ESTC F FIVN GILD GRUB GS GSK H HD HON HPE HSBC IBM INST INTC INTU IRBT JCOM JNJ JPM LLY LMT M MA MCD MDB MGM MMM MSFT MSI NCR NEM NEWR NFLX NKE NOW NTNX NVDA NYT OKTA ORCL PD PG PLAN PS RHT RNG SAP SBUX SHOP SMAR SPLK SQ TDOC TEAM TSLA TWOU TWTR TXN UA UAL UL UTX V VEEV VZ WDAY WFC WK WMT WORK YELP ZEN ZM ZS ZUO

公式動画＆関連する動画 [Pop Goes the Stack | KV cache is the real inference bottleneck (Not GPUs) | Agentic AI]

FFIV

#GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of #F5's Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack.

Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token.

From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state.

The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.

Chapters:
00:00 Welcome to Pop Goes the Stack
00:18 GPUs aren’t the inference bottleneck—KV cache memory is
00:42 Why inference became stateful (long context + agents)
01:59 What exactly is KV cache? 
03:18 Distributed inference: Compute-bound prefill vs memory-bound decode
04:31 Routing matters: Send prompts back to the GPU with the cache
05:24 The 4-tier memory hierarchy: HBM → DRAM → SSD → NAS
06:32 Time-to-first-token: Why offload beats recompute
07:30 Semantic caching: Answer before the LLM even runs
08:58 The ugly math: Huge KV caches create “elephant flows”
10:43 NVIDIA NIXL: Tuned networking for KV cache transfers
12:00 Sticky sessions for LLMs: Persistence and “prompt-to-GPU” mapping
15:14 Enterprises need to optimize and retrofit, don’t just buy GPUs
18:01 Key takeaways: Move the small things, reuse what you can, and retrofit what you have
20:32 The upside, we've solved these problems before

Learn how you can stay ahead of the curve and keep your stack whole with additional insights on app security, multicloud, #AI, and emerging tech:  https://go.f5.net/kp608z4f

More about F5: https://go.f5.net/pkxbtxgp

Read our blog: https://go.f5.net/s6xpjel3

Follow us on LinkedIn: https://go.f5.net/7e12jm0x 

109 7

この動画に関連する企業の動画一覧はこちら