公式動画ピックアップ
AAPL
ADBE
ADSK
AIG
AMGN
AMZN
BABA
BAC
BL
BOX
C
CHGG
CLDR
COKE
COUP
CRM
CROX
DDOG
DELL
DIS
DOCU
DOMO
ESTC
F
FIVN
GILD
GRUB
GS
GSK
H
HD
HON
HPE
HSBC
IBM
INST
INTC
INTU
IRBT
JCOM
JNJ
JPM
LLY
LMT
M
MA
MCD
MDB
MGM
MMM
MSFT
MSI
NCR
NEM
NEWR
NFLX
NKE
NOW
NTNX
NVDA
NYT
OKTA
ORCL
PD
PG
PLAN
PS
RHT
RNG
SAP
SBUX
SHOP
SMAR
SPLK
SQ
TDOC
TEAM
TSLA
TWOU
TWTR
TXN
UA
UAL
UL
UTX
V
VEEV
VZ
WDAY
WFC
WK
WMT
WORK
YELP
ZEN
ZM
ZS
ZUO
公式動画&関連する動画 [Pop Goes the Stack | KV cache is the real inference bottleneck (Not GPUs) | Agentic AI]
#GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of #F5's Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack.
Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token.
From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state.
The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.
Chapters:
00:00 Welcome to Pop Goes the Stack
00:18 GPUs aren’t the inference bottleneck—KV cache memory is
00:42 Why inference became stateful (long context + agents)
01:59 What exactly is KV cache?
03:18 Distributed inference: Compute-bound prefill vs memory-bound decode
04:31 Routing matters: Send prompts back to the GPU with the cache
05:24 The 4-tier memory hierarchy: HBM → DRAM → SSD → NAS
06:32 Time-to-first-token: Why offload beats recompute
07:30 Semantic caching: Answer before the LLM even runs
08:58 The ugly math: Huge KV caches create “elephant flows”
10:43 NVIDIA NIXL: Tuned networking for KV cache transfers
12:00 Sticky sessions for LLMs: Persistence and “prompt-to-GPU” mapping
15:14 Enterprises need to optimize and retrofit, don’t just buy GPUs
18:01 Key takeaways: Move the small things, reuse what you can, and retrofit what you have
20:32 The upside, we've solved these problems before
Learn how you can stay ahead of the curve and keep your stack whole with additional insights on app security, multicloud, #AI, and emerging tech: https://go.f5.net/kp608z4f
More about F5: https://go.f5.net/pkxbtxgp
Read our blog: https://go.f5.net/s6xpjel3
Follow us on LinkedIn: https://go.f5.net/7e12jm0x
88
7