此处是前言,待填坑。

V1 Block

本章节基于 v0.9.0 分支的 V1 版本,自底向上分析 vLLM 灵魂特性 KV Cache Page Attention 中的块(Block)管理。这块内容可以参考 kaiyuan 详细的分析:vLLM的prefix cache为何零开销

最底层的 KVCacheBlock 加了比较重的引用计数和双向链表。block_hash=None 指的是未哈希的(不满)块。

1
2
3
4
5
6
7
8
9
10
11
12
13
class KVCacheBlock:
# Block ID, ranging from 0 to num_gpu_blocks - 1.
block_id: int
# Reference count.
ref_cnt: int = 0
# The hash of the block composed of (block hash, tuple of token IDs).
# It is only available when the block is full.
_block_hash: Optional[BlockHashType] = None

# Used to construct a doubly linked list for free blocks.
# These two attributes should only be manipulated by FreeKVCacheBlockQueue.
prev_free_block: Optional["KVCacheBlock"] = None
next_free_block: Optional["KVCacheBlock"] = None

FreeKVCacheBlockQueue 是 V1 架构里十分出彩的结构设计,内部维护了一个 blocks 列表和 head/tail 指针,支持 remove(any_existing_block)append(new_end_block)pop_left(front_block)的双向链表基本操作。其内含的设计哲学是:设计一个结构体来统一维护无引用计数(相关请求均已结束)但未来可能可复用的这批块。新请求尝试复用时也会尝试去这个队列里(用哈希)匹配,匹配成功则可复用此块结果,无需重新计算 KV Cache;申请新块时,如果存量空块不足,可以再从这个链表里淘汰一些旧块。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class FreeKVCacheBlockQueue:
"""This class organizes a list of KVCacheBlock objects to a doubly linked
list of free blocks. We implement this class instead of using Python
builtin deque to support removing a block in the middle of the queue
in O(1) time. To close the performance gap to the builtin deque which is
implemented in C++, this class does not allocate any Python objects when
manipulating the linked list. Instead, this class manipulates the
prev_free_block and next_free_block attributes of the given blocks.

The queue is ordered by block ID in the beginning. When a block is allocated
and then freed, it will be appended back with the eviction order:
1. The least recent used block is at the front (LRU).
2. If two blocks have the same last accessed time (allocated by the
same sequence), the one with more hash tokens (the tail of a block
chain) is at the front.
Note that we maintain this order by reversing the block order when free
blocks of a request. This operation is outside of this class.

Args:
blocks: A list of KVCacheBlock objects.
"""

def __init__(self, blocks: list[KVCacheBlock]) -> None:
self.num_free_blocks = len(blocks)

# Initialize the doubly linked list of free blocks.
self.free_list_head: Optional[KVCacheBlock] = blocks[0]
self.free_list_tail: Optional[KVCacheBlock] = blocks[-1]
for i in range(self.num_free_blocks):
if i > 0:
blocks[i].prev_free_block = blocks[i - 1]
if i < self.num_free_blocks - 1:
blocks[i].next_free_block = blocks[i + 1]

官方称这里用到了 Zero-Overhead Prefix Caching 技术,因为双向链表会带来如下好处:

  1. 复用成功时支持 O(1)O(1) 移除这个块。
  2. 天然支持 LRU 淘汰策略,即释放请求时将块塞入队尾、征用空间时将队首块取出。

BlockPool 结构体用 blocks 列表维护全量块信息,用 free_block_queue 维护上述空闲但可复用的双向队列,并维护 cached_block_hash_to_block 字典用来从块哈希映射到块本身。根据官方的描述,他们保留了可能出现哈希值相同的场景,辅助用 block_id 来保证索引唯一,这一点我甚感奇怪。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class BlockPool:
"""BlockPool that manages KVCacheBlocks.
It provides methods to allocate, free and cache the kv cache blocks. The
free_block_queue stores the free blocks in eviction order to enable
allocation, free, and cache eviction. The cached_block_hash_to_block
maps between block hash and cached block to support finding cached blocks
by their block hash.
"""

def __init__(
self,
num_gpu_blocks: int,
enable_caching: bool,
enable_kv_cache_events: bool = False,
):
assert isinstance(num_gpu_blocks, int) and num_gpu_blocks > 0
self.num_gpu_blocks = num_gpu_blocks
self.enable_caching = enable_caching
# All kv-cache blocks.
self.blocks: list[KVCacheBlock] = [
KVCacheBlock(idx) for idx in range(num_gpu_blocks)
]
# Free block queue that constructs and manipulates a doubly linked
# list of free blocks (including eviction candidates when caching is
# enabled).
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)

# {block_hash: {block ID: block}}. A cached block is
# a full block with a block hash that can be used for prefix caching.
# The cached block may be used by running requests or in the
# free_block_queue that could potentially be evicted.
# NOTE: We currently don't de-duplicate the blocks in the cache,
# meaning that if a block becomes full and is cached, we don't check
# if there is already an identical block in the cache. This is because
# we want to make sure the allocated block IDs won't change so that
# block tables are append-only.
self.cached_block_hash_to_block: dict[BlockHashType, dict[
int, KVCacheBlock]] = defaultdict(dict)

# To represent a placeholder block with block_id=0.
# The ref_cnt of null_block is not maintained, needs special care to
# avoid freeing it.
self.null_block = self.free_block_queue.popleft()

self.enable_kv_cache_events = enable_kv_cache_events
self.kv_event_queue: list[KVCacheEvent] = []

BlockPool 提供以下几个关键方法:

  • get_cached_block(block_hash):利用映射表查找特定哈希的块,若命中多个则默认返回第一个。
  • cache_full_blocks(request, cached_blocks, goal_blocks, ...):将一段完整块进行哈希运算并保存下来。哈希函数可自定义,默认是 sha256。这里加了个优化,当前请求在尝试匹配时可能已经算过一遍前缀哈希了,参数里支持传入缓存结果 block_hashes,函数会从未缓存处开始计算。吐槽一下 request 这个参数,BlockPool 不应该感知这个结构,而应拆解成所需的参数(主要是用来哈希的 token ids)。
  • get_new_blocks(num_blocks):新征用指定个块。从 free_block_queue 队首取出 num_blocks 个块并把它们的引用计数 +1+1。若某些块已存在哈希值,需要清空并从 cached_block_hash_to_block 中移除。
  • touch(blocks):标识 blocks 列表里的块不能被释放(征用),用于当前请求命中哈希的场景。遍历 blocks 列表进行引用计数 +1+1,若从 0011 则需从 free_block_queue 里移除。
  • free_blocks(ordered_blocks):将 ordered_blocks 列表里的块释放,用于当前请求已结束的场景。按顺序遍历 ordered_blocks 列表进行引用计数 1-1,若从 1100 则需插入 free_block_queue 尾部。

注意到 BlockPool 里还维护了 KVCacheEvent 事件列表。当新保存哈希块或从映射表中移除哈希块时,分别会触发 BlockStored 和 BlockRemoved 事件,并记录所有相关参数。

【未完待续】

Scheduler

Ascend