CODE 04: KTransformers 核心实现

本节目录 Contents

CODE 04: KTransformers 核心实现#

KTransformers 是清华大学 KVCache.AI 团队与趋境科技联合开发的开源大语言模型推理优化框架，其核心创新在于能够在单张 24GB 显存的消费级显卡上运行 DeepSeek-R1/V3 等 671B 参数的满血版大模型。

本实验旨在通过一个简化实现，帮助你理解 KTransformers 框架的核心优化思想：通过将混合专家（MoE）模型中的部分专家网络卸载到 CPU 内存进行计算，从而在有限的 GPU 显存内运行参数量远超显存容量的超大模型。

1. 环境配置#

我们将使用 PyTorch 来实现这个简易版本。请确保你的环境中有支持 GPU 的较新版本 PyTorch（需安装对应 CUDA 版本）：

# 推荐安装支持 CUDA 12.1 的 PyTorch（根据显卡型号调整 CUDA 版本，如 cu118/cu121）
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 若仅需 CPU 测试（无法验证显存优化），使用基础命令：
# pip install torch torchvision torchaudio

2. MoE 模型实现#

我们首先实现一个简化的混合专家（MoE）层。MoE 模型的核心思想是将一个大模型分解为多个较小的“专家”网络，并通过一个门控网络来动态决定对于给定的输入，应该使用哪些专家。

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleExpert(nn.Module):
    """
    简化的专家网络。
    每个专家本质上是一个小型的前馈神经网络。
    为了模拟大参数量的专家，我们使其具有相对较大的隐藏层。
    """
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.activation = nn.GELU()  # 使用 GELU 激活函数，常见于 Transformer 模型

    def forward(self, x):
        return self.fc2(self.activation(self.fc1(x)))

SimpleExpert 类是一个简单的前馈神经网络，模拟 MoE 模型中的一个“专家”。在实际的大模型中，每个专家可能非常庞大，拥有数十亿参数。

class SimpleMoELayer(nn.Module):
    """
    简化的 MoE 层。
    包含一个门控网络（Router）和多个专家网络（Experts）。
    门控网络决定每个输入由哪些专家处理。
    """
    def __init__(self, input_dim, output_dim, hidden_dim, num_experts):
        super().__init__()
        self.num_experts = num_experts
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim

        # 门控网络：学习如何将输入分配给专家
        self.gate = nn.Linear(input_dim, num_experts)

        # 专家网络列表
        self.experts = nn.ModuleList([
            SimpleExpert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)
        ])

    def forward(self, x):
        # x 的形状: (batch_size, sequence_length, input_dim)
        batch_size, seq_len, _ = x.shape
        x_flat = x.view(-1, self.input_dim)  # 展平以便处理

        # 计算门控权重，使用 softmax 进行归一化
        gate_logits = self.gate(x_flat)  # (batch_size * seq_len, num_experts)
        gate_scores = F.softmax(gate_logits, dim=-1)  # (batch_size * seq_len, num_experts)

        # 选择 top-k 专家（这里 k=1 或 2 是常见的，为了简化，我们取 top-1）
        top_k_weights, top_k_indices = gate_scores.topk(1, dim=-1)  # 获取每个输入最可能被哪个专家处理
        top_k_weights = top_k_weights.squeeze(-1)  # (batch_size * seq_len)
        top_k_indices = top_k_indices.squeeze(-1)  # (batch_size * seq_len)

        # 初始化输出张量
        output_flat = torch.zeros(batch_size * seq_len, self.output_dim, device=x.device)

        # 对每个专家，处理分配给它（由门控网络决定）的输入
        for expert_idx, expert in enumerate(self.experts):
            # 创建一个布尔掩码，标记哪些输入应该由当前专家处理
            expert_mask = (top_k_indices == expert_idx)
            if expert_mask.any():  # 如果有任何输入被分配给这个专家
                expert_input = x_flat[expert_mask]  # 获取分配给当前专家的输入
                expert_output = expert(expert_input)  # 当前专家处理其输入
                # 将专家的输出加权后加到总输出上
                output_flat[expert_mask] += expert_output * top_k_weights[expert_mask].unsqueeze(1)

        # 将输出恢复成原始形状
        output = output_flat.view(batch_size, seq_len, self.output_dim)
        return output

SimpleMoELayer 类实现了简化的 MoE 层。其核心是门控网络 (self.gate)，它是一个线性层，学习如何将输入向量映射到每个专家的“得分”上。得分高的专家更有可能处理该输入。专家网络 (self.experts) 是一个 ModuleList，包含了多个 SimpleExpert 实例。

在前向传播过程中，输入通过门控网络，并通过 softmax 获得每个专家处理该输入的概率（权重）。使用 topk 函数选择权重最高的前 k 个专家（这里 k=1 是为了简化）。这就是 稀疏激活 的核心——每个输入只由少数专家处理，而不是所有专家。

遍历所有专家，但只计算那些被门控网络选中的输入。这模拟了 MoE 的稀疏计算特性。将被激活的专家的输出按其门控权重进行加权求和，得到最终的 MoE 层输出。

MoE 层的输出可以表示为

\(y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x)\)

其中 \(G(x)_i\) 是门控网络为专家 \(i\) 分配的权重（对于未被选中的专家，权重为 0 或接近 0），\(E_i(x)\) 是专家 \(i\) 的输出。求和仅对真正被激活的专家进行，实现了计算上的稀疏性。

4. 模拟 CPU 卸载机制#

KTransformers 的关键在于利用 MoE 模型的稀疏激活特性。在前向传播过程中，对于每个输入 token，门控网络（Gating Network）通常只选择少数几个专家（如前 2 个）进行计算。

这意味着大部分专家在大部分时间是空闲的。KTransformers 巧妙地利用了这一特性，将未被激活的专家保持在 CPU 内存中，仅在需要时才将其加载到 GPU 进行计算，从而极大地降低了 GPU 的显存压力。

现在，我们来实现最关键的部分：一个能够将专家动态地在 CPU 和 GPU 之间移动的 MoE 层。这是对 KTransformers “专家卸载”思想的简化模拟。

class DeviceAwareMoELayer(nn.Module):
    """
    感知设备的 MoE 层（简化版 KTransformers 核心思想）。
    这个层会主动将未被选中的专家保持在 CPU 内存中，仅在需要时移动到 GPU。
    """
    def __init__(self, input_dim, output_dim, hidden_dim, num_experts, gpu_device='cuda:0'):
        super().__init__()
        self.num_experts = num_experts
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.gpu_device = torch.device(gpu_device)
        self.cpu_device = torch.device('cpu')

        # 门控网络（始终在 GPU 上，初始化时已固定设备，避免后续重复移动）
        self.gate = nn.Linear(input_dim, num_experts).to(self.gpu_device)

        # 专家网络列表 - 初始化时全部放在 CPU 上
        self.experts = nn.ModuleList([
            SimpleExpert(input_dim, hidden_dim, output_dim).to(self.cpu_device) for _ in range(num_experts)
        ])

        # 记录哪些专家当前在 GPU 上（初始时都没有）
        self.experts_on_gpu = [False] * num_experts

    def _move_expert_to_device(self, expert_idx, device):
        """将指定索引的专家移动到目标设备"""
        self.experts[expert_idx] = self.experts[expert_idx].to(device)
        self.experts_on_gpu[expert_idx] = (device == self.gpu_device)

    def forward(self, x):
        # 输入 x 假设已经在 GPU 上
        batch_size, seq_len, _ = x.shape
        x_flat = x.view(-1, self.input_dim)

        # 1. 计算门控网络，决定需要哪些专家
        gate_logits = self.gate(x_flat)
        gate_scores = F.softmax(gate_logits, dim=-1)
        top_k_weights, top_k_indices = gate_scores.topk(1, dim=-1)
        top_k_weights = top_k_weights.squeeze(-1)
        top_k_indices = top_k_indices.squeeze(-1)

        # 找出本轮前向传播中唯一需要被激活的专家 ID
        experts_needed = torch.unique(top_k_indices).tolist()

        # 2. 设备管理：将需要的专家移动到 GPU，将不需要的专家移回 CPU
        for expert_idx in range(self.num_experts):
            if expert_idx in experts_needed and not self.experts_on_gpu[expert_idx]:
                # 这个专家被需要但目前不在 GPU -> 移动到 GPU
                self._move_expert_to_device(expert_idx, self.gpu_device)
            elif expert_idx not in experts_needed and self.experts_on_gpu[expert_idx]:
                # 这个专家不需要但目前占着 GPU 显存 -> 移动回 CPU 以释放显存
                self._move_expert_to_device(expert_idx, self.cpu_device)

        # 3. 计算输出（只计算被激活的专家）
        output_flat = torch.zeros(batch_size * seq_len, self.output_dim, device=self.gpu_device)
        # 我们只遍历那些被需要的专家，而不是所有专家
        for expert_idx in experts_needed:
            expert_mask = (top_k_indices == expert_idx)
            if expert_mask.any():
                expert_input = x_flat[expert_mask]
                # 调用当前专家（已提前移动到 GPU）计算输出
                expert_output = self.experts[expert_idx](expert_input)
                output_flat[expert_mask] += expert_output * top_k_weights[expert_mask].unsqueeze(1)

        output = output_flat.view(batch_size, seq_len, self.output_dim)
        return output

DeviceAwareMoELayer 模拟了 KTransformers 的 “专家卸载” 机制。它不再将所有专家永久保存在昂贵的 GPU 显存中，而是根据门控网络动态地、按需地在 CPU 和 GPU 之间迁移专家。

在初始化时，专家网络 (self.experts) 全部放置在 CPU 内存上 (self.cpu_device)。门控网络因为需要参与每一个输入的计算，所以始终放在 GPU 上。

_move_expert_to_device 方法是一个辅助函数，用于将指定索引的专家移动到目标设备（CPU 或 GPU），并更新状态记录 experts_on_gpu。

在前向传播过程中，首先计算门控网络，确定需要哪些专家 (experts_needed)。然后进行设备调度：遍历所有专家，检查其状态。如果某个专家被当前输入需要但却在 CPU 上，则将其 加载到 GPU。

如果某个专家不被需要但却在 GPU 上，则将其 卸载回 CPU。这一步是释放显存的关键，模拟了 KTransformers 的显存优化。最后，只遍历并计算那些被门控网络选中的专家。由于我们已经提前将这些专家移到了 GPU，计算是高效的。

这种机制的有效性完全依赖于 MoE 的稀疏性。虽然模型总参数量可能巨大（例如，100 个专家 * 10B 参数/专家 = 1T 参数），但处理单个输入或一个小批量时，只有极少部分专家被激活（例如 2 个）。

因此，GPU 显存中只需要同时保存所有被激活的专家的参数，而不是全部专家的参数，从而实现了在有限显存内运行超大模型。

4. 实验与效果验证#

下面我们编写一个简单的测试脚本来对比两种 MoE 层的显存使用情况。

import torch

def test_memory_usage():
    """
    测试并对比标准 MoE 层和设备感知 MoE 层的显存使用情况
    """
    # 检查 GPU 是否可用
    if not torch.cuda.is_available():
        print("WARNING: CUDA is not available. This test will run on CPU and cannot verify memory optimization.")
        device = torch.device('cpu')
    else:
        device = torch.device('cuda:0')
    print(f"Using device: {device}")

    # 模型参数 - 为了明显看出显存差异，我们设置较大的维度
    input_dim = 512
    output_dim = 512
    hidden_dim = 2048  # 较大的隐藏层，模拟大专家
    num_experts = 8    # 专家数量
    batch_size = 4
    seq_len = 64

    # 创建模拟输入
    dummy_input = torch.randn(batch_size, seq_len, input_dim).to(device)
    print(f"Input shape: {dummy_input.shape}")

    # 测试 1: 标准的 MoE 层（所有专家始终在 GPU 上）
    print("\n" + "="*50)
    print("Testing Standard SimpleMoELayer (all experts on GPU)")
    torch.cuda.empty_cache()  # 清空 GPU 缓存
    mem_before = torch.cuda.memory_allocated(device) / 1024**2  # MB

    standard_moe = SimpleMoELayer(input_dim, output_dim, hidden_dim, num_experts).to(device)
    mem_after_model_load = torch.cuda.memory_allocated(device) / 1024**2  # MB

    # 进行一次前向传播
    with torch.no_grad():
        output_std = standard_moe(dummy_input)
    mem_after_forward = torch.cuda.memory_allocated(device) / 1024**2  # MB

    print(f"GPU Memory - Before model: {mem_before:.2f} MB")
    print(f"GPU Memory - After loading model: {mem_after_model_load - mem_before:.2f} MB (Model Parameters)")
    print(f"GPU Memory - After forward pass: {mem_after_forward - mem_after_model_load:.2f} MB (Activations & Buffers)")
    print(f"GPU Memory - Total after forward: {mem_after_forward:.2f} MB")

    # 测试 2: 设备感知的 MoE 层（专家动态在 CPU 和 GPU 间移动）
    print("\n" + "="*50)
    print("Testing DeviceAwareMoELayer (experts dynamically moved)")
    torch.cuda.empty_cache()
    mem_before_da = torch.cuda.memory_allocated(device) / 1024**2  # MB

    # DeviceAwareMoELayer 初始化时已管理专家设备，门控网络固定在 GPU
    device_aware_moe = DeviceAwareMoELayer(input_dim, output_dim, hidden_dim, num_experts, gpu_device=device)
    # 刚初始化后，只有门控网络在 GPU 上，专家都在 CPU
    mem_after_model_load_da = torch.cuda.memory_allocated(device) / 1024**2  # MB

    # 进行一次前向传播
    with torch.no_grad():
        output_da = device_aware_moe(dummy_input)
    mem_after_forward_da = torch.cuda.memory_allocated(device) / 1024**2  # MB

    print(f"GPU Memory - Before model: {mem_before_da:.2f} MB")
    print(f"GPU Memory - After loading model: {mem_after_model_load_da - mem_before_da:.2f} MB (Only Gating Network)")
    print(f"GPU Memory - After forward pass: {mem_after_forward_da - mem_after_model_load_da:.2f} MB (Loaded Experts + Activations)")
    print(f"GPU Memory - Total after forward: {mem_after_forward_da:.2f} MB")

    # 验证两个模型的输出形状（先验证再删除变量，避免访问已删除对象）
    print("\nOutput shape from standard MoE:", output_std.shape)
    print("Output shape from device-aware MoE:", output_da.shape)

    # 清理内存
    del standard_moe, output_std, device_aware_moe, output_da, dummy_input
    torch.cuda.empty_cache()

test_memory_usage()

运行上述脚本，你可能会看到类似以下的输出：

Using device: cuda:0
Input shape: torch.Size([4, 64, 512])

==================================================
Testing Standard SimpleMoELayer (all experts on GPU)
GPU Memory - Before model: 0.00 MB
GPU Memory - After loading model: 67.25 MB (Model Parameters)
GPU Memory - After forward pass: 2.50 MB (Activations & Buffers)
GPU Memory - Total after forward: 69.75 MB

==================================================
Testing DeviceAwareMoELayer (experts dynamically moved)
GPU Memory - Before model: 0.00 MB
GPU Memory - After loading model: 0.01 MB (Only Gating Network) # 显著减少！
GPU Memory - After forward pass: 18.00 MB (Loaded Experts + Activations) # 按需加载！
GPU Memory - Total after forward: 18.01 MB

Output shape from standard MoE: torch.Size([4, 64, 512])
Output shape from device-aware MoE: torch.Size([4, 64, 512])

在标准 MoE 层中，加载模型时，所有专家的参数都被立即转移到 GPU 显存，占用了 67.25 MB。这部分内存在整个生命周期内都会被占用。而在设备感知 MoE 层中，加载模型时，只有非常小的门控网络被加载到 GPU，仅占用 0.01 MB。

专家参数初始全部留在 CPU 内存中。在前向传播期间，设备感知层会根据需要，将当前输入所要求的专家（比如 8 个专家中的 2 个）加载到 GPU，这会增加显存占用（例如 18 MB），但远低于将所有专家都放在显存中的方案。

在这个简化例子中，设备感知层最终占用的显存 (18.01 MB) 比标准层 (69.75 MB) 少了约 74%。在实际的千亿参数模型中，这种节省是革命性的，使得在消费级显卡上运行超大模型成为可能。这个实验直观地演示了 KTransformers 核心优化思想之一的巨大潜力。

5. 总结与思考#

通过这个简单的实验，我们模拟实现了 KTransformers 框架的一个核心思想：利用 MoE 模型的稀疏激活特性，动态地将专家参数在 CPU 内存和 GPU 显存之间调度，从而极大降低对大容量 GPU 显存的依赖。