<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://project-hami.io/zh/blog</id>
    <title>HAMi Blog</title>
    <updated>2026-01-20T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://project-hami.io/zh/blog"/>
    <subtitle>HAMi Blog</subtitle>
    <icon>https://project-hami.io/zh/img/favicon.ico</icon>
    <entry>
        <title type="html"><![CDATA[HAMi v2.8.0 发布：全面支持 DRA 与高可用调度，迈向标准化 GPU 资源管理]]></title>
        <id>https://project-hami.io/zh/blog/hami-v2-8-0-release</id>
        <link href="https://project-hami.io/zh/blog/hami-v2-8-0-release"/>
        <updated>2026-01-20T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[HAMi v2.8.0 正式发布。这是一个在架构完整性、调度可靠性以及生态对齐层面具有里程碑意义的版本，引入了 DRA 支持、Leader 选举机制、CDI 模式支持等关键特性。]]></summary>
        <content type="html"><![CDATA[<p>HAMi 社区正式发布 <strong>HAMi v2.8.0</strong>。这是一个在 <strong>架构完整性、调度可靠性以及生态对齐</strong> 层面具有里程碑意义的版本。</p>
<p>v2.8.0 不仅引入了多项关键特性更新，也在 <strong>Kubernetes 原生标准对齐、异构设备支持、生产可用性与可观测性</strong> 等方面进行了系统性增强，使 HAMi 更加适合在长期运行、对稳定性和演进路径敏感的 AI 生产集群中使用。</p>
<p>本文将对 v2.8.0 的主要更新进行详细说明。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="核心特性与能力更新">核心特性与能力更新<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E6%A0%B8%E5%BF%83%E7%89%B9%E6%80%A7%E4%B8%8E%E8%83%BD%E5%8A%9B%E6%9B%B4%E6%96%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>本节介绍 HAMi v2.8.0 的核心特性与能力更新，涵盖标准接口支持、高可用机制、设备兼容性等方面。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="正式支持-kubernetes-device-resource-assignmentdra">正式支持 Kubernetes Device Resource Assignment（DRA）<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E6%AD%A3%E5%BC%8F%E6%94%AF%E6%8C%81-kubernetes-device-resource-assignmentdra" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi v2.8.0 新增对 <strong>Kubernetes Device Resource Assignment（DRA）</strong> 的支持，并提供了独立实现项目：</p>
<ul>
<li class=""><a href="https://github.com/Project-HAMi/HAMi-DRA" target="_blank" rel="noopener noreferrer" class="">https://github.com/Project-HAMi/HAMi-DRA</a></li>
</ul>
<p>DRA 是 Kubernetes 社区正在推进的下一代设备资源声明与分配机制，旨在为 GPU/AI 加速器等设备提供 <strong>更标准化、可组合、可扩展</strong> 的资源管理模型。</p>
<p>HAMi 对 DRA 的支持，标志着项目在设备资源管理方向上，开始从"自定义设备调度逻辑"逐步走向 <strong>Kubernetes 原生标准接口</strong>。这不仅为未来更复杂的 GPU / AI 加速器使用模式奠定基础，也为 HAMi 在上游生态中的长期演进打开了空间。</p>
<blockquote>
<p>关于 DRA 的设计理念、使用方式及与现有模式的对比，将在后续单独的技术解读文章中展开。</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="多-scheduler-实例的-leader-选举机制">多 Scheduler 实例的 Leader 选举机制<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E5%A4%9A-scheduler-%E5%AE%9E%E4%BE%8B%E7%9A%84-leader-%E9%80%89%E4%B8%BE%E6%9C%BA%E5%88%B6" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>在大规模集群或高可用部署场景下，HAMi v2.8.0 引入了 <strong>多 Scheduler 实例的 Leader 选举机制</strong>，以增强调度层的稳定性和可运维性。该机制具备以下优势：</p>
<ul>
<li class="">避免多实例并发调度带来的资源冲突</li>
<li class="">提升 Scheduler 组件的高可用能力</li>
<li class="">为长期运行的生产集群提供更稳健的运行模型</li>
</ul>
<p>该机制使 HAMi 更适合部署在对稳定性和容错能力要求较高的生产环境中。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="nvidia-设备支持-container-device-interfacecdi模式">NVIDIA 设备支持 Container Device Interface（CDI）模式<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#nvidia-%E8%AE%BE%E5%A4%87%E6%94%AF%E6%8C%81-container-device-interfacecdi%E6%A8%A1%E5%BC%8F" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi v2.8.0 新增对 <strong>NVIDIA <a href="https://github.com/cncf-tags/container-device-interface" target="_blank" rel="noopener noreferrer" class="">CDI（Container Device Interface）</a></strong> 模式的支持，进一步降低设备管理与容器运行时之间的耦合度。主要特性包括：</p>
<ul>
<li class="">使用更标准的设备注入方式</li>
<li class="">提供更清晰的设备声明与生命周期管理</li>
<li class="">为未来多运行时、多设备模型打下基础</li>
</ul>
<p>用户可以通过 <code>values.yaml</code> 中的 <code>deviceListStrategy</code> 配置项，选择使用传统的环境变量模式（<code>envvar</code>）或 CDI 模式（<code>cdi-annotations</code>）。</p>
<p>这一能力推动 HAMi 持续向 <strong>更云原生、可组合的设备管理方式</strong> 演进。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="对齐-nvidia-k8s-device-plugin-v0180">对齐 NVIDIA k8s-device-plugin v0.18.0<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E5%AF%B9%E9%BD%90-nvidia-k8s-device-plugin-v0180" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>在 v2.8.0 中，HAMi 同步升级并对齐 <strong>NVIDIA 官方 <a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="noopener noreferrer" class="">k8s-device-plugin</a> v0.18.0</strong>，以实现以下目标：</p>
<ul>
<li class="">保持对 NVIDIA 最新设备管理模型的兼容</li>
<li class="">降低用户在混合部署场景中的适配成本</li>
<li class="">确保 HAMi 作为设备管理与调度的"增强层"，而非分叉实现</li>
</ul>
<p>这一对齐有助于用户在现有 NVIDIA GPU 生态中平滑引入 HAMi。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mock-device-plugin-支持">Mock Device Plugin 支持<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#mock-device-plugin-%E6%94%AF%E6%8C%81" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>为提升工程实践中的可测试性与开发效率，v2.8.0 新增 <strong><a href="https://github.com/Project-HAMi/mock-device-plugin" target="_blank" rel="noopener noreferrer" class="">Mock Device Plugin</a></strong> 能力，适用于以下场景：</p>
<ul>
<li class="">功能验证与开发调试</li>
<li class="">CI / 测试环境下的设备模拟</li>
<li class="">降低新特性验证与回归测试成本</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="构建信息与-metrics-体系更新">构建信息与 Metrics 体系更新<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E6%9E%84%E5%BB%BA%E4%BF%A1%E6%81%AF%E4%B8%8E-metrics-%E4%BD%93%E7%B3%BB%E6%9B%B4%E6%96%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi v2.8.0 在可观测性方面进行了补充与整理，具体包括：</p>
<ul>
<li class="">新增 <code>hami_build_info</code> 指标</li>
<li class="">启动时输出更完整的版本与构建信息</li>
<li class="">正式移除已标记弃用的历史指标</li>
</ul>
<p>这些改进使 HAMi 在生产环境中的版本追踪、问题定位与运维可视性更加清晰。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="异构设备与厂商生态进展">异构设备与厂商生态进展<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E5%BC%82%E6%9E%84%E8%AE%BE%E5%A4%87%E4%B8%8E%E5%8E%82%E5%95%86%E7%94%9F%E6%80%81%E8%BF%9B%E5%B1%95" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>HAMi 持续围绕 <strong>多类型 GPU/AI 加速器</strong> 的统一管理与调度能力进行演进。</p>
<p>在 v2.8.0 周期内，社区在以下方向上持续推进：</p>
<ul>
<li class="">不同 GPU/AI 加速器设备模型的适配与能力增强</li>
<li class="">面向国产 GPU/AI 芯片的持续支持与特性补充</li>
<li class="">相关功能与 Bug Fix 的持续合入（详见 GitHub PR 记录）</li>
</ul>
<p>这些改进进一步增强了 HAMi 在异构算力环境下的可用性与扩展空间。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="上下游生态集成进展">上下游生态集成进展<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E4%B8%8A%E4%B8%8B%E6%B8%B8%E7%94%9F%E6%80%81%E9%9B%86%E6%88%90%E8%BF%9B%E5%B1%95" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>HAMi 不仅是一个独立项目，也在持续与 Kubernetes AI 生态中的关键组件进行协同演进。当前主要集成方向包括：</p>
<ul>
<li class=""><strong>Kueue</strong>：HAMi 社区向 Kueue 项目贡献的增强能力，使其原生支持 HAMi 的设备资源管理与调度模型，为批量 AI 作业的队列管理提供异构设备调度支持</li>
<li class=""><strong>vLLM</strong>：修复了多卡场景下的兼容性问题，详见相关 Issue（<a href="https://github.com/Project-HAMi/HAMi/issues/1461" target="_blank" rel="noopener noreferrer" class="">#1461</a> 和 <a href="https://github.com/Project-HAMi/HAMi/issues/1381" target="_blank" rel="noopener noreferrer" class="">#1381</a>）</li>
</ul>
<p>这些生态集成有助于用户在真实 AI 工作负载中，构建更加完整的算力调度与资源管理方案。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="社区与项目进展">社区与项目进展<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E7%A4%BE%E5%8C%BA%E4%B8%8E%E9%A1%B9%E7%9B%AE%E8%BF%9B%E5%B1%95" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>HAMi 不仅是一个代码仓库，也是一个持续演进的开源社区与项目组织。</p>
<p>在 v2.8.0 周期内，社区在以下方面持续活跃：</p>
<ul>
<li class="">用户与厂商的实际使用反馈，比如 <a href="https://www.cncf.io/case-studies/daocloud/" target="_blank" rel="noopener noreferrer" class="">DaoCloud 使用 HAMi 构建 GPU 云</a> 的用户案例发布在了 CNCF 官网</li>
<li class="">开展了两次社区 Meetup 社区活动，详见<a class="" href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025/">第一届 HAMi Meetup 上海站</a> 和 <a class="" href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025/">第二届 HAMi Meetup 北京站</a></li>
</ul>
<p>HAMi 社区欢迎更多开发者、用户和生态伙伴参与项目，共同推动 GPU 虚拟化与设备调度能力的演进。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="总结">总结<a href="https://project-hami.io/zh/blog/hami-v2-8-0-release#%E6%80%BB%E7%BB%93" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>HAMi v2.8.0 是一次面向 <strong>标准化、生产可用性与生态对齐</strong> 的重要版本更新。</p>
<p>通过引入 DRA、增强调度高可用能力、对齐主流设备插件与运行时标准，并持续扩展异构设备与生态集成，HAMi 正在稳步迈向更成熟、更可持续的 GPU 资源管理与调度平台。</p>]]></content>
        <author>
            <name>HAMi 社区</name>
        </author>
        <category label="Release" term="Release"/>
        <category label="GPU" term="GPU"/>
        <category label="Kubernetes" term="Kubernetes"/>
        <category label="DRA" term="DRA"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[第二届 HAMi Meetup 北京站回顾]]></title>
        <id>https://project-hami.io/zh/blog/hami-meetup-beijing-2025</id>
        <link href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025"/>
        <updated>2025-12-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[12 月 27 日，HAMi Meetup 北京站近百位技术伙伴齐聚，分享异构算力虚拟化、调度策略与生产实践。]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="HAMi Meetup 北京站" src="https://project-hami.io/zh/assets/images/hami-meetup-beijing-banner-67403b3c1f9f43ecf07396681442376a.webp" width="1080" height="467" class="img_ev3q"></p>
<p>12 月 27 日，HAMi Meetup 北京站在近百位技术伙伴的参与下圆满落幕。作为 HAMi 社区的第二场线下活动，本次 Meetup 聚焦国产算力的生产实践与异构调度工程落地，来自贝壳、海光信息、第四范式、昆仑芯等企业的工程师分享了他们的一线经验。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="活动开场">活动开场<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#%E6%B4%BB%E5%8A%A8%E5%BC%80%E5%9C%BA" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>Linux 基金会副总裁、CNCF 亚太区中国主席 Keith Chan 在开场分享中提到：AI 的发展正在从模型本身转向对底层基础设施与资源效率的考验。GPU 成本高、资源利用率不足已成为全球共性问题，如何通过云原生与开源技术构建更弹性的 AI 基础设施，是整个行业面临的共同课题。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="主题分享">主题分享<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#%E4%B8%BB%E9%A2%98%E5%88%86%E4%BA%AB" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hami-新特性与能力矩阵标准化">HAMi 新特性与能力矩阵标准化<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#hami-%E6%96%B0%E7%89%B9%E6%80%A7%E4%B8%8E%E8%83%BD%E5%8A%9B%E7%9F%A9%E9%98%B5%E6%A0%87%E5%87%86%E5%8C%96" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="HAMi 社区" src="https://project-hami.io/zh/assets/images/hami-community-be3c2fceafd04b8f78bfb1e13bf827f9.webp" width="1080" height="608" class="img_ev3q"></p>
<p>HAMi 社区 Maintainer 李孟轩介绍了项目在异构算力调度领域的技术演进。作为 CNCF Sandbox 项目，HAMi 已在多种 AI 加速器场景验证了应用无侵入、强隔离、易部署等核心能力。</p>
<p><strong>新特性与规划：</strong></p>
<ul>
<li class="">CDI 支持、Mock Device Plugin 等特性改进</li>
<li class="">Ascend Device Plugin 与 Volcano 调度器的深度适配</li>
<li class="">计划推出轻量化方案 HAMi-DRA，基于 Kubernetes DRA 架构简化调度链路</li>
<li class="">建立设备能力矩阵，评估不同设备在显存隔离、算力控制等方面的支持情况</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="dcu-软件虚拟化从基础到实践">DCU 软件虚拟化从基础到实践<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#dcu-%E8%BD%AF%E4%BB%B6%E8%99%9A%E6%8B%9F%E5%8C%96%E4%BB%8E%E5%9F%BA%E7%A1%80%E5%88%B0%E5%AE%9E%E8%B7%B5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="DCU 实践" src="https://project-hami.io/zh/assets/images/dcu-practice-implementation-d149315c4ee1bac51f32952debebf04b.webp" width="1080" height="605" class="img_ev3q"></p>
<p>海光信息研发工程师王忠勤分享了 DCU 在云原生环境中的虚拟化实践。他介绍了如何基于 hy-smi 工具实现 vDCU 在算力与显存维度的精细化切分，以及 DCU-Device-Plugin 与 HAMi 调度器的协同架构。</p>
<p><strong>关键内容：</strong></p>
<ul>
<li class="">vDCU 的资源隔离与运行时一致性实现</li>
<li class="">标准化设备插件框架与多种运行模式支持</li>
<li class="">DCU-Exporter 在物理 DCU 与 vDCU 监控中的应用</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="贝壳--hamivgpu-推理集群实践">贝壳 × HAMi：vGPU 推理集群实践<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#%E8%B4%9D%E5%A3%B3--hamivgpu-%E6%8E%A8%E7%90%86%E9%9B%86%E7%BE%A4%E5%AE%9E%E8%B7%B5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="贝壳实践" src="https://project-hami.io/zh/assets/images/beike-vgpu-inference-cluster-practice-e7f909909a71986b327fa8bd756321e3.webp" width="1080" height="584" class="img_ev3q"></p>
<p>贝壳找房算力平台开发工程师王妮分享了 HAMi 在大规模 GPU 管理场景下的落地经验。</p>
<p><strong>实践成果：</strong></p>
<ul>
<li class="">基于显存切片的 vGPU 弹性池化能力</li>
<li class="">支持多型号 GPU 共存、多集群统一调度</li>
<li class="">千万级日请求量下的稳定运行</li>
<li class="">GPU 利用率提升约 3 倍</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hami-core-x-dra原生-dra-driver-实践">HAMi-Core x DRA：原生 DRA Driver 实践<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#hami-core-x-dra%E5%8E%9F%E7%94%9F-dra-driver-%E5%AE%9E%E8%B7%B5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="HAMi-Core DRA" src="https://project-hami.io/zh/assets/images/hami-core-dra-architecture-4b3e27af5410b21112e5e2ddaf8f5f6d.webp" width="1080" height="809" class="img_ev3q"></p>
<p>第四范式研发工程师、HAMi Approver 杨守仁分享了 HAMi-Core 从传统 Device Plugin 向原生 DRA Driver 的演进。</p>
<p><strong>技术要点：</strong></p>
<ul>
<li class="">采用 KEP-5075（DRA: Consumable Capacity）标准</li>
<li class="">结合 ResourceClaim、ResourceSlice 原生对象</li>
<li class="">通过 CDI 与 libvgpu 实现不侵入业务容器的资源管理</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hami-设备插件新功能">HAMi 设备插件新功能<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#hami-%E8%AE%BE%E5%A4%87%E6%8F%92%E4%BB%B6%E6%96%B0%E5%8A%9F%E8%83%BD" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="新功能" src="https://project-hami.io/zh/assets/images/device-plugin-new-features-329e2d450175d763d09fc07d054c46d9.webp" width="1080" height="605" class="img_ev3q"></p>
<p>第四范式平台工程师 James 介绍了 HAMi 设备插件在昇腾场景下与 Volcano 调度器的集成。</p>
<p><strong>关键改进：</strong></p>
<ul>
<li class="">Ascend Device Plugin 的设备初始化、筛选与分配机制</li>
<li class="">Mock Device Plugin 方案补齐显存等资源维度</li>
<li class="">提升异构设备在多调度器环境下的可观测性</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hami-v270-国产算力适配">HAMi v2.7.0 国产算力适配<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#hami-v270-%E5%9B%BD%E4%BA%A7%E7%AE%97%E5%8A%9B%E9%80%82%E9%85%8D" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="HAMi v2.7.0" src="https://project-hami.io/zh/assets/images/hami-v2.7.0-domestic-compute-09f03b9b8d513d3ea1de3ed9138cf1c3.webp" width="1080" height="810" class="img_ev3q"></p>
<p>睿思智联研发工程师、HAMi Reviewer 欧阳陆伟分享了昆仑芯 P800 vXPU 场景下的工程实践。</p>
<p><strong>实践亮点：</strong></p>
<ul>
<li class="">HAMi-Scheduler 的拓扑感知调度能力</li>
<li class="">多 XPU、多节点环境下的合理调度决策</li>
<li class="">调度可观测性改进：规范化日志、丰富事件信息</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="交流环节">交流环节<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#%E4%BA%A4%E6%B5%81%E7%8E%AF%E8%8A%82" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="交流环节" src="https://project-hami.io/zh/assets/images/meetup-networking-session-da82c6c3d1d661902657d1714ac0539c.webp" width="1080" height="1080" class="img_ev3q"></p>
<p>现场围绕 GPU/DCU/XPU 虚拟化、推理与训练混部策略、国产加速器适配成本等问题展开了热烈讨论。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ppt-分享">PPT 分享<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#ppt-%E5%88%86%E4%BA%AB" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>PPT 下载链接：<a href="https://github.com/Project-HAMi/community/tree/main/hami-meetup/02-%E5%8C%97%E4%BA%AC-20251227" target="_blank" rel="noopener noreferrer" class="">HAMi Meetup 北京站 PPT 合集</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="结语">结语<a href="https://project-hami.io/zh/blog/hami-meetup-beijing-2025#%E7%BB%93%E8%AF%AD" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>算力效率不是单点能力，而是调度、虚拟化、软件栈与业务场景共同作用的结果。欢迎更多 HAMi 用户分享你的实践故事，共同推动社区发展。</p>]]></content>
        <author>
            <name>HAMi 社区</name>
        </author>
        <category label="HAMi" term="HAMi"/>
        <category label="Meetup" term="Meetup"/>
        <category label="异构算力" term="异构算力"/>
        <category label="GPU 虚拟化" term="GPU 虚拟化"/>
        <category label="云原生" term="云原生"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[第一届 HAMi Meetup 上海站回顾]]></title>
        <id>https://project-hami.io/zh/blog/hami-meetup-shanghai-2025</id>
        <link href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025"/>
        <updated>2025-11-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[11 月 30 日，HAMi Meetup 上海站成功举办。来自蔚来、沐曦、DaoCloud、星环科技等企业的技术专家分享了 HAMi 在 GPU 虚拟化、异构算力调度、国产算力适配等方面的实践经验。]]></summary>
        <content type="html"><![CDATA[<p>11 月 30 日，首场 HAMi Meetup 在上海圆满结束。本次活动以"不卷算力卷效率"为主题，近百位 AI 开发者、运维工程师、企业 IT 架构师齐聚现场，聚焦异构算力调度的核心命题。</p>
<p><img decoding="async" loading="lazy" alt="现场" src="https://project-hami.io/zh/assets/images/meetup-banner-df70f255ed1fa846a80be03f4ba4085b.png" width="1080" height="608" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="开场云原生-ai-基础设施">开场：云原生 AI 基础设施<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E5%BC%80%E5%9C%BA%E4%BA%91%E5%8E%9F%E7%94%9F-ai-%E5%9F%BA%E7%A1%80%E8%AE%BE%E6%96%BD" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>Linux 基金会副总裁、CNCF 亚太区中国主席 <strong>Keith Chan</strong> 在开场演讲中指出：</p>
<ul>
<li class="">GPU 成本高、资源利用率不足已成为全球共性问题</li>
<li class="">70%–80% 的推理与训练工作负载已运行在 Kubernetes 上</li>
<li class="">超过 80% 的企业认为"开源是 AI 成熟的关键驱动力"</li>
<li class="">CNCF 正推动 Certified AI Platform for Kubernetes 标准化计划</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Keith 演讲" src="https://project-hami.io/zh/assets/images/keith-opening-keynote-36addf74e451f656fa1b2b9b127cf226.png" width="1080" height="608" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="技术分享回顾">技术分享回顾<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E6%8A%80%E6%9C%AF%E5%88%86%E4%BA%AB%E5%9B%9E%E9%A1%BE" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hami-270---280-版本演进">HAMi 2.7.0 - 2.8.0 版本演进<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#hami-270---280-%E7%89%88%E6%9C%AC%E6%BC%94%E8%BF%9B" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="HAMi 版本演进" src="https://project-hami.io/zh/assets/images/hami-version-evolution-6e83afca22fe0cb5d31ba2d46de28da3.png" width="1080" height="608" class="img_ev3q"></p>
<p>HAMi 核心 Maintainer <strong>李孟轩</strong> 介绍了从 2.7.0 到 2.8.0 的能力演进：</p>
<p><strong>2.7.0 可用性改进：</strong></p>
<ul>
<li class="">调度原因可视化：一眼看出 Pod 为什么 Pending</li>
<li class="">资源配额监控优化：解决虚拟化带来的 quota 失真问题</li>
</ul>
<p><strong>生态支持：</strong></p>
<ul>
<li class="">已支持 9 家厂商 GPU</li>
<li class="">扩展到昆仑芯 XPU、AWS Trainium/Inferentia 等异构设备</li>
<li class="">Web UI 提供更友好的能力展示</li>
</ul>
<p><strong>2.8.0 规划：</strong></p>
<ul>
<li class="">优化调度性能与 Web UI 的异构设备支持</li>
<li class="">通过 DRA 将原有 scheduler 与 device plugin 能力收敛到新的 DRA driver</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="metax-sgpu-on-hami">MetaX sGPU on HAMi<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#metax-sgpu-on-hami" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="MetaX sGPU" src="https://project-hami.io/zh/assets/images/metax-sgpu-hami-3ce696a9c01f132129de588ed65b47ed.png" width="1080" height="608" class="img_ev3q"></p>
<p>沐曦股份云原生基础架构专家 <strong>郭磊</strong> 分享了 sGPU 在 HAMi 社区的落地实践：</p>
<p><strong>核心能力：</strong></p>
<ul>
<li class="">显存以 1MB 级粒度、算力以 1% 粒度配置</li>
<li class="">Pod 可按需申请"60% 算力 + 4GB 显存"等虚拟 GPU 资源</li>
<li class="">节点级与 GPU 级 binpack/spread 策略灵活组合</li>
</ul>
<p><strong>QoS 策略：</strong></p>
<ul>
<li class="">best effort、fixed share、burst share 多种策略</li>
<li class="">在线/离线混部，资源紧张时自动暂停低优先级任务</li>
<li class="">拓扑感知调度优化通信链路</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="基于-vgpu-的性能优化">基于 vGPU 的性能优化<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E5%9F%BA%E4%BA%8E-vgpu-%E7%9A%84%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="性能优化" src="https://project-hami.io/zh/assets/images/performance-optimization-9fd8b4a7055399e6625aebe0893bfaa3.png" width="1080" height="608" class="img_ev3q"></p>
<p>蔚来云端工程部训练加速负责人 <strong>李鹏</strong> 分享了虚拟化环境下的性能诊断框架：</p>
<p><strong>核心方案：</strong></p>
<ul>
<li class="">HAMi 基于 perf 的虚拟化机制提供无侵入式数据采集</li>
<li class="">拦截 CUDA/cuBLAS/NVML 等 GPU 核心库，无需改动业务代码</li>
<li class="">构建 CPU/GPU 双侧时间线，精确还原任务执行状态</li>
</ul>
<p><strong>应用场景：</strong></p>
<ul>
<li class="">已应用于蔚来自动驾驶训练场景</li>
<li class="">识别低并行度、通信阻塞等关键瓶颈</li>
<li class="">实现底层瓶颈定位 + 上层代码溯源的闭环诊断</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="daocloud-drun-的-gpu-虚拟化实践">DaoCloud d.run 的 GPU 虚拟化实践<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#daocloud-drun-%E7%9A%84-gpu-%E8%99%9A%E6%8B%9F%E5%8C%96%E5%AE%9E%E8%B7%B5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="DaoCloud 实践" src="https://project-hami.io/zh/assets/images/daocloud-drun-practice-d376020e3068862555486934a4b0afca.png" width="1080" height="608" class="img_ev3q"></p>
<p>DaoCloud 产品负责人 <strong>卢传佳</strong> 分享了 d.run 智算调度平台在 SaaS GPU 租赁场景的实践：</p>
<p><strong>挑战：</strong></p>
<ul>
<li class="">企业自建场景受限于"整卡"使用，利用率难以突破</li>
<li class="">SaaS 租赁模式易受供需波动影响，GPU 池碎片化</li>
</ul>
<p><strong>HAMi 的价值：</strong></p>
<ul>
<li class="">动态切片与超配能力显著降低碎片化</li>
<li class="">单卡可输出更多 SKU（3G/6G/12G/24G 等）</li>
<li class="">显存动态扩容避免 OOM 触发容器重启</li>
<li class="">支持多集群池化管理、国产卡统一调度、租户级优先级与抢占</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="星环科技国产算力适配实践">星环科技国产算力适配实践<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E6%98%9F%E7%8E%AF%E7%A7%91%E6%8A%80%E5%9B%BD%E4%BA%A7%E7%AE%97%E5%8A%9B%E9%80%82%E9%85%8D%E5%AE%9E%E8%B7%B5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><img decoding="async" loading="lazy" alt="星环科技" src="https://project-hami.io/zh/assets/images/transwarp-technology-40c8f6e8634aa44cbefa35b1e7be5301.png" width="1080" height="608" class="img_ev3q"></p>
<p>星环科技 AI 工具平台研发 <strong>侯雨希</strong> 围绕 LLMOps 平台在寒武纪、海光等国产加速器上的适配实践展开分享：</p>
<p><strong>寒武纪适配：</strong></p>
<ul>
<li class="">解决 sMLU 动态切片粒度限制</li>
<li class="">多型号资源名隔离与硬编码显存单元问题</li>
<li class="">通过节点 label、型号识别实现多型号管理</li>
</ul>
<p><strong>海光 DCU 场景：</strong></p>
<ul>
<li class="">解决设备 ID 不唯一问题，通过驱动 SDK 获取硬件序列号</li>
<li class="">重写 exporter 逻辑，确保指标与调度一致性</li>
</ul>
<p><strong>未来方向：</strong></p>
<ul>
<li class="">DRA 将成为异构 GPU 的统一抽象方向</li>
<li class="">完成从自定义到 Kubernetes 原生的升级</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="交流环节">交流环节<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E4%BA%A4%E6%B5%81%E7%8E%AF%E8%8A%82" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="交流环节" src="https://project-hami.io/zh/assets/images/networking-session-bdecac27029f29a01eb7003e9361556f.png" width="1080" height="1080" class="img_ev3q"></p>
<p>现场围绕 GPU 虚拟化实践、模型推理效率优化、国产加速器适配经验等主题展开了热烈讨论。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ppt-分享">PPT 分享<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#ppt-%E5%88%86%E4%BA%AB" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>PPT 下载链接：<a href="https://github.com/Project-HAMi/community/tree/main/hami-meetup/01-shanghai-20251130" target="_blank" rel="noopener noreferrer" class="">HAMi Meetup 上海站 PPT 合集</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="结语">结语<a href="https://project-hami.io/zh/blog/hami-meetup-shanghai-2025#%E7%BB%93%E8%AF%AD" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>本次 HAMi Meetup 聚焦提升算力效率，沉淀了多份企业级实战干货。从技术细节探讨到业务场景适配，从开源生态共建到国产算力创新，交流充分体现了行业对云原生 AI 基建的高关注度。</p>
<p>未来，HAMi 社区将继续以开源力量为纽带，联动更多行业伙伴，深耕异构算力调度领域。也期待更多社区用户和从业者踊跃投稿，分享你的实践经验！</p>]]></content>
        <author>
            <name>HAMi 社区</name>
        </author>
        <category label="Meetup" term="Meetup"/>
        <category label="上海" term="上海"/>
        <category label="异构算力调度" term="异构算力调度"/>
        <category label="GPU 虚拟化" term="GPU 虚拟化"/>
        <category label="Kubernetes" term="Kubernetes"/>
        <category label="国产算力" term="国产算力"/>
        <category label="AI 训练与推理优化" term="AI 训练与推理优化"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[HAMi 项目 GPU Pod 调度流程源码走读]]></title>
        <id>https://project-hami.io/zh/blog/2024/12/31/post</id>
        <link href="https://project-hami.io/zh/blog/2024/12/31/post"/>
        <updated>2024-12-31T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[使用 HAMi 的过程中经常会出现 Pod 被创建出来 Pending 的问题，犹以如下两个问题为著：]]></summary>
        <content type="html"><![CDATA[<p>使用 HAMi 的过程中经常会出现 Pod 被创建出来 Pending 的问题，犹以如下两个问题为著：</p>
<ul>
<li class="">Pod UnexpectedAdmissionError</li>
<li class="">Pod Pending</li>
</ul>
<p>介于此，展开这部分代码的粗略走读，旨在说明调度过程中各组件的交互，以及资源的计算方式，其他细节会有所遗漏。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="调度流程">调度流程<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E8%B0%83%E5%BA%A6%E6%B5%81%E7%A8%8B" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>看代码之前可以先看下官方文档说明，大体上比较明确：</p>
<p><img decoding="async" loading="lazy" src="https://github.com/Project-HAMi/HAMi/blob/master/docs/develop/imgs/flowchart.jpeg?raw=true" alt="flowchart" class="img_ev3q"></p>
<p>细节上可以分为三个阶段：</p>
<ul>
<li class="">
<p>准备阶段：图上可以看出有一些依赖条件，例如要有 Mutating Webhook、device-plugin 等等。
所以这个阶段主要分析下依赖条件的准备，只有在服务首次启动时需要。</p>
<p><img decoding="async" loading="lazy" src="https://github.com/elrondwong/elrond.wang/raw/master/img/posts/Hami-GPU-Pod-Scheduler/%E5%87%86%E5%A4%87%E5%B7%A5%E4%BD%9C.png" alt="Pod 创建前的准备工作" class="img_ev3q"></p>
</li>
<li class="">
<p>Pod 调度阶段：准备过程完成之后 Pod 进入处理流程，完成调度</p>
</li>
<li class="">
<p>Pod 启动阶段：Pod 如何与 Node 上的 GPU 进行交互等</p>
</li>
</ul>
<p>本文会着重分析准备阶段，主要内容为调度分析。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="pod-调度流程">Pod 调度流程<a href="https://project-hami.io/zh/blog/2024/12/31/post#pod-%E8%B0%83%E5%BA%A6%E6%B5%81%E7%A8%8B" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<ul>
<li class="">用户发送创建 Pod 请求到 kube-apiserver</li>
<li class="">触发 Admission Webhook，更新 Pod 中 schedulerName</li>
<li class="">kube-apiserver 根据 schedulerName 将请求发送给调度器处理</li>
<li class="">调度器处理<!-- -->
<ul>
<li class="">收集 Node device 信息 -- 通过 node annotation 收集，数据来自 daemonSet <code>hami-device-plugin</code> 定时写入</li>
<li class="">根据设备信息以及 Pod 的 limit 信息进行打分，选出最高分的 node</li>
<li class="">将 Pod 和 node 进行绑定完成绑定，进行 Pod 创建</li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="常见问题排查">常见问题排查<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="pod-unexpectedadmissionerror">Pod UnexpectedAdmissionError<a href="https://project-hami.io/zh/blog/2024/12/31/post#pod-unexpectedadmissionerror" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>Pod 创建状态显示 <code>UnexpectedAdmissionError</code></p>
<p>了解流程之后，可以知道这个错误代表 kube-apiserver 调用拓展调度器失败，可能有两个原因，其他情况具体排查需要看 kube-apiserver 日志。</p>
<ul>
<li class="">通信异常：从 kube-apiserver 到拓展调度器的 https 端口不通，有几种可能<!-- -->
<ul>
<li class="">dns 无法解析</li>
<li class="">跨节点通信有问题</li>
<li class="">拓展调度器的服务异常</li>
</ul>
</li>
<li class="">TLS 验证错误：一般会显示 <code>webhook x509: certificate signed by unknown authority</code>，helmchart 部署时有一个 <code>jobs.batch</code> <code>hami-vgpu.admission-pathch</code>，如果没有运行完成会出现这样的问题</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="调度问题">调度问题<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E8%B0%83%E5%BA%A6%E9%97%AE%E9%A2%98" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>容器一直在 pending 状态，使用 <code>kubectl describe</code> 命令可以看到具体原因，主要有以下几个：</p>
<ul>
<li class="">
<p><code>card Insufficient remaining memory</code></p>
</li>
<li class="">
<p><code>calcScore:node not fit pod</code></p>
<p>主要原因一般是确实资源不足，或者配置错误，配置错误是指 devicememoryscaling 配置未符合预期。
有两个地方可以配置，优先级为节点配置大于全局配置，容易发生问题的地方在于 name 需要和 kubectl get node 显示的 nodename 一致才能生效。</p>
</li>
<li class="">
<p>全局配置 <code>kubectl get cm hami-scheduler-device</code></p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">deviceMemoryScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><br></span></code></pre></div></div>
</li>
<li class="">
<p>节点配置 <code>kubectl get cm hami-device-plugin</code></p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token property">"nodeconfig"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"name"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"node1"</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"devicememoryscaling"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"devicesplitcount"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"migstrategy"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"none"</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"filterdevices"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token property">"uuid"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token property">"index"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><br></span></code></pre></div></div>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mutatingwebhook">MutatingWebhook<a href="https://project-hami.io/zh/blog/2024/12/31/post#mutatingwebhook" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>K8s 提供了 admissionWebhook 资源，以 k8s 资源操作为触发器，触发 hook，用途最广泛的为针对
Pod 创建做拦截，对 Pod 做 YAML 注入，具体的例如增加 init 容器注入文件等等。</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="webhook-配置">Webhook 配置<a href="https://project-hami.io/zh/blog/2024/12/31/post#webhook-%E9%85%8D%E7%BD%AE" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>hami-webhook：</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook </span><span class="token parameter variable" style="color:rgb(191, 199, 213)">-o</span><span class="token plain"> yaml</span><br></span></code></pre></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> admissionregistration.k8s.io/v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> MutatingWebhookConfiguration</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">creationTimestamp</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2024-12-10T03:50:37Z"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">generation</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">5</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/managed-by</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Helm</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2307810"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">uid</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 2cdcebe4</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">f561</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">429f</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">9480</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token number" style="color:rgb(247, 140, 108)">701e65980687</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">webhooks</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">admissionReviewVersions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> v1beta1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">clientConfig</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">caBundle</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">service</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">path</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> /webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">port</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">443</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">failurePolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">matchPolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Equivalent</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> vgpu.hami.io</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">namespaceSelector</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">matchExpressions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">key</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami.io/webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">operator</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> NotIn</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">objectSelector</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">matchExpressions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">key</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami.io/webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">operator</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> NotIn</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">reinvocationPolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Never</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">rules</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">apiGroups</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">""</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">apiVersions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">operations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> CREATE</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">resources</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> pods</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">scope</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'*'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">sideEffects</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> None</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">timeoutSeconds</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><br></span></code></pre></div></div>
<p>当 Pod 创建时，调用 <code>https://hami-scheduler.kube-system:443/webhook</code> 做 TLS 校验，CA 为 <code>caBundle</code> 配置。
当命名空间有 <code>hami.io/webhook: ignore</code> 的标签时不触发。</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="webhook-server-实现">Webhook Server 实现<a href="https://project-hami.io/zh/blog/2024/12/31/post#webhook-server-%E5%AE%9E%E7%8E%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>需要实现一个 TLS  的 HTTP Server，且提供 <code>/webhook</code> 接口。</p>
<p>cmd/scheduler/main.go:84</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ...</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/webhook", routes.WebHookRoute())</span><br></span></code></pre></div></div>
<p><code>WebHookRoute</code> 需要实现 <code>sigs.k8s.io/controller-runtime@v0.16.3/pkg/webhook/admission/webhook.go:98</code></p>
<p>pkg/scheduler/webhook.go:52</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> pod := &amp;corev1.Pod{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err := h.decoder.Decode(req, pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf("Failed to decode request: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Errored(http.StatusBadRequest, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(pod.Spec.Containers) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Warningf(template+" - Denying admission as pod has no containers", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Denied("pod has no containers")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof(template, req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> hasResource := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for idx, ctr := range pod.Spec.Containers {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  c := &amp;pod.Spec.Containers[idx]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ctr.SecurityContext != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if ctr.SecurityContext.Privileged != nil &amp;&amp; *ctr.SecurityContext.Privileged {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Warningf(template+" - Denying admission as container %s is privileged", req.Namespace, req.Name, req.UID, c.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   found, err := val.MutateAdmission(c, pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Errorf("validating pod failed:%s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    return admission.Errored(http.StatusInternalServerError, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   hasResource = hasResource || found</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !hasResource {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof(template+" - Allowing admission for pod: no resource found", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  //return admission.Allowed("no resource found")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> } else if len(config.SchedulerName) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  pod.Spec.SchedulerName = config.SchedulerName</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if pod.Spec.NodeName != "" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof(template+" - Pod already has node assigned", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return admission.Denied("pod has node assigned")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> marshaledPod, err := json.Marshal(pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf(template+" - Failed to marshal pod, error: %v", req.Namespace, req.Name, req.UID, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Errored(http.StatusInternalServerError, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)</span><br></span></code></pre></div></div>
<p>主要通过 Pod 中容器的 resource 来判断是否要不要走拓展调度器。</p>
<p>pkg/device/nvidia/device.go:246</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (dev *NvidiaGPUDevices) MutateAdmission(ctr *corev1.Container, p *corev1.Pod) (bool, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*gpu related */</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> priority, ok := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourcePriority)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctr.Env = append(ctr.Env, corev1.EnvVar{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Name:  util.TaskPriority,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Value: fmt.Sprint(priority.Value()),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceNameOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCountName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if resourceNameOK {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return resourceNameOK, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceCoresOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCoreName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceMemOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceMemoryName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceMemPercentageOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceMemoryPercentageName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if resourceCoresOK || resourceMemOK || resourceMemPercentageOK {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if dev.config.DefaultGPUNum &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCountName)] = *resource.NewQuantity(int64(dev.config.DefaultGPUNum), resource.BinarySI)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   resourceNameOK = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !resourceNameOK &amp;&amp; dev.config.OverwriteEnv {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctr.Env = append(ctr.Env, corev1.EnvVar{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Name:  "NVIDIA_VISIBLE_DEVICES",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Value: "none",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return resourceNameOK, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>主要比对 Pod 的 Resources Limit 中有没有包含 <code>device-config.yaml</code> 的配置，如果有走 hami 调度流程</p>
<p><code>device-config</code> 以英伟达显卡为例：</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">nvidia</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceCountName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceMemoryName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceMemoryPercentageName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpumem</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">percentage</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceCoreName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourcePriorityName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/priority</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">overwriteEnv</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token boolean important" style="color:rgb(255, 88, 116)">false</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultMemory</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultCores</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultGPUNum</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceSplitCount</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceMemoryScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceCoreScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><br></span></code></pre></div></div>
<p>确定走 HAMi 调度流程之后，通过 Patch 修改 Pod <code>schedulerName</code> 为 HAMi 调度器的名称。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="拓展-k8s-scheduler">拓展 k8s scheduler<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E6%8B%93%E5%B1%95-k8s-scheduler" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p><a href="https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/" target="_blank" rel="noopener noreferrer" class="">KubeSchedulerConfiguration</a> 拓展调度器可以通过实现拓展点进行调度器的拓展</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubeschedulerconfiguration">KubeSchedulerConfiguration<a href="https://project-hami.io/zh/blog/2024/12/31/post#kubeschedulerconfiguration" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">kubectl get cm hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">newversion </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">o yaml</span><br></span></code></pre></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">data</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">config.yaml</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">|</span><span class="token scalar string" style="color:rgb(195, 232, 141)"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    apiVersion: kubescheduler.config.k8s.io/v1beta2</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    kind: KubeSchedulerConfiguration</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    leaderElection:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      leaderElect: false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    profiles:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    - schedulerName: hami-scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    extenders:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    - urlPrefix: "https://127.0.0.1:443"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      filterVerb: filter</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      bindVerb: bind</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      nodeCacheCapable: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      weight: 1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      httpTimeout: 30s</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      enableHTTPS: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      tlsConfig:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        insecure: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      managedResources:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpumem-percentage</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/priority</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: cambricon.com/vmlu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcunum</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: iluvatar.ai/vgpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> ConfigMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">creationTimestamp</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2024-12-10T03:50:36Z"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/component</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/instance</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/managed-by</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Helm</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/version</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 2.4.1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">helm.sh/chart</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">2.4.1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">newversion</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2316275"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">uid</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 3a61a72c</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">0bab</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">432f</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">b4d7</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">5c1ae46ee14d</span><br></span></code></pre></div></div>
<p>拓展调度器通过<a href="https://kubernetes.io/docs/reference/scheduling/config/#extension-points" target="_blank" rel="noopener noreferrer" class="">拓展点</a>进行拓展，这里拓展了 filter 和 bind。</p>
<ul>
<li class="">filter: 找到最合适的 node</li>
<li class="">bind: 为 Pod 创建一个 binding 资源</li>
</ul>
<p>调度时会根据拓展点顺序来调用拓展调度器的实现，这里会先调用
<code>https://127.0.0.1:443/filter</code>，再调用 <code>https://127.0.0.1:443/bind</code></p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="拓展调度器-http-server-启动">拓展调度器 HTTP Server 启动<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E6%8B%93%E5%B1%95%E8%B0%83%E5%BA%A6%E5%99%A8-http-server-%E5%90%AF%E5%8A%A8" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p><code>cmd/scheduler/main.go:70</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher = scheduler.NewScheduler()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher.Start()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer sher.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start monitor metrics</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go sher.RegisterFromNodeAnnotations()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go initMetrics(config.MetricsBindAddress)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start http server</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router := httprouter.New()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/filter", routes.PredicateRoute(sher))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/bind", routes.Bind(sher))</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="filter-实现">filter 实现<a href="https://project-hami.io/zh/blog/2024/12/31/post#filter-%E5%AE%9E%E7%8E%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p><code>pkg/scheduler/routes/route.go:41</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func PredicateRoute(s *scheduler.Scheduler) httprouter.Handle {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infoln("Into Predicate Route outer func")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return func(w http.ResponseWriter, r *http.Request, _ httprouter.Params) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("Into Predicate Route inner func")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  checkBody(w, r)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var buf bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  body := io.TeeReader(r.Body, &amp;buf)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderArgs extenderv1.ExtenderArgs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderFilterResult *extenderv1.ExtenderFilterResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := json.NewDecoder(body).Decode(&amp;extenderArgs); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("decode error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderFilterResult = &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderFilterResult, err = s.Filter(extenderArgs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Errorf("pod %v filter error, %v", extenderArgs.Pod.Name, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    extenderFilterResult = &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if resultBody, err := json.Marshal(extenderFilterResult); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("Failed to marshal extenderFilterResult: %+v, %+v",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    err, extenderFilterResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusInternalServerError)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write([]byte(err.Error()))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusOK)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write(resultBody)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p><code>pkg/scheduler/scheduler.go:430</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) Filter(args extenderv1.ExtenderArgs) (*extenderv1.ExtenderFilterResult, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("begin schedule filter", "pod", args.Pod.Name, "uuid", args.Pod.UID, "namespaces", args.Pod.Namespace)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nums := k8sutil.Resourcereqs(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, n := range nums {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, k := range n {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   total += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if total == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(1).Infof("pod %v not find resource", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("does not request any resource"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   NodeNames:   args.NodeNames,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: nil,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error:       "",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos := args.Pod.Annotations</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeUsage, failedNodes, err := s.getNodesUsage(args.NodeNames, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(failedNodes) != 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(5).InfoS("getNodesUsage failed nodes", "nodes", failedNodes)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len((*nodeScores).NodeList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(4).Infof("All node scores do not meet for pod %v", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("no available node, all node scores do not meet"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: failedNodes,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(4).Infoln("nodeScores_len=", len((*nodeScores).NodeList))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sort.Sort(nodeScores)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m := (*nodeScores).NodeList[len((*nodeScores).NodeList)-1]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("schedule %v/%v to %v %v", args.Pod.Namespace, args.Pod.Name, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedNodeAnnotations] = m.NodeID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.PatchAnnotations(&amp;annotations, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //InRequestDevices := util.EncodePodDevices(util.InRequestDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //supportDevices := util.EncodePodDevices(util.SupportDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, InRequestDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, supportDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.addPod(args.Pod, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(args.Pod, annotations)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringSucceed, []string{m.NodeID}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := extenderv1.ExtenderFilterResult{NodeNames: &amp;[]string{m.NodeID}}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里核心逻辑主要有两步，获取节点资源、根据节点已分配资源与总资源计算分数并选出一个最高分。</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="获取节点资源信息">获取节点资源信息<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E8%8E%B7%E5%8F%96%E8%8A%82%E7%82%B9%E8%B5%84%E6%BA%90%E4%BF%A1%E6%81%AF" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h5>
<p><code>pkg/scheduler/scheduler.go:241</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) getNodesUsage(nodes *[]string, task *corev1.Pod) (*map[string]*NodeUsage, map[string]string, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> overallnodeMap := make(map[string]*NodeUsage)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> cachenodeMap := make(map[string]*NodeUsage)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> failedNodes := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //for _, nodeID := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> allNodes, err := s.ListNodes()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;overallnodeMap, failedNodes, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, node := range allNodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo := &amp;NodeUsage{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  userGPUPolicy := config.GPUSchedulerPolicy</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if task != nil &amp;&amp; task.Annotations != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if value, ok := task.Annotations[policy.GPUSchedulerPolicyAnnotationKey]; ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    userGPUPolicy = value</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo.Node = node.Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo.Devices = policy.DeviceUsageList{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Policy:      userGPUPolicy,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   DeviceLists: make([]*policy.DeviceListsScore, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, d := range node.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   nodeInfo.Devices.DeviceLists = append(nodeInfo.Devices.DeviceLists, &amp;policy.DeviceListsScore{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Score: 0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Device: &amp;util.DeviceUsage{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     ID:        d.ID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Index:     d.Index,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Used:      0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Count:     d.Count,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Usedmem:   0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Totalmem:  d.Devmem,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Totalcore: d.Devcore,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Usedcores: 0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     MigUsage: util.MigInUse{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      Index:     0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      UsageList: make(util.MIGS, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     },</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     MigTemplate: d.MIGTemplate,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Mode:        d.Mode,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Type:        d.Type,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Numa:        d.Numa,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Health:      d.Health,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    },</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  overallnodeMap[node.ID] = nodeInfo</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> podsInfo := s.ListPodsInfo()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, p := range podsInfo {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node, ok := overallnodeMap[p.NodeID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, podsingleds := range p.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, ctrdevs := range podsingleds {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, udevice := range ctrdevs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     for _, d := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      deviceID := udevice.UUID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      if strings.Contains(deviceID, "[") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       deviceID = strings.Split(deviceID, "[")[0]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      if d.Device.ID == deviceID {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Used++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Usedmem += udevice.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Usedcores += udevice.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       if strings.Contains(udevice.UUID, "[") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        tmpIdx, Instance := util.ExtractMigTemplatesFromUUID(udevice.UUID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        if len(d.Device.MigUsage.UsageList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">         util.PlatternMIG(&amp;d.Device.MigUsage, d.Device.MigTemplate, tmpIdx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        d.Device.MigUsage.UsageList[Instance].InUse = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        klog.V(3).Infoln("add mig usage", d.Device.MigUsage, "template=", d.Device.MigTemplate, "uuid=", d.Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(5).Infof("usage: pod %v assigned %v %v", p.Name, p.NodeID, p.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.overviewstatus = overallnodeMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, nodeID := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node, err := s.GetNode(nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   // The identified node does not have a gpu device, so the log here has no practical meaning,increase log priority.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("node unregistered", "node", nodeID, "error", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   failedNodes[nodeID] = "node unregistered"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cachenodeMap[node.ID] = overallnodeMap[node.ID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.cachedstatus = cachenodeMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;cachenodeMap, failedNodes, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>获取 Node 总的资源与已分配的资源，首先获取 Node 信息。</p>
<p><code>pkg/scheduler/nodes.go:120</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (m *nodeManager) ListNodes() (map[string]*util.NodeInfo, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m.mutex.RLock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer m.mutex.RUnlock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return m.nodes, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里用到了缓存，缓存节点信息，由 <code>addNode</code> 添加缓存。</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="node-缓存">Node 缓存<a href="https://project-hami.io/zh/blog/2024/12/31/post#node-%E7%BC%93%E5%AD%98" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h6>
<p><code>pkg/scheduler/nodes.go:46</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (m *nodeManager) addNode(nodeID string, nodeInfo *util.NodeInfo) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if nodeInfo == nil || len(nodeInfo.Devices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m.mutex.Lock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer m.mutex.Unlock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, ok := m.nodes[nodeID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(nodeInfo.Devices) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmp := make([]util.DeviceInfo, 0, len(nodeInfo.Devices))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   devices := device.GetDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   deviceType := ""</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, val := range devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if strings.Contains(nodeInfo.Devices[0].Type, val.CommonWord()) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     deviceType = val.CommonWord()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, val := range m.nodes[nodeID].Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !strings.Contains(val.Type, deviceType) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmp = append(tmp, val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   m.nodes[nodeID].Devices = tmp</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   m.nodes[nodeID].Devices = append(m.nodes[nodeID].Devices, nodeInfo.Devices...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  m.nodes[nodeID] = nodeInfo</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里的主要逻辑在于 <code>device.GetDevices()</code> 获取设备信息</p>
<p><code>pkg/device/devices.go:81</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func GetDevices() map[string]Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>device 也是个缓存，后面再分析，首先看 Node 缓存是什么时候被调用的。</p>
<p><code>pkg/scheduler/scheduler.go:155</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) RegisterFromNodeAnnotations() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).Infoln("Scheduler into RegisterFromNodeAnnotations")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ticker := time.NewTicker(time.Second * 15)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  select {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-s.nodeNotify:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-ticker.C:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-s.stopCh:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  labelSelector := labels.Everything()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(config.NodeLabelSelector) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   labelSelector = (labels.Set)(config.NodeLabelSelector).AsSelector()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  rawNodes, err := s.nodeLister.List(labelSelector)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nodes list failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var nodeNames []string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range rawNodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   nodeNames = append(nodeNames, val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for devhandsk, devInstance := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    health, needUpdate := devInstance.CheckHealth(devhandsk, val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.V(5).InfoS("device check health", "node", val.Name, "deviceVendor", devhandsk, "health", health, "needUpdate", needUpdate)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !health {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     err := devInstance.NodeCleanUp(val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     // If the device is not healthy, the device is removed from the node.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     // At the same time, this node needs to be removed from the cache.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorln("node cleanup failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     info, ok := s.nodes[val.Name]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Infof("node %v device %s:%v leave, %v remaining devices:%v", val.Name, devhandsk, info.ID, err, s.nodes[val.Name].Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      s.rmNodeDevice(val.Name, info, devhandsk)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !needUpdate {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    _, ok := util.HandshakeAnnos[devhandsk]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmppat := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmppat[util.HandshakeAnnos[devhandsk]] = "Requesting_" + time.Now().Format("2006.01.02 15:04:05")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.V(4).InfoS("New timestamp", util.HandshakeAnnos[devhandsk], tmppat[util.HandshakeAnnos[devhandsk]], "nodeName", val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     n, err := util.GetNode(val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorln("get node failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     util.PatchNodeAnnotations(n, tmppat)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo := &amp;util.NodeInfo{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.ID = val.Name</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.Node = val</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodedevices, err := devInstance.GetNodeDevices(*val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.Devices = make([]util.DeviceInfo, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, deviceinfo := range nodedevices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     nodeInfo.Devices = append(nodeInfo.Devices, *deviceinfo)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    s.addNode(val.Name, nodeInfo)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if s.nodes[val.Name] != nil &amp;&amp; len(nodeInfo.Devices) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.Infof("node %v device %s come node info=%s,%v total=%v", val.Name, devhandsk, nodeInfo.ID, nodeInfo.Devices, s.nodes[val.Name].Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  _, _, err = s.getNodesUsage(&amp;nodeNames, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("get node usage failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>启动了一个 15s 的定时任务，获取 Node 信息维护 Node 缓存。</p>
<p>这里的核心逻辑在于 <code>for devhandsk, devInstance := range device.GetDevices()</code> 获取所有的 device，
主要是一些根据不同的设备注册了不同的 handler，根据注册的 device 获取显卡的资源信息 <code>devInstance.GetNodeDevices</code>。</p>
<p>这里会通过注册的 device（此环境为 nvidia），调用到不同显卡的<code>GetNodeDevices</code>实现，device 后面再做具体说明。</p>
<p><code>pkg/device/nvidia/device.go:209</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">ffunc (dev *NvidiaGPUDevices) GetNodeDevices(n corev1.Node) ([]*util.DeviceInfo, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devEncoded, ok := n.Annotations[RegisterAnnos]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, errors.New("annos not found " + RegisterAnnos)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodedevices, err := util.DecodeNodeDevices(devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "failed to decode node devices", "node", n.Name, "device annotation", devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(nodedevices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.InfoS("no nvidia gpu device found", "node", n.Name, "device annotation", devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, errors.New("no gpu found on node")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range nodedevices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if val.Mode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   val.MIGTemplate = make([]util.Geometry, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, migTemplates := range dev.config.MigGeometriesList {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    found := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, migDevices := range migTemplates.Models {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if strings.Contains(val.Type, migDevices) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      found = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if found {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     val.MIGTemplate = append(val.MIGTemplate, migTemplates.Geometries...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devDecoded := util.EncodeNodeDevices(nodedevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).InfoS("nodes device information", "node", n.Name, "nodedevices", devDecoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nodedevices, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>看到这里基本逻辑是 scheduler 通过定时器去读取 node 的 annotation 信息并将其维护再 node 缓存中，以供调度时使用。</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">...</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-nvidia-register</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 'GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">24576</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">300</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">NVIDIA</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      GeForce RTX 3090</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">true</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><br></span></code></pre></div></div>
<p>又调用到了 device，这个我们待会儿再看，继续看谁调用的 <code>RegisterFromNodeAnnotations</code>。</p>
<p><code>cmd/scheduler/main.go:70</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher = scheduler.NewScheduler()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher.Start()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer sher.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start monitor metrics</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go sher.RegisterFromNodeAnnotations()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go initMetrics(config.MetricsBindAddress)</span><br></span></code></pre></div></div>
<p>调度器启动的时候就会调用，这里逻辑明确了，继续看刚刚的 device。</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="device">device<a href="https://project-hami.io/zh/blog/2024/12/31/post#device" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h6>
<p>device 通过 <code>pkg/device/devices.go:85</code> 进行初始化。</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func InitDevicesWithConfig(config *Config) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices = make(map[string]Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = []string{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[nvidia.NvidiaGPUDevice] = nvidia.InitNvidiaDevice(config.NvidiaConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[cambricon.CambriconMLUDevice] = cambricon.InitMLUDevice(config.CambriconConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[hygon.HygonDCUDevice] = hygon.InitDCUDevice(config.HygonConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[iluvatar.IluvatarGPUDevice] = iluvatar.InitIluvatarDevice(config.IluvatarConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[mthreads.MthreadsGPUDevice] = mthreads.InitMthreadsDevice(config.MthreadsConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[metax.MetaxGPUDevice] = metax.InitMetaxDevice(config.MetaxConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, nvidia.NvidiaGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, cambricon.CambriconMLUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, hygon.HygonDCUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, iluvatar.IluvatarGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, mthreads.MthreadsGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, metax.MetaxGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, dev := range ascend.InitDevices(config.VNPUs) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  devices[dev.CommonWord()] = dev</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  DevicesToHandle = append(DevicesToHandle, dev.CommonWord())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里用的是 nvidia，所以主要看 <code>InitNvidiaDevice</code> 即可。</p>
<p><code>pkg/device/devices.go:42</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type Devices interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CommonWord() string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> MutateAdmission(ctr *corev1.Container, pod *corev1.Pod) (bool, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckHealth(devType string, n *corev1.Node) (bool, bool)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> NodeCleanUp(nn string) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetNodeDevices(n corev1.Node) ([]*util.DeviceInfo, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckType(annos map[string]string, d util.DeviceUsage, n util.ContainerDeviceRequest) (bool, bool, bool)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // CheckUUID is check current device id whether in GPUUseUUID or GPUNoUseUUID set, return true is check success.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckUUID(annos map[string]string, d util.DeviceUsage) bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> LockNode(n *corev1.Node, p *corev1.Pod) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ReleaseNodeLock(n *corev1.Node, p *corev1.Pod) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GenerateResourceRequests(ctr *corev1.Container) util.ContainerDeviceRequest</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> PatchAnnotations(annoinput *map[string]string, pd util.PodDevices) map[string]string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CustomFilterRule(allocated *util.PodDevices, request util.ContainerDeviceRequest, toAllicate util.ContainerDevices, device *util.DeviceUsage) bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> AddResourceUsage(n *util.DeviceUsage, ctr *util.ContainerDevice) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // This should not be associated with a specific device object</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //ParseConfig(fs *flag.FlagSet)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里定义了一些接口，不同的设备进行不同的实现，在 scheduler 启动时进行初始化，以供运行中调用。</p>
<p>获取到各个节点的各个设备的资源情况之后开始进行打分。</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="根据节点资源信息打分">根据节点资源信息打分<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E6%A0%B9%E6%8D%AE%E8%8A%82%E7%82%B9%E8%B5%84%E6%BA%90%E4%BF%A1%E6%81%AF%E6%89%93%E5%88%86" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h5>
<p><code>pkg/scheduler/scheduler.go:458</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span></code></pre></div></div>
<p><code>pkg/scheduler/score.go:198</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) calcScore(nodes *map[string]*NodeUsage, nums util.PodDeviceRequests, annos map[string]string, task *corev1.Pod) (*policy.NodeScoreList, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> userNodePolicy := config.NodeSchedulerPolicy</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if annos != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if value, ok := annos[policy.NodeSchedulerPolicyAnnotationKey]; ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   userNodePolicy = value</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := policy.NodeScoreList{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Policy:   userNodePolicy,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  NodeList: make([]*policy.NodeScore, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //func calcScore(nodes *map[string]*NodeUsage, errMap *map[string]string, nums util.PodDeviceRequests, annos map[string]string, task *corev1.Pod) (*NodeScoreList, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // res := make(NodeScoreList, 0, len(*nodes))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for nodeID, node := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  viewStatus(*node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  score := policy.NodeScore{NodeID: nodeID, Node: node.Node, Devices: make(util.PodDevices), Score: 0}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  score.ComputeDefaultScore(node.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  //This loop is for different container request</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctrfit := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for ctrid, n := range nums {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   sums := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, k := range n {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    sums += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if sums == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for idx := range score.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     for len(score.Devices[idx]) &lt;= ctrid {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      score.Devices[idx] = append(score.Devices[idx], util.ContainerDevices{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     score.Devices[idx][ctrid] = append(score.Devices[idx][ctrid], util.ContainerDevice{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("fitInDevices", "pod", klog.KObj(task), "node", nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   fit, _ := fitInDevices(node, n, annos, task, &amp;score.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ctrfit = fit</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if !fit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.InfoS("calcScore:node not fit pod", "pod", klog.KObj(task), "node", nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ctrfit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   res.NodeList = append(res.NodeList, &amp;score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   score.OverrideScore(node.Devices, userNodePolicy)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这块逻辑主要分为遍历节点打分，遍历 Pod 的容器计算每个容器对应的设备的分数，返回所有可以承载 limits 所需资源的 node 返回。</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="计算出节点的分数">计算出节点的分数<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E8%AE%A1%E7%AE%97%E5%87%BA%E8%8A%82%E7%82%B9%E7%9A%84%E5%88%86%E6%95%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h5>
<p><code>pkg/scheduler/policy/node_policy.go:68</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (ns *NodeScore) ComputeDefaultScore(devices DeviceUsageList) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> used, usedCore, usedMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, device := range devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  used += device.Device.Used</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  usedCore += device.Device.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  usedMem += device.Device.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("node %s used %d, usedCore %d, usedMem %d,", ns.NodeID, used, usedCore, usedMem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total, totalCore, totalMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, deviceLists := range devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  total += deviceLists.Device.Count</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  totalCore += deviceLists.Device.Totalcore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  totalMem += deviceLists.Device.Totalmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> useScore := float32(used) / float32(total)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> coreScore := float32(usedCore) / float32(totalCore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> memScore := float32(usedMem) / float32(totalMem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ns.Score = float32(Weight) * (useScore + coreScore + memScore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("node %s computer default score is %f", ns.NodeID, ns.Score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>节点打分规则比较简单</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="计算每个容器对应的设备的分数">计算每个容器对应的设备的分数<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E8%AE%A1%E7%AE%97%E6%AF%8F%E4%B8%AA%E5%AE%B9%E5%99%A8%E5%AF%B9%E5%BA%94%E7%9A%84%E8%AE%BE%E5%A4%87%E7%9A%84%E5%88%86%E6%95%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h5>
<p><code>pkg/scheduler/score.go:149</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func fitInDevices(node *NodeUsage, requests util.ContainerDeviceRequests, annos map[string]string, pod *corev1.Pod, devinput *util.PodDevices) (bool, float32) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //devmap := make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devs := util.ContainerDevices{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total, totalCore, totalMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> free, freeCore, freeMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sums := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // computer all device score for one node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for index := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node.Devices.DeviceLists[index].ComputeScore(requests)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //This loop is for requests for different devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, k := range requests {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  sums += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if int(k.Nums) &gt; len(node.Devices.DeviceLists) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("request devices nums cannot exceed the total number of devices on the node.", "pod", klog.KObj(pod), "request devices nums", k.Nums, "node device nums", len(node.Devices.DeviceLists))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  sort.Sort(node.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  fit, tmpDevs := fitInCertainDevice(node, k, annos, pod, devinput)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if fit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for idx, val := range tmpDevs[k.Type] {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for nidx, v := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     //bc node.Devices has been sorted, so we should find out the correct device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if v.Device.ID != val.UUID {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     total += v.Device.Count</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     totalCore += v.Device.Totalcore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     totalMem += v.Device.Totalmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     free += v.Device.Count - v.Device.Used</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     freeCore += v.Device.Totalcore - v.Device.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     freeMem += v.Device.Totalmem - v.Device.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     err := device.GetDevices()[k.Type].AddResourceUsage(node.Devices.DeviceLists[nidx].Device, &amp;tmpDevs[k.Type][idx])</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorf("AddResource failed:%s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.Infoln("After AddResourceUsage:", node.Devices.DeviceLists[nidx].Device)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   devs = append(devs, tmpDevs[k.Type]...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  (*devinput)[k.Type] = append((*devinput)[k.Type], devs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return true, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>主要逻辑为：</p>
<ul>
<li class="">给容器对应的每个设备打分、遍历不同的容器对应的 limit 资源，找到可以承载容器 limits 资源的设备</li>
</ul>
<p><code>pkg/scheduler/policy/gpu_policy.go:58</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (ds *DeviceListsScore) ComputeScore(requests util.ContainerDeviceRequests) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> request, core, mem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Here we are required to use the same type device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, container := range requests {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  request += container.Nums</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  core += container.Coresreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if container.MemPercentagereq != 0 &amp;&amp; container.MemPercentagereq != 101 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   mem += ds.Device.Totalmem * (container.MemPercentagereq / 100.0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  mem += container.Memreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("device %s user %d, userCore %d, userMem %d,", ds.Device.ID, ds.Device.Used, ds.Device.Usedcores, ds.Device.Usedmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> usedScore := float32(request+ds.Device.Used) / float32(ds.Device.Count)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> coreScore := float32(core+ds.Device.Usedcores) / float32(ds.Device.Totalcore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> memScore := float32(mem+ds.Device.Usedmem) / float32(ds.Device.Totalmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ds.Score = float32(Weight) * (usedScore + coreScore + memScore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("device %s computer score is %f", ds.Device.ID, ds.Score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>打分规则与节点类似。</p>
<p><code>pkg/scheduler/score.go:65</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func fitInCertainDevice(node *NodeUsage, request util.ContainerDeviceRequest, annos map[string]string, pod *corev1.Pod, allocated *util.PodDevices) (bool, map[string]util.ContainerDevices) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> k := request</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> originReq := k.Nums</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> prevnuma := -1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("Allocating device for container request", "pod", klog.KObj(pod), "card request", k)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var tmpDevs map[string]util.ContainerDevices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmpDevs = make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for i := len(node.Devices.DeviceLists) - 1; i &gt;= 0; i-- {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.InfoS("scoring pod", "pod", klog.KObj(pod), "Memreq", k.Memreq, "MemPercentagereq", k.MemPercentagereq, "Coresreq", k.Coresreq, "Nums", k.Nums, "device index", i, "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  found, numa := checkType(annos, *node.Devices.DeviceLists[i].Device, k)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !found {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("card type mismatch,continuing...", "pod", klog.KObj(pod), (node.Devices.DeviceLists[i].Device).Type, k.Type)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if numa &amp;&amp; prevnuma != node.Devices.DeviceLists[i].Device.Numa {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("Numa not fit, resotoreing", "pod", klog.KObj(pod), "k.nums", k.Nums, "numa", numa, "prevnuma", prevnuma, "device numa", node.Devices.DeviceLists[i].Device.Numa)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Nums = originReq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   prevnuma = node.Devices.DeviceLists[i].Device.Numa</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmpDevs = make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !checkUUID(annos, *node.Devices.DeviceLists[i].Device, k) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("card uuid mismatch,", "pod", klog.KObj(pod), "current device info is:", *node.Devices.DeviceLists[i].Device)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memreq := int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Count &lt;= node.Devices.DeviceLists[i].Device.Used {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Coresreq &gt; 100 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(nil, "core limit can't exceed 100", "pod", klog.KObj(pod))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Coresreq = 100</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   //return false, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Memreq &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memreq = k.Memreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.MemPercentagereq != 101 &amp;&amp; k.Memreq == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   //This incurs an issue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memreq = node.Devices.DeviceLists[i].Device.Totalmem * k.MemPercentagereq / 100</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalmem-node.Devices.DeviceLists[i].Device.Usedmem &lt; memreq {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("card Insufficient remaining memory", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "device total memory", node.Devices.DeviceLists[i].Device.Totalmem, "device used memory", node.Devices.DeviceLists[i].Device.Usedmem, "request memory", memreq)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore-node.Devices.DeviceLists[i].Device.Usedcores &lt; k.Coresreq {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("card Insufficient remaining cores", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "device total core", node.Devices.DeviceLists[i].Device.Totalcore, "device used core", node.Devices.DeviceLists[i].Device.Usedcores, "request cores", k.Coresreq)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Coresreq=100 indicates it want this card exclusively</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore == 100 &amp;&amp; k.Coresreq == 100 &amp;&amp; node.Devices.DeviceLists[i].Device.Used &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("the container wants exclusive access to an entire card, but the card is already in use", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "used", node.Devices.DeviceLists[i].Device.Used)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // You can't allocate core=0 job to an already full GPU</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore != 0 &amp;&amp; node.Devices.DeviceLists[i].Device.Usedcores == node.Devices.DeviceLists[i].Device.Totalcore &amp;&amp; k.Coresreq == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("can't allocate core=0 job to an already full GPU", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !device.GetDevices()[k.Type].CustomFilterRule(allocated, request, tmpDevs[k.Type], node.Devices.DeviceLists[i].Device) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Nums &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("first fitted", "pod", klog.KObj(pod), "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Nums--</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmpDevs[k.Type] = append(tmpDevs[k.Type], util.ContainerDevice{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Idx:       int(node.Devices.DeviceLists[i].Device.Index),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    UUID:      node.Devices.DeviceLists[i].Device.ID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Type:      k.Type,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Usedmem:   memreq,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Usedcores: k.Coresreq,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Nums == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("device allocate success", "pod", klog.KObj(pod), "allocate device", tmpDevs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return true, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Mode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   i++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return false, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>遍历设备，主要根据设备资源余量来判断是否够 container 分配，返回所有够分配的设备。</p>
<p><code>pkg/scheduler/scheduler.go:458</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len((*nodeScores).NodeList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(4).Infof("All node scores do not meet for pod %v", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("no available node, all node scores do not meet"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: failedNodes,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(4).Infoln("nodeScores_len=", len((*nodeScores).NodeList))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sort.Sort(nodeScores)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m := (*nodeScores).NodeList[len((*nodeScores).NodeList)-1]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("schedule %v/%v to %v %v", args.Pod.Namespace, args.Pod.Name, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedNodeAnnotations] = m.NodeID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.PatchAnnotations(&amp;annotations, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //InRequestDevices := util.EncodePodDevices(util.InRequestDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //supportDevices := util.EncodePodDevices(util.SupportDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, InRequestDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, supportDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.addPod(args.Pod, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(args.Pod, annotations)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringSucceed, []string{m.NodeID}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := extenderv1.ExtenderFilterResult{NodeNames: &amp;[]string{m.NodeID}}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span></code></pre></div></div>
<p>遍历完成之后选择分数最高的，给 Pod 打标签。</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Pod</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-node</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> node1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"1733988480"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-devices-allocated</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">20000</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">80</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain">;</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-devices-to-allocate</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> ;</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="binding-实现">binding 实现<a href="https://project-hami.io/zh/blog/2024/12/31/post#binding-%E5%AE%9E%E7%8E%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>bind 逻辑比较简单，将 Pod 绑定到 Node。</p>
<p><code>pkg/scheduler/routes/route.go:82</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func Bind(s *scheduler.Scheduler) httprouter.Handle {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return func(w http.ResponseWriter, r *http.Request, ps httprouter.Params) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var buf bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  body := io.TeeReader(r.Body, &amp;buf)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderBindingArgs extenderv1.ExtenderBindingArgs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderBindingResult *extenderv1.ExtenderBindingResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := json.NewDecoder(body).Decode(&amp;extenderBindingArgs); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "Decode extender binding args")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderBindingResult = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderBindingResult, err = s.Bind(extenderBindingArgs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if response, err := json.Marshal(extenderBindingResult); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "Marshal binding result", "result", extenderBindingResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusInternalServerError)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   errMsg := fmt.Sprintf("{'error':'%s'}", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write([]byte(errMsg))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("Return bind response", "result", extenderBindingResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusOK)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write(response)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>路由处理：</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) Bind(args extenderv1.ExtenderBindingArgs) (*extenderv1.ExtenderBindingResult, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("Bind", "pod", args.PodName, "namespace", args.PodNamespace, "podUID", args.PodUID, "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var err error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var res *extenderv1.ExtenderBindingResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> binding := &amp;corev1.Binding{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ObjectMeta: metav1.ObjectMeta{Name: args.PodName, UID: args.PodUID},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Target:     corev1.ObjectReference{Kind: "Node", Name: args.Node},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> current, err := s.kubeClient.CoreV1().Pods(args.PodNamespace).Get(context.Background(), args.PodName, metav1.GetOptions{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Get pod failed")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> node, err := s.kubeClient.CoreV1().Nodes().Get(context.Background(), args.Node, metav1.GetOptions{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Failed to get node", "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleBindingResultEvent(current, EventReasonBindingFailed, []string{}, fmt.Errorf("failed to get node %v", args.Node))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err = val.LockNode(node, current)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   goto ReleaseNodeLocks</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch[util.DeviceBindPhase] = "allocating"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch[util.BindTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(current, tmppatch)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "patch pod annotation failed")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err = s.kubeClient.CoreV1().Pods(args.PodNamespace).Bind(context.Background(), binding, metav1.CreateOptions{}); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Failed to bind pod", "pod", args.PodName, "namespace", args.PodNamespace, "podUID", args.PodUID, "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err == nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleBindingResultEvent(current, EventReasonBindingSucceed, []string{args.Node}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error: "",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("After Binding Process")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">ReleaseNodeLocks:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("bind failed", "err", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.ReleaseNodeLock(node, current)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleBindingResultEvent(current, EventReasonBindingFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="node-将设备情况写入-node-annotation">Node 将设备情况写入 node annotation<a href="https://project-hami.io/zh/blog/2024/12/31/post#node-%E5%B0%86%E8%AE%BE%E5%A4%87%E6%83%85%E5%86%B5%E5%86%99%E5%85%A5-node-annotation" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>scheduler 获取 node 的设备信息主要是通过读取 node 的 annotation，主要有如下几步：</p>
<ul>
<li class="">启动插件</li>
</ul>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-handshake</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Requesting_2024.12.24 03</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token datetime number" style="color:rgb(247, 140, 108)">31:30</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-handshake-dcu</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Deleted_2024.12.06 07</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token datetime number" style="color:rgb(247, 140, 108)">43:49</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-nvidia-register</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      "GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">73728</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">300</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">NVIDIA</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      GeForce RTX 3090</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">true</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain">"</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="启动-device-plugin-服务">启动 device-plugin 服务<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E5%90%AF%E5%8A%A8-device-plugin-%E6%9C%8D%E5%8A%A1" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>这里用到了 <code>github.com/urfave/cli/v2</code> 作为 command 启动服务，需要注意 -v 不是日志等级而是是否显示版本</p>
<p><code>cmd/device-plugin/nvidia/main.go:40</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func main() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var configFile string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c := cli.NewApp()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Name = "NVIDIA Device Plugin"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Usage = "NVIDIA device plugin for Kubernetes"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Version = info.GetVersionString()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Action = func(ctx *cli.Context) error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return start(ctx, c.Flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="启动-plugin">启动 plugin<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E5%90%AF%E5%8A%A8-plugin" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>这里的 plugin 主要是针对不同厂家的设备需要实现不同的方法，这里定义了 pluigin 的控制器，例如 start、restart、exit 等，这里我们主要关注<code>plugins, restartPlugins, err := startPlugins(c, flags, restarting)</code></p>
<p><code>cmd/device-plugin/nvidia/main.go:156</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start(c *cli.Context, flags []cli.Flag) error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting FS watcher.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> util.NodeName = os.Getenv(util.NodeNameEnvName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> watcher, err := newFSWatcher(kubeletdevicepluginv1beta1.DevicePluginPath)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("failed to create FS watcher: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer watcher.Close()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*Loading config files*/</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Start working on node %s", util.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting OS watcher.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var restarting bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var restartTimeout &lt;-chan time.Time</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var plugins []plugin.Interface</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">restart:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // If we are restarting, stop plugins from previous run.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if restarting {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := stopPlugins(plugins)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return fmt.Errorf("error stopping plugins from previous run: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting Plugins.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugins, restartPlugins, err := startPlugins(c, flags, restarting)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("error starting plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if restartPlugins {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Info("Failed to start one or more plugins. Retrying in 30s...")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  restartTimeout = time.After(30 * time.Second)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> restarting = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Start an infinite loop, waiting for several indicators to either log</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // some messages, trigger a restart of the plugins, or exit the program.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  select {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // If the restart timeout has expired, then restart the plugins</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-restartTimeout:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Detect a kubelet restart by watching for a newly created</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // 'kubeletdevicepluginv1beta1.KubeletSocket' file. When this occurs, restart this loop,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // restarting all of the plugins in the process.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case event := &lt;-watcher.Events:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if event.Name == kubeletdevicepluginv1beta1.KubeletSocket &amp;&amp; event.Op&amp;fsnotify.Create == fsnotify.Create {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Infof("inotify: %s created, restarting.", kubeletdevicepluginv1beta1.KubeletSocket)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Watch for any other fs errors and log them.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case err := &lt;-watcher.Errors:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("inotify: %s", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Watch for any signals from the OS. On SIGHUP, restart this loop,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // restarting all of the plugins in the process. On all other</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // signals, exit the loop and exit the program.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case s := &lt;-sigs:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   switch s {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   case syscall.SIGHUP:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Info("Received SIGHUP, restarting.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   default:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Infof("Received signal \"%v\", shutting down.", s)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto exit</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">exit:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = stopPlugins(plugins)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("error stopping plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p><code>cmd/device-plugin/nvidia/main.go:239</code></p>
<p>启动插件，主要方法 <code>p.Start()</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func startPlugins(c *cli.Context, flags []cli.Flag, restarting bool) ([]plugin.Interface, bool, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Load the configuration file</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Loading configuration.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> config, err := loadConfig(c, flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("unable to load config: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> disableResourceRenamingInConfig(config)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*Loading config files*/</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //fmt.Println("NodeName=", config.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devConfig, err := generateDeviceConfigFromNvidia(config, c, flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf("failed to load config file %s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Update the configuration file with default resources.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Updating config with default resource matching patterns.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = rm.AddDefaultResourcesToConfig(&amp;devConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("unable to add default resources to config: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Print the config to the output.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> configJSON, err := json.MarshalIndent(devConfig, "", "  ")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("failed to marshal config to JSON: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("\nRunning with config:\n%v", string(configJSON))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Get the set of plugins.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Retrieving plugins.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> pluginManager, err := NewPluginManager(&amp;devConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("error creating plugin manager: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugins, err := pluginManager.GetPlugins()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("error getting plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Loop through all plugins, starting them if they have any devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // to serve. If even one plugin fails to start properly, try</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // starting them all again.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> started := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, p := range plugins {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Just continue if there are no devices to serve for plugin p.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(p.Devices()) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Start the gRPC server for plugin p and connect it with the kubelet.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := p.Start(); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("Could not contact Kubelet. Did you enable the device plugin feature gate?")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return plugins, true, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  started++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if started == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Info("No devices found. Waiting indefinitely.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return plugins, false, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>其中 p(plugin) 需要实现几个方法来管理插件。</p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/api.go:37</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type Interface interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Devices() rm.Devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Start() error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Stop() error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>同时如果需要 kubelet 能够识别 resource 中的类似 <code>nvidia.com/gpu: 1</code> 这样的拓展字段需要启动一个 GRPC
服务挂载 <code>/var/lib/kubelet/device-plugins/</code> 且实现如下方法。这块跟调度相关性不大，暂且不展开
<a href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/" target="_blank" rel="noopener noreferrer" class="">device-plugins</a>。</p>
<p><code>k8s.io/kubelet@v0.28.3/pkg/apis/deviceplugin/v1beta1/api.pb.go:1419</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type DevicePluginServer interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // GetDevicePluginOptions returns options to be communicated with Device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Manager</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetDevicePluginOptions(context.Context, *Empty) (*DevicePluginOptions, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // ListAndWatch returns a stream of List of Devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Whenever a Device state change or a Device disappears, ListAndWatch</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // returns the new list</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ListAndWatch(*Empty, DevicePlugin_ListAndWatchServer) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // GetPreferredAllocation returns a preferred set of devices to allocate</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // from a list of available ones. The resulting preferred allocation is not</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // guaranteed to be the allocation ultimately performed by the</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // devicemanager. It is only designed to help the devicemanager make a more</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // informed allocation decision when possible.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetPreferredAllocation(context.Context, *PreferredAllocationRequest) (*PreferredAllocationResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Allocate is called during container creation so that the Device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Plugin can run device specific operations and instruct Kubelet</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // of the steps to make the Device available in the container</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Allocate(context.Context, *AllocateRequest) (*AllocateResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // PreStartContainer is called, if indicated by Device Plugin during registration phase,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // before each container start. Device plugin can run device specific operations</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // such as resetting the device before making devices available to the container</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> PreStartContainer(context.Context, *PreStartContainerRequest) (*PreStartContainerResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="nvidia-插件的实现">nvidia 插件的实现<a href="https://project-hami.io/zh/blog/2024/12/31/post#nvidia-%E6%8F%92%E4%BB%B6%E7%9A%84%E5%AE%9E%E7%8E%B0" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h4>
<p>主要看<code>plugin.WatchAndRegister()</code></p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go:196</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) Start() error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugin.initialize()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err := plugin.Serve()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.cleanup()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Starting to serve '%s' on %s", plugin.rm.Resource(), plugin.socket)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = plugin.Register()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("Could not register device plugin: %s", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Registered device plugin for '%s' with Kubelet", plugin.rm.Resource())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if plugin.operatingMode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd := exec.Command("nvidia-mig-parted", "export")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var stdout, stderr bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd.Stdout = &amp;stdout</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd.Stderr = &amp;stderr</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := cmd.Run()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Fatalf("nvidia-mig-parted failed with %s\n", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  outStr := stdout.Bytes()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  yaml.Unmarshal(outStr, &amp;plugin.migCurrent)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  os.WriteFile("/tmp/migconfig.yaml", outStr, os.ModePerm)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(plugin.migCurrent.MigConfigs["current"]) == 1 &amp;&amp; len(plugin.migCurrent.MigConfigs["current"][0].Devices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   idx := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   plugin.migCurrent.MigConfigs["current"][0].Devices = make([]int32, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for idx &lt; GetDeviceNums() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    plugin.migCurrent.MigConfigs["current"][0].Devices = append(plugin.migCurrent.MigConfigs["current"][0].Devices, int32(idx))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    idx++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("Mig export", plugin.migCurrent)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go func() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := plugin.rm.CheckHealth(plugin.stop, plugin.health)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Failed to start health check: %v; continuing with health checks disabled", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go func() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.WatchAndRegister()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里是个定时器，每 30s 收集一次该 node 的设备信息，并写入 node annotation。</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) WatchAndRegister() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting WatchAndRegister")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> errorSleepInterval := time.Second * 5</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> successSleepInterval := time.Second * 30</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := plugin.RegisterInAnnotation()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("Failed to register annotation: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Retrying in %v seconds...", errorSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   time.Sleep(errorSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Successfully registered annotation. Next check in %v seconds...", successSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   time.Sleep(successSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) RegisterInAnnotation() error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices := plugin.getAPIDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("start working on the devices", "devices", devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> node, err := util.GetNode(util.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorln("get node error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> encodeddevices := util.EncodeNodeDevices(*devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos[nvidia.HandshakeAnnos] = "Reported " + time.Now().String()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos[nvidia.RegisterAnnos] = encodeddevices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("patch node with the following annos %v", fmt.Sprintf("%v", annos))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchNodeAnnotations(node, annos)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorln("patch node error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>具体数据收集逻辑。</p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/register.go:110</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) getAPIDevices() *[]*util.DeviceInfo {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devs := plugin.Devices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).InfoS("getAPIDevices", "devices", devs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nvml.Init()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := make([]*util.DeviceInfo, 0, len(devs))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for UUID := range devs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ndev, ret := nvml.DeviceGetHandleByUUID(UUID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nvml new device by index error uuid=", UUID, "err=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  idx, ret := ndev.GetIndex()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nvml get index error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memoryTotal := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memory, ret := ndev.GetMemoryInfo()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret == nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memoryTotal = int(memory.Total)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("nvml get memory error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Model, ret := ndev.GetName()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("nvml get name error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  registeredmem := int32(memoryTotal / 1024 / 1024)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if plugin.schedulerConfig.DeviceMemoryScaling != 1 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   registeredmem = int32(float64(registeredmem) * plugin.schedulerConfig.DeviceMemoryScaling)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("MemoryScaling=", plugin.schedulerConfig.DeviceMemoryScaling, "registeredmem=", registeredmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  health := true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range devs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if strings.Compare(val.ID, UUID) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    // when NVIDIA-Tesla P4, the device info is : ID:GPU-e290caca-2f0c-9582-acab-67a142b61ffa,Health:Healthy,Topology:nil,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    // it is more reasonable to think of healthy as case-insensitive</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if strings.EqualFold(val.Health, "healthy") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     health = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     health = false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  numa, err := plugin.getNumaInformation(idx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "failed to get numa information", "idx", idx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = append(res, &amp;util.DeviceInfo{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ID:      UUID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Index:   uint(idx),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Count:   int32(plugin.schedulerConfig.DeviceSplitCount),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Devmem:  registeredmem,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Devcore: int32(plugin.schedulerConfig.DeviceCoreScaling * 100),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Type:    fmt.Sprintf("%v-%v", "NVIDIA", Model),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Numa:    numa,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Mode:    plugin.operatingMode,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Health:  health,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("nvml registered device id=%v, memory=%v, type=%v, numa=%v", idx, registeredmem, Model, numa)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>这里通过 nvidia 驱动获取设备信息，需要注意的是这里有配置 DeviceMemoryScaling，显存超分配置，
这里是通过命令行启动的 --config-file 参数指定的 scheduler 配置和代码中固化的
<code>config/config.json</code> 来取值的，其中 config/config.json 优先级大于 --config-file</p>
<p>到这里，调度所需的所有东西就准备好了，Pod 可以顺利被分配到合适的节点上。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="参考">参考<a href="https://project-hami.io/zh/blog/2024/12/31/post#%E5%8F%82%E8%80%83" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://kubernetes.io/" target="_blank" rel="noopener noreferrer" class="">kubernetes 官网</a></li>
<li class=""><a href="https://www.qikqiak.com/post/custom-kube-scheduler/" target="_blank" rel="noopener noreferrer" class="">自定义 Kubernetes 调度器</a></li>
<li class=""><a href="https://www.lixueduan.com/posts/kubernetes/21-device-plugin/" target="_blank" rel="noopener noreferrer" class="">自定义资源支持：K8s Device Plugin 从原理到实现</a></li>
</ul>]]></content>
        <author>
            <name>Elrond Wang</name>
            <uri>https://github.com/elrondwong</uri>
        </author>
    </entry>
    <entry>
        <title type="html"><![CDATA[介绍 HAMi]]></title>
        <id>https://project-hami.io/zh/blog/2024/12/18/support-blog-post</id>
        <link href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post"/>
        <updated>2024-12-18T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[什么是 HAMi？]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="什么是-hami">什么是 HAMi？<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E4%BB%80%E4%B9%88%E6%98%AF-hami" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>HAMi（异构 AI 计算虚拟化中间件），之前称为 k8s-vGPU-scheduler，是一种创新解决方案，
旨在管理 Kubernetes 集群内的异构 AI 计算设备。这个一站式中间件能够实现各种 AI 设备的共享，
同时确保不同任务之间的资源隔离。通过提高异构计算设备的利用率，
HAMi 提供了一个统一的复用接口，以满足不同设备类型的需求。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="为什么选择-hami">为什么选择 HAMi？<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E4%B8%BA%E4%BB%80%E4%B9%88%E9%80%89%E6%8B%A9-hami" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubernetes-本机-api-兼容性">Kubernetes 本机 API 兼容性<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#kubernetes-%E6%9C%AC%E6%9C%BA-api-%E5%85%BC%E5%AE%B9%E6%80%A7" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi 的突出特点之一是其与 Kubernetes 原生 API 的兼容性。这意味着用户可以在
不修改现有配置的情况下升级到 HAMi，从而实现无缝过渡，同时保持 Kubernetes 的默认行为。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="开放和中立">开放和中立<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E5%BC%80%E6%94%BE%E5%92%8C%E4%B8%AD%E7%AB%8B" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi 是一个涉及来自各个领域利益相关者的协作倡议，包括互联网服务、金融、制造业和云服务提供商。
目标是建立云原生计算基金会（CNCF）下的开放治理，确保 HAMi 对所有用户保持中立和可访问。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="避免供应商锁定">避免供应商锁定<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E9%81%BF%E5%85%8D%E4%BE%9B%E5%BA%94%E5%95%86%E9%94%81%E5%AE%9A" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>使用 HAMi，用户可以与主流云服务提供商集成，而无需绑定到专有供应商的编排。
这种灵活性允许组织选择他们偏好的云解决方案，同时利用 HAMi 的功能。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="资源�隔离">资源隔离<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E8%B5%84%E6%BA%90%E9%9A%94%E7%A6%BB" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi 在容器内提供强大的资源隔离。每个在容器中运行的任务都被限制在其分配的资源范围内，
防止任何任务超出其配额。这种严格的隔离增强了计算环境中的安全性和稳定性。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="支持多种异构计算设备">支持多种异构计算设备<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E6%94%AF%E6%8C%81%E5%A4%9A%E7%A7%8D%E5%BC%82%E6%9E%84%E8%AE%A1%E7%AE%97%E8%AE%BE%E5%A4%87" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>HAMi 在支持各种异构计算设备方面表现出色。无论是来自不同制造商的 GPU、MLU 还是 NPU，
HAMi 都促进了设备共享，并在不同的硬件平台上最大化资源效率。</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="统一管理">统一管理<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E7%BB%9F%E4%B8%80%E7%AE%A1%E7%90%86" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h3>
<p>为了简化运营，HAMi 提供了一套统一的监控系统，以及如箱装和扩散的可配置调度策略。
这种全面的管理方法简化了对资源的监管，并提升了整体系统性能。</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="结语">结语<a href="https://project-hami.io/zh/blog/2024/12/18/support-blog-post#%E7%BB%93%E8%AF%AD" class="hash-link" aria-label="跳转到头部" title="跳转到头部" translate="no">​</a></h2>
<p>总之，HAMi 代表了在 Kubernetes 环境中管理异构 AI 计算资源的重大进步。它与现有系统的兼容性、
对开放治理的承诺以及强大的资源管理能力，使其成为寻求优化其 AI 计算基础设施的组织不可或缺的工具。</p>
<p>加入我们，一起踏上使用 HAMi 实现更高效和灵活的 AI 计算的旅程吧！</p>]]></content>
        <author>
            <name>HAMi 社区</name>
        </author>
    </entry>
</feed>