基于K8S使用DCGM和Prometheus监控GPU

网友投稿 859 2022-10-21

DCGM 介绍

DCGM(Data Center GPU Manager)即数据中心GPU管理器，是一套用于在集群环境中管理和监视Tesla™GPU的工具。它包括主动健康监控，全面诊断，系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用，并且可以轻松地集成到NVIDIA合作伙伴的集群管理，资源调度和监视产品中。DCGM简化了数据中心中的GPU管理，提高了资源可靠性和正常运行时间，自动化了管理任务，并有助于提高整体基础架构效率。

Kubernetes 集群中的每个 pod GPU 指标

dcgm-exporter收集节点上所有可用 GPUs 的指标。然而，在 Kubernetes 中，当一个 pod 请求 GPU 资源时，你不一定知道节点中的哪个 GPUs 将被分配给一个 pod 。从 v1.13 开始，kubelet 添加了一个设备监视功能，可以使用 pod 资源套接字找到分配给 pod—的 pod 名称、 pod 名称空间和设备 ID 的设备。dcgm-exporter中的服务器连接到 kubelet pod resources 服务器（/var/lib/kubelet/pod-resources），以标识在 pod 上运行的 GPU 设备，并将 GPU 设备 pod 信息附加到收集的度量中。

DCGM工具部署

项目地址：DCGM-Exporter

GPU采集指标

dcgm-exporter采集指标项以及含义:

节点标签

我们这里使用DaemonSet这种方式部署，所以我们优先将GPU节点服务器打上标签，方便我们后期部署与维护。

# 节点设置 Label 标签 kubectl label nodes fsyy nvidia-gpu=monitoring # 查看节点是否设置 Label 成功 kubectl get nodes --show-labels # 删除标签 kubectl label nodes fsyy nvidia-gpu-

k8s上部署

dcgm-metrics.yaml

apiVersion: v1 kind: ConfigMap metadata: name: "dcgm-metrics" namespace: monitoring data: # 类属性键；每一个键都映射到一个简单的值 default-counters.csv: | # Format,, # If line starts with a '#' it is considered a comment,, # DCGM FIELD, Prometheus metric type, help message # Clocks,, DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature,, DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power,, DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIE,, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization (the sample period varies depending on the product),, DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). # Errors and violations,, DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). # Memory usage,, DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # ECC,, # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. # Retired pages,, # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. # NVLink,, # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes # VGPU License status,, DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status # Remapped rows,, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed

dcgm-exporter.yaml

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: apps/v1 kind: DaemonSet metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" annotations: prometheus.io/scrape: "true" prometheus.io/port: "9400" spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" template: metadata: labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" name: "dcgm-exporter" spec: containers: - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04" env: - name: "DCGM_EXPORTER_LISTEN" value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 securityContext: runAsNonRoot: false runAsUser: 0 volumeMounts: - name: "pod-gpu-resources" readOnly: true mountPath: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" readOnly: true mountPath: "/etc/dcgm-exporter" volumes: - name: "pod-gpu-resources" hostPath: path: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" configMap: name: "dcgm-metrics" nodeSelector: nvidia-gpu: "monitoring" --- kind: Service apiVersion: v1 metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" annotations: prometheus.io/scrape: "true" prometheus.io/port: "9400" spec: selector: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" ports: - name: "metrics" port: 9400

# 部署 default-counters.csv 配置文件 kubectl apply -f dcgm-metrics.yaml # 部署dcgm工具 kubuctl apply -f dcgm-exporter.yaml

上面服务部署在monitoring命名空间下，查看各资源状态：

# 查看service和pod kubectl get svc,pod -n monitoring -l app.kubernetes.io/name=dcgm-exporter

查看gpu监控指标收集情况：

# 本地端口转换 kubectl -n monitoring port-forward service/dcgm-exporter 9400 # 获取gpu监控指标 curl 127.0.0.1:9400/metrics

对接prometheus

当上述资源创建完成后，在集群内部的prometheus-server中就可以找到对应的target，确认状态为up即表示prometheus已正常采集集群外gpu的metrics数据了，接下来数据就会以10s为间隔，源源不断的将数据采集到prometheus存储中。我这边的prometheus使用的是自动发现服务配置，结果如下：

Grafana的监控展示

我们可以使用grafana dashboard官网上的模版，我们这里使用12639，需要根据实际情况修改下变量：

正确导入和修改变量后，最终的GPU监控图：

这里有个问题，grafana上gpu_host的 IP 地址并不是真实的node IP或node name，gpu_host的值其实是dcgm-exporter-xxxxpod的ip，可以使用如下命令查看：

kubectl get pod -n monitoring -o wide

上述问题会造成开发同学获取节点gpu监控信息不是那么直接，他们可能没有查看pod详细信息的权限，过多依赖运维同学，导致工作效率不高。

对于上面的需求，我们只需将“pod ip”更改为“node ip”即可解决上述问题。在动手之前，我们需要了解以下几点信息：

ServiceMonitor配置

在本节“对接prometheus”中，我使用自动服务发现来接入的gpu指标，现在我们需要对标签重写，现在我们来定义一个ServiceMonitor的yaml文件。

# 现将自动发现标签去掉 kubectl edit svc dcgm-exporter -n monitoring

# cat service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" spec: selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" endpoints: - port: "metrics" path: "/metrics" relabelings: - action: replace sourceLabels: [__meta_kubernetes_endpoint_node_name] targetLabel: node_name

# 部署ServiceMonitor kubectl apply -f service-monitor.yaml

然后，登录prometheus-->targets即可查看到serviceMonitor/monitoring/dcgm-exporter/0 (7/7 up)这类信息。

prometheus标签重写

常见系统内置标签

__address__ #当前Target实例的访问地址: __scheme__ #采集目标服务访问地址的HTTP Scheme，HTTP或者HTTPS __metrics_path__ #采集目标服务访问地址的访问路径 __param_ #采集任务目标服务的中包含的请求参数

一般来说，targets中以""作为前置的是系统内置标签，这些标签是不会写入到指标数据中的。不过也会有一些例外，比如我们前面得知prometheus采集的数据中都包含了instance的标签，这些标签的内容对应到target实例中的address__的标签，这里实际上是做了一次标签的重写操作。

自定义标签

- job_name: 'node' static_configs: - targets: ['192.168.1.21:9100'] labels: #添加标签 env: test1-cluster __hostname__: localhost

注意：带__前缀的标签是不会出现在指标显示中的。

配置grafana看板

当了解了上面标签重写处理过程，我们重新配置下当前看板下的变量，根据自己实际情况进行调整，最终效果如下：

GPU报警

这里放个例子，大家参考下：

# 添加如下规则到prometheus rule文件 name: GPU内存利用率（%） expr: DCGM_FI_DEV_MEM_COPY_UTIL{kubernetes_namespace="monitoring"} > 80 labels: severity: 警告 annotations: description: “namespace:{{$labels.namespace}}的pod:{{$labels.pod}},当前GPU内存利用率过高。” identifier: {{ $labels.hostname }} summary: GPU内存利用率超过80%

登陆prometheus中Alerts查看新建的报警规则：

参考文档

标签：工具

暂时没有评论，来抢沙发吧~

基于K8S使用DCGM和Prometheus监控GPU

linux cpu占用率如何看

宝塔数据库如何清理缓存

oracle怎么创建存储过程

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）