基于K8S使用DCGM和Prometheus监控GPU

网友投稿 859 2022-10-21

基于K8S使用DCGM和Prometheus监控GPU

DCGM 介绍

DCGM(Data Center GPU Manager)即数据中心GPU管理器,是一套用于在集群环境中管理和监视Tesla™GPU的工具。它包括主动健康监控,全面诊断,系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用,并且可以轻松地集成到NVIDIA合作伙伴的集群管理,资源调度和监视产品中。DCGM简化了数据中心中的GPU管理,提高了资源可靠性和正常运行时间,自动化了管理任务,并有助于提高整体基础架构效率。

Kubernetes 集群中的每个 pod GPU 指标

dcgm-exporter收集节点上所有可用 GPUs 的指标。然而,在 Kubernetes 中,当一个 pod 请求 GPU 资源时,你不一定知道节点中的哪个 GPUs 将被分配给一个 pod 。从 v1.13 开始,kubelet 添加了一个设备监视功能,可以使用 pod 资源套接字找到分配给 pod—的 pod 名称、 pod 名称空间和设备 ID 的设备。dcgm-exporter中的 服务器连接到 kubelet pod resources 服务器(/var/lib/kubelet/pod-resources),以标识在 pod 上运行的 GPU 设备,并将 GPU 设备 pod 信息附加到收集的度量中。

DCGM工具部署

项目地址:DCGM-Exporter

GPU采集指标

dcgm-exporter采集指标项以及含义:

节点标签

我们这里使用DaemonSet这种方式部署,所以我们优先将GPU节点服务器打上标签,方便我们后期部署与维护。

# 节点设置 Label 标签 kubectl label nodes fsyy nvidia-gpu=monitoring # 查看节点是否设置 Label 成功 kubectl get nodes --show-labels # 删除标签 kubectl label nodes fsyy nvidia-gpu-

k8s上部署

dcgm-metrics.yaml

apiVersion: v1 kind: ConfigMap metadata: name: "dcgm-metrics" namespace: monitoring data: # 类属性键;每一个键都映射到一个简单的值 default-counters.csv: | # Format,, # If line starts with a '#' it is considered a comment,, # DCGM FIELD, Prometheus metric type, help message # Clocks,, DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature,, DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power,, DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIE,, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization (the sample period varies depending on the product),, DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). # Errors and violations,, DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). # Memory usage,, DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # ECC,, # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. # Retired pages,, # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. # NVLink,, # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes # VGPU License status,, DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status # Remapped rows,, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed

dcgm-exporter.yaml

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: apps/v1 kind: DaemonSet metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" annotations: prometheus.io/scrape: "true" prometheus.io/port: "9400" spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" template: metadata: labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" name: "dcgm-exporter" spec: containers: - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04" env: - name: "DCGM_EXPORTER_LISTEN" value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 securityContext: runAsNonRoot: false runAsUser: 0 volumeMounts: - name: "pod-gpu-resources" readOnly: true mountPath: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" readOnly: true mountPath: "/etc/dcgm-exporter" volumes: - name: "pod-gpu-resources" hostPath: path: "/var/lib/kubelet/pod-resources" - name: "gpu-metrics" configMap: name: "dcgm-metrics" nodeSelector: nvidia-gpu: "monitoring" --- kind: Service apiVersion: v1 metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" annotations: prometheus.io/scrape: "true" prometheus.io/port: "9400" spec: selector: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" ports: - name: "metrics" port: 9400

# 部署 default-counters.csv 配置文件 kubectl apply -f dcgm-metrics.yaml # 部署dcgm工具 kubuctl apply -f dcgm-exporter.yaml

上面服务部署在monitoring命名空间下,查看各资源状态:

# 查看service和pod kubectl get svc,pod -n monitoring -l app.kubernetes.io/name=dcgm-exporter

查看gpu监控指标收集情况:

# 本地端口转换 kubectl -n monitoring port-forward service/dcgm-exporter 9400 # 获取gpu监控指标 curl 127.0.0.1:9400/metrics

对接prometheus

当上述资源创建完成后,在集群内部的prometheus-server中就可以找到对应的target,确认状态为up即表示prometheus已正常采集集群外gpu的metrics数据了,接下来数据就会以10s为间隔,源源不断的将数据采集到prometheus存储中。我这边的prometheus使用的是自动发现服务配置,结果如下:

Grafana的监控展示

我们可以使用grafana dashboard官网上的模版,我们这里使用12639,需要根据实际情况修改下变量:

正确导入和修改变量后,最终的GPU监控图:

这里有个问题,grafana上gpu_host的 IP 地址并不是真实的node IP或node name,gpu_host的值其实是dcgm-exporter-xxxxpod的ip,可以使用如下命令查看:

kubectl get pod -n monitoring -o wide

上述问题会造成开发同学获取节点gpu监控信息不是那么直接,他们可能没有查看pod详细信息的权限,过多依赖运维同学,导致工作效率不高。

对于上面的需求,我们只需将“pod ip”更改为“node ip”即可解决上述问题。在动手之前,我们需要了解以下几点信息:

ServiceMonitor配置

在本节“对接prometheus”中,我使用自动服务发现来接入的gpu指标,现在我们需要对标签重写,现在我们来定义一个ServiceMonitor的yaml文件。

# 现将自动发现标签去掉 kubectl edit svc dcgm-exporter -n monitoring

# cat service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" spec: selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" endpoints: - port: "metrics" path: "/metrics" relabelings: - action: replace sourceLabels: [__meta_kubernetes_endpoint_node_name] targetLabel: node_name

# 部署ServiceMonitor kubectl apply -f service-monitor.yaml

然后,登录prometheus-->targets即可查看到serviceMonitor/monitoring/dcgm-exporter/0 (7/7 up)这类信息。

prometheus标签重写

常见系统内置标签

__address__ #当前Target实例的访问地址: __scheme__ #采集目标服务访问地址的HTTP Scheme,HTTP或者HTTPS __metrics_path__ #采集目标服务访问地址的访问路径 __param_ #采集任务目标服务的中包含的请求参数

一般来说,targets中以""作为前置的是系统内置标签,这些标签是不会写入到指标数据中的。不过也会有一些例外,比如我们前面得知prometheus采集的数据中都包含了instance的标签,这些标签的内容对应到target实例中的address__的标签,这里实际上是做了一次标签的重写操作。

自定义标签

- job_name: 'node' static_configs: - targets: ['192.168.1.21:9100'] labels: #添加标签 env: test1-cluster __hostname__: localhost

注意:带__前缀的标签是不会出现在指标显示中的。

配置grafana看板

当了解了上面标签重写处理过程,我们重新配置下当前看板下的变量,根据自己实际情况进行调整,最终效果如下:

GPU报警

这里放个例子,大家参考下:

# 添加如下规则到prometheus rule文件 name: GPU内存利用率(%) expr: DCGM_FI_DEV_MEM_COPY_UTIL{kubernetes_namespace="monitoring"} > 80 labels: severity: 警告 annotations: description: “namespace:{{$labels.namespace}}的pod:{{$labels.pod}},当前GPU内存利用率过高。” identifier: {{ $labels.hostname }} summary: GPU内存利用率超过80%

登陆prometheus中Alerts查看新建的报警规则:

参考文档

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:Springboot 内部服务调用方式
下一篇:云计算-8-Dockerfile深度解析全
相关文章

 发表评论

暂时没有评论,来抢沙发吧~