k8s-vpa深入剖析

网友投稿 298 2022-09-12

k8s-vpa深入剖析

1.介绍

1.1vpa组件

vpa主要由三个组件组成,分别为recommend、updater、admission-controller

recommend:根据metric-server api获取到的容器指标,计算推荐指标。

updater:根据pod的request中设置的指标和recommend计算的推荐指标,在一定条件下驱逐pod。

admission-controller:这是一个webhook组件,监听pod的创建,把recommend计算出的指标设置给pod的request和limit

1.2自定义资源

涉及到的自定义资源主要有两个:VerticalPodAutoscaler 、VerticalPodAutoscalerCheckpoint

VerticalPodAutoscaler:

该资源由用户创建,用于设置纵向扩容的目标对象和存储recommend组件计算出的推荐指标。

apiVersion: autoscaling.k8s.io/v1beta2kind: VerticalPodAutoscalermetadata: name: nginx-vpa namespace: defaultspec: targetRef: apiVersion: "apps/v1" kind: Deployment name: nginx updatePolicy: updateMode: "Off" resourcePolicy: containerPolicies: - containerName: "nginx" minAllowed: cpu: "90m" memory: "10Mi" maxAllowed: cpu: "1000m" memory: "1Gi" status: conditions: - lastTransitionTime: "2021-11-24T08:54:41Z" status: "True" type: RecommendationProvided recommendation: containerRecommendations: - containerName: nginx lowerBound: cpu: 90m memory: 131072k target: cpu: 90m memory: 131072k uncappedTarget: cpu: 12m memory: 131072k upperBound: cpu: "1" memory: 131072k - containerName: yellow-pod-container lowerBound: cpu: 12m memory: 131072k target: cpu: 12m memory: 131072k uncappedTarget: cpu: 12m memory: 131072k upperBound: cpu: 407m memory: 425500k

VerticalPodAutoscalerCheckpoint:

该资源由recommend组件创建和维护,用于存储指标相关信息。

一个vpa对应的多个容器,每个容器创建一个该资源。

apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalerCheckpointmetadata: creationTimestamp: "2021-11-25T08:00:20Z" name: nginx-vpa-nginx namespace: defaultspec: containerName: nginx vpaObjectName: nginx-vpastatus: cpuHistogram: bucketWeights: "0": 10000 "9": 4 "11": 2 "12": 16 "13": 2 "15": 1 "16": 11 "17": 5 "31": 40 "34": 2 "44": 2 "60": 13 "75": 744 referenceTimestamp: "2021-11-28T00:00:00Z" totalWeight: 560.4882004282056 firstSampleStart: null lastSampleStart: "2021-11-28T14:13:59Z" lastUpdateTime: null memoryHistogram: bucketWeights: "0": 10000 "1": 67 "34": 141 referenceTimestamp: "2021-11-29T00:00:00Z" totalWeight: 23.976566526407623 totalSamplesCount: 9371 version: v3

2.recommend组件

2.1yaml介绍

2.1.1版本细节

0.7版本

0.7版本apiVersion: autoscaling.k8s.io/v1beta2kind: VerticalPodAutoscalermetadata: name: nginx-vpa namespace: defaultspec: targetRef: ----设置目标资源,也可以是scaledClient能获取信息的自定义资源,如cloneSet这种能作为hpa target的自定义资源 apiVersion: "apps/v1" kind: Deployment name: nginx updatePolicy:  ---- 可设置Auto、Recreate或Off,设置Off,只会计算推荐值,updater组件不会驱逐pod。设置Auto:updater组建会驱逐pod updateMode: "Auto" resourcePolicy: containerPolicies: ---- 指定设置推荐值的容器 - containerName: "nginx" minAllowed: ---- 设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。 cpu: "90m" memory: "10Mi" maxAllowed: cpu: "1000m" memory: "1Gi"

0.8版本

多了一个controlledResources

可以只设置cpu、或memory

该属性在recommend组件中有使用

该属性可设置的值如下:enum: ["cpu", "memory"]

// Specifies the type of recommendations that will be computed // (and possibly applied) by VPA. // If not specified, the default of [ResourceCPU, ResourceMemory]will be used. ControlledResources *[]v1.ResourceName `json:"controlledResources,omitempty" patchStrategy:"merge" protobuf:"bytes,5,rep,name=controlledResources"`

0.9版本

又多了一个controlledValues

该属性在recommend组件中没有使用,在admission组件中使用了

该属性可设置的值如下:enum: ["RequestsAndLimits", "RequestsOnly"]

// Specifies which resource values should be controlled. // The default is "RequestsAndLimits". // +optional ControlledValues *ContainerControlledValues `json:"controlledValues,omitempty" protobuf:"bytes,6,rep,name=controlledValues"`

2.1.2targetRef

设置目标资源

可以是如下资源daemonSet、deployment、replicaSet、statefulSet、replicationController、job、cronJob

也可以是scaledClient能获取信息的自定义资源,如cloneSet这种能作为hpa target的自定义资源

注意

1.只有是deployment、replicaSet、replicationController、statefulSet、job时,updater驱逐pod的时候,updater设置的可驱逐pod比例(evictionToleranceFraction)才生效。当是其他资源时,会把所有pod全部驱逐,因为计算的tolerance=0。

targetRef: apiVersion: "apps/v1" kind: Deployment name: nginx

源码查看

//根据vpa获取target 资源的selector。。例如:deployment的selectorfunc (f *vpaTargetSelectorFetcher) Fetch(vpa *vpa_types.VerticalPodAutoscaler) (labels.Selector, error) { if vpa.Spec.TargetRef == nil { return nil, fmt.Errorf("targetRef not defined. If this is a v1beta1 object switch to v1beta2.") } kind := wellKnownController(vpa.Spec.TargetRef.Kind) //informersMap := map[wellKnownController]cache.SharedIndexInformer{ // daemonSet: factory.Apps().V1().DaemonSets().Informer(), // deployment: factory.Apps().V1().Deployments().Informer(), // replicaSet: factory.Apps().V1().ReplicaSets().Informer(), // statefulSet: factory.Apps().V1().StatefulSets().Informer(), // replicationController: factory.Core().V1().ReplicationControllers().Informer(), // job: factory.Batch().V1().Jobs().Informer(), // cronJob: factory.Batch().V1beta1().CronJobs().Informer(), // } informer, exists := f.informersMap[kind] if exists { return getLabelSelector(informer, vpa.Spec.TargetRef.Kind, vpa.Namespace, vpa.Spec.TargetRef.Name) } // not on a list of known controllers, use scale sub-resource // TODO: cache response groupVersion, err := schema.ParseGroupVersion(vpa.Spec.TargetRef.APIVersion) if err != nil { return nil, err } groupKind := schema.GroupKind{ Group: groupVersion.Group, Kind: vpa.Spec.TargetRef.Kind, } //通过scaleClient获取资源的selector selector, err := f.getLabelSelectorFromResource(groupKind, vpa.Namespace, vpa.Spec.TargetRef.Name) if err != nil { return nil, fmt.Errorf("Unhandled targetRef %s / %s / %s, last error %v", vpa.Spec.TargetRef.APIVersion, vpa.Spec.TargetRef.Kind, vpa.Spec.TargetRef.Name, err) } return selector, nil}

2.tartget资源不能再有owner

获取不到指标,是因为该deployment还有owner

源码解析:

2.1.3updatePolicy

可设置Auto、Recreate、Off、Initial

设置Off、Initial:只会计算推荐值,updater组件不会驱逐pod。从代码层面没有发现这两个的区别。

设置Auto、Recreate:updater组建会驱逐pod。从代码层面没有发现这两个的区别。

updatePolicy:   updateMode: "Auto"

2.1.3resourcePolicy

设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。

resourcePolicy: containerPolicies: ---- 指定设置推荐值的容器 - containerName: "nginx" minAllowed: ---- 设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。 cpu: "90m" memory: "10Mi" maxAllowed: cpu: "1000m" memory: "1Gi"

2.2细节介绍

2.2.1参数介绍

1.指标存储

默认每个容器的指标存储在VerticalPodAutoscalerCheckpoint中

可以--storage 来指定将指标存储到prometheus

这种方式很好呢,默认有自己的存储,也可指定外部存储。。。我们的系统是不是也能这样呢。可以通过参数指定是否使用sso。

storage = flag.String("storage", "", `Specifies storage mode. Supported values: prometheus, checkpoint (default)`)

2.指标获取周期:

默认1分钟获取一次指标,并计算推荐值

可通过 --recommender-interval 自定义

metricsFetcherInterval = flag.Duration("recommender-interval", 1*time.Minute, `How often metrics should be fetched`)

3.推荐值默认最小值-defaultMinRecommend:

cpu推荐最小值:25m*(1/pod中的container数)

memory推荐最小值:250M*(1/vpa中的container数)

注意:

minRecommend = max(minAllowed,defaultMinRecommend)

推荐值的最小值是minAllowed和defaultMinRecommend 取两者之间最大

可分别通过--pod-recommendation-min-cpu-millicores 和--pod-recommendation-min-memory-mb自定义

podMinCPUMillicores = flag.Float64("pod-recommendation-min-cpu-millicores", 25, `Minimum CPU recommendation for a pod`)podMinMemoryMb = flag.Float64("pod-recommendation-min-memory-mb", 250, `Minimum memory recommendation for a pod`)

源码查看:

//containerNameToAggregateStateMap----pod中的container数fraction := 1.0 / float64(len(containerNameToAggregateStateMap)) //cpu最小25m*fraction //内存最小250Mi*fraction //minResources:map[cpu:25 memory:262144000] minResources := model.Resources{ model.ResourceCPU: model.ScaleResource(model.CPUAmountFromCores(*podMinCPUMillicores*0.001), fraction), model.ResourceMemory: model.ScaleResource(model.MemoryAmountFromBytes(*podMinMemoryMb*1024*1024), fraction), }

4.最终推荐值:

最终推荐值 = bucketStart*(1+recommendation-margin-fraction)

safetyMarginFraction = flag.Float64("recommendation-margin-fraction", 0.15, `Fraction of usage added as the safety margin to the recommended request`)

5.prometheus指标

recommend指标路径

map[0:10000 9:2 11:1 12:8 13:1 16:6 17:3 31:21 34:1 44:1 60:7 75:365]TotalWeight :549.3782628171234sumWeight:10416=10000+2+1+8+1+6+3+21+1+1+7+365ratio =0.052743688826528745 = totalWeight/sumWeight = 549.3782628171234/10416复原后的bucketWeight = [527.4368882652875 0 0 0 0 0 0 0 0 0.10548737765305749 0 0.052743688826528745 0.42194951061222996 0.052743688826528745 0 0 0.31646213295917247 0.15823106647958624 0 0 0 0 0 0 0 0 0 0 0 0 0 1.1076174653571036 0 0 0.052743688826528745 0 0 0 0 0 0 0 0 0 0.052743688826528745 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3692058217857012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19.251446421682992 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]TotalWeight :549.3782628171234eg:527.4368882652875 = 10000*ratio(0.052743688826528745)

2.LoadVPAs

把所有vpa放到这个字段中  feeder.clusterState.Vpas

3.LoadPods

把所有pod放到feeder.clusterState.pods

注意:aggregateContainerState = &AggregateContainerState{ AggregateCPUUsage: util.NewDecayingHistogram(CPUHistogramOptions, CPUHistogramDecayHalfLife), AggregateMemoryPeaks: util.NewDecayingHistogram(MemoryHistogramOptions, MemoryHistogramDecayHalfLife), CreationTime: time.Now(),}1.pod.Containers[containerID.ContainerName] 使用同一个aggregateContainerState2.cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState 使用同一个aggregateContainerState3.vpa.aggregateContainerStates clusterState的vpa中也使用同一个aggregateContainerState正是因为上面三处使用的是同一个aggregateContainerState,所以下面LoadRealTimeMetrics 是使用的pod[].containers中的aggregate,而UpdateVPAs中使用的是vpa中的aggregate

关键源码

// findOrCreateAggregateContainerState returns (possibly newly created) AggregateContainerState// that should be used to aggregate usage samples from container with a given ID.// The pod with the corresponding PodID must already be present in the ClusterState.func (cluster *ClusterState) findOrCreateAggregateContainerState(containerID ContainerID) *AggregateContainerState { aggregateStateKey := cluster.aggregateStateKeyForContainerID(containerID) aggregateContainerState, aggregateStateExists := cluster.aggregateStateMap[aggregateStateKey] if !aggregateStateExists { aggregateContainerState = NewAggregateContainerState() cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState // Link the new aggregation to the existing VPAs. for _, vpa := range cluster.Vpas { //cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState 使用同一个aggregateContainerState //vpa.aggregateContainerStates clusterState的vpa中也使用同一个aggregateContainerState if vpa.UseAggregationIfMatching(aggregateStateKey, aggregateContainerState) { cluster.VpasWithMatchingPods[vpa.ID] = true } } } return aggregateContainerState}

4.LoadRealTimeMetrics

将监控值作为权重

根据从metric-server中获取到的当前内存使用率,计算出bucket是哪个,weight增加下

cpu权重+0.1 (实际还不是0.1,是0.14*******)

memory权重+1  (实际还不是1,是1.4******)

注意:

由于是以container为维度,一个deployment中有5个副本的话,那么container1的CPU指标就会+0.1*5

半衰期这里是24小时,源码在这里

time是指标采集的时间

为每个样本数据权重乘上指数2^((sampleTime - referenceTimestamp) / halfLife),以保证较新的样本被赋予更高的权重,而较老的样本随时间推移权重逐步衰减。默认情况下,每 24h 为一个半衰期,即每经过 24h,直方图中所有样本的权重(重要性)衰减为原来的一半。

5.UpdateVPAs

targetPercentile=0.9

lowerBuundPercentile=0.5

uperBuundPercentile=0.95

//根据vpa中的bucketWeight算出recommend,并设置到vpa的status中partialSum := 0.0threshold := percentile * h.totalWeightbucket := h.minBucket for ; bucket < h.maxBucket; bucket++ { partialSum += h.bucketWeight[bucket] if partialSum >= threshold { break } }

上面逻辑找出bucket,再根据bucket 算出bucket Start,由于第一个bucket 小于minResource,就使用minReource

结合上图查看

6.MaintainCheckpoints

把bucketWeight存储到checkPoint中,详细逻辑可查看 InitFromCheckpoints 处的图

7.GarbageCollect

每小时清理一下内存中无效数据,例如上面的feeder.clusterState.pods,feeder.clusterState.vpas

注意:

不会删除无效的VerticalPodAutoscalerCheckpoint

VerticalPodAutoscalerCheckpoint的名字是 vpaName-containerName当容器不存在时,该资源不会被删除dz0400819@MacBook-Pro  ~  kubectl get VerticalPodAutoscalerCheckpoint -n vpaNAME AGEnginx-vpa-nginx 23dnginx-vpa-nginxll 21hnginx-vpa-yellow-pod-container 21h

3.updater组件

3.1pod驱逐条件

3.1.1.当前pod数

可通过--min-replicas自定义

minReplicas = flag.Int("min-replicas", 2,`Minimum number of replicas to perform update`)

3.1.2.resourceDiff<0.1的不会被驱逐

一个pod下两个容器C1和C2resourceDiff = (C1request+C2request)-(C1recommand+C2recommand)/(C1request+C2request)

源码:if updatePriority.resourceDiff < calc.config.MinChangePriority {klog.Info(fmt.Sprintf("not updating pod %v, resource diff too low: %v", pod.Name, updatePriority))return}

3.1.3.如果pod中任何一个容器request值不在vpa设置的范围(lowerBound -uperBound),pod会被驱逐

注意:

这里是vpa status中的lowerBound 和uperBound,不是minAllowed和maxAllowed。

lowerBound 和uperBound 也在minAllowed和maxAllowed范围中。。

3.1.4.如果request值在vpa范围内,并且不是被OOM,pod运行12小时后才会根据resourceDiff判断是否驱逐

源码:if now.Before(pod.Status.StartTime.Add(podLifetimeUpdateThreshold)) { klog.Info(fmt.Sprintf("not updating a short-lived pod %v, request within recommended range", pod.Name)) return }

3.1.5.驱逐pod比例

eviction-tolerance  ====》updater工作周期内,一次可驱逐的pod百分比

//updater默认每分钟执行一次,nginx一共两个pod,那么updater会每分钟驱逐一个,直到所有pod都为target推荐值evictionToleranceFraction = flag.Float64("eviction-tolerance", 0.5, `Fraction of replica count that can be evicted for update, if more than one pod can be evicted.`)

只有Target是deployment、replicaSet、replicationController、statefulSet、job时,updater驱逐pod的时候,updater设置的可驱逐pod比例(eviction-tolerance )才生效。当是其他资源时,updater一个周期内,会把所有pod全部驱逐,因为计算的tolerance=0。

源码查看:

// NewPodsEvictionRestriction creates PodsEvictionRestriction for a given set of pods.func (f *podsEvictionRestrictionFactoryImpl) NewPodsEvictionRestriction(pods []*apiv1.Pod) PodsEvictionRestriction { // We can evict pod only if it is a part of replica set // For each replica set we can evict only a fraction of pods. // Evictions may be later limited by pod disruption budget if configured. //-----pod的owner对应的pod的列表---- livePods := make(map[podReplicaCreator][]*apiv1.Pod) for _, pod := range pods { creator, err := getPodReplicaCreator(pod) if err != nil { klog.Errorf("failed to obtain replication info for pod %s: %v", pod.Name, err) continue } if creator == nil { klog.Warningf("pod %s not replicated", pod.Name) continue } livePods[*creator] = append(livePods[*creator], pod) } //-----pod的owner对应的pod的列表---- //获取每个pod对应的creator podToReplicaCreatorMap := make(map[string]podReplicaCreator) //获取每个creator 的pod runnin给数量、pending数量、replica数量 creatorToSingleGroupStatsMap := make(map[podReplicaCreator]singleGroupStats) for creator, replicas := range livePods { actual := len(replicas) if actual < f.minReplicas { klog.V(2).Infof("too few replicas for %v %v/%v. Found %v live pods", creator.Kind, creator.Namespace, creator.Name, actual) continue } //这里只处理了job,statefulSet,replicationController,replicaSet 四种资源的副本数 var configured int if creator.Kind == job { // Job has no replicas configuration, so we will use actual number of live pods as replicas count. configured = actual } else { var err error //只有是replicaSet、replicationController、statefulSet,返回值为replicas字段,其他资源返回值为0 ,那么singleGroup.evictionTolerance=0 configured, err = f.getReplicaCount(creator) if err != nil { klog.Errorf("failed to obtain replication info for %v %v/%v. %v", creator.Kind, creator.Namespace, creator.Name, err) continue } } singleGroup := singleGroupStats{} singleGroup.configured = configured singleGroup.evictionTolerance = int(float64(configured) * f.evictionToleranceFraction) for _, pod := range replicas { podToReplicaCreatorMap[getPodID(pod)] = creator if pod.Status.Phase == apiv1.PodPending { singleGroup.pending = singleGroup.pending + 1 } } singleGroup.running = len(replicas) - singleGroup.pending creatorToSingleGroupStatsMap[creator] = singleGroup } podToReplicaCreatorMapStr,_ := json.Marshal(podToReplicaCreatorMap) klog.Info(fmt.Sprintf("NewPodsEvictionRestriction----podToReplicaCreatorMapStr %s", podToReplicaCreatorMapStr)) creatorToSingleGroupStatsMapStr,_ := json.Marshal(creatorToSingleGroupStatsMap) klog.Info(fmt.Sprintf("NewPodsEvictionRestriction----creatorToSingleGroupStatsMapStr %s", creatorToSingleGroupStatsMapStr)) return &podsEvictionRestrictionImpl{ client: f.client, podToReplicaCreatorMap: podToReplicaCreatorMap, creatorToSingleGroupStatsMap: creatorToSingleGroupStatsMap}}

func (e *podsEvictionRestrictionImpl) CanEvict(pod *apiv1.Pod) bool { cr, present := e.podToReplicaCreatorMap[getPodID(pod)] podToReplicaCreatorMapStr,err := json.Marshal(e.podToReplicaCreatorMap) if err != nil { klog.Warningf("CanEvict e.podToReplicaCreatorMap --marshal err: %v", err) } klog.Info(fmt.Sprintf("CanEvict podToReplicaCreatorMapStr %s", podToReplicaCreatorMapStr)) if present { klog.Info("CanEvict pod present") singleGroupStats, present := e.creatorToSingleGroupStatsMap[cr] if pod.Status.Phase == apiv1.PodPending { return true } if present { klog.Info("CanEvict pod also present") //shouldBeAlive = 0-0,所以才会全部驱逐 shouldBeAlive := singleGroupStats.configured - singleGroupStats.evictionTolerance if singleGroupStats.running-singleGroupStats.evicted > shouldBeAlive { klog.Info("singleGroupStats.running-singleGroupStats.evicted > shouldBeAlive") return true } // If all pods are running and eviction tollerance is small evict 1 pod. if singleGroupStats.running == singleGroupStats.configured && singleGroupStats.evictionTolerance == 0 && singleGroupStats.evicted == 0 { klog.Info("singleGroupStats.evictionTolerance == 0") return true } } } return false}

3.1.6.pod驱逐顺序

扩容的pod优先处理

resourceDiff大的优先处理

1. If any container wants to grow, the pod takes precedence.扩容的pod优先处理2. A pod with larger value of resourceDiff takes precedence. resourceDiff大的优先处理func (list byPriority) Swap(i, j int) { list[i], list[j] = list[j], list[i]}// Less implements reverse ordering by priority (highest priority first).func (list byPriority) Less(i, j int) bool { // 1. If any container wants to grow, the pod takes precedence. // TODO: A better policy would be to prioritize scaling down when // (a) the pod is pending // (b) there is general resource shortage // and prioritize scaling up otherwise. if list[i].scaleUp != list[j].scaleUp { return list[i].scaleUp } // 2. A pod with larger value of resourceDiff takes precedence. return list[i].resourceDiff > list[j].resourceDiff}

4.admission组件

容器的newRequest就是recommendation.Target

容器的newLimit =oldLimit*recommend/oldRequest 即:保留了原limit与request的比例

参考文献

参考文档:

​​https://cloud.tencent.com/developer/news/841543​​

​​https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md​​

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:苹果隐私新规将落地 广告营销游戏分发直面冲击!
下一篇:关于 k3s registries.yml 文件的配置注意点(k8s mirrors)
相关文章

 发表评论

暂时没有评论,来抢沙发吧~