diff --git a/content/en/docs/actions.md b/content/en/docs/actions.md index e2d8d73..605361f 100644 --- a/content/en/docs/actions.md +++ b/content/en/docs/actions.md @@ -21,12 +21,13 @@ linktitle = "Actions" #### Overview -The Enqueue action filters qualified jobs into the queue to be scheduled. When the minimum number of resource requests under a Job cannot be met, even if the scheduling action is performed for a pod under a Job, pod will not be able to schedule because the "Gang" constraint is not reached. A state refresh from "Pending" to "Inqueue" can only happen if the minimum resource size of the job is met. In general, the Enqueue action is an essential action for the scheduler configuration. +The Enqueue action filters qualified jobs into the queue to be scheduled. When the minimum number of resource requests under a Job cannot be met, even if the scheduling action is performed for a pod under a Job, pod will not be able to schedule because the "Gang" constraint is not reached. A state refresh from "Pending" to "Inqueue" can only happen if the minimum resource size of the job is met. This state transition is a prerequisite for Pod creation - only after the PodGroup enters the Inqueue state will the vc-controller create Pods for that PodGroup. This mechanism ensures that Pods are only created when resources are available, making it an essential action for scheduler configuration. -#### Scenario +#### Scenario Enqueue action is the preparatory stage in the scheduling process. Only when the cluster resources meet the minimum resource request for the job scheduling, the job state can be changed from "pending" to "Enqueue". In this way, Enqueue Action can prevent a large number of unscheduled pods in the cluster and improve the performance of the scheduler in the high-load scenarios where the cluster resources may be insufficient, such as AI/MPI/HPC. +> Note: There is a conflict between enqueue action and preempt/reclaim action. If both enqueue action and preempt/reclaim action are configured, and enqueue action determines that the job cannot be queued, it may result in failure to generate Pending state Pods, thus failing to trigger preempt/reclaim action. ### Allocate @@ -41,7 +42,15 @@ The Allocate action follows the commit mechanism. When a pod's scheduling reques In a clustered mixed business scenario, the Allocate pre-selected part enables specific businesses (AI, big data, HPC, scientific computing) to quickly filter, sort, and schedule according to their namespace quickly and centrally. In a complex computing scenario such as TensorFlow or MPI, where there are multiple tasks in a single job, the Allocate action traversal multiple task allocation options under the job to find the most appropriate node for each task. +### Backfill + +#### Overview + +Backfill action is a backfill step in the scheduling process. It deals with BestEffort Pods (pods that do not specify resource requests) scheduling. Similar to Allocate action, Backfill also traverses all nodes to find suitable scheduling positions, with the main difference being that it handles pods without explicit resource requests. +#### Scenario + +In a cluster, besides workloads that require explicit resource requests, there are also workloads with unclear resource demands. These workloads typically run in BestEffort mode, and Backfill action is responsible for finding suitable scheduling positions for such Pods. ### Preempt @@ -54,33 +63,19 @@ The preempt action is used for resource preemption between jobs in a queue , or - Preemption between jobs in the same queue: Multiple departments in a company share a cluster, and each department can be mapped into a Queue. Resources of different departments cannot be preempted from each other. This mechanism can well guarantee the isolation of resources of departments..In complex scheduling scenarios, basic resources (CPUs, disks, GPUs, memory, network bandwidth) are allocated based on services: In computing-intensive scenarios, such as AI and high-performance scientific computing, queues require more computing resources, such as CPUs, GPUs, and memory. Big data scenarios, such as the Spark framework, have high requirements on disks. Different queues share resources. If AI jobs preempts all CPU resources, jobs in queues of other scenarios will starve. Therefore, the queue-based resource allocation is used to ensure service running. - Preemption between tasks in the same job: Usually, there can be multiple tasks in the same Job. For example, in complex AI application scenarios, a parameter server and multiple workers need to be set inside the TF-job, and preemption between multiple workers is supported by preemption within such scenarios. -### Reserve +### Reclaim #### Overview -The action has been deprecated from v1.2 and replaced with SLA plugin. - -The Reserve action completes the resource reservation. Bind the selected target job to the node. The Reserve action, the elect action, and the Reservation plugin make up the resource Reservation mechanism. The Reserve action must be configured after the allocate action. +Reclaim action is a **cross-queue** resource reclamation step in the scheduling process. Unlike Preempt, Reclaim specifically handles resource reclamation between different Queues. When a job in a Queue needs resources and that Queue is not overused, resources can be reclaimed from other reclaimable queues. #### Scenario -In practical applications, there are two common scenarios as follows: - -- In the case of insufficient cluster resources, it is assumed that for Job A and Job B in the state to be scheduled, the application amount of resource A is less than B or the priority of resource A is higher than that of job B. Based on the default scheduling policy, A will schedule ahead of B. In the worst case, if subsequent jobs with high priority or less application resources are added to the queue to be scheduled, B will be hungry for a long time and wait forever. - -- In the case of insufficient cluster resources, assume that there are jobs A and B to be scheduled. The priority of A is lower than that of B, but the resource application amount is smaller than that of B. Under the scheduling policy based on cluster throughput and resource utilization as the core, A will be scheduled first. In the worst case, B will remain hungry. +- Cross-queue resource reclamation: In scenarios where multiple departments share a cluster, when a high-priority department's (such as online business department) Queue lacks resources, it can reclaim resources from other department Queues (such as offline computing department). For example, online business Queues can reclaim resources from offline business Queues, but offline business Queues cannot reclaim resources from each other. +- Resource utilization optimization: Through the cross-queue resource reclamation mechanism, the cluster can improve overall resource utilization while ensuring SLA for high-priority businesses. When a high-priority Queue lacks resources, it can reclaim resources from low-priority Queues to ensure resource requirements for critical businesses. -Therefore, we need a fair scheduling mechanism that ensures that chronic hunger for some reason reaches a critical state when it is dispatched. Job reservation is such a fair scheduling mechanism. - -Resource reservation mechanisms need to consider node selection, number of nodes, and how to lock nodes. Volcano resource reservation mechanism reserves resources for target operations in the way of node group locking, that is, select a group of nodes that meet certain constraints and include them into the node group. Nodes within the node group will not accept new job delivery from the inclusion moment, and the total specification of nodes meets the requirements of target operations. It is important to note that target jobs can be scheduled throughout the cluster, while non-target jobs can only be scheduled with nodes outside the node group. - -### Backfill - -#### Overview - -Backfill action is a backfill step in the scheduling process. It deals with the pod scheduling that does not specify the resource application amount in the list of pod to be scheduled. When executing the scheduling action on a single pod, it traverse all nodes and schedule the pod to this node as long as the node meets the scheduling request of pod. - -#### Scenario - -In a cluster, the main resources are occupied by "fat jobs", such as AI model training. Backfill actions allow the cluster to quickly schedule "small jobs" such as single AI model identification and small data volume communication. Backfill can improve cluster throughput and resource utilization. \ No newline at end of file +> Note: +> +> 1. Reclaim checks multiple conditions during execution: whether the target Queue is reclaimable, whether the task can be reclaimed (Preemptable), whether the job's running requirements can be met after resource reclamation, etc., to ensure the rationality of resource reclamation. +> 2. To make jobs in a Queue reclaimable by other Queues, the reclaimable field in the Queue's spec must be set to true. \ No newline at end of file diff --git a/content/en/docs/network_topology_aware_scheduling.md b/content/en/docs/network_topology_aware_scheduling.md index 8982f47..544c8bd 100644 --- a/content/en/docs/network_topology_aware_scheduling.md +++ b/content/en/docs/network_topology_aware_scheduling.md @@ -255,6 +255,33 @@ Since the `spec.networkTopology.highestTierAllowed` of the Job is set to 2, the ## Best Practices +### Optimizing Scheduler Configuration + +HyperNode scoring is based on the sum of scores from all nodes it manages. To achieve better scheduling results, you need to enable the binpack plugin in the scheduler configuration and set an appropriate weight. The binpack strategy prioritizes scheduling Pods to nodes with existing workloads, which helps ensure Pods from the same job are scheduled to the same HyperNode at a lower tier, thereby reducing cross-switch communication and improving network transmission efficiency: + +```yaml +kind: ConfigMap +apiVersion: v1 +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: proportion + - name: nodeorder + - name: binpack # Enable binpack plugin + arguments: + binpack.weight: 10 # Set a higher weight value to make binpack scoring dominant, ensuring Pods are tightly scheduled within the same HyperNode +``` + +### Soft Mode Configuration The `spec.networkTopology.highestTierAllowed` field of a Job constrains the highest tier allowed for job deployment. This value is only meaningful when `spec.networkTopology.mode` is set to `hard`. Therefore, when `spec.networkTopology.highestTierAllowed` is set to the maximum tier in the cluster, the resource view of the Job during scheduling includes all nodes in the cluster, making the topology constraint consistent with the `soft` mode. Therefore, **to use the `soft` mode**, set `spec.networkTopology.highestTierAllowed` to the maximum tier in the cluster. Still using Figure 1 as an example, this value should be set to 3. ```yaml diff --git a/content/en/docs/queue_resource_management.md b/content/en/docs/queue_resource_management.md index 62d112e..d5ba975 100644 --- a/content/en/docs/queue_resource_management.md +++ b/content/en/docs/queue_resource_management.md @@ -231,11 +231,18 @@ spec: ``` After submitting job3, system starts resource reclamation: -* System reclaims resources exceeding deserved amount from default queue -* job2 (3C) is evicted -* job1 (1C) continues running -* job3 (3C) starts running + +1. System reclaims resources exceeding deserved amount from default queue +2. job2 (3C) is evicted +3. job1 (1C) continues running +4. job3 (3C) starts running This scenario works with both capacity plugin and proportion plugin: -* capacity plugin: Directly configure deserved values (default=1C, test=3C) -* proportion plugin: Configure weight values (default=1, test=3) resulting in the same deserved values + + * capacity plugin: Directly configure deserved values (default=1C, test=3C) + * proportion plugin: Configure weight values (default=1, test=3) resulting in the same deserved values + +> **Note**: +> 1. capacity plugin and proportion plugin must be used exclusively, they cannot be used simultaneously +> 2. The choice between plugins depends on whether you want to set deserved directly (capacity) or calculate deserved automatically through weights (proportion) +> 3. After Volcano v1.9.0, capacity plugin is recommended as it provides more intuitive resource configuration diff --git a/content/zh/docs/actions.md b/content/zh/docs/actions.md index 43113fc..64054c9 100644 --- a/content/zh/docs/actions.md +++ b/content/zh/docs/actions.md @@ -2,7 +2,7 @@ title = "Actions" date = 2021-04-07 -lastmod = 2021-07-26 +lastmod = 2025-01-21 draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false @@ -21,12 +21,13 @@ linktitle = "Actions" #### 简介 -Enqueue action筛选符合要求的作业进入待调度队列。当一个Job下的最小资源申请量不能得到满足时,即使为Job下的Pod执行调度动作,Pod也会因为gang约束没有达到而无法进行调度;只有当job的最小资源量得到满足,状态由"Pending"刷新为"Inqueue"才可以进行。一般来说Enqueue action是调度器配置必不可少的action。 +Enqueue action筛选符合要求的作业进入待调度队列。当一个Job下的最小资源申请量不能得到满足时,即使为Job下的Pod执行调度动作,Pod也会因为gang约束没有达到而无法进行调度。只有当集群资源满足作业声明的最小资源需求时,Enqueue action才允许该作业入队,使得PodGroup的状态由Pending状态转换为Inqueue状态。这个状态转换是Pod创建的前提,只有PodGroup进入Inqueue状态后,vc-controller才会为该PodGroup创建Pod。这种机制确保了Pod只会在资源满足的情况下被创建,是调度器配置中必不可少的action。 #### 场景 Enqueue action是调度流程中的准备阶段,只有当集群资源满足作业调度的最小资源请求,作业状态才可由"pending"变为"enqueue"。这样在AI/MPI/HPC这样的集群资源可能不足的高负荷的场景下,Enqueue action能够防止集群下有大量不能调度的pod,提高了调度器的性能。 +> 注意:enqueue action和preempt/reclaim action是互相冲突的,如果同时配置了enqueue action和preempt/reclaim action,且enqueue action判断作业无法入队,有可能导致无法生成Pending状态的Pod,从而无法触发preempt/reclaim action。 ### Allocate @@ -41,50 +42,43 @@ Allocate action遵循commit机制,当一个Pod的调度请求得到满足后 在集群混合业务场景中,Allocate的预选部分能够将特定的业务(AI、大数据、HPC、科学计算)按照所在namespace快速筛选、分类,对特定的业务进行快速、集中的调度。在Tensorflow、MPI等复杂计算场景中,单个作业中会有多个任务,Allocate action会遍历job下的多个task分配优选,为每个task找到最合适的node。 - - -### Preempt +### Backfill #### 简介 -Preempt action是调度流程中的抢占步骤,用于处理高优先级调度问题。Preempt用于同一个Queue中job之间的抢占,或同一Job下Task之间的抢占。 +Backfill action是调度流程中处理BestEffort Pod(即没有指定资源申请量的Pod)的调度步骤。与Allocate action类似,Backfill也会遍历所有节点寻找合适的调度位置,主要区别在于它处理的是没有明确资源申请量的Pod。 #### 场景 -- Queue内job抢占:一个公司中多个部门共用一个集群,每个部门可以映射成一个Queue,不同部门之间的资源不能互相抢占,这种机制能够很好的保证部门资源的隔离性。多业务类型混合场景中,基于Queue的机制满足了一类业务对于某一类资源的集中诉求,也能够兼顾集群的弹性。例如,AI业务组成的queue对集群GPU占比90%,其余图像类处理的业务组成的queue占集群GPU10%。前者占用了集群绝大部分GPU资源但是依然有一小部分资源可以处理其余类型的业务。 -- Job内task抢占:同一Job下通常可以有多个task,例如复杂的AI应用场景中,tf-job内部需要设置一个ps和多个worker,Preempt action就支持这种场景下多个worker之间的抢占。 - - +在集群中,除了需要明确资源申请的工作负载外,还存在一些对资源需求不明确的工作负载。这些工作负载通常以BestEffort的方式运行,Backfill action负责为这类 Pod寻找合适的调度位置。 -### Reserve +### Preempt #### 简介 -Reserve action从v1.2开始已经被弃用,并且被SLA plugin替代。 - -Reserve action完成资源预留。将选中的目标作业与节点进行绑定。Reserve action、elect action 以及Reservation plugin组成了资源预留机制。Reserve action必须配置在allocate action之后。 +Preempt action是调度流程中的抢占步骤,用于处理高优先级调度问题。Preempt用于同一个Queue中job之间的抢占,或同一Job下Task之间的抢占。 #### 场景 -在实际应用中,常见以下两种场景: - -- 在集群资源不足的情况下,假设处于待调度状态的作业A和B,A资源申请量小于B或A优先级高于B。基于默认调度策略,A将优先于B进行调度。在最坏的情况下,若后续持续有高优先级或申请资源量较少的作业加入待调度队列,B将长时间处于饥饿状态并永远等待下去。 -- 在集群资源不足的情况下,假设存在待调度作业A和B。A优先级低于B但资源申请量小于B。在基于集群吞吐量和资源利用率为核心的调度策略下,A将优先被调度。在最坏的情况下,B将持续饥饿下去。 - -因此我们需要一种公平调度机制:保证因为某种原因长期饥饿达到临界状态之后被调度。作业预留机制的就是这样一种公平调度机制。 +- Queue内job抢占:一个公司中多个部门共用一个集群,每个部门可以映射成一个Queue,不同部门之间的资源不能互相抢占,这种机制能够很好的保证部门资源的隔离性。多业务类型混合场景中,基于Queue的机制满足了一类业务对于某一类资源的集中诉求,也能够兼顾集群的弹性。例如,AI业务组成的queue对集群GPU占比90%,其余图像类处理的业务组成的queue占集群GPU10%。前者占用了集群绝大部分GPU资源但是依然有一小部分资源可以处理其余类型的业务。 +- Job内task抢占:同一Job下通常可以有多个task,例如复杂的AI应用场景中,tf-job内部需要设置一个ps和多个worker,Preempt action就支持这种场景下多个worker之间的抢占。 -资源预留机制需要考虑节点选取、节点数量以及如何锁定节点。volcano资源预留机制采用节点组锁定的方式为目标作业预留资源,即选定一组符合某些约束条件的节点纳入节点组,节点组内的节点从纳入时刻起不再接受新作业投递,节点规格总和满足目标作业要求。需要强调的是,目标作业将可以在整个集群中进行调度,非目标作业仅可使用节点组外的节点进行调度。 +### Reclaim +#### 简介 +Reclaim action是调度流程中的**跨队列**资源回收步骤。与Preempt不同,Reclaim专门处理不同Queue之间的资源回收。当某个Queue中的作业需要资源且该Queue未超用时,可以从其他可回收队列中回收资源。 -### Backfill +#### 场景 -#### 简介 +- 跨队列资源回收:在多部门共用集群的场景下,当高优先级部门(如在线业务部门)的Queue资源不足时,可以从其他可回收的部门Queue(如离线计算部门)回收资源。例如,在线业务Queue可以从离线业务Queue回收资源,但离线业务Queue之间不能互相回收资源。 -Backfill action是调度流程中的回填步骤,处理待调度Pod列表中没有指明资源申请量的Pod调度,在对单个Pod执行调度动作的时候,遍历所有的节点,只要节点满足了Pod的调度请求,就将Pod调度到这个节点上。 +- 资源利用率优化:通过跨队列资源回收机制,集群可以在保证高优先级业务SLA的同时,提高整体资源利用率。当高优先级Queue资源不足时,可以从低优先级Queue回收资源,确保关键业务的资源需求。 -#### 场景 +> 注意: +> +> 1. Reclaim在执行时会检查多个条件:目标Queue是否可回收(Reclaimable)、任务是否可被回收(Preemptable)、资源回收后是否满足作业运行需求等,从而确保资源回收的合理性。 +> 2. 要使Queue中的作业可以被其他Queue回收资源,需要在Queue的spec中将reclaimable字段设置为true。 -在一个集群中,主要资源被“胖业务”占用,例如AI模型的训练。Backfill action让集群可以快速调度诸如单次AI模型识别、小数据量通信的“小作业” 。Backfill能够提高集群吞吐量,提高资源利用率。 diff --git a/content/zh/docs/multi_cluster_scheduling.md b/content/zh/docs/multi_cluster_scheduling.md index 089d87f..766e530 100644 --- a/content/zh/docs/multi_cluster_scheduling.md +++ b/content/zh/docs/multi_cluster_scheduling.md @@ -19,7 +19,7 @@ type = "docs" # Do not modify. 随着企业业务的快速增长,单一Kubernetes集群往往无法满足大规模AI训练和推理任务的需求。用户通常需要管理多个Kubernetes集群,以实现工作负载的统一分发、部署和管理。目前,业界的多集群编排系统(如[Karmada](https://karmada.io/))主要针对微服务场景,提供了高可用性和容灾部署能力。然而,在AI作业调度方面,Karmada的能力仍然有限,缺乏对**Volcano Job**的支持,也无法满足队列管理、多租户公平调度和作业优先级调度等需求。 -为了解决多集群环境下AI作业的调度与管理问题,**Volcano社区**孵化了**[Volcano Global]([https://github.com/volcano-sh/volcano-global)**子项目。该项目基于Karmada,扩展了Volcano在单集群中的强大调度能力,为多集群AI作业提供了统一的调度平台,支持跨集群的任务分发、资源管理和优先级控制。 +为了解决多集群环境下AI作业的调度与管理问题,**Volcano社区**孵化了**[Volcano Global](https://github.com/volcano-sh/volcano-global)**子项目。该项目基于Karmada,扩展了Volcano在单集群中的强大调度能力,为多集群AI作业提供了统一的调度平台,支持跨集群的任务分发、资源管理和优先级控制。 ## 功能 @@ -42,7 +42,7 @@ Volcano Global在Karmada的基础上,提供了以下增强功能,满足多 Volcano global主要包含两个组件: - **Volcano Webhook:** 监听ResourceBinding资源的创建事件,将ResourceBinding设置为暂停状态。 -- **Volcnao Controller:** 监听处于暂停状态的ResourceBinding,根据Job所在队列的优先级、Job本身的优先级,对Job进行优先级和公平调度,并运行资源准入机制,决定是否可以调度Job,准入成功后将ResourceBinding解除暂停状态,由Karmada进行资源分发。 +- **Volcano Controller:** 监听处于暂停状态的ResourceBinding,根据Job所在队列的优先级、Job本身的优先级,对Job进行优先级和公平调度,并运行资源准入机制,决定是否可以调度Job,准入成功后将ResourceBinding解除暂停状态,由Karmada进行资源分发。 ## 使用指导 diff --git a/content/zh/docs/network_topology_aware_scheduling.md b/content/zh/docs/network_topology_aware_scheduling.md index fe9a989..4dacad6 100644 --- a/content/zh/docs/network_topology_aware_scheduling.md +++ b/content/zh/docs/network_topology_aware_scheduling.md @@ -255,6 +255,34 @@ spec: ## 最佳实践 +### 优化调度器配置 + +HyperNode的打分是基于其管理的所有节点的打分总和。为了获得更好的调度效果,需要在调度器配置中开启binpack插件并设置合适的权重。binpack策略会优先将Pod调度到已有负载的节点上,这样可以让同一作业的Pod尽可能地调度到更低层的同一个HyperNode中,从而减少跨交换机通信,提高网络传输效率: + +```yaml +kind: ConfigMap +apiVersion: v1 +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: proportion + - name: nodeorder + - name: binpack #开启binpack插件 + arguments: + binpack.weight: 10 #设置较高的权重值,使binpack策略的打分占主导地位,确保Pod能够被紧密地调度到同一个HyperNode中 +``` + +### 软约束模式配置 + Job的`spec.networkTopology.highestTierAllowed`字段约束了Job允许部署的最高Tier,该值只有在`spec.networkTopology.mode`设置为`hard`时才有意义,因此将`spec.networkTopology.highestTierAllowed`设置为集群中最大的tier时,Job在调度时的资源视图为集群中的所有节点,此时拓扑约束与soft模式一致。因此**若要使用soft模式**,请将`spec.networkTopology.highestTierAllowed`设置为集群中最大的Tier,仍以图1为例,应该设置该值为3。 ```yaml