Skip to content

Commit

Permalink
Update and fix some docs
Browse files Browse the repository at this point in the history
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
  • Loading branch information
JesseStutler committed Jan 23, 2025
1 parent a36f54f commit 46fbf5e
Show file tree
Hide file tree
Showing 6 changed files with 109 additions and 58 deletions.
43 changes: 19 additions & 24 deletions content/en/docs/actions.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,13 @@ linktitle = "Actions"

#### Overview

The Enqueue action filters qualified jobs into the queue to be scheduled. When the minimum number of resource requests under a Job cannot be met, even if the scheduling action is performed for a pod under a Job, pod will not be able to schedule because the "Gang" constraint is not reached. A state refresh from "Pending" to "Inqueue" can only happen if the minimum resource size of the job is met. In general, the Enqueue action is an essential action for the scheduler configuration.
The Enqueue action filters qualified jobs into the queue to be scheduled. When the minimum number of resource requests under a Job cannot be met, even if the scheduling action is performed for a pod under a Job, pod will not be able to schedule because the "Gang" constraint is not reached. A state refresh from "Pending" to "Inqueue" can only happen if the minimum resource size of the job is met. This state transition is a prerequisite for Pod creation - only after the PodGroup enters the Inqueue state will the vc-controller create Pods for that PodGroup. This mechanism ensures that Pods are only created when resources are available, making it an essential action for scheduler configuration.

#### Scenario
#### Scenario

Enqueue action is the preparatory stage in the scheduling process. Only when the cluster resources meet the minimum resource request for the job scheduling, the job state can be changed from "pending" to "Enqueue". In this way, Enqueue Action can prevent a large number of unscheduled pods in the cluster and improve the performance of the scheduler in the high-load scenarios where the cluster resources may be insufficient, such as AI/MPI/HPC.

> Note: There is a conflict between enqueue action and preempt/reclaim action. If both enqueue action and preempt/reclaim action are configured, and enqueue action determines that the job cannot be queued, it may result in failure to generate Pending state Pods, thus failing to trigger preempt/reclaim action.

### Allocate
Expand All @@ -41,7 +42,15 @@ The Allocate action follows the commit mechanism. When a pod's scheduling reques

In a clustered mixed business scenario, the Allocate pre-selected part enables specific businesses (AI, big data, HPC, scientific computing) to quickly filter, sort, and schedule according to their namespace quickly and centrally. In a complex computing scenario such as TensorFlow or MPI, where there are multiple tasks in a single job, the Allocate action traversal multiple task allocation options under the job to find the most appropriate node for each task.

### Backfill

#### Overview

Backfill action is a backfill step in the scheduling process. It deals with BestEffort Pods (pods that do not specify resource requests) scheduling. Similar to Allocate action, Backfill also traverses all nodes to find suitable scheduling positions, with the main difference being that it handles pods without explicit resource requests.

#### Scenario

In a cluster, besides workloads that require explicit resource requests, there are also workloads with unclear resource demands. These workloads typically run in BestEffort mode, and Backfill action is responsible for finding suitable scheduling positions for such Pods.

### Preempt

Expand All @@ -54,33 +63,19 @@ The preempt action is used for resource preemption between jobs in a queue , or
- Preemption between jobs in the same queue: Multiple departments in a company share a cluster, and each department can be mapped into a Queue. Resources of different departments cannot be preempted from each other. This mechanism can well guarantee the isolation of resources of departments..In complex scheduling scenarios, basic resources (CPUs, disks, GPUs, memory, network bandwidth) are allocated based on services: In computing-intensive scenarios, such as AI and high-performance scientific computing, queues require more computing resources, such as CPUs, GPUs, and memory. Big data scenarios, such as the Spark framework, have high requirements on disks. Different queues share resources. If AI jobs preempts all CPU resources, jobs in queues of other scenarios will starve. Therefore, the queue-based resource allocation is used to ensure service running.
- Preemption between tasks in the same job: Usually, there can be multiple tasks in the same Job. For example, in complex AI application scenarios, a parameter server and multiple workers need to be set inside the TF-job, and preemption between multiple workers is supported by preemption within such scenarios.

### Reserve
### Reclaim

#### Overview

The action has been deprecated from v1.2 and replaced with SLA plugin.

The Reserve action completes the resource reservation. Bind the selected target job to the node. The Reserve action, the elect action, and the Reservation plugin make up the resource Reservation mechanism. The Reserve action must be configured after the allocate action.
Reclaim action is a **cross-queue** resource reclamation step in the scheduling process. Unlike Preempt, Reclaim specifically handles resource reclamation between different Queues. When a job in a Queue needs resources and that Queue is not overused, resources can be reclaimed from other reclaimable queues.

#### Scenario

In practical applications, there are two common scenarios as follows:

- In the case of insufficient cluster resources, it is assumed that for Job A and Job B in the state to be scheduled, the application amount of resource A is less than B or the priority of resource A is higher than that of job B. Based on the default scheduling policy, A will schedule ahead of B. In the worst case, if subsequent jobs with high priority or less application resources are added to the queue to be scheduled, B will be hungry for a long time and wait forever.

- In the case of insufficient cluster resources, assume that there are jobs A and B to be scheduled. The priority of A is lower than that of B, but the resource application amount is smaller than that of B. Under the scheduling policy based on cluster throughput and resource utilization as the core, A will be scheduled first. In the worst case, B will remain hungry.
- Cross-queue resource reclamation: In scenarios where multiple departments share a cluster, when a high-priority department's (such as online business department) Queue lacks resources, it can reclaim resources from other department Queues (such as offline computing department). For example, online business Queues can reclaim resources from offline business Queues, but offline business Queues cannot reclaim resources from each other.

- Resource utilization optimization: Through the cross-queue resource reclamation mechanism, the cluster can improve overall resource utilization while ensuring SLA for high-priority businesses. When a high-priority Queue lacks resources, it can reclaim resources from low-priority Queues to ensure resource requirements for critical businesses.

Therefore, we need a fair scheduling mechanism that ensures that chronic hunger for some reason reaches a critical state when it is dispatched. Job reservation is such a fair scheduling mechanism.

Resource reservation mechanisms need to consider node selection, number of nodes, and how to lock nodes. Volcano resource reservation mechanism reserves resources for target operations in the way of node group locking, that is, select a group of nodes that meet certain constraints and include them into the node group. Nodes within the node group will not accept new job delivery from the inclusion moment, and the total specification of nodes meets the requirements of target operations. It is important to note that target jobs can be scheduled throughout the cluster, while non-target jobs can only be scheduled with nodes outside the node group.

### Backfill

#### Overview

Backfill action is a backfill step in the scheduling process. It deals with the pod scheduling that does not specify the resource application amount in the list of pod to be scheduled. When executing the scheduling action on a single pod, it traverse all nodes and schedule the pod to this node as long as the node meets the scheduling request of pod.

#### Scenario

In a cluster, the main resources are occupied by "fat jobs", such as AI model training. Backfill actions allow the cluster to quickly schedule "small jobs" such as single AI model identification and small data volume communication. Backfill can improve cluster throughput and resource utilization.
> Note:
>
> 1. Reclaim checks multiple conditions during execution: whether the target Queue is reclaimable, whether the task can be reclaimed (Preemptable), whether the job's running requirements can be met after resource reclamation, etc., to ensure the rationality of resource reclamation.
> 2. To make jobs in a Queue reclaimable by other Queues, the reclaimable field in the Queue's spec must be set to true.
27 changes: 27 additions & 0 deletions content/en/docs/network_topology_aware_scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,33 @@ Since the `spec.networkTopology.highestTierAllowed` of the Job is set to 2, the

## Best Practices

### Optimizing Scheduler Configuration

HyperNode scoring is based on the sum of scores from all nodes it manages. To achieve better scheduling results, you need to enable the binpack plugin in the scheduler configuration and set an appropriate weight. The binpack strategy prioritizes scheduling Pods to nodes with existing workloads, which helps ensure Pods from the same job are scheduled to the same HyperNode at a lower tier, thereby reducing cross-switch communication and improving network transmission efficiency:

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- plugins:
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack # Enable binpack plugin
arguments:
binpack.weight: 10 # Set a higher weight value to make binpack scoring dominant, ensuring Pods are tightly scheduled within the same HyperNode
```

### Soft Mode Configuration
The `spec.networkTopology.highestTierAllowed` field of a Job constrains the highest tier allowed for job deployment. This value is only meaningful when `spec.networkTopology.mode` is set to `hard`. Therefore, when `spec.networkTopology.highestTierAllowed` is set to the maximum tier in the cluster, the resource view of the Job during scheduling includes all nodes in the cluster, making the topology constraint consistent with the `soft` mode. Therefore, **to use the `soft` mode**, set `spec.networkTopology.highestTierAllowed` to the maximum tier in the cluster. Still using Figure 1 as an example, this value should be set to 3.

```yaml
Expand Down
19 changes: 13 additions & 6 deletions content/en/docs/queue_resource_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,11 +231,18 @@ spec:
```
After submitting job3, system starts resource reclamation:
* System reclaims resources exceeding deserved amount from default queue
* job2 (3C) is evicted
* job1 (1C) continues running
* job3 (3C) starts running
1. System reclaims resources exceeding deserved amount from default queue
2. job2 (3C) is evicted
3. job1 (1C) continues running
4. job3 (3C) starts running
This scenario works with both capacity plugin and proportion plugin:
* capacity plugin: Directly configure deserved values (default=1C, test=3C)
* proportion plugin: Configure weight values (default=1, test=3) resulting in the same deserved values
* capacity plugin: Directly configure deserved values (default=1C, test=3C)
* proportion plugin: Configure weight values (default=1, test=3) resulting in the same deserved values
> **Note**:
> 1. capacity plugin and proportion plugin must be used exclusively, they cannot be used simultaneously
> 2. The choice between plugins depends on whether you want to set deserved directly (capacity) or calculate deserved automatically through weights (proportion)
> 3. After Volcano v1.9.0, capacity plugin is recommended as it provides more intuitive resource configuration
Loading

0 comments on commit 46fbf5e

Please sign in to comment.