Skip to content

Commit 61fc7ac

Browse files
authoredApr 17, 2024
Merge pull request #36 from BillyChen1/cqm_versioning
versioning documentation
2 parents eb5b956 + 9915720 commit 61fc7ac

File tree

82 files changed

+8405
-59
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+8405
-59
lines changed
 

‎docs/case-study/alibaba-case-study.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ To improve efficiency of large-scale machine learning model training for Alibaba
5454
#### Fluid
5555
[Fluid](https://github.com/fluid-cloudnative/fluid) is an open source scalable distributed data orchestration and acceleration system. It enables data access for data-intensive applications such as AI and big data based on the Kubernetes standard without user awareness. It is intended to build an efficient support platform for data-intensive applications in cloud-native environments. Based on data layer abstraction provided by Kubernetes services, Fluid can flexibly and efficiently move, replicate, evict, transform, and manage data between storage sources such as HDFS, OSS, and Ceph and upper-layer cloud-native computing applications of Kubernetes. Specific data operations are performed without user awareness. You do not need to worry about the efficiency of accessing remote data, the convenience of managing data sources, or how to help Kubernetes make O&M and scheduling decisions. You can directly access abstracted data from Kubernetes-native persistent volumes (PVs) and persistent volume claims (PVCs). Remaining tasks and underlying details are all handled by Fluid.
5656

57-
![Fluid](../../static/img/docs/case-study/ali-fluid.png)
57+
![Fluid](/img/docs/case-study/ali-fluid.png)
5858

5959
Fluid supports multiple runtimes, including JindoRuntime, AlluxioRuntime, JuiceFSRuntime, and GooseFSRuntime. JindoRuntime has outstanding capabilities, performance, and stability, and is applied in many scenarios. [JindoRuntime](https://github.com/aliyun/alibabacloud-jindodata/blob/master/docs/user/6.x/6.2.0/jindo_fluid/jindo_fluid_overview.md) is a distributed cache runtime of Fluid. It is built on JindoCache, a distributed cache acceleration engine.
6060

@@ -73,13 +73,13 @@ JindoCache is applicable to the following scenarios:
7373

7474
- AI and training acceleration, to reduce the costs of using AI clusters and provide more comprehensive capability support.
7575

76-
![JindoCache](../../static/img/docs/case-study/ali-jindo.png)
76+
![JindoCache](/img/docs/case-study/ali-jindo.png)
7777

7878
#### KubeDL
7979
KubeDL is a Kubernetes (ASI)-based AI workload orchestration system for managing the lifecycle of distributed AI workloads, interaction with layer-1 scheduling, failure tolerance and recovery, as well as dataset and runtime acceleration. It supports the stable operation of more than 10,000 AI training tasks on different platforms in the unified resource pool of Alibaba Group every day, including but not limited to tasks related to Taobao, Alimama, and DAMO Academy business domains. You can download the [open source edition of KubeDL](https://github.com/kubedl-io/kubedl) from GitHub.
8080

8181
#### Overall Project Architecture
82-
![architecture](../../static/img/docs/case-study/ali-architecture.png)
82+
![architecture](/img/docs/case-study/ali-architecture.png)
8383

8484
### 3.2 Benefits of JindoCache-based Fluid
8585
1. Fluid can orchestrate datasets in Kubernetes clusters to co-deploy data and computing, and provide PVC-based APIs for seamlessly integrating Kubernetes applications. JindoRuntime can accelerate data access and caching in OSS. POSIX-based APIs of FUSE allow you to easily access large numbers of files in OSS the way you access local disks. Deep learning training tools such as PyTorch can read training data through POSIX-based APIs.
@@ -136,27 +136,27 @@ Cluster and model: high-performance A800 server cluster equipped with remote dir
136136

137137
**Monitoring Data: Direct Connection without Caching**
138138

139-
![w/o-cache-1](../../static/img/docs/case-study/ali-wo-cache-1.png)
139+
![w/o-cache-1](/img/docs/case-study/ali-wo-cache-1.png)
140140

141-
![w/o-cache-2](../../static/img/docs/case-study/ali-wo-cache-2.png)
141+
![w/o-cache-2](/img/docs/case-study/ali-wo-cache-2.png)
142142

143-
![w/o-cache-3](../../static/img/docs/case-study/ali-wo-cache-3.png)
143+
![w/o-cache-3](/img/docs/case-study/ali-wo-cache-3.png)
144144

145145
**Monitoring Data: Caching Enabled**
146146

147-
![with-cache-1](../../static/img/docs/case-study/ali-with-cache-1.png)
147+
![with-cache-1](/img/docs/case-study/ali-with-cache-1.png)
148148

149149
The overall average GPU utilization is also close to 100%, and the loads of GPUs are uniform and are all close to 100%.
150150

151-
![with-cache-2](../../static/img/docs/case-study/ali-with-cache-2.png)
151+
![with-cache-2](/img/docs/case-study/ali-with-cache-2.png)
152152

153153
#### Checkpoint Acceleration
154154
**Training and Offline Inference Scenarios**
155155
A distributed training task loads a checkpoint model file to continue training each time it is restarted. The model size ranges from tens of GB to hundreds of MB. In addition, a large number of offline inference tasks occupy many spot instance resources in the unified resource pool. Resources of an inference task can be preempted at any time, and the task will reload the model for offline inference after a failover. Therefore, a large number of jobs load the same checkpoint file after restart.
156156

157157
Distributed cache acceleration of Fluid converts multiple remote read operations into a single local read operation. This greatly accelerates job failovers and prevents bandwidth costs caused by multiple repeated read operations. In a typical large model scenario, the size of the model file is approximately 20 GB based on the 7B parameter size with FP16 precision. Fluid reduces the model loading time from 10 minutes to approximately 30 seconds.
158158

159-
![Inference](../../static/img/docs/case-study/ali-inference.png)
159+
![Inference](/img/docs/case-study/ali-inference.png)
160160

161161
**Spot Scenarios of Training (write-through)**
162162
In spot scenarios of distributed training, if resources of a synchronous training task are preempted, it is usually restarted globally through a failover to continue training. KubeDL cooperates with layer-1 scheduling to instruct, through interactive preemption, the rank 0 node of the training task to record an on-demand checkpoint to save the latest training progress. After the restart, the task can reload the latest checkpoint to continue training as soon as possible. This leverages low costs of spot instance resources and minimizes the costs of training interruption.

‎docs/case-study/haomo-case-study.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@ Data intelligence can also help these vertical products and consolidate their le
2121

2222
The rapid development of HAOMO.AI also reflects that higher-level intelligent driving will play a role in a wider range of scenarios, and autonomous driving is moving into the fast lane of commercial application.
2323

24-
![](../../static/img/docs/case-study/haomo-arch.jpeg)
24+
![](/img/docs/case-study/haomo-arch.jpeg)
2525

2626
## 2. Training Effectiveness of Traditional Machine Learning Encounters a Bottleneck
2727

2828
The machine learning platform has played a very central role in the widespread application of machine learning in autonomous driving scenarios. The platform adopts the architecture with separation of storage and computing, which decouples computing resources from storage resources. As such, it provides flexible resource allocation, realizes convenient storage expansion, and reduces storage and O&M costs.
2929

30-
![](../../static/img/docs/case-study/haomo-ml-arch.png)
30+
![](/img/docs/case-study/haomo-ml-arch.png)
3131

3232

3333
However, this architecture also brings some challenges, among which the most critical ones lie in data access performance and stability:
@@ -63,7 +63,7 @@ It is necessary to improve the data localization on data access during model tra
6363

6464
We are eager to find a system platform with distributed cache acceleration capabilities on Kubernetes to achieve these goals. Fortunately, we found Fluid, a CNCF Sandbox project that can meet our demands. Therefore, we have designed a new architecture scheme based on Fluid. After verification and comparison, we chose JindoRuntime as the acceleration run time.
6565

66-
![](../../static/img/docs/case-study/haomo-with-fluid-arch.png)
66+
![](/img/docs/case-study/haomo-with-fluid-arch.png)
6767

6868
### 3.1 Technical Solution
6969

@@ -108,15 +108,15 @@ We use different models to infer and train the same data. We conduct inference a
108108

109109
- *The test result of the model inferring 10,000 frames of images on the cloud*
110110

111-
![](../../static/img/docs/case-study/haomo-test-result-1.png)
111+
![](/img/docs/case-study/haomo-test-result-1.png)
112112

113113
- *The test result of another larger model inferring 10,000 frames of images on the cloud*
114114

115-
![](../../static/img/docs/case-study/haomo-test-result-2.png)
115+
![](/img/docs/case-study/haomo-test-result-2.png)
116116

117117
- *Time consumption of a model with 4 GPUs to train 10,000 frames of images on the cloud*
118118

119-
![](../../static/img/docs/case-study/haomo-test-result-3.png)
119+
![](/img/docs/case-study/haomo-test-result-3.png)
120120

121121
The efficiency of cloud training and inference improves significantly with Fluid + JindoRuntime, especially for some small models. JindoRuntime can solve the I/O bottleneck problem, and the training can be accelerated by up to about 300%. It also improves the efficiency of GPU usage on the cloud and accelerates the efficiency of data-driven iterations on the cloud.
122122

0 commit comments

Comments
 (0)