Skip to content

Commit 9915720

Browse files
committed
versioning documentation
Signed-off-by: chenqiming <whqscqm@outlook.com>
1 parent eb5b956 commit 9915720

File tree

82 files changed

+8405
-59
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+8405
-59
lines changed

docs/case-study/alibaba-case-study.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ To improve efficiency of large-scale machine learning model training for Alibaba
5454
#### Fluid
5555
[Fluid](https://github.com/fluid-cloudnative/fluid) is an open source scalable distributed data orchestration and acceleration system. It enables data access for data-intensive applications such as AI and big data based on the Kubernetes standard without user awareness. It is intended to build an efficient support platform for data-intensive applications in cloud-native environments. Based on data layer abstraction provided by Kubernetes services, Fluid can flexibly and efficiently move, replicate, evict, transform, and manage data between storage sources such as HDFS, OSS, and Ceph and upper-layer cloud-native computing applications of Kubernetes. Specific data operations are performed without user awareness. You do not need to worry about the efficiency of accessing remote data, the convenience of managing data sources, or how to help Kubernetes make O&M and scheduling decisions. You can directly access abstracted data from Kubernetes-native persistent volumes (PVs) and persistent volume claims (PVCs). Remaining tasks and underlying details are all handled by Fluid.
5656

57-
![Fluid](../../static/img/docs/case-study/ali-fluid.png)
57+
![Fluid](/img/docs/case-study/ali-fluid.png)
5858

5959
Fluid supports multiple runtimes, including JindoRuntime, AlluxioRuntime, JuiceFSRuntime, and GooseFSRuntime. JindoRuntime has outstanding capabilities, performance, and stability, and is applied in many scenarios. [JindoRuntime](https://github.com/aliyun/alibabacloud-jindodata/blob/master/docs/user/6.x/6.2.0/jindo_fluid/jindo_fluid_overview.md) is a distributed cache runtime of Fluid. It is built on JindoCache, a distributed cache acceleration engine.
6060

@@ -73,13 +73,13 @@ JindoCache is applicable to the following scenarios:
7373

7474
- AI and training acceleration, to reduce the costs of using AI clusters and provide more comprehensive capability support.
7575

76-
![JindoCache](../../static/img/docs/case-study/ali-jindo.png)
76+
![JindoCache](/img/docs/case-study/ali-jindo.png)
7777

7878
#### KubeDL
7979
KubeDL is a Kubernetes (ASI)-based AI workload orchestration system for managing the lifecycle of distributed AI workloads, interaction with layer-1 scheduling, failure tolerance and recovery, as well as dataset and runtime acceleration. It supports the stable operation of more than 10,000 AI training tasks on different platforms in the unified resource pool of Alibaba Group every day, including but not limited to tasks related to Taobao, Alimama, and DAMO Academy business domains. You can download the [open source edition of KubeDL](https://github.com/kubedl-io/kubedl) from GitHub.
8080

8181
#### Overall Project Architecture
82-
![architecture](../../static/img/docs/case-study/ali-architecture.png)
82+
![architecture](/img/docs/case-study/ali-architecture.png)
8383

8484
### 3.2 Benefits of JindoCache-based Fluid
8585
1. Fluid can orchestrate datasets in Kubernetes clusters to co-deploy data and computing, and provide PVC-based APIs for seamlessly integrating Kubernetes applications. JindoRuntime can accelerate data access and caching in OSS. POSIX-based APIs of FUSE allow you to easily access large numbers of files in OSS the way you access local disks. Deep learning training tools such as PyTorch can read training data through POSIX-based APIs.
@@ -136,27 +136,27 @@ Cluster and model: high-performance A800 server cluster equipped with remote dir
136136

137137
**Monitoring Data: Direct Connection without Caching**
138138

139-
![w/o-cache-1](../../static/img/docs/case-study/ali-wo-cache-1.png)
139+
![w/o-cache-1](/img/docs/case-study/ali-wo-cache-1.png)
140140

141-
![w/o-cache-2](../../static/img/docs/case-study/ali-wo-cache-2.png)
141+
![w/o-cache-2](/img/docs/case-study/ali-wo-cache-2.png)
142142

143-
![w/o-cache-3](../../static/img/docs/case-study/ali-wo-cache-3.png)
143+
![w/o-cache-3](/img/docs/case-study/ali-wo-cache-3.png)
144144

145145
**Monitoring Data: Caching Enabled**
146146

147-
![with-cache-1](../../static/img/docs/case-study/ali-with-cache-1.png)
147+
![with-cache-1](/img/docs/case-study/ali-with-cache-1.png)
148148

149149
The overall average GPU utilization is also close to 100%, and the loads of GPUs are uniform and are all close to 100%.
150150

151-
![with-cache-2](../../static/img/docs/case-study/ali-with-cache-2.png)
151+
![with-cache-2](/img/docs/case-study/ali-with-cache-2.png)
152152

153153
#### Checkpoint Acceleration
154154
**Training and Offline Inference Scenarios**
155155
A distributed training task loads a checkpoint model file to continue training each time it is restarted. The model size ranges from tens of GB to hundreds of MB. In addition, a large number of offline inference tasks occupy many spot instance resources in the unified resource pool. Resources of an inference task can be preempted at any time, and the task will reload the model for offline inference after a failover. Therefore, a large number of jobs load the same checkpoint file after restart.
156156

157157
Distributed cache acceleration of Fluid converts multiple remote read operations into a single local read operation. This greatly accelerates job failovers and prevents bandwidth costs caused by multiple repeated read operations. In a typical large model scenario, the size of the model file is approximately 20 GB based on the 7B parameter size with FP16 precision. Fluid reduces the model loading time from 10 minutes to approximately 30 seconds.
158158

159-
![Inference](../../static/img/docs/case-study/ali-inference.png)
159+
![Inference](/img/docs/case-study/ali-inference.png)
160160

161161
**Spot Scenarios of Training (write-through)**
162162
In spot scenarios of distributed training, if resources of a synchronous training task are preempted, it is usually restarted globally through a failover to continue training. KubeDL cooperates with layer-1 scheduling to instruct, through interactive preemption, the rank 0 node of the training task to record an on-demand checkpoint to save the latest training progress. After the restart, the task can reload the latest checkpoint to continue training as soon as possible. This leverages low costs of spot instance resources and minimizes the costs of training interruption.

docs/case-study/haomo-case-study.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@ Data intelligence can also help these vertical products and consolidate their le
2121

2222
The rapid development of HAOMO.AI also reflects that higher-level intelligent driving will play a role in a wider range of scenarios, and autonomous driving is moving into the fast lane of commercial application.
2323

24-
![](../../static/img/docs/case-study/haomo-arch.jpeg)
24+
![](/img/docs/case-study/haomo-arch.jpeg)
2525

2626
## 2. Training Effectiveness of Traditional Machine Learning Encounters a Bottleneck
2727

2828
The machine learning platform has played a very central role in the widespread application of machine learning in autonomous driving scenarios. The platform adopts the architecture with separation of storage and computing, which decouples computing resources from storage resources. As such, it provides flexible resource allocation, realizes convenient storage expansion, and reduces storage and O&M costs.
2929

30-
![](../../static/img/docs/case-study/haomo-ml-arch.png)
30+
![](/img/docs/case-study/haomo-ml-arch.png)
3131

3232

3333
However, this architecture also brings some challenges, among which the most critical ones lie in data access performance and stability:
@@ -63,7 +63,7 @@ It is necessary to improve the data localization on data access during model tra
6363

6464
We are eager to find a system platform with distributed cache acceleration capabilities on Kubernetes to achieve these goals. Fortunately, we found Fluid, a CNCF Sandbox project that can meet our demands. Therefore, we have designed a new architecture scheme based on Fluid. After verification and comparison, we chose JindoRuntime as the acceleration run time.
6565

66-
![](../../static/img/docs/case-study/haomo-with-fluid-arch.png)
66+
![](/img/docs/case-study/haomo-with-fluid-arch.png)
6767

6868
### 3.1 Technical Solution
6969

@@ -108,15 +108,15 @@ We use different models to infer and train the same data. We conduct inference a
108108

109109
- *The test result of the model inferring 10,000 frames of images on the cloud*
110110

111-
![](../../static/img/docs/case-study/haomo-test-result-1.png)
111+
![](/img/docs/case-study/haomo-test-result-1.png)
112112

113113
- *The test result of another larger model inferring 10,000 frames of images on the cloud*
114114

115-
![](../../static/img/docs/case-study/haomo-test-result-2.png)
115+
![](/img/docs/case-study/haomo-test-result-2.png)
116116

117117
- *Time consumption of a model with 4 GPUs to train 10,000 frames of images on the cloud*
118118

119-
![](../../static/img/docs/case-study/haomo-test-result-3.png)
119+
![](/img/docs/case-study/haomo-test-result-3.png)
120120

121121
The efficiency of cloud training and inference improves significantly with Fluid + JindoRuntime, especially for some small models. JindoRuntime can solve the I/O bottleneck problem, and the training can be accelerated by up to about 300%. It also improves the efficiency of GPU usage on the cloud and accelerates the efficiency of data-driven iterations on the cloud.
122122

docs/case-study/metabit-trading-case-study.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,15 @@ Metabit Trading continues to make efforts in R&D investment, innovation support,
2121

2222
As an AI-powered hedge fund, strategy research through AI model training is our main research method. First, we need to extract features from the raw data before training the model. The signal-to-noise ratio of financial data is low. If we directly use raw data for training, the resulting model will be noisy. In addition to market data (such as stock price and trading volume that we often see in the market), raw data includes some non-volume and price data (such as research reports, financial reports, news, social media, and other unstructured data). Researchers will extract features through a series of transformations and then train AI models. You can refer to the following simplified diagram of the strategy research pattern that is most closely related to machine learning in our research scenario.
2323

24-
![](../../static/img/docs/case-study/metabit-trading-current-flow.png)
24+
![](/img/docs/case-study/metabit-trading-current-flow.png)
2525

2626
Model training produces models and signals. A signal is a judgment about future price trends, and the strength of the signal means the strength of strategic orientation. The quantitative researcher will use this information to optimize the portfolio to form a real-time position for trade. In this process, the horizontal dimension (stocks) will be considered for risk control. For example, do not excessively hold stocks in a particular industry. When a position strategy is formed, quantitative researchers will place simulated orders and get the profit and loss information corresponding to the real-time position to understand the profit of the strategy. This is a complete process of quantitative research.
2727

2828
## Requirements for Quantitative Research Basic Platform
2929

3030
1. **There are many unexpected tasks and high elasticity requirements.** In the process of strategy research, quantitative researchers will generate strategic ideas and test their ideas through experiments. Therefore, computing platforms will generate a large number of unexpected tasks, so we have high requirements for computing Auto Scaling capability.
3131

32-
![](../../static/img/docs/case-study/metabit-trading-workload.png)
32+
![](/img/docs/case-study/metabit-trading-workload.png)
3333

3434
The preceding figure shows the running instance data of a cluster over a period. Take it as an example: the number of instances of the whole cluster can reach thousands at peak hours in multiple time periods, but at the same time, the scale of the computing cluster can be scaled in to zero. There is a strong correlation between the computing tasks of quantitative institutions and the researcher's R&D progress. There are big gaps between peaks and troughs, which is also a feature of offline research tasks.
3535

@@ -61,14 +61,14 @@ Considering POSIX compatibility, cost, and high throughput, we chose the JuiceFS
6161

6262
To this end, **we are eager to find software on Kubernetes that has elastic distributed cache acceleration capability and has good support for JuiceFS storage**. We found that Fluid 1 works well with the JuiceFS storage, and the JuiceFS team happens to be the main contributor and maintainer of JuiceFSRuntime in the Fluid project. Therefore, we designed the architecture solution based on Fluid and chose the native JuiceFSRuntime.
6363

64-
![](../../static/img/docs/case-study/metabit-trading-arch.png)
64+
![](/img/docs/case-study/metabit-trading-arch.png)
6565

6666
## An Introduction to Architecture Components
6767
### Fluid
6868

6969
Fluid is different from the traditional storage-oriented PVC abstraction. Instead, it abstracts the process of *computing the data used by tasks* on Kubernetes. It puts forward the concept of an elastic Dataset, which is centered on the application's demand for data access and gives features to the data, such as small files, read-only, and read-write. At the same time, extract data from the storage and give scopes to the data with features (such as the data that users only care about for a few days). Build a scheduling system centered on Dataset, focus on the orchestration of the data itself and the applications that use the data, and emphasize elasticity and lifecycle management.
7070

71-
![](../../static/img/docs/case-study/fluid-data-usage.png)
71+
![](/img/docs/case-study/fluid-data-usage.png)
7272

7373
### JuiceFSRuntime
7474

@@ -211,7 +211,7 @@ In the actual deployment evaluation, we use 20 ECS instances of ecs.g7.8xlarge s
211211
212212
For comparison, we counted the access time data and compared it with the access time by using Fluid. The data is shown in the following figure:
213213
214-
![](../../static/img/docs/case-study/metabit-trading-res.png)
214+
![](/img/docs/case-study/metabit-trading-res.png)
215215
216216
When the number of Pods that are started simultaneously is small, Fluid has no significant advantage over distributed storage. However, when more Pods are started at the same time, Fluid has a greater acceleration advantage. When the concurrency is expanded to 100 Pods at the same time, Fluid can reduce the average time consumption by more than 40% compared with traditional distributed storage. On the one hand, it improves task speed. On the other hand, it saves the cost of ECS due to IO latency.
217217

docs/case-study/weibo-case-study.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ The deep learning platform plays an important role in Weibo's social business. U
1313
## 1. Background
1414
Sina Weibo is the largest social media platform in China. Every day, hundreds of millions of pieces of content are generated and spread on it. The following figure shows the business ecosystem of Weibo. Quality content producers generate and spread premium content. Other users can enjoy this content and follow the microbloggers they like. Thus, interaction is established, and a sound closed-loop ecosystem is formed.
1515

16-
![background](../../static/img/docs/case-study/weibo-background.png)
16+
![background](/img/docs/case-study/weibo-background.png)
1717

1818
The main function of Weibo's machine learning platform is to make the whole process operate more efficiently and smoothly. With an understanding of the quality content, the platform builds the user profiles and pushes the content that interests users. This allows users to interact with the content producers and encourages producers to produce more (and better) content. As a result, a win-win situation for both information consumers and producers is created. As multimedia content becomes mainstream, deep learning technology becomes more important. From the understanding of multimedia content to the optimization of CTR tasks, the support of deep learning technology is indispensable.
1919

2020
## 2. Challenges of Training Large-Scale Deep Learning Models
2121
With the widespread use of deep learning in Weibo's business scenarios, its deep learning platform plays a central role. The platform decouples computing resources from storage resources by separating storage and computing. Thus, it provides flexible resource allocation, realizes convenient storage expansion, and reduces storage costs.
2222

23-
![challenge](../../static/img/docs/case-study/weibo-challenge.png)
23+
![challenge](/img/docs/case-study/weibo-challenge.png)
2424

2525
However, this architecture also brings some challenges, among which the most critical ones are data access performance and stability.
2626

@@ -47,7 +47,7 @@ We need to achieve better data locality to meet the computational requirements o
4747

4848
We are eager to find software with distributed cache acceleration capabilities on Kubernetes to achieve these goals. Fortunately, we found Fluid, a CNCF Sandbox project that met our demands. Therefore, we have designed a new architecture scheme based on Fluid. After verification and comparison, we chose JindoRuntime as the acceleration run time.
4949

50-
![fluid](../../static/img/docs/case-study/weibo-fluid.png)
50+
![fluid](/img/docs/case-study/weibo-fluid.png)
5151

5252
### 3.1 An Introduction to Architecture Components
5353
#### 3.1.1 Fluid
@@ -88,11 +88,11 @@ Fluid JindoRuntime is used to prefetch data and train models.
8888
## 3.5 Results of Performance Testing
8989
Combined with the solution of Fluid + JindoRuntime, we have achieved an improvement in the training speed after data prefetching. As shown in the figure below, in the scenario of 3 nodes and 12 GPUs, the test of reading data through the HDFS interface is often interrupted due to problems, such as poor network communication. This leads to test failure. After adding exception handling, the waiting time between workers becomes longer. As a result, the increasement of GPUs slows down training rather than speeds up it. The overall training speed is virtually the same in the scenario of 1 node and 8 GPUs as well as 3 nodes and 12 GPUs, and the computing resources are scaled out. Through the new scheme, we found that compared with the HDFS interface, the scenario of 1 node and 4 GPUs can be accelerated by 5 times, 2 nodes and 8 GPUs by 9 times, and 3 nodes and 12 GPUs by 18 times.
9090

91-
![result-1](../../static/img/docs/case-study/weibo-res-1.png)
91+
![result-1](/img/docs/case-study/weibo-res-1.png)
9292

9393
Since speed and stability of training are guaranteed, the model’s end-to-end training time has also been reduced from 389 hours (16 days) to 16 hours.
9494

95-
![result-1](../../static/img/docs/case-study/weibo-res-2.png)
95+
![result-1](/img/docs/case-study/weibo-res-2.png)
9696

9797
## 4. Summary
9898
After the integration of Fluid and JindoRuntime, the performance and stability of model training in small file scenarios are improved significantly. In the distributed training of multiple nodes and multiple GPUs, the model training speed can be increased by 18 times. Training that used to take two weeks and now only takes 16 hours. Shorter time for training and less pressure on HDFS also improve the stability of training tasks, with the success rate increasing from 37.1% to 98.3%. The amount of data in our production environment is currently 4 TB and will continue to grow with continuous iteration.

0 commit comments

Comments
 (0)