Skip to content

Commit b4087ac

Browse files
kosabogiszabostevedarnautovcolleenmcginnis
authored
[Serverless] Adds Trained model autoscaling page (#139)
* Adds Trained model autoscaling page * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Changes paragraph placement * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Update serverless/pages/ml-nlp-auto-scale.mdx Co-authored-by: Dima Arnautov <arnautov.dima@gmail.com> * Updates document based on feedback * mdx to asciidoc * Updates table --------- Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> Co-authored-by: Dima Arnautov <arnautov.dima@gmail.com> Co-authored-by: Colleen McGinnis <colleen.mcginnis@elastic.co>
1 parent a7cde6d commit b4087ac

File tree

3 files changed

+209
-0
lines changed

3 files changed

+209
-0
lines changed
106 KB
Loading

serverless/index-serverless-general.asciidoc

+2
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,5 @@ include::./pages/service-status.asciidoc[leveloffset=+2]
3232
include::./pages/user-profile.asciidoc[leveloffset=+2]
3333

3434
include::./pages/cloud-regions.asciidoc[leveloffset=+2]
35+
36+
include::./pages/ml-nlp-auto-scale.asciidoc[leveloffset=+2]
+207
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
[[general-ml-nlp-auto-scale]]
2+
= Trained model autoscaling
3+
4+
// :keywords: serverless
5+
6+
You can enable autoscaling for each of your trained model deployments.
7+
Autoscaling allows Elasticsearch to automatically adjust the resources the model deployment can use based on the workload demand.
8+
9+
There are two ways to enable autoscaling:
10+
11+
* through APIs by enabling adaptive allocations
12+
* in Kibana by enabling adaptive resources
13+
14+
15+
Trained model autoscaling is available for both serverless and Cloud deployments. In serverless deployments, processing power is managed differently across Search, Observability, and Security projects, which impacts their costs and resource limits.
16+
17+
Security and Observability projects are only charged for data ingestion and retention. They are not charged for processing power (VCU usage), which is used for more complex operations, like running advanced search models. For example, in Search projects, models such as ELSER require significant processing power to provide more accurate search results.
18+
19+
[discrete]
20+
[[enabling-autoscaling-through-apis-adaptive-allocations]]
21+
== Enabling autoscaling through APIs - adaptive allocations
22+
23+
Model allocations are independent units of work for NLP tasks.
24+
If you set a static number of allocations, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
25+
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process.
26+
This can help you to manage performance and cost more easily.
27+
(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.)
28+
29+
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
30+
When the load is high, additional model allocations are automatically created as needed.
31+
When the load is low, a model allocation is automatically removed.
32+
You can explicitly set the minimum and maximum number of allocations; autoscaling will occur within these limits.
33+
34+
[NOTE]
35+
====
36+
If you set the minimum number of allocations to 1, you will be charged even if the system is not using those resources.
37+
====
38+
39+
You can enable adaptive allocations by using:
40+
41+
* the create inference endpoint API for https://www.elastic.co/guide/en/elasticsearch/reference/master/infer-service-elser.html[ELSER], https://www.elastic.co/guide/en/elasticsearch/reference/master/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as inference services.
42+
* the https://www.elastic.co/guide/en/elasticsearch/reference/master/start-trained-model-deployment.html[start trained model deployment] or https://www.elastic.co/guide/en/elasticsearch/reference/master/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on machine learning nodes.
43+
44+
If the new allocations fit on the current machine learning nodes, they are immediately started.
45+
If more resource capacity is needed for creating new model allocations, then your machine learning node will be scaled up if machine learning autoscaling is enabled to provide enough resources for the new allocation.
46+
The number of model allocations can be scaled down to 0.
47+
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
48+
Adaptive allocations must be set up independently for each deployment and https://www.elastic.co/guide/en/elasticsearch/reference/master/put-inference-api.html[inference endpoint].
49+
50+
When you create inference endpoints on Serverless using Kibana, adaptive allocations are automatically turned on, and there is no option to disable them.
51+
52+
[discrete]
53+
[[optimizing-for-typical-use-cases]]
54+
=== Optimizing for typical use cases
55+
56+
You can optimize your model deployment for typical use cases, such as search and ingest.
57+
When you optimize for ingest, the throughput will be higher, which increases the number of inference requests that can be performed in parallel.
58+
When you optimize for search, the latency will be lower during search processes.
59+
60+
* If you want to optimize for ingest, set the number of threads to `1` (`"threads_per_allocation": 1`).
61+
* If you want to optimize for search, set the number of threads to greater than `1`.
62+
Increasing the number of threads will make the search processes more performant.
63+
64+
[discrete]
65+
[[enabling-autoscaling-in-kibana-adaptive-resources]]
66+
== Enabling autoscaling in Kibana - adaptive resources
67+
68+
You can enable adaptive resources for your models when starting or updating the model deployment.
69+
Adaptive resources make it possible for Elasticsearch to scale up or down the available resources based on the load on the process.
70+
This can help you to manage performance and cost more easily.
71+
When adaptive resources are enabled, the number of VCUs that the model deployment uses is set automatically based on the current load.
72+
When the load is high, the number of VCUs that the process can use is automatically increased.
73+
When the load is low, the number of VCUs that the process can use is automatically decreased.
74+
75+
You can choose from three levels of resource usage for your trained model deployment; autoscaling will occur within the selected level's range.
76+
77+
Refer to the tables in the auto-scaling-matrix section to find out the setings for the level you selected.
78+
79+
image::images/ml-nlp-deployment.png[ML model deployment with adaptive resources enabled.]
80+
81+
Search projects are given access to more processing resources, while Security and Observability projects have lower limits. This difference is reflected in the UI configuration: Search projects have higher resource limits compared to Security and Observability projects to accommodate their more complex operations.
82+
83+
On Serverless, adaptive allocations are automatically enabled for all project types.
84+
However, the "Adaptive resources" control is not displayed in Kibana for Observability and Security projects.
85+
86+
[discrete]
87+
[[model-deployment-resource-matrix]]
88+
== Model deployment resource matrix
89+
90+
The used resources for trained model deployments depend on three factors:
91+
92+
* your cluster environment (Serverless, Cloud, or on-premises)
93+
* the use case you optimize the model deployment for (ingest or search)
94+
* whether model autoscaling is enabled with adaptive allocations/resources to have dynamic resources, or disabled for static resources
95+
96+
The following tables show you the number of allocations, threads, and VCUs available on Serverless when adaptive resources are enabled or disabled.
97+
98+
[discrete]
99+
[[deployments-on-serverless-optimized-for-ingest]]
100+
=== Deployments on serverless optimized for ingest
101+
102+
In case of ingest-optimized deployments, we maximize the number of model allocations.
103+
104+
[discrete]
105+
[[adaptive-resources-enabled]]
106+
==== Adaptive resources enabled
107+
108+
|===
109+
| Level | Allocations | Threads | VCUs
110+
111+
| Low
112+
| 0 to 2 dynamically
113+
| 1
114+
| 0 to 16 dynamically
115+
116+
| Medium
117+
| 1 to 32 dynamically
118+
| 1
119+
| 8 to 256 dynamically
120+
121+
| High
122+
a| 1 to 512 for Search +
123+
1 to 128 for Security and Observability
124+
| 1
125+
a| 8 to 4096 for Search +
126+
8 to 1024 for Security and Observability
127+
|===
128+
129+
130+
[discrete]
131+
[[adaptive-resources-disabled-search-only]]
132+
==== Adaptive resources disabled (Search only)
133+
134+
|===
135+
| Level | Allocations | Threads | VCUs
136+
137+
| Low
138+
| Exactly 2
139+
| 1
140+
| 16
141+
142+
| Medium
143+
| Exactly 32
144+
| 1
145+
| 256
146+
147+
| High
148+
a| 512 for Search +
149+
No static allocations for Security and Observability
150+
| 1
151+
a| 4096 for Search +
152+
No static allocations for Security and Observability
153+
|===
154+
155+
[discrete]
156+
[[deployments-on-serverless-optimized-for-search]]
157+
=== Deployments on serverless optimized for search
158+
159+
[discrete]
160+
[[adaptive-resources-enabled-for-search]]
161+
==== Adaptive resources enabled
162+
163+
|===
164+
| Level | Allocations | Threads | VCUs
165+
166+
| Low
167+
| 0 to 1 dynamically
168+
| Always 2
169+
| 0 to 16 dynamically
170+
171+
| Medium
172+
| 1 to 2 (if threads=16), dynamically
173+
| Maximum (for example, 16)
174+
| 8 to 256 dynamically
175+
176+
| High
177+
a| 1 to 32 (if threads=16), dynamically +
178+
1 to 128 for Security and Observability
179+
| Maximum (for example, 16)
180+
a| 8 to 4096 for Search +
181+
8 to 1024 for Security and Observability
182+
|===
183+
184+
[discrete]
185+
[[adaptive-resources-disabled-for-search]]
186+
==== Adaptive resources disabled
187+
188+
|===
189+
| Level | Allocations | Threads | VCUs
190+
191+
| Low
192+
| 1 statically
193+
| Always 2
194+
| 16
195+
196+
| Medium
197+
| 2 statically (if threads=16)
198+
| Maximum (for example, 16)
199+
| 256
200+
201+
| High
202+
a| 32 statically (if threads=16) for Search +
203+
No static allocations for Security and Observability
204+
| Maximum (for example, 16)
205+
a| 4096 for Search +
206+
No static allocations for Security and Observability
207+
|===

0 commit comments

Comments
 (0)