1
- =========================
2
- Executing Presto on Spark
3
- =========================
1
+ ===============
2
+ Presto on Spark
3
+ ===============
4
4
5
- Presto on Spark makes it possible to leverage Spark as an execution framework
6
- for Presto queries. This is useful for queries that we want to run on thousands
7
- of nodes, requires 10s or 100s of terabytes of memory, and consume many CPU years.
5
+ Presto on Spark makes it possible to leverage Spark as an execution engine for Presto queries.
6
+ This is useful for queries that need to run on thousands of nodes,
7
+ require 10s or 100s of terabytes of memory, and consume many CPU years.
8
8
9
9
Spark adds several useful features like resource isolation, fine grained resource
10
10
management, and a scalable materialized exchange mechanism.
11
11
12
- Steps
13
- -----
12
+ Installation
13
+ ------------
14
14
15
15
Download the Presto Spark package tarball, :maven_download: `spark-package `
16
- and the Presto Spark launcher, :maven_download: `spark-launcher `. Keep both the
17
- files at, say, *example * directory. We assume here a two node Spark cluster
18
- with four cores each, thus giving us eight total cores.
16
+ and the Presto Spark launcher, :maven_download: `spark-launcher `. Keep both files in the same directory.
17
+ The example assumes there is a two-node Spark cluster with four cores each, which gives a total of eight cores.
19
18
20
19
The following is an example ``config.properties ``:
21
20
22
- .. code-block :: none
21
+ .. code-block :: properties
23
22
24
23
task.concurrency=4
25
24
task.max-worker-threads=4
26
25
task.writer-count=4
27
-
26
+
28
27
The details about properties are available at :doc: `/admin/properties `.
29
- Note that ``task.concurrency ``, ``task.writer-count `` and
30
- ``task.max-worker-threads `` are set to 4 each, since we have four cores per executor
31
- and want to synchronize with the relevant Spark submit arguments below.
32
- These values should be adjusted to keep all executor cores busy and
28
+ Note that ``task.concurrency ``, ``task.writer-count `` and ``task.max-worker-threads `` are set to 4 each,
29
+ since there are four cores per executor and it aligned with Spark submit arguments below.
30
+ These values should be adjusted to keep all executor cores busy and
33
31
synchronize with :command: `spark-submit ` parameters.
34
32
35
- To execute Presto on Spark, first start your Spark cluster, which we will
36
- assume have the URL *spark://spark-master:7077 *. Keep your
37
- time consuming query in a file called, say, *query.sql *. Run :command: `spark-submit `
38
- command from the *example * directory created earlier:
33
+ Execution
34
+ ---------
35
+
36
+ To execute Presto on Spark, first start the Spark cluster, which is assumed to have
37
+ the URL *spark://spark-master:7077 *. Save the query in a file, for example, with the named *query.sql *.
38
+ Run :command: `spark-submit ` command from the directory where Presto on Spark is installed:
39
39
40
40
.. parsed-literal ::
41
41
@@ -52,12 +52,10 @@ command from the *example* directory created earlier:
52
52
--schema default \\
53
53
--file query.sql
54
54
55
- The details about configuring catalogs are at :ref: `catalog_properties `. In
56
- Spark submit arguments, note the values of *executor-cores * (number of cores per
55
+ The details about configuring catalogs are at :ref: `catalog_properties `.
56
+ In Spark submit arguments, note the values of *executor-cores * (number of cores per
57
57
executor in Spark) and *spark.task.cpus * (number of cores to allocate to each task
58
- in Spark). These are also equal to the number of cores (4 in this case ) and are
58
+ in Spark). These are also equal to the number of cores (4 in the example ) and are
59
59
same as some of the ``config.properties `` settings discussed above. This is to ensure that
60
60
a single Presto on Spark task is run in a single Spark executor (This limitation may be
61
- temporary and is introduced to avoid duplicating broadcasted hash tables for every
62
- task).
63
-
61
+ temporary and is introduced to avoid duplicating broadcasted hash tables for every task).
0 commit comments