Skip to content

Commit

Permalink
Implementation of pipeline to run LST-Bench on Trino in Azure (#242)
Browse files Browse the repository at this point in the history
Closes #238
  • Loading branch information
jcamachor authored Feb 22, 2024
1 parent 32b76a7 commit c9636ef
Show file tree
Hide file tree
Showing 32 changed files with 768 additions and 76 deletions.
6 changes: 3 additions & 3 deletions run/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ This folder contains configurations for running LST-Bench on various systems as
- [x] Delta Lake 2.2.0
- [x] Apache Hudi 0.12.2
- [x] Apache Iceberg 1.1.0
- [ ] Trino 420
- [ ] Delta Lake
- [ ] Apache Iceberg
- [x] Trino 420
- [x] Delta Lake
- [x] Apache Iceberg

## Folder Structure
While the folder for each engine may have a slightly different structure, they generally contain the following:
Expand Down
3 changes: 2 additions & 1 deletion run/spark-3.3.1/azure-pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,11 @@ This directory comprises the necessary tooling for executing LST-Bench on Apache
- A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node.
- An Azure Storage Account accessible by both the VMSS and head node.
- An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
The Hive Metastore schema for version 2.3.0 should already be installed in the instance.
The Hive Metastore schema for version 2.3.9 should already be installed in the instance.
- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
- `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
- `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
- `data_storage_account_container`: Name of the container in the Azure Blob Storage account where the source data for the experiment is stored.
- `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
- `hms_jdbc_url`: JDBC URL for the Hive Metastore.
- `hms_jdbc_user`: Username for the Hive Metastore.
Expand Down
4 changes: 4 additions & 0 deletions run/spark-3.3.1/azure-pipelines/sh/hms.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ if [ "$#" -ne 7 ]; then
fi

source env.sh
if [ -z "${USER}" ]; then
echo "ERROR: USER is not defined."
exit 1
fi
if [ -z "${HADOOP_HOME}" ]; then
echo "ERROR: HADOOP_HOME is not defined."
exit 1
Expand Down
56 changes: 56 additions & 0 deletions run/trino-420/azure-pipelines/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
<!--
{% comment %}
Copyright (c) Microsoft Corporation.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

# Azure Pipelines Deployment for LST-Bench on Trino 420
This directory comprises the necessary tooling for executing LST-Bench on Trino 420 with different LSTs using Azure Pipelines. The included tooling consists of:
- `run-lst-bench.yml`:
An Azure Pipelines script designed to deploy Trino and execute LST-Bench.
- `sh/`:
A directory containing shell scripts and engine configuration files supporting the deployment of Trino and the execution of experiments.
- `config/`:
A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results.

## Prerequisites
- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup:
- A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client.
- A VM named 'lst-bench-head' to run the coordinator node of the Trino cluster, also connected to the pipeline environment.
- A VMSS cluster, that will serve as the Trino worker nodes, within the same VNet as the coordinator node.
- An Azure Storage Account accessible by both the VMSS and coordinator node.
- An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
The Hive Metastore schema for version 2.3.9 should already be installed in the instance.
- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
- `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
- `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
- `data_storage_account_container`: Name of the container in the Azure Blob Storage account where the source data for the experiment is stored.
- `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
- `hms_jdbc_url`: JDBC URL for the Hive Metastore.
- `hms_jdbc_user`: Username for the Hive Metastore.
- `hms_jdbc_password` (secret): Password for the Hive Metastore.
- `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account).
- `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
- `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
- The LSTs to run experiments on can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI.
Default values are assigned to these parameters.
Parameters also include experiment scale factor, machine type, and cluster size.
Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline.
Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.

## Additional Notes
For workloads within LST-Bench that include an `optimize` step, particularly those involving partitioned tables, a [custom task](/docs/workloads.md#custom-tasks) is used to execute this step.
The task divides the `optimize` operation into batches, each containing up to 100 partitions (the parameter value is configurable).
This approach was implemented to address issues where Trino would crash if the optimization step were applied to the entire table.
9 changes: 9 additions & 0 deletions run/trino-420/azure-pipelines/config/connections_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Description: Connections Configuration
---
version: 1
connections:
- id: trino_0
driver: io.trino.jdbc.TrinoDriver
url: jdbc:trino://${TRINO_MASTER_HOST}:8080
username: admin
password: ''
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: trino
system_version: 420
table_format: delta
table_format_version: undefined
scale_factor: "${EXP_SCALE_FACTOR}"
mode: cow
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: hive
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: textfile
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ''
external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
catalog: delta
database: "delta_${EXP_NAME}"
table_format: delta
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/delta/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ''
partition_spec_keyword: 'partitioned_by'
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: trino
system_version: 420
table_format: iceberg
table_format_version: undefined
scale_factor: "${EXP_SCALE_FACTOR}"
mode: mor
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: hive
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: textfile
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ''
external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
catalog: iceberg
database: "iceberg_${EXP_NAME}"
table_format: iceberg
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ''
partition_spec_keyword: 'partitioning'
20 changes: 20 additions & 0 deletions run/trino-420/azure-pipelines/config/setup_experiment_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Description: Experiment Configuration
---
version: 1
id: setup_experiment
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: trino
system_version: 420
scale_factor: "${EXP_SCALE_FACTOR}"
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: hive
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: textfile
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ''
external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
13 changes: 13 additions & 0 deletions run/trino-420/azure-pipelines/config/telemetry_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Description: Telemetry Configuration
---
version: 1
connection:
id: duckdb_0
driver: org.duckdb.DuckDBDriver
url: jdbc:duckdb:./telemetry-trino-420
execute_ddl: true
ddl_file: 'src/main/resources/scripts/logging/duckdb/ddl.sql'
insert_file: 'src/main/resources/scripts/logging/duckdb/insert.sql'
# The following parameter values will be used to replace the variables in the logging statements.
parameter_values:
data_path: ''
Loading

0 comments on commit c9636ef

Please sign in to comment.