Implementation of pipeline to run LST-Bench on Trino in Azure (#242)

Closes #238
microsoft · Feb 22, 2024 · c9636ef · c9636ef
1 parent 32b76a7
commit c9636ef
Show file tree

Hide file tree

Showing 32 changed files with 768 additions and 76 deletions.
diff --git a/run/README.md b/run/README.md
@@ -24,9 +24,9 @@ This folder contains configurations for running LST-Bench on various systems as
   - [x] Delta Lake 2.2.0
   - [x] Apache Hudi 0.12.2
   - [x] Apache Iceberg 1.1.0
-- [ ] Trino 420
-  - [ ] Delta Lake
-  - [ ] Apache Iceberg
+- [x] Trino 420
+  - [x] Delta Lake
+  - [x] Apache Iceberg
 
 ## Folder Structure
 While the folder for each engine may have a slightly different structure, they generally contain the following:

diff --git a/run/spark-3.3.1/azure-pipelines/README.md b/run/spark-3.3.1/azure-pipelines/README.md
@@ -32,10 +32,11 @@ This directory comprises the necessary tooling for executing LST-Bench on Apache
   - A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node.
   - An Azure Storage Account accessible by both the VMSS and head node.
   - An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
-    The Hive Metastore schema for version 2.3.0 should already be installed in the instance.
+    The Hive Metastore schema for version 2.3.9 should already be installed in the instance.
 - Prior to running the pipeline, several variables need definition in your Azure Pipeline:
   - `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
   - `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
+  - `data_storage_account_container`: Name of the container in the Azure Blob Storage account where the source data for the experiment is stored.
   - `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
   - `hms_jdbc_url`: JDBC URL for the Hive Metastore.
   - `hms_jdbc_user`: Username for the Hive Metastore.

diff --git a/run/spark-3.3.1/azure-pipelines/sh/hms.sh b/run/spark-3.3.1/azure-pipelines/sh/hms.sh
@@ -5,6 +5,10 @@ if [ "$#" -ne 7 ]; then
 fi
 
 source env.sh
+if [ -z "${USER}" ]; then
+    echo "ERROR: USER is not defined."
+    exit 1
+fi
 if [ -z "${HADOOP_HOME}" ]; then
     echo "ERROR: HADOOP_HOME is not defined."
     exit 1

diff --git a/run/trino-420/azure-pipelines/README.md b/run/trino-420/azure-pipelines/README.md
@@ -0,0 +1,56 @@
+<!--
+{% comment %}
+Copyright (c) Microsoft Corporation.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Azure Pipelines Deployment for LST-Bench on Trino 420
+This directory comprises the necessary tooling for executing LST-Bench on Trino 420 with different LSTs using Azure Pipelines. The included tooling consists of:
+- `run-lst-bench.yml`:
+  An Azure Pipelines script designed to deploy Trino and execute LST-Bench.
+- `sh/`:
+  A directory containing shell scripts and engine configuration files supporting the deployment of Trino and the execution of experiments.
+- `config/`:
+  A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results.
+
+## Prerequisites
+- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup:
+  - A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client.
+  - A VM named 'lst-bench-head' to run the coordinator node of the Trino cluster, also connected to the pipeline environment.
+  - A VMSS cluster, that will serve as the Trino worker nodes, within the same VNet as the coordinator node.
+  - An Azure Storage Account accessible by both the VMSS and coordinator node.
+  - An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
+    The Hive Metastore schema for version 2.3.9 should already be installed in the instance.
+- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
+  - `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
+  - `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
+  - `data_storage_account_container`: Name of the container in the Azure Blob Storage account where the source data for the experiment is stored.
+  - `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
+  - `hms_jdbc_url`: JDBC URL for the Hive Metastore.
+  - `hms_jdbc_user`: Username for the Hive Metastore.
+  - `hms_jdbc_password` (secret): Password for the Hive Metastore.
+  - `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account).
+  - `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+  - `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+- The LSTs to run experiments on can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI.
+  Default values are assigned to these parameters.
+  Parameters also include experiment scale factor, machine type, and cluster size.
+  Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline.
+  Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.
+
+## Additional Notes
+For workloads within LST-Bench that include an `optimize` step, particularly those involving partitioned tables, a [custom task](/docs/workloads.md#custom-tasks) is used to execute this step. 
+The task divides the `optimize` operation into batches, each containing up to 100 partitions (the parameter value is configurable). 
+This approach was implemented to address issues where Trino would crash if the optimization step were applied to the entire table.
diff --git a/run/trino-420/azure-pipelines/config/connections_config.yaml b/run/trino-420/azure-pipelines/config/connections_config.yaml
@@ -0,0 +1,9 @@
+# Description: Connections Configuration
+---
+version: 1
+connections:
+- id: trino_0
+  driver: io.trino.jdbc.TrinoDriver
+  url: jdbc:trino://${TRINO_MASTER_HOST}:8080
+  username: admin
+  password: ''
diff --git a/run/trino-420/azure-pipelines/config/experiment_config-cow-delta.yaml b/run/trino-420/azure-pipelines/config/experiment_config-cow-delta.yaml
@@ -0,0 +1,30 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: trino
+  system_version: 420
+  table_format: delta
+  table_format_version: undefined
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: cow
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: hive
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: textfile
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ''
+  external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
+  catalog: delta
+  database: "delta_${EXP_NAME}"
+  table_format: delta
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/delta/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ''
+  partition_spec_keyword: 'partitioned_by'
diff --git a/run/trino-420/azure-pipelines/config/experiment_config-mor-iceberg.yaml b/run/trino-420/azure-pipelines/config/experiment_config-mor-iceberg.yaml
@@ -0,0 +1,30 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: trino
+  system_version: 420
+  table_format: iceberg
+  table_format_version: undefined
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: mor
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: hive
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: textfile
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ''
+  external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
+  catalog: iceberg
+  database: "iceberg_${EXP_NAME}"
+  table_format: iceberg
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ''
+  partition_spec_keyword: 'partitioning'
diff --git a/run/trino-420/azure-pipelines/config/setup_experiment_config.yaml b/run/trino-420/azure-pipelines/config/setup_experiment_config.yaml
@@ -0,0 +1,20 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: setup_experiment
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: trino
+  system_version: 420
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: hive
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: textfile
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ''
+  external_tblproperties_suffix: ", textfile_field_separator=',', null_format='', skip_header_line_count=1"
diff --git a/run/trino-420/azure-pipelines/config/telemetry_config.yaml b/run/trino-420/azure-pipelines/config/telemetry_config.yaml
@@ -0,0 +1,13 @@
+# Description: Telemetry Configuration
+---
+version: 1
+connection:
+  id: duckdb_0
+  driver: org.duckdb.DuckDBDriver
+  url: jdbc:duckdb:./telemetry-trino-420
+execute_ddl: true
+ddl_file: 'src/main/resources/scripts/logging/duckdb/ddl.sql'
+insert_file: 'src/main/resources/scripts/logging/duckdb/insert.sql'
+# The following parameter values will be used to replace the variables in the logging statements.
+parameter_values:
+  data_path: ''