diff --git a/run/README.md b/run/README.md new file mode 100644 index 00000000..14e74f6c --- /dev/null +++ b/run/README.md @@ -0,0 +1,46 @@ + + +# LST-Bench: Configurations and Results +This folder contains configurations for running LST-Bench on various systems as depicted in the [LST-Bench dashboard](/metrics/app), along with details about the setups used to generate those results. + +## Systems Included +- [x] Apache Spark 3.3.1 + - [x] Delta Lake 2.2.0 + - [x] Apache Hudi 0.12.2 + - [x] Apache Iceberg 1.1.0 +- [ ] Trino 420 + - [ ] Delta Lake + - [ ] Apache Iceberg + +## Folder Structure +While the folder for each engine may have a slightly different structure, they generally contain the following: + +- `scripts/`: + This directory contains SQL files used to execute LST-Bench workloads on the respective engine. + Typically, these SQL files may vary slightly across engines and LSTs based on the supported SQL dialect. +- `config/`: + This directory houses LST-Bench configuration files required to execute the workload. + It includes LST-Bench phase/session/task libraries that reference the aforementioned SQL scripts. +- Additional infrastructure and configuration automation folders, e.g., `azure-pipelines/`: + These folders contain scripts or files facilitating automation for running the benchmark on a specific infrastructure/engine. + For instance, Azure Pipelines scripts to deploy an engine with different LSTs and executing LST-Bench. + Generally, these folders should include an additional README.md file offering further details. +- `results/`: + This folder stores the results of the LST-Bench runs as captured by LST-Bench telemetry using DuckDB. + These results are processed and visualized in the [LST-Bench dashboard](/metrics/app). diff --git a/run/spark-3.3.1/azure-pipelines/README.md b/run/spark-3.3.1/azure-pipelines/README.md index 53df878f..4488a12e 100644 --- a/run/spark-3.3.1/azure-pipelines/README.md +++ b/run/spark-3.3.1/azure-pipelines/README.md @@ -1,22 +1,50 @@ -TODO: FILL IN + -- Variables - - DATA_STORAGE_ACCOUNT - - DATA_STORAGE_ACCOUNT_SHARED_KEY - - HMS_JDBC_DRIVER - - HMS_JDBC_URL - - HMS_JDBC_USER - - HMS_JDBC_PASSWORD - - HMS_STORAGE_ACCOUNT - - HMS_STORAGE_ACCOUNT_SHARED_KEY - - HMS_STORAGE_ACCOUNT_CONTAINER +# Azure Pipelines Deployment for LST-Bench on Apache Spark 3.3.1 +This directory comprises the necessary tooling for executing LST-Bench on Apache Spark 3.3.1 with different LSTs using Azure Pipelines. The included tooling consists of: +- `run-lst-bench.yml`: + An Azure Pipelines script designed to deploy Apache Spark with various LSTs and execute LST-Bench. +- `sh/`: + A directory containing shell scripts and engine configuration files supporting the deployment of Spark with different LSTs and the execution of experiments. +- `config/`: + A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results. +## Prerequisites +- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup: + - A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client. + - A VM named 'lst-bench-head' to run the head node of the Spark cluster, also connected to the pipeline environment. + - A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node. + - An Azure Storage Account accessible by both the VMSS and head node. + - An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore. + The Hive Metastore schema for version 2.3.0 should already be installed in the instance. +- Prior to running the pipeline, several variables need definition in your Azure Pipeline: + - `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored. + - `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored. + - `hms_jdbc_driver`: JDBC driver for the Hive Metastore. + - `hms_jdbc_url`: JDBC URL for the Hive Metastore. + - `hms_jdbc_user`: Username for the Hive Metastore. + - `hms_jdbc_password` (secret): Password for the Hive Metastore. + - `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account). + - `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog. + - `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog. +- The versions and configurations of LSTs to run can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI. + Default values are assigned to these parameters. + Parameters also include experiment scale factor, machine type, and cluster size. + Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline. + Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.