Documentation

microsoft · Feb 16, 2024 · 31be09e · 31be09e
1 parent 40bae21
commit 31be09e
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 17 deletions.
diff --git a/run/README.md b/run/README.md
@@ -0,0 +1,46 @@
+<!--
+{% comment %}
+Copyright (c) Microsoft Corporation.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# LST-Bench: Configurations and Results
+This folder contains configurations for running LST-Bench on various systems as depicted in the [LST-Bench dashboard](/metrics/app), along with details about the setups used to generate those results.
+
+## Systems Included
+- [x] Apache Spark 3.3.1
+  - [x] Delta Lake 2.2.0
+  - [x] Apache Hudi 0.12.2
+  - [x] Apache Iceberg 1.1.0
+- [ ] Trino 420
+  - [ ] Delta Lake
+  - [ ] Apache Iceberg
+
+## Folder Structure
+While the folder for each engine may have a slightly different structure, they generally contain the following:
+
+- `scripts/`: 
+  This directory contains SQL files used to execute LST-Bench workloads on the respective engine. 
+  Typically, these SQL files may vary slightly across engines and LSTs based on the supported SQL dialect.
+- `config/`: 
+  This directory houses LST-Bench configuration files required to execute the workload. 
+  It includes LST-Bench phase/session/task libraries that reference the aforementioned SQL scripts.
+- Additional infrastructure and configuration automation folders, e.g., `azure-pipelines/`: 
+  These folders contain scripts or files facilitating automation for running the benchmark on a specific infrastructure/engine.
+  For instance, Azure Pipelines scripts to deploy an engine with different LSTs and executing LST-Bench. 
+  Generally, these folders should include an additional README.md file offering further details.
+- `results/`: 
+  This folder stores the results of the LST-Bench runs as captured by LST-Bench telemetry using DuckDB.
+  These results are processed and visualized in the [LST-Bench dashboard](/metrics/app).
diff --git a/run/spark-3.3.1/azure-pipelines/README.md b/run/spark-3.3.1/azure-pipelines/README.md
@@ -1,22 +1,50 @@
-TODO: FILL IN
+<!--
+{% comment %}
+Copyright (c) Microsoft Corporation.
 
-- Set up variables
-	- pwd database
-	- pwd storage account
-	- storage account name
-	- client host ip
-	- engine head ip
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
 
+    http://www.apache.org/licenses/LICENSE-2.0
 
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
 
-- Variables
-  - DATA_STORAGE_ACCOUNT
-  - DATA_STORAGE_ACCOUNT_SHARED_KEY
-  - HMS_JDBC_DRIVER
-  - HMS_JDBC_URL
-  - HMS_JDBC_USER
-  - HMS_JDBC_PASSWORD
-  - HMS_STORAGE_ACCOUNT
-  - HMS_STORAGE_ACCOUNT_SHARED_KEY
-  - HMS_STORAGE_ACCOUNT_CONTAINER
+# Azure Pipelines Deployment for LST-Bench on Apache Spark 3.3.1
+This directory comprises the necessary tooling for executing LST-Bench on Apache Spark 3.3.1 with different LSTs using Azure Pipelines. The included tooling consists of:
+- `run-lst-bench.yml`:
+  An Azure Pipelines script designed to deploy Apache Spark with various LSTs and execute LST-Bench.
+- `sh/`:
+  A directory containing shell scripts and engine configuration files supporting the deployment of Spark with different LSTs and the execution of experiments.
+- `config/`:
+  A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results.
 
+## Prerequisites
+- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup:
+  - A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client.
+  - A VM named 'lst-bench-head' to run the head node of the Spark cluster, also connected to the pipeline environment.
+  - A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node.
+  - An Azure Storage Account accessible by both the VMSS and head node.
+  - An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
+    The Hive Metastore schema for version 2.3.0 should already be installed in the instance.
+- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
+  - `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
+  - `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
+  - `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
+  - `hms_jdbc_url`: JDBC URL for the Hive Metastore.
+  - `hms_jdbc_user`: Username for the Hive Metastore.
+  - `hms_jdbc_password` (secret): Password for the Hive Metastore.
+  - `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account).
+  - `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+  - `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+- The versions and configurations of LSTs to run can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI.
+  Default values are assigned to these parameters. 
+  Parameters also include experiment scale factor, machine type, and cluster size. 
+  Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline. 
+  Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.