diff --git a/lib/workload/stateless/stacks/data-sharing-manager/Readme.md b/lib/workload/stateless/stacks/data-sharing-manager/Readme.md index ad4cc35fa..16295a810 100644 --- a/lib/workload/stateless/stacks/data-sharing-manager/Readme.md +++ b/lib/workload/stateless/stacks/data-sharing-manager/Readme.md @@ -2,81 +2,177 @@ ## Description -The sharing manager works two main ways, as a 'push' or 'pull' step. +The data sharing manager is divided into three main components - +1. Package generation +2. Package validation +3. Package sharing + +For all three parts, we recommend using the data-sharing-tool provided. + +### Installing the Data Sharing Tool + +In order to generate a package, we recommend installing the data-sharing-tool by running the following command (from this directory). + +Please preface the command with 'bash' because the `scripts/install.sh` script relies on `bash`-specific features. +This ensures compatibility and prevents errors if your default shell is not `bash`. + +```bash +bash scripts/install.sh +``` + +## Package Generation + +> This component expects the user to have some familiarity with AWS athena + +We use the 'mart' tables to generate the appropriate manifests for package generation. + +You may use the UI to generate the manifests, or you can use the command line interface as shown below. + +In the example below, we collect the libraries that are associated with the project 'CUP' and the +sequencing run date is greater than or equal to '2025-04-01'. + +We require only the lims-manifest when collecting fastq data. + +The workflow manifest (along with the lims-manifest) is required when collecting secondary analysis data. + +```bash +WORK_GROUP="orcahouse" +DATASOURCE_NAME="orcavault" +DATABASE_NAME="mart" + +# Initialise the query +query_execution_id="$( \ + aws athena start-query-execution \ + --no-cli-pager \ + --query-string " \ + SELECT * + FROM lims + WHERE + project_id = 'CUP' AND + sequencing_run_date >= CAST('2025-04-01' AS DATE) + " \ + --work-group "${WORK_GROUP}" \ + --query-execution-context "Database=${DATABASE_NAME}, Catalog=${DATASOURCE_NAME}" \ + --output json \ + --query 'QueryExecutionId' \ +)" + +# Wait for the query to complete +while true; do + query_state="$( \ + aws athena get-query-execution \ + --no-cli-pager \ + --output json \ + --query-execution-id "${query_execution_id}" \ + --query 'QueryExecution.Status.State' \ + )" + + if [[ "${query_state}" == "SUCCEEDED" ]]; then + break + elif [[ "${query_state}" == "FAILED" || "${query_state}" == "CANCELLED" ]]; then + echo "Query failed or was cancelled" + exit 1 + fi + + sleep 5 +done + +# Collect the query results +query_results_uri="$( \ + aws athena get-query-execution \ + --no-cli-pager \ + --output json \ + --query-execution-id "${query_execution_id}" \ + --query '.QueryExecution.ResultConfiguration.OutputLocation' \ +)" + +# Download the results +aws s3 cp "${query_results_uri}" ./lims_manifest.csv +``` -Inputs are configured into the API, and then the step function is launched. +Using the lims manifest we can now generate the package. -For pushing sharing types, if configuration has not been tried with the '--dryrun' flag first, the API will return an error. -This is so we don't go accidentally pushing data to the wrong place. +By using the `--wait` parameter, the CLI will only return once the package has been completed. -A job will then be scheduled and ran in the background, a user can check the status of the job by checking the job status in the API. +This may take around 5 mins to complete depending on the size of the package. +```bash +data-sharing-tool generate-package \ + --lims-manifest-csv lims_manifest.csv \ + --wait +``` -### Push or Pull? +This will generate a package and print the package to the console like so: -When pushing, we use the s3 steps copy manager to 'push' data to a bucket. We assume that we have access to this bucket. -When pulling, we generate a presigned url containing a script that can be used to download the data. +```bash +Generating package 'pkg.123456789'... +``` +For the workflow manifest, we can use the same query as above, but we will need to change the final table name to 'workflow'. -### Pushing Outputs +An example of the SQL might be as follows: -Once a Job has completed pushing data, the job response object can be queried to gather the following information: -* fastq data that was pushed -* portal run ids that were pushed -* list the s3 objects that were pushed. +```sql +/* +Get the libraries associated with the project 'CUP' and their sequencing run date is greater than or equal to '2025-04-01'. +*/ +WITH libraries AS ( + SELECT library_id + FROM lims + WHERE + project_id = 'CUP' AND + sequencing_run_date >= CAST('2025-04-01' AS DATE) +) +/* +Select matching TN workflows for the libraries above +*/ +SELECT * +from workflow +WHERE + workflow_name = 'tumor-normal' AND + library_id IN (SELECT library_id FROM libraries) +``` -### Invoking a job +## Package Validation -The Job API launch comprises the following inputs: +Once the package has completed generating we can validate the package using the following command: -* instrumentRunIdList: The list of instrument run ids to be shared (used for fastq sharing only), can be used in tandem alongside one of the metadata attributes of libraryId, subjectId, individualId or projectId and will take an intersection of the two for fastq data. -* libraryIdList: A list of library ids to be shared. Cannot be used alongside subjectIdList, individualIdList or projectIdList. -* subjectIdList: A list of subject ids to share. Cannot be used alongside libraryIds, projectIdList or individualIdList. -* projectIdList: A list of project names to share. Cannot be used alongside libraryIds, subjectIdList or individualIdList. -* dataTypeList: A list of data types to share. Can be one or more of: - * 'Fastq' - * 'SecondaryAnalysis' -* defrostArchivedFastqs: A boolean flag to determine if we should de-frost archived fastqs. This is only used for fastq data types. - If set to true, and the fastq data is archived, the data de-frosted will be triggered but the workflow will not wait for the data to be de-frosted and will fail with a DataDefrostingError. -* secondaryAnalysisWorkflowList: A list of secondary analysis workflows to share, can be used in tandem with data types. - The possible values are one or more of: - * cttsov2 (or dragen-tso500-ctdna) - * tumor-normal (or dragen-wgts-dna) - * wts (or dragen-wgts-rna) - * oncoanalyser-wgts-dna - * oncoanalyser-wgts-rna - * oncoanalyser-wgts-dna-rna - * rnasum - * umccrise - * sash -* portalRunIdList: A list of portal run ids to share. - For secondaryanalysis data types, this parameter will take precedence over any metadata specified or secondary workflow types specified. -* portalRunIdExclusionList: A list of portal run ids NOT to share. - For secondaryanalysis data types, this parameter can be used in tandem with metadata or secondary workflow types specified. - This is useful if a known workflow has been repeated and we do not wish to share the original. -* shareType: The type of share, must be one of 'push' or 'pull' -* shareDestination: The destination of the share, only required if shareType is 'push'. Can be an 'icav2' or 's3' uri. -* dryrun: A boolean flag, used when we set the push type to true to determine if we should actually push the data or instead just print out to the console the list of s3 objects we would have sent. +> By using the BROWSER env var, the package report will be automatically opened up in our browser! +```bash +data-sharing-tool view-package-report \ + --package-id pkg.12345678910 +``` -### Steps Functions Output +Look through the metadata, fastq and secondary analysis tabs to ensure that the package is correct. -* The steps function will output two attributes: - * limsCsv presigned url - a presigned url to download a csv file containing the lims metadata to share - * data-download script presigned url - a presigned url to download a bash script that can be used to download the data. - -### Data Download Url for Pulling Data +## Package Sharing -The data download script will have the following options: +### Pushing Packages -* --data-download-path - the root path of the data to be downloaded, this directory must already exist. -* --dryrun | --dry-run - a flag to indicate that the script should not download the data, but instead print the commands that would be run and directories that would be created. -* --check-size-only - a flag to skip any downloading if the existing file is the same size as the file to be downloaded. -* --skip-existing - a flag to skip downloading files that already exist in the destination directory (regardless of size). -* --print-summary - a flag to print a summary of the files that would be downloaded and the total size of the download. +We can use the following command to push the package to a destination location. This will generate a push job id. +Like the package generation, we can use the `--wait` parameter to wait for the job to complete. +```bash +data-sharing-tool push-package \ + --package-id pkg.12345678910 \ + --share-location s3://bucket/path-to-prefix/ +``` - +### Presigning packages + +Not all data receivers will have an S3 bucket or ICAV2 project for us to dump data in. + +Therefore we also support the old-school presigned url method. + +We can use the following command to generate presigned urls in a script for the package + +```bash +data-sharing-tool presign-package \ + --package-id pkg.12345678910 +``` + +This will return a presigned url for a shell script that can be used to download the package. diff --git a/lib/workload/stateless/stacks/data-sharing-manager/scripts/data-sharing-tool.py b/lib/workload/stateless/stacks/data-sharing-manager/scripts/data-sharing-tool.py new file mode 100755 index 000000000..dfa802378 --- /dev/null +++ b/lib/workload/stateless/stacks/data-sharing-manager/scripts/data-sharing-tool.py @@ -0,0 +1,764 @@ +#!/usr/bin/env python3 + +""" +data-sharing-tool ::: Data sharing packager + +Usage: + data-sharing-tool [...] + +Command: + help Print this help message and exit + generate-package Generate a package + list-packages List package jobs + get-package-status Get Package Status + view-package-report View the package report + push-package Push a package to a destination + presign-package Presign a package + list-push-jobs List push jobs + get-push-job-status Get status of a push job +""" + +# Imports +import json +import sys +from os import environ +from pathlib import Path +from textwrap import dedent + +from time import sleep +from docopt import docopt +import requests +import pandas as pd +import pandera as pa +from pandera.typing import DataFrame +from typing import Optional, List, Dict, TypedDict +import typing +import boto3 +from requests import HTTPError +from subprocess import call + +if typing.TYPE_CHECKING: + from mypy_boto3_ssm import SSMClient + from mypy_boto3_secretsmanager import SecretsManagerClient + +# Global +DATA_SHARING_PREFIX = 'data-sharing' +AWS_HOSTNAME_SSM_PATH = '/hosted_zone/umccr/name' +AWS_ORCABUS_TOKEN_SECRET_ID = 'orcabus/token-service-jwt' +AWS_PRODUCTION_ACCOUNT_ID = '472057503814' + +# Models +class PackageRequestResponseDict(TypedDict): + id: str + packageName: str + stepsExecutionArn: str + status: str + requestTime: str + completionTime: Optional[str] + hasExpired: bool + + +class PackageRequestDict(TypedDict): + libraryIdList: List[str] + dataTypeList: List[str] + portalRunIdList: Optional[List[str]] + + +class PushJobRequestResponseDict(TypedDict): + id: str + stepFunctionsExecutionArn: str + status: str + startTime: str + packageId: str + shareDestination: str + logUri: str + endTime: Optional[str] + errorMessage: Optional[str] + +# Dataframe models +class LimsManifestDataFrame(pa.DataFrameModel): + sequencing_run_id: str = pa.Field(nullable=True) + sequencing_run_date: str = pa.Field(nullable=True) + library_id: str = pa.Field(nullable=True) + internal_subject_id: str = pa.Field(nullable=True) + external_subject_id: str = pa.Field(nullable=True) + sample_id: str = pa.Field(nullable=True) + external_sample_id: str = pa.Field(nullable=True) + experiment_id: str = pa.Field(nullable=True) + project_id: str = pa.Field(nullable=True) + owner_id: str = pa.Field(nullable=True) + workflow: str = pa.Field(nullable=True) + phenotype: str = pa.Field(nullable=True) + type: str = pa.Field(nullable=True) + assay: str = pa.Field(nullable=True) + quality: str = pa.Field(nullable=True) + source: str = pa.Field(nullable=True) + truseq_index: str = pa.Field(nullable=True) + load_datetime: str = pa.Field(nullable=True) + partition_schema_name: str = pa.Field(nullable=True) + partition_name: str = pa.Field(nullable=True) + + +class WorkflowManifestDataFrame(pa.DataFrameModel): + portal_run_id: str = pa.Field(nullable=True) + library_id: str = pa.Field(nullable=True) + workflow_name: str = pa.Field(nullable=True) + workflow_version: str = pa.Field(nullable=True) + workflow_status: str = pa.Field(nullable=True) + workflow_start: str = pa.Field(nullable=True) + workflow_end: str = pa.Field(nullable=True) + workflow_duration: int = pa.Field(coerce=True, nullable=True) + workflow_comment: str = pa.Field(nullable=True) + partition_schema_name: str = pa.Field(nullable=True) + partition_name: str = pa.Field(nullable=True) + + +# AWS functions +def get_hostname() -> str: + ssm_client: SSMClient = boto3.client('ssm') + return ssm_client.get_parameter( + Name=AWS_HOSTNAME_SSM_PATH + )['Parameter']['Value'] + + +def get_orcabus_token() -> str: + secrets_manager_client: SecretsManagerClient = boto3.client('secretsmanager') + return json.loads( + secrets_manager_client.get_secret_value( + SecretId=AWS_ORCABUS_TOKEN_SECRET_ID + )['SecretString'] + )['id_token'] + + +# Request functions +def get_base_api() -> str: + return f"https://{DATA_SHARING_PREFIX}.{get_hostname()}" + + +def get_default_get_headers() -> Dict: + return { + "Accept": "application/json", + "Authorization": f"Bearer {get_orcabus_token()}", + } + + +def get_default_post_headers() -> Dict: + return { + "Accept": "application/json", + "Content-Type": "application/json", + "Authorization": f"Bearer {get_orcabus_token()}", + } + + +def create_package( + package_name: str, + package_request: PackageRequestDict +) -> str: + """ + Create a package request + :param package_name: + :param package_request: + :return: + """ + response = requests.post( + headers=get_default_post_headers(), + json={ + "packageName": package_name, + "packageRequest": package_request, + }, + url=f"{get_base_api()}/api/v1/package", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.json()['id'] + + +def list_packages(package_name: Optional[str]) -> List[PackageRequestResponseDict]: + response = requests.get( + headers=get_default_get_headers(), + params=dict(filter( + lambda kv: kv[1] is not None, + { + "packageName": package_name, + "rowsPerPage": 1000 + }.items() + )), + url=f"{get_base_api()}/api/v1/package", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return list(filter( + lambda package_obj_iter_: package_obj_iter_['packageName'] == package_name, + response.json()['results'] + )) + + +def list_push_jobs(package_id: Optional[str]) -> List[PushJobRequestResponseDict]: + response = requests.get( + headers=get_default_get_headers(), + params=dict(filter( + lambda kv: kv[1] is not None, + { + "packageId": package_id, + "rowsPerPage": 1000 + }.items() + )), + url=f"{get_base_api()}/api/v1/push", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.json()['results'] + + +def get_package(package_id: str) -> PackageRequestResponseDict: + response = requests.get( + headers=get_default_get_headers(), + url=f"{get_base_api()}/api/v1/package/{package_id}", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.json() + + +def get_push_job(push_job_id: Optional[str]) -> PushJobRequestResponseDict: + response = requests.get( + headers=get_default_get_headers(), + url=f"{get_base_api()}/api/v1/push/{push_job_id}", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.json() + + +def get_package_report(package_id: str) -> str: + response = requests.get( + headers=get_default_get_headers(), + url=f"{get_base_api()}/api/v1/package/{package_id}:getSummaryReport", + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.text + + +def push_package(package_id: str, location_uri: str) -> str: + response = requests.post( + headers=get_default_post_headers(), + json={ + "shareDestination": location_uri, + }, + url=f"{get_base_api()}/api/v1/package/{package_id}:push" + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.json()['id'] + + +def presign_package(package_id: str) -> str: + response = requests.get( + headers=get_default_get_headers(), + url=f"{get_base_api()}/api/v1/package/{package_id}:presign" + ) + + try: + response.raise_for_status() + except HTTPError as e: + raise HTTPError(f"Got an error, response was {response.text}") from e + + return response.text + + +# Sub functions +def generate_package( + package_name: str, + lims_manifest: DataFrame[LimsManifestDataFrame], + workflow_manifest: Optional[DataFrame[WorkflowManifestDataFrame]] = None, + exclude_primary_data: bool = False +) -> str: + """ + Given a package name, the manifest for the LIMS and an optional workflow manifest, + generate and launch a package request. + :param package_name: + :param lims_manifest: + :param workflow_manifest: + :return: + """ + + # Get library ids from the lims manifest + library_ids = lims_manifest['library_id'].unique().tolist() + + # Get the portal run ids from the workflow manifest + if workflow_manifest is not None: + portal_run_ids = workflow_manifest['portal_run_id'].unique().tolist() + else: + portal_run_ids = None + + # Create the package request payload + package_request: PackageRequestDict = { + "libraryIdList": library_ids, + "dataTypeList": ( + (["FASTQ"] if exclude_primary_data else []) + + (["SECONDARY_ANALYSIS"] if workflow_manifest is not None else []) + ), + "portalRunIdList": portal_run_ids, + } + + return create_package( + package_name=package_name, + package_request=package_request, + ) + + +class Command: + def __init__(self, command_argv): + # Initialise any req vars + self.cli_args = self._get_args(command_argv) + + def _get_args(self, command_argv): + """ + Get the command line arguments + :param command_argv: + :return: + """ + return docopt( + dedent(self.__doc__), + argv=command_argv, + options_first=False + ) + + +class GeneratePackageSubCommand(Command): + """ + Usage: + data-sharing-tool generate-package help + data-sharing-tool generate-package (--package-name=) + (--lims-manifest-csv=) + [--workflow-manifest-csv=] + [--exclude-primary-data] + [--wait] + + Description: + Generate a package, use the athena mart tables to generate the lims and workflow manifest files, + more help can be found in the README.md file + + Options: + --package-name= Name of the package + --lims-manifest-csv= The LIMS manifest CSV file + --workflow-manifest-csv= The workflow manifest CSV file + --exclude-primary-data Exclude FASTQ files from the package + Only applicable if --workflow-manifest-csv is provided + --wait Wait for the package to be created before exiting + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + + Example: + data-sharing-tool generate-package --package-name 'latest-fastqs' --lims-manifest-csv /path/to/manifest.csv + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + + # Import args + self.package_name = self.cli_args['--package-name'] + self.lims_manifest = self.cli_args['--lims-manifest-csv'] + self.workflow_manifest = self.cli_args['--workflow-manifest-csv'] + self.exclude_primary_data = self.cli_args['--exclude-primary-data'] + self.wait = self.cli_args['--wait'] + + # Check args + if not Path(self.lims_manifest).is_file(): + raise FileNotFoundError(f"LIMS manifest file {self.lims_manifest} does not exist") + if self.workflow_manifest is not None and not Path(self.workflow_manifest).is_file(): + raise FileNotFoundError(f"Workflow manifest file {self.workflow_manifest} does not exist") + + # Check package name doesn't have any spaces or special characters + if not self.package_name.replace("-", "").replace("_", "").isalnum(): + raise ValueError(f"Package name {self.package_name} contains invalid characters. Only alphanumeric characters are allowed.") + + # Generate the package + package_id = generate_package( + package_name=self.package_name, + lims_manifest=pd.read_csv(self.lims_manifest), + workflow_manifest=pd.read_csv(self.workflow_manifest) if self.workflow_manifest else None, + exclude_primary_data=self.exclude_primary_data + ) + + if self.wait: + print(f"Starting packaging with id '{package_id}'") + while True: + package_status = get_package(package_id)['status'] + if package_status == "SUCCEEDED": + print(f"Generated package: {json.dumps(package_id, indent=4)}") + break + if package_status == "FAILED": + print(f"Package generation failed, see sfn logs '{get_package(package_id)['stepsExecutionArn']}' for more information") + break + if get_package(package_id)['status'] == "RUNNING": + sleep(10) + + else: + print(f"Generating package: {json.dumps(package_id, indent=4)}") + + +class ListPackagesSubCommand(Command): + """ + Usage: + data-sharing-tool list-packages help + data-sharing-tool list-packages [--package-name=] + + Description: + List packages, you may specify the package name to filter by, this will match the + --package-name parameter used in the generate-package command + + Options: + --package-name= The package name to filter by + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + + Example: + data-sharing-tool list-packages --package-name 'latest-fastqs' + """ + + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_name = self.cli_args['--package-name'] + + # Generate the package + print(json.dumps( + list_packages(package_name=self.package_name), + indent=4 + )) + + +class GetPackageStatusSubCommand(Command): + """ + Usage: + data-sharing-tool get-package-status help + data-sharing-tool get-package-status (--package-id=) + + Description: + Get the status of a package + + Options: + --package-id= The package id to get the status of + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + + Example: + data-sharing-tool get-package-status --package-id 'pkg.12345678910' + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_id = self.cli_args['--package-id'] + + # Generate the package + print(json.dumps( + get_package(package_id=self.package_id), + indent=4 + )) + + +class ViewPackageReportSubCommand(Command): + """ + Usage: + data-sharing-tool view-package-report help + data-sharing-tool view-package-report (--package-id=) + + Description: + View a package RMarkdown report. + One can set the BROWSER environment variable (to say 'firefox') to open the report in a browser. + + Options: + --package-id= View the package RMarkdown report + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + BROWSER Can be used by xdg-utils to automatically open the presigned url directly into a browser + + Example: + data-sharing-tool view-package-report --package-id 'pkg.12345678910' + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_id = self.cli_args['--package-id'] + + package_report_presigned_url = get_package_report(package_id=self.package_id).strip('"') + + # Check if the 'BROWSER' environment variable is set + if 'BROWSER' in environ: + call( + [environ['BROWSER'], package_report_presigned_url] + ) + + # Generate the package report presigned url + print(f"\"{package_report_presigned_url}\"") + + +class PushPackageSubCommand(Command): + """ + Usage: + data-sharing-tool push-package help + data-sharing-tool push-package (--package-id=) + (--share-location=) + [--wait] + + Description: + Push packages to a destination location. This can be either an S3 bucket with a prefix or an icav2 uri, in the + format of icav2:///path/to/prefix/ + + Options: + --package-id= The package id to push + --share-location= The location to push the package to + --wait Don't terminate the command until the push job is complete + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + + Example: + data-sharing-tool push-package --package-id 'pkg.12345678910' --share-location s3://bucket/path/to/dest/prefix/ + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_id = self.cli_args['--package-id'] + self.share_location = self.cli_args['--share-location'] + self.wait = self.cli_args['--wait'] + + # Generate the package report presigned url + push_job_id = push_package( + package_id=self.package_id, + location_uri=self.share_location + ) + + if self.wait: + print(f"Starting push job '{push_job_id}'") + while True: + package_status = get_push_job(push_job_id)['status'] + if package_status == "SUCCEEDED": + print(f"Generated push job: {json.dumps(push_job_id, indent=4)}") + break + if package_status == "FAILED": + print(f"Push to destination failed, see sfn logs '{get_push_job(push_job_id)['stepFunctionsExecutionArn']}' for more information") + break + if get_push_job(push_job_id)['status'] == "RUNNING": + sleep(10) + else: + print( + f"Pushing package '{self.package_id}' to '{self.share_location}' with push job id '{push_job_id}'" + ) + +class PresignPackageSubCommand(Command): + """ + Usage: + data-sharing-tool presign-package help + data-sharing-tool presign-package (--package-id=) + + Description: + Presign a package. This will generate a presigned url that can be used to download the package. + The presigned urls in the shell script will be valid for one week before expiring. + + Options: + --package-id= The package id to presign + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + BROWSER Can be used by xdg-utils to automatically open the download script into a browser + + Example: + data-sharing-tool presign-package --package-id 'pkg.12345678910' + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_id = self.cli_args['--package-id'] + + # Generate the package report presigned url + package_script_presigned_url = presign_package( + package_id=self.package_id + ).strip('"') + + # Check if the 'BROWSER' environment variable is set + if 'BROWSER' in environ: + call( + [environ['BROWSER'], package_script_presigned_url] + ) + + # Generate the package report presigned url + print(f"\"{package_script_presigned_url}\"") + + +class ListPushJobsSubCommand(Command): + """ + Usage: + data-sharing-tool list-push-jobs help + data-sharing-tool list-push-jobs [--package-id=] + + Description: + List Push Jobs. + + One can filter by a package id, this will match the --package-id parameter used in the push-package command. + + Options: + --package-id= + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + BROWSER Can be used by xdg-utils to automatically open the download script into a browser + + Example: + data-sharing-tool list-push-jobs --package-id 'pkg.12345678910' + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.package_id = self.cli_args['--package-id'] + + # Generate the package report presigned url + print(json.dumps( + list_push_jobs(package_id=self.package_id), + indent=4 + )) + + +class GetPushJobStatusSubCommand(Command): + """ + Usage: + data-sharing-tool get-push-job-status help + data-sharing-tool get-push-job-status [--push-job-id ] + + Description: + Get Push Job status + + Options: + --push-job-id= + + Environment variables: + AWS_PROFILE The AWS profile used by boto3 + BROWSER Can be used by xdg-utils to automatically open the download script into a browser + + Example: + data-sharing-tool get-push-job-status --push-job-id 'psh.12345678910' + """ + + def __init__(self, command_argv): + super().__init__(command_argv) + # Import args + self.push_job_id = self.cli_args['--push-job-id'] + + # Generate the package report presigned url + print(json.dumps( + get_push_job(push_job_id=self.push_job_id), + indent=4 + )) + + + +# Subcommand functions +def _dispatch(): + # This variable comprises both the subcommand AND the args + global_args: dict = docopt(dedent(__doc__), sys.argv[1:], options_first=True) + + command_argv = [global_args[""]] + global_args[""] + + cmd = global_args[''] + + # Yes, this is just a massive if-else statement + if cmd == "help": + # We have a separate help function for each subcommand + print(dedent(__doc__)) + sys.exit(0) + + # Configuration commands + elif cmd == "generate-package": + subcommand = GeneratePackageSubCommand + elif cmd == "list-packages": + subcommand = ListPackagesSubCommand + elif cmd == "get-package-status": + subcommand = GetPackageStatusSubCommand + elif cmd == "view-package-report": + subcommand = ViewPackageReportSubCommand + elif cmd == "push-package": + subcommand = PushPackageSubCommand + elif cmd == "presign-package": + subcommand = PresignPackageSubCommand + elif cmd == "list-push-jobs": + subcommand = ListPushJobsSubCommand + elif cmd == "get-push-job-status": + subcommand = GetPushJobStatusSubCommand + + # NotImplemented Error + else: + print(dedent(__doc__)) + print(f"Could not find cmd \"{cmd}\". Please refer to usage above") + sys.exit(1) + + # Check AWS_PROFILE env var + if "AWS_PROFILE" not in environ: + print("AWS_PROFILE environment variable not set. Please set it to the profile you want to use.") + sys.exit(1) + + # Assume the role in AWS_PROFILE + boto3.setup_default_session( + profile_name=environ['AWS_PROFILE'], + ) + + # Check if the profile is valid + account_id = boto3.client('sts').get_caller_identity()['Account'] + if not account_id == AWS_PRODUCTION_ACCOUNT_ID: + print(f"Warning, you are not using the production account. You are using {account_id}") + + # Initialise / call the subcommand + subcommand(command_argv) + + +def main(): + # If only the script name is provided, show help + if len(sys.argv) == 1: + sys.argv.append('help') + try: + _dispatch() + except KeyboardInterrupt: + pass + + +if __name__ == "__main__": + main() diff --git a/lib/workload/stateless/stacks/data-sharing-manager/scripts/install.sh b/lib/workload/stateless/stacks/data-sharing-manager/scripts/install.sh new file mode 100644 index 000000000..9a7bc5a6e --- /dev/null +++ b/lib/workload/stateless/stacks/data-sharing-manager/scripts/install.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash + +set -euo pipefail + +: ' +Quick shell script to perform the following tasks + +1. Check if "uv" is installed, if not, install it +2. Create a virtual environment +3. Install the required dependencies into the virtual environment +4. Add the python script into the virtual env bin directory +5. Create an alias for the script to be placed in the users .rc file +' + +# Globals +DATA_SHARING_INSTALL_VENV="${HOME}/.local/data-sharing-cli-venv" + +# Get this directory +THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + +# Check if "uv" is installed +if ! command -v uv &> /dev/null; then + echo "uv could not be found, installing..." + curl -LsSf https://astral.sh/uv/install.sh | sh +else + echo "uv is already installed" +fi + +# Create a virtual environment +uv venv --python '==3.12' --allow-existing "${DATA_SHARING_INSTALL_VENV}" + +# Activate the virtual environment +source "${DATA_SHARING_INSTALL_VENV}/bin/activate" + +# Install the required dependencies into the virtual environment +# Pretty 'meh' about versions here, but I guess we can always update them later +uv pip install --quiet \ + pandas \ + pandera \ + docopt \ + requests \ + boto3 + +# Copy the python script into the virtual environment bin directory +cp "${THIS_DIR}/data-sharing-tool.py" "${DATA_SHARING_INSTALL_VENV}/bin/data-sharing-tool" +chmod +x "${DATA_SHARING_INSTALL_VENV}/bin/data-sharing-tool" + +# Get user's shell +SHELL="$(basename "${SHELL}")" + +# Create an alias for the script to be placed in the users .rc file +if ! grep -q "alias data-sharing-tool" "${HOME}/.${SHELL}rc"; then + echo "alias data-sharing-tool='${DATA_SHARING_INSTALL_VENV}/bin/python3 ${DATA_SHARING_INSTALL_VENV}/bin/data-sharing-tool'" >> "${HOME}/.${SHELL}rc" + echo "Alias 'data-sharing-tool' added to .${SHELL}rc, please restart your terminal or run 'source ~/.${SHELL}rc' to use the alias." +else + echo "Alias already exists in .${SHELL}rc" +fi \ No newline at end of file