Skip to content

Commit 8232b6c

Browse files
committed
Added data-sharing manager and toolkit
Fix readme
1 parent db7a9ab commit 8232b6c

File tree

3 files changed

+972
-56
lines changed

3 files changed

+972
-56
lines changed

lib/workload/stateless/stacks/data-sharing-manager/Readme.md

Lines changed: 151 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,81 +2,176 @@
22

33
## Description
44

5-
The sharing manager works two main ways, as a 'push' or 'pull' step.
5+
The data sharing manager is divided into three main components -
6+
1. Package generation
7+
2. Package validation
8+
3. Package sharing
9+
10+
For all three parts, we recommend using the data-sharing-tool provided.
11+
12+
### Installing the Data Sharing Tool
13+
14+
In order to generate a package, we recommend installing the data-sharing-tool by running the following command (from this directory).
15+
16+
Please preface the command with 'bash'
17+
18+
```bash
19+
bash scripts/install.sh
20+
```
21+
22+
## Package Generation
23+
24+
> This component expects the user to have some familiarity with AWS athena
25+
26+
We use the 'mart' tables to generate the appropriate manifests for package generation.
27+
28+
You may use the UI to generate the manifests, or you can use the command line interface as shown below.
29+
30+
In the example below, we collect the libraries that are associated with the project 'CUP' and the
31+
sequencing run date is greater than or equal to '2025-04-01'.
32+
33+
We require only the lims-manifest when collecting fastq data.
34+
35+
The workflow manifest (along with the lims-manifest) is required when collecting secondary analysis data.
36+
37+
```bash
38+
WORK_GROUP="orcahouse"
39+
DATASOURCE_NAME="orcavault"
40+
DATABASE_NAME="mart"
41+
42+
# Initialise the query
43+
query_execution_id="$( \
44+
aws athena start-query-execution \
45+
--no-cli-pager \
46+
--query-string " \
47+
SELECT *
48+
FROM lims
49+
WHERE
50+
project_id = 'CUP' AND
51+
sequencing_run_date >= CAST('2025-04-01' AS DATE)
52+
" \
53+
--work-group "${WORK_GROUP}" \
54+
--query-execution-context "Database=${DATABASE_NAME}, Catalog=${DATASOURCE_NAME}" \
55+
--output json \
56+
--query 'QueryExecutionId' \
57+
)"
58+
59+
# Wait for the query to complete
60+
while true; do
61+
query_state="$( \
62+
aws athena get-query-execution \
63+
--no-cli-pager \
64+
--output json \
65+
--query-execution-id "${query_execution_id}" \
66+
--query 'QueryExecution.Status.State' \
67+
)"
68+
69+
if [[ "${query_state}" == "SUCCEEDED" ]]; then
70+
break
71+
elif [[ "${query_state}" == "FAILED" || "${query_state}" == "CANCELLED" ]]; then
72+
echo "Query failed or was cancelled"
73+
exit 1
74+
fi
75+
76+
sleep 5
77+
done
78+
79+
# Collect the query results
80+
query_results_uri="$( \
81+
aws athena get-query-execution \
82+
--no-cli-pager \
83+
--output json \
84+
--query-execution-id "${query_execution_id}" \
85+
--query '.QueryExecution.ResultConfiguration.OutputLocation' \
86+
)"
87+
88+
# Download the results
89+
aws s3 cp "${query_results_uri}" ./lims_manifest.csv
90+
```
691

7-
Inputs are configured into the API, and then the step function is launched.
92+
Using the lims manifest we can now generate the package.
893

9-
For pushing sharing types, if configuration has not been tried with the '--dryrun' flag first, the API will return an error.
10-
This is so we don't go accidentally pushing data to the wrong place.
94+
By using the `--wait` parameter, the CLI will only return once the package has been completed.
1195

12-
A job will then be scheduled and ran in the background, a user can check the status of the job by checking the job status in the API.
96+
This may take around 5 mins to complete depending on the size of the package.
1397

98+
```bash
99+
data-sharing-tool generate-package \
100+
--lims-manifest-csv lims_manifest.csv \
101+
--wait
102+
```
14103

15-
### Push or Pull?
104+
This will generate a package and print the package to the console like so:
16105

17-
When pushing, we use the s3 steps copy manager to 'push' data to a bucket. We assume that we have access to this bucket.
18-
When pulling, we generate a presigned url containing a script that can be used to download the data.
106+
```bash
107+
Generating package 'pkg.123456789'...
108+
```
19109

110+
For the workflow manifest, we can use the same query as above, but we will need to change the final table name to 'workflow'.
20111

21-
### Pushing Outputs
112+
An example of the SQL might be as follows:
22113

23-
Once a Job has completed pushing data, the job response object can be queried to gather the following information:
24-
* fastq data that was pushed
25-
* portal run ids that were pushed
26-
* list the s3 objects that were pushed.
114+
```sql
115+
/*
116+
Get the libraries associated with the project 'CUP' and their sequencing run date is greater than or equal to '2025-04-01'.
117+
*/
118+
WITH libraries AS (
119+
SELECT library_id
120+
FROM lims
121+
WHERE
122+
project_id = 'CUP' AND
123+
sequencing_run_date >= CAST('2025-04-01' AS DATE)
124+
)
125+
/*
126+
Select matching TN workflows for the libraries above
127+
*/
128+
SELECT *
129+
from workflow
130+
WHERE
131+
workflow_name = 'tumor-normal' AND
132+
library_id IN (SELECT library_id FROM libraries)
133+
```
27134

28135

29-
### Invoking a job
136+
## Package Validation
30137

31-
The Job API launch comprises the following inputs:
138+
Once the package has completed generating we can validate the package using the following command:
32139

33-
* instrumentRunIdList: The list of instrument run ids to be shared (used for fastq sharing only), can be used in tandem alongside one of the metadata attributes of libraryId, subjectId, individualId or projectId and will take an intersection of the two for fastq data.
34-
* libraryIdList: A list of library ids to be shared. Cannot be used alongside subjectIdList, individualIdList or projectIdList.
35-
* subjectIdList: A list of subject ids to share. Cannot be used alongside libraryIds, projectIdList or individualIdList.
36-
* projectIdList: A list of project names to share. Cannot be used alongside libraryIds, subjectIdList or individualIdList.
37-
* dataTypeList: A list of data types to share. Can be one or more of:
38-
* 'Fastq'
39-
* 'SecondaryAnalysis'
40-
* defrostArchivedFastqs: A boolean flag to determine if we should de-frost archived fastqs. This is only used for fastq data types.
41-
If set to true, and the fastq data is archived, the data de-frosted will be triggered but the workflow will not wait for the data to be de-frosted and will fail with a DataDefrostingError.
42-
* secondaryAnalysisWorkflowList: A list of secondary analysis workflows to share, can be used in tandem with data types.
43-
The possible values are one or more of:
44-
* cttsov2 (or dragen-tso500-ctdna)
45-
* tumor-normal (or dragen-wgts-dna)
46-
* wts (or dragen-wgts-rna)
47-
* oncoanalyser-wgts-dna
48-
* oncoanalyser-wgts-rna
49-
* oncoanalyser-wgts-dna-rna
50-
* rnasum
51-
* umccrise
52-
* sash
53-
* portalRunIdList: A list of portal run ids to share.
54-
For secondaryanalysis data types, this parameter will take precedence over any metadata specified or secondary workflow types specified.
55-
* portalRunIdExclusionList: A list of portal run ids NOT to share.
56-
For secondaryanalysis data types, this parameter can be used in tandem with metadata or secondary workflow types specified.
57-
This is useful if a known workflow has been repeated and we do not wish to share the original.
58-
* shareType: The type of share, must be one of 'push' or 'pull'
59-
* shareDestination: The destination of the share, only required if shareType is 'push'. Can be an 'icav2' or 's3' uri.
60-
* dryrun: A boolean flag, used when we set the push type to true to determine if we should actually push the data or instead just print out to the console the list of s3 objects we would have sent.
140+
> By using the BROWSER env var, the package report will be automatically opened up in our browser!
61141
142+
```bash
143+
data-sharing-tool view-package-report \
144+
--package-id pkg.12345678910
145+
```
62146

63-
### Steps Functions Output
147+
Look through the metadata, fastq and secondary analysis tabs to ensure that the package is correct.
64148

65-
* The steps function will output two attributes:
66-
* limsCsv presigned url - a presigned url to download a csv file containing the lims metadata to share
67-
* data-download script presigned url - a presigned url to download a bash script that can be used to download the data.
68149

69-
70-
### Data Download Url for Pulling Data
150+
## Package Sharing
71151

72-
The data download script will have the following options:
152+
### Pushing Packages
73153

74-
* --data-download-path - the root path of the data to be downloaded, this directory must already exist.
75-
* --dryrun | --dry-run - a flag to indicate that the script should not download the data, but instead print the commands that would be run and directories that would be created.
76-
* --check-size-only - a flag to skip any downloading if the existing file is the same size as the file to be downloaded.
77-
* --skip-existing - a flag to skip downloading files that already exist in the destination directory (regardless of size).
78-
* --print-summary - a flag to print a summary of the files that would be downloaded and the total size of the download.
154+
We can use the following command to push the package to a destination location. This will generate a push job id.
79155

156+
Like the package generation, we can use the `--wait` parameter to wait for the job to complete.
80157

158+
```bash
159+
data-sharing-tool push-package \
160+
--package-id pkg.12345678910 \
161+
--share-location s3://bucket/path-to-prefix/
162+
```
81163

82-
164+
### Presigning packages
165+
166+
Not all data receivers will have an S3 bucket or ICAV2 project for us to dump data in.
167+
168+
Therefore we also support the old-school presigned url method.
169+
170+
We can use the following command to generate presigned urls in a script for the package
171+
172+
```bash
173+
data-sharing-tool presign-package \
174+
--package-id pkg.12345678910
175+
```
176+
177+
This will return a presigned url for a shell script that can be used to download the package.

0 commit comments

Comments
 (0)