Skip to content

Commit 443c9a2

Browse files
alexiswlCopilot
andauthored
Added data-sharing manager and toolkit (#983)
* Added data-sharing manager and toolkit Fix readme * Add explanation on bash prefix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 1337ff0 commit 443c9a2

File tree

3 files changed

+973
-56
lines changed

3 files changed

+973
-56
lines changed

lib/workload/stateless/stacks/data-sharing-manager/Readme.md

Lines changed: 152 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,81 +2,177 @@
22

33
## Description
44

5-
The sharing manager works two main ways, as a 'push' or 'pull' step.
5+
The data sharing manager is divided into three main components -
6+
1. Package generation
7+
2. Package validation
8+
3. Package sharing
9+
10+
For all three parts, we recommend using the data-sharing-tool provided.
11+
12+
### Installing the Data Sharing Tool
13+
14+
In order to generate a package, we recommend installing the data-sharing-tool by running the following command (from this directory).
15+
16+
Please preface the command with 'bash' because the `scripts/install.sh` script relies on `bash`-specific features.
17+
This ensures compatibility and prevents errors if your default shell is not `bash`.
18+
19+
```bash
20+
bash scripts/install.sh
21+
```
22+
23+
## Package Generation
24+
25+
> This component expects the user to have some familiarity with AWS athena
26+
27+
We use the 'mart' tables to generate the appropriate manifests for package generation.
28+
29+
You may use the UI to generate the manifests, or you can use the command line interface as shown below.
30+
31+
In the example below, we collect the libraries that are associated with the project 'CUP' and the
32+
sequencing run date is greater than or equal to '2025-04-01'.
33+
34+
We require only the lims-manifest when collecting fastq data.
35+
36+
The workflow manifest (along with the lims-manifest) is required when collecting secondary analysis data.
37+
38+
```bash
39+
WORK_GROUP="orcahouse"
40+
DATASOURCE_NAME="orcavault"
41+
DATABASE_NAME="mart"
42+
43+
# Initialise the query
44+
query_execution_id="$( \
45+
aws athena start-query-execution \
46+
--no-cli-pager \
47+
--query-string " \
48+
SELECT *
49+
FROM lims
50+
WHERE
51+
project_id = 'CUP' AND
52+
sequencing_run_date >= CAST('2025-04-01' AS DATE)
53+
" \
54+
--work-group "${WORK_GROUP}" \
55+
--query-execution-context "Database=${DATABASE_NAME}, Catalog=${DATASOURCE_NAME}" \
56+
--output json \
57+
--query 'QueryExecutionId' \
58+
)"
59+
60+
# Wait for the query to complete
61+
while true; do
62+
query_state="$( \
63+
aws athena get-query-execution \
64+
--no-cli-pager \
65+
--output json \
66+
--query-execution-id "${query_execution_id}" \
67+
--query 'QueryExecution.Status.State' \
68+
)"
69+
70+
if [[ "${query_state}" == "SUCCEEDED" ]]; then
71+
break
72+
elif [[ "${query_state}" == "FAILED" || "${query_state}" == "CANCELLED" ]]; then
73+
echo "Query failed or was cancelled"
74+
exit 1
75+
fi
76+
77+
sleep 5
78+
done
79+
80+
# Collect the query results
81+
query_results_uri="$( \
82+
aws athena get-query-execution \
83+
--no-cli-pager \
84+
--output json \
85+
--query-execution-id "${query_execution_id}" \
86+
--query '.QueryExecution.ResultConfiguration.OutputLocation' \
87+
)"
88+
89+
# Download the results
90+
aws s3 cp "${query_results_uri}" ./lims_manifest.csv
91+
```
692

7-
Inputs are configured into the API, and then the step function is launched.
93+
Using the lims manifest we can now generate the package.
894

9-
For pushing sharing types, if configuration has not been tried with the '--dryrun' flag first, the API will return an error.
10-
This is so we don't go accidentally pushing data to the wrong place.
95+
By using the `--wait` parameter, the CLI will only return once the package has been completed.
1196

12-
A job will then be scheduled and ran in the background, a user can check the status of the job by checking the job status in the API.
97+
This may take around 5 mins to complete depending on the size of the package.
1398

99+
```bash
100+
data-sharing-tool generate-package \
101+
--lims-manifest-csv lims_manifest.csv \
102+
--wait
103+
```
14104

15-
### Push or Pull?
105+
This will generate a package and print the package to the console like so:
16106

17-
When pushing, we use the s3 steps copy manager to 'push' data to a bucket. We assume that we have access to this bucket.
18-
When pulling, we generate a presigned url containing a script that can be used to download the data.
107+
```bash
108+
Generating package 'pkg.123456789'...
109+
```
19110

111+
For the workflow manifest, we can use the same query as above, but we will need to change the final table name to 'workflow'.
20112

21-
### Pushing Outputs
113+
An example of the SQL might be as follows:
22114

23-
Once a Job has completed pushing data, the job response object can be queried to gather the following information:
24-
* fastq data that was pushed
25-
* portal run ids that were pushed
26-
* list the s3 objects that were pushed.
115+
```sql
116+
/*
117+
Get the libraries associated with the project 'CUP' and their sequencing run date is greater than or equal to '2025-04-01'.
118+
*/
119+
WITH libraries AS (
120+
SELECT library_id
121+
FROM lims
122+
WHERE
123+
project_id = 'CUP' AND
124+
sequencing_run_date >= CAST('2025-04-01' AS DATE)
125+
)
126+
/*
127+
Select matching TN workflows for the libraries above
128+
*/
129+
SELECT *
130+
from workflow
131+
WHERE
132+
workflow_name = 'tumor-normal' AND
133+
library_id IN (SELECT library_id FROM libraries)
134+
```
27135

28136

29-
### Invoking a job
137+
## Package Validation
30138

31-
The Job API launch comprises the following inputs:
139+
Once the package has completed generating we can validate the package using the following command:
32140

33-
* instrumentRunIdList: The list of instrument run ids to be shared (used for fastq sharing only), can be used in tandem alongside one of the metadata attributes of libraryId, subjectId, individualId or projectId and will take an intersection of the two for fastq data.
34-
* libraryIdList: A list of library ids to be shared. Cannot be used alongside subjectIdList, individualIdList or projectIdList.
35-
* subjectIdList: A list of subject ids to share. Cannot be used alongside libraryIds, projectIdList or individualIdList.
36-
* projectIdList: A list of project names to share. Cannot be used alongside libraryIds, subjectIdList or individualIdList.
37-
* dataTypeList: A list of data types to share. Can be one or more of:
38-
* 'Fastq'
39-
* 'SecondaryAnalysis'
40-
* defrostArchivedFastqs: A boolean flag to determine if we should de-frost archived fastqs. This is only used for fastq data types.
41-
If set to true, and the fastq data is archived, the data de-frosted will be triggered but the workflow will not wait for the data to be de-frosted and will fail with a DataDefrostingError.
42-
* secondaryAnalysisWorkflowList: A list of secondary analysis workflows to share, can be used in tandem with data types.
43-
The possible values are one or more of:
44-
* cttsov2 (or dragen-tso500-ctdna)
45-
* tumor-normal (or dragen-wgts-dna)
46-
* wts (or dragen-wgts-rna)
47-
* oncoanalyser-wgts-dna
48-
* oncoanalyser-wgts-rna
49-
* oncoanalyser-wgts-dna-rna
50-
* rnasum
51-
* umccrise
52-
* sash
53-
* portalRunIdList: A list of portal run ids to share.
54-
For secondaryanalysis data types, this parameter will take precedence over any metadata specified or secondary workflow types specified.
55-
* portalRunIdExclusionList: A list of portal run ids NOT to share.
56-
For secondaryanalysis data types, this parameter can be used in tandem with metadata or secondary workflow types specified.
57-
This is useful if a known workflow has been repeated and we do not wish to share the original.
58-
* shareType: The type of share, must be one of 'push' or 'pull'
59-
* shareDestination: The destination of the share, only required if shareType is 'push'. Can be an 'icav2' or 's3' uri.
60-
* dryrun: A boolean flag, used when we set the push type to true to determine if we should actually push the data or instead just print out to the console the list of s3 objects we would have sent.
141+
> By using the BROWSER env var, the package report will be automatically opened up in our browser!
61142
143+
```bash
144+
data-sharing-tool view-package-report \
145+
--package-id pkg.12345678910
146+
```
62147

63-
### Steps Functions Output
148+
Look through the metadata, fastq and secondary analysis tabs to ensure that the package is correct.
64149

65-
* The steps function will output two attributes:
66-
* limsCsv presigned url - a presigned url to download a csv file containing the lims metadata to share
67-
* data-download script presigned url - a presigned url to download a bash script that can be used to download the data.
68150

69-
70-
### Data Download Url for Pulling Data
151+
## Package Sharing
71152

72-
The data download script will have the following options:
153+
### Pushing Packages
73154

74-
* --data-download-path - the root path of the data to be downloaded, this directory must already exist.
75-
* --dryrun | --dry-run - a flag to indicate that the script should not download the data, but instead print the commands that would be run and directories that would be created.
76-
* --check-size-only - a flag to skip any downloading if the existing file is the same size as the file to be downloaded.
77-
* --skip-existing - a flag to skip downloading files that already exist in the destination directory (regardless of size).
78-
* --print-summary - a flag to print a summary of the files that would be downloaded and the total size of the download.
155+
We can use the following command to push the package to a destination location. This will generate a push job id.
79156

157+
Like the package generation, we can use the `--wait` parameter to wait for the job to complete.
80158

159+
```bash
160+
data-sharing-tool push-package \
161+
--package-id pkg.12345678910 \
162+
--share-location s3://bucket/path-to-prefix/
163+
```
81164

82-
165+
### Presigning packages
166+
167+
Not all data receivers will have an S3 bucket or ICAV2 project for us to dump data in.
168+
169+
Therefore we also support the old-school presigned url method.
170+
171+
We can use the following command to generate presigned urls in a script for the package
172+
173+
```bash
174+
data-sharing-tool presign-package \
175+
--package-id pkg.12345678910
176+
```
177+
178+
This will return a presigned url for a shell script that can be used to download the package.

0 commit comments

Comments
 (0)