Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download files via htsget #371

Merged
merged 25 commits into from
Jul 1, 2024
Merged

Download files via htsget #371

merged 25 commits into from
Jul 1, 2024

Conversation

pahatz
Copy link
Contributor

@pahatz pahatz commented Apr 29, 2024

Related issue(s) and PR(s)
This PR closes #280.

Description
This adds a component called htsget to the sda-cli that allows to download (partial) files through the htsget server.

How to test
To test this, you need to run the docker compose of genomicDataInfra/starter-kit-storage-and-interfaces with:
docker compose -f docker-compose-demo.yml up -d
Then get the token as listed in that repo and insert it into the testing/s3cmd.conf file of the sda-cli repo with the name of access_token=<token>
Last step for the dev environment is to run the docker compose from genomicDataInfra/starter-kit-htsget with docker compose up -d
Then you can run the htsget sda-cli command with something like:
./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename htsnexus_test_NA12878 -reference 11 -host http://localhost:8088 -pubkey testing/c4gh.pub.pem -output ./result.c4gh --force-overwrite, where pubkey is your own c4gh key pair provided to be used.
This will download the file result.c4gh
You can then decrypt the file with:
./sda-cli decrypt -key c4gh.sec.pem result.c4gh
To check that it worked, check the file contents with samtools:
samtools view result

@MalinAhlberg
Copy link
Member

Nice job, @pahatz ! Works well for me!
I added some comments. Also, it would be nice with an error message when something goes wrong, instead of just an empty result file.

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 52.46914% with 77 lines in your changes missing coverage. Please review.

Project coverage is 43.15%. Comparing base (360b5d8) to head (e514cce).

Files Patch % Lines
htsget/htsget.go 53.45% 57 Missing and 17 partials ⚠️
main.go 0.00% 2 Missing ⚠️
login/login.go 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #371      +/-   ##
==========================================
+ Coverage   42.25%   43.15%   +0.89%     
==========================================
  Files          12       13       +1     
  Lines        1678     1840     +162     
==========================================
+ Hits          709      794      +85     
- Misses        878      938      +60     
- Partials       91      108      +17     
Flag Coverage Δ
unittests 43.15% <52.46%> (+0.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@nanjiangshu nanjiangshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works very well when I tested!
Apart from the comments I left in the code. I have a few general comments.

  1. I think it would be better to show a message when the file is successfully downloaded.

  2. When the output file exists already, currently the command just overwrites the existing file without any notification. It would be better to show if the file already exist or not, and probably also add an option for force overwriting.

  3. For the example given, when the filename does not have the .bam extension, the downloaded file also does not have a .bam extension. However, the file in s3 does have the .bam extension. I think it would be great to add a proper file extension if it is recognizable.

@aaperis
Copy link
Contributor

aaperis commented May 3, 2024

Is it ready for review?

@pahatz pahatz marked this pull request as ready for review May 6, 2024 09:43
@pahatz pahatz force-pushed the feature/htsget branch from 44d468a to a594a41 Compare May 7, 2024 14:32
MalinAhlberg
MalinAhlberg previously approved these changes May 17, 2024
Copy link
Member

@MalinAhlberg MalinAhlberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works well!
I added one suggestion to the Readme, but nothing blocking an approval.

@aaperis
Copy link
Contributor

aaperis commented May 22, 2024

fyi, once you rebase tests should work as expected (the waiting status is due to the update to go 1.22)

@aaperis
Copy link
Contributor

aaperis commented May 22, 2024

I followed the instructions on the description (also changed the name of the access_key to match the one on the token in the s3cmd) but I get the following error:

./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename htsnexus_test_NA12878 -reference 11 -htsgethost ht
tp://localhost:8088 -key c4gh.pub.pem
Error: failed to get the file, status code: 404

with logs:

oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type ","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 200 |   20.206094ms |      172.19.0.1 | GET      "/s3/DATASET0001/htsnexus_test_NA12878.bam.bai"
oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type encrypted","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed response from the reencrypt service, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 200 |    8.236127ms |      172.19.0.1 | HEAD     "/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh"
oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type encrypted","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"using range 0-131252","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"using key for reencryption LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"because user-agnt was htsget-search/0.6.6","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed response from the reencrypt service, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed to reencrypt the file header, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 500 |    8.372643ms |      172.19.0.1 | GET      "/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh"

Something strange is going on with re-encrypt as it seems?

@aaperis
Copy link
Contributor

aaperis commented May 22, 2024

Also, trying something stupid like neglecting to input a filename for the flag -filename:

./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename  -reference 11 -htsgethost http://localhost:8088 -key c4gh.pub.pem

produces the same error as above:

Error: failed to get the file, status code: 404

where logs download look "normal"

download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KTVJmd0JoeVNXcHpteFhVdVhLMGs5NTVMMG4za3g3K2xyRHRaMkZuMUIxND0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T19:31:11Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | [GIN] 2024/05/22 - 19:31:11 | 404 |   16.807661ms |      172.23.0.1 | GET      "/s3/DATASET0001/-reference.bam.bai"

however the filename requested is -reference, which is the next flag on the command.

@pahatz
Copy link
Contributor Author

pahatz commented May 23, 2024

Also, trying something stupid like neglecting to input a filename for the flag -filename:

./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename  -reference 11 -htsgethost http://localhost:8088 -key c4gh.pub.pem

produces the same error as above:

Error: failed to get the file, status code: 404

where logs download look "normal"

download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KTVJmd0JoeVNXcHpteFhVdVhLMGs5NTVMMG4za3g3K2xyRHRaMkZuMUIxND0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T19:31:11Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | {"level":"error","msg":"sql: no rows in result set","time":"2024-05-22T19:31:11Z"}
download     | [GIN] 2024/05/22 - 19:31:11 | 404 |   16.807661ms |      172.23.0.1 | GET      "/s3/DATASET0001/-reference.bam.bai"

however the filename requested is -reference, which is the next flag on the command.

This is not handled properly in general, we would need the following change in helpers/helpers.go:

@@ -123,7 +123,7 @@ func ParseS3ErrorResponse(respBody io.Reader) (string, error) {
 
 // Removes all positional arguments from args, and returns them.
 // This function assumes that all flags have exactly one value.
-func getPositional(args []string) ([]string, []string) {
+func getPositional(args []string) ([]string, []string, error) {
        argList := []string{"-r", "--r", "--force-overwrite", "-force-overwrite", "--force-unencrypted", "-force-unencrypted"}
        i := 1
        var positional []string
@@ -132,6 +132,9 @@ func getPositional(args []string) ([]string, []string) {
                case slices.Contains(argList, args[i]):
                        // if the current args is a boolean flag, skip it
                        i++
+               case args[i][0] == '-' && args[i+1][0] == '-':
+                       // bad input, exit
+                       return nil, nil, errors.New("flag requires a value")
                case args[i][0] == '-':
                        // if the current arg is a flag, skip the flag and its value
                        i += 2
@@ -143,15 +146,19 @@ func getPositional(args []string) ([]string, []string) {
                }
        }
 
-       return positional, args
+       return positional, args, nil
 }
 
 func ParseArgs(args []string, argFlags *flag.FlagSet) error {
        var pos []string
-       pos, args = getPositional(args)
+       var err error
+       pos, args, err = getPositional(args)
+       if err != nil {
+               return err
+       }
        // append positional args back at the end of args
        args = append(args, pos...)
-       err := argFlags.Parse(args[1:])
+       err = argFlags.Parse(args[1:])
 
        return err


edit:

On second thought, the above fix is wrong. Filenames could start with a dash character. The other solution I had in mind is to:

  • first remove the "special" flags ("-r", "--r", "--force-overwrite", "-force-overwrite", "--force-unencrypted", "-force-unencrypted")
  • check we have an even number of string in the leftover arguments.
  • continue

I could not think of some relatively small solution that could ensure this handled better. I'll amend the last commit for now. It might be a good idea to use a library to handle the parsing of the arguments in the future.

@pahatz
Copy link
Contributor Author

pahatz commented May 23, 2024

I followed the instructions on the description (also changed the name of the access_key to match the one on the token in the s3cmd) but I get the following error:

./sda-cli htsget -config testing/s3cmd.conf -dataset DATASET0001 -filename htsnexus_test_NA12878 -reference 11 -htsgethost ht
tp://localhost:8088 -key c4gh.pub.pem
Error: failed to get the file, status code: 404

with logs:

oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type ","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 200 |   20.206094ms |      172.19.0.1 | GET      "/s3/DATASET0001/htsnexus_test_NA12878.bam.bai"
oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type encrypted","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed response from the reencrypt service, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 200 |    8.236127ms |      172.19.0.1 | HEAD     "/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh"
oidc         | ga4gh token requested, trusted
download     | {"level":"warning","msg":"client key LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"Server-Public-Key ","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"type encrypted","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"using range 0-131252","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"using key for reencryption LS0tLS1CRUdJTiBDUllQVDRHSCBQVUJMSUMgS0VZLS0tLS0KYW9LMXhQWmlVNmJwM3kxSXM0bnN1SmJhY3lmaml6SURzVzIxNTNQdWwxbz0KLS0tLS1FTkQgQ1JZUFQ0R0ggUFVCTElDIEtFWS0tLS0tCg==","time":"2024-05-22T17:29:41Z"}
download     | {"level":"warning","msg":"because user-agnt was htsget-search/0.6.6","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed response from the reencrypt service, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | {"level":"error","msg":"Failed to reencrypt the file header, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded","time":"2024-05-22T17:29:41Z"}
download     | [GIN] 2024/05/22 - 17:29:41 | 500 |    8.372643ms |      172.19.0.1 | GET      "/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh"

Something strange is going on with re-encrypt as it seems?

Maybe some leftover volume/container?

docker compose -f docker-compose-demo.yml down -v --remove-orphans

pahatz and others added 5 commits May 28, 2024 12:38
- changed the way we input arguments
- added force-overwrite check and option
- added message for successful operation
Co-authored-by: Malin Klang <ahlberg.malin@gmail.com>
Copy link
Contributor

@jbygdell jbygdell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a tendency to prefix everything inside the package with the package name for the variables.

Some of the error messages are not what the user would expect but what you as a developer sees in the current context.

MalinAhlberg
MalinAhlberg previously approved these changes May 29, 2024
Copy link
Member

@MalinAhlberg MalinAhlberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well for me!

Co-authored-by: Joakim Bygdell <joakim.bygdell@nbis.se>
Copy link
Contributor

@nanjiangshu nanjiangshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great when I tested and all my previous comments are addressed, good job @pahatz!
I left a minor comments about an probably unexpected indentation in the helping message.
Apart from that, I can approve it when the failing unit tests are solved.

Copy link
Member

@MalinAhlberg MalinAhlberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pahatz pahatz requested a review from nanjiangshu July 1, 2024 10:44
Copy link
Contributor

@nanjiangshu nanjiangshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pahatz pahatz enabled auto-merge July 1, 2024 11:42
@pahatz pahatz dismissed jbygdell’s stale review July 1, 2024 11:43

added changes

@pahatz pahatz added this pull request to the merge queue Jul 1, 2024
Merged via the queue into main with commit 10bd4b7 Jul 1, 2024
6 checks passed
@pahatz pahatz deleted the feature/htsget branch July 1, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download files via htsget with sda-cli
8 participants