Skip to content

Commit f935cbf

Browse files
Document remote file staging (#5523)
Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Chris Hakkaart <chris.hakkaart@seqera.io> Co-authored-by: Ben Sherman <bentshermann@gmail.com>
1 parent 1fd5dc5 commit f935cbf

File tree

1 file changed

+20
-6
lines changed

1 file changed

+20
-6
lines changed

docs/working-with-files.md

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -228,29 +228,43 @@ In general, you should not need to manually copy files, because Nextflow will au
228228

229229
## Remote files
230230

231-
Nextflow can work with many kinds of remote files and objects using the same interface as for local files. The following protocols are supported:
231+
Nextflow works with many types of remote files and objects using the same interface as for local files. The following protocols are supported:
232232

233-
- HTTP(S) / FTP (`http://`, `https://`, `ftp://`)
233+
- HTTP(S)/FTP (`http://`, `https://`, `ftp://`)
234234
- Amazon S3 (`s3://`)
235235
- Azure Blob Storage (`az://`)
236236
- Google Cloud Storage (`gs://`)
237237

238-
To reference a remote file, simple specify the URL when opening the file:
238+
To reference a remote file, simply specify the URL when opening the file:
239239

240240
```nextflow
241241
pdb = file('http://files.rcsb.org/header/5FID.pdb')
242242
```
243243

244-
You can then access it as a local file as described previously:
244+
It can then be used in the same way as a local file:
245245

246246
```nextflow
247247
println pdb.text
248248
```
249249

250250
:::{note}
251-
Not all operations are supported for all protocols. In particular, writing and directory listing are not supported for HTTP(S) and FTP paths.
251+
Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths.
252252
:::
253253

254254
:::{note}
255-
Additional configuration may be required to work with cloud object storage (e.g. to authenticate with a private bucket). Refer to the respective page for each cloud storage provider for more information.
255+
Additional configuration may be necessary for cloud object storage, such as authenticating with a private bucket. See the documentation for each cloud storage provider for further details.
256+
:::
257+
258+
### Remote file staging
259+
260+
When a process input file resides on a different file system than the work directory, Nextflow copies the file into the work directory using an appropriate Java SDK.
261+
262+
Remote files are staged in a subdirectory of the work directory with the form `stage-<session-id>/<hash>/<filename>`, where `<hash>` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID.
263+
264+
:::{note}
265+
Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all of these file transfers.
266+
267+
To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. Files should be given as a `val` input instead of a `path` input to bypass Nextflow's built-in remote file staging.
268+
269+
Alternatively, use {ref}`fusion-page` with the work directory set to object storage. In this case, tasks can access remote files directly without any prior staging, eliminating the bottleneck.
256270
:::

0 commit comments

Comments
 (0)