forked from PRIDE-Archive/nf-downloadstats
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b873c7f
commit 64f6eb9
Showing
6 changed files
with
25 additions
and
3 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,25 @@ | ||
## **Workflow** | ||
|
||
### 1. Copy Data | ||
If you are running the pipeline in EBI infrastructure, the pipeline will copy data from the original log file location to your path | ||
Currently, original log files are stored in a place where only `datamover` can be read. So, as the first step, our pipeline will copy(`rsync`) the log files to the location you specified which can be accessed by the `standard` queue. | ||
Once this job is completed, it will automatically launched the next dependant job to process the log files and do the statistical analysis. | ||
|
||
!!! note "Running first time" | ||
|
||
It could take 2-3 hours to copy the log files for the first time, then it is will be few minutes for the subsequent runs. | ||
|
||
### 2. Process Log Files | ||
|
||
This step will collect the names of log files, process the log files parallel and apply many filters excluding the unwanted data. | ||
The processed log files will be stored in the Parquet format which is a columnar storage format that is optimized for reading and writing large datasets. | ||
|
||
 | ||
|
||
### 3. Produce Statistics Report | ||
Using dask framework, parquet will be queried and the statistics will be generated. | ||
This step will generate the statistics report in the HTML format and will be stored in the location you specified. | ||
|
||
 | ||
|
||
Detailed workflow steps can be found in the [workflow documentation](../../misc/workflow). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters