The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. This pipeline get the statistics around file downloads from the log files saved in the EBI infrastructure. This helps to understand the usage of the files and the projects, and helps to make decisions.
This pipeline automates log file retrieval, transformation, and statistical analysis, producing structured outputs for downstream analysis and database updates. It is optimized for high-throughput processing and large-scale data aggregation.
-
Retrieve Log Files (
get_log_files
)- Identifies and compiles a list of log files from the root directory.
- Output:
file_list.txt
-
Compute Log File Statistics (
run_log_file_stat
)- Performs statistical analysis on extracted log files.
- Output:
log_file_statistics.html
-
Process Log Files (
process_log_file
)- Extracts and transforms log data into Parquet format.
- Output:
*.parquet
files.
-
Merge Parquet Datasets (
merge_parquet_files
)- Aggregates individual Parquet datasets into a consolidated dataset.
- Output:
output_parquet
-
Analyze Merged Dataset (
analyze_parquet_files
)- Conducts statistical analysis and generates JSON reports.
- Outputs:
project_level_download_counts.json
file_level_download_counts.json
project_level_yearly_download_counts.json
project_level_top_download_counts.json
all_data.json
-
Generate Download Statistics Report (
run_file_download_stat
)- Produces a visual analytics report in HTML format.
- Output:
file_download_stat.html
-
Update Project-Level Download Metrics (
update_project_download_counts
)- Uploads project-level statistics to a database.
- Output:
upload_response_file_downloads_per_project.txt
-
Update File-Level Download Metrics (
update_file_level_download_counts
)- Segments large JSON datasets for database ingestion.
- Output: Server response files confirming successful uploads.
flowchart TD;
A[Retrieve Log Files] -->|file_list.txt| B[Compute Log File Statistics];
B -->|log_file_statistics.html| C[Process Log Files];
C -->|*.parquet| D[Merge Parquet Datasets];
D -->|output_parquet| E[Analyze Merged Dataset];
E -->|JSON Reports| F[Generate Download Statistics Report];
F -->|file_download_stat.html| G[Update Project-Level Metrics];
F --> H[Update File-Level Metrics];
- Optimized for High-Throughput Processing: Uses parallel computation and efficient storage formats (Parquet).
- Modular Execution: Each step can be run independently or as part of the full pipeline.
- Database Integration: Supports structured uploads of processed metrics.
- Customizable Parameters: Configurable via
params.yml
.
make install
make clean
make uninstall
./run_download_stat.sh
For step-by-step instructions, please refer to the usage documentation
To see the results of an example test run with a full size dataset refer to the Report.
For more details, please refer to the Complete documentation
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack at EBI.