file-download-stat

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. This pipeline get the statistics around file downloads from the log files saved in the EBI infrastructure. This helps to understand the usage of the files and the projects, and helps to make decisions.

Pipeline in Nutshell

Overview

This pipeline automates log file retrieval, transformation, and statistical analysis, producing structured outputs for downstream analysis and database updates. It is optimized for high-throughput processing and large-scale data aggregation.

Workflow Steps

Retrieve Log Files (get_log_files)
- Identifies and compiles a list of log files from the root directory.
- Output: file_list.txt
Compute Log File Statistics (run_log_file_stat)
- Performs statistical analysis on extracted log files.
- Output: log_file_statistics.html
Process Log Files (process_log_file)
- Extracts and transforms log data into Parquet format.
- Output: *.parquet files.
Merge Parquet Datasets (merge_parquet_files)
- Aggregates individual Parquet datasets into a consolidated dataset.
- Output: output_parquet
Analyze Merged Dataset (analyze_parquet_files)
- Conducts statistical analysis and generates JSON reports.
- Outputs:
  - project_level_download_counts.json
  - file_level_download_counts.json
  - project_level_yearly_download_counts.json
  - project_level_top_download_counts.json
  - all_data.json
Generate Download Statistics Report (run_file_download_stat)
- Produces a visual analytics report in HTML format.
- Output: file_download_stat.html
Update Project-Level Download Metrics (update_project_download_counts)
- Uploads project-level statistics to a database.
- Output: upload_response_file_downloads_per_project.txt
Update File-Level Download Metrics (update_file_level_download_counts)
- Segments large JSON datasets for database ingestion.
- Output: Server response files confirming successful uploads.

Execution Flow

flowchart TD;
    A[Retrieve Log Files] -->|file_list.txt| B[Compute Log File Statistics];
    B -->|log_file_statistics.html| C[Process Log Files];
    C -->|*.parquet| D[Merge Parquet Datasets];
    D -->|output_parquet| E[Analyze Merged Dataset];
    E -->|JSON Reports| F[Generate Download Statistics Report];
    F -->|file_download_stat.html| G[Update Project-Level Metrics];
    F --> H[Update File-Level Metrics];

Key Features

Optimized for High-Throughput Processing: Uses parallel computation and efficient storage formats (Parquet).
Modular Execution: Each step can be run independently or as part of the full pipeline.
Database Integration: Supports structured uploads of processed metrics.
Customizable Parameters: Configurable via params.yml.

Usage

Install everything

make install

Clean up

make clean

Uninstall everything

make uninstall

Run in EBI infastructure

./run_download_stat.sh

For step-by-step instructions, please refer to the usage documentation

Pipeline output

To see the results of an example test run with a full size dataset refer to the Report.

Additional documentation and tutorial

For more details, please refer to the Complete documentation

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack at EBI.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github		.github
conf		conf
documentation		documentation
filedownloadstat		filedownloadstat
params		params
scripts		scripts
template		template
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.params.config		.params.config
.run_file_download_stats.sh		.run_file_download_stats.sh
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

file-download-stat

Introduction

Pipeline in Nutshell

Overview

Workflow Steps

Execution Flow

Key Features

Usage

Install everything

Clean up

Uninstall everything

Run in EBI infastructure

Pipeline output

Additional documentation and tutorial

Contributions and Support

About

Releases

Packages

Languages

sureshhewabi/nf-downloadstats

Folders and files

Latest commit

History

Repository files navigation

file-download-stat

Introduction

Pipeline in Nutshell

Overview

Workflow Steps

Execution Flow

Key Features

Usage

Install everything

Clean up

Uninstall everything

Run in EBI infastructure

Pipeline output

Additional documentation and tutorial

Contributions and Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages