MTBLS233-Pachyderm

In this page we introduce an metabolomics preprocessing workflow that you can run using Pachyderm, a distributed data-processing tool built on software containers that enables scalable and reproducible pipelines.

Introduction

The main goal of the study performed on MTBLS233 was to produce quantitative information of the highest possible number of reliable features in untargeted metabolomics. In order to do so, diverse approaches of mass spectromic acquisition parameter tuning were tested in order to maximize the number of spectral features.

The workflow was originally implemeted in OpenMS v. 1.1.1. followed by the downstream analysis in KNIME. Here we show you how to run the preprocessing workflow using Pachyderm, a tool built on top of Kubernetes that allows to process the data in a distributed fashion and to keep track of the input/output data from every stage of our the pipeline (think “git for data”), such that it is possible to track the provenance of results and accurately reproduce scientific workflows.

Run the preprocessing workflow

Once you are logged into the master node, start by making sure that Pachyderm is up and running:

$ pachctl version

If everything goes fine, you should see the version of the Pachyderm daemon and the pachctl version.

Note: In order for Pachyderm to be accessible in your cluster, you need to first uncomment the install-pachyderm-minio-playbook from your deployment template.

Ingest the MTBLS233 dataset from MetaboLights

MetaboLights offers an FTP service, so we can ingest the MTBLS233 dataset in a terminal.

First create a folder where you will store the data and navigate to it
Ingest the dataset using wget:

# Dataset retrieval
mkdir dataset
cd dataset
wget ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS233/00*alternate_pos_low_mr.mzML

Add the MTBLS233 dataset to Pachyderm

Create a repo called mrpo and push the dataset into it.

# Dataset upload 
pachctl create-repo mrpo
pachctl put-file mrpo master -c -r -p 3 -f .

Process the data

Now that the data is in the repository, it’s time to use the execute the pipeline. Four different jobs compose the pipeline, which can be found in the ./pipelinesdirectory. You can learn how to customise your pipelines in detail by visiting: http://docs.pachyderm.io/en/v1.6.6/reference/pipeline_spec.html

pachctl create-pipeline -f ./path/to/pipelines/FileFilter.json
pachctl create-pipeline -f ./path/to/pipelines/PeakPickerHiRes.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureFinderMetabo.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureLinkerUnlabeledQT.json
pachctl create-pipeline -f ./path/to/pipelines/TextExporter.json

After the whole workflow has been successfully executed, the resulting CSV file generated by the TextExporter in OpenMS will be saved in the TextExporterrepository. You can download the file simply by using:

pachctl get-file TextExporter master <path-to-file-in-repo> > <local-path-output>

The <path-to-file> can be obtained by checking the list of files outputed to the TextExporter repository in a given branch:

pachctl list-file TextExporter master

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
pipelines		pipelines
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTBLS233-Pachyderm

Introduction

Run the preprocessing workflow

Ingest the MTBLS233 dataset from MetaboLights

Add the MTBLS233 dataset to Pachyderm

Process the data

About

Releases

Packages

License

pharmbio/MTBLS233-Pachyderm

Folders and files

Latest commit

History

Repository files navigation

MTBLS233-Pachyderm

Introduction

Run the preprocessing workflow

Ingest the MTBLS233 dataset from MetaboLights

Add the MTBLS233 dataset to Pachyderm

Process the data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages