In this page we introduce an metabolomics preprocessing workflow that you can run using Pachyderm, a distributed data-processing tool built on software containers that enables scalable and reproducible pipelines.
The main goal of the study performed on MTBLS233 was to produce quantitative information of the highest possible number of reliable features in untargeted metabolomics. In order to do so, diverse approaches of mass spectromic acquisition parameter tuning were tested in order to maximize the number of spectral features.
The workflow was originally implemeted in OpenMS v. 1.1.1. followed by the downstream analysis in KNIME. Here we show you how to run the preprocessing workflow using Pachyderm, a tool built on top of Kubernetes that allows to process the data in a distributed fashion and to keep track of the input/output data from every stage of our the pipeline (think “git for data”), such that it is possible to track the provenance of results and accurately reproduce scientific workflows.
Once you are logged into the master node, start by making sure that Pachyderm is up and running:
$ pachctl version
If everything goes fine, you should see the version of the Pachyderm daemon and the pachctl
version.
Note: In order for Pachyderm to be accessible in your cluster, you need to first uncomment the install-pachyderm-minio-playbook
from your deployment template.
MetaboLights offers an FTP service, so we can ingest the MTBLS233 dataset in a terminal.
- First create a folder where you will store the data and navigate to it
- Ingest the dataset using wget:
# Dataset retrieval
mkdir dataset
cd dataset
wget ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS233/00*alternate_pos_low_mr.mzML
Create a repo called mrpo
and push the dataset into it.
# Dataset upload
pachctl create-repo mrpo
pachctl put-file mrpo master -c -r -p 3 -f .
Now that the data is in the repository, it’s time to use the execute the pipeline. Four different jobs compose the pipeline, which can be found in the ./pipelines
directory. You can learn how to customise your pipelines in detail by visiting: http://docs.pachyderm.io/en/v1.6.6/reference/pipeline_spec.html
pachctl create-pipeline -f ./path/to/pipelines/FileFilter.json
pachctl create-pipeline -f ./path/to/pipelines/PeakPickerHiRes.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureFinderMetabo.json
pachctl create-pipeline -f ./path/to/pipelines/FeatureLinkerUnlabeledQT.json
pachctl create-pipeline -f ./path/to/pipelines/TextExporter.json
After the whole workflow has been successfully executed, the resulting CSV file generated by the TextExporter
in OpenMS will be saved in the TextExporter
repository. You can download the file simply by using:
pachctl get-file TextExporter master <path-to-file-in-repo> > <local-path-output>
The <path-to-file>
can be obtained by checking the list of files outputed to the TextExporter
repository in a given branch:
pachctl list-file TextExporter master