Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor tutorial organisation #3

Merged
merged 3 commits into from
Mar 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 125 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,137 @@
# POC versioning Machine Learning pipeline

This repository is a tutorial to explain a process of **versioning** and
**automation** of a **Machine Learning Project** using the combination of
the following tools:
The aim of this repository is to show a way to handle **pipelining** and **versioning**
of a **Machine Learning project**.

Processes exposed during this tutorial are based of 3 existing tools:

- [Data Science Version Control](https://github.com/iterative/dvc) or DVC
- [MLflow tracking](https://github.com/mlflow/mlflow)
- [MLVtools](https://github.com/peopledoc/ml-versioning-tools)
- [MLV-tools](https://github.com/peopledoc/ml-versioning-tools)

Use cases are based on a text classification task on 20newsgroup dataset. A *dummy* tutorial is also available
to show tools mechanisms.


**Requirements:**

Before starting, you must be familiar with the following commands:
- virtualenv or condaenv
- make
- git
- python3

## How To

To complete this tutorial clone this repository:
## Tools Overview

git clone https://github.com/peopledoc/mlv-tools-tutorial

Activate your **Python 3** virtual environment.
DVC: an open-source tool for data science and machine learning projects. Use to version, share and reproduce.

Install requirements:
MLflow tracking: API and UI to log and visualize metrics obtained during experiments.

make develop

All other steps are explain in each use cases.
MLV-tools: provides a set of tools to enhance Jupyter Notebooks conversion and DVC versioning and pipelining.

## Keywords

**DVC meta files**: DVC metadata files generated by DVC `add` and `run` command. Extension: **.dvc**
Please have a look to the [presentation](https://peopledoc.github.io/mlv-tools-tutorial/talks/pyData/presentation.html)


## Dummy Tutorial Use Case
## Our main features

The aim of this tutorial is to show how **MLV-tools**, **DVC** and **MLflow tracking** work on a trivial case.
See [dummy tutorial](./tutorial/dummy.md).
- Notebook parametrized conversion ([MLV-tools](https://github.com/peopledoc/ml-versioning-tools))

## Realistic Tutorial Use Cases
- Pipelining ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))

For a more realistic use case, we consider a Natural Language Processing pipeline, based on the well-known dataset 20-newsgroups.
- Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))

The base pipeline is simplified to include the following steps:
1. Split the data between train and test sets;
2. Tokenize (split into words) the raw text input;
3. Classify the cleaned input with FastText;
4. Evaluate the model.

In addition, use case 4 showcases hyperparameter tuning with a scikit-learn classifier.

The use cases considered are the following:
- Reproduce a pipeline / experience on new data ([Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md))
- Versioning and storage of intermediate (preprocessed) datasets; partial execution of the pipeline starting from precomputed data
([Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) and [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md))
- Hyperparameter optimisation and fine-tuning (saving results, [Use Case 4: Combine Metrics](./tutorial/use_case4.md))
## Standard Versioning Process Establishment

**Goal:** find a way to version code, data and pipelines.

#### Existing project

These use cases were chosen because we think they represent typical day-to-day work of a data scientist: being able to reproduce easily a pipeline made some time ago or by a coworker, versioning intermediate state of the data to avoid rerunning costly preprocessing steps, branching experiments to run tests with different classifiers, fine-tuning hyperparameters.
Starting from an existing project composed of Python 3 module(s) and a set of **Jupyter notebooks**,
we want to create an automated pipeline in order to version, share and reproduce experiments.


│── classifier
│ ├── aggregate_classif.py
│ ├── __init__.py
│ ├── extract.py
│ └── ...
│── notebooks
│ ├── Augment train data.ipynb
│ ├── Check data and split and train.ipynb
│ ├── Extract data.ipynb
│ ├── Learn text classifier.ipynb
│ ├── Learn aggregated model.ipynb
│ ├── Preprocess image data.ipynb
│ └── Train CNN classifier on image data.ipynb
│── README.md
│── requirements.yml
│── setup.cfg
│── setup.py

The data flow is processed by applying steps and intermediary results are versioned using metadata files. These steps are defined in **Jupyter notebooks**, which are then converted to Python scripts.

## Standard Versioning Process
Keep in mind that:

The process used in this tutorial is a way to version code, data and pipelines.
Speaking of code the reference will always be the **Jupyter notebook**.
Speaking of input and output data the reference is the set of parameters define in each **DVC** commands.
- The reference for the code of the step remains in **Jupyter notebook**
- Pipelines are structured according to their inputs and outputs
- Hyperparameters are pipeline inputs

#### Project after refactoring

│── classifier
│ ├── aggregate_classif.py
│ ├── __init__.py
│ ├── extract.py
│ └── ...
│── notebooks
│ ├── Augment train data.ipynb
│ ├── Check data and split and train.ipynb
│ ├── Extract data.ipynb
│ ├── Learn text classifier.ipynb
│ ├── Learn aggregated model.ipynb
│ ├── Preprocess image data.ipynb
│ └── Train CNN classifier on image data.ipynb
│── pipeline
│ ├── dvc ** DVC pipeline steps
│ │ ├─ mlvtools_augment_train_data_dvc
│ │ ├─ ..
│ ├── scripts ** Notebooks converted into Python 3 configurable scripts
│ │ ├─ mlvtools_augment_train_data.py
│ │ ├─ ..
│── README.md
│── requirements.yml
│── setup.cfg
│── setup.py


**Notebooks converted into configurable Python 3 scripts**: obtained by **Jupyter notebook** conversion.

**DVC pipeline steps**: DVC command applied on generated Python 3 scripts


#### Applying the process

For each **Jupyter notebook** a **Python 3** parameterizable and executable script is generated. It is the way to
version code and be able to automatize its run.

Pipelines are composed of **DVC** steps. Those steps can be generated directly from the **Jupyter notebook** based
on parameters describe in the Docstring. (notebook -> python script -> DVC command)

For steps which reused same code part (for example the evaluation step run once on train data then on test data) it is
not useful to convert again the **Jupyter notebook** to a **Python 3** script then to a **DVC** command again. You just
need to duplicate the corresponding **DVC** command file and then to edit inputs, outputs and meta file name.


Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This meta file represent
a pipeline step, it is used to track outputs and dependencies. You can use **dvc repro [DVC meta file]** to re-run
all needed pipeline steps to achieve the one corresponding to this meta file.
Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This metadata
file represent a pipeline step, it is the DVC result of a step execution. Those files must be tacked using Git.
They are used to reproduce a pipeline..

For each step in the tutorial the process remain the same.
**Application:**
>For each step in the tutorial the process remain the same.

1. Write a **Jupyter notebook** which correspond to a pipeline step. (See **Jupyter notebook** syntax section in
[MLVtools documentation](https://github.com/peopledoc/ml-versioning-tools))
2. Test your **Jupyter notebook**.
3. Add it under git.
4. Convert the **Jupyter notebook** into a parameterized and executable **Python 3** script using *ipynb_to_python*.
4. Convert the **Jupyter notebook** into a configurable and executable **Python 3** script using *ipynb_to_python*.

ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]

Expand All @@ -106,12 +152,6 @@ For each step in the tutorial the process remain the same.
12. Add **DVC meta file** under git/


## Tutorial

The realistic tutorial entry point is the [tutorial setup](./tutorial/setup.md) step.

The **dummy** tutorial entry point is [here](./tutorial/dummy.md).

## Key Features
|Need| Feature|
|:---|:---|
Expand All @@ -133,3 +173,35 @@ different parameters.

It is a bad idea to modify generated **Python 3** scripts. They are generated from **Jupyter notebooks**, so changes
should be done in them and then scripts should be re-generated.



## Tutorial

#### Environment

To complete this tutorial clone this repository:

git clone https://github.com/peopledoc/mlv-tools-tutorial

Activate your **Python 3** virtual environment.

Install requirements:

make develop

All other steps are explain in each use cases.

#### Cases

- [How DVC works](./tutorial/dvc_overview.md)

- [MLV-tools pipeline features (on simple cases)](./tutorial/pipeline_features.md)

- Going further with more realistic use cases:

- [Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md)
- [Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) (Run an experiment)
- [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md)
- [Use Case 4: Hyperparameter optimisation and fine-tuning](./tutorial/use_case4.md)

146 changes: 146 additions & 0 deletions tutorial/dvc_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
Data Version Control
====================


Overview
---------
- Each run is tracked and is reproducible
- Each run can be a part of a pipeline
- A complete pipeline is reproducible according to a chosen version
(ie chosen commit)
- The cache mechanisme allows to reproduce sub-pipelines (only part with outdated dependencies)
- Several kind of storage can be configure to handle data file (AWS S3, Azure,
Google Cloud Storage, SSH, HDFS)


- Need to be rigorous: inputs and outputs of each run must be explicitly
specified to be handle as dependencies
- Commands can't be run through a Jupyter Notebook



How it works
------------

**DVC** depends on **Git**. You need to have a Git repository and to manage yourself
your *code* versioning.
You must consider **DVC** as a git extension.

1. As usual, create a git repository and version your files
2. Activate DVC (`dvc init`)
3. Add data files and manage their versioning with DVC (`dvc add [my_file]`).
At this step DVC put data files in its cache and it creates meta files to
identify them.
(see section **Add data file**)
4. Commit meta files using Git to save a version of a pipeline



Small tutorial
---------------

### Install DVC

pip install dvc

### Setup a git environment

mkdir test_dvc
cd test_dvc

git init
# Create a python script which takes a file as input, reads it, writes it in upper case
mkdir code
echo '#!/usr/bin/env python' > code/python_script.py
echo -e "with open('./data/input_file.txt', 'r') as fd, open('./results/output_file.txt', 'w') \
as wfd:\n wfd.write(fd.read().upper())" >> code/python_script.py
chmod +x ./code/python_script.py

# Commit you script
git add ./code/python_script.py
git commit -m 'Initialize env'

### Setup DVC environment

# In ./test_dvc (top level directory)
dvc init
git commit -m 'Initialize dvc'

### Add a data file

# Create a data fiel for the exemple
mkdir data
echo "This is a text" > data/input_file.txt

dvc add data/input_file.txt

Here it is possible to check meta file is created running `git status data`, real file
is ignored by git `cat ./data/.gitignore` and cache entry is created `ls -la .dvc/cache/`

# Commit meta files in git
git add .
git commit -m "Add input data file"

### Run a step

dvc run -d [input file] -o [output file] [cmd]

mkdir results
dvc run -d ./data/input_file.txt -o ./results/output_file.txt ./code/python_script.py

Check output file and meta file are generated *./results/output_file.txt*, *./output_file.txt.dvc*


### Run a pipeline
A pipeline is composed of several steps, so we need to create at least one more step here.

# Run an other step and create a pipeline
MY_CMD="cat ./results/output_file.txt | wc -c > ./results/nb_letters.txt"
dvc run -d ./results/output_file.txt -o ./results/nb_letters.txt -f MyPipeline.dvc $MY_CMD

See the result

cat ./results/nb_letters.txt

A this step the file *./MyPipeline.dvc* represent the pipeline for the current version of files and data

# Reproduce the pipeline
dvc repro MyPipeline.dvc

Nothing happened because nothing has changed try `dvc repro MyPipeline.dvc -v`

# Force the pipeline run
dvc repro MyPipeline.dvc -v -f

git add .
git commit -m 'pipeline creation'

### Modify the input and re-run

echo "new input" >> data/input_file.txt

dvc repro MyPipeline.dvc -v

cat ./results/nb_letters.txt

git commit -am 'New pipeline version'


### See pipelines steps

dvc pipeline show MyPipeline.dvc

Need to be rigorous
-------------------

- inputs and outputs of each run must be explicitly
specified to be handle as dependencies
- when you modify a data file you need to run the associated step to be able
to version it (or reproduce the whole pipeline using the cache mechanism)

Various
-------

See [Data Version Control documentation](https://github.com/iterative/dvc)

See [Data Version Control tutorial](https://blog.dataversioncontrol.com/data-version-control-tutorial-9146715eda46)
Loading