peopledoc · sbracaloni · Mar 26, 2019 · Mar 15, 2019 · Mar 26, 2019 · Mar 26, 2019
@@ -1,91 +1,137 @@
 # POC versioning Machine Learning pipeline
 
-This repository is a tutorial to explain a process of **versioning** and 
-**automation** of a **Machine Learning Project** using the combination of
- the following tools:
+The aim of this repository is to show a way to handle **pipelining** and **versioning**
+of a **Machine Learning project**.
+
+Processes exposed during this tutorial are based of 3 existing tools:
+
  - [Data Science Version Control](https://github.com/iterative/dvc) or DVC
  - [MLflow tracking](https://github.com/mlflow/mlflow)
- - [MLVtools](https://github.com/peopledoc/ml-versioning-tools)
+ - [MLV-tools](https://github.com/peopledoc/ml-versioning-tools)
 
 Use cases are based on a text classification task on 20newsgroup dataset. A *dummy* tutorial is also available
 to show tools mechanisms.
 
+
 **Requirements:**
+
+Before starting, you must be familiar with the following commands: 
 - virtualenv or condaenv
 - make
+- git
+- python3
 
-## How To
 
-To complete this tutorial clone this repository:
+## Tools Overview
 
-    git clone https://github.com/peopledoc/mlv-tools-tutorial
-
-Activate your **Python 3** virtual environment.
+DVC: an open-source tool for data science and machine learning projects. Use to version, share and reproduce.
 
-Install requirements:
+MLflow tracking: API and UI to log and visualize metrics obtained during experiments.
 
-    make develop
-
-All other steps are explain in each use cases.
+MLV-tools: provides a set of tools to enhance Jupyter Notebooks conversion and DVC versioning and pipelining.  
 
-## Keywords
 
-**DVC meta files**: DVC metadata files generated by DVC `add` and `run` command. Extension: **.dvc**
+Please have a look to the [presentation](https://peopledoc.github.io/mlv-tools-tutorial/talks/pyData/presentation.html)
 
 
-## Dummy Tutorial Use Case
+## Our main features
 
-The aim of this tutorial is to show how **MLV-tools**, **DVC** and **MLflow tracking** work on a trivial case.
-See [dummy tutorial](./tutorial/dummy.md). 
+- Notebook parametrized conversion ([MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
 
-## Realistic Tutorial Use Cases
+- Pipelining ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
 
-For a more realistic use case, we consider a Natural Language Processing pipeline, based on the well-known dataset 20-newsgroups. 
+- Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
 
-The base pipeline is simplified to include the following steps:
- 1. Split the data between train and test sets;
- 2. Tokenize (split into words) the raw text input;
- 3. Classify the cleaned input with FastText;
- 4. Evaluate the model.
-
-In addition, use case 4 showcases hyperparameter tuning with a scikit-learn classifier.
 
-The use cases considered are the following:
-- Reproduce a pipeline / experience on new data ([Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md))
-- Versioning and storage of intermediate (preprocessed) datasets; partial execution of the pipeline starting from precomputed data
-  ([Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) and [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md))
-- Hyperparameter optimisation and fine-tuning (saving results, [Use Case 4: Combine Metrics](./tutorial/use_case4.md))
+## Standard Versioning Process Establishment
+
+**Goal:** find a way to version code, data and pipelines.
+
+#### Existing project
 
-These use cases were chosen because we think they represent typical day-to-day work of a data scientist: being able to reproduce easily a pipeline made some time ago or by a coworker, versioning intermediate state of the data to avoid rerunning costly preprocessing steps, branching experiments to run tests with different classifiers, fine-tuning hyperparameters. 
+Starting from an existing project composed of Python 3 module(s) and a set of **Jupyter notebooks**,
+we want to create an automated pipeline in order to version, share and reproduce experiments.
+
+
+    │── classifier
+    │   ├── aggregate_classif.py
+    │   ├── __init__.py
+    │   ├── extract.py
+    │   └── ...
+    │── notebooks
+    │   ├── Augment train data.ipynb
+    │   ├── Check data and split and train.ipynb
+    │   ├── Extract data.ipynb
+    │   ├── Learn text classifier.ipynb
+    │   ├── Learn aggregated model.ipynb
+    │   ├── Preprocess image data.ipynb
+    │   └── Train CNN classifier on image data.ipynb
+    │── README.md
+    │── requirements.yml
+    │── setup.cfg
+    │── setup.py
+
+The data flow is processed by applying steps and intermediary results are versioned using metadata files. These steps are defined in **Jupyter notebooks**, which are then converted to Python scripts.
 
-## Standard Versioning Process
+Keep in mind that:
 
-The process used in this tutorial is a way to version code, data and pipelines.
-Speaking of code the reference will always be the **Jupyter notebook**. 
-Speaking of input and output data the reference is the set of parameters define in each **DVC** commands.
+ - The reference for the code of the step remains in **Jupyter notebook** 
+ - Pipelines are structured according to their inputs and outputs
+ - Hyperparameters are pipeline inputs
+
+#### Project after refactoring
+
+    │── classifier
+    │   ├── aggregate_classif.py
+    │   ├── __init__.py
+    │   ├── extract.py
+    │   └── ...
+    │── notebooks
+    │   ├── Augment train data.ipynb
+    │   ├── Check data and split and train.ipynb
+    │   ├── Extract data.ipynb
+    │   ├── Learn text classifier.ipynb
+    │   ├── Learn aggregated model.ipynb
+    │   ├── Preprocess image data.ipynb
+    │   └── Train CNN classifier on image data.ipynb
+    │── pipeline
+    │   ├── dvc                                        ** DVC pipeline steps 
+    │   │   ├─ mlvtools_augment_train_data_dvc              
+    │   │   ├─ ..
+    │   ├── scripts                                    ** Notebooks converted into Python 3 configurable scripts
+    │   │   ├─ mlvtools_augment_train_data.py
+    │   │   ├─ ..
+    │── README.md
+    │── requirements.yml
+    │── setup.cfg
+    │── setup.py
+
+
+**Notebooks converted into configurable Python 3 scripts**: obtained by **Jupyter notebook** conversion.
+
+**DVC pipeline steps**: DVC command applied on generated Python 3 scripts
+
+
+#### Applying the process
 
 For each **Jupyter notebook** a **Python 3** parameterizable and executable script is generated. It is the way to 
 version code and be able to automatize its run.
 
 Pipelines are composed of **DVC** steps. Those steps can be generated directly from the **Jupyter notebook** based
 on parameters describe in the Docstring. (notebook -> python script -> DVC command)
 
-For steps which reused same code part (for example the evaluation step run once on train data then on test data) it is
-not useful to convert again the **Jupyter notebook** to a **Python 3** script then to a **DVC** command again. You just 
-need to duplicate the corresponding **DVC** command file and then to edit inputs, outputs and meta file name. 
-
-
-Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This meta file represent 
-a pipeline step, it is used to track outputs and dependencies. You can use **dvc repro [DVC meta file]** to re-run
-all needed pipeline steps to achieve the one corresponding to this meta file.
+Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This metadata
+ file represent a pipeline step, it is the DVC result of a step execution. Those files must be tacked using Git.
+They are used to reproduce a pipeline..
 
-For each step in the tutorial the process remain the same.
+**Application:**
+>For each step in the tutorial the process remain the same.
 
    1. Write a **Jupyter notebook** which correspond to a pipeline step. (See **Jupyter notebook** syntax section in 
    [MLVtools documentation](https://github.com/peopledoc/ml-versioning-tools))
    2. Test your **Jupyter notebook**.
    3. Add it under git.
-   4. Convert the **Jupyter notebook** into a parameterized and executable **Python 3** script using *ipynb_to_python*.
+   4. Convert the **Jupyter notebook** into a configurable and executable **Python 3** script using *ipynb_to_python*.
 
             ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]
 
@@ -106,12 +152,6 @@ For each step in the tutorial the process remain the same.
    12. Add **DVC meta file** under git/
 
 
-## Tutorial
-
-The realistic tutorial entry point is the [tutorial setup](./tutorial/setup.md) step.
-
-The **dummy** tutorial entry point is [here](./tutorial/dummy.md).
-
 ## Key Features
 |Need| Feature|
 |:---|:---|
@@ -133,3 +173,35 @@ different parameters.
 
 It is a bad idea to modify generated **Python 3** scripts. They are generated from **Jupyter notebooks**, so changes 
 should be done in them and then scripts should be re-generated.
+
+
+
+## Tutorial
+
+#### Environment
+
+To complete this tutorial clone this repository:
+
+    git clone https://github.com/peopledoc/mlv-tools-tutorial
+
+Activate your **Python 3** virtual environment.
+
+Install requirements:
+
+    make develop
+
+All other steps are explain in each use cases.
+
+#### Cases
+
+- [How DVC works](./tutorial/dvc_overview.md)
+
+- [MLV-tools pipeline features (on simple cases)](./tutorial/pipeline_features.md)
+
+- Going further with more realistic use cases:
+
+    - [Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md)
+    - [Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) (Run an experiment)
+    - [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md)
+    - [Use Case 4: Hyperparameter optimisation and fine-tuning](./tutorial/use_case4.md)
+
@@ -0,0 +1,146 @@
+Data Version Control
+====================
+
+
+Overview
+---------
+- Each run is tracked and is reproducible
+- Each run can be a part of a pipeline
+- A complete pipeline is reproducible according to a chosen version 
+(ie chosen commit)
+- The cache mechanisme allows to reproduce sub-pipelines (only part with outdated dependencies)
+- Several kind of storage can be configure to handle data file (AWS S3, Azure, 
+Google Cloud Storage, SSH, HDFS)
+
+
+- Need to be rigorous: inputs and outputs of each run must be explicitly 
+specified to be handle as dependencies
+- Commands can't be run through a Jupyter Notebook
+
+
+
+How it works
+------------
+
+**DVC** depends on **Git**. You need to have a Git repository and to manage yourself
+your *code* versioning.
+You must consider **DVC** as a git extension.
+
+1. As usual, create a git repository and version your files
+2. Activate DVC  (`dvc init`)
+3. Add data files and manage their versioning with DVC (`dvc add [my_file]`).
+   At this step DVC put data files in its cache and it creates meta files to
+   identify them.
+   (see section **Add data file**)
+4. Commit meta files using Git to save a version of a pipeline
+
+
+
+Small tutorial
+---------------
+
+### Install DVC
+
+    pip install dvc
+
+### Setup a git environment
+
+    mkdir test_dvc
+    cd test_dvc
+
+    git init 
+    # Create a python script which takes a file as input, reads it, writes it in upper case
+    mkdir code
+    echo '#!/usr/bin/env python' > code/python_script.py
+    echo -e "with open('./data/input_file.txt', 'r') as fd, open('./results/output_file.txt', 'w') \
+    as wfd:\n    wfd.write(fd.read().upper())" >> code/python_script.py
+    chmod +x ./code/python_script.py
+
+    # Commit you script
+    git add ./code/python_script.py
+    git commit -m 'Initialize env'
+
+### Setup DVC environment
+
+    # In ./test_dvc (top level directory)
+    dvc init
+    git commit -m 'Initialize dvc'
+
+### Add a data file
+
+    # Create a data fiel for the exemple
+    mkdir data
+    echo "This is a text" > data/input_file.txt
+
+    dvc add data/input_file.txt
+
+Here it is possible to check meta file is created running `git status data`, real file
+is ignored by git `cat ./data/.gitignore` and cache entry is created `ls -la .dvc/cache/`
+
+    # Commit meta files in git
+    git add .
+    git commit -m "Add input data file"
+
+### Run a step 
+
+    dvc run -d [input file] -o [output file] [cmd]
+
+    mkdir results
+    dvc run -d ./data/input_file.txt -o ./results/output_file.txt ./code/python_script.py
+
+Check output file and meta file are generated *./results/output_file.txt*, *./output_file.txt.dvc*
+
+
+### Run a pipeline
+A pipeline is composed of several steps, so we need to create at least one more step here.
+
+    # Run an other step and create a pipeline 
+    MY_CMD="cat ./results/output_file.txt | wc -c > ./results/nb_letters.txt" 
+    dvc run -d ./results/output_file.txt -o ./results/nb_letters.txt -f MyPipeline.dvc $MY_CMD
+
+See the result
+
+    cat ./results/nb_letters.txt
+
+A this step the file *./MyPipeline.dvc* represent the pipeline for the current version of files and data
+
+    # Reproduce the pipeline
+    dvc repro MyPipeline.dvc
+
+Nothing happened because nothing has changed try `dvc repro MyPipeline.dvc -v`
+
+    # Force the pipeline run
+    dvc repro MyPipeline.dvc -v -f
+
+    git add .
+    git commit -m 'pipeline creation'
+
+### Modify the input and re-run
+
+    echo "new input" >> data/input_file.txt
+
+    dvc repro MyPipeline.dvc -v
+
+    cat ./results/nb_letters.txt
+
+    git commit -am 'New pipeline version'
+
+
+### See pipelines steps
+
+    dvc pipeline show MyPipeline.dvc
+
+Need to be rigorous
+-------------------
+
+- inputs and outputs of each run must be explicitly 
+specified to be handle as dependencies
+- when you modify a data file you need to run the associated step to be able 
+to version it (or reproduce the whole pipeline using the cache mechanism)
+
+Various
+-------
+
+See [Data Version Control documentation](https://github.com/iterative/dvc)
+
+See [Data Version Control tutorial](https://blog.dataversioncontrol.com/data-version-control-tutorial-9146715eda46)