Skip to content

Commit f45c12d

Browse files
committed
Refactor tutorial organisation
1 parent 36415a4 commit f45c12d

File tree

3 files changed

+285
-67
lines changed

3 files changed

+285
-67
lines changed

README.md

+124-53
Original file line numberDiff line numberDiff line change
@@ -1,91 +1,136 @@
11
# POC versioning Machine Learning pipeline
22

3-
This repository is a tutorial to explain a process of **versioning** and
4-
**automation** of a **Machine Learning Project** using the combination of
5-
the following tools:
3+
The aim of this repository is to show a way to handle **pipelining** and **versioning**
4+
of a **Machine Learning project**.
5+
6+
Processes exposed during this tutorial are based of 3 existing tools:
7+
68
- [Data Science Version Control](https://github.com/iterative/dvc) or DVC
79
- [MLflow tracking](https://github.com/mlflow/mlflow)
8-
- [MLVtools](https://github.com/peopledoc/ml-versioning-tools)
10+
- [MLV-tools](https://github.com/peopledoc/ml-versioning-tools)
911

1012
Use cases are based on a text classification task on 20newsgroup dataset. A *dummy* tutorial is also available
1113
to show tools mechanisms.
1214

15+
1316
**Requirements:**
17+
18+
Before starting, you must be able to run following commands:
1419
- virtualenv or condaenv
1520
- make
21+
- git
22+
- python3
1623

17-
## How To
1824

19-
To complete this tutorial clone this repository:
25+
## Tools Overview
2026

21-
git clone https://github.com/peopledoc/mlv-tools-tutorial
22-
23-
Activate your **Python 3** virtual environment.
27+
DVC: an open-source tool for data science and machine learning projects. Use to version, share and reproduce.
2428

25-
Install requirements:
29+
MLflow tracking: api and UI to log and visualize metrics obtained during experiments.
2630

27-
make develop
28-
29-
All other steps are explain in each use cases.
31+
MLV-tools: provide a set of tools to enhance Jupyter Notebooks conversion and DVC versioning and pipelining.
3032

31-
## Keywords
3233

33-
**DVC meta files**: DVC metadata files generated by DVC `add` and `run` command. Extension: **.dvc**
34+
Please have a look to the presentation [TODO: link]
3435

3536

36-
## Dummy Tutorial Use Case
37+
## What are we bringing out
3738

38-
The aim of this tutorial is to show how **MLV-tools**, **DVC** and **MLflow tracking** work on a trivial case.
39-
See [dummy tutorial](./tutorial/dummy.md).
39+
- Notebook parametrized conversion ([MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
4040

41-
## Realistic Tutorial Use Cases
41+
- Pipelining ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
4242

43-
For a more realistic use case, we consider a Natural Language Processing pipeline, based on the well-known dataset 20-newsgroups.
43+
- Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
4444

45-
The base pipeline is simplified to include the following steps:
46-
1. Split the data between train and test sets;
47-
2. Tokenize (split into words) the raw text input;
48-
3. Classify the cleaned input with FastText;
49-
4. Evaluate the model.
50-
51-
In addition, use case 4 showcases hyperparameter tuning with a scikit-learn classifier.
5245

53-
The use cases considered are the following:
54-
- Reproduce a pipeline / experience on new data ([Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md))
55-
- Versioning and storage of intermediate (preprocessed) datasets; partial execution of the pipeline starting from precomputed data
56-
([Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) and [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md))
57-
- Hyperparameter optimisation and fine-tuning (saving results, [Use Case 4: Combine Metrics](./tutorial/use_case4.md))
46+
## Standard Versioning Process Establishment
47+
48+
**Aim:** find a way to version code, data and pipelines.
49+
50+
#### Existing project
5851

59-
These use cases were chosen because we think they represent typical day-to-day work of a data scientist: being able to reproduce easily a pipeline made some time ago or by a coworker, versioning intermediate state of the data to avoid rerunning costly preprocessing steps, branching experiments to run tests with different classifiers, fine-tuning hyperparameters.
52+
Starting from an existing project composed of Python 3 module(s) and a set of **Jupyter notebooks**
53+
we wan to create an automatised pipeline in view to version, share and reproduce experiments.
6054

61-
## Standard Versioning Process
55+
56+
│── classifier
57+
│ ├── aggregate_classif.py
58+
│ ├── __init__.py
59+
│ ├── extract.py
60+
│ └── ...
61+
│── notebooks
62+
│ ├── Augment train data.ipynb
63+
│ ├── Check data and split and train.ipynb
64+
│ ├── Extract data.ipynb
65+
│ ├── Learn text classifier.ipynb
66+
│ ├── Learn aggregated model.ipynb
67+
│ ├── Preprocess image data.ipynb
68+
│ └── Train CNN classifier on image data.ipynb
69+
│── README.md
70+
│── requirements.yml
71+
│── setup.cfg
72+
│── setup.py
73+
74+
Applied transformations will let us handle data using metadata file. And convert **Jupyter notebooks** to
75+
Python scripts. We need to keep in mind that:
6276

63-
The process used in this tutorial is a way to version code, data and pipelines.
64-
Speaking of code the reference will always be the **Jupyter notebook**.
65-
Speaking of input and output data the reference is the set of parameters define in each **DVC** commands.
77+
- The step code reference remain in **Jupyter notebook**
78+
- Pipelines in structured according to its inputs and outputs
79+
- Hyperparameters are pipeline inputs
80+
81+
#### Result
82+
83+
│── classifier
84+
│ ├── aggregate_classif.py
85+
│ ├── __init__.py
86+
│ ├── extract.py
87+
│ └── ...
88+
│── notebooks
89+
│ ├── Augment train data.ipynb
90+
│ ├── Check data and split and train.ipynb
91+
│ ├── Extract data.ipynb
92+
│ ├── Learn text classifier.ipynb
93+
│ ├── Learn aggregated model.ipynb
94+
│ ├── Preprocess image data.ipynb
95+
│ └── Train CNN classifier on image data.ipynb
96+
│── pipeline
97+
│ ├── dvc ** DVC pipeline steps
98+
│ │ ├─ mlvtools_augment_train_data_dvc
99+
│ │ ├─ ..
100+
│ ├── scripts ** Notebooks converted into Python 3 configurable scripts
101+
│ │ ├─ mlvtools_augment_train_data.py
102+
│ │ ├─ ..
103+
│── README.md
104+
│── requirements.yml
105+
│── setup.cfg
106+
│── setup.py
107+
108+
109+
**Notebooks converted into configurable Python 3 scripts**: obtained by **Jupyter notebook** conversion.
110+
111+
**DVC pipeline steps**: DVC command applied on generated Python 3 scripts
112+
113+
114+
#### Apply the process
66115

67116
For each **Jupyter notebook** a **Python 3** parameterizable and executable script is generated. It is the way to
68117
version code and be able to automatize its run.
69118

70119
Pipelines are composed of **DVC** steps. Those steps can be generated directly from the **Jupyter notebook** based
71120
on parameters describe in the Docstring. (notebook -> python script -> DVC command)
72121

73-
For steps which reused same code part (for example the evaluation step run once on train data then on test data) it is
74-
not useful to convert again the **Jupyter notebook** to a **Python 3** script then to a **DVC** command again. You just
75-
need to duplicate the corresponding **DVC** command file and then to edit inputs, outputs and meta file name.
122+
Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This metadata
123+
file represent a pipeline step, it is the DVC result of a step execution. Those files must be tacked using Git.
124+
They are used to reproduce a pipeline..
76125

77-
78-
Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This meta file represent
79-
a pipeline step, it is used to track outputs and dependencies. You can use **dvc repro [DVC meta file]** to re-run
80-
all needed pipeline steps to achieve the one corresponding to this meta file.
81-
82-
For each step in the tutorial the process remain the same.
126+
**Application:**
127+
>For each step in the tutorial the process remain the same.
83128
84129
1. Write a **Jupyter notebook** which correspond to a pipeline step. (See **Jupyter notebook** syntax section in
85130
[MLVtools documentation](https://github.com/peopledoc/ml-versioning-tools))
86131
2. Test your **Jupyter notebook**.
87132
3. Add it under git.
88-
4. Convert the **Jupyter notebook** into a parameterized and executable **Python 3** script using *ipynb_to_python*.
133+
4. Convert the **Jupyter notebook** into a configurable and executable **Python 3** script using *ipynb_to_python*.
89134

90135
ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]
91136

@@ -106,12 +151,6 @@ For each step in the tutorial the process remain the same.
106151
12. Add **DVC meta file** under git/
107152

108153

109-
## Tutorial
110-
111-
The realistic tutorial entry point is the [tutorial setup](./tutorial/setup.md) step.
112-
113-
The **dummy** tutorial entry point is [here](./tutorial/dummy.md).
114-
115154
## Key Features
116155
|Need| Feature|
117156
|:---|:---|
@@ -133,3 +172,35 @@ different parameters.
133172

134173
It is a bad idea to modify generated **Python 3** scripts. They are generated from **Jupyter notebooks**, so changes
135174
should be done in them and then scripts should be re-generated.
175+
176+
177+
178+
## Tutorial
179+
180+
#### Environment
181+
182+
To complete this tutorial clone this repository:
183+
184+
git clone https://github.com/peopledoc/mlv-tools-tutorial
185+
186+
Activate your **Python 3** virtual environment.
187+
188+
Install requirements:
189+
190+
make develop
191+
192+
All other steps are explain in each use cases.
193+
194+
#### Cases
195+
196+
- [How DVC works](./tutorial/dvc_overview.md)
197+
198+
- [MLV-tools pipeline features (on simple cases)](./tutorial/pipeline_features.md)
199+
200+
- Going further with more realistic use cases:
201+
202+
- [Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md)
203+
- [Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) (Run an experiment)
204+
- [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md)
205+
- [Use Case 4: Hyperparameter optimisation and fine-tuning](./tutorial/use_case4.md)
206+

tutorial/dvc_overview.md

+146
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
Data Version Control
2+
====================
3+
4+
5+
Overview
6+
---------
7+
- Each run is tracked and is reproducible
8+
- Each run can be a part of a pipeline
9+
- A complete pipeline is reproducible according to a chosen version
10+
(ie chosen commit)
11+
- The cache mechanisme allows to reproduce sub-pipelines (only part with outdated dependencies)
12+
- Several kind of storage can be configure to handle data file (AWS S3, Azure,
13+
Google Cloud Storage, SSH, HDFS)
14+
15+
16+
- Need to be rigorous: inputs and outputs of each run must be explicitly
17+
specified to be handle as dependencies
18+
- Commands can't be run through a Jupyter Notebook
19+
20+
21+
22+
How it works
23+
------------
24+
25+
**DVC** depends on **Git**. You need to have a Git repository and to manage yourself
26+
your *code* versioning.
27+
You must consider **DVC** as a git extension.
28+
29+
1. As usual, create a git repository and version your files
30+
2. Activate DVC (`dvc init`)
31+
3. Add data files and manage their versioning with DVC (`dvc add [my_file]`).
32+
At this step DVC put data files in its cache and it creates meta files to
33+
identify them.
34+
(see section **Add data file**)
35+
4. Commit meta files using Git to save a version of a pipeline
36+
37+
38+
39+
Small tutorial
40+
---------------
41+
42+
### Install DVC
43+
44+
pip install dvc
45+
46+
### Setup a git environment
47+
48+
mkdir test_dvc
49+
cd test_dvc
50+
51+
git init
52+
# Create a python script which takes a file as input, reads it, writes it in upper case
53+
mkdir code
54+
echo '#!/usr/bin/env python' > code/python_script.py
55+
echo -e "with open('./data/input_file.txt', 'r') as fd, open('./results/output_file.txt', 'w') \
56+
as wfd:\n wfd.write(fd.read().upper())" >> code/python_script.py
57+
chmod +x ./code/python_script.py
58+
59+
# Commit you script
60+
git add ./code/python_script.py
61+
git commit -m 'Initialize env'
62+
63+
### Setup DVC environment
64+
65+
# In ./test_dvc (top level directory)
66+
dvc init
67+
git commit -m 'Initialize dvc'
68+
69+
### Add a data file
70+
71+
# Create a data fiel for the exemple
72+
mkdir data
73+
echo "This is a text" > data/input_file.txt
74+
75+
dvc add data/input_file.txt
76+
77+
Here it is possible to check meta file is created running `git status data`, real file
78+
is ignored by git `cat ./data/.gitignore` and cache entry is created `ls -la .dvc/cache/`
79+
80+
# Commit meta files in git
81+
git add .
82+
git commit -m "Add input data file"
83+
84+
### Run a step
85+
86+
dvc run -d [input file] -o [output file] [cmd]
87+
88+
mkdir results
89+
dvc run -d ./data/input_file.txt -o ./results/output_file.txt ./code/python_script.py
90+
91+
Check output file and meta file are generated *./results/output_file.txt*, *./output_file.txt.dvc*
92+
93+
94+
### Run a pipeline
95+
A pipeline is composed of several steps, so we need to create at least one more step here.
96+
97+
# Run an other step and create a pipeline
98+
MY_CMD="cat ./results/output_file.txt | wc -c > ./results/nb_letters.txt"
99+
dvc run -d ./results/output_file.txt -o ./results/nb_letters.txt -f MyPipeline.dvc $MY_CMD
100+
101+
See the result
102+
103+
cat ./results/nb_letters.txt
104+
105+
A this step the file *./MyPipeline.dvc* represent the pipeline for the current version of files and data
106+
107+
# Reproduce the pipeline
108+
dvc repro MyPipeline.dvc
109+
110+
Nothing happened because nothing has changed try `dvc repro MyPipeline.dvc -v`
111+
112+
# Force the pipeline run
113+
dvc repro MyPipeline.dvc -v -f
114+
115+
git add .
116+
git commit -m 'pipeline creation'
117+
118+
### Modify the input and re-run
119+
120+
echo "new input" >> data/input_file.txt
121+
122+
dvc repro MyPipeline.dvc -v
123+
124+
cat ./results/nb_letters.txt
125+
126+
git commit -am 'New pipeline version'
127+
128+
129+
### See pipelines steps
130+
131+
dvc pipeline show MyPipeline.dvc
132+
133+
Need to be rigorous
134+
-------------------
135+
136+
- inputs and outputs of each run must be explicitly
137+
specified to be handle as dependencies
138+
- when you modify a data file you need to run the associated step to be able
139+
to version it (or reproduce the whole pipeline using the cache mechanism)
140+
141+
Various
142+
-------
143+
144+
See [Data Version Control documentation](https://github.com/iterative/dvc)
145+
146+
See [Data Version Control tutorial](https://blog.dataversioncontrol.com/data-version-control-tutorial-9146715eda46)

0 commit comments

Comments
 (0)