Skip to content

Commit 8ea6eaf

Browse files
author
Éric Lemoine
committed
Update tutorial material
1 parent c8d7365 commit 8ea6eaf

8 files changed

+326
-321
lines changed

Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
SHELL := /bin/bash
22
develop:
3+
pip install cython
34
pip install -r ./requirements.txt
45

56
init-struct:

README.md

+123-112
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,55 @@
1-
# POC versioning Machine Learning pipeline
1+
# Machine Learning Pipeline Versioning Tutorial
22

33
The aim of this repository is to show a way to handle **pipelining** and **versioning**
44
of a **Machine Learning project**.
55

6-
Processes exposed during this tutorial are based of 3 existing tools:
6+
Processes exposed during this tutorial are based on two tools:
77

8-
- [Data Science Version Control](https://github.com/iterative/dvc) or DVC
9-
- [MLflow tracking](https://github.com/mlflow/mlflow)
10-
- [MLV-tools](https://github.com/peopledoc/ml-versioning-tools)
11-
12-
Use cases are based on a text classification task on 20newsgroup dataset. A *dummy* tutorial is also available
13-
to show tools mechanisms.
8+
* [DVC](https://github.com/iterative/dvc)
9+
* [MLflow tracking](https://github.com/mlflow/mlflow)
10+
* [mlvtools](https://github.com/peopledoc/mlvtools)
1411

12+
Use cases are based on a text classification task on 20newsgroup dataset. A *dummy*
13+
tutorial is also available to show tools mechanisms.
1514

16-
**Requirements:**
15+
**Prerequisites**
16+
17+
For this tutorial, you must be familiar with the following tools:
1718

18-
Before starting, you must be familiar with the following commands:
1919
- virtualenv or condaenv
2020
- make
2121
- git
22-
- python3
23-
22+
- python
2423

2524
## Tools Overview
2625

27-
DVC: an open-source tool for data science and machine learning projects. Use to version, share and reproduce.
28-
29-
MLflow tracking: API and UI to log and visualize metrics obtained during experiments.
26+
DVC is an open-source version control system for Machine Learning projects. It is used
27+
for versioning and sharing Machine Learning data, and reproducing Machine Learning
28+
experiments and pipeline stages.
3029

31-
MLV-tools: provides a set of tools to enhance Jupyter Notebooks conversion and DVC versioning and pipelining.
32-
33-
34-
Please have a look to the [presentation](https://peopledoc.github.io/mlv-tools-tutorial/talks/pyData/presentation.html)
30+
mlvtools provides tools to generate Python scripts and DVC commands from Jupyter
31+
Notebooks.
3532

33+
Please have a look at the
34+
[presentation](https://peopledoc.github.io/mlvtools-tutorial/talks/pyData/presentation.html).
3635

3736
## Our main features
3837

39-
- Notebook parametrized conversion ([MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
40-
41-
- Pipelining ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
42-
43-
- Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
44-
38+
* Notebook parametrized conversion ([mlvtools](https://github.com/peopledoc/mlvtools))
39+
* Pipelining ([DVC](https://github.com/iterative/dvc) and
40+
[mlvtools](https://github.com/peopledoc/mlvtools))
41+
* Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and
42+
[mlvtools](https://github.com/peopledoc/mlvtools))
4543

4644
## Standard Versioning Process Establishment
4745

4846
**Goal:** find a way to version code, data and pipelines.
4947

50-
#### Existing project
48+
### Initial project
5149

52-
Starting from an existing project composed of Python 3 module(s) and a set of **Jupyter notebooks**,
53-
we want to create an automated pipeline in order to version, share and reproduce experiments.
50+
Starting from an existing project composed of Python module(s) and a set of Jupyter
51+
notebooks, we want to create an automated pipeline in order to version, share and
52+
reproduce experiments.
5453

5554
5655
│── classifier
@@ -70,16 +69,18 @@ we want to create an automated pipeline in order to version, share and reproduce
7069
│── requirements.yml
7170
│── setup.cfg
7271
│── setup.py
73-
74-
The data flow is processed by applying steps and intermediary results are versioned using metadata files. These steps are defined in **Jupyter notebooks**, which are then converted to Python scripts.
72+
73+
The data flow is processed by applying steps and intermediary results are versioned
74+
using metadata files. These steps are defined in Jupyter notebooks, which are then
75+
converted to Python scripts.
7576

7677
Keep in mind that:
7778

78-
- The reference for the code of the step remains in **Jupyter notebook**
79+
- The reference for the code of the step remains in the Jupyter notebook
7980
- Pipelines are structured according to their inputs and outputs
8081
- Hyperparameters are pipeline inputs
81-
82-
#### Project after refactoring
82+
83+
### Project after refactoring
8384

8485
│── classifier
8586
│ ├── aggregate_classif.py
@@ -95,10 +96,10 @@ Keep in mind that:
9596
│ ├── Preprocess image data.ipynb
9697
│ └── Train CNN classifier on image data.ipynb
9798
│── pipeline
98-
│ ├── dvc ** DVC pipeline steps
99-
│ │ ├─ mlvtools_augment_train_data_dvc
99+
│ ├── dvc ** DVC pipeline steps
100+
│ │ ├─ mlvtools_augment_train_data_dvc
100101
│ │ ├─ ..
101-
│ ├── scripts ** Notebooks converted into Python 3 configurable scripts
102+
│ ├── scripts ** Notebooks converted into Python configurable scripts
102103
│ │ ├─ mlvtools_augment_train_data.py
103104
│ │ ├─ ..
104105
│── README.md
@@ -107,106 +108,116 @@ Keep in mind that:
107108
│── setup.py
108109

109110

110-
**Notebooks converted into configurable Python 3 scripts**: obtained by **Jupyter notebook** conversion.
111+
### Applying the process
111112

112-
**DVC pipeline steps**: DVC command applied on generated Python 3 scripts
113+
For each Jupyter notebook a Python parameterizable and executable script is generated.
114+
It is the way to version code and be able to automatize its run.
113115

116+
Pipelines are composed of DVC steps. Those steps can be generated directly from the
117+
Jupyter notebook based on parameters described in the Docstring. (notebook -> python
118+
script -> DVC command)
114119

115-
#### Applying the process
120+
Each time a DVC step is run a DVC meta file (`[normalize_notebook_name].dvc`) is
121+
created. This metadata file represents a pipeline step, it is the DVC result of a step
122+
execution. Those files must be tracked using Git. They are used to reproduce
123+
a pipeline.
116124

117-
For each **Jupyter notebook** a **Python 3** parameterizable and executable script is generated. It is the way to
118-
version code and be able to automatize its run.
125+
**Application:**
119126

120-
Pipelines are composed of **DVC** steps. Those steps can be generated directly from the **Jupyter notebook** based
121-
on parameters describe in the Docstring. (notebook -> python script -> DVC command)
127+
For each step in the tutorial the process remain the same.
128+
129+
1. Write a Jupyter notebook which corresponds to a pipeline step. (See Jupyter notebook
130+
syntax section in [mlvtools documentation](https://github.com/peopledoc/mlvtools))
131+
1. Test your Jupyter notebook.
132+
1. Add it under git.
133+
1. Convert the Jupyter notebook into a configurable and executable Python script
134+
using `ipynb_to_python`.
135+
```
136+
ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]
137+
```
138+
139+
1. Ensure Python executable and configurable script is well created into `./pipeline/steps/[python_script_name]`.
140+
```
141+
./pipeline/steps/[python_script_name] -h
142+
```
143+
1. Create a DVC commands to run the Python script using DVC.
144+
```
145+
gen_dvc -i ./pipeline/steps/[python_script_name] \
146+
--out-dvc-cmd ./scripts/cmd/[dvc_cmd_name]
147+
```
148+
1. Ensure DVC command is well created.
149+
1. Add generated command and Python script under git.
150+
1. Add step inputs under DVC.
151+
1. Run DVC command `./scripts/cmd/[dvc_cmd_name]`.
152+
1. Check DVC meta file is created `./[normalize notebook _name].dvc`
153+
1. Add DVC meta file under git
122154

123-
Each time a **DVC** step is run a **DVC meta file** (`[normalize_notebook_name].dvc`) is created. This metadata
124-
file represent a pipeline step, it is the DVC result of a step execution. Those files must be tacked using Git.
125-
They are used to reproduce a pipeline..
126-
127-
**Application:**
128-
>For each step in the tutorial the process remain the same.
129-
130-
1. Write a **Jupyter notebook** which correspond to a pipeline step. (See **Jupyter notebook** syntax section in
131-
[MLVtools documentation](https://github.com/peopledoc/ml-versioning-tools))
132-
2. Test your **Jupyter notebook**.
133-
3. Add it under git.
134-
4. Convert the **Jupyter notebook** into a configurable and executable **Python 3** script using *ipynb_to_python*.
135-
136-
ipynb_to_python -n ./pipeline/notebooks/[notebook_name] -o ./pipeline/steps/[python_script_name]
137-
138-
5. Ensure **Python 3** executable and configurable script is well created into `./pipeline/steps/[python_script_name]`.
139-
140-
./pipeline/steps/[python_script_name] -h
141-
142-
6. Create a **DVC** commands to run the **Python 3** script using **DVC**.
143-
144-
gen_dvc -i ./pipeline/steps/[python_script_name] \
145-
--out-dvc-cmd ./scripts/cmd/[dvc_cmd_name]
146-
147-
7. Ensure **DVC** command is well created.
148-
8. Add generated command and **Python 3** script under git.
149-
9. Add step inputs under **DVC**.
150-
10. Run **DVC** command `./scripts/cmd/[dvc_cmd_name]`.
151-
11. Check **DVC meta file** is created `./[normalize notebook _name].dvc`
152-
12. Add **DVC meta file** under git/
153-
154155

155156
## Key Features
157+
156158
|Need| Feature|
157159
|:---|:---|
158-
| Ignore notebook cell | # No effect |
159-
| DVC input and ouptuts | **:dvc-in**, **:dvc-out**|
160-
| Add extra parameters | **:dvc-extra**|
161-
| Write DVC whole command | **:dvc-cmd**|
162-
| Convert Jupiter Notebook to Python 3 script | **ipynb_to_python**|
163-
| Generate DVC command | **gen_dvc**|
164-
| Create a pipeline step from a Jupiter Notebook | ipynb_to_python, gen_dvc |
165-
| Add a pipeline step with different IO | Copy **DVC** step then edit inputs, outputs and meta file name |
166-
| Reproduce a pipeline | **dvc repro [metafile]**|
167-
| Reproduce a pipeline with no cache | **dvc repro -f [metafile]**|
168-
| Reproduce a pipeline after an algo change | **dvc repro -f [metafile]** or run impacted step individually then complete the pipeline.|
169-
170-
171-
It is allowed to modify or duplicate a **DVC** command to change an hyperparameter or run a same step twice with
172-
different parameters.
173-
174-
It is a bad idea to modify generated **Python 3** scripts. They are generated from **Jupyter notebooks**, so changes
175-
should be done in them and then scripts should be re-generated.
176-
177-
160+
| Ignore notebook cell | `# No effect` |
161+
| DVC input and ouptuts | `:dvc-in`, `:dvc-out`|
162+
| Add extra parameters | `:dvc-extra`|
163+
| Write DVC whole command | `:dvc-cmd`|
164+
| Convert Jupiter Notebook to Python script | `ipynb_to_python`|
165+
| Generate DVC command | `gen_dvc`|
166+
| Create a pipeline step from a Jupiter Notebook | `ipynb_to_python`, `gen_dvc` |
167+
| Add a pipeline step with different IO | Copy DVC step then edit inputs, outputs and meta file name |
168+
| Reproduce a pipeline | `dvc repro [metafile]`|
169+
| Reproduce a pipeline with no cache | `dvc repro -f [metafile]`|
170+
| Reproduce a pipeline after an algo change | `dvc repro -f [metafile]` or run impacted step individually then complete the pipeline.|
171+
172+
It is allowed to modify or duplicate a DVC command to change an hyperparameter or run
173+
a same step twice with different parameters.
174+
175+
It is a bad idea to modify generated Python scripts. They are generated from Jupyter
176+
notebooks, so changes should be done in Jupyter notebooks and then scripts should be
177+
re-generated.
178178

179179
## Tutorial
180180

181-
#### Environment
181+
### Environment
182182

183183
To complete this tutorial clone this repository:
184184

185-
git clone https://github.com/peopledoc/mlv-tools-tutorial
186-
187-
Activate your **Python 3** virtual environment.
185+
```shell
186+
git clone https://github.com/peopledoc/mlvtools-tutorial
187+
```
188188

189-
Install requirements:
189+
Create and a Python virtual environment, and activate it:
190190

191-
make develop
192-
193-
All other steps are explain in each use cases.
191+
```shell
192+
virtualenv --python python3 venv
193+
source venv/bin/activate
194+
```
194195

195-
#### Cases
196+
Install requirements:
196197

197-
- [How DVC works](./tutorial/dvc_overview.md)
198+
```shell
199+
make develop
200+
```
198201

199-
- [MLV-tools pipeline features (on simple cases)](./tutorial/pipeline_features.md)
202+
All other steps are explained in each use case.
200203

201-
- Going further with more realistic use cases:
204+
### Cases
202205

203-
- [Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md)
204-
- [Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) (Run an experiment)
205-
- [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md)
206-
- [Use Case 4: Hyperparameter optimisation and fine-tuning](./tutorial/use_case4.md)
206+
* [How DVC works](./tutorial/dvc_overview.md)
207+
* [mlvtools pipeline features (on simple cases)](./tutorial/pipeline_features.md)
208+
* Going further with more realistic use cases:
209+
* [Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md)
210+
* [Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) (Run an
211+
experiment)
212+
* [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md)
213+
* [Use Case 4: Hyperparameter optimisation and fine-tuning](./tutorial/use_case4.md)
207214

208215

209216
## Talks
210217

211-
- [PyData Paris - March 2019 Meetup](https://www.meetup.com/fr-FR/PyData-Paris/events/259187805/): [talk](https://peopledoc.github.io/mlv-tools-tutorial/talks/pyData/presentation.html)
212-
- [PyData Amsterdam - May 2019](https://pydata.org/amsterdam2019/schedule/presentation/32/): [tutorial](https://peopledoc.github.io/mlv-tools-tutorial/talks/workshop/presentation.html)
218+
* [PyData Paris - March 2019
219+
Meetup](https://www.meetup.com/fr-FR/PyData-Paris/events/259187805/):
220+
[talk](https://peopledoc.github.io/mlvtools-tutorial/talks/pyData/presentation.html)
221+
* [PyData Amsterdam - May
222+
2019](https://pydata.org/amsterdam2019/schedule/presentation/32/):
223+
[tutorial](https://peopledoc.github.io/mlvtools-tutorial/talks/workshop/presentation.html)

requirements.txt

+2-3
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
1-
scikit-learn==0.19
1+
scikit-learn
22
dvc
33
mlflow
44
jupyter
55
pandas
66
numpy
77
nltk
8-
cython
98
pyfasttext
10-
ml-versioning-tools
9+
mlvtools

0 commit comments

Comments
 (0)