Skip to content

Commit 53ed474

Browse files
committed
Tuto how to: re-use step, reproduce, change branch, shortcuts
1 parent 86733d6 commit 53ed474

File tree

4 files changed

+171
-27
lines changed

4 files changed

+171
-27
lines changed

resources/setup_project/project/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ help:
99

1010
#: setup - Install dependencies.
1111
setup:
12+
pip install cython
1213
pip install -e . -r ./requirements.txt

resources/setup_project/project/setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@
55
from setuptools import setup
66

77
if __name__ == '__main__':
8-
setup()
8+
setup(name='tuto_project')

tutorial/setup_project.md

+169-26
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,29 @@
11
Setup a project using MLV-tools
22
===============================
33

4-
**GOAL**: TODO
4+
5+
The aim of this tutorial is to understand how to setup a Machine Learning project
6+
development environment using MLV-tools. It explains how to:
7+
8+
- Generate Python 3 scripts and DVC pipeline from Jupyter Notebooks
9+
- Re-use pipeline steps with different I/O and parameters
10+
- Create an experiment using git branches
11+
- Re-run a pipeline with input changes
12+
13+
514

615
Project example
716
----------------
817

918
This tutorial is based on a text classification pipeline.
1019

11-
**Dataset:** a set of labeled reviews from TripAdvisor (TODO ref) (review + rating)
20+
**Dataset:** a set of labeled reviews from Trip Advisor.
21+
22+
> This dataset is a cleaned extract (2) of the publicly available TripAdvisor dataset(1).
23+
24+
>(1) Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: A rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010). pp. 783–792. Washington, US (2010))
25+
26+
>(2) Marcheggiani, D., Täckström, O., Esuli, A, Sebastiani, F.: Hierarchical Multi-Label Conditional Random Fields for Aspect-Oriented Opinion Mining. In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014).
1227
1328
To each review is associated a star rating from 1 to 5. We treat these values as categorical and tackle the problem as
1429
a classification problem given the small number of labels.
@@ -45,9 +60,6 @@ Create a workdir and copy resources:
4560
git add .
4661
git commit -m 'Project Initialization'
4762

48-
dvc init
49-
git commit -m 'Project DVC Initialization'
50-
5163

5264
**Project structure:**
5365

@@ -75,19 +87,31 @@ Setup the environment
7587

7688

7789
cd ..
78-
virtualenv venv -p /usr/bin/python3
90+
virtualenv venv -p /usr/bin/python3.6 (in provided docker: /usr/local/bin/python3.6)
7991
. ./venv/bin/activate
8092

8193
- Or a conda env
8294

95+
8396
cd ..
8497
conda create -n venv python=3 pip
8598
conda activate venv
8699

87100
Install dependencies:
88101

89-
make -C project setup
90102
cd ./project
103+
make setup
104+
105+
106+
Initialize DVC
107+
---------------
108+
109+
In `sandbox/project`, run:
110+
111+
dvc init
112+
git commit -m 'Project DVC Initialization'
113+
114+
91115

92116
Step 1: create the project configuration
93117
----------------------------------------
@@ -280,6 +304,16 @@ Perform the **Step 3** for all remaining notebooks.
280304
MLV-tools notebooks are availables in: `./resources/setup_project/solution/mlvtools`
281305

282306
cp ../../resources/setup_project/solution/mlvtools/* ./notebooks
307+
308+
Then run:
309+
310+
ipynb_to_dvc -n ./notebooks/extract_data.ipynb
311+
ipynb_to_dvc -n ./notebooks/preprocess_data.ipynb
312+
ipynb_to_dvc -n ./notebooks/split_dataset.ipynb
313+
ipynb_to_dvc -n ./notebooks/train_data_model.ipynb
314+
ipynb_to_dvc -n ./notebooks/evaluate_model.ipynb
315+
316+
283317
</details>
284318

285319
<details>
@@ -376,6 +410,11 @@ Then run each step once using DVC commands.
376410
4 directories, 15 files
377411
</details>
378412

413+
See the execution result:
414+
415+
cat ./data/result/metrics_test.txt
416+
417+
379418
Step 5: reuse a step
380419
--------------------
381420

@@ -388,35 +427,139 @@ The code will be exactly the same than the one for the evaluation on the **test
388427

389428
cp ./pipeline/dvc/mlvtools_evaluate_model_dvc ./pipeline/dvc/mlvtools_evaluate_model_train_dvc
390429

391-
2. Update **dataset** dependency
430+
2. Update **dataset** dependency in the new step `./pipeline/dvc/mlvtools_evaluate_model_train_dvc`
392431

393-
There is a set of variables, you need to update :
394-
395-
MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
396-
MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
397-
MODEL_PATH="./data/model/classifier.bin"
398-
TEST_DATASET_PATH="./data/intermediate/test_dataset.txt"
399-
METRICS_PATH="./data/result/metrics_test.txt"
400-
401-
# META FILENAME, MODIFY IF DUPLICATE
402-
MLV_DVC_META_FILENAME="mlvtools_evaluate_model.dvc"
403-
432+
The following set of **bash varaibles** is used as dvc run command arguments.
433+
<pre>
434+
MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
435+
MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
436+
MODEL_PATH="./data/model/classifier.bin"
437+
<b>DATASET_PATH="./data/intermediate/test_dataset.txt"</b>
438+
<b>METRICS_PATH="./data/result/metrics_test.txt"</b>
439+
440+
# META FILENAME, MODIFY IF DUPLICATE
441+
<b>MLV_DVC_META_FILENAME="mlvtools_evaluate_model.dvc"</b>
442+
</pre>
404443
444+
We need to update:
445+
446+
- **MLV_DVC_META_FILENAME**: the DVC step name.
447+
- **I/O**: DATASET_PATH and METRICS_PATH to provide the tain dataset and generate the new metrics output.
405448

449+
Update `./pipeline/dvc/mlvtools_evaluate_model_train_dvc` with:
450+
<pre>
451+
MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
452+
MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
453+
MODEL_PATH="./data/model/classifier.bin"
454+
DATASET_PATH="./data/intermediate/<b>train</b>_dataset.txt"
455+
METRICS_PATH="./data/result/metrics_<b>train</b>.txt"
456+
457+
# META FILENAME, MODIFY IF DUPLICATE
458+
MLV_DVC_META_FILENAME="mlvtools_evaluate_model<b>_train</b>.dvc"
459+
</pre>
460+
461+
406462
3. Run the new command to create the new step.
407463

408-
./pipeline/dvc/...
409464

410-
Step 6: reproduce a pipeline
465+
./pipeline/dvc/pipeline/dvc/mlvtools_evaluate_model_train_dvc
466+
git add .
467+
git commit -m 'Evaluate model on train data step run'
468+
469+
<details>
470+
<summary>See data directory</summary>
471+
472+
#:> tree -a ./data/
473+
./data/
474+
├── input
475+
│   ├── .gitignore
476+
│   ├── conf.json
477+
│   ├── conf.json.dvc
478+
│   ├── trip_advisor.json
479+
│   └── trip_advisor.json.dvc
480+
├── intermediate
481+
│   ├── .gitignore
482+
│   ├── extracted_data.json
483+
│   ├── preprocessed_data.json
484+
│   ├── test_dataset.txt
485+
│   └── train_dataset.txt
486+
├── model
487+
│   ├── .gitignore
488+
│   └── classifier.bin
489+
└── result
490+
├── .gitignore
491+
├── metrics_test.txt
492+
└── metrics_train.txt
493+
494+
4 directories, 15 files
495+
</details>
496+
497+
Step 6: see the pipeline
498+
------------------------
499+
500+
It is possible to visualize the pipeline with the `dvc pipeline show` command.
501+
502+
Run:
503+
504+
dvc pipeline show ./mlvtools_evaluate_model* --ascii
505+
506+
Step 7: reproduce a pipeline
411507
-----------------------------
412508

413-
Modify the training learning rate in `./data/input/conf.json`
509+
DVC handle dependencies between steps, so it is possible to modify the configuration file then to
510+
reproduce only needed steps to compute updated metrics.
414511

415-
Reproduce out of date step to obtain metrics:
512+
- Modify the training learning rate in `./data/input/conf.json`
416513

417-
dvc repro ....dvc
514+
- Run `dvc status` to see impacted steps.
515+
516+
#:> dvc status
517+
WARNING: Corrupted cache file .dvc/cache/c9/730ae77c8c040833a6c8588c0ec2e5.
518+
mlvtools_train_data_model.dvc:
519+
changed deps:
520+
modified: data/input/conf.json
521+
data/input/conf.json.dvc:
522+
changed outs:
523+
not in cache: data/input/conf.json
524+
525+
- Track pipeline input change
526+
527+
528+
dvc add ./data/input/conf.json
529+
530+
- Run `dvc status` again to see remaining out-dated steps
531+
532+
#:> dvc status
533+
mlvtools_train_data_model.dvc:
534+
changed deps:
535+
modified: data/input/conf.json
536+
537+
- Re-compute metrics steps (ie `mlvtools_evaluate_model.dvc` and `mlvtools_evaluate_model_train.dvc`)
538+
539+
540+
dvc repro mlvtools_evaluate_model*
541+
git checkout -b exp_bigger_learning_rate && git add .
542+
git commit -m 'Increased learning rate results'
543+
544+
Step 8: change of branch
545+
------------------------
546+
547+
After step 7 you must be on Git **exp_bigger_learning_rate** branch.
548+
549+
Display metrics from **exp_bigger_learning_rate**:
550+
551+
cat ./data/result/metrics*
552+
553+
Display metrics from **master branch**:
554+
555+
git checkout
556+
dvc checkout
557+
cat ./data/result/metrics*
558+
559+
560+
**Important**: don't forget to run **dvc checkout** after each **git checkout**.
418561

419-
420562

421563

422-
TODO: do not forget to mention resources for tripadvisor data and medium topic
564+
Your reached the end of this tutorial. Now your project is setup to run with DVC. To explore
565+
more complex cases try []others tutorials](https://github.com/peopledoc/mlv-tools-tutorial#tutorial)

0 commit comments

Comments
 (0)