Tuto how to: re-use step, reproduce, change branch, shortcuts

sbracaloni · sbracaloni · commit 53ed474d758c · 2019-05-08T16:05:09.000+02:00
diff --git a/resources/setup_project/project/.gitignore b/resources/setup_project/project/.gitignore
diff --git a/resources/setup_project/project/Makefile b/resources/setup_project/project/Makefile
@@ -9,4 +9,5 @@ help:
 
 #: setup - Install dependencies.
 setup:
+	pip install cython
 	pip install -e . -r ./requirements.txt
diff --git a/resources/setup_project/project/setup.py b/resources/setup_project/project/setup.py
@@ -5,4 +5,4 @@
 from setuptools import setup
 
 if __name__ == '__main__':
-    setup()
+    setup(name='tuto_project')
diff --git a/tutorial/setup_project.md b/tutorial/setup_project.md
@@ -1,14 +1,29 @@
 Setup a project using MLV-tools
 ===============================
 
-**GOAL**: TODO
+
+The aim of this tutorial is to understand how to setup a Machine Learning project
+ development environment using MLV-tools. It explains how to:
+ 
+- Generate Python 3 scripts and DVC pipeline from Jupyter Notebooks
+- Re-use pipeline steps with different I/O and parameters
+- Create an experiment using git branches
+- Re-run a pipeline with input changes
+
+
 
 Project example
 ----------------
 
 This tutorial is based on a text classification pipeline. 
 
-**Dataset:** a set of labeled reviews from TripAdvisor (TODO ref) (review + rating)
+**Dataset:** a set of labeled reviews from Trip Advisor.
+
+> This dataset is a cleaned extract (2) of the publicly available TripAdvisor dataset(1). 
+
+>(1) Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: A rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010). pp. 783–792. Washington, US (2010))
+
+>(2) Marcheggiani, D., Täckström, O., Esuli, A, Sebastiani, F.: Hierarchical Multi-Label Conditional Random Fields for Aspect-Oriented Opinion Mining. In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014).
 
 To each review is associated a star rating from 1 to 5. We treat these values as categorical and tackle the problem as
 a classification problem given the small number of labels. 
@@ -45,9 +60,6 @@ Create a workdir and copy resources:
     git add .
     git commit -m 'Project Initialization'
     
-    dvc init
-    git commit -m 'Project DVC Initialization'
-
 
 **Project structure:**
 
@@ -75,19 +87,31 @@ Setup the environment
 
     
     cd ..
-    virtualenv venv -p /usr/bin/python3
+    virtualenv venv -p /usr/bin/python3.6   (in provided docker: /usr/local/bin/python3.6)
     . ./venv/bin/activate
     
 - Or a conda env
  
+ 
     cd ..
     conda create -n venv python=3 pip
     conda activate venv
     
 Install dependencies:
 
-    make -C project setup
     cd ./project
+    make setup
+    
+    
+Initialize DVC
+---------------
+
+In `sandbox/project`, run:
+
+    dvc init
+    git commit -m 'Project DVC Initialization'
+
+    
     
 Step 1: create the project configuration
 ----------------------------------------
@@ -280,6 +304,16 @@ Perform the **Step 3** for all remaining notebooks.
   MLV-tools notebooks are availables in: `./resources/setup_project/solution/mlvtools`
   
     cp ../../resources/setup_project/solution/mlvtools/* ./notebooks
+    
+  Then run:
+    
+    ipynb_to_dvc -n ./notebooks/extract_data.ipynb 
+    ipynb_to_dvc -n ./notebooks/preprocess_data.ipynb 
+    ipynb_to_dvc -n ./notebooks/split_dataset.ipynb 
+    ipynb_to_dvc -n ./notebooks/train_data_model.ipynb 
+    ipynb_to_dvc -n ./notebooks/evaluate_model.ipynb 
+    
+  
 </details>
 
 <details>
@@ -376,6 +410,11 @@ Then run each step once using DVC commands.
     4 directories, 15 files
 </details>
 
+See the execution result:
+
+    cat ./data/result/metrics_test.txt 
+ 
+
 Step 5: reuse a step
 --------------------
 
@@ -388,35 +427,139 @@ The code will be exactly the same than the one for the evaluation on the **test
 
     cp ./pipeline/dvc/mlvtools_evaluate_model_dvc ./pipeline/dvc/mlvtools_evaluate_model_train_dvc
     
-2. Update **dataset** dependency
+2. Update **dataset** dependency in the new step `./pipeline/dvc/mlvtools_evaluate_model_train_dvc`
 
-    There is a set of variables, you need to update :
-    
-        MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
-        MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
-        MODEL_PATH="./data/model/classifier.bin"
-        TEST_DATASET_PATH="./data/intermediate/test_dataset.txt"
-        METRICS_PATH="./data/result/metrics_test.txt"
-        
-        # META FILENAME, MODIFY IF DUPLICATE
-        MLV_DVC_META_FILENAME="mlvtools_evaluate_model.dvc"
-        
+    The following set of **bash varaibles** is used as dvc run command arguments.
+<pre>
+MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
+MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
+MODEL_PATH="./data/model/classifier.bin"
+<b>DATASET_PATH="./data/intermediate/test_dataset.txt"</b>
+<b>METRICS_PATH="./data/result/metrics_test.txt"</b>
+
+# META FILENAME, MODIFY IF DUPLICATE
+<b>MLV_DVC_META_FILENAME="mlvtools_evaluate_model.dvc"</b>
+</pre>        
      
+We need to update:
+
+- **MLV_DVC_META_FILENAME**: the DVC step name.
+- **I/O**: DATASET_PATH and METRICS_PATH to provide the tain dataset and generate the new metrics output.
     
+Update `./pipeline/dvc/mlvtools_evaluate_model_train_dvc` with:
+<pre>
+MLV_PY_CMD_PATH="pipeline/script/mlvtools_evaluate_model.py"
+MLV_PY_CMD_NAME="mlvtools_evaluate_model.py"
+MODEL_PATH="./data/model/classifier.bin"
+DATASET_PATH="./data/intermediate/<b>train</b>_dataset.txt"
+METRICS_PATH="./data/result/metrics_<b>train</b>.txt"
+
+# META FILENAME, MODIFY IF DUPLICATE
+MLV_DVC_META_FILENAME="mlvtools_evaluate_model<b>_train</b>.dvc"
+</pre>
+   
+ 
 3. Run the new command to create the new step.
     
-    ./pipeline/dvc/...
     
-Step 6: reproduce a pipeline
+    ./pipeline/dvc/pipeline/dvc/mlvtools_evaluate_model_train_dvc
+    git add .
+    git commit -m 'Evaluate model on train data step run'
+
+<details>
+  <summary>See data directory</summary>    
+
+    #:> tree -a ./data/
+    ./data/
+    ├── input
+    │   ├── .gitignore
+    │   ├── conf.json
+    │   ├── conf.json.dvc
+    │   ├── trip_advisor.json
+    │   └── trip_advisor.json.dvc
+    ├── intermediate
+    │   ├── .gitignore
+    │   ├── extracted_data.json
+    │   ├── preprocessed_data.json
+    │   ├── test_dataset.txt
+    │   └── train_dataset.txt
+    ├── model
+    │   ├── .gitignore
+    │   └── classifier.bin
+    └── result
+        ├── .gitignore
+        ├── metrics_test.txt
+        └── metrics_train.txt
+    
+    4 directories, 15 files
+</details>
+    
+Step 6: see the pipeline
+------------------------
+
+It is possible to visualize the pipeline with the `dvc pipeline show` command.
+
+Run:
+
+    dvc pipeline show ./mlvtools_evaluate_model* --ascii
+
+Step 7: reproduce a pipeline
 -----------------------------
 
-Modify the training learning rate in `./data/input/conf.json`
+DVC handle dependencies between steps, so it is possible to modify the configuration file then to 
+reproduce only needed steps to compute updated metrics.
 
-Reproduce out of date step to obtain metrics:
+- Modify the training learning rate in `./data/input/conf.json`
 
-    dvc repro ....dvc
+- Run `dvc status` to see impacted steps.
+
+        #:> dvc status
+        WARNING: Corrupted cache file .dvc/cache/c9/730ae77c8c040833a6c8588c0ec2e5.
+        mlvtools_train_data_model.dvc:
+            changed deps:
+                modified:           data/input/conf.json
+        data/input/conf.json.dvc:
+            changed outs:
+                not in cache:       data/input/conf.json
+                
+- Track pipeline input change
+
+    
+    dvc add ./data/input/conf.json
+
+- Run `dvc status` again to see remaining out-dated steps
+
+        #:> dvc status
+        mlvtools_train_data_model.dvc:
+        	changed deps:
+        		modified:           data/input/conf.json
+
+- Re-compute metrics steps  (ie `mlvtools_evaluate_model.dvc` and `mlvtools_evaluate_model_train.dvc`)
+
+
+    dvc repro mlvtools_evaluate_model*
+    git checkout -b exp_bigger_learning_rate && git add . 
+    git commit -m 'Increased learning rate results' 
+
+Step 8: change of branch
+------------------------
+
+After step 7 you must be on Git **exp_bigger_learning_rate** branch.
+
+Display metrics from **exp_bigger_learning_rate**:
+
+    cat ./data/result/metrics*
+    
+Display metrics from **master branch**:
+
+    git checkout
+    dvc checkout
+    cat ./data/result/metrics*
+
+
+**Important**: don't forget to run **dvc checkout** after each **git checkout**.
 
- 
 
 
-TODO: do not forget to mention resources for tripadvisor data and medium topic
+Your reached the end of this tutorial. Now your project is setup to run with DVC. To explore
+more complex cases try []others tutorials](https://github.com/peopledoc/mlv-tools-tutorial#tutorial)