|
| 1 | +<section> |
| 2 | + <p>Enters <span class="emph"> DVC </span></p> |
| 3 | + <aside class="notes"> |
| 4 | + </aside> |
| 5 | +</section> |
| 6 | + |
| 7 | +<section> |
| 8 | + <h5>DVC - Data Version Control</h5> |
| 9 | + <div class="fragment"><span class="emph">Open-source</span> Version Control System for Machine Learning Projects</div> |
| 10 | + <blockquote class="fragment fade-up">"For data scientists, by data scientists"</blockquote> |
| 11 | + <p class="fragment"> |
| 12 | + <a href="https://github.com/iterative/dvc"> |
| 13 | + <img class="icon" src="./img/GitHub-logo.png" alt="GitHub logo" style="width: 1em"> iterative/dvc |
| 14 | + </a> |
| 15 | + </p> |
| 16 | + <p class="fragment"> |
| 17 | + <a href="https://youtu.be/4h6I9_xeYA4"> |
| 18 | + <img class="icon" src="./img/icons/youtube.png" alt="youtube logo" style="width: 1em">how it works! |
| 19 | + </a> |
| 20 | + </p> |
| 21 | + |
| 22 | + <aside class="notes"> |
| 23 | + @SBI<br/> |
| 24 | + |
| 25 | + - Open source - Apache Licence<br/> |
| 26 | + - Versionign tool for ml<br/> |
| 27 | + - for... by ...<br/> |
| 28 | + |
| 29 | + - see ... <br/> |
| 30 | + TRANSITION: Not just a versioning tool |
| 31 | + </aside> |
| 32 | +</section> |
| 33 | + |
| 34 | +<section> |
| 35 | + <h5>DVC - Data Version Control</h5> |
| 36 | + <div> |
| 37 | + <img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/dvc_home_page1.png"> |
| 38 | + <img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/dvc_home_page2.png"> |
| 39 | + <img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/dvc_home_page3.png"> |
| 40 | + </div> |
| 41 | + <aside class="notes"> |
| 42 | + @SBI |
| 43 | + - Versioning tool => DATA cache mechanism (REMOTE or LOCAL) <br/> |
| 44 | + => Meatadat sous git (close to git lfs)<br/><br/> |
| 45 | + |
| 46 | + - Amazing tool to perform experiment<br/> |
| 47 | + - Project is handle as a PIPELINE composed of several STEP and DEPENDENCIES<br/> |
| 48 | + |
| 49 | + - Collaboration: PUSH and PULL |
| 50 | + </aside> |
| 51 | +</section> |
| 52 | +<section> |
| 53 | + <h5>DVC - How it works</h5> |
| 54 | + |
| 55 | + <div> |
| 56 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC1.png"> |
| 57 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC2.png"> |
| 58 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC3.png"> |
| 59 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC4.png"> |
| 60 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC5.png"> |
| 61 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC6.png"> |
| 62 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC7.png"> |
| 63 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC8.png"> |
| 64 | + <img class="fragment current-visible" src="./img/dvc/pipeline/DVC9.png"> |
| 65 | + </div> |
| 66 | + <p style="text-align: left"> Save all intermediate results: Metadata <span class="emph">+</span> data</p> |
| 67 | + <aside class="notes"> |
| 68 | + |
| 69 | + </aside> |
| 70 | +</section> |
| 71 | + |
| 72 | +<section> |
| 73 | + <h5>DVC - Reproduce only sub pipeline</h5> |
| 74 | + <div> |
| 75 | + <img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/pipeline/DVC9.png"> |
| 76 | + <img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/pipeline/DVC_change0.png"> |
| 77 | + <img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/pipeline/DVC_change1.png"> |
| 78 | + <img class="fragment current-visible" data-fragment-index="3" src="./img/dvc/pipeline/DVC_change2.png"> |
| 79 | + <img class="fragment current-visible" data-fragment-index="4" src="./img/dvc/pipeline/DVC_change2bis.png"> |
| 80 | + <img class="fragment current-visible" data-fragment-index="5" src="./img/dvc/pipeline/DVC_change2bis.png"> |
| 81 | + <pre class="fragment current-visible" data-fragment-index="5"><code>dvc repro evaluate.dvc</code></pre> |
| 82 | + <img class="fragment current-visible" data-fragment-index="6" src="./img/dvc/pipeline/DVC_change3.png"> |
| 83 | + <pre class="fragment current-visible" data-fragment-index="6"><code>dvc repro evaluate.dvc</code></pre> |
| 84 | + <img class="fragment current-visible" data-fragment-index="7" src="./img/dvc/pipeline/DVC_change4.png"> |
| 85 | + <pre class="fragment current-visible" data-fragment-index="7"><code>dvc repro evaluate.dvc</code></pre> |
| 86 | + <img class="fragment current-visible" data-fragment-index="8" src="./img/dvc/pipeline/DVC_change5.png"> |
| 87 | + <pre class="fragment current-visible" data-fragment-index="8"><code>dvc repro evaluate.dvc</code></pre> |
| 88 | + <img class="fragment current-visible" data-fragment-index="9" src="./img/dvc/pipeline/DVC_change5.png"> |
| 89 | + <pre class="fragment current-visible" data-fragment-index="9"><code>dvc repro evaluate.dvc</code></pre> |
| 90 | + </div> |
| 91 | + <p class="fragment current-visible" data-fragment-index="9" style="text-align: left">Do not re-run time consuming tasks</p> |
| 92 | + <aside class="notes"> |
| 93 | + @SBI<br/> |
| 94 | + - Change param<br/> |
| 95 | + - Want to recalculate METRICS<br/> |
| 96 | + - run DVC ... evaluate step<br/> |
| 97 | + - DETECTS out of date steps<br/> |
| 98 | + - Re- run only needed tasks<br/> |
| 99 | + - NO time consuming |
| 100 | + |
| 101 | + |
| 102 | + </aside> |
| 103 | +</section> |
| 104 | + |
| 105 | + |
| 106 | +<section> |
| 107 | + <h5>How to use</h5> |
| 108 | + <p> |
| 109 | + <pre><code>dvc run -d <span class="step_input">[input_dep]</span> -o <span class="step_output">[output]</span> [command]</code></pre> |
| 110 | + </p> |
| 111 | +</section> |
| 112 | + |
| 113 | +<section> |
| 114 | + <h5>How to use - example</h5> |
| 115 | + <div class="fragment"> |
| 116 | + <p>Step: |
| 117 | + <pre><code>extract.py --input ./data.tgz --out ./out/train_set.csv</code></pre> |
| 118 | + </p> |
| 119 | + </div> |
| 120 | + <div class="fragment">With DVC: |
| 121 | + <pre class="fragment current-visible"><code> |
| 122 | +<span class="step_cmd">dvc run -d ./data.tgz -o ./out/train_set.csv \</span> |
| 123 | + extract.py --input ./data.tgz --out ./out/train_set.csv |
| 124 | + </code></pre> |
| 125 | + <pre class="fragment current-visible"><code> |
| 126 | +dvc run -d ./data.tgz -o ./out/train_set.csv \ |
| 127 | + extract.py --input ./data.tgz --out ./out/train_set.csv |
| 128 | + </code></pre> |
| 129 | + <pre class="fragment current-visible"><code> |
| 130 | +dvc run -d <span class="step_input">./data.tgz</span> -o ./out/train_set.csv \ |
| 131 | + extract.py --input <span class="step_input">./data.tgz</span> --out ./out/train_set.csv |
| 132 | + </code></pre> |
| 133 | + <pre class="fragment current-visible"><code> |
| 134 | +dvc run -d <span class="step_input">./data.tgz</span> -o <span class="step_output">./out/train_set.csv</span> \ |
| 135 | + extract.py --input <span class="step_input">./data.tgz</span> --out <span class="step_output">./out/train_set.csv</span> |
| 136 | + </code></pre> |
| 137 | + </p> |
| 138 | + </div> |
| 139 | + <aside class="notes"> |
| 140 | + @SBI:<br/> |
| 141 | + - python step <br/> |
| 142 | + - add DVC + dep + out<br/> |
| 143 | + - TEDIOUS |
| 144 | + |
| 145 | + |
| 146 | + </aside> |
| 147 | +</section> |
| 148 | +<section> |
| 149 | + <h5>DVC</h5> |
| 150 | + <div class="two-halves"> |
| 151 | + <div class="half fragment"> |
| 152 | + <h4>... is great! </h4> |
| 153 | + <ul> |
| 154 | + <li class="fragment">Handle dependencies</li> |
| 155 | + <li class="fragment">Cache mechanism</li> |
| 156 | + <li class="fragment">Reproducibility</li> |
| 157 | + <li class="fragment">Facilitate collaboration</li> |
| 158 | + <li class="fragment">Language agnostic</li> |
| 159 | + </ul> |
| 160 | + </div> |
| 161 | + <div class="half fragment"> |
| 162 | + <h4>... but </h4> |
| 163 | + <ul> |
| 164 | + <li class="fragment">Risk of inconsistencies</li> |
| 165 | + <li class="fragment">Tedious to write/setup</li> |
| 166 | + </ul> |
| 167 | + </div> |
| 168 | + </div> |
| 169 | + <aside class="notes"> |
| 170 | + @SDG<br/> |
| 171 | + 1. Step order => Track dependencies (know which steps are needed to achieve a targeted step)<br/> |
| 172 | + 2. Cache meca => save intermediate res + Reproduce sub pipeline (avoid time consuming task)<br/> |
| 173 | + 3. Repro on any setup server<br/> |
| 174 | + 4. Share data<br/> |
| 175 | + 5. works with any executable<br/> |
| 176 | + </aside> |
| 177 | +</section> |
| 178 | +<section> |
| 179 | + <h5>MLV-tools gen_dvc</h5> |
| 180 | + <pre><code>gen_dvc -i ./script.py -o ./commands/script_dvc</code></pre> |
| 181 | + |
| 182 | +</section> |
| 183 | +<section> |
| 184 | + <h5>The script</h5> |
| 185 | + <img class="fragment current-visible" data-fragment-index="2" src="./img/mlv_convert/script2.png"> |
| 186 | + <aside class="notes"> |
| 187 | + Previously generated with ipynb_to_dvc |
| 188 | + </aside> |
| 189 | +</section> |
| 190 | +<section> |
| 191 | + |
| 192 | + <div class="fragment current-visible"> |
| 193 | + <img src="./img/dvc/script_docstring_extract.png"/> |
| 194 | + <pre><code> |
| 195 | +""" |
| 196 | +:param str subset: Subset of data to load {'train', 'test'} |
| 197 | +:param str data_in: File directory path |
| 198 | +:param str output_path: Output file path |
| 199 | +""" |
| 200 | + </code></pre></div> |
| 201 | + <div class="fragment current-visible"> |
| 202 | + <img src="./img/dvc/script_docstring_extract.png"/> |
| 203 | + <pre><code> |
| 204 | +""" |
| 205 | +:param str subset: Subset of data to load {'train', 'test'} |
| 206 | +:param str <span class="highlight">data_in</span>: File directory path |
| 207 | +:param str output_path: Output file path |
| 208 | + |
| 209 | +<span class="highlight">:dvc-in data_in: ./data/all.zip</span> |
| 210 | +""" |
| 211 | + </code></pre></div> |
| 212 | + <div class="fragment current-visible"> |
| 213 | + <img src="./img/dvc/script_docstring_extract.png"/> |
| 214 | + <pre><code> |
| 215 | +""" |
| 216 | +:param str subset: Subset of data to load {'train', 'test'} |
| 217 | +:param str data_in: File directory path |
| 218 | +:param str <span class="highlight">output_path</span>: Output file path |
| 219 | + |
| 220 | +:dvc-in data_in: ./data/all.zip |
| 221 | +<span class="highlight">:dvc-out output_path: ./data/data_train.csv</span> |
| 222 | +""" |
| 223 | + </code></pre></div> |
| 224 | + <div class="fragment current-visible"> |
| 225 | + <img src="./img/dvc/script_docstring_extract.png"/> |
| 226 | + <pre><code> |
| 227 | +""" |
| 228 | +:param str <span class="highlight">subset</span>: Subset of data to load {'train', 'test'} |
| 229 | +:param str data_in: File directory path |
| 230 | +:param str output_path: Output file path |
| 231 | + |
| 232 | +:dvc-in data_in: ./data/all.zip |
| 233 | +:dvc-out output_path: ./data/data_train.csv |
| 234 | +<span class="highlight">:dvc-extra: --subset test</span> |
| 235 | +""" </code></pre></div> |
| 236 | +</section> |
| 237 | +<section> |
| 238 | + <h5>The generation</h5> |
| 239 | + <img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/gen_dvc.png"> |
| 240 | + <aside class="notes"> |
| 241 | + easy to use |
| 242 | + </aside> |
| 243 | +</section> |
| 244 | +<section> |
| 245 | + <h5>The generated bash command</h5> |
| 246 | + <img src="./img/dvc/dvc_cmd.png"> |
| 247 | + <aside class="notes"> |
| 248 | + @SBI<br/> |
| 249 | + - You can see: dvc run CMD + dep + out<br/> |
| 250 | + - + step python |
| 251 | + </aside> |
| 252 | +</section> |
| 253 | + |
| 254 | + |
| 255 | +<section> |
| 256 | + <h5>Benefits</h5> |
| 257 | + <ul class="fragment"> |
| 258 | + <li class="fragment">Easily generate DVC steps</li> |
| 259 | + <li class="fragment">Easily create a DVC pipeline</li> |
| 260 | + <li class="fragment">Avoid inconsistencies</li> |
| 261 | + </ul> |
| 262 | + <aside class="notes"> |
| 263 | + @SDG |
| 264 | + |
| 265 | + +@SBI? |
| 266 | + - mentioner que marche que avec conf => voir tuto online<br/> |
| 267 | + - Everything is in the Docstring (initialy in the jupyter or script)<br/> |
| 268 | + - Docstring can be a template using a pipeline conf (tuto et readme)<br/> |
| 269 | + </aside> |
| 270 | +</section> |
| 271 | + |
| 272 | +<section> |
| 273 | + <h5>Going furter: ipynb_to_dvc</h5> |
| 274 | + <pre class="fragment"><code>ipynb_to_dvc -n ./notebook.ipynb</code></pre> |
| 275 | + <img class="fragment current-visible plain" src="./img/global_schema1.png"> |
| 276 | + <img class="fragment current-visible plain" src="./img/global_schema2.png"> |
| 277 | + <aside class="notes"> |
| 278 | + - MLV-tools tools are linked but can be used idependantly<br/> |
| 279 | + - mentioner que marche que avec conf => voir tuto online<br/> |
| 280 | + - Keep ref in jupyter notebook<br/> |
| 281 | + </aside> |
| 282 | +</section> |
0 commit comments