-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathdvc.html
282 lines (256 loc) · 10.9 KB
/
dvc.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
<section>
<p>Enters <span class="emph"> DVC </span></p>
<aside class="notes">
</aside>
</section>
<section>
<h5>DVC - Data Version Control</h5>
<div class="fragment"><span class="emph">Open-source</span> Version Control System for Machine Learning Projects</div>
<blockquote class="fragment fade-up">"For data scientists, by data scientists"</blockquote>
<p class="fragment">
<a href="https://github.com/iterative/dvc">
<img class="icon" src="./img/GitHub-logo.png" alt="GitHub logo" style="width: 1em"> iterative/dvc
</a>
</p>
<p class="fragment">
<a href="https://youtu.be/4h6I9_xeYA4">
<img class="icon" src="./img/icons/youtube.png" alt="youtube logo" style="width: 1em">how it works!
</a>
</p>
<aside class="notes">
@SBI<br/>
- Open source - Apache Licence<br/>
- Versionign tool for ml<br/>
- for... by ...<br/>
- see ... <br/>
TRANSITION: Not just a versioning tool
</aside>
</section>
<section>
<h5>DVC - Data Version Control</h5>
<div>
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/dvc_home_page1.png">
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/dvc_home_page2.png">
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/dvc_home_page3.png">
</div>
<aside class="notes">
@SBI
- Versioning tool => DATA cache mechanism (REMOTE or LOCAL) <br/>
=> Meatadat sous git (close to git lfs)<br/><br/>
- Amazing tool to perform experiment<br/>
- Project is handle as a PIPELINE composed of several STEP and DEPENDENCIES<br/>
- Collaboration: PUSH and PULL
</aside>
</section>
<section>
<h5>DVC - How it works</h5>
<div>
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC1.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC2.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC3.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC4.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC5.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC6.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC7.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC8.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC9.png">
</div>
<p style="text-align: left"> Save all intermediate results: Metadata <span class="emph">+</span> data</p>
<aside class="notes">
</aside>
</section>
<section>
<h5>DVC - Reproduce only sub pipeline</h5>
<div>
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/pipeline/DVC9.png">
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/pipeline/DVC_change0.png">
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/pipeline/DVC_change1.png">
<img class="fragment current-visible" data-fragment-index="3" src="./img/dvc/pipeline/DVC_change2.png">
<img class="fragment current-visible" data-fragment-index="4" src="./img/dvc/pipeline/DVC_change2bis.png">
<img class="fragment current-visible" data-fragment-index="5" src="./img/dvc/pipeline/DVC_change2bis.png">
<pre class="fragment current-visible" data-fragment-index="5"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="6" src="./img/dvc/pipeline/DVC_change3.png">
<pre class="fragment current-visible" data-fragment-index="6"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="7" src="./img/dvc/pipeline/DVC_change4.png">
<pre class="fragment current-visible" data-fragment-index="7"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="8" src="./img/dvc/pipeline/DVC_change5.png">
<pre class="fragment current-visible" data-fragment-index="8"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="9" src="./img/dvc/pipeline/DVC_change5.png">
<pre class="fragment current-visible" data-fragment-index="9"><code>dvc repro evaluate.dvc</code></pre>
</div>
<p class="fragment current-visible" data-fragment-index="9" style="text-align: left">Do not re-run time consuming tasks</p>
<aside class="notes">
@SBI<br/>
- Change param<br/>
- Want to recalculate METRICS<br/>
- run DVC ... evaluate step<br/>
- DETECTS out of date steps<br/>
- Re- run only needed tasks<br/>
- NO time consuming
</aside>
</section>
<section>
<h5>How to use</h5>
<p>
<pre><code>dvc run -d <span class="step_input">[input_dep]</span> -o <span class="step_output">[output]</span> [command]</code></pre>
</p>
</section>
<section>
<h5>How to use - example</h5>
<div class="fragment">
<p>Step:
<pre><code>extract.py --input ./data.tgz --out ./out/train_set.csv</code></pre>
</p>
</div>
<div class="fragment">With DVC:
<pre class="fragment current-visible"><code>
<span class="step_cmd">dvc run -d ./data.tgz -o ./out/train_set.csv \</span>
extract.py --input ./data.tgz --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d ./data.tgz -o ./out/train_set.csv \
extract.py --input ./data.tgz --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d <span class="step_input">./data.tgz</span> -o ./out/train_set.csv \
extract.py --input <span class="step_input">./data.tgz</span> --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d <span class="step_input">./data.tgz</span> -o <span class="step_output">./out/train_set.csv</span> \
extract.py --input <span class="step_input">./data.tgz</span> --out <span class="step_output">./out/train_set.csv</span>
</code></pre>
</p>
</div>
<aside class="notes">
@SBI:<br/>
- python step <br/>
- add DVC + dep + out<br/>
- TEDIOUS
</aside>
</section>
<section>
<h5>DVC</h5>
<div class="two-halves">
<div class="half fragment">
<h4>... is great! </h4>
<ul>
<li class="fragment">Handle dependencies</li>
<li class="fragment">Cache mechanism</li>
<li class="fragment">Reproducibility</li>
<li class="fragment">Facilitate collaboration</li>
<li class="fragment">Language agnostic</li>
</ul>
</div>
<div class="half fragment">
<h4>... but </h4>
<ul>
<li class="fragment">Risk of inconsistencies</li>
<li class="fragment">Tedious to write/setup</li>
</ul>
</div>
</div>
<aside class="notes">
@SDG<br/>
1. Step order => Track dependencies (know which steps are needed to achieve a targeted step)<br/>
2. Cache meca => save intermediate res + Reproduce sub pipeline (avoid time consuming task)<br/>
3. Repro on any setup server<br/>
4. Share data<br/>
5. works with any executable<br/>
</aside>
</section>
<section>
<h5>MLV-tools gen_dvc</h5>
<pre><code>gen_dvc -i ./script.py -o ./commands/script_dvc</code></pre>
</section>
<section>
<h5>The script</h5>
<img class="fragment current-visible" data-fragment-index="2" src="./img/mlv_convert/script2.png">
<aside class="notes">
Previously generated with ipynb_to_dvc
</aside>
</section>
<section>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str output_path: Output file path
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str <span class="highlight">data_in</span>: File directory path
:param str output_path: Output file path
<span class="highlight">:dvc-in data_in: ./data/all.zip</span>
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str <span class="highlight">output_path</span>: Output file path
:dvc-in data_in: ./data/all.zip
<span class="highlight">:dvc-out output_path: ./data/data_train.csv</span>
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str <span class="highlight">subset</span>: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str output_path: Output file path
:dvc-in data_in: ./data/all.zip
:dvc-out output_path: ./data/data_train.csv
<span class="highlight">:dvc-extra: --subset test</span>
""" </code></pre></div>
</section>
<section>
<h5>The generation</h5>
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/gen_dvc.png">
<aside class="notes">
easy to use
</aside>
</section>
<section>
<h5>The generated bash command</h5>
<img src="./img/dvc/dvc_cmd.png">
<aside class="notes">
@SBI<br/>
- You can see: dvc run CMD + dep + out<br/>
- + step python
</aside>
</section>
<section>
<h5>Benefits</h5>
<ul class="fragment">
<li class="fragment">Easily generate DVC steps</li>
<li class="fragment">Easily create a DVC pipeline</li>
<li class="fragment">Avoid inconsistencies</li>
</ul>
<aside class="notes">
@SDG
+@SBI?
- mentioner que marche que avec conf => voir tuto online<br/>
- Everything is in the Docstring (initialy in the jupyter or script)<br/>
- Docstring can be a template using a pipeline conf (tuto et readme)<br/>
</aside>
</section>
<section>
<h5>Going furter: ipynb_to_dvc</h5>
<pre class="fragment"><code>ipynb_to_dvc -n ./notebook.ipynb</code></pre>
<img class="fragment current-visible plain" src="./img/global_schema1.png">
<img class="fragment current-visible plain" src="./img/global_schema2.png">
<aside class="notes">
- MLV-tools tools are linked but can be used idependantly<br/>
- mentioner que marche que avec conf => voir tuto online<br/>
- Keep ref in jupyter notebook<br/>
</aside>
</section>