Skip to content

Commit eaaab3b

Browse files
sdgsbracaloni
sdg
authored andcommitted
Talk Pydata presentation
1 parent 36415a4 commit eaaab3b

File tree

133 files changed

+17337
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+17337
-0
lines changed

talks/pyData/draft.md

+86
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
Axé la conf sur l'opposition des mondes surtout vers le debut. et rsultat on se rassemble à la fin.
2+
3+
Overview:
4+
5+
- Presentation
6+
- Sarah : [petit resumé] + Accroche sur les technos => J'utilise des jupyter notebook
7+
et j'ai besoin de ...pouvoir reproduire facilement, garder de la souplesse, ...
8+
9+
- Stephanie: [petit résumé] + Accroche Automatisation, Livaison, Tests
10+
J'ai besoin de... un truc qui se lance facilement, qui se package, qui soit reproductible
11+
sur n'importe quel environnement
12+
13+
14+
- Why => Notre histoire: en gros un titre stylé pour dire le portage du
15+
poc (multi jupyter executables sur 1 machine) vers la prod (enfin au moins le step d'"industrialisation" du projet)
16+
en mixant monde dev et data scientist
17+
18+
- long et sinueu chemin du poc vers la prod ... à la croisée des deux mondes
19+
20+
- POC vs PROD ... vs Data scientist vs Software Developer
21+
22+
23+
- The POC:
24+
25+
- set of notebooks, some data, name versioning, specific server/user
26+
[Show a repo overview]
27+
28+
- Step 1: express our needs
29+
30+
- Automation/Scripting (first step)
31+
32+
ML side : keep using jupyter notebook
33+
Dev side: be able to easily run the tool and version a standardized format under git
34+
tests, CI
35+
36+
- Reproducibility/Pipelining/Versioning
37+
38+
=> Going further in the automation process
39+
=> No loss, be more confident
40+
=> Easily perform experiments
41+
=> Handle data sharing
42+
43+
44+
ML side: be able to experiment, avoid to reproduce time consuming steps, keep tracking data
45+
share with the team. organistation (no more inconstent reference on name versioned notebooks and execution order
46+
and dependencies)
47+
48+
49+
Dev side: be able to reproduce any configuration (data + hyperparam + code) on any server
50+
keep tracking the state of the art pipeline for further delivery. be ble to handle client specificities
51+
52+
53+
[Schema représentant besoins]
54+
55+
- Step 2: Organisation start: we need python scripts from jupyter notebooks
56+
57+
- Existing solutions: nb convert
58+
59+
- Issues: not parametrized and no effect cells
60+
61+
- MLV-tools: ipynb_to_python
62+
63+
64+
- Step 3: We need to handle data versioning and pipelining
65+
66+
- Existing solution: git lfs => data ok, pipelining nok
67+
- Existing solution: dvc => data ok, pipelining ok BUT... [ not easy to use and based on bash cmd
68+
mais bonne nouvelle on a deja des scripts]
69+
- example DVC
70+
- montrer pkoi c'est relou
71+
72+
- MLV-tools: from jupyter notebook to a pipeline step
73+
74+
75+
76+
- REX
77+
78+
=> souplesse expérimentation commerzbank
79+
=> perte de données
80+
81+
82+
83+
84+
85+
86+

talks/pyData/dvc.html

+282
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
<section>
2+
<p>Enters <span class="emph"> DVC </span></p>
3+
<aside class="notes">
4+
</aside>
5+
</section>
6+
7+
<section>
8+
<h5>DVC - Data Version Control</h5>
9+
<div class="fragment"><span class="emph">Open-source</span> Version Control System for Machine Learning Projects</div>
10+
<blockquote class="fragment fade-up">"For data scientists, by data scientists"</blockquote>
11+
<p class="fragment">
12+
<a href="https://github.com/iterative/dvc">
13+
<img class="icon" src="./img/GitHub-logo.png" alt="GitHub logo" style="width: 1em"> iterative/dvc
14+
</a>
15+
</p>
16+
<p class="fragment">
17+
<a href="https://youtu.be/4h6I9_xeYA4">
18+
<img class="icon" src="./img/icons/youtube.png" alt="youtube logo" style="width: 1em">how it works!
19+
</a>
20+
</p>
21+
22+
<aside class="notes">
23+
@SBI<br/>
24+
25+
- Open source - Apache Licence<br/>
26+
- Versionign tool for ml<br/>
27+
- for... by ...<br/>
28+
29+
- see ... <br/>
30+
TRANSITION: Not just a versioning tool
31+
</aside>
32+
</section>
33+
34+
<section>
35+
<h5>DVC - Data Version Control</h5>
36+
<div>
37+
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/dvc_home_page1.png">
38+
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/dvc_home_page2.png">
39+
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/dvc_home_page3.png">
40+
</div>
41+
<aside class="notes">
42+
@SBI
43+
- Versioning tool => DATA cache mechanism (REMOTE or LOCAL) <br/>
44+
=> Meatadat sous git (close to git lfs)<br/><br/>
45+
46+
- Amazing tool to perform experiment<br/>
47+
- Project is handle as a PIPELINE composed of several STEP and DEPENDENCIES<br/>
48+
49+
- Collaboration: PUSH and PULL
50+
</aside>
51+
</section>
52+
<section>
53+
<h5>DVC - How it works</h5>
54+
55+
<div>
56+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC1.png">
57+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC2.png">
58+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC3.png">
59+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC4.png">
60+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC5.png">
61+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC6.png">
62+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC7.png">
63+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC8.png">
64+
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC9.png">
65+
</div>
66+
<p style="text-align: left"> Save all intermediate results: Metadata <span class="emph">+</span> data</p>
67+
<aside class="notes">
68+
69+
</aside>
70+
</section>
71+
72+
<section>
73+
<h5>DVC - Reproduce only sub pipeline</h5>
74+
<div>
75+
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/pipeline/DVC9.png">
76+
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/pipeline/DVC_change0.png">
77+
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/pipeline/DVC_change1.png">
78+
<img class="fragment current-visible" data-fragment-index="3" src="./img/dvc/pipeline/DVC_change2.png">
79+
<img class="fragment current-visible" data-fragment-index="4" src="./img/dvc/pipeline/DVC_change2bis.png">
80+
<img class="fragment current-visible" data-fragment-index="5" src="./img/dvc/pipeline/DVC_change2bis.png">
81+
<pre class="fragment current-visible" data-fragment-index="5"><code>dvc repro evaluate.dvc</code></pre>
82+
<img class="fragment current-visible" data-fragment-index="6" src="./img/dvc/pipeline/DVC_change3.png">
83+
<pre class="fragment current-visible" data-fragment-index="6"><code>dvc repro evaluate.dvc</code></pre>
84+
<img class="fragment current-visible" data-fragment-index="7" src="./img/dvc/pipeline/DVC_change4.png">
85+
<pre class="fragment current-visible" data-fragment-index="7"><code>dvc repro evaluate.dvc</code></pre>
86+
<img class="fragment current-visible" data-fragment-index="8" src="./img/dvc/pipeline/DVC_change5.png">
87+
<pre class="fragment current-visible" data-fragment-index="8"><code>dvc repro evaluate.dvc</code></pre>
88+
<img class="fragment current-visible" data-fragment-index="9" src="./img/dvc/pipeline/DVC_change5.png">
89+
<pre class="fragment current-visible" data-fragment-index="9"><code>dvc repro evaluate.dvc</code></pre>
90+
</div>
91+
<p class="fragment current-visible" data-fragment-index="9" style="text-align: left">Do not re-run time consuming tasks</p>
92+
<aside class="notes">
93+
@SBI<br/>
94+
- Change param<br/>
95+
- Want to recalculate METRICS<br/>
96+
- run DVC ... evaluate step<br/>
97+
- DETECTS out of date steps<br/>
98+
- Re- run only needed tasks<br/>
99+
- NO time consuming
100+
101+
102+
</aside>
103+
</section>
104+
105+
106+
<section>
107+
<h5>How to use</h5>
108+
<p>
109+
<pre><code>dvc run -d <span class="step_input">[input_dep]</span> -o <span class="step_output">[output]</span> [command]</code></pre>
110+
</p>
111+
</section>
112+
113+
<section>
114+
<h5>How to use - example</h5>
115+
<div class="fragment">
116+
<p>Step:
117+
<pre><code>extract.py --input ./data.tgz --out ./out/train_set.csv</code></pre>
118+
</p>
119+
</div>
120+
<div class="fragment">With DVC:
121+
<pre class="fragment current-visible"><code>
122+
<span class="step_cmd">dvc run -d ./data.tgz -o ./out/train_set.csv \</span>
123+
extract.py --input ./data.tgz --out ./out/train_set.csv
124+
</code></pre>
125+
<pre class="fragment current-visible"><code>
126+
dvc run -d ./data.tgz -o ./out/train_set.csv \
127+
extract.py --input ./data.tgz --out ./out/train_set.csv
128+
</code></pre>
129+
<pre class="fragment current-visible"><code>
130+
dvc run -d <span class="step_input">./data.tgz</span> -o ./out/train_set.csv \
131+
extract.py --input <span class="step_input">./data.tgz</span> --out ./out/train_set.csv
132+
</code></pre>
133+
<pre class="fragment current-visible"><code>
134+
dvc run -d <span class="step_input">./data.tgz</span> -o <span class="step_output">./out/train_set.csv</span> \
135+
extract.py --input <span class="step_input">./data.tgz</span> --out <span class="step_output">./out/train_set.csv</span>
136+
</code></pre>
137+
</p>
138+
</div>
139+
<aside class="notes">
140+
@SBI:<br/>
141+
- python step <br/>
142+
- add DVC + dep + out<br/>
143+
- TEDIOUS
144+
145+
146+
</aside>
147+
</section>
148+
<section>
149+
<h5>DVC</h5>
150+
<div class="two-halves">
151+
<div class="half fragment">
152+
<h4>... is great! </h4>
153+
<ul>
154+
<li class="fragment">Handle dependencies</li>
155+
<li class="fragment">Cache mechanism</li>
156+
<li class="fragment">Reproducibility</li>
157+
<li class="fragment">Facilitate collaboration</li>
158+
<li class="fragment">Language agnostic</li>
159+
</ul>
160+
</div>
161+
<div class="half fragment">
162+
<h4>... but </h4>
163+
<ul>
164+
<li class="fragment">Risk of inconsistencies</li>
165+
<li class="fragment">Tedious to write/setup</li>
166+
</ul>
167+
</div>
168+
</div>
169+
<aside class="notes">
170+
@SDG<br/>
171+
1. Step order => Track dependencies (know which steps are needed to achieve a targeted step)<br/>
172+
2. Cache meca => save intermediate res + Reproduce sub pipeline (avoid time consuming task)<br/>
173+
3. Repro on any setup server<br/>
174+
4. Share data<br/>
175+
5. works with any executable<br/>
176+
</aside>
177+
</section>
178+
<section>
179+
<h5>MLV-tools gen_dvc</h5>
180+
<pre><code>gen_dvc -i ./script.py -o ./commands/script_dvc</code></pre>
181+
182+
</section>
183+
<section>
184+
<h5>The script</h5>
185+
<img class="fragment current-visible" data-fragment-index="2" src="./img/mlv_convert/script2.png">
186+
<aside class="notes">
187+
Previously generated with ipynb_to_dvc
188+
</aside>
189+
</section>
190+
<section>
191+
192+
<div class="fragment current-visible">
193+
<img src="./img/dvc/script_docstring_extract.png"/>
194+
<pre><code>
195+
"""
196+
:param str subset: Subset of data to load {'train', 'test'}
197+
:param str data_in: File directory path
198+
:param str output_path: Output file path
199+
"""
200+
</code></pre></div>
201+
<div class="fragment current-visible">
202+
<img src="./img/dvc/script_docstring_extract.png"/>
203+
<pre><code>
204+
"""
205+
:param str subset: Subset of data to load {'train', 'test'}
206+
:param str <span class="highlight">data_in</span>: File directory path
207+
:param str output_path: Output file path
208+
209+
<span class="highlight">:dvc-in data_in: ./data/all.zip</span>
210+
"""
211+
</code></pre></div>
212+
<div class="fragment current-visible">
213+
<img src="./img/dvc/script_docstring_extract.png"/>
214+
<pre><code>
215+
"""
216+
:param str subset: Subset of data to load {'train', 'test'}
217+
:param str data_in: File directory path
218+
:param str <span class="highlight">output_path</span>: Output file path
219+
220+
:dvc-in data_in: ./data/all.zip
221+
<span class="highlight">:dvc-out output_path: ./data/data_train.csv</span>
222+
"""
223+
</code></pre></div>
224+
<div class="fragment current-visible">
225+
<img src="./img/dvc/script_docstring_extract.png"/>
226+
<pre><code>
227+
"""
228+
:param str <span class="highlight">subset</span>: Subset of data to load {'train', 'test'}
229+
:param str data_in: File directory path
230+
:param str output_path: Output file path
231+
232+
:dvc-in data_in: ./data/all.zip
233+
:dvc-out output_path: ./data/data_train.csv
234+
<span class="highlight">:dvc-extra: --subset test</span>
235+
""" </code></pre></div>
236+
</section>
237+
<section>
238+
<h5>The generation</h5>
239+
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/gen_dvc.png">
240+
<aside class="notes">
241+
easy to use
242+
</aside>
243+
</section>
244+
<section>
245+
<h5>The generated bash command</h5>
246+
<img src="./img/dvc/dvc_cmd.png">
247+
<aside class="notes">
248+
@SBI<br/>
249+
- You can see: dvc run CMD + dep + out<br/>
250+
- + step python
251+
</aside>
252+
</section>
253+
254+
255+
<section>
256+
<h5>Benefits</h5>
257+
<ul class="fragment">
258+
<li class="fragment">Easily generate DVC steps</li>
259+
<li class="fragment">Easily create a DVC pipeline</li>
260+
<li class="fragment">Avoid inconsistencies</li>
261+
</ul>
262+
<aside class="notes">
263+
@SDG
264+
265+
+@SBI?
266+
- mentioner que marche que avec conf => voir tuto online<br/>
267+
- Everything is in the Docstring (initialy in the jupyter or script)<br/>
268+
- Docstring can be a template using a pipeline conf (tuto et readme)<br/>
269+
</aside>
270+
</section>
271+
272+
<section>
273+
<h5>Going furter: ipynb_to_dvc</h5>
274+
<pre class="fragment"><code>ipynb_to_dvc -n ./notebook.ipynb</code></pre>
275+
<img class="fragment current-visible plain" src="./img/global_schema1.png">
276+
<img class="fragment current-visible plain" src="./img/global_schema2.png">
277+
<aside class="notes">
278+
- MLV-tools tools are linked but can be used idependantly<br/>
279+
- mentioner que marche que avec conf => voir tuto online<br/>
280+
- Keep ref in jupyter notebook<br/>
281+
</aside>
282+
</section>

0 commit comments

Comments
 (0)