Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial setup for reveal.js slides #2

Merged
merged 1 commit into from
Mar 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions talks/pyData/draft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Axé la conf sur l'opposition des mondes surtout vers le debut. et rsultat on se rassemble à la fin.

Overview:

- Presentation
- Sarah : [petit resumé] + Accroche sur les technos => J'utilise des jupyter notebook
et j'ai besoin de ...pouvoir reproduire facilement, garder de la souplesse, ...

- Stephanie: [petit résumé] + Accroche Automatisation, Livaison, Tests
J'ai besoin de... un truc qui se lance facilement, qui se package, qui soit reproductible
sur n'importe quel environnement


- Why => Notre histoire: en gros un titre stylé pour dire le portage du
poc (multi jupyter executables sur 1 machine) vers la prod (enfin au moins le step d'"industrialisation" du projet)
en mixant monde dev et data scientist

- long et sinueu chemin du poc vers la prod ... à la croisée des deux mondes

- POC vs PROD ... vs Data scientist vs Software Developer


- The POC:

- set of notebooks, some data, name versioning, specific server/user
[Show a repo overview]

- Step 1: express our needs

- Automation/Scripting (first step)

ML side : keep using jupyter notebook
Dev side: be able to easily run the tool and version a standardized format under git
tests, CI

- Reproducibility/Pipelining/Versioning

=> Going further in the automation process
=> No loss, be more confident
=> Easily perform experiments
=> Handle data sharing


ML side: be able to experiment, avoid to reproduce time consuming steps, keep tracking data
share with the team. organistation (no more inconstent reference on name versioned notebooks and execution order
and dependencies)


Dev side: be able to reproduce any configuration (data + hyperparam + code) on any server
keep tracking the state of the art pipeline for further delivery. be ble to handle client specificities


[Schema représentant besoins]

- Step 2: Organisation start: we need python scripts from jupyter notebooks

- Existing solutions: nb convert

- Issues: not parametrized and no effect cells

- MLV-tools: ipynb_to_python


- Step 3: We need to handle data versioning and pipelining

- Existing solution: git lfs => data ok, pipelining nok
- Existing solution: dvc => data ok, pipelining ok BUT... [ not easy to use and based on bash cmd
mais bonne nouvelle on a deja des scripts]
- example DVC
- montrer pkoi c'est relou

- MLV-tools: from jupyter notebook to a pipeline step



- REX

=> souplesse expérimentation commerzbank
=> perte de données







282 changes: 282 additions & 0 deletions talks/pyData/dvc.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
<section>
<p>Enters <span class="emph"> DVC </span></p>
<aside class="notes">
</aside>
</section>

<section>
<h5>DVC - Data Version Control</h5>
<div class="fragment"><span class="emph">Open-source</span> Version Control System for Machine Learning Projects</div>
<blockquote class="fragment fade-up">"For data scientists, by data scientists"</blockquote>
<p class="fragment">
<a href="https://github.com/iterative/dvc">
<img class="icon" src="./img/GitHub-logo.png" alt="GitHub logo" style="width: 1em"> iterative/dvc
</a>
</p>
<p class="fragment">
<a href="https://youtu.be/4h6I9_xeYA4">
<img class="icon" src="./img/icons/youtube.png" alt="youtube logo" style="width: 1em">how it works!
</a>
</p>

<aside class="notes">
@SBI<br/>

- Open source - Apache Licence<br/>
- Versionign tool for ml<br/>
- for... by ...<br/>

- see ... <br/>
TRANSITION: Not just a versioning tool
</aside>
</section>

<section>
<h5>DVC - Data Version Control</h5>
<div>
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/dvc_home_page1.png">
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/dvc_home_page2.png">
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/dvc_home_page3.png">
</div>
<aside class="notes">
@SBI
- Versioning tool => DATA cache mechanism (REMOTE or LOCAL) <br/>
=> Meatadat sous git (close to git lfs)<br/><br/>

- Amazing tool to perform experiment<br/>
- Project is handle as a PIPELINE composed of several STEP and DEPENDENCIES<br/>

- Collaboration: PUSH and PULL
</aside>
</section>
<section>
<h5>DVC - How it works</h5>

<div>
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC1.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC2.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC3.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC4.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC5.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC6.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC7.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC8.png">
<img class="fragment current-visible" src="./img/dvc/pipeline/DVC9.png">
</div>
<p style="text-align: left"> Save all intermediate results: Metadata <span class="emph">+</span> data</p>
<aside class="notes">

</aside>
</section>

<section>
<h5>DVC - Reproduce only sub pipeline</h5>
<div>
<img class="fragment current-visible" data-fragment-index="0" src="./img/dvc/pipeline/DVC9.png">
<img class="fragment current-visible" data-fragment-index="1" src="./img/dvc/pipeline/DVC_change0.png">
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/pipeline/DVC_change1.png">
<img class="fragment current-visible" data-fragment-index="3" src="./img/dvc/pipeline/DVC_change2.png">
<img class="fragment current-visible" data-fragment-index="4" src="./img/dvc/pipeline/DVC_change2bis.png">
<img class="fragment current-visible" data-fragment-index="5" src="./img/dvc/pipeline/DVC_change2bis.png">
<pre class="fragment current-visible" data-fragment-index="5"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="6" src="./img/dvc/pipeline/DVC_change3.png">
<pre class="fragment current-visible" data-fragment-index="6"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="7" src="./img/dvc/pipeline/DVC_change4.png">
<pre class="fragment current-visible" data-fragment-index="7"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="8" src="./img/dvc/pipeline/DVC_change5.png">
<pre class="fragment current-visible" data-fragment-index="8"><code>dvc repro evaluate.dvc</code></pre>
<img class="fragment current-visible" data-fragment-index="9" src="./img/dvc/pipeline/DVC_change5.png">
<pre class="fragment current-visible" data-fragment-index="9"><code>dvc repro evaluate.dvc</code></pre>
</div>
<p class="fragment current-visible" data-fragment-index="9" style="text-align: left">Do not re-run time consuming tasks</p>
<aside class="notes">
@SBI<br/>
- Change param<br/>
- Want to recalculate METRICS<br/>
- run DVC ... evaluate step<br/>
- DETECTS out of date steps<br/>
- Re- run only needed tasks<br/>
- NO time consuming


</aside>
</section>


<section>
<h5>How to use</h5>
<p>
<pre><code>dvc run -d <span class="step_input">[input_dep]</span> -o <span class="step_output">[output]</span> [command]</code></pre>
</p>
</section>

<section>
<h5>How to use - example</h5>
<div class="fragment">
<p>Step:
<pre><code>extract.py --input ./data.tgz --out ./out/train_set.csv</code></pre>
</p>
</div>
<div class="fragment">With DVC:
<pre class="fragment current-visible"><code>
<span class="step_cmd">dvc run -d ./data.tgz -o ./out/train_set.csv \</span>
extract.py --input ./data.tgz --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d ./data.tgz -o ./out/train_set.csv \
extract.py --input ./data.tgz --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d <span class="step_input">./data.tgz</span> -o ./out/train_set.csv \
extract.py --input <span class="step_input">./data.tgz</span> --out ./out/train_set.csv
</code></pre>
<pre class="fragment current-visible"><code>
dvc run -d <span class="step_input">./data.tgz</span> -o <span class="step_output">./out/train_set.csv</span> \
extract.py --input <span class="step_input">./data.tgz</span> --out <span class="step_output">./out/train_set.csv</span>
</code></pre>
</p>
</div>
<aside class="notes">
@SBI:<br/>
- python step <br/>
- add DVC + dep + out<br/>
- TEDIOUS


</aside>
</section>
<section>
<h5>DVC</h5>
<div class="two-halves">
<div class="half fragment">
<h4>... is great! </h4>
<ul>
<li class="fragment">Handle dependencies</li>
<li class="fragment">Cache mechanism</li>
<li class="fragment">Reproducibility</li>
<li class="fragment">Facilitate collaboration</li>
<li class="fragment">Language agnostic</li>
</ul>
</div>
<div class="half fragment">
<h4>... but </h4>
<ul>
<li class="fragment">Risk of inconsistencies</li>
<li class="fragment">Tedious to write/setup</li>
</ul>
</div>
</div>
<aside class="notes">
@SDG<br/>
1. Step order => Track dependencies (know which steps are needed to achieve a targeted step)<br/>
2. Cache meca => save intermediate res + Reproduce sub pipeline (avoid time consuming task)<br/>
3. Repro on any setup server<br/>
4. Share data<br/>
5. works with any executable<br/>
</aside>
</section>
<section>
<h5>MLV-tools gen_dvc</h5>
<pre><code>gen_dvc -i ./script.py -o ./commands/script_dvc</code></pre>

</section>
<section>
<h5>The script</h5>
<img class="fragment current-visible" data-fragment-index="2" src="./img/mlv_convert/script2.png">
<aside class="notes">
Previously generated with ipynb_to_dvc
</aside>
</section>
<section>

<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str output_path: Output file path
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str <span class="highlight">data_in</span>: File directory path
:param str output_path: Output file path

<span class="highlight">:dvc-in data_in: ./data/all.zip</span>
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str subset: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str <span class="highlight">output_path</span>: Output file path

:dvc-in data_in: ./data/all.zip
<span class="highlight">:dvc-out output_path: ./data/data_train.csv</span>
"""
</code></pre></div>
<div class="fragment current-visible">
<img src="./img/dvc/script_docstring_extract.png"/>
<pre><code>
"""
:param str <span class="highlight">subset</span>: Subset of data to load {'train', 'test'}
:param str data_in: File directory path
:param str output_path: Output file path

:dvc-in data_in: ./data/all.zip
:dvc-out output_path: ./data/data_train.csv
<span class="highlight">:dvc-extra: --subset test</span>
""" </code></pre></div>
</section>
<section>
<h5>The generation</h5>
<img class="fragment current-visible" data-fragment-index="2" src="./img/dvc/gen_dvc.png">
<aside class="notes">
easy to use
</aside>
</section>
<section>
<h5>The generated bash command</h5>
<img src="./img/dvc/dvc_cmd.png">
<aside class="notes">
@SBI<br/>
- You can see: dvc run CMD + dep + out<br/>
- + step python
</aside>
</section>


<section>
<h5>Benefits</h5>
<ul class="fragment">
<li class="fragment">Easily generate DVC steps</li>
<li class="fragment">Easily create a DVC pipeline</li>
<li class="fragment">Avoid inconsistencies</li>
</ul>
<aside class="notes">
@SDG

+@SBI?
- mentioner que marche que avec conf => voir tuto online<br/>
- Everything is in the Docstring (initialy in the jupyter or script)<br/>
- Docstring can be a template using a pipeline conf (tuto et readme)<br/>
</aside>
</section>

<section>
<h5>Going furter: ipynb_to_dvc</h5>
<pre class="fragment"><code>ipynb_to_dvc -n ./notebook.ipynb</code></pre>
<img class="fragment current-visible plain" src="./img/global_schema1.png">
<img class="fragment current-visible plain" src="./img/global_schema2.png">
<aside class="notes">
- MLV-tools tools are linked but can be used idependantly<br/>
- mentioner que marche que avec conf => voir tuto online<br/>
- Keep ref in jupyter notebook<br/>
</aside>
</section>
Loading