You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Pipelining ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
42
42
43
-
For a more realistic use case, we consider a Natural Language Processing pipeline, based on the well-known dataset 20-newsgroups.
43
+
- Data x Code x Hyperparameters versioning ([DVC](https://github.com/iterative/dvc) and [MLV-tools](https://github.com/peopledoc/ml-versioning-tools))
44
44
45
-
The base pipeline is simplified to include the following steps:
46
-
1. Split the data between train and test sets;
47
-
2. Tokenize (split into words) the raw text input;
48
-
3. Classify the cleaned input with FastText;
49
-
4. Evaluate the model.
50
-
51
-
In addition, use case 4 showcases hyperparameter tuning with a scikit-learn classifier.
52
45
53
-
The use cases considered are the following:
54
-
- Reproduce a pipeline / experience on new data ([Use Case 1: Build and Reproduce a Pipeline](./tutorial/use_case1.md))
55
-
- Versioning and storage of intermediate (preprocessed) datasets; partial execution of the pipeline starting from precomputed data
56
-
([Use Case 2: Create a new version of a pipeline](./tutorial/use_case2.md) and [Use Case 3: Build a Pipeline from an Existing Pipeline](./tutorial/use_case3.md))
57
-
- Hyperparameter optimisation and fine-tuning (saving results, [Use Case 4: Combine Metrics](./tutorial/use_case4.md))
46
+
## Standard Versioning Process Establishment
47
+
48
+
**Aim:** find a way to version code, data and pipelines.
49
+
50
+
#### Existing project
58
51
59
-
These use cases were chosen because we think they represent typical day-to-day work of a data scientist: being able to reproduce easily a pipeline made some time ago or by a coworker, versioning intermediate state of the data to avoid rerunning costly preprocessing steps, branching experiments to run tests with different classifiers, fine-tuning hyperparameters.
52
+
Starting from an existing project composed of Python 3 module(s) and a set of **Jupyter notebooks**
53
+
we wan to create an automatised pipeline in view to version, share and reproduce experiments.
60
54
61
-
## Standard Versioning Process
55
+
56
+
│── classifier
57
+
│ ├── aggregate_classif.py
58
+
│ ├── __init__.py
59
+
│ ├── extract.py
60
+
│ └── ...
61
+
│── notebooks
62
+
│ ├── Augment train data.ipynb
63
+
│ ├── Check data and split and train.ipynb
64
+
│ ├── Extract data.ipynb
65
+
│ ├── Learn text classifier.ipynb
66
+
│ ├── Learn aggregated model.ipynb
67
+
│ ├── Preprocess image data.ipynb
68
+
│ └── Train CNN classifier on image data.ipynb
69
+
│── README.md
70
+
│── requirements.yml
71
+
│── setup.cfg
72
+
│── setup.py
73
+
74
+
Applied transformations will let us handle data using metadata file. And convert **Jupyter notebooks** to
75
+
Python scripts. We need to keep in mind that:
62
76
63
-
The process used in this tutorial is a way to version code, data and pipelines.
64
-
Speaking of code the reference will always be the **Jupyter notebook**.
65
-
Speaking of input and output data the reference is the set of parameters define in each **DVC** commands.
77
+
- The step code reference remain in **Jupyter notebook**
78
+
- Pipelines in structured according to its inputs and outputs
0 commit comments