You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lightwood works by generating code for `Predictor` objects out of structured data (e.g. a data frame) and a problem definition. The simplest possible definition being the column to predict.
9
+
10
+
The data can be anything. It can contain numbers, dates, categories, text (in any language, but English is currently the primary focus), quantities, arrays, matrices, images, audio, or video. The last three as paths to the file system or URLs, since storing them as binary data can be cumbersome.
11
+
12
+
The generated `Predictor` object can be fitted by calling a learn method, or through a lower level step-by-step API. It can then make predictions on similar data (same columns except for the target) by calling a predict method. That's the gist of it.
13
+
14
+
There's an intermediate representation that gets turned into the final `Python` code, called `JsonAI`. This provides an easy way to edit the `Predictor` being generated from the original data and problem specifications. It also enables prototyping custom code without modifying the library itself, or even having a "development" version of the library installed.
15
+
16
+
Pipeline
17
+
------------
18
+
4
19
Lightwood abstracts the ML pipeline into 3 core steps:
5
20
6
21
1. Pre-processing and data cleaning
@@ -11,24 +26,48 @@ Lightwood abstracts the ML pipeline into 3 core steps:
11
26
:align:center
12
27
:alt:Lightwood "under-the-hood"
13
28
29
+
By default, each of them entails:
30
+
14
31
i) Pre-processing and cleaning
15
32
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16
-
For each column in your dataset, Lightwood will identify the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JSON-AI syntax.
33
+
For each column in your dataset, Lightwood will infer the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JsonAI object.
17
34
18
-
If the user keeps default behavior, Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type. From there, it will split the data into train/dev/test splits.
35
+
Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type (e.g. dates represented as a mix of string formats and timestamp floats are converted to datetime objects). From there, it will split the data into train/dev/test splits.
19
36
20
37
The `cleaner` and `splitter` objects respectively refer to the pre-processing and the data splitting functions.
21
38
22
39
ii) Feature Engineering
23
40
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24
-
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representations that a model can be used.
41
+
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representation that a model can use.
25
42
26
43
Encoders can be **rule-based** or **learned**. A rule-based encoder transforms data per a specific set of instructions (ex: normalized numerical data) whereas a learned encoder produces a representation of the data after training (ex: a "\[CLS\]" token in a language model).
27
44
28
-
Encoders are assigned to each column of data based on the data type; users can override this assignment either at the column-based level or at the data-type based level. Encoders inherit from the `BaseEncoder` class.
45
+
Encoders are assigned to each column of data based on the data type, and depending on the type there can be inter-column dependencies (e.g. time series). Users can override this assignment either at the column-based level or at the datatype-based level. Encoders inherit from the `BaseEncoder` class.
29
46
30
47
iii) Model Building and Training
31
48
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
32
49
We call a predictive model that intakes *encoded* feature data and outputs a prediction for the target of interest a `mixer` model. Users can either use Lightwood's default mixers or create their own approaches inherited from the `BaseMixer` class.
33
50
34
-
We predominantly use PyTorch based approaches, but can support other models.
51
+
We predominantly use PyTorch based approaches, but can support other models.
52
+
53
+
Multiple mixers can be trained for any given `Predictor`. After mixer training, an ensemble is created (and potentially trained) to decide which mixers to use and how to combine their predictions.
54
+
55
+
Finally, a "model analysis" step looks at the whole ensemble and extracts some stats about it, in addition to building confidence models that allow us to output a confidence and prediction intervals for each prediction. We also use this step to generate some explanations about model behavior.
56
+
57
+
Predicting is very similar: data is cleaned and encoded, then mixers make their predictions and they get ensembled. Finally, explainer modules determine things like confidence, prediction bounds, and column importances.
58
+
59
+
60
+
Strengths and drawbacks
61
+
------------------------
62
+
63
+
The main benefit of lightwood's architecture is that it is very easy to extend. Full understanding (or even any understanding) of the pipeline is not required to improve a specific component. Users can still easily integrate their custom code with minimal hassle, even if PRs are not accepted, while still pulling everything else from upstream. This works well with the open-source nature of the project.
64
+
65
+
The second advantage this provides is that it is relatively trivial to parallelize since most tasks are done per-feature. The bits which are done on all the data (mixer training and model analysis) are made up of multiple blocks with similar APIs which can themselves be run in parallel.
66
+
67
+
Finally, most of lightwood is built on PyTorch, and PyTorch mixers and encoders are first-class citizens in so far as the data format makes it easiest to work with them. In that sense performance on specialized hardware and continued compatibility is taken care of for us, which frees up time to work on other things.
68
+
69
+
The main drawback, however, is that the pipeline separation doesn't allow for phases to wield great influence on each other or run in a joint fashion. This both means you can't easily have stuff like mixer gradients propagating through and training encoders, nor analysis blocks looking at the model and deciding the data cleaning procedure should change. Granted, there's no hard limit on this, but any such implementation would be rather unwieldy in terms of code complexity.
0 commit comments