Skip to content

Commit 0d8e420

Browse files
authored
Merge pull request #817 from mindsdb/staging
Release 22.1.4.0
2 parents 84b0767 + f2e7aa5 commit 0d8e420

40 files changed

+1878
-1249
lines changed

docssrc/source/conf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import sys
1212
import os
1313
import datetime
14-
import sphinx_rtd_theme
14+
1515

1616
# ----------------- #
1717
# Project information

docssrc/source/index.rst

+8-4
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
contain the root ``toctree`` directive.
66
77
****************************************
8-
Welcome to Lightwood's Documentation!
8+
Lightwood
99
****************************************
1010

1111
:Release: |release|
@@ -26,7 +26,6 @@ Quick Guide
2626
- :ref:`Installation <Installation>`
2727
- :ref:`Example Use Cases <Example Use Cases>`
2828
- :ref:`Contribute to Lightwood <Contribute to Lightwood>`
29-
- :ref:`Hacktoberfest 2021 <Hacktoberfest 2021>`
3029

3130
Installation
3231
============
@@ -225,7 +224,12 @@ Other Links
225224
.. toctree::
226225
:maxdepth: 8
227226

228-
lightwood_philosophy
229227
tutorials
230228
api
231-
data
229+
data
230+
encoder
231+
mixer
232+
ensemble
233+
analysis
234+
helpers
235+
lightwood_philosophy
+44-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
:mod:`Lightwood Philosophy`
22
================================
33

4+
5+
Introduction
6+
------------
7+
8+
Lightwood works by generating code for `Predictor` objects out of structured data (e.g. a data frame) and a problem definition. The simplest possible definition being the column to predict.
9+
10+
The data can be anything. It can contain numbers, dates, categories, text (in any language, but English is currently the primary focus), quantities, arrays, matrices, images, audio, or video. The last three as paths to the file system or URLs, since storing them as binary data can be cumbersome.
11+
12+
The generated `Predictor` object can be fitted by calling a learn method, or through a lower level step-by-step API. It can then make predictions on similar data (same columns except for the target) by calling a predict method. That's the gist of it.
13+
14+
There's an intermediate representation that gets turned into the final `Python` code, called `JsonAI`. This provides an easy way to edit the `Predictor` being generated from the original data and problem specifications. It also enables prototyping custom code without modifying the library itself, or even having a "development" version of the library installed.
15+
16+
Pipeline
17+
------------
18+
419
Lightwood abstracts the ML pipeline into 3 core steps:
520

621
1. Pre-processing and data cleaning
@@ -11,24 +26,48 @@ Lightwood abstracts the ML pipeline into 3 core steps:
1126
:align: center
1227
:alt: Lightwood "under-the-hood"
1328

29+
By default, each of them entails:
30+
1431
i) Pre-processing and cleaning
1532
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16-
For each column in your dataset, Lightwood will identify the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JSON-AI syntax.
33+
For each column in your dataset, Lightwood will infer the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JsonAI object.
1734

18-
If the user keeps default behavior, Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type. From there, it will split the data into train/dev/test splits.
35+
Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type (e.g. dates represented as a mix of string formats and timestamp floats are converted to datetime objects). From there, it will split the data into train/dev/test splits.
1936

2037
The `cleaner` and `splitter` objects respectively refer to the pre-processing and the data splitting functions.
2138

2239
ii) Feature Engineering
2340
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24-
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representations that a model can be used.
41+
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representation that a model can use.
2542

2643
Encoders can be **rule-based** or **learned**. A rule-based encoder transforms data per a specific set of instructions (ex: normalized numerical data) whereas a learned encoder produces a representation of the data after training (ex: a "\[CLS\]" token in a language model).
2744

28-
Encoders are assigned to each column of data based on the data type; users can override this assignment either at the column-based level or at the data-type based level. Encoders inherit from the `BaseEncoder` class.
45+
Encoders are assigned to each column of data based on the data type, and depending on the type there can be inter-column dependencies (e.g. time series). Users can override this assignment either at the column-based level or at the datatype-based level. Encoders inherit from the `BaseEncoder` class.
2946

3047
iii) Model Building and Training
3148
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3249
We call a predictive model that intakes *encoded* feature data and outputs a prediction for the target of interest a `mixer` model. Users can either use Lightwood's default mixers or create their own approaches inherited from the `BaseMixer` class.
3350

34-
We predominantly use PyTorch based approaches, but can support other models.
51+
We predominantly use PyTorch based approaches, but can support other models.
52+
53+
Multiple mixers can be trained for any given `Predictor`. After mixer training, an ensemble is created (and potentially trained) to decide which mixers to use and how to combine their predictions.
54+
55+
Finally, a "model analysis" step looks at the whole ensemble and extracts some stats about it, in addition to building confidence models that allow us to output a confidence and prediction intervals for each prediction. We also use this step to generate some explanations about model behavior.
56+
57+
Predicting is very similar: data is cleaned and encoded, then mixers make their predictions and they get ensembled. Finally, explainer modules determine things like confidence, prediction bounds, and column importances.
58+
59+
60+
Strengths and drawbacks
61+
------------------------
62+
63+
The main benefit of lightwood's architecture is that it is very easy to extend. Full understanding (or even any understanding) of the pipeline is not required to improve a specific component. Users can still easily integrate their custom code with minimal hassle, even if PRs are not accepted, while still pulling everything else from upstream. This works well with the open-source nature of the project.
64+
65+
The second advantage this provides is that it is relatively trivial to parallelize since most tasks are done per-feature. The bits which are done on all the data (mixer training and model analysis) are made up of multiple blocks with similar APIs which can themselves be run in parallel.
66+
67+
Finally, most of lightwood is built on PyTorch, and PyTorch mixers and encoders are first-class citizens in so far as the data format makes it easiest to work with them. In that sense performance on specialized hardware and continued compatibility is taken care of for us, which frees up time to work on other things.
68+
69+
The main drawback, however, is that the pipeline separation doesn't allow for phases to wield great influence on each other or run in a joint fashion. This both means you can't easily have stuff like mixer gradients propagating through and training encoders, nor analysis blocks looking at the model and deciding the data cleaning procedure should change. Granted, there's no hard limit on this, but any such implementation would be rather unwieldy in terms of code complexity.
70+
71+
72+
73+

docssrc/source/mixer.rst

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
:mod:`Mixers`
2+
==========================
3+
4+
Mixers learn to map encoded representation, they are the core of lightwood's automl.
5+
6+
.. automodule:: mixer
7+
:members:

docssrc/source/mixer/helpers/helpers.rst

-4
This file was deleted.

docssrc/source/mixer/mixer.rst

-9
This file was deleted.

docssrc/source/tutorials.rst

+5-25
Original file line numberDiff line numberDiff line change
@@ -4,31 +4,11 @@
44
:maxdepth: 1
55
:caption: Table of Contents:
66

7-
8-
Getting started with Lightwood and JSON-AI
9-
----------------------------------------------
10-
The following tutorial will walk you through a simple tabular dataset with JSON-AI.
11-
12-
| How to use Lightwood for your data (Coming Soon!)
13-
| `Lightwood for a quick data analysis <tutorials/tutorial_data_analysis/tutorial_data_analysis.ipynb>`_
14-
15-
16-
Run models with more complex data types
17-
------------------------------------------------
18-
19-
Below, you can see how Lightwood handles language and time-series data.
20-
21-
| Using Language Models (Coming Soon!)
22-
| Make your own timeseries predictor (Coming Soon!)
23-
24-
25-
Bring your own custom methods
26-
------------------------------------------------
27-
We support users bringing their custom methods. To learn how to build your own pipelines, check out the following notebooks:
28-
297
| `Construct a custom preprocessor to clean your data <tutorials/custom_cleaner/custom_cleaner.ipynb>`_
308
| `Make your own train and test split <tutorials/custom_splitter/custom_splitter.ipynb>`_
31-
| `Create your own encoder to featurize your data <tutorials/custom_encoder_rulebased/custom_encoder_rulebased.ipynb>`_ (Rule-based)
32-
| Create your own encoder to featurize your data using a learned representation (Coming Soon!)
9+
| `Create your own encoder <tutorials/custom_encoder_rulebased/custom_encoder_rulebased.ipynb>`_
3310
| `Design a custom mixer model <tutorials/custom_mixer/custom_mixer.ipynb>`_
34-
| `Use your own model explainer <tutorials/custom_explainer/custom_explainer.ipynb>`_
11+
| `Use your own model explainer <tutorials/custom_explainer/custom_explainer.ipynb>`_
12+
| `Solve a timeseries problem <tutorials/tutorial_time_series/tutorial_time_series.ipynb>`_
13+
| `Using lightwood for data analysis <tutorials/tutorial_data_analysis/tutorial_data_analysis.ipynb>`_
14+
| `Update existing mixers with new data <tutorials/tutorial_update_models/tutorial_update_models.ipynb>`_

0 commit comments

Comments
 (0)