Linear algebra, in particular matrix-matrix and matrix-vector products, is a prerequisite for understanding Machine Learning. The Matrix Cookbook is an excellent resource for identities and various calculations.
https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
In general, only differentiation is needed to compute partial derivatives used in gradient descent updates.
https://en.wikipedia.org/wiki/Derivative
Machine learning is a numerical technique where data is sampled from a system of interested and then a parameterized computational model is fit to that data. This computational model system is then used to make predictions give new data. The most basic task is to take a data point and classify it as one of a number of classes. Historically, machine learning used models relied upon very shallow but well understood architectures to model data. For example, linear regression and logistic regression which rely upon form
Deep learning, instead of relying upon more complex building blocks to model data with more intricate patterns, relies upon composing simple models. Typically, this takes the form of a series of linear and nonlinear layers. For example.
or
Note that without the nonlinear stages, additional linear stages have no impact and are equivalent to a single stage, except in the case of a low rank architecture.
Deep models are typically represented as a graphical model. In addition, operations can have multiple inputs.
As a graphical model: (TODO)
Supervised learning is the fitting of a model to explicit input/output (a.k.a. annotations, labels) pairs. This is typically stated as
Unsupervised learning is the process of finding patterns in data without having explicit labels i.e.
Semi-supervised learning generally doesn't have annotated input/output pairs. Instead, the output value is generated using some heuristics, such as predicting a value based upon its spatial context.
Multi-class classification is the task of assigning 1 of N classes to a data point. In this case the output values are typically normalized and sum to one, often performed using a softmax function,
Binary classification the task of assigning any number of N classes to a data point. In this case each value is compressed to the range [0,1], often performed using a sigmoid function,
In both of these cases, the values can be interpreted as probabilities. However, from a computational perspective they don't necessarily have any significance as a probability and are more a result of the engineered properties of the computational architecture and training process. This is further reinforced by the fact that probability is ill-defined in most real world cases, relying upon a prior assumption of the underlying distribution and parameter estimates. There is other literature that further discusses the definition of probability. That said, it is a useful interpretation.
Regression is the task of outputting real numbers that predict a value in a finite or infinite range.
Tabular data means data that comes in the form of a table where each row represents a different data point. Each data point is a vector that concatenates a number of values representing specific fields/columns. In this case models are generally used to predict a specific field from the other fields. This is a form of supervised learning.
Given pairs
We want to fit an additional function
In signal processing we aim to make predictions given a sequence of data. The input data can be either a finite window of historical data, or all previous data. But no future data.
Often these are solved by using a sliding window of values in order to fix the number of inputs:
Or by using a state variable in a recurrent model:
In computer vision we make predictions typically given either a single image, a temporal sequence of images, or some sort of image array. The main difference is that at each time you have a 2D array of data. This includes tasks such as image classification, object detection, object identification, segmentation and object tracking. 3D perception is also often performed on 2D data to extract 3D structure, such as in multiview reconstruction (an extension of stereovision).
In natural language processing, strings are processed similarly to signal processing. The most basic being classifying an input sequence.
In the case of a language model, the task is to classify the next word after a given sequence. The most prolific example of such a system is autocomplete. It is true that the large language models today, despite their impressive capabilities, are still trained to perform the same task, only with a much higher capacity architecture.
- https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
- https://github.com/pliang279/awesome-multimodal-ml
Sparsity is not merely a way to accelerate neural networks, it is integral to design scalable architectures.
References:
Essentially all modern tools work by allowing the user to construct a computational graph of operations that contains undefined parameters which are learned via some gradient descent optimization approach. Each framework has their own flavor. For specific applications it is always worth searching for an open source software (OSS) repository where someone else has wired together a more task specific implementation with input/output encoding/decoding and a reasonable model architecture.
This was the original major framework for graphical ML, built by Google. It is still widely deployed, but new developments are largely done in JAX and PyTorch. Of those, the Keras frontend/syntax for building models is preferred by most users, but limited in terms of what you can build.
This is a framework built by Facebook/Meta that is nearly identical to Tensorflow but with many of syntax oddities fixed.
Recommended toolkit for future development
This is the successor to Tensorflow from Google. It is heavily used by machine learning researchers today.
Main benefits:
- Differential numpy interface (see jax.numpy)
- Arbitrary order derivatives, via expression rewriting into the same language
- Highly optimized for a wide variety of platforms (e.g. CPU, Nvidia, AMD, Google TPU)
- Highly distributed processing (e.g. FSDP)
Notes:
- Flax
- Orbax
- Optax