Trying to fit all polynomial terms into a linear regression or logistic regression can be very expensive if there are a lot of features in the model. The total number of features grow in the order of O(n^m), where n is the number of features and m the maximum order of the polynom we want to include (e.g. x^6). Only the quadratic features are in the order of O(n^2/2).
Computer vision to recognize a car:
50x50 pixel images -> 2500 pixels
n = 2500 (grayscale) / 7500 (RGB)
x = [ p1 p2 p3 ... p2500 ]
# where:
# pi = pixel i intensity (0-255)
If we were to include all the quadratic features (xi * xj), we would have 3 million features.
Neural networks are algorithms that try to mimic the brain. It came from the hypothesis that the brain is one single learning algorithm, based on experiments such as:
- Seeing with your tongue: camera that transmit signals to the tongue
- Human echolocation (sonar): learning how to move according to sound vibrations
- Haptic belt: direction sense
- Implanting a 3rd eye in a frog: the frog will learn how to use it.
Similarly to how a brain neuron is composed of, the artificial neural network can be represented as a simple unit that receives inputs, does some calculation and outputs a message according to the received inputs.
x1 XXXXXXXXXXXX a1 ----------\
x2 XXXXXXXXXXXX a2 -----------> * -----------> h(x)
x3 XXXXXXXXXXXX a3 ----------/
Layer 1 Layer 2 Layer 3
(input) (hidden) (output)
# Where:
Θ = matrix of weights controlling function mapping from layer j to layer (j+1)
ai = "activation" of unit i in layer j
a1 = g(Θ10 * x0 + Θ11 * x1 + Θ12 * x2 + Θ13 * x3)
a2 = g(Θ20 * x0 + Θ21 * x1 + Θ22 * x2 + Θ23 * x3)
a3 = g(Θ30 * x0 + Θ31 * x1 + Θ32 * x2 + Θ33 * x3)
h(x) = a1 = g(Θ10 * a0 + Θ11 * a1 + Θ12 * a2 + Θ13 * a3)
# if a network has sj units in layer j and (sj+1) units in layer j+1, then
# Θ will be of dimension (sj+1) x (sj + 1)
Anything that is not input or output layer is a hidden layer.
Forward Propagation: Vectorized implementation
ai = g(Θi0 * x0 + Θi1 * x1 + Θi2 * x2 + Θi3 * x3)
zi = Θi0 * x0 + Θi1 * x1 + Θi2 * x2 + Θi3 * x3
# where i is the number of the layer, i=2 being the hidden layer
x = [x0; x1; x2; x3]
z² = [z1; z2; z3]
z² = Θ1 * x
a² = g(z²)
# we can think of the inputs x as the activation of layer 1 (a1):
z² = Θ1 * a1
# To take care of the bias input:
a0 = 1
z³ = Θ² * a²
h(x) = a³ = g(z³)
Other network architectures:
x1 XXXXXXXXX *
XXXXXXXXX *
x2 XXXXXXXXX * XXXXXXXXX * -----> h(x)
XXXXXXXXX *
x3 XXXXXXXXX *
Layer 1 Layer 2 Layer 3 Layer 4
Given this neural network:
x1, x2 are binary (0 or 1)
x0 ---\
x1 ----> h(x)
x2 ---/
We can calculate simple functions by attributing weighs to the edges.
AND function:
Θ = [-30; 20; 20]
g(z) = (-30 + 20x1 + 20x2)
OR function:
Θ = [-10; 20; 20]
g(z) = (-10 + 20x1 + 20x2)
Non-linear classification example. Given this neural network:
x0 ---\
h(x)
x1 ---/
We can calculate these functions by attributing weighs to the edges:
NOT function:
Θ = [10; -20]
g(z) = (10 - 20x1)
(NOT x1) AND (NOT x2) function:
Θ = [10; -20; -20]
g(z) = (10 - 20x1 -20x2)
x1 XNOR x2 function:
+1 XXXXXXXXXX
a1² XXXXXXXXXX
x1 XXXXXXXXXX a1³ -----> h(x)
a2² XXXXXXXXXX
x2 XXXXXXXXXX
Θ11 = [-30; 20; 20] # AND
Θ12 = [ 10; -20; -20] # (NOT x1) AND (NOT x2)
a1² = [0;0;0;1]
a2² = [1;0;0;0]
Θ21 = [-10; 20; 20] # OR
a1³ = h(x) = [1;0;0;1] # x1 XNOR x2
To be able to do multi-class classification with neural networks, the only thing to be done is to have each unit on the output layer to output 1 or 0 for a given class. This is called "one-vs-all" strategy.
h(x) will then be in the form [y1, y2, y3, ..., yn]
for n classes, where yi is binary (i.e. y ∈ {0,1}).