Skip to content

Latest commit

 

History

History
146 lines (95 loc) · 4.11 KB

08.Neural.Networks:Representation.markdown

File metadata and controls

146 lines (95 loc) · 4.11 KB

Neural Networks: Representation

Non-linear Hypotheses

Trying to fit all polynomial terms into a linear regression or logistic regression can be very expensive if there are a lot of features in the model. The total number of features grow in the order of O(n^m), where n is the number of features and m the maximum order of the polynom we want to include (e.g. x^6). Only the quadratic features are in the order of O(n^2/2).

Computer vision to recognize a car:

50x50 pixel images -> 2500 pixels
n = 2500 (grayscale) / 7500 (RGB)
x = [ p1 p2 p3 ... p2500 ]
# where:
# pi = pixel i intensity (0-255)

If we were to include all the quadratic features (xi * xj), we would have 3 million features.

Neurons and the Brain

Neural networks are algorithms that try to mimic the brain. It came from the hypothesis that the brain is one single learning algorithm, based on experiments such as:

  • Seeing with your tongue: camera that transmit signals to the tongue
  • Human echolocation (sonar): learning how to move according to sound vibrations
  • Haptic belt: direction sense
  • Implanting a 3rd eye in a frog: the frog will learn how to use it.

Model Representation I

Similarly to how a brain neuron is composed of, the artificial neural network can be represented as a simple unit that receives inputs, does some calculation and outputs a message according to the received inputs.

x1 XXXXXXXXXXXX a1 ----------\
x2 XXXXXXXXXXXX a2 -----------> * -----------> h(x)
x3 XXXXXXXXXXXX a3 ----------/
Layer 1		Layer 2		Layer 3
(input)		(hidden)	(output)

# Where:
Θ  = matrix of weights controlling function mapping from layer j to layer (j+1)

ai = "activation" of unit i in layer j
a1 = g(Θ10 * x0 + Θ11 * x1 + Θ12 * x2 + Θ13 * x3)
a2 = g(Θ20 * x0 + Θ21 * x1 + Θ22 * x2 + Θ23 * x3)
a3 = g(Θ30 * x0 + Θ31 * x1 + Θ32 * x2 + Θ33 * x3)

h(x) = a1 = g(Θ10 * a0 + Θ11 * a1 + Θ12 * a2 + Θ13 * a3)
# if a network has sj units in layer j and (sj+1) units in layer j+1, then
# Θ will be of dimension (sj+1) x (sj + 1)

Anything that is not input or output layer is a hidden layer.

Model Representation II

Forward Propagation: Vectorized implementation

ai = g(Θi0 * x0 + Θi1 * x1 + Θi2 * x2 + Θi3 * x3)
zi =   Θi0 * x0 + Θi1 * x1 + Θi2 * x2 + Θi3 * x3
# where i is the number of the layer, i=2 being the hidden layer
x  = [x0; x1; x2; x3]
z² = [z1; z2; z3]

z² = Θ1 * x
a² = g(z²)
# we can think of the inputs x as the activation of layer 1 (a1):
z² = Θ1 * a1

# To take care of the bias input:
a0   = 1
z³   = Θ² * a²
h(x) = a³ = g(z³)

Other network architectures:

x1 XXXXXXXXX *
               XXXXXXXXX *
x2 XXXXXXXXX *             XXXXXXXXX * -----> h(x)
               XXXXXXXXX *
x3 XXXXXXXXX *
Layer 1   Layer 2     Layer 3     Layer 4

Examples and Intuitions I

Given this neural network:

x1, x2 are binary (0 or 1)
x0 ---\
x1 ----> h(x)
x2 ---/

We can calculate simple functions by attributing weighs to the edges.

AND function:
Θ = [-30; 20; 20]
g(z) = (-30 + 20x1 + 20x2)

OR function:
Θ = [-10; 20; 20]
g(z) = (-10 + 20x1 + 20x2)

Examples and Intuitions II

Non-linear classification example. Given this neural network:

x0 ---\
        h(x)
x1 ---/

We can calculate these functions by attributing weighs to the edges:

NOT function:
Θ = [10; -20]
g(z) = (10 - 20x1)

(NOT x1) AND (NOT x2) function:
Θ = [10; -20; -20]
g(z) = (10 - 20x1 -20x2)


x1 XNOR x2 function:

+1 XXXXXXXXXX 
              a1² XXXXXXXXXX 
x1 XXXXXXXXXX               a1³ -----> h(x)
              a2² XXXXXXXXXX 
x2 XXXXXXXXXX 

Θ11 = [-30;  20;  20]	# AND
Θ12 = [ 10; -20; -20]	# (NOT x1) AND (NOT x2)
a1² = [0;0;0;1]
a2² = [1;0;0;0]

Θ21 = [-10;  20;  20]	# OR
a1³ = h(x) = [1;0;0;1]	# x1 XNOR x2

Multiclass Classification

To be able to do multi-class classification with neural networks, the only thing to be done is to have each unit on the output layer to output 1 or 0 for a given class. This is called "one-vs-all" strategy.

h(x) will then be in the form [y1, y2, y3, ..., yn] for n classes, where yi is binary (i.e. y ∈ {0,1}).