# Artificial Neural Networks: Training for Deep Learning – IIb

This post, like the series provides a pathway into deep learning by introducing some of the concepts using some common reference points. This is not designed to be an exhaustive research review of deep learning techniques. I have also tried to keep the description neutral of any programming language, though the backing code is written in Java.

So far we have visited shallow neural networks and their building blocks (post 1), investigated their performance on difficult problems and explored their limitations (post 2). Then we jumped into the world of deep networks and described the concept behind them (post 3) and the RBM building block (post 4). Then we started discussing a possible local (greedy) training method for such deep networks (post 5). In the previous post we started talking about the global training and also about the two possible ‘modes’ of operation (discriminative and generative).

In the previous post the difference between the two modes was made clear. Now we can talk a bit more about how the global training works.

As you might have guessed the two operating modes need two different approaches to global training. The differences in flow for the two modes and the required outputs also means there will be structural differences when in the two modes as well.

The image below shows a standard discriminative network where flow of propagation is from input to the output layer. In such networks the standard back-propagation algorithm can be used to do the learning closer to the output layers. More about this in a bit.

The image below shows a generative network where the flow is from the hidden layers to the visible layers. The target is to generate an input, label pair. This network needs to learn to associate the labels with inputs. The final hidden layer is usually lot larger as it needs to learn the joint probability of the label and input. One of the algorithms used for global training of such networks is called the ‘wake-sleep’ algorithm. We will briefly discuss this next.

Wake-Sleep Algorithm:

The basic idea behind the wake-sleep algorithm is that we have two sets of weights between each layer – one to propagate in the Input => Hidden direction (so called discriminative weights) and the other to propagate in the reverse direction (Hidden => Input – so called generative weights). The propagation and training are always in opposite directions.

The central assumption behind wake-sleep is that hidden units are independent of each other – which holds true for Restricted Boltzmann Machines as there are no intra-layer connections between hidden units.

Then the algorithm proceeds in two phases:

1. Wake Phase: Drive the system using input data from the training set and the discriminative weights (Input => Hidden). We learn (tune) the generative weights (Hidden => Input) – thus we are trying to learn how to recreate the inputs by tuning the generative weights
2. Sleep Phase: Drive the system using a random data vector at the top most hidden layer and the generative weights (Hidden => Input). We learn (tune) the discriminative weights (Input => Hidden) – thus we are trying to learn how to recreate the hidden states by tuning the discriminative weights

As our primary target is to understand how deep learning networks can be used to classify data we are not going to get into details of wake-sleep.

There are some excellent papers for Wake-Sleep by Hinton et. al. that you can read to further your knowledge. I would suggest you start with this one and the references contained in it.

Back-propagation:

You might be wondering why we are talking about back-prop (BP) again when we listed all those ‘problems’ with it and ‘deep networks’. Won’t we be affected by issues such as ‘vanishing gradients’ and being trapped in sub-optimal local minima?

The trick here is that we do the pre-training before BP which ensures that we are tuning all the layers (in a local – greedy way) and giving BP a head start by not using randomly initialised weights. Once we start BP we don’t care if the layers closer to the input layer do not change their weights that much because we have already ‘pointed’ them in a sensible direction.

What we do care about is that the features closer to the output layer get associated with the right label and we know BP for those outer layers will work.

The issue of sub-optimal local minima is addressed by the pre-training and the stochastic nature of the networks. This means that there is no hard convergence early on and the network can ‘jump’ its way out of a sub-optimal local minima (with decreasing probability though as the training proceeds).

Classification Example – MNIST:

The easiest way to go about this is to use ‘shallow’ back propagation where we put a layer of logistic units on top of the existing deep network of hidden units (i.e. the Output Layer in the discriminative arrangement) and only this top layer is trained. The number of logistic units is equal to the number of classes we have in the classification task if using one-hot encoding to encode the classes.

An example is provided on my github, the test file is: rd.neuron.neuron.test.TestRBMMNISTRecipeClassifier

This may not give record breaking accuracy but it is a good way of testing discriminative deep networks. It also takes less time to train as we are splitting the training into two stages and always ever training one layer at a time:

1. Greedy training of the hidden layers
2. Back-prop training of the output layer

The other advantage this arrangement has is that it is easy to reason about. In stage 1 we train the feature extractors and in stage 2 we train the feature – class associations.

One example network for MNIST is:

Input Image > 784 > 484 > 484 > 484 > 10 > Output Class

This has 3 RBM based Hidden Layers with 484 neurons per layer and a 10 unit wide Logistic Output Layer (we can also use a SoftMax layer). The Hidden Layers are trained using CD10 and the Output Layer is trained using back propagation.

To evaluate we do peak matching – the index of the highest value at the output layer must match the one-hot encoded label index. So if the label vector is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] then the index value for the peak is 3 (we use index starting at 0). If in the output layer the 4th neuron has the highest activation value out of the 10 then we can say it detected the right digit.

Using such a method we can easily get an accuracy of upwards of 95%. While this is not a phenomenal result (the state of the art full network back-prop gives > 99% accuracy for MNIST), it does prove the concept of a discriminative deep network.

The trained model that results is: network.discrm.25.nw and can be found on my github here. The model is simply a list of network layers (LayerIf).

The model can be loaded using:

You can use the Propagate class to use it to ‘predict’ the label.

The PatternBuilder class can be used to measure the performance in two ways:

1. Match Score: Matches the peak index of the one-hot encoded label vector from the test data with the generated label vector. It is a successful match (100%) is the peaks in the two vectors have the same indexes. This does not tell us much about the ‘quality’ of the assigned label because our ‘peak’ value could just be slightly bigger than other values (more of a speed breaker on the road than a peak!) as long as it is strictly the ‘largest’ value. For example this would be a successful match:
1. Test Data Label: [0, 0, 1, 0] => Actual Label: [0.10, 0.09, 0.11, 0.10] as the peak indexes are the same ( = 2 for zero indexed vector)
2. and this would be an unsuccessful one: Test Data Label: [0, 0, 1, 0] => Actual Label: [0.10, 0.09, 0.10, 0.11] as the peak indexes are not the same
2. Score: Also includes the quality aspect by measuring how close the Test Data and Actual Label values are to each other. This measure of closeness is controlled by a threshold which can be set by the user and incorporates ALL the values in the vector. For example if the threshold is set to 0.1 then:
1. Test Data Label: [0, 0, 1, 0] => Actual Label: [0.09, 0.09, 0.12, 0.11] the score will be 2 out of 4 (or 50%) as the last index is not within the threshold of 0.1 as | 0 – 0.11 | = 0.11 which is > 0.1 and same with | 1 – 0.12 | = 0.88 which is > 0.1 thus we score them both a 0. All other values are within the threshold so we score +1 for them. In this case the Match Score would have given a score of 100%.

Next Steps:

So far we have just taken a short stroll at the edge of the Deep Learning forest. We have not really looked at different types of deep learning configurations (such as convolution networks, recurrent networks and hybrid networks) nor have we looked at other computational models of the brain (such as integrate and fire models).

One more thing that we have not discussed so far is how can we incorporate the independent nature of neurons. If you think about it, the neurons in our brains are not arranged neatly in layers with a repeating pattern of inter-layer connections. Neither are they synchronized like in our ANN examples where all the neurons in a layer were guaranteed to process input and decide their output state at the SAME time. What if we were to add a time element to this? What would happen if certain neurons changed state even as we are examining the output? In other words what would happen if the network state also became a function of time (along with the inputs, weights and biases)?

In my future posts I will move to a proper framework (most probably DL4J – deep learning for java or TensorFlow) and show how different types of networks work. I can spend time and implement each type of network but with a host of high quality deep learning libraries available, I believe one should not try and ‘reinvent the wheel’.

If you have found these blog posts useful or have found any mistakes please do comment! My human neural network (i.e. the brain!) is always being trained!

# Artificial Neural Networks: Introduction to Deep Learning

Firstly sorry for the break! Been busy with few things. But here goes – the next instalment of our ANN series.

So far we have covered in the first and second posts:

a) To classify more complex and real world data which is not linearly separable we need more processing units, these are usually added in the Hidden Layer

b) To feed the processing units (i.e. the Hidden Layer) and to encode the input we utilise an Input Layer which has only one task –  to present the input in a consistent way to the hidden layer, it will not learn or change as the network is trained.

c) To work with multiple Hidden Layer units and to encode the output properly we need an aggregation layer to collect output of the Hidden Layer, this aggregation layer is also called an Output Layer

d) The representation of the input and output can have a big influence on the performance of the neural network especially when it encounters noisy data

To the first point, while it is good to be able to add more hidden units and multiple hidden layers, we quickly come up against the problem of how to train such networks. This is not only a theoretical problem (e.g. vanishing gradient – see the first post) but also a computational one.

The Challenge:

• More hidden units mean complex features can be learnt from the data
• Multiple layers are difficult to train using standard back prop due to the vanishing gradient problem
• Multiple layers are also difficult to train because we need to distribute the training effort so that each layer is able to support the other layers at the end of the training (a hint: perhaps we need to look at independent training of each layer to delink them?)
• Adding more hidden units also increases the computational load especially if the training data set is massive, Stochastic Gradient Descent only partially solves the problem

The Solution:

• Add more hidden layers (i.e. more hidden units) to make the network deeper (thus the ‘deep’ in the ‘deep learning’)
• Use a combination of bottom-up and top-down learning to train this ‘deep’ network
• Use specialised libraries that support GPU based distributed data processing (e.g. Tensorflow, DL4J)

Image above shows a deep learning network setup for the MNIST data set where inputs are gray-scale images (of constant pixel count 28×28) of single handwritten digits (0 to 9) which need to be mapped to their corresponding number. Each pixel in the 28×28 image is normalised and treated as an input. This gives us a full input length of 784 neurons. There are 10 digits (0 to 9) as a possible output class so the full output length is 10.

We have previously tried shallow networks and gotten good performance of about 95% with 300 hidden units organised as a single hidden layer, compare this to deep learning networks which achieve 99.7%

Deep learning works by distributing the ‘learning’ load across the hidden layers by ‘learning and aggregating’ smaller features to make larger features.

Image below shows how a Face Detector, which detects if an image contains a face of a child, baby or an adult, may distribute the feature learning and aggregation between the hidden layers.

The Building Blocks:

There are quite a few different types of ‘deep learning’ networks out there specialised for different application types (such as image tagging, document processing etc.). Some of them are not even ‘deep’ (e.g. Word2Vec) yet they incorporate novel training methods that allow them to deal with complex tasks (such as language translation) as if by magic.

I believe it will be more useful if I describe some of the common building blocks with reference to the ‘Deep Belief Network’ (Hinton et.al. 2006) as it is relatively simple (it has ‘boolean’ states in the hidden units instead of real valued ones).

The common building blocks include:

• One-Hot Encoding
• Softmax Layer
• Sigmoid and ReLU Activation Functions
• Restricted Boltzmann Machines (RBM)
• Distributed Layer-wise Unsupervised Training (Contrastive Divergence)
• Back-prop based Supervised Training (Fine Tuning)

One-Hot Encoding:

One-Hot Encoding is a really simple way of encoding outputs related to states.

The idea is that to represent S different states you need to have a S-bit binary string where for each state s in only one bit in the binary string is set.

This can also be used to represent mutually exclusive classes and we have to pick one (for example a transaction cannot be both fraudulent and normal at the same time)

Let us assume we have a classifier which has to classify all inputs into one of 5 classes A, B, C, D, E.

How do we put this as an output for a neural network? Should we have just one output neuron and divide the possible output range between the 5 classes (e.g. 0-10 = Class A, 11-20 = Class B etc.), no that sounds weird especially because the output values could mean anything and nothing, also this complicates the training.

The easiest option in this case is to have one output per class, using one-hot encoding (thus a 5-bit output vector). When we train the model we simply require that the corresponding output value be significantly higher than all others so as to indicate that particular class.

One possible scheme:

A = [1,0,0,0,0]

B = [0,1,0,0,0]

C = [0,0,1,0,0]

so on..

Then if we get an output vector like:

[0.16, 0.23, 0.67, 0.03, 0.1]

we can be reasonably sure that our model is telling us the input belongs to Class C as for that class position 3 has the highest value in the label. One thing to note is that the output vector still does not tell us anything about how close two values in it are because these are just numeric values without a comparative scale (e.g. such as those found in case of probabilistic scores where all scores are compared to the value of 1 and the highest score wins).

This is where Softmax comes into the picture.

Softmax Layer:

The Softmax Layer is really straight forward to understand. It is usually found as the outermost layer of the network because it has the very important property of being able to convert ANY set of inputs into probability values such that all the values sum to 1 ( a very important property for probability values).

The softmax function is:

P(i) = e^(x(i))/Sum(e^(x(j)))

Where x(i) is the ith output that we want to convert to a probability value.

and Sum(P(i)) = 1

Below is the function in Java.

 Source code   ```/**
* Softmax function
* @param input - to be converted to probabilities
* @return
*/
public static double[] softmax(double[] input) {
double prob[] = new double[input.length];

double sum = 0;
for (double val : input) {
sum += Math.exp(val);
}
if (sum == 0) {
throw new IllegalStateException("Sum cannot be zero");
}
for (int i = 0; i < input.length; i++) {
prob[i] = Math.exp(input[i]) / sum;
}

return prob;
}```

Example:

Assume there are 5 output units (thus length of output vector = 5) where each unit represents a class (as defined during the supervised learning phase e.g. Output Unit 1 = Class A; Output Unit 2 = Class B etc.).

For a certain input we get the following outputs (say using the Sigmoid function – see below) at Unit 1 -> 5:

[0.01, 0.23, 0.55, 0.29, 0.1]

While we can say score for Class C (Unit 3) = 0.55 looks like a good answer we cannot be 100% sure as this is not a probability value. Also Class D (Unit 4) = 0.29 is not that far off or is it? We can’t say for sure because we do not have a scale to compare against. Wouldn’t it be great if we could convert these to a probability value then we could compare them with each other and simply pick the largest value as the most probable class and ALSO provide our ‘confidence’ at the result?

If we use Softmax we get the following probability values:

[0.157, 0.195, 0.269, 0.207, 0.172]

(values are rounded to fit here so they may not exactly total to 1 – the problem with floating point arithmetic in Java)

The result is very interesting! Now we see Class C and D are just 6% away from each other (27% and 21% approx.). This is far closer than the original output vector. So while we can play it safe and choose Class C (the largest probability value) we will have to indicate somehow that we are not very sure about it as Class D came very close as well.

This can be converted to the following one-hot with a suitable confidence warning:

[0, 0, 1, 0, 0]

The interesting point about Softmax is that larger the scores, clearer is the separation between the probability values.

For example, if the output was (perhaps from ReLU units – see below):

[1, 2, 4, 3, 0]

we get probability score as: [0.032, 0.0861, 0.636, 0.234, 0.012]

This tells us there is a high probability that Class C is the correct class. The separation is now 40% between Class C and D.

Sigmoid and Rectified Linear Unit (ReLU) Activation Function:

The Sigmoid function ensures all output values are between 0 and 1. This allows us to use a probabilistic interpretation of the output. The closer the value is to 0 or 1 more confident we are about it. if the value is around 0.5 then we are not really sure.

It has the following form:

Sigmoid (x) =  1 / (1 + e^(-x))

One interesting property of the Sigmoid (which lends itself to back-prop training) is that its differential can be represented using Sigmoid:

Sigmoid'(x) = Sigmoid(x)(1-Sigmoid(x))

This also compares well with a Bernoulli Trial where Sigmoid(x) = P(x) and 1-Sigmoid(x) = Q (not x) = 1 – P(x).

This function is also used in ‘logistic regression’ where for a two class problem each class is represented by the edge values of the Sigmoid (0 and 1).

When we look at multi-class problems a common encoding follows the one-hot pattern where there is one sigmoid output per class which tells us how confident we are about that class. Remember it tells us NOTHING about how these compare with each other.

So if for 5 classes we have 5 sigmoid outputs where:

• a value close to 0 means we are very sure the input does not belong to the class
• a value close to 1 means we are very sure the input does belong to the class

we would still need something like a Softmax to compare these outputs with each other to choose a single class out of the 5 and reason about how confident we were about that choice.

The ReLU function comes from the behavior of a half-wave rectifier unit in electrical engineering where these units convert AC to DC. The function is VERY easy to model and process, there are no messy exponential terms. These allow just the right amount of non-linearity into a neural network thereby allowing us to handle non-linear classification tasks (more info here).

The function is:

ReLU(x) = max(0,x)

Its differential for x > 0 (required for back-prop) is a constant value of 1

The ReLU has several advantages including the fact that it is very easy to calculate and differentiate. If you see the difference it brings to the back-prop equations you will never want to even think about using Sigmoid.

The one problem with ReLUs is that once it is closed during a forward pass (i.e. x <=0) it will forever remain closed (even for backward passes). This is called the ‘dying ReLU problem’.

In the next post(s) we will cover Restricted Boltzmann Machines, Contrastive Divergence and Fine-tuning.

As usual – if I have made any mistakes do let me know!