Analytics, Machine Learning, AI and Automation

In the last few years buzzwords such as Machine Learning (ML), Deep Learning (DL), Artificial Intelligence (AI) and Automation have taken over from the excitement of Analytics and Big Data.

Often ML, DL and AI are placed in the same context especially in product and job descriptions. This not only creates confusion as to the end target, it can also lead to loss of credibility and wasted investment (e.g. in product development).

Figure 1: Framework for Automation

Figure 1 shows a simplified version of the framework for automation. It shows all the required ingredients to automate the handling of a ‘System’. The main components of this framework are:

  1. A system to be observed and controlled (e.g. telecoms network, supply chain, trading platform, deep space probe …)
  2. Some way of getting data (e.g. telemetry, inventory data, market data …) out of the system via some interface (e.g. APIs, service endpoints, USB ports, radio links …) [Interface <1> Figure 1]
  3. A ‘brain’ that can effectively convert input data into some sort of actions or output data which has one or more ‘models’ (e.g. trained neural networks, decision trees etc.) that contain its ‘understanding’ of the system being controlled. The ‘training’ interface that creates the model(s) and helps maintain them, is not shown separately
  4. Some way of getting data/commands back into the system to control it (e.g. control commands, trade transactions, purchase orders, recommendations for next action etc.) [Interface <2> Figure 1]
  5. Supervision capability which allows the ‘creators’ and ‘maintainers’ of the ‘brain’ to evaluate its performance and if required manually tune the system using generated data [Interface <3> Figure 1] – this itself is another Brain (see Recursive Layering)

This is a so called automated ‘closed-loop’ system with human supervision. In such a system the control can be fully automated, only manual or any combination of the two for different types of actions. For example, in safety critical systems the automated closed loop can have cut out conditions that disables Interface <2> in Figure 1. This means all control passes to the human user (via Interface <4> in Figure 1).

A Note about the Brain

The big fluffy cloud in the middle called the ‘Brain’ hides a lot of complexity, not in terms of the algorithms and infrastructure but in terms of even talking about differences between things like ML, DL and AI.

There are two useful concepts to use when trying to put all these different buzzwords in context when it comes to the ‘Brain’ of the system. In other words next time some clever person tells you that there is a ‘brain’ in their software/hardware that learns.. ask them two questions:

  1. How old is the brain?
  2. How dense is the brain?

Age of the Brain

Age is a very important criteria in most tasks. Games that preschool children struggle with are ‘child’s play’ for teenagers. Voting and driving are reserved for ‘adults’. In the same way for an automated system the age of the brain talks a lot about how ‘smart’ it is.

At its simplest a ‘brain’ can contain a set of unchanging rules that are applied to the observed data again and again [so called static rule based systems]. This is similar to a new born baby that has fairly well defined behaviours (e.g. hungry -> cry). This sort of a brain is pretty helpless in case the data has large variability. It will not be able to generate insights about the system being observed and the rules can quickly become error prone (thus the age old question – ‘why does my baby cry all the time!’).

Next comes the brain of a toddler which can think and learn but in straight lines and that too after extensive training and explanations (unless you are a very ‘lucky’ parent and your toddler is great at solving ‘problems’!). This is similar to a ‘machine learning system’ that is specialised to handle specific tasks. Give it a task it has not trained for and it falls apart.

Next comes the brain of a pre-teen which is maturing and learning all kinds of things with or without extensive training and explanations. ‘Deep learning systems’ have similar properties. For example a Convolutional Neural Network (CNN) can extract features out of a raw image (such as edges) without requiring any kind of pre-processing and can be used on different types of images (generalisation).

At its most complex, (e.g. a healthy adult) the ‘brain’ is able to not only learn new rules but more importantly evaluates existing rules for their usefulness. Furthermore, it is capable of chaining rules, applying often unrelated rules to different situations. Processing of different types of input data is also relatively easy (e.g. facial expressions, tone, gestures, alongside other data). This is what you should expect from ‘artificial intelligence‘. In fact with a true AI Brain you should not need Interface <4> and perhaps a very limited Interface <3> (almost a psychiatrist/psycho-analyst to a brain).

Brain Density

Brain density increases as our age increases and then stops increasing and starts to decrease. From a processing perspective its like the CPU in your phone or laptop starts adding additional processors and therefore is capable of doing more complex tasks.

Static rule-based systems may not require massive computational power. Here more processing power may be required for <1>/<2>. to prepare the data for input and output.

Machine-learning algorithms definitely benefit from massive computational powers especially when the ‘brain’ is being trained. Once the model is trained however, the application of the model may not require computing power. Again more power may be required to massage the data to fit the model parameters than to actually use the model.

Deep-learning algorithms require computational power throughout the cycle of prep, train and use. The training and use times are massively reduced when using special purpose hardware (e.g. GPUs for Neural Networks). One rule of thumb: ‘if it doesn’t need special purpose hardware then its probably not a real deep-learning brain, it may simply be a machine learning algorithm pretending to be a deep-learning brain’. CPUs are mostly good for the data prep tasks before and after the ‘brain’ has done its work.

Analytics System

If we were to have only interfaces <1> and <3> (see Figure 1) – we can call it an analytics solution. This type of system has no ability to influence the system. It is merely an observer. This is very popular especially on the business support side. Here the interface <4> may not be something tangible (such REST API or a command console) all the time. Interface <4> might represent strategic and tactical decisions. The ‘Analytics’ block in this case consists of data visualisation and user interface components.

True Automation

To enable true automation we must close the loop (i.e. Interface <2> must exist). But there is something that I have not shown in Figure 1 which is important for true automation. This missing item is the ability to process event-based data. This is very important especially for systems that are time dependent – real-time or near-real-time – such as trading systems, network orchestrators etc. This is shown in Figure 2.

Figure 2: Automation and different types of data flows

Note: Events are not only generated by the System being controlled but also by the ‘Brain’. Therefore, the ‘Brain’ must be capable of handling both time dependent as well as time independent data. It should also be able to generate commands that are time dependent as well as time independent.

Recursive Layers

Recursive Layering is a powerful concept where an architecture allows for its implementations to be layered on top of each other. This is possible with ML, DL and AI components. The System in Figures 1 and 2 can be another combination of a Brain and controlled System where the various outputs are being fed in to another Brain (super-brain? supervisor brain?). An example is shown in Figure 3. This is a classic Analytics over ML example where the ‘Analytics’ block from Figure 1 and 2 has a Brain inside it (it is not just restricted to visualisation and UI). It may be a simple new-born brain (e.g. static SQL data processing queries) or a sophisticated deep learning system.

Figure 3: Recursive layering in ML, DL and AI systems.

The Analytics feed is another API point that can be an input data source (Interface <1>) to another ‘Brain’ that is say supervising the one that is generating the analytics data.

Conclusion

So next time you get a project that involves automation (implementing or using) – think about the interfaces and components shown in Figure 1. Think about what type of brain do you need (age and density).

If you are on the product side then make sure bold claims are made, not illogical or blatantly false ones. Just as you would not ask a toddler to do a teenagers job, don’t advertise one as the other.

Finally think hard about how the users will be included in the automation loop. What conditions will disable interface <2> in Figure 1 and cut out to manual control? How can the users monitor the ‘Brain’? Fully automated – closed loop systems are not good for anyone (just ask John Connor from the Terminator series or people from Knight Capital https://en.wikipedia.org/wiki/Knight_Capital_Group). Humans often provide deeper insights based on practical experience and knowledge than ML or DL is capable of.

Reduce Food Wastage using Machine Learning

A scenario the readers might be familiar with: food items hiding around in our refrigerator way past their expiry date. Once discovered, these are quickly transferred to the bin with promises to self that next time it will be different for sure OR worse yet we stuff the items in our freezer!

Estimates of waste range from 20% to 50% (in countries like USA). This is a big shame given the fact that hundreds of millions of people around the world don’t have any form of food security and face acute shortage of food.

What can we do about this? 

One solution is to help people be a bit more organised by reminding them of the expiry dates of various items. The registration of items has to be automated and smart. 

Automated:

If we insist on manual entry of items with their expiry date – people are likely to not want to do this especially right after a long shop! Instead, as the items are checked out at the shop, an option should be available to email the receipt which should also contain an electronic record of the expiry date of the purchased items. This should include all groceries as well as ‘ready to eat’ meals. Alternatively, one can also provide different integration options using open APIs with some sort of a mobile app.

Smarter:

Once we have the expiry dates we need to ensure we provide the correct support and advice to the users of the app. To make it more user-friendly we should suggest recipes from the purchased groceries and put those on the calendar to create a ‘burn-down’ chart for the groceries (taking inspiration from Agile) which optimises for things like freshness of groceries, minimising use of ‘packaged foods’ and maintaining the variety of recipes.

Setup:

Steps are as follows:

  1. When buying groceries the expiry and nutrition information are loaded into the system
  2. Using a matrix of expiry to items and items to recipes (for raw groceries) we get an optimised ordering of usage dates mapped to recipes
  3. With the item consumption-recipe schedule we can then interleave ready to eat items, take-away days and calendar entries related to dinner/lunch meetings (all of these are constraints)
  4. Add feedback loop allowing users to provide feedback as to what recipes they cooked, what they didn’t cook, what items were wasted and where ‘unscheduled’ ready to eat items were used or take-away called for
  5. This will help in encouraging users to buy the items they consume and warn against buying (or prioritise after?) items that users ‘ignore’ 

I provide a dummy implementation in Python using Pandas to sketch out some of the points and to bring out some tricky problems.

The output is a list of purchased items and a list of available recipes followed by a list of recommendations with a ‘score’ metric that maximises ingredient use and minimises delay in usage.

Item: 0:cabbage
Item: 1:courgette
Item: 2:potato
Item: 3:meat_mince
Item: 4:lemon
Item: 5:chicken
Item: 6:fish
Item: 7:onion
Item: 8:carrot
Item: 9:cream
Item: 10:tomato


Recipe: 0:butter_chicken
Recipe: 1:chicken_in_white_sauce
Recipe: 2:mince_pie
Recipe: 3:fish_n_chips
Recipe: 4:veg_pasta
Recipe: 5:chicken_noodles
Recipe: 6:veg_soup

Recommendations

butter_chicken:     Score:30            Percentage items consumed:36%

chicken_in_white_sauce:     Score:26            Percentage items consumed:27%

Not enough ingredients for mince_pie

fish_n_chips:       Score:20            Percentage items consumed:27%

veg_pasta:      Score:26            Percentage items consumed:27%

chicken_noodles:        Score:28            Percentage items consumed:36%

veg_soup:       Score:20            Percentage items consumed:27%

The recommendation is to start with ‘butter chicken’ as we use up some items that have a short shelf life. Here is a ‘real’ recipe – as a thank you for reading this post: 

http://maunikagowardhan.co.uk/cook-in-a-curry/butter-chicken-murgh-makhani-chicken-cooked-in-a-spiced-tomato-gravy/h

Tricky Problems:

There are some tricky bits that can be solved but will need some serious thinking:

  1. Updating recommendations as recipes are cooked
  2. Updating recommendations as unscheduled things happen (e.g. item going bad early or re-ordering of recipes being cooked)
  3. Keeping track of cooked items and other interleaved schedules (e.g. item being frozen to use later)
  4. Learning from usage without requiring the user to update all entries (e.g. using RFID? Deep Learning – from images taken of your fridge with the door open)
  5. Coming up with innovative metrics to encourage people to eat healthy and eat fresh – lots of information can be extracted (E.g. nutrition information) if we have a list of purchased items
  6. Scheduling recipes around other events in a calendar or routine items (e.g. avoiding a heavy meal before a scheduled gym appointment)

House Market Analysis

The house prices in UK are at it again. A combination of Brexit, change in housing stock, easy loans and growing consumer debt is making things interesting again.

Figure 1: Number of Transactions per month from 1995 to August 2018

Figure 1 shows the number of transactions every month since 1995. The massive fall post 2007 because of the financial crisis. Then the surge in transactions since 2013. The lonely spot (top-right, March 2016) is just before the new Stamp Duty changes made buying a second house an expensive proposition. But this is relatively boring!

Visual Analytics: Relation between Quantity and Value of Transactions

Let us look at Transaction Count (quantity) and Total Value of those transactions, aggregated on a monthly basis. I used a Spark cluster to aggregate the full transaction set (4GB csv data file). The base data set has about 280 rows with the following structure:

{month, year, sum, count}

The month and year values are converted into dates and added to the row, then the data set is sorted by date:

{date, month, year, sum, count}

This leads us to three plots. Sum and Count against time and Sum against Count. These are shown below:

Figure 2: Total Transaction value by date, grouped by year (each dot represents a month in that year)

Figure 2 shows Total Transaction value by date (Y-axis). The plot is grouped by year where each dot represents a month in that year. The current year (2018) has complete months data till August therefore less number of dots.

Figure 3: Total Quantity of Transactions  by date, grouped by year (each dot represents a month in that year)

Figure 3 shows Total Quantity of Transactions (Y-axis), once again grouped by year. Similar to Figure 2 the data is complete till August 2018.

Figure 4: Total Transaction value (Y-axis) against Total Number of Transactions (X-axis)

Figure 4 show how the value of the transactions relates to number of transactions. Each dot represents a month in a year. As expected there is a slightly positive correlation between total value of transactions and the number of transactions. A point to note: the total value of transactions depends on the sale price (that depends on the property sold) as well as the number of transactions in a given month. For the same number of transactions the value could be high or low (year on year) depending on whether prices are inflationary or a higher number of good quality houses are part of that months transactions.

Figure 5: Total Transaction value (Y-axis) against Total number of transaction (X-axis), each point represents a particular month in a year

Figure 5 enhances Figure 4 by using colour gradient to show the year of the observation. Each year should have at least 12 points associated with it (except 2018). This concept is further extended by using different shape for the markers depending on whether that observation was made before the financial crisis (circle: year of observation before 2008), during the financial crisis (square: year of observation between 2008 and 2012) or after the crisis (plus: year of observation after 2012). These values for years have been picked using Figures 2 and 3. 

Figure 6: Showing the housing market contract during the Crisis and then expand

Figure 6 shows the effect of the financial crisis nicely. The circles represent pre-crisis transactions. The squares represent transactions during the crisis. The plus symbol represents post-crisis transactions. 

The rapid decrease in transactions can be seen as the market contracted in 2007-2008. As the number of transactions and the value of transactions starts falling, the relative fall in number of transactions is larger than in the total value of the transactions. This indicates the prices did fall but mostly not enough houses were being sold. Given the difficulty in getting a mortgage, this reduction in number of transactions could be caused by a lack of demand.

Discovering Data Clusters

Using a three class split (pre-crisis, crisis, post-crisis) provides some interesting results. These were described in the previous section. But what happens if a clustering algorithm is used on the data?

A Clustering algorithm attempts to assign each observation to a cluster. Depending on the algorithm, total number of clusters may be required as an input. Clustering is often helpful when trying to build initial models of the input data especially when no labels are available. In that case, the cluster id (represented by the cluster centre) becomes the label. The following clustering algorithms were evaluated:

  1. k-means clustering
  2. gaussian mixture model

The data-set for the clustering algorithm has three columns: Date, Monthly Transaction Sum and Monthly Transaction Count.

Given the claw mark distribution of the data it was highly unlikely k-means would give good results. That is exactly what we see in Figure 7 with cluster size of 3 (given we had three labels previously of before crisis, during crisis and after crisis). The clustering seems to cut across the claws. 

Figure 7: k-mean clustering with cluster size of 3 – total value of transactions (Y-axis) vs total number of transactions

If a gaussian mixture model (GMM) is used with component count of 3 and covariance type ‘full’ (using sklearn implementation – see code below) some nice clusters emerge as seen in Figure 8.

Figure 8: Gaussian Mixture model with three components.

Each of the components corresponds to a ‘band’ in the observations. The lowest band corresponds loosely with pre-crisis market, the middle (yellow) band somewhat expands the crisis market to include entries from before the crisis. Finally, the top-most band (green) corresponds nicely with the post-crisis market.

But what other number of components could we choose? Should we try other GMM covariance types (such as ‘spherical’, ‘full’, ‘diag’, ‘tied’)? To answer these questions we can run a ‘Bayesian Information Criteria’ test against different number of components and different covariance types. The method and component count that gives the lowest BIC is preferred.

The result is shown in Figure 9.

Figure 9: BIC analysis of the data – BIC score against number of components (X-axis)

From Figure 9 it seems the ‘full’ type consistently gives the lowest BIC on the data-set. Furthermore, going from 3 to 4 components improves the BIC score (lower the better). Another such jump is from 7 to 8. Therefore, number of components should be 4 (see Figure 10) or 8 (see Figure 11).

Figure 10: Transaction value (Y-axis) against  Total Number of Transactions – with 4 components.

Figure 11: Transaction value (Y-axis) against  Total Number of Transactions – with 8 components.

The 4 component results (Figure 10) when compared with Figure 5 indicates an expansion at the start of the data-set (year: 1995), this is the jump from yellow to green. Then during the crisis there is a contraction (green to purple). Post crisis there is another expansion (purple to blue). This is shown in Figure 12.

Figure 12: Expansion and contraction in the housing market

The 8 component results (Figure 11) when compared with Figure 5 shows the stratification of the data-set based on the Year value. Within the different colours one can see multiple phases of expansion and contraction.

The interesting thing is that for both 4 and 8 component models, the crisis era cluster is fairly well defined.

Code for this is given below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture


csv = "c:\\ML Stats\\housing_sep_18_no_partial_mnth_cnt_sum.csv"


df = pd.read_csv(csv)


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

df_pure = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year})


df_pure = df_pure.sort_values(["Date"])

df_pure = df_pure.set_index("Date")


bics = {}
for cmp in range(1,10):

clust_sph = GaussianMixture(n_components=cmp, covariance_type='spherical').fit(df_pure)
clust_tied = GaussianMixture(n_components=cmp, covariance_type='tied').fit(df_pure)
clust_diag = GaussianMixture(n_components=cmp, covariance_type='diag').fit(df_pure)
clust_full = GaussianMixture(n_components=cmp, covariance_type='full').fit(df_pure)

clusts = [clust_full, clust_diag, clust_sph, clust_tied]
bics[cmp] = []
for c in clusts:
bics[cmp].append(c.bic(df_pure))

plt.plot(bics.keys(), bics.values())
plt.legend(["full", "diag", "sph", "tied"])
plt.show()

num_components = 4

clust = GaussianMixture(n_components=num_components, covariance_type='full').fit(df_pure)

lbls = clust.predict(df_pure)

df_clus = pd.DataFrame({"Count": df_pure.Count, "Sum": df_pure.Sum, "Year": df_pure.Year, "Cluster": lbls})
color = df_clus["Cluster"]

fig, ax = plt.subplots()
ax.scatter(df_clus["Count"], df_clus["Sum"], c=color)

fig, ax2 = plt.subplots()
ax2.scatter(df_clus["Year"], df_clus["Count"], c=color)

fig, ax3 = plt.subplots()
ax3.scatter(df_clus["Year"], df_clus["Sum"], c=color)


plt.show()

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

Recurrent Neural Networks to Predict Pricing Trends in UK Housing Market

Recurrent Neural Networks (RNN):

RNNs are used when temporal relationships have to be learnt. Some common examples include time series data (e.g. stock prices), sequence of words (e.g. predictive text) and so on.

The basic concept of RNNs is that we train an additional set of weights (along with the standard input – output pair) that associate past state (time: t-1) with the current state (time: t). This can then be used to predict the future state (time: t+1) given the current state (time: t). In other words RNNs are NNs with state!

When used to standard time series prediction the input and output values are taken from the same time series (usually a scalar value). This is a degenerate case of single valued inputs and outputs. Thus we need to learn the relationship between x(t-1) and x(t) so that we can predict the value of x(t+1) given x(t). This is what I did for this post.

Time series can be made more complicated by making the input a vector of different parameters, the output may still remain a scalar value which is a component of x or be a vector. One reason this is done is to add all the factors that may impact the value to be predicted (e.g. x(t+1)). In our example of average house prices – we may want to add factors such as time of the year, interest rates, salary levels, inflation etc. to provide some more “independent” variables in the input.

Two final points:

  • Use-cases for RNNs: Speech to Text, Predictive Text, Music Tagging, Machine Translation
  • RNNs include the additional complexity of training in Time as well as Space therefore our standard Back-Propagation becomes Back-Propagation Through Time

RNN Structure for Predicting House Prices:

RNN simple time series

The basic time series problem is that we have a sequence of numbers – the average price of houses for a given month and year (e.g. given: X(1), X(2), … X(t-1), X(t) ) with a regular step size and our task is to predict the next number in the sequence (i.e. predict: X(t+1)). In our problem the avg price is calculated for every month since January 1995 (thus step size is 1 month). As a first step we need to define a fixed sequence size that we are going to use for training the RNN. For the input data we will select a sub-sequence of a given length equal to the number of inputs (in the diagram above there are three inputs). For training output we will select a sub-sequence of the same length as the input but the values will be shifted one step in the future.

Thus if input sub-sequence is: X(3), X(4) and X(5) then the output sub-sequence must be: X(4), X(5) and X(6). In general if input sub-sequence spans time step to where b > a and b-a = sub-sequence length, then the output sub-sequence must span a+1 to b+1.

Once the training has been completed if we provide the last sub-sequence as input we will get the next number in the series as the output. We can see how well the RNN is able to replicate the signal by starting with a sub-sequence in the middle and movie ahead in time steps and plotting actual vs predicted values for the next number in the sequence.

Remember to NORMALISE the data!

The parameters are as below:

n_steps = 36 # Number of time steps (thus a = 0 and b = 35, total of 36 months)

n_inputs = 1 # Number of inputs per step (the avg. price for the current month)

n_neurons = 1000 # Number of neurons in the middle layer

n_outputs = 1 # Number of outputs per step (the avg. price for the next month)

learning_rate = 0.0001 # Learning Rate

n_iter = 2000 # Number of iterations

batch_size = 50 # Batch size

I am using TensorFlow’s BasicRNNCell (complete code at the end of the post) but the basic setup is:

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu), output_size=n_outputs)

outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

loss = tf.reduce_mean(tf.square(outputs-y))
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
training = opt.minimize(loss)

saver = tf.train.Saver()

init = tf.global_variables_initializer()

Results:

A sample of 3 runs, using Mean Squared Error threshold of 1e-4 we get the following values for Error:

  1. 8.6831e-05
  2. 9.05436e-05
  3. 9.86998e-05

Run 3 fitting and predictions are shown below:

Orange dots represent the prediction by the RNN and Blue dots represent the actual data

 

Run 3 prediction against existing data 3 years before October 2017

Then we start from October 2017 (Month 24 in figure below) and forecast ahead to October 2018. This predicts a rise in average prices which start to plateau 3rd quarter of 2018. Given that average house prices across a country like UK are determined by a large number of noisy factors, we should take this prediction with a pinch of salt.

Run 3 Forecasting from Month 24 (October 2017 for the year ahead till October 2018)

A sample of 3 runs, using Mean Squared Error threshold of 1e-3 we get the following values for Error:

  1. 3.4365e-04
  2. 4.1512e-04
  3. 2.1874e-04

With a higher Error Threshold we find when comparing against actual data (Runs 2 and 3 below) the predicted values have a lot less overlap with the actual values. This is expected as we have traded accuracy for reduction in training time.

predicted avg price vs actual avg price (Run 2)

predicted avg price vs actual avg price (Run 3)

Projections in this case are lot different. We see a linearly decreasing avg price in 2018.

predicted avg price vs actual avg price with forecast

Next Steps:

I would like to add more parameters to the input – but it is difficult to get correlated data for different things such as interest rates, inflation etc.

I would also like to try other types of networks (e.g. LSTM) but I am not sure if that would be the equivalent of using a canon to kill a mosquito.

Finally if anyone has any ideas on this I would be happy to collaborate with them on this!

 

Source code can be found here: housing_tf

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

Artificial Neural Networks: Problems with Multiple Hidden Layers

In the last post I described how we work with Multi-layer Perceptron (MLP) model of artificial neural networks. I had also shared my repository on GitHub (https://github.com/amachwe/NeuralNetwork).

I have now added a Single Perceptron (SP) and Multi-class Logistic Regression (MCLR) implementations to it.

The idea is to set the stage for deep learning by showing where these types of ANN models fail and why we need to keep adding more layers.

Single Perceptron:

Let us take a step back from a MLP network to a Single Perceptron to delve a bit deeper into its working.

The Single Perceptron (with a one output) acts as a Simple Linear Classifier. In other words, for a two class problem, it finds a single hyper-plane (n-dimensional plane) that separates the inputs based on their class.

 

Neural Network Single Perceptron

Neural Network Single Perceptron

The image above describes the basic operation of the Perceptron. It can have N inputs with each input having a corresponding weight. The Perceptron itself has a bias or threshold term. A weighted sum is taken of all the inputs and the bias is added to this value (a linear model). This value is then put through a function to get the actual output of the Perceptron.

The function is an activation function with the simplest case being the so-called Step Function:

f(x) = 1 if [ Sum(weight * input) + bias ] > 0 

f(x) = -1 if [ Sum(weight*input) + bias ]  <= 0

Perceptron might look simple but it is a powerful model. Implement your own version or walk through my example (rd.neuron.neuron.perceptron.Perceptron) and associated tests on GitHub if you are not convinced.

At the same time not all two-class problems are created equal. As mentioned before, the Single Perceptron partitions the input space using a hyper-plane to provide a classification model. But what if no such separation exists?

Single Perceptron - Linear Separation

Single Perceptron – Linear Separation

The image above represents two very simple test cases: the 2 input AND and XOR logic gates. The two inputs (A and B) can take values of 0 or 1. The single output can similarly take the value of 0 or 1. The colourful lines represents the model which should separate out the two classes of output (0 and 1 – the white and black dots) and allow us to classify incoming data.

For AND the training/test data is:

  • 0, 0 -> 0 (white dot)
  • 0, 1 -> 0 (white dot)
  • 1, 0 -> 0 (white dot)
  • 1, 1 -> 1 (black dot)

We can see it is very easy to draw a single straight line (i.e. a linear model with a single neuron) that separates the two classes (white and black dots), the orange line in the figure above.

For XOR the training/test data is:

  • 0, 0 -> 0 (white dot)
  • 0, 1 -> 1 (black dot)
  • 1, 0 -> 1 (black dot)
  • 1, 1 -> 0 (white dot)

For the XOR it is clear that no single straight line can be drawn that can separate the two classes. Instead what we need are multiple constructs (see figure above). As the single perceptron can only model using a single linear construct it will be impossible for it to classify this case.

If you run a test with the XOR data you will find that the accuracy comes out to be 50%. That might sound good but it is the exact same accuracy if you were to guess one of the classes constantly and the classes were equally distributed. For the XOR case here, as the 0’s and 1’s are equally distributed if we kept guessing 0 or 1 constantly we would still be right 50% of the time.

To put this in contrast to a Multi Layer Perceptron which gives an accuracy of 100%. What is the main difference between a MLP and a Single Perceptron? Obviously the presence of multiple Perceptrons organised in layers! This makes it possible to create models with multiple linear constructs (hyper-planes) which are represented by the blue and green lines in the figure above.

Can you figure out how many units we would need as a minimum for this task? Read on for the answer.

Solving XOR using MLP:

If you used the logic that for the XOR example we need 2 hyper-planes therefore 2 Perceptrons would be required your reasoning would be correct!

Such a MLP network usually would be arranged in a 2 -> 2 -> 1 formation. Where we have two input nodes (as there are 2 inputs A and B), a hidden layer with 2 Perceptrons and a aggregation layer with a single Perceptron to provide a single output (as there is just one output). The input layer doesn’t do anything interesting except presents the values to both the hidden layer Perceptrons. So the main difference between this MLP and a Single Perceptron is that:

  • we add 1 more processing unit (in the hidden layer)
  • to aggregate the output to a single variable an aggregation unit (output layer)

If you check the activation of the individual Perceptrons in the hidden layer (i.e. processing layer) of a MLP trained for XOR you will find a pattern for the activation when presented with type 1 Class (A = B – white dot) and when presented with a type 2 Class (A != B – black dot). One possibility for such a MLP is that:

  • For Class 1 (A = B – white dot): Both the neurons either activate or not (i.e. outputs of the 2 hidden layer Perceptrons are comparable – so either both are high or both are low)
  • For Class 2 (A != B – black dot): The neurons activate asymmetrically (i.e. there is a clear difference between the outputs of the 2 hidden layer Perceptrons)

Conclusions:

Thus there are three takeaways from this post:

a) To classify more complex and real world data which is not linearly separable we need more processing units, these are usually added in the Hidden Layer

b) To feed the processing units (i.e. the Hidden Layer) and to encode the input we utilise an Input Layer which has only one task –  to present the input in a consistent way to the hidden layer, it will not learn or change as the network is trained.

c) To work with multiple Hidden Layer units and to encode the output properly we need an aggregation layer to collect output of the Hidden Layer, this aggregation layer is also called an Output Layer

 

I would again like to bring up the point of input representation and encoding of output:

  • We have to be careful in choosing the right input and output encoding
  • For the XOR and other logic gate example we can simply map the number of bits to the number of inputs/outputs but what if we were trying to process handwritten documents – would you have one character per output? How would you organise the inputs given that the handwritten data can be of different length?

In the next post we will start talking about Deep Learning as we have provided two very important reasons for so-called shallow networks to fail.

Artificial Neural Networks: An Introduction

Artificial Neural networks (ANNs) are back in town after a rather long exile to the edges of Artificial Intelligence (AI) product space. Therefore I thought I would do a post on it to provide an introduction.

For a one line intro: An Artificial Neural Network is a Machine Learning paradigm that mimics the structure of the human brain.

Some of the biggest tech companies in the world (i.e. Google, Microsoft and IBM) are investing heavily in ANN research and in creating new AI products such as driver-less cars, language translation software and virtual assistants (e.g. Siri and Cortana).

There are three main reasons for a resurgence in ANNs:

  1. Availability of cheap computing power in form of multi-core CPUs and GPUs which enables machines to process and learn from ‘big-data’ using increasingly sophisticated networks (e.g. deep learning networks)
  2. Problem with using existing Machine Learning methods against high volume data with complex representations (e.g. images, videos and sound) required for novel applications such as driver-less cars and virtual assistants
  3. Availability of free/open source general  purpose ANN libraries for major programming languages (i.e. TensorFlow/Theano – Python; DL4J – Java), earlier either you had to code ANNs from scratch or shell out money for specialised software (e.g. Matlab plugins)

My aim is to provide a trail up to the current state of the art (Deep Learning) over the space of 3-4 posts. To start with, in this post I will talk about the simplest form of ANN (also one of the oldest), called a Multi-Layer Perceptron Neural Network (MLP).

Application Use-Case:

We are going to investigate a supervised learning classification task using simple MLP networks with a single hidden layer, trained using back-propagation.

Simple Multi-Layer Perceptron Network:

MLP Neural Network

Neural Network (MLP)

The image above describes a simple MLP neural network with 5 neurons in the input layer, 3 in the hidden layer and 2 in the output layer.

Data Set for Training ANNs:

For supervised learning classification tasks we need labelled data sets. Think of it as a set of input – expected output pairs. The input can be an image, video, sound clip, sensor readings etc.; the label(s) can be set of tags, words, classes, expected state etc.

The important thing to understand is that whatever the input, we need to define a representation that optimally describes the features of interest that will help with the classification.

Representation and feature identification is a very important task that machines find difficult to do. For a brain that has developed normally this is a trivial task. Because this is a very important point I want to get into the details (part of my Ph.D. was on this topic as well!).

Let us assume we have a set of grey scale images as the input with labels against them to describe the main subject of the image. To keep it simple let us also assume a one-to-one mapping between images and tags (one tag per image). Now there are several ways of representing these images. One option is to flatten each image into an array where each element represents the grey scale value of a pixel. Another option is to take an average of 2 pixels and take that as an array element. Yet another option is to chop the image into fixed number of squares and take the average of that. But the one thing to keep in mind is whatever representation we use, it should not hide features of importance. For example if there are features that are at the level of individual pixels and we use averaging representation then we might loose a lot of information.

The labels (if less in number) can be encoded using binary notation otherwise we can use other representations such as word vectors.

To formalise:

If is a given input at the Input Layer;

is the expected output at the Output Layer;

Y’ is the actual output at the Output Layer;

Then  our aim is to learn a model (M) such that:

Y’ = M(X) where Error calculated by comparing and Y’ is minimised.

One method of calculating Error is (Y’-Y)^2

To calculate the total error for training examples me just use the Mean Squared Error formula (https://en.wikipedia.org/wiki/Mean_squared_error)

Working of a Network:

The MLP works on the principle of value propagation through different layers till it is presented as an ouput at the output layer. For a three layer network the propagation of value is as follows:

Input -> Hidden -> Output -> Actual Output

The propagation of the Error is in reverse.

Error at Output -> Output -> Hidden -> Input

When we propagate the Error back through the network we adjust the weights and biases between the Output-Hidden and Hidden-Input layers. The adjustment is carried out one layer at a time keeping all other layers the same (i.e. updates are applied to the entire network in a single step). This process is called ‘Back-propagation’. The idea is to minimise the Error which is computed as a ‘gradient descent’, sort of like walking through a hilly region but always down hill. What gradient descent does not guarantee is whether the lowest point (i.e. Error) you will reach will be the Global Minimum – i.e. there are no guarantees that the lowest Error figure you found is the lowest possible Error figure unless the error is zero!

This excellent post describes the process of ‘Back-propagation’ in detail with a worked example: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

The one key point of the process is that as we move from Output to the Input layer, tweaking the weights as we perform gradient descent, a chain of interactions is formed (e.g. Input Neuron 1 affects all Hidden Neurons which in turn affect all Output Neurons). This chain becomes more volatile as the number of Hidden Layers increase (e.g. Input Neuron 1 affects all Hidden Layer 1 Neurons which affect all Hidden Layer 2 Neurons … which affect all Hidden Layer M Neurons which affect all the Output Neurons). As we go deeper into the network the effect of individual hidden neurons on the final Error at the output layer becomes small.

This leads to the problem of the ‘Vanishing Gradient’ which limits the use of traditional methods for learning when using ‘deep’ topologies (i..e. more than 1 hidden layer) because this chained adjustment to the weights becomes unstable and for deeper layers the process no longer resembles following a downhill path. The gradient can become insignificant very quickly or it can become very large.

When training all training examples are presented one at a time. For each of the examples the network is adjusted (gradient descent). Each loop through the FULL set of training examples is called an epoch.

The problem here can be if there are very large number of training examples and their presentation order does not change. This is because initial examples lead to larger change in the network.So if the first 10 examples (say) are similar, then the network will be very efficient at classifying those class of cases but will generalise to other classes very poorly.

A variation of this is called stochastic gradient descent where training examples are randomly selected so the danger of premature convergence is reduced.

Working of a Single Neuron:

A single neuron in a MLP network works by combining the input it receives through all the connections with the previous layer, weighted by the connection weight; adding an offset (bias) value and putting the result through an activation function.

  1. For each input connection we calculate the weighted value (w*x)
  2. Sum it across all inputs to the neuron (sum(w*x))
  3. Apply bias (sum(w*x)+bias)
  4. Apply activation function and obtain actual output (Output = f( sum(w*x)+bias ))
  5. Present the output value to all the neurons connected to this one in the next layer

When we look at the collective interactions between layers the above equations become Matrix Equations. Therefore value propagation is nothing but Matrix multiplications and summations.

Activation functions introduce non-linearity into an otherwise linear process (see Step 3 and 4). This allows the network to handle non-trivial problems. Two common activation functions are: Sigmoid Function and Step Function.

More info here: https://en.wikipedia.org/wiki/Activation_function

Implementation:

I wanted to dig deep into the workings of ANNs which is difficult if you use a library like DL4J. So I implemented my own using just JBLAS matrix libraries for the Matrix calculations.

The code can be found here: https://github.com/amachwe/NeuralNetwork

It also has two examples that can be used to evaluate the working.

  1. XOR Gate
    1. Has 4 training instances with 2 inputs and a single output, the instances are: {0,0} -> 0; {1,1} -> 0; {1,0} -> 1; {0,1} -> 1;
  2. MNIST Handwritten Numbers
    1. Has two sets of instances (single handwritten digits as images of constant size with corresponding labels) – 60k set and 10k set
    2. Data can be downloaded here: http://yann.lecun.com/exdb/mnist/

MNIST Example:

The MNIST dataset is one of the most common ‘test’ problems one can find. The data set is both interesting and relevant. It consists of images of hand written numbers with corresponding labels. All the images are 28×28 and each image has a single digit in it.

We use the 10k instances to train and 60k to evaluate. Stochastic Gradient Descent is used to train a MLP with a single hidden layer. The Sigmoid activation function is used throughout.

The input representation is simply a flattened array of pixels with normalised values (between 0 and 1). A 28×28 image results in an array of 784 values. Thus the input layer has 784 neurons.

The output has to be a label value between 0 and 9 (as images have only single digits). We encoded this by having 10 output neurons with each neuron representing one digit label.

That just leaves us with the number of hidden neurons. We can try all kinds of values and measure the accuracy to decide what suits best. In general the performance will improve as we add more hidden units up to a point after that we will encounter the law of diminishing returns. Also remember more hidden units means longer it takes to train as the size of our weight matrices explode.

For 15 hidden units:

  • a total of 11,760 weights have to be learnt between the input and hidden layer 
  • a total of 150 weights have to be learnt between the hidden and output layer

For 100 hidden units:

  • a total of 78,400 weights have to be learnt between the input and hidden layer
  • a total of 1000 weights have to be learnt between the hidden and output layer

Hidden Units and performance

Hidden Units and Performance

The graph above shows what happens to performance as the number of hidden layer units (neurons) are increased. Initially from 15 till about 100 decent performance gains are achieved at the expense of increased processing time. But after 100 units the performance increase slows down dramatically. Fixed learning rate of 0.05 is used. The SGD is based on single example (mini-batch size = 1)

Vanishing Gradient in MNIST:

Remember the problem of vanishing gradient? Let us see if we can highlight its effect using MNIST. The chaining here is not so bad because there is a single hidden layer but still we should expect the outer – hidden layer weights to have on average larger step size when the weights are being adjusted as compared to the inner – hidden layer weights (as the chain goes from output -> hidden -> input). Let us try and visualise this by sampling the delta (adjustment) being made to weights along with which layer they are in and how many training examples have been shown.

weights update by layer

Weights update by layer and number of training examples

After collecting millions of samples (remember for a 100 hidden unit network each training instance results in almost 80,000 weight updates so it doesn’t take long to collect millions of samples) of delta weight values in hidden and input layer we can take their average by grouping based on layer and stage of learning to see if there is significant difference in the step sizes.

What we find (see image above) is as expected. The delta weight updates in the outer layer are much higher than in the hidden layer to start with, but it converges rapidly as more training examples are presented.Thus the first 250 training examples have the most effect.

If we had multiple hidden layers, the chances are that delta updates for deeper layers would be negligible (maybe even zero). Thus the adaption or learning is being limited to the outer layer and the hidden layer just before it. This is called shallow learning. As we shall see to train multiple hidden layers we have to use a divide and rule strategy as compared to our current layer by layer strategy.

Keep this in mind as in our next post we will talk about transitioning from shallow to deep networks and examine the reasons behind this shift.

Bots using Microsoft Bot Platform and Heroku: Customer Life-cycle Management

This post is about using the Microsoft Bot Platform with Heroku to build a bot!

The demo scenario is very simple:

  1. User starts the conversation
  2. Bot asks for an account number
  3. Customer provides an account number or indicates they are not a customer
  4. Bot retrieves details if available for a personalised greeting and asks how can it be of help today
  5. Customer states the problem/reason for contact
  6. Bot uses sentiment analysis to provide the appropriate response

Bots

Bots are nothing but automated programs that carry out some well defined set of tasks. They are old technology (think web-crawlers).

Recent developments such as Facebook/Skype platform APIs being made available for free, easy availability of cloud-computing platforms and relative sophistication of machine learning as a service  has renewed interest in this technology especially for customer life-cycle management applications.

Three main components of a modern, customer facing bot app are:

  • Communication Platform (e.g. Facebook Messenger, Web-portal,  Skype etc.): the eyes, ears and mouth of the bot
  • Machine Learning Platform: the brain of the bot
  • Back end APIs for integration with other systems (e.g. order management): the hands of the bot

Other aspects include giving a proper face to the bot in terms of branding but from a technical perspective above three are complete.

Heroku Setup

Heroku provides various flavours of virtual containers (including a ‘free’ and ‘hobby’ ones) for different types of applications. To be clear: a ‘dyno’ is a lightweight Linux container which runs a single command that you specify.

Another important reason to use Heroku is that it provides a ‘https’ endpoint for your app which makes it more secure. This is very important as most platforms will not allow you to use a plain ‘http’ endpoint (e.g. Facebook Messenger). So unless you are ready to fork out big bucks for proper web-hosting and SSL certificates start out with something like Heroku.

Therefore for a Node.JS dyno you will run something like node <js file name>.

The cool thing about Heroku (in my view) is that it integrates with Git so deploying your code is as simple as ‘git push heroku <branch name to push from>’.

You will need to follow a step by step process to make yourself comfortable with Heroku (including installing the Heroku CLI) here: https://devcenter.heroku.com/start

We will be using a Node.JS flavour of Heroku ‘dynos’.

Heroku has an excellent ‘hello world’ guide here: https://devcenter.heroku.com/articles/getting-started-with-nodejs#introduction

 

Microsoft Bot Platform

The Microsoft Bot Platform allows you to create, test and publish bots easily. It also provides connectivity to a large number of communication platforms (such as Facebook Messenger). Registration and publishing is FREE at the time of writing.

You can find more information on the Node.js base framework here: http://docs.botframework.com/builder/node/overview/

The dialog framework in the MS Bot Platform is based on REST paths. This is a very important concept to master before you can start building bots.

Architecture

Microsoft provide a publishing platform to register your bot.

Once you have the bot correctly published on a channel (e.g. Web, Skype etc.) messages will be passed on to it via the web-hook.

You need to provide an endpoint (i.e. the web-hook) to a web app in Node.JS which implements the bot dialog framework to publish your bot. This web app is in essence the front door to your ‘bot’.

You can test the bot locally by downloading the Microsoft Bot Framework simulator.

The demo architecture is outlined below:

Bot Demo Architecture

Bot Demo Architecture

Detailed Architecture for the Demo

There are three main components to the above architecture as used for the demo:

  1. Publish the bot in the Bot Registry (Microsoft) for a channel – you will need your Custom Bot application endpoint to complete this step,in the demo I am publishing only to a web-channel which is the easiest to work with in my opinion. Once registered you will get an application id and secret which you will need to add to the bot app to ‘authorise’ it.
  2. Custom Bot Application (Node.JS) with the embedded bot dialog – the endpoint where the app is deployed needs to be public, a HTTPS endpoint is always better! I have used Heroku to deploy my app which gives me a public HTTPS endpoint to use in the above step.
  3. Machine Learning Services – to provide functionality to make the Bot intelligent, we can have a statically scripted bot with just the embedded dialog but where is the fun in that? For the demo I am using Watson Sentiment Analysis API to detect the users sentiment during the chat.

*One item that I have purposely left out within the Custom Bot app, in the architecture, is the service that provides access to the data which drives the dialog (i.e. Customer Information based on the Account Number). In the demo a dummy service is used that returns hard coded values for Customer Name when queried using an Account Number.

The main custom bot app Javascript file is available below, right click and save-as to download.

Microsoft Bot Demo App

Enjoy!!