UK Housing Data Analysis: Additional Price Paid Entry

I have been exploring the HM Land Registry Price Paid Data and have discovered few more things of interest.

The data contains a ‘Price Paid Data Category Type’ (this is the second last column at the time of writing this post. As per the description of the schema this field can have one of two values:

A = Standard Price Paid entry, includes single residential property sold for full market value.
B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals.

Therefore it seems there is a way of looking at how properties sold for full market value differ from buy-to-lets, repossessions and power of sale transactions. Proper Category B tracking only starts from October 2013.

Before we do this it is worthwhile to use the ‘Property Type’ field to filter out properties of type ‘Other’ which contribute to the overall noise because these are usually high value properties such as office buildings. The ‘Property Type’ field has the following values:

D = Detached,

S = Semi-Detached,

T = Terraced,

F = Flats/Maisonettes,

O = Other

Data Pipeline for all transactions:

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

 

Data Pipeline for Standard and Additional Price Paid Transactions (separate):

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Price Paid Data Category Type, Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

Tech stuff:

I used a combination of MongoDB (aggregation pipelines for standard heavy weight aggregations – such as simple grouping), Apache Spark (Java based for heavy weight custom aggregations) and Python (for creating graphs and summarising aggregated data)

Results:

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Transaction Counts

Price Paid Data Category A/B Transaction Count

Price Paid Data Category A/B Transaction Count

Category B transactions form a small percentage of the overall transactions (5-8% appprox.)

As the Category B data starts from October 2013 we see a rapid increase in Category B transactions which then settles to a steady rate till 2017 where we can see transactions falling as it becomes less lucrative to buy a second house to generate rental income. There is a massive variation in terms of overall and Category A transactions. But here as well we see a downward trend in 2017.

We can also see the sharp fall in transactions due to the financial crisis around 2008.

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Average Price

Price Paid Data Category A/B Average Price

Price Paid Data Category A/B Average Price

Here we find an interesting result. Category B prices are consistently lower than pure Category A. But given the relatively small number of Category B transactions the average price of combined transactions is fairly close to the average price of Category A transactions. This also seems to point to the fact that in case of buy to let, repossessions and power of sale conditions the price paid is below the average price for Category A. Several reasons could exist for such a result:

  1. People buy cheaper properties as buy-to-let and use more expensive properties as their main residence.
  2. Under stressful conditions (e.g. forced sale or repossession) there is urgency to sell and therefore full market rate may not be obtainable.

Standard Deviation of Prices

Price Paid Data Category A/B Price Standard Dev.

Price Paid Data Category A/B Price Standard Dev.

The variation in the price for Category B properties is quite high when compared with Category A (the standard price paid transaction). This can point to few things about the Category B market:

  1. A lot more speculative activity is carried out here therefore the impact of ‘expectation’ on price paid is very high – particularly:
    1. ‘expected rental returns’: The tendency here will be to buy cheap (i.e. lowest possible mortgage) and profit from the difference between monthly rental and mortgage payments over a long period of time.
    2. ‘expected profit from a future sale’: The tendency here will be to keep a shorter horizon and buy cheap then renovate and sell at a higher price – either through direct value add or because of natural increase in demand.
  2. For a Standard transaction (Category A) the incentive to speculate may not be present as it is a basic necessity.

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

UK House Sales Analysis

I have been looking at house sales data from the UK (actually England and Wales). This is derived from the Land Registry data set (approx. 4GB) which contains all house sales data from mid 1990s. Data contains full address information so one can use reverse geo-coding to get the location of the sales.

Sales Density Over the Years

If we compare the number of sales over the years an interesting picture emerges. Below is the geographical distribution of active regions (w.r.t. number of sales).

Years 2004-2007 there is strong activity in the housing market – this is especially true for London (the big patch of green), South coast and South West of England.

The activity penetrates deeper (look at Wales and South West) as the saturation starts to kick in.
The financial crisis hits and we can immediately see a weakening of sales across England and Wales. It becomes more difficult to get a mortgage. Market shows first signs of recovery especially around London.  Market recovery starts to gain momentum especially outside London.
The recovery is now fairly widespread thanks to various initiative by the Government, rock bottom interest rates and a generally positive feeling about the future. Brexit and other factors kick in – the main issue is around ‘buy-to-let’ properties which are made less lucrative thanks to three-pronged attack: increase in stamp duty on a second house, removal of tax breaks for landlords and tightening of lending for a second home (especially interest-only mortgages).

Finally 2017 once can see that the market is again cooling down. Latest data suggests house prices have started falling once again and with the recent rise in interest rates it will make landing a good deal on a mortgage all that more difficult.

Average House Prices

Above graph shows how the Average price of Sales has changed over the Years. We see there is a slump in prices starting from 2017. It will be interesting to see how the house prices behave as we start 2018. It will be a challenge for people to afford higher mortgages as inflation outstrips income growth. This is especially true for first-time buyers. Given the recent bonanza of zero percent stamp duty for first time buyers I am not sure how much of an impact (positive) this will have.

Returns on Properties

Above graph shows how the returns and risks associated with a house change after a given number of years. It is clear that it is easier to get a return when a house is held for at least 5 years. Below that there is a risk of loosing money on the property. Properties resold within two years are most likely to make a loss. This also ties in with a ‘distress’ sale scenario where the house is sold without waiting for the best possible offer or in times of slowdown where easy term mortgages are not available.

Number of Times Re-sold

Above graph shows the number of times a house is re-sold (vertical) against the number of years it is held for before being re-sold. Most houses are re-sold within 5 years. But why a massive spike where houses are re-sold within 2 years? One possible explanation is that these are houses that are bought by a developer, improved and then re-sold within a year or so.

House Transactions by Month of Year

Transaction by month

What is the best time of the year to sell your house? Counting number of transactions by month (figure above) we can see number of transactions increases as Spring starts and continues to grow till the end of Summer. In fact 60% more houses are sold in Jun – Aug period as compared to Jan – March.

Transactions tend to decrease slightly as Autumn starts and falls off towards end of the year. This is expected as people would not want to move right after Christmas or early in the new year (winter moves are difficult!)

Infrastructure

I have used Apache Spark (using Java) to summarise the data from approximately 4 GB to 1-1.5 GB CSV files and then Python to do next round of aggregations and to generate the plots.

 

Next step will be to incorporate some Machine Learning into the process.

House Price and Transactions with UK Elections

We are just getting over the not so shocking election result in UK (8th June 2017).

I wanted to look at house prices and how they are affected by election results.

The graphs below plot House Price/ Number of Transactions against date (blue dots). The data is averaged over a month and is normalised to 1.0.

The vertical lines represent UK general elections with blue representing clear results (clear majority) and black lines representing hung Parliament. There is a black line (2nd from right) that represents EU Referendum (‘Brexit’).

The orange dots represent GBP (Sterling) performing against INR (Indian Rupee) and CNY (Chinese Yuan). The data is daily average normalised to 1.0.

We can see house prices grow aggressively after clear results. The period from 2008 onward is the ‘financial’ crisis era which is further complicated by a hung Parliament in 2010. The actual recovery takes a few years and by 2014 the boom times are back! The growth is further enhanced by a Conservative majority in 2015.

It is too early to see the impact of Brexit on the housing market but as far as GBP goes there has been a fall against all major currencies.

This means investment into the UK housing market is made cheaper for ‘international’ buyers. The growth in house prices is compensated by the fall in the pound (we can see this by the relative falls in the two graphs).

Already the house price increase is cooling off (falling in many regions where they were over-inflated to begin with). With the messy general election of 2017 increasing the uncertainty, especially around Brexit, the house prices from internal demand should decrease or flatten out. We can already see this starting. People might rush in to lock their mortgage (thereby boosting short term demand) as Bank of England has indicated a rise in Interest Rates in the near future.

What happens if look at the number of transactions? The normalised graph  above shows that during the financial crisis era the transactions fell sharply. Then began to revive (correlates with the rise in house prices). The strong position of the Conservatives further supported the market.

But as soon as the Stamp Duty increase came into the picture the number of transactions started reducing and after ‘Brexit’ leading up to the 2017 General Election we can see a sharp fall in transactions.

All of these things indicate that people are not sure about what will happen in the future so are not willing to take positions of risk.

Stamp duty change

Stamp duty change (1st April 2016)

A final interesting titbit – Why is there a massive spike in transactions in a subdued period of house sales (the red arrow)? And no this is not an error! The month is March 2016 – and the spike is there because stamp duty changes were being introduced from 1st April 2016 which meant buying a second home (without selling the first one) would become a lot more expensive!

[This analysis uses the Land Registry data set which is processed using Apache Spark, Python was used to further process and plot the data]

Raspberry Pi Cluster and Apache Spark!

So over the Christmas holidays I have been busy playing with my 4 x Raspberry Pi 3 (Model B) units which I have assembled into a stack. They each have a 16 GB Memory Card with Raspbian.

Spark Pi Cluster

Spark Pi Cluster

The Spark Master is running on a NUC (the Spark driver program runs there or I simply use the ‘spark-shell’).

If you want to make your own cluster here is what you will need:

  • Raspberry Pi 3 Model B (I bought 4 of them – just the Pi’s – don’t bother with the ‘Kit’ because you won’t need the individual cases or power supplies).
  • Rapbian on a Memory Card (16GB will work fine) for each Pi.
  • A stacking plate set (one per Raspberry Pi to mount it) and one pair of ‘end plates’. This acts as a ‘rack’ for your Pi cluster. It also makes sure your Pi boards get enough ventilation and you can place the whole set neatly in a corner instead of having them lying around on the dining table!
  • Multi-device USB power supply (I would suggest Anker 60W PowerPort with 6 USB ports – which can support up to 6 Pi 3’s) so that you end up with one power plug instead of one plug per Pi.
  • To connect the Pi boards to the Internet (and to each other – for the Spark cluster) you will need a multi-port Gigabit switch – I would suggest buying one with at least 8 ports as you will need 1 port per Pi and 1 port to connect to your existing network.
  • A wireless keyboard-trackpad to setup each Pi (just once per Pi).
  • A single HDMI cable to connect with a TV/Monitor (just once per Pi).

Setting up the Pi boards:

Once you have assembled the rack and mounted the boards, install the memory cards on all the boards and connect them to the power supply and the network. Wait for the Pi boards to boot up.

Then one Pi at a time:

  • Connect a keyboard, mouse and monitor – ensure the Pi is working properly then:
    • Set hostname
    • Disable Wireless LAN (as you have Ethernet connectivity- which is more stable)
    • Check SSH works – this will make sure you can remotely work on the Pi

Raspberry Pi Cluster Image

Once all that is done and you can SSH into the Pi boards – time to install Spark:

Again one Pi at a time:

  • SSH into the Pi and use curl -o <spark download url> to download Spark tar.gz
  • tar -xvf <spark tar.gz file> to unzip the tar.gz to a standard location (I use ‘/spark/’ on all the Pi boards)
  • Make sure correct permissions are assigned to the spark folder
  • Add the master machine hostname to the /etc/hosts file
  • Edit your ~/.bashrc and add the following: export SPARK_HOME = <the standard location for your spark>

Similarly install Spark on a node which you will use as the ‘spark cluster master’ – use the same standard location.

Start up master using the spark ‘start-master.sh’ script. If you go to http://<IP of the Master Node>:8080/ you should see the Spark webpage with the status of the Workers (empty to start with) and various other bits of useful information such as the spark master URL (which we will need for the next step), number of available CPUs and application information. The last item – application information – is particularly useful to track running applications.

SSH into each of the Pi boards and execute the following: ‘start-slave.sh spark://<IP of the Master Node>:7077’ to convert each Pi board into a Spark slave.

Now if you look at the Spark webpage you will see each of the Slave nodes up (give it a couple of minutes) and you will also see the cluster resources available to you. If you have 4 Pi boards as slaves you will see 4 * 4 = 16 Cores and 4 * 1 GB = 4 GB Memory available.

Running Spark applications:

There are two main things to remember when running a Spark application:

  1. All the code that you are running should be available to ALL the nodes in your cluster (including the master)
  2. All the data that you are using should be available to ALL the nodes in your cluster (including the master).

For the code – you can easily package it up in the appropriate format (language dependent – I used Java so I used Maven to build a JAR with dependencies) and network share a folder. This reference can be used when using the spark-submit command (as the location of the application package).

For the data – you have two options – either use a network share as for the code or copy the data to the SAME location on ALL the nodes (including the master). So for example if on the master you create a local copy of the data at ‘/spark/data’ then you must use the SAME location on all the Pi boards! A local copy is definitely required if you are dealing with large data files.

Some tests:

For my test I used a 4 GB data file (text-csv) and a simple Spark program (using ‘spark-shell’) to load the text file and do a line count.

1: Pi Cluster (4 x Raspberry Pi 3 Model B)

  • Pi with Network shared data file: > 6 minutes (not good at all – this is just a simple line count!)
  • Pi with local copies of the data file: ~ 51 seconds (massive difference by making the data local to the node)

2: Spark standalone running on my laptop (i7 6th Gen., 5600 RPM HDD, SATA3 SSD, 16 GB RAM)

  • Local data file on HDD: > 1 min 30 seconds (worse than a Pi cluster with locally copied data file)
  • Local data file on SSD: ~ 20 seconds (massive difference due to the raw speed of the SSD!)

Conclusion (Breaking the Cluster):

I did manage to kill the cluster! I setup a more complicated data pipeline which does grouping and calculations using the 4 GB data file. It runs within 5 mins on my laptop (Spark local). The cluster collapsed after processing about 50%. I am not sure if the issue was related to the network (as a bottleneck) or just the Pi not able to take the load. The total file size is greater than the total available memory in the cluster (some RAM is required for the local OS as well).

So my Spark cluster is not going to break any records, in fact I would be better off using a Spark standalone on my laptop  if it is a one-shot (i.e. process large data file and store the results somewhere).

Things get interesting if we had to do this once every few hours and we could automate the ‘local data copy’ step – which should be fairly easy to do. The other option is to create a fast network share (e.g. using SSDs).

What next:

Some nice project which would suit the capabilities of a Pi cluster? Periodic data processing/stream processing task? Node.JS Servers? Please comment and let me know!

Currency Data, Efficient Markets and Influx DB

This post is about processing currency data which I have been collecting since the end of 2014. The data is collected once every hour from Monday 12am till Friday 11pm.

The data-set itself is not large as the frequency of collection is low, but it does cover lots of interesting world events such as Nigerian currency devaluation, Brexit, Trump Presidency, BJP Government in India, EU financial crisis, Demonetisation in India etc.

The image below shows the percentage change histogram for three common currencies (GBP – British Pound, USD – US Dollar and INR – Indian Rupee). The value for Percentage Change (X-Axis) is between -4% and 2%

Percentage Change histogram

Percentage Change histogram

What is immediately clear is the so called ‘fat-tail’ configuration. The data is highly skewed and shows clear features of ‘power law’ statistics. In other words the percentage change is related to frequency by an inverse power law. Larger changes (up or down) are rarer than small changes but not impossible (with respect to other distributions such as the Normal Distribution).

The discontinuity around Percentage Change = 0% is intentional. We do not want very small changes to be included as these would ‘drown out’ medium and large changes.

Mean Currency Movement

Mean Currency Movement

We can use the R code snippet below to draw 100 samples with replacement from  the movement data (combined across all currencies) and calculate the sample mean. The sample means can be plotted on a histogram which should give us the familiar Normal Distribution [this is the ‘Central Limit Theorem’ in action]. The sample mean that is most common is 0% – which is not an unexpected result given the presence of positive and negative  change percentages.

mean_curr_movement <- replicate(1000, { 
mean__curr_movement<-mean(
		sample(data$Percent.Change,100,replace = TRUE)
		)
	}
)

Compare this with a Normal distribution where, as we move away from the mean, the probability of occurrence reduces super-exponentially making large changes almost impossible (also a super-exponential quantity reduces a lot faster than a square or a cube).

Equilibrium Theory (or so called Efficient Market Hypothesis) would have us believe that the market can be modelled using a Bell Curve (Normal Distribution) where things might deviate from the ‘mean’ but rarely by a large amount and in the end it always converges back to the ‘equilibrium’ condition. Unfortunately with the reality of power-law we cannot sleep so soundly because a different definition of rare is applicable there.

Incidentally earthquakes follow a similar power law with respect to magnitude. This means that while powerful quakes are less frequent than milder ones they are still far from non-existent.

Another magical quality of such systems is that fluctuations and stability often come in clusters. The image below show the percentage movement over the full two years (approx.). We see a relative period of calm (green area) bracketed by periods of high volatility (red areas).

Movement Over Time

Movement Over Time

The above graph shows that there are no ‘equilibrium’ states within the price. The invisible hand has not magically appeared to calm things down and reduce any gaps between demand and supply to allow the price of the currency to re-adjust. Otherwise we would have found that larger the change larger the damping force to resist the change – there by making sudden large changes impossible.

For the curious:

All the raw currency data is collected in an Influx DB instance and then pulled out and processed using custom window functions I wrote in JAVA. The processed data is then dumped into a CSV (about 6000 rows) to be processed in R.

We will explore this data-set a bit more in future posts! This was to get you interested in the topic. There are large amounts of time series data sets available out there that you can start to analyse in the same way.

All the best!

Artificial Neural Networks: Training for Deep Learning – IIa

  1. Artificial Neural Networks: An Introduction
  2. Artificial Neural Networks: Problems with Multiple Hidden Layers
  3. Artificial Neural Networks: Introduction to Deep Learning
  4. Artificial Neural Networks: Restricted Boltzmann Machines
  5. Artificial Neural Networks: Training for Deep Learning – I

This is the second post on Training a Deep Learning network. The best way to read through is by starting from the first post (see above).

This post, like the series provides a pathway into deep learning by introducing some of the concepts using some common reference points. This is not designed to be an exhaustive research review of deep learning techniques. I have also tried to keep the description neutral of any programming language, though the backing code is written in Java.

So far we have visited shallow neural networks and their building blocks (post 1), investigated their performance on difficult problems and explored their limitations (post 2). Then we jumped into the world of deep networks and described the concept behind them (post 3) and the RBM building block (post 4). Finally in the previous post we started describing a possible training method for such deep networks (post 5) where we take a local view of the network..

In this post we describe the other side of the training process – where we take the global view of the network.

Network Usage:

Before we start that though, it is very important to take a step back and review what we are trying to do.

Our target is to train a neural network that can be used to classify complex data to a high degree of accuracy for tasks that are relatively easy for Humans to do.

Classification can be done in one of two ways: Discriminative or Generative. We have touched on these in the previous post as well. From a practical perspective the choice needs to be made on the basis of what we want our network to do. If we want to use it for a purely label generation task for an input then it is enough to have a discriminative model (which basically calculates p (label | input)). Here we are attempting to assign a label to a set of features extracted from the input. That is why discrimintative training requires labelled training data.

If you want to actually create new inputs based on certain features then you need to have a generative model (which calculates p (label , input)). In case of a generative model we do not ‘discriminate’ between inputs based on features using labels (i.e. try and find the label/class boundary). Instead we treat them as a pair of variables and we try and model their joint probability. This allows us to create new pairs of inputs and features based on the learned joint probabilities.

For example: if we are using MNIST just to recognise and label handwritten digits then we can work with a discriminative model. To get the discriminative output we need some sort of a ‘capping’ output layer (e.g. softmax) which gives us one clear label (for this example there is one to one correspondence between input and label). We cannot directly work with a probability distribution of features (similar to what we saw in the last post) as an output. The process here is inherently one way, present an input and get the label as an output (thus the propagation is away from the input layer).

But what if we wanted to generate new ‘handwritten’ digits (think of an app that translates a typed letter into a handwritten one which matches your handwriting!). If we learn p(input , label)  we can easily reverse it as we could start with a label and get an ‘input’ (hand written digit). The direction of generative propagation is opposite to the discriminative one (the propagation is towards the input layer).

Does this mean that we should always target a generative model as it gives us more flexibility? The short answer is No, because generative models usually have poor performance as compared to their discriminative cousins. The long answer is ‘depends on the use-case’.

Symbol Grounding Problem:

Another reason why we show special interest in generative models is because the standard ‘data’ labeling process is very artificial. In real life no such clear labels exist for most of what we experience or even worse: there may be too many labels. For example if we show an image of a cartoon car to say 10 different people and ask them to assign one label to it we are more than likely to get multiple labels such as: cartoon car, car, cartoon… and that is just in the English language! If we had people in that group whose first language was not English they might use other labels which may or may not have a direct correlation with the corresponding English language labels. In fact all these labels are just different symbols that assign meaning to the data. This is the ‘symbol grounding problem’ in AI.

Our brain definitely does not work with strict labels. In fact it matches the joint distribution behavior better – the cartoon in the above example can be analysed at different levels such as: a cartoon, a cartoon car, a cartoon sports car, a cartoon sports car driving very fast…. so as we analyse the same input we have a growing set of labels associated with it.

It would be very messy if we had to learn a different discriminative model for each of the associated labels that operates on the same input data. Also it would be impossible if we were asked to draw a cartoon sports car without some kind of generative model that takes into account all its possible ‘characteristics’ and returns a learned representation (shape, components, size etc.).

If we also take a look at human cognition (which is what we are trying to mimic) simple classification is just one half of the process. Without the generative ability we would not be able to react to the result of the classification. Our brain may classify the weather as ‘likely to be wet’ as the image of the sky travels from the eye to the brain, but it is the reverse propagation from the brain to our muscles that ensures we pick up the umbrella.For our example: As our brain classifies and breaks down the task of drawing a cartoon sports car it needs to switch into generative mode to actually draw it out.

Here we also have a good reason why generative models should NOT be very accurate or rigid. If we had rigidly learnt generative models that did not change over time (or were very difficult to re-train), there would be no concept of ‘training’, ‘skill’ or ‘creativity’. Given a set of features we all would produce the same (or similar) cartoon sports car! There would be very little difference between the cartoon sports car drawn by a professional cartoonist and one drawn by a child as after a certain point in time a rigid generative model would not respond to additional training.

Note: the above description is an over-simplification of some very complex cognitive processes and is intended only as an aid in understanding the concepts being presented in this post.

MNIST Example:

We can generate digits as we learn to classify them using the greedy learning algorithm described in the previous post. This can be done by simply reversing the direction of propagation from Input => Hidden to Hidden => Input and doing some sampling using clamped hidden vectors.

The process is very simple:

  1. Randomly generate a binary vector equal in length to the top most hidden layer
  2. Clamp this vector to the hidden layer and then propagate down to the visible and back up to the hidden ‘n’ number of times (thus feeding back the result at both hidden and visible layers)
  3. For the last iteration do not propagate back to the hidden unit instead convert the vector on the visible layer into an image

For the test we have the standard MNIST input layer (28 x 28 = 784 inputs). Following that we have 3 hidden layers of 100 neurons each. Each hidden layer is trained using CD-10 on a mini batch of the MNIST dataset. I will be uploading the associated test files on my github. The file is: rd.neuron.neuron.test.TestRBMMNISTRecipe

When we set n = 0 we get very fuzzy generated digits:

Generated Digits

Generated Digits

I can make out a few rough 2s and a some half formed digits and a lot of ‘0’s!

Let us set n = 5 (therefore we do down – up for 5 times and then the 6th pass is just down):

Generated Numbers 6

Generated Numbers 6

As you can see the generated digits are a lot cleaner and we also have some relatively complicated digits (‘3’ and ‘6’) and a rough ‘8’ (3rd row from bottom, 4th column from right).

This proves that our network has learnt the features associated with handwritten digits which it uses to generate new data.

As a final example, let us set n = 50 and generate a larger set of digits:

Generated Digits 50

Generated Digits 50

In the next post we delve deeper into the ‘feature’ – ‘label’ training process and show how we can get our deep network to classify hand-written digits.

Artificial Neural Networks: Restricted Boltzmann Machines

The next building block for Deep Learning Networks is the Restricted Boltzmann Machine (RBM). We are going to use all that we have learnt so far in the ‘Building Blocks’ section (see here).

Bernoulli Trials

There is perhaps one small topic I should cover before we delve into the world of RBMs. This is the concept of a Bernoulli Trial (we have already hinted at it in the previous post). A Bernoulli Trial is an experiment (in the loosest sense) with the following properties:

  • There are always only 2 mutually exclusive outcomes – it is a binary result
    • Coin toss – can only result in a heads OR a tails for a particular toss, there is no third option
    • A team playing a game with a guaranteed result, can either win OR lose the game
  • The probability of both the events has a finite value (we may not know the values) which sum to 1 (because one of the two possible results must be observed after we perform the experiment)

The formulation of a Bernoulli Trial is:

P( s = 0 | 1 ) = (p^s)*(1-p)^(1- s)

If we are just interested in one of the events (say s = 1) then:

P ( s = 1 ) = p; where  0 < p < 1,

If p = 0 or 1 then it is not a probabilistic scenario – instead it becomes a deterministic scenario because we can accurately predict the result all the time (e.g.tossing a ‘fake’ Coin with two heads or two tails).

Restricted Boltzmann Machines:

A Boltzmann Machine has a set of fully connected stochastic unitsA Restricted Boltzmann Machine does not have fully connected units (there are no connections between units in the same layer) but we retain the stochastic units.

RBM and Boltzmann Machine

RBM and Boltzmann Machine

When we remove the connections between units in the same layer we arrive at a now familiar structure for a feed-forward neural network. The figure above shows a simple Boltzmann Machine with 4 visible units and a RBM with 2 visible (orange) and 2 hidden units (blue).

The name Boltzmann comes from the fact that these networks learn by minimizing the Boltzmann energy of the system. In statistical physics stable systems are those that have minimum energy levels associated with the state of the different components of the system under the required constraints.

The one interesting thing that separates RBMs from normal neural networks is the stochastic nature of the units.

Taking a simple case (as done by Hinton et. al. 2006) of binary outputs (i.e. a unit is either on or off), a stochastic unit can be defined as:

“neuron-like units that make stochastic decisions about whether to be on or off”

(http://www.scholarpedia.org/article/Boltzmann_machine)

Normally (e.g. in Multi Layer Perceptrons) the activation level or score is calculated by:

  1. summing over the incoming scores and weights only from the previous layer (the restricted part)
  2. adding a bias term associated with the unit for which we are calculating the activation score
  3. pushing through an activation function (e.g. sigmoid or ReLU) to get an output value

The whole system is deterministic. If you know the weights, biases and inputs you can, with 100% accuracy calculate the activation scores at each layer.

To make the system stochastic with binary output (on or off), we use the sigmoid function in step 3 (as it gives values between 0 and 1) and add a fourth step to the above:

4. use the sigmoid value as the probability score for a Bernoulli Trial to decide if the            unit is on or off

As a worked example:

If sigmoid (sum(x[i]w[i][j]) + bias[j]) for jth unit is = sig[j]

Then the jth unit is on if upon choosing s uniformly distributed random number between 0 and 1 we find that its value is less than sig[j]. Otherwise it is off.

So if sig[j] = 0.56 and the random number we get is 0.42 then the unit is on with an output of 1; if the random number we get is 0.63 then the unit is off with an output value of 0.

Thus higher the sigmoid output less is the chance for the unit to be turned off, however high the value there will always be a small but finite chance that the unit is off.

Similarly lower the sigmoid output higher is the chance for the unit to be turned off, however low the value there will always be a small but finite chance that the unit is on.

This leads to something very interesting:

We end up with a ‘distribution’ of outputs for every input given the weights and biases, as compared to a deterministic approach where the output is constant for a given input as long as we do not change the parameters of the model (i.e. weights and biases).

Before we move on to an example, you might be wondering why we add this extra level of complexity? How can we ever unit-test our model? Well we can easily unit-test by setting a constant seed for our pseudo-random number generator. As to why we add this extra level of complexity, the answer is simple: It allows us to give a probabilistic result which in turn allows us to deal with errors in input patterns as well as with patterns that contradict the norm (by making the network less likely to be trapped in a local minima and to not have a difficult to change association between input and output).

This is also how our brain works. If you are familiar with hash-maps or computer memory, we need the full key / address to retrieve a piece of information (i.e. it is deterministic). Even if a single character is missing or incorrect we will not be able to retrieve the required information without a lengthy search through all the keys to find the best match (which may use some sort of a probabilistic method to measure the ‘match’).

But our brain is able to retrieve the full set of information based on small samples of itself (e.g. hearing half a dialog from a movie is enough for us to recall the full scene). This is called auto-associative memory.

Worked Example:

Assume we have a RBM with 12 input units and 6 hidden units. The units are binary stochastic type.

On the input pass the 12 inputs are passed to all the 6 hidden units (through the weighted links), bias for each hidden unit is added on, sigmoid is used followed by a Bernoulli Trial to determine if the hidden unit is on or off.

Let the input also be a binary pattern. For 12 inputs there can be at most 12 bit = 4096 possible patterns. As the output at the hidden layer is also a binary pattern we know there are finite number of them (6 bits = 64 possible patterns)

Given the stochastic nature of the hidden units, for a particular binary input pattern we are likely to see some sort of a distribution across the factorial output of the hidden layer.

We can use the finite number of possible patterns to examine the distribution for a given input and see how the network is able to distinguish between patterns based on this distribution while keeping the network parameters (weights, biases) the same.

Distribution Visible to Hidden

Distribution Visible to Hidden

The image above shows few plots of this distribution. Each graph is created by a single 12 bit binary input pattern.

X axis are the categorical (one of 64 possible) output patterns and Y axis the count of number of times these are seen – in each graph we keep the category (X-axis) ordering the same so that we can compare the distributions across different inputs.

These are without any kind of training (i.e. weights and biases are not updated), created by pure sampling. The distributions above allow us to distinguish between different input patterns although not with a great deal of accuracy.

We can clearly see for the input vector (1 0 1 0 0 0 0 0 0 0 0 1) – 2nd graph from left, a fairly skewed distribution with high counts towards the edges, with the max being about 3165 for output (1 1 1 0 1 0); where as for the input vector (0 0 0 0 1 0 0 0 0 1 0 0) – 1st graph from left, the counts are lower and the peaks are evenly spread out.

These distributions tell us about P(h | v) for a given W and Hb which is the probability of seeing a pattern at the hidden units (h) given an input (v) for a given set of weights from visible to hidden (W) and Hidden Unit bias values (Hb). This is also called the ‘up’ phase.

We can also do the reverse of the above: P (v | h) for a given W’ and Vb which is the probability of seeing a pattern at the input units (v) given a hidden layer pattern (h) for a given set of weights from hidden to visible (W’ – transpose of W from above) and Visible Unit bias values (Vb). This is also called the ‘down’ phase.

Distribution Hidden to Visible

Distribution Hidden to Visible

Similar to the Visible to Hidden. Here each graph represents a particular pattern at the hidden layer (one of possible 64). X axis are the 4096 possible patterns we can get at the input layer when we use the transposed Weights vector and Visible Unit bias values (along with the sigmoid – Bernoulli Trial steps). Y axis is the count.

Again we see the difference in distributions allows us to distinguish between hidden values based on the frequency of resulting input values.

Next Steps:

So far we have dealt with clear cut small dimensional inputs (e.g. 2 dimensional XOR gates, 12 bit input patterns etc.) but what if we want to process a complex picture (made up of millions of pixels in this ‘megapixel’ camera age)?

In our brain because there are no deterministic links (as a very simplistic example) it is able to associate a cartoon of a car and a picture of a car with the abstract label of ‘car’. It does high level feature abstraction to enable this probabilistic mapping. We are also able to deal with a lot of error (e.g. correctly identifying a car as drawn by a child). The one disadvantage is that we may not be able to reliably recall the same piece of information all the time (like while giving an exam we are not able to recall an answer but it comes to us some point later in time).

With a probabilistic activation it would be like the 2nd from left image in the Hidden to Visible example. Where there are clearly defined peaks for certain input patterns that result when using that hidden pattern. For example when we think of a car we get associations of ‘a cartoon car’, ‘a poster of a car’, ‘the car we drive’ etc.

If it was a deterministic system we would only ever get one input pattern for a hidden pattern (unless we updated the weights/biases). The relationship (for a generic input – whether at the ‘input layer’ or ‘hidden layer’) between inputs and outputs is always one-to-one.

Note: Don’t confuse one-to-one to mean one input to one-class. One input can activate multiple classes but the point is that it will always activate the same set of classes unless we change the weights/biases.

Also even if we updated weights and biases to take into account new training data, if the association between a feature-label pair was really strong (because of multiple examples), we would not be able to modify it easily.

That brings us to the most important piece of the puzzle – how to build a network out of such layers and then train it to distinguish between even more complex patterns.

Another key point to the training is also to make the output distributions highly distinguishable and in case of multiple peaks (i.e. multiple high probability outputs) ensure all of those are relevant. As an example for predictive text we want multiple peaks (i.e. multiple probable next words) but we want all of those to be relevant (e.g. eat should have high probability next words as ‘food’, ‘dinner’, ‘lunch’ but not ‘car’ or ‘rain’).

Contrastive Divergence allows us to do just that! We cover it next time!

As usual – please point out any mistakes… and comment if you found this useful.

Artificial Neural Networks: Introduction to Deep Learning

Firstly sorry for the break! Been busy with few things. But here goes – the next instalment of our ANN series.

So far we have covered in the first and second posts:

a) To classify more complex and real world data which is not linearly separable we need more processing units, these are usually added in the Hidden Layer

b) To feed the processing units (i.e. the Hidden Layer) and to encode the input we utilise an Input Layer which has only one task –  to present the input in a consistent way to the hidden layer, it will not learn or change as the network is trained.

c) To work with multiple Hidden Layer units and to encode the output properly we need an aggregation layer to collect output of the Hidden Layer, this aggregation layer is also called an Output Layer

d) The representation of the input and output can have a big influence on the performance of the neural network especially when it encounters noisy data

To the first point, while it is good to be able to add more hidden units and multiple hidden layers, we quickly come up against the problem of how to train such networks. This is not only a theoretical problem (e.g. vanishing gradient – see the first post) but also a computational one.

The Challenge:

  • More hidden units mean complex features can be learnt from the data
  • Multiple layers are difficult to train using standard back prop due to the vanishing gradient problem
  • Multiple layers are also difficult to train because we need to distribute the training effort so that each layer is able to support the other layers at the end of the training (a hint: perhaps we need to look at independent training of each layer to delink them?)
  • Adding more hidden units also increases the computational load especially if the training data set is massive, Stochastic Gradient Descent only partially solves the problem

The Solution:

  • Add more hidden layers (i.e. more hidden units) to make the network deeper (thus the ‘deep’ in the ‘deep learning’)
  • Use a combination of bottom-up and top-down learning to train this ‘deep’ network
  • Use specialised libraries that support GPU based distributed data processing (e.g. Tensorflow, DL4J)
Deep Network MNIST

Deep Network MNIST

Image above shows a deep learning network setup for the MNIST data set where inputs are gray-scale images (of constant pixel count 28×28) of single handwritten digits (0 to 9) which need to be mapped to their corresponding number. Each pixel in the 28×28 image is normalised and treated as an input. This gives us a full input length of 784 neurons. There are 10 digits (0 to 9) as a possible output class so the full output length is 10.

We have previously tried shallow networks and gotten good performance of about 95% with 300 hidden units organised as a single hidden layer, compare this to deep learning networks which achieve 99.7%

Deep learning works by distributing the ‘learning’ load across the hidden layers by ‘learning and aggregating’ smaller features to make larger features.

Image below shows how a Face Detector, which detects if an image contains a face of a child, baby or an adult, may distribute the feature learning and aggregation between the hidden layers.

Deep Learning

Deep Learning Face Recognition

The Building Blocks:

There are quite a few different types of ‘deep learning’ networks out there specialised for different application types (such as image tagging, document processing etc.). Some of them are not even ‘deep’ (e.g. Word2Vec) yet they incorporate novel training methods that allow them to deal with complex tasks (such as language translation) as if by magic.

I believe it will be more useful if I describe some of the common building blocks with reference to the ‘Deep Belief Network’ (Hinton et.al. 2006) as it is relatively simple (it has ‘boolean’ states in the hidden units instead of real valued ones).

The common building blocks include:

  • One-Hot Encoding
  • Softmax Layer
  • Sigmoid and ReLU Activation Functions
  • Restricted Boltzmann Machines (RBM)
  • Distributed Layer-wise Unsupervised Training (Contrastive Divergence)
  • Back-prop based Supervised Training (Fine Tuning)

One-Hot Encoding:

One-Hot Encoding is a really simple way of encoding outputs related to states.

The idea is that to represent S different states you need to have a S-bit binary string where for each state s in only one bit in the binary string is set.

This can also be used to represent mutually exclusive classes and we have to pick one (for example a transaction cannot be both fraudulent and normal at the same time)

Let us assume we have a classifier which has to classify all inputs into one of 5 classes A, B, C, D, E.

How do we put this as an output for a neural network? Should we have just one output neuron and divide the possible output range between the 5 classes (e.g. 0-10 = Class A, 11-20 = Class B etc.), no that sounds weird especially because the output values could mean anything and nothing, also this complicates the training.

The easiest option in this case is to have one output per class, using one-hot encoding (thus a 5-bit output vector). When we train the model we simply require that the corresponding output value be significantly higher than all others so as to indicate that particular class.

One possible scheme:

A = [1,0,0,0,0]

B = [0,1,0,0,0]

C = [0,0,1,0,0]

so on..

Then if we get an output vector like:

[0.16, 0.23, 0.67, 0.03, 0.1]

we can be reasonably sure that our model is telling us the input belongs to Class C as for that class position 3 has the highest value in the label. One thing to note is that the output vector still does not tell us anything about how close two values in it are because these are just numeric values without a comparative scale (e.g. such as those found in case of probabilistic scores where all scores are compared to the value of 1 and the highest score wins).

This is where Softmax comes into the picture.

Softmax Layer:

The Softmax Layer is really straight forward to understand. It is usually found as the outermost layer of the network because it has the very important property of being able to convert ANY set of inputs into probability values such that all the values sum to 1 ( a very important property for probability values).

The softmax function is:

P(i) = e^(x(i))/Sum(e^(x(j)))

Where x(i) is the ith output that we want to convert to a probability value.

and Sum(P(i)) = 1

Below is the function in Java.

/**
	 * Softmax function
	 * @param input - to be converted to probabilities
	 * @return
	 */
	public static double[] softmax(double[] input) {
		double prob[] = new double[input.length];
 
		double sum = 0;
		for (double val : input) {
			sum += Math.exp(val);
		}
		if (sum == 0) {
			throw new IllegalStateException("Sum cannot be zero");
		}
		for (int i = 0; i < input.length; i++) {
			prob[i] = Math.exp(input[i]) / sum;
		}
 
		return prob;
	}

 

Example:

Assume there are 5 output units (thus length of output vector = 5) where each unit represents a class (as defined during the supervised learning phase e.g. Output Unit 1 = Class A; Output Unit 2 = Class B etc.).

For a certain input we get the following outputs (say using the Sigmoid function – see below) at Unit 1 -> 5:

[0.01, 0.23, 0.55, 0.29, 0.1]

While we can say score for Class C (Unit 3) = 0.55 looks like a good answer we cannot be 100% sure as this is not a probability value. Also Class D (Unit 4) = 0.29 is not that far off or is it? We can’t say for sure because we do not have a scale to compare against. Wouldn’t it be great if we could convert these to a probability value then we could compare them with each other and simply pick the largest value as the most probable class and ALSO provide our ‘confidence’ at the result?

If we use Softmax we get the following probability values:

[0.157, 0.195, 0.269, 0.207, 0.172]

(values are rounded to fit here so they may not exactly total to 1 – the problem with floating point arithmetic in Java)

The result is very interesting! Now we see Class C and D are just 6% away from each other (27% and 21% approx.). This is far closer than the original output vector. So while we can play it safe and choose Class C (the largest probability value) we will have to indicate somehow that we are not very sure about it as Class D came very close as well.

This can be converted to the following one-hot with a suitable confidence warning:

[0, 0, 1, 0, 0]

The interesting point about Softmax is that larger the scores, clearer is the separation between the probability values.

For example, if the output was (perhaps from ReLU units – see below):

[1, 2, 4, 3, 0]

we get probability score as: [0.032, 0.0861, 0.636, 0.234, 0.012]

This tells us there is a high probability that Class C is the correct class. The separation is now 40% between Class C and D.

 

Sigmoid and Rectified Linear Unit (ReLU) Activation Function:

The Sigmoid function ensures all output values are between 0 and 1. This allows us to use a probabilistic interpretation of the output. The closer the value is to 0 or 1 more confident we are about it. if the value is around 0.5 then we are not really sure.

It has the following form:

Sigmoid (x) =  1 / (1 + e^(-x))

One interesting property of the Sigmoid (which lends itself to back-prop training) is that its differential can be represented using Sigmoid:

Sigmoid'(x) = Sigmoid(x)(1-Sigmoid(x))

This also compares well with a Bernoulli Trial where Sigmoid(x) = P(x) and 1-Sigmoid(x) = Q (not x) = 1 – P(x).

This function is also used in ‘logistic regression’ where for a two class problem each class is represented by the edge values of the Sigmoid (0 and 1).

When we look at multi-class problems a common encoding follows the one-hot pattern where there is one sigmoid output per class which tells us how confident we are about that class. Remember it tells us NOTHING about how these compare with each other.

So if for 5 classes we have 5 sigmoid outputs where:

  • a value close to 0 means we are very sure the input does not belong to the class
  • a value close to 1 means we are very sure the input does belong to the class

we would still need something like a Softmax to compare these outputs with each other to choose a single class out of the 5 and reason about how confident we were about that choice.

The ReLU function comes from the behavior of a half-wave rectifier unit in electrical engineering where these units convert AC to DC. The function is VERY easy to model and process, there are no messy exponential terms. These allow just the right amount of non-linearity into a neural network thereby allowing us to handle non-linear classification tasks (more info here).

The function is:

ReLU(x) = max(0,x)

Its differential for x > 0 (required for back-prop) is a constant value of 1

The ReLU has several advantages including the fact that it is very easy to calculate and differentiate. If you see the difference it brings to the back-prop equations you will never want to even think about using Sigmoid.

The one problem with ReLUs is that once it is closed during a forward pass (i.e. x <=0) it will forever remain closed (even for backward passes). This is called the ‘dying ReLU problem’.

In the next post(s) we will cover Restricted Boltzmann Machines, Contrastive Divergence and Fine-tuning.

As usual – if I have made any mistakes do let me know!

 

Time-Series Dashboards with Grafana and Influx DB

I have been collecting currency data (once every hour) over the last couple of years, using a Node.JS retriever. The data goes in to a Mongo DB instance. We store the timestamp and all the currency pairings against the US Dollar (e.g. how may GBPs get you 1 USD).

Some other features we may need:

  • What if we want different pairings? Say we wanted to have pairings against Gold or Euros?
  • We may need to normalise the data if we want to compare the movement of different currencies?
  • We may also need ways of visualising the data (as time series etc.)

To cater to this I decided to bring in Grafana (I did not want to use the ELK stack because I believe if you don’t need text search don’t use elasticSEARCH!). Grafana does not support Mongo out of the box so I had to bring in Influx DB.

Bringing in Influx DB allowed me to integrate painlessly with Grafana (sort of). But as we all know – when you are integrating systems there is a certain minimum amount of pain you have to experience. Juggling around just shifts that pain to hopefully a location where you can deal with it efficiently. I my case I moved the pain to the integration between Mongo and Influx.

I had to create a component that pulls out the raw data from Mongo DB, pumps it through a pipeline to filter (to get the pairings of interest – the full data set has 168 currencies in it!), normalise and inject into Influx DB.

A side note: I used the Influx DB Java API which was REALLY easy to use, this encouraged me to go ahead with the Influx – Mongo integration in Java.

Grifana Influx

Grafana – Influx

I also wanted the Influx – Mongo integration to be ‘on-demand’ so that we can create different pairings against different targets – such as the big 3 against Gold, major currencies against SDR (International Monetary Fund – Special Drawing Rights) etc. and populate different time-series databases in Influx. So I gave the integration a REST interface using Spark-Java.

I did encounter one problem with Grafana – InfluxDB integration was the ‘easy’ query system did not work – I had to create one manually – which is pretty straight forward as the InfluxDB documentation is decent.

Results

I was able to get the dashboards to work with normalised time-series of 6 currencies against Gold (XAU): US Dollar (USD), Indian Rupee (INR), Nigerian Naira (NGN), Euro (EUR), British Pound (GBP) and Chinese Yuan (CNY). I discovered something interesting when I was looking at the data around the ‘Brexit’ vote (23/24 June 2016).

Brexit

The screenshot above is from Grafana. Filtered to show EUR-XAU, GBP-XAU, USD-XAU and NGN-XAU. We see a massive dip in Nigerian Naira before the actual Brexit vote. I was surprised so I googled and it seems that few days before Brexit vote the Naira was allowed to float against USD (earlier it was pegged to it) which led to a massive devaluation.

Then on Brexit day we see a general trend for the selected currencies to fall against Gold as it suddenly became the asset of choice once the ‘unthinkable’ had happened, with the GBP-XAU registering the largest fall. As you can see the Naira also registers a dip against Gold (XAU).

This also shows how universal Gold is. At one time all currencies were pegged against Gold, now unofficially USD is the currency of trade, but this shows when something unthinkable or truly ‘world-changing’ happens people run back to the safety of Gold.

Quality of Life Reduced Question Set: Bristol Open Data

This visualisation operates upon a reduced set of questions from the Quality of Life indicators. This data has been provided by the Bristol City Council under the open data initiative (https://opendata.bristol.gov.uk/).

Using this view the reduced question set can be examined across all the wards as an average of beta for particular question across all wards in Bristol.

Click on a question to focus on it and to examine the beta value across all the wards. A count of wards with positive and negative beta values is also shown. These should correspond to the total green/red marks seen.
The click on a ward to examine the response over time and see the trend line (associated with beta).

Java and Apache Spark used to generate the csv data files.

Link: Dashboard

Criteria for beta calculation: minimum three years data should be available.

Reduced Question Set: