Housing Market: Auto Correlation Analysis

In this post we take a look at the housing market data which consists of all the transactions registered with the UK Land Registry since 1996. So lets get the copyright out of the way:

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

The data-set from HM Land Registry has information about all registered property transactions in England and Wales. The data-set used for this post has all transactions till the end of October 2018. 

To make things slightly simple and to focus on the price paid and number of transaction metrics I have removed most of the columns from the data-set and aggregated (sum) by month and year of the transaction. This gives us roughly 280 observations with the following data:

{ month, year, total price paid, total number of transactions }

Since this is a simple time-series, it is relatively easy to process. Figure 1 shows this series in a graph. Note the periodic nature of the graph.

Figure 1: Total Price Paid aggregated (sum) over a month; time on X axis (month/year) and Total Price Paid on Y axis.

The first thing that one can try is auto-correlation analysis to answer the question: Given the data available (till end-October 2018) how similar have the last N months been to other periods in the series? Once we identify the periods of high similarity, we should get a good idea of current market state.

To predict future market state we can use time-series forecasting methods which I will keep for a different post.

Auto-correlation

Auto-correlation is correlation (Pearson correlation coefficient) of a given sample (A) from a time series against other available samples (B). Both samples are of the same size. 

Correlation value lies between 1 and -1. A value of 1 means perfect correlation between two samples where they are directly proportional (when A increases, B also increases). A value of 0 implies no correlation and a value of -1 implies the two samples are inversely proportional (when A increases, B decreases).

The simplest way to explain this is with an example. Assume:

  1. monthly data is available from Jan. 1996 to Oct. 2018
  2. we choose a sample size of 12 (months)
  3. the sample to be compared is the last 12 months (Oct. 2018 – Nov. 2017)
  4. value to be correlated is the Total Price Paid (summed by month).

As the sample size is fixed (12 months) we start generating samples from the series:

Sample to be compared: [Oct. 2018 – Nov. 2017]

Sample 1: [Oct. 2018 – Nov. 2017], this should give correlation value of 1 as both the samples are identical.

Sample 2: [Sep. 2018 – Oct. 2017], the correlation value should start to decrease as we skip back one month.

Sample N: [Dec. 1996 – Jan. 1996], this is the earliest period we can correlate against.

Now we present two graphs for different sample sizes:

  1. correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year) – to show yearly spread
  2. correlation coefficient visualised going back in time – to show periods of high correlation

Thing to note in all the graphs is that the starting value (right most) is always 1. That is when we compare the selected sample (last 12 months) with the first sample (last 12 months).

In the ‘back in time’ graph we can see the seasonal fluctuations in the correlation. These are between 1 and -1. This tells us that total price paid has a seasonal aspect to it. This makes sense as we see lots of houses for sale in the summer months than winter as most people prefer to move when the weather is nice!

Fig 2: Example of In and Out of Phase correlation.

So if we correlate a 12 month period (like this one) one year apart (e.g. Oct. 2018 – Nov. 2017 and Oct. 2017 – Nov. 2016) one should get positive correlation as the variation of Total Price Paid should have the same shape. This is ‘in phase’ correlation. This can be seen in Figure 2 as the ‘first’ correlation which is in phase (in fact it is perfectly in phase and the values are identical – thus the correlation value of 1). 

Similarly, if the comparison is made ‘out of phase’ (e.g. Oct. 2018 – Nov. 2017 and Jul 2018 – Aug. 2017) where variations are opposite then negative correlation will be seen. This is the ‘second’ correlation in Figure 2.

This is exactly what we can see in these figures. Sample sizes are 6 months, 12 months, 18 months and 24 months. There are two figures for each sample size. The first figure is the spread of the auto-correlation coefficient for a given year. The second figure is the time series plot of the auto-correlation coefficient, where we move back in time and correlate against the last N months. The correlation values fluctuating between 1 and -1 in a periodic manner. 


Fig. 3a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year), Sample size: 6 months

Fig. 3b: Correlation coefficient visualised going back in time; Sample size: 6 months


Fig. 4a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 12 months

Fig. 4b: Correlation coefficient visualised going back in time; Sample size: 12 months


Fig. 5a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 18 months

Fig. 5b: Correlation coefficient visualised going back in time; Sample size: 18 months


Fig. 6a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 24 months

Fig. 6b: Correlation coefficient visualised going back in time; Sample size: 24 months

Conclusions

Firstly, if we compare the scatter + box plot figures, especially for 12 months (Figure 4a), we find the correlation coefficients are spread around ‘0’ for most of the years. One period where this is not so and the correlation spread is consistently above ‘0’ is the year 2008, the year that marked the start of the financial crisis. The spread is also ‘tight’ which means all the months of that year saw consistent correlation, for the Total Price Paid, against the last 12 months from October 2018.

Secondly conclusion we can draw from the positive correlation between last 12 months (Figure 2b) and the period of the financial crisis is that the variations in the Total Price Paid are similar (weakly correlated) with the time of the financial crisis. This obviously does not guarantee that a new crisis is upon us. But it does mean that the market is slowing down. This is a reasonable conclusion given the double whammy of impending Brexit and on set of winter/Holiday season (which traditionally marks a ‘slow’ time of the year for property transactions).

Code is once again in python and attached below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture

months = MonthLocator(range(1, 13), bymonthday=1, interval=3)
year_loc = YearLocator()

window_size = 12

def is_crisis(year):
if year<2008:
return 0
elif year>2012:
return 2

return 1

def is_crisis_start(year):
if year<2008:
return False
elif year
>2008:
return False

return True

def
process_timeline(do_plot=False):
col = "Count"
y = []
x = []
x_d = []
box_d = []
year_d = []
year = 0
years_pos = []
crisis_corr = []
for i in range(0, size - window_size):

try:

if year != df_dates["Year"][size-1-i]:

if year > 0:
box_d.append(year_d)
years_pos.append(year)
year_d = []
year = df_dates["Year"][size-1-i]

corr = np.corrcoef(df_dates[col][size -i - window_size: size - i].values, current[col].values)
year_d.append(corr[0, 1])
y.append(corr[0, 1])
if is_crisis_start(year):
crisis_corr.append(corr[0, 1])
x.append(year)
month = df_dates["Month"][size - 1 - i]
x_d.append(dt(year, month, 15))

except Exception as e:
print(e)

box_d.append(year_d)
years_pos.append(year)

corr_np = np.array(crisis_corr)
corr_mean = corr_np.mean()
corr_std = corr_np.std()

print("Crisis year correlation: mean and std.: {} / {} ".format(corr_mean, corr_std))
if do_plot:

fig, sp = plt.subplots()

sp.scatter(x, y)
sp.boxplot(box_d, positions=years_pos)

plt.show()

fig, ax = plt.subplots()
ax.plot(x_d, y,'-o')
ax.grid(True)
ax.xaxis.set_major_locator(year_loc)
ax.xaxis.set_minor_locator(months)
plt.show()

return corr_mean, corr_std

csv = "c:\\ML Stats\\housing_oct_18_no_partial_mnth_cnt_sum.csv"
full_csv = "c:\\ML Stats\\housing_oct_18.csv_mnth_cnt_sum.csv"

df = pd.read_csv(full_csv)


mnth = {
1: "Jan",
2: "Feb",
3: "Mar",
4: "Apr",
5: "May",
6: "Jun",
7: "Jul",
8: "Aug",
9: "Sep",
10: "Oct",
11: "Nov",
12: "Dec"
}


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

crisis = list(map(lambda r: is_crisis(int(r[1]["Year"])), df.iterrows()))

df_dates = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year, "Month": df.Month, "Crisis": crisis})

df_dates = df_dates.sort_values(["Date"])

df_dates = df_dates.set_index("Date")

plt.plot(df_dates["Sum"],'-o')
plt.ylim(ymin=0)
plt.show()

size = len(df_dates["Count"])

corr_mean_arr = []
corr_std_arr = []
corr_rat = []
idx = []
for i in range(0, size-window_size):
end = size - i
current = df_dates[end-window_size:end]
print("Length of current: {}, window size: {}".format(len(current), window_size))

ret = process_timeline(do_plot=True)
break #Exit early




House Market Analysis

The house prices in UK are at it again. A combination of Brexit, change in housing stock, easy loans and growing consumer debt is making things interesting again.

Figure 1: Number of Transactions per month from 1995 to August 2018

Figure 1 shows the number of transactions every month since 1995. The massive fall post 2007 because of the financial crisis. Then the surge in transactions since 2013. The lonely spot (top-right, March 2016) is just before the new Stamp Duty changes made buying a second house an expensive proposition. But this is relatively boring!

Visual Analytics: Relation between Quantity and Value of Transactions

Let us look at Transaction Count (quantity) and Total Value of those transactions, aggregated on a monthly basis. I used a Spark cluster to aggregate the full transaction set (4GB csv data file). The base data set has about 280 rows with the following structure:

{month, year, sum, count}

The month and year values are converted into dates and added to the row, then the data set is sorted by date:

{date, month, year, sum, count}

This leads us to three plots. Sum and Count against time and Sum against Count. These are shown below:

Figure 2: Total Transaction value by date, grouped by year (each dot represents a month in that year)

Figure 2 shows Total Transaction value by date (Y-axis). The plot is grouped by year where each dot represents a month in that year. The current year (2018) has complete months data till August therefore less number of dots.

Figure 3: Total Quantity of Transactions  by date, grouped by year (each dot represents a month in that year)

Figure 3 shows Total Quantity of Transactions (Y-axis), once again grouped by year. Similar to Figure 2 the data is complete till August 2018.

Figure 4: Total Transaction value (Y-axis) against Total Number of Transactions (X-axis)

Figure 4 show how the value of the transactions relates to number of transactions. Each dot represents a month in a year. As expected there is a slightly positive correlation between total value of transactions and the number of transactions. A point to note: the total value of transactions depends on the sale price (that depends on the property sold) as well as the number of transactions in a given month. For the same number of transactions the value could be high or low (year on year) depending on whether prices are inflationary or a higher number of good quality houses are part of that months transactions.

Figure 5: Total Transaction value (Y-axis) against Total number of transaction (X-axis), each point represents a particular month in a year

Figure 5 enhances Figure 4 by using colour gradient to show the year of the observation. Each year should have at least 12 points associated with it (except 2018). This concept is further extended by using different shape for the markers depending on whether that observation was made before the financial crisis (circle: year of observation before 2008), during the financial crisis (square: year of observation between 2008 and 2012) or after the crisis (plus: year of observation after 2012). These values for years have been picked using Figures 2 and 3. 

Figure 6: Showing the housing market contract during the Crisis and then expand

Figure 6 shows the effect of the financial crisis nicely. The circles represent pre-crisis transactions. The squares represent transactions during the crisis. The plus symbol represents post-crisis transactions. 

The rapid decrease in transactions can be seen as the market contracted in 2007-2008. As the number of transactions and the value of transactions starts falling, the relative fall in number of transactions is larger than in the total value of the transactions. This indicates the prices did fall but mostly not enough houses were being sold. Given the difficulty in getting a mortgage, this reduction in number of transactions could be caused by a lack of demand.

Discovering Data Clusters

Using a three class split (pre-crisis, crisis, post-crisis) provides some interesting results. These were described in the previous section. But what happens if a clustering algorithm is used on the data?

A Clustering algorithm attempts to assign each observation to a cluster. Depending on the algorithm, total number of clusters may be required as an input. Clustering is often helpful when trying to build initial models of the input data especially when no labels are available. In that case, the cluster id (represented by the cluster centre) becomes the label. The following clustering algorithms were evaluated:

  1. k-means clustering
  2. gaussian mixture model

The data-set for the clustering algorithm has three columns: Date, Monthly Transaction Sum and Monthly Transaction Count.

Given the claw mark distribution of the data it was highly unlikely k-means would give good results. That is exactly what we see in Figure 7 with cluster size of 3 (given we had three labels previously of before crisis, during crisis and after crisis). The clustering seems to cut across the claws. 

Figure 7: k-mean clustering with cluster size of 3 – total value of transactions (Y-axis) vs total number of transactions

If a gaussian mixture model (GMM) is used with component count of 3 and covariance type ‘full’ (using sklearn implementation – see code below) some nice clusters emerge as seen in Figure 8.

Figure 8: Gaussian Mixture model with three components.

Each of the components corresponds to a ‘band’ in the observations. The lowest band corresponds loosely with pre-crisis market, the middle (yellow) band somewhat expands the crisis market to include entries from before the crisis. Finally, the top-most band (green) corresponds nicely with the post-crisis market.

But what other number of components could we choose? Should we try other GMM covariance types (such as ‘spherical’, ‘full’, ‘diag’, ‘tied’)? To answer these questions we can run a ‘Bayesian Information Criteria’ test against different number of components and different covariance types. The method and component count that gives the lowest BIC is preferred.

The result is shown in Figure 9.

Figure 9: BIC analysis of the data – BIC score against number of components (X-axis)

From Figure 9 it seems the ‘full’ type consistently gives the lowest BIC on the data-set. Furthermore, going from 3 to 4 components improves the BIC score (lower the better). Another such jump is from 7 to 8. Therefore, number of components should be 4 (see Figure 10) or 8 (see Figure 11).

Figure 10: Transaction value (Y-axis) against  Total Number of Transactions – with 4 components.

Figure 11: Transaction value (Y-axis) against  Total Number of Transactions – with 8 components.

The 4 component results (Figure 10) when compared with Figure 5 indicates an expansion at the start of the data-set (year: 1995), this is the jump from yellow to green. Then during the crisis there is a contraction (green to purple). Post crisis there is another expansion (purple to blue). This is shown in Figure 12.

Figure 12: Expansion and contraction in the housing market

The 8 component results (Figure 11) when compared with Figure 5 shows the stratification of the data-set based on the Year value. Within the different colours one can see multiple phases of expansion and contraction.

The interesting thing is that for both 4 and 8 component models, the crisis era cluster is fairly well defined.

Code for this is given below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture


csv = "c:\\ML Stats\\housing_sep_18_no_partial_mnth_cnt_sum.csv"


df = pd.read_csv(csv)


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

df_pure = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year})


df_pure = df_pure.sort_values(["Date"])

df_pure = df_pure.set_index("Date")


bics = {}
for cmp in range(1,10):

clust_sph = GaussianMixture(n_components=cmp, covariance_type='spherical').fit(df_pure)
clust_tied = GaussianMixture(n_components=cmp, covariance_type='tied').fit(df_pure)
clust_diag = GaussianMixture(n_components=cmp, covariance_type='diag').fit(df_pure)
clust_full = GaussianMixture(n_components=cmp, covariance_type='full').fit(df_pure)

clusts = [clust_full, clust_diag, clust_sph, clust_tied]
bics[cmp] = []
for c in clusts:
bics[cmp].append(c.bic(df_pure))

plt.plot(bics.keys(), bics.values())
plt.legend(["full", "diag", "sph", "tied"])
plt.show()

num_components = 4

clust = GaussianMixture(n_components=num_components, covariance_type='full').fit(df_pure)

lbls = clust.predict(df_pure)

df_clus = pd.DataFrame({"Count": df_pure.Count, "Sum": df_pure.Sum, "Year": df_pure.Year, "Cluster": lbls})
color = df_clus["Cluster"]

fig, ax = plt.subplots()
ax.scatter(df_clus["Count"], df_clus["Sum"], c=color)

fig, ax2 = plt.subplots()
ax2.scatter(df_clus["Year"], df_clus["Count"], c=color)

fig, ax3 = plt.subplots()
ax3.scatter(df_clus["Year"], df_clus["Sum"], c=color)


plt.show()

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

Recurrent Neural Networks to Predict Pricing Trends in UK Housing Market

Recurrent Neural Networks (RNN):

RNNs are used when temporal relationships have to be learnt. Some common examples include time series data (e.g. stock prices), sequence of words (e.g. predictive text) and so on.

The basic concept of RNNs is that we train an additional set of weights (along with the standard input – output pair) that associate past state (time: t-1) with the current state (time: t). This can then be used to predict the future state (time: t+1) given the current state (time: t). In other words RNNs are NNs with state!

When used to standard time series prediction the input and output values are taken from the same time series (usually a scalar value). This is a degenerate case of single valued inputs and outputs. Thus we need to learn the relationship between x(t-1) and x(t) so that we can predict the value of x(t+1) given x(t). This is what I did for this post.

Time series can be made more complicated by making the input a vector of different parameters, the output may still remain a scalar value which is a component of x or be a vector. One reason this is done is to add all the factors that may impact the value to be predicted (e.g. x(t+1)). In our example of average house prices – we may want to add factors such as time of the year, interest rates, salary levels, inflation etc. to provide some more “independent” variables in the input.

Two final points:

  • Use-cases for RNNs: Speech to Text, Predictive Text, Music Tagging, Machine Translation
  • RNNs include the additional complexity of training in Time as well as Space therefore our standard Back-Propagation becomes Back-Propagation Through Time

RNN Structure for Predicting House Prices:

RNN simple time series

The basic time series problem is that we have a sequence of numbers – the average price of houses for a given month and year (e.g. given: X(1), X(2), … X(t-1), X(t) ) with a regular step size and our task is to predict the next number in the sequence (i.e. predict: X(t+1)). In our problem the avg price is calculated for every month since January 1995 (thus step size is 1 month). As a first step we need to define a fixed sequence size that we are going to use for training the RNN. For the input data we will select a sub-sequence of a given length equal to the number of inputs (in the diagram above there are three inputs). For training output we will select a sub-sequence of the same length as the input but the values will be shifted one step in the future.

Thus if input sub-sequence is: X(3), X(4) and X(5) then the output sub-sequence must be: X(4), X(5) and X(6). In general if input sub-sequence spans time step to where b > a and b-a = sub-sequence length, then the output sub-sequence must span a+1 to b+1.

Once the training has been completed if we provide the last sub-sequence as input we will get the next number in the series as the output. We can see how well the RNN is able to replicate the signal by starting with a sub-sequence in the middle and movie ahead in time steps and plotting actual vs predicted values for the next number in the sequence.

Remember to NORMALISE the data!

The parameters are as below:

n_steps = 36 # Number of time steps (thus a = 0 and b = 35, total of 36 months)

n_inputs = 1 # Number of inputs per step (the avg. price for the current month)

n_neurons = 1000 # Number of neurons in the middle layer

n_outputs = 1 # Number of outputs per step (the avg. price for the next month)

learning_rate = 0.0001 # Learning Rate

n_iter = 2000 # Number of iterations

batch_size = 50 # Batch size

I am using TensorFlow’s BasicRNNCell (complete code at the end of the post) but the basic setup is:

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu), output_size=n_outputs)

outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

loss = tf.reduce_mean(tf.square(outputs-y))
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
training = opt.minimize(loss)

saver = tf.train.Saver()

init = tf.global_variables_initializer()

Results:

A sample of 3 runs, using Mean Squared Error threshold of 1e-4 we get the following values for Error:

  1. 8.6831e-05
  2. 9.05436e-05
  3. 9.86998e-05

Run 3 fitting and predictions are shown below:

Orange dots represent the prediction by the RNN and Blue dots represent the actual data

 

Run 3 prediction against existing data 3 years before October 2017

Then we start from October 2017 (Month 24 in figure below) and forecast ahead to October 2018. This predicts a rise in average prices which start to plateau 3rd quarter of 2018. Given that average house prices across a country like UK are determined by a large number of noisy factors, we should take this prediction with a pinch of salt.

Run 3 Forecasting from Month 24 (October 2017 for the year ahead till October 2018)

A sample of 3 runs, using Mean Squared Error threshold of 1e-3 we get the following values for Error:

  1. 3.4365e-04
  2. 4.1512e-04
  3. 2.1874e-04

With a higher Error Threshold we find when comparing against actual data (Runs 2 and 3 below) the predicted values have a lot less overlap with the actual values. This is expected as we have traded accuracy for reduction in training time.

predicted avg price vs actual avg price (Run 2)

predicted avg price vs actual avg price (Run 3)

Projections in this case are lot different. We see a linearly decreasing avg price in 2018.

predicted avg price vs actual avg price with forecast

Next Steps:

I would like to add more parameters to the input – but it is difficult to get correlated data for different things such as interest rates, inflation etc.

I would also like to try other types of networks (e.g. LSTM) but I am not sure if that would be the equivalent of using a canon to kill a mosquito.

Finally if anyone has any ideas on this I would be happy to collaborate with them on this!

 

Source code can be found here: housing_tf

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

UK Housing Data Analysis: Additional Price Paid Entry

I have been exploring the HM Land Registry Price Paid Data and have discovered few more things of interest.

The data contains a ‘Price Paid Data Category Type’ (this is the second last column at the time of writing this post. As per the description of the schema this field can have one of two values:

A = Standard Price Paid entry, includes single residential property sold for full market value.
B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals.

Therefore it seems there is a way of looking at how properties sold for full market value differ from buy-to-lets, repossessions and power of sale transactions. Proper Category B tracking only starts from October 2013.

Before we do this it is worthwhile to use the ‘Property Type’ field to filter out properties of type ‘Other’ which contribute to the overall noise because these are usually high value properties such as office buildings. The ‘Property Type’ field has the following values:

D = Detached,

S = Semi-Detached,

T = Terraced,

F = Flats/Maisonettes,

O = Other

Data Pipeline for all transactions:

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

 

Data Pipeline for Standard and Additional Price Paid Transactions (separate):

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Price Paid Data Category Type, Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

Tech stuff:

I used a combination of MongoDB (aggregation pipelines for standard heavy weight aggregations – such as simple grouping), Apache Spark (Java based for heavy weight custom aggregations) and Python (for creating graphs and summarising aggregated data)

Results:

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Transaction Counts

Price Paid Data Category A/B Transaction Count

Price Paid Data Category A/B Transaction Count

Category B transactions form a small percentage of the overall transactions (5-8% appprox.)

As the Category B data starts from October 2013 we see a rapid increase in Category B transactions which then settles to a steady rate till 2017 where we can see transactions falling as it becomes less lucrative to buy a second house to generate rental income. There is a massive variation in terms of overall and Category A transactions. But here as well we see a downward trend in 2017.

We can also see the sharp fall in transactions due to the financial crisis around 2008.

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Average Price

Price Paid Data Category A/B Average Price

Price Paid Data Category A/B Average Price

Here we find an interesting result. Category B prices are consistently lower than pure Category A. But given the relatively small number of Category B transactions the average price of combined transactions is fairly close to the average price of Category A transactions. This also seems to point to the fact that in case of buy to let, repossessions and power of sale conditions the price paid is below the average price for Category A. Several reasons could exist for such a result:

  1. People buy cheaper properties as buy-to-let and use more expensive properties as their main residence.
  2. Under stressful conditions (e.g. forced sale or repossession) there is urgency to sell and therefore full market rate may not be obtainable.

Standard Deviation of Prices

Price Paid Data Category A/B Price Standard Dev.

Price Paid Data Category A/B Price Standard Dev.

The variation in the price for Category B properties is quite high when compared with Category A (the standard price paid transaction). This can point to few things about the Category B market:

  1. A lot more speculative activity is carried out here therefore the impact of ‘expectation’ on price paid is very high – particularly:
    1. ‘expected rental returns’: The tendency here will be to buy cheap (i.e. lowest possible mortgage) and profit from the difference between monthly rental and mortgage payments over a long period of time.
    2. ‘expected profit from a future sale’: The tendency here will be to keep a shorter horizon and buy cheap then renovate and sell at a higher price – either through direct value add or because of natural increase in demand.
  2. For a Standard transaction (Category A) the incentive to speculate may not be present as it is a basic necessity.

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

UK House Sales Analysis

I have been looking at house sales data from the UK (actually England and Wales). This is derived from the Land Registry data set (approx. 4GB) which contains all house sales data from mid 1990s. Data contains full address information so one can use reverse geo-coding to get the location of the sales.

Sales Density Over the Years

If we compare the number of sales over the years an interesting picture emerges. Below is the geographical distribution of active regions (w.r.t. number of sales).

Years 2004-2007 there is strong activity in the housing market – this is especially true for London (the big patch of green), South coast and South West of England.

The activity penetrates deeper (look at Wales and South West) as the saturation starts to kick in.
The financial crisis hits and we can immediately see a weakening of sales across England and Wales. It becomes more difficult to get a mortgage. Market shows first signs of recovery especially around London.  Market recovery starts to gain momentum especially outside London.
The recovery is now fairly widespread thanks to various initiative by the Government, rock bottom interest rates and a generally positive feeling about the future. Brexit and other factors kick in – the main issue is around ‘buy-to-let’ properties which are made less lucrative thanks to three-pronged attack: increase in stamp duty on a second house, removal of tax breaks for landlords and tightening of lending for a second home (especially interest-only mortgages).

Finally 2017 once can see that the market is again cooling down. Latest data suggests house prices have started falling once again and with the recent rise in interest rates it will make landing a good deal on a mortgage all that more difficult.

Average House Prices

Above graph shows how the Average price of Sales has changed over the Years. We see there is a slump in prices starting from 2017. It will be interesting to see how the house prices behave as we start 2018. It will be a challenge for people to afford higher mortgages as inflation outstrips income growth. This is especially true for first-time buyers. Given the recent bonanza of zero percent stamp duty for first time buyers I am not sure how much of an impact (positive) this will have.

Returns on Properties

Above graph shows how the returns and risks associated with a house change after a given number of years. It is clear that it is easier to get a return when a house is held for at least 5 years. Below that there is a risk of loosing money on the property. Properties resold within two years are most likely to make a loss. This also ties in with a ‘distress’ sale scenario where the house is sold without waiting for the best possible offer or in times of slowdown where easy term mortgages are not available.

Number of Times Re-sold

Above graph shows the number of times a house is re-sold (vertical) against the number of years it is held for before being re-sold. Most houses are re-sold within 5 years. But why a massive spike where houses are re-sold within 2 years? One possible explanation is that these are houses that are bought by a developer, improved and then re-sold within a year or so.

House Transactions by Month of Year

Transaction by month

What is the best time of the year to sell your house? Counting number of transactions by month (figure above) we can see number of transactions increases as Spring starts and continues to grow till the end of Summer. In fact 60% more houses are sold in Jun – Aug period as compared to Jan – March.

Transactions tend to decrease slightly as Autumn starts and falls off towards end of the year. This is expected as people would not want to move right after Christmas or early in the new year (winter moves are difficult!)

Infrastructure

I have used Apache Spark (using Java) to summarise the data from approximately 4 GB to 1-1.5 GB CSV files and then Python to do next round of aggregations and to generate the plots.

 

Next step will be to incorporate some Machine Learning into the process.

House Price and Transactions with UK Elections

We are just getting over the not so shocking election result in UK (8th June 2017).

I wanted to look at house prices and how they are affected by election results.

The graphs below plot House Price/ Number of Transactions against date (blue dots). The data is averaged over a month and is normalised to 1.0.

The vertical lines represent UK general elections with blue representing clear results (clear majority) and black lines representing hung Parliament. There is a black line (2nd from right) that represents EU Referendum (‘Brexit’).

The orange dots represent GBP (Sterling) performing against INR (Indian Rupee) and CNY (Chinese Yuan). The data is daily average normalised to 1.0.

We can see house prices grow aggressively after clear results. The period from 2008 onward is the ‘financial’ crisis era which is further complicated by a hung Parliament in 2010. The actual recovery takes a few years and by 2014 the boom times are back! The growth is further enhanced by a Conservative majority in 2015.

It is too early to see the impact of Brexit on the housing market but as far as GBP goes there has been a fall against all major currencies.

This means investment into the UK housing market is made cheaper for ‘international’ buyers. The growth in house prices is compensated by the fall in the pound (we can see this by the relative falls in the two graphs).

Already the house price increase is cooling off (falling in many regions where they were over-inflated to begin with). With the messy general election of 2017 increasing the uncertainty, especially around Brexit, the house prices from internal demand should decrease or flatten out. We can already see this starting. People might rush in to lock their mortgage (thereby boosting short term demand) as Bank of England has indicated a rise in Interest Rates in the near future.

What happens if look at the number of transactions? The normalised graph  above shows that during the financial crisis era the transactions fell sharply. Then began to revive (correlates with the rise in house prices). The strong position of the Conservatives further supported the market.

But as soon as the Stamp Duty increase came into the picture the number of transactions started reducing and after ‘Brexit’ leading up to the 2017 General Election we can see a sharp fall in transactions.

All of these things indicate that people are not sure about what will happen in the future so are not willing to take positions of risk.

Stamp duty change

Stamp duty change (1st April 2016)

A final interesting titbit – Why is there a massive spike in transactions in a subdued period of house sales (the red arrow)? And no this is not an error! The month is March 2016 – and the spike is there because stamp duty changes were being introduced from 1st April 2016 which meant buying a second home (without selling the first one) would become a lot more expensive!

[This analysis uses the Land Registry data set which is processed using Apache Spark, Python was used to further process and plot the data]

Raspberry Pi Cluster and Apache Spark!

So over the Christmas holidays I have been busy playing with my 4 x Raspberry Pi 3 (Model B) units which I have assembled into a stack. They each have a 16 GB Memory Card with Raspbian.

Spark Pi Cluster

Spark Pi Cluster

The Spark Master is running on a NUC (the Spark driver program runs there or I simply use the ‘spark-shell’).

If you want to make your own cluster here is what you will need:

  • Raspberry Pi 3 Model B (I bought 4 of them – just the Pi’s – don’t bother with the ‘Kit’ because you won’t need the individual cases or power supplies).
  • Rapbian on a Memory Card (16GB will work fine) for each Pi.
  • A stacking plate set (one per Raspberry Pi to mount it) and one pair of ‘end plates’. This acts as a ‘rack’ for your Pi cluster. It also makes sure your Pi boards get enough ventilation and you can place the whole set neatly in a corner instead of having them lying around on the dining table!
  • Multi-device USB power supply (I would suggest Anker 60W PowerPort with 6 USB ports – which can support up to 6 Pi 3’s) so that you end up with one power plug instead of one plug per Pi.
  • To connect the Pi boards to the Internet (and to each other – for the Spark cluster) you will need a multi-port Gigabit switch – I would suggest buying one with at least 8 ports as you will need 1 port per Pi and 1 port to connect to your existing network.
  • A wireless keyboard-trackpad to setup each Pi (just once per Pi).
  • A single HDMI cable to connect with a TV/Monitor (just once per Pi).

Setting up the Pi boards:

Once you have assembled the rack and mounted the boards, install the memory cards on all the boards and connect them to the power supply and the network. Wait for the Pi boards to boot up.

Then one Pi at a time:

  • Connect a keyboard, mouse and monitor – ensure the Pi is working properly then:
    • Set hostname
    • Disable Wireless LAN (as you have Ethernet connectivity- which is more stable)
    • Check SSH works – this will make sure you can remotely work on the Pi

Raspberry Pi Cluster Image

Once all that is done and you can SSH into the Pi boards – time to install Spark:

Again one Pi at a time:

  • SSH into the Pi and use curl -o <spark download url> to download Spark tar.gz
  • tar -xvf <spark tar.gz file> to unzip the tar.gz to a standard location (I use ‘/spark/’ on all the Pi boards)
  • Make sure correct permissions are assigned to the spark folder
  • Add the master machine hostname to the /etc/hosts file
  • Edit your ~/.bashrc and add the following: export SPARK_HOME = <the standard location for your spark>

Similarly install Spark on a node which you will use as the ‘spark cluster master’ – use the same standard location.

Start up master using the spark ‘start-master.sh’ script. If you go to http://<IP of the Master Node>:8080/ you should see the Spark webpage with the status of the Workers (empty to start with) and various other bits of useful information such as the spark master URL (which we will need for the next step), number of available CPUs and application information. The last item – application information – is particularly useful to track running applications.

SSH into each of the Pi boards and execute the following: ‘start-slave.sh spark://<IP of the Master Node>:7077’ to convert each Pi board into a Spark slave.

Now if you look at the Spark webpage you will see each of the Slave nodes up (give it a couple of minutes) and you will also see the cluster resources available to you. If you have 4 Pi boards as slaves you will see 4 * 4 = 16 Cores and 4 * 1 GB = 4 GB Memory available.

Running Spark applications:

There are two main things to remember when running a Spark application:

  1. All the code that you are running should be available to ALL the nodes in your cluster (including the master)
  2. All the data that you are using should be available to ALL the nodes in your cluster (including the master).

For the code – you can easily package it up in the appropriate format (language dependent – I used Java so I used Maven to build a JAR with dependencies) and network share a folder. This reference can be used when using the spark-submit command (as the location of the application package).

For the data – you have two options – either use a network share as for the code or copy the data to the SAME location on ALL the nodes (including the master). So for example if on the master you create a local copy of the data at ‘/spark/data’ then you must use the SAME location on all the Pi boards! A local copy is definitely required if you are dealing with large data files.

Some tests:

For my test I used a 4 GB data file (text-csv) and a simple Spark program (using ‘spark-shell’) to load the text file and do a line count.

1: Pi Cluster (4 x Raspberry Pi 3 Model B)

  • Pi with Network shared data file: > 6 minutes (not good at all – this is just a simple line count!)
  • Pi with local copies of the data file: ~ 51 seconds (massive difference by making the data local to the node)

2: Spark standalone running on my laptop (i7 6th Gen., 5600 RPM HDD, SATA3 SSD, 16 GB RAM)

  • Local data file on HDD: > 1 min 30 seconds (worse than a Pi cluster with locally copied data file)
  • Local data file on SSD: ~ 20 seconds (massive difference due to the raw speed of the SSD!)

Conclusion (Breaking the Cluster):

I did manage to kill the cluster! I setup a more complicated data pipeline which does grouping and calculations using the 4 GB data file. It runs within 5 mins on my laptop (Spark local). The cluster collapsed after processing about 50%. I am not sure if the issue was related to the network (as a bottleneck) or just the Pi not able to take the load. The total file size is greater than the total available memory in the cluster (some RAM is required for the local OS as well).

So my Spark cluster is not going to break any records, in fact I would be better off using a Spark standalone on my laptop  if it is a one-shot (i.e. process large data file and store the results somewhere).

Things get interesting if we had to do this once every few hours and we could automate the ‘local data copy’ step – which should be fairly easy to do. The other option is to create a fast network share (e.g. using SSDs).

What next:

Some nice project which would suit the capabilities of a Pi cluster? Periodic data processing/stream processing task? Node.JS Servers? Please comment and let me know!

Currency Data, Efficient Markets and Influx DB

This post is about processing currency data which I have been collecting since the end of 2014. The data is collected once every hour from Monday 12am till Friday 11pm.

The data-set itself is not large as the frequency of collection is low, but it does cover lots of interesting world events such as Nigerian currency devaluation, Brexit, Trump Presidency, BJP Government in India, EU financial crisis, Demonetisation in India etc.

The image below shows the percentage change histogram for three common currencies (GBP – British Pound, USD – US Dollar and INR – Indian Rupee). The value for Percentage Change (X-Axis) is between -4% and 2%

Percentage Change histogram

Percentage Change histogram

What is immediately clear is the so called ‘fat-tail’ configuration. The data is highly skewed and shows clear features of ‘power law’ statistics. In other words the percentage change is related to frequency by an inverse power law. Larger changes (up or down) are rarer than small changes but not impossible (with respect to other distributions such as the Normal Distribution).

The discontinuity around Percentage Change = 0% is intentional. We do not want very small changes to be included as these would ‘drown out’ medium and large changes.

Mean Currency Movement

Mean Currency Movement

We can use the R code snippet below to draw 100 samples with replacement from  the movement data (combined across all currencies) and calculate the sample mean. The sample means can be plotted on a histogram which should give us the familiar Normal Distribution [this is the ‘Central Limit Theorem’ in action]. The sample mean that is most common is 0% – which is not an unexpected result given the presence of positive and negative  change percentages.

mean_curr_movement <- replicate(1000, { 
mean__curr_movement<-mean(
		sample(data$Percent.Change,100,replace = TRUE)
		)
	}
)

Compare this with a Normal distribution where, as we move away from the mean, the probability of occurrence reduces super-exponentially making large changes almost impossible (also a super-exponential quantity reduces a lot faster than a square or a cube).

Equilibrium Theory (or so called Efficient Market Hypothesis) would have us believe that the market can be modelled using a Bell Curve (Normal Distribution) where things might deviate from the ‘mean’ but rarely by a large amount and in the end it always converges back to the ‘equilibrium’ condition. Unfortunately with the reality of power-law we cannot sleep so soundly because a different definition of rare is applicable there.

Incidentally earthquakes follow a similar power law with respect to magnitude. This means that while powerful quakes are less frequent than milder ones they are still far from non-existent.

Another magical quality of such systems is that fluctuations and stability often come in clusters. The image below show the percentage movement over the full two years (approx.). We see a relative period of calm (green area) bracketed by periods of high volatility (red areas).

Movement Over Time

Movement Over Time

The above graph shows that there are no ‘equilibrium’ states within the price. The invisible hand has not magically appeared to calm things down and reduce any gaps between demand and supply to allow the price of the currency to re-adjust. Otherwise we would have found that larger the change larger the damping force to resist the change – there by making sudden large changes impossible.

For the curious:

All the raw currency data is collected in an Influx DB instance and then pulled out and processed using custom window functions I wrote in JAVA. The processed data is then dumped into a CSV (about 6000 rows) to be processed in R.

We will explore this data-set a bit more in future posts! This was to get you interested in the topic. There are large amounts of time series data sets available out there that you can start to analyse in the same way.

All the best!

Using Scala Spark and K-Means on Geo Data

The code (Scala+Maven) can be found here: https://github.com/amachwe/Scala-Machine-Learning

The idea is simple… I found an open Geo data (points) set provided by Microsoft (~24 million points). The data is NOT uniformly distributed across the world, in fact the data is highly skewed and there are large concentrations of location data around China (Beijing specifically) and the US (West-Coast).

The data can be found here: https://www.microsoft.com/en-us/download/details.aspx?id=52367

As per the description:

This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). Last published: August 9, 2012.

 

Loading the Data:

The data set is fairly simple, it contains longitude, latitude, altitude and time-date information. All the details are available with the data set (being Microsoft they have complicated matters by creating a very complex folder structure – but my GeoTrailsLoader Object makes easy work of traversing and loading the data into Mongo ready for you to play around with it.

The data is loaded as Points (WGS 84) and indexed using a 2dSphere. Once the data is in Mongo you can easily test the ‘geographic’ nature of it by running a geo-query:

{
  $near: {
     $geometry: {
        type: "Point" ,
        coordinates: [ <longitude> , <latitude> ]
     }
  }
}

 

More Query types here: https://docs.mongodb.com/v3.2/applications/geospatial-indexes/

Clustering the Data:

The ScalaWorker does the K-Means training on the geo-data within Mongo using Spark and the Mongo-Spark connector.

We use a local Spark instance (standalone) but you can just as easily use a Spark cluster if you are lucky enough to have access to multiple machines. Just provide the IP Address and Port of your Spark master instead of ‘local[*]’ in the ‘setMaster’ call.

In the example the data is loaded from Mongo into RDDs and then we initiate K-Means clustering on it with a cluster count of 2000. We use Spark ML Lib for this. Only the longitude and latitude are used for clustering (so we have a simple 2D clustering problem).

The clustering operation takes between 2 to 3 hrs on a i7 (6th Gen), 16GB RAM, 7200RPM HDD.

One way of making this work on a ‘lighter’ machine is to limit the amount of data used for K-Means. If you run it with a small data set (say 1 million) then the operation on my machine just takes a 10-15 mins.

Feel free to play around with the code!

The Results:

The simple 2D cluster centres obtained as a result of the K-Means clustering are nothing but longitudes and latitudes. They represent ‘centre points’ of all the locations present in the data set.

We should expect the centres to be around high concentration of location data.

Furthermore a high concentration of location data implies a ‘popular’ location.

As these cluster centres are nothing but longitudes and latitudes let us plot them on the world map to see what are the popular centres of location data contained within the data set.

Geocluster data (cluster centres) with city names

Geocluster data (cluster centres) with city names

The image above is a ‘zoomed’ plot of the cluster centres (blue dots). I chose an area with relatively fewer cluster centres to make sure we do not get influenced by the highly skewed data set.

I have provided a sample 2000 cluster centre file here: https://github.com/amachwe/Scala-Machine-Learning/blob/master/cluster_centre_example/clusters_2000.csv

The red text is the ‘popular area’ these cluster centres represent. So without knowing anything about the major cities of Eurasia we have managed to locate many of them (Paris, Madrid, Rome, Moscow etc.) just by clustering location data!

We could have obtained a lot of this ‘label’ information automatically by using a reverse geo-coding service (or geo-decoding service) where we pass the cluster centre and obtain meta-data about that location. For example for the cluster centre: 41.8963978, 12.4818856 (reversed for the  geo-decoding service – in the CSV file it is: 12.4818856, 41.8963978) is the following location in Rome:

Piazza Venezia

Wikipedia describes Piazza Venezia as the ‘central hub’ of Rome.

The geo-decoding service I used (with the sample cluster centre) is: http://noc.to/geodecode#41.8963978,12.4818856

Enjoy!

 

Artificial Neural Networks: Training for Deep Learning – IIa

  1. Artificial Neural Networks: An Introduction
  2. Artificial Neural Networks: Problems with Multiple Hidden Layers
  3. Artificial Neural Networks: Introduction to Deep Learning
  4. Artificial Neural Networks: Restricted Boltzmann Machines
  5. Artificial Neural Networks: Training for Deep Learning – I

This is the second post on Training a Deep Learning network. The best way to read through is by starting from the first post (see above).

This post, like the series provides a pathway into deep learning by introducing some of the concepts using some common reference points. This is not designed to be an exhaustive research review of deep learning techniques. I have also tried to keep the description neutral of any programming language, though the backing code is written in Java.

So far we have visited shallow neural networks and their building blocks (post 1), investigated their performance on difficult problems and explored their limitations (post 2). Then we jumped into the world of deep networks and described the concept behind them (post 3) and the RBM building block (post 4). Finally in the previous post we started describing a possible training method for such deep networks (post 5) where we take a local view of the network..

In this post we describe the other side of the training process – where we take the global view of the network.

Network Usage:

Before we start that though, it is very important to take a step back and review what we are trying to do.

Our target is to train a neural network that can be used to classify complex data to a high degree of accuracy for tasks that are relatively easy for Humans to do.

Classification can be done in one of two ways: Discriminative or Generative. We have touched on these in the previous post as well. From a practical perspective the choice needs to be made on the basis of what we want our network to do. If we want to use it for a purely label generation task for an input then it is enough to have a discriminative model (which basically calculates p (label | input)). Here we are attempting to assign a label to a set of features extracted from the input. That is why discrimintative training requires labelled training data.

If you want to actually create new inputs based on certain features then you need to have a generative model (which calculates p (label , input)). In case of a generative model we do not ‘discriminate’ between inputs based on features using labels (i.e. try and find the label/class boundary). Instead we treat them as a pair of variables and we try and model their joint probability. This allows us to create new pairs of inputs and features based on the learned joint probabilities.

For example: if we are using MNIST just to recognise and label handwritten digits then we can work with a discriminative model. To get the discriminative output we need some sort of a ‘capping’ output layer (e.g. softmax) which gives us one clear label (for this example there is one to one correspondence between input and label). We cannot directly work with a probability distribution of features (similar to what we saw in the last post) as an output. The process here is inherently one way, present an input and get the label as an output (thus the propagation is away from the input layer).

But what if we wanted to generate new ‘handwritten’ digits (think of an app that translates a typed letter into a handwritten one which matches your handwriting!). If we learn p(input , label)  we can easily reverse it as we could start with a label and get an ‘input’ (hand written digit). The direction of generative propagation is opposite to the discriminative one (the propagation is towards the input layer).

Does this mean that we should always target a generative model as it gives us more flexibility? The short answer is No, because generative models usually have poor performance as compared to their discriminative cousins. The long answer is ‘depends on the use-case’.

Symbol Grounding Problem:

Another reason why we show special interest in generative models is because the standard ‘data’ labeling process is very artificial. In real life no such clear labels exist for most of what we experience or even worse: there may be too many labels. For example if we show an image of a cartoon car to say 10 different people and ask them to assign one label to it we are more than likely to get multiple labels such as: cartoon car, car, cartoon… and that is just in the English language! If we had people in that group whose first language was not English they might use other labels which may or may not have a direct correlation with the corresponding English language labels. In fact all these labels are just different symbols that assign meaning to the data. This is the ‘symbol grounding problem’ in AI.

Our brain definitely does not work with strict labels. In fact it matches the joint distribution behavior better – the cartoon in the above example can be analysed at different levels such as: a cartoon, a cartoon car, a cartoon sports car, a cartoon sports car driving very fast…. so as we analyse the same input we have a growing set of labels associated with it.

It would be very messy if we had to learn a different discriminative model for each of the associated labels that operates on the same input data. Also it would be impossible if we were asked to draw a cartoon sports car without some kind of generative model that takes into account all its possible ‘characteristics’ and returns a learned representation (shape, components, size etc.).

If we also take a look at human cognition (which is what we are trying to mimic) simple classification is just one half of the process. Without the generative ability we would not be able to react to the result of the classification. Our brain may classify the weather as ‘likely to be wet’ as the image of the sky travels from the eye to the brain, but it is the reverse propagation from the brain to our muscles that ensures we pick up the umbrella.For our example: As our brain classifies and breaks down the task of drawing a cartoon sports car it needs to switch into generative mode to actually draw it out.

Here we also have a good reason why generative models should NOT be very accurate or rigid. If we had rigidly learnt generative models that did not change over time (or were very difficult to re-train), there would be no concept of ‘training’, ‘skill’ or ‘creativity’. Given a set of features we all would produce the same (or similar) cartoon sports car! There would be very little difference between the cartoon sports car drawn by a professional cartoonist and one drawn by a child as after a certain point in time a rigid generative model would not respond to additional training.

Note: the above description is an over-simplification of some very complex cognitive processes and is intended only as an aid in understanding the concepts being presented in this post.

MNIST Example:

We can generate digits as we learn to classify them using the greedy learning algorithm described in the previous post. This can be done by simply reversing the direction of propagation from Input => Hidden to Hidden => Input and doing some sampling using clamped hidden vectors.

The process is very simple:

  1. Randomly generate a binary vector equal in length to the top most hidden layer
  2. Clamp this vector to the hidden layer and then propagate down to the visible and back up to the hidden ‘n’ number of times (thus feeding back the result at both hidden and visible layers)
  3. For the last iteration do not propagate back to the hidden unit instead convert the vector on the visible layer into an image

For the test we have the standard MNIST input layer (28 x 28 = 784 inputs). Following that we have 3 hidden layers of 100 neurons each. Each hidden layer is trained using CD-10 on a mini batch of the MNIST dataset. I will be uploading the associated test files on my github. The file is: rd.neuron.neuron.test.TestRBMMNISTRecipe

When we set n = 0 we get very fuzzy generated digits:

Generated Digits

Generated Digits

I can make out a few rough 2s and a some half formed digits and a lot of ‘0’s!

Let us set n = 5 (therefore we do down – up for 5 times and then the 6th pass is just down):

Generated Numbers 6

Generated Numbers 6

As you can see the generated digits are a lot cleaner and we also have some relatively complicated digits (‘3’ and ‘6’) and a rough ‘8’ (3rd row from bottom, 4th column from right).

This proves that our network has learnt the features associated with handwritten digits which it uses to generate new data.

As a final example, let us set n = 50 and generate a larger set of digits:

Generated Digits 50

Generated Digits 50

In the next post we delve deeper into the ‘feature’ – ‘label’ training process and show how we can get our deep network to classify hand-written digits.