House Market Analysis

The house prices in UK are at it again. A combination of Brexit, change in housing stock, easy loans and growing consumer debt is making things interesting again.

Figure 1: Number of Transactions per month from 1995 to August 2018

Figure 1 shows the number of transactions every month since 1995. The massive fall post 2007 because of the financial crisis. Then the surge in transactions since 2013. The lonely spot (top-right, March 2016) is just before the new Stamp Duty changes made buying a second house an expensive proposition. But this is relatively boring!

Visual Analytics: Relation between Quantity and Value of Transactions

Let us look at Transaction Count (quantity) and Total Value of those transactions, aggregated on a monthly basis. I used a Spark cluster to aggregate the full transaction set (4GB csv data file). The base data set has about 280 rows with the following structure:

{month, year, sum, count}

The month and year values are converted into dates and added to the row, then the data set is sorted by date:

{date, month, year, sum, count}

This leads us to three plots. Sum and Count against time and Sum against Count. These are shown below:

Figure 2: Total Transaction value by date, grouped by year (each dot represents a month in that year)

Figure 2 shows Total Transaction value by date (Y-axis). The plot is grouped by year where each dot represents a month in that year. The current year (2018) has complete months data till August therefore less number of dots.

Figure 3: Total Quantity of Transactions  by date, grouped by year (each dot represents a month in that year)

Figure 3 shows Total Quantity of Transactions (Y-axis), once again grouped by year. Similar to Figure 2 the data is complete till August 2018.

Figure 4: Total Transaction value (Y-axis) against Total Number of Transactions (X-axis)

Figure 4 show how the value of the transactions relates to number of transactions. Each dot represents a month in a year. As expected there is a slightly positive correlation between total value of transactions and the number of transactions. A point to note: the total value of transactions depends on the sale price (that depends on the property sold) as well as the number of transactions in a given month. For the same number of transactions the value could be high or low (year on year) depending on whether prices are inflationary or a higher number of good quality houses are part of that months transactions.

Figure 5: Total Transaction value (Y-axis) against Total number of transaction (X-axis), each point represents a particular month in a year

Figure 5 enhances Figure 4 by using colour gradient to show the year of the observation. Each year should have at least 12 points associated with it (except 2018). This concept is further extended by using different shape for the markers depending on whether that observation was made before the financial crisis (circle: year of observation before 2008), during the financial crisis (square: year of observation between 2008 and 2012) or after the crisis (plus: year of observation after 2012). These values for years have been picked using Figures 2 and 3. 

Figure 6: Showing the housing market contract during the Crisis and then expand

Figure 6 shows the effect of the financial crisis nicely. The circles represent pre-crisis transactions. The squares represent transactions during the crisis. The plus symbol represents post-crisis transactions. 

The rapid decrease in transactions can be seen as the market contracted in 2007-2008. As the number of transactions and the value of transactions starts falling, the relative fall in number of transactions is larger than in the total value of the transactions. This indicates the prices did fall but mostly not enough houses were being sold. Given the difficulty in getting a mortgage, this reduction in number of transactions could be caused by a lack of demand.

Discovering Data Clusters

Using a three class split (pre-crisis, crisis, post-crisis) provides some interesting results. These were described in the previous section. But what happens if a clustering algorithm is used on the data?

A Clustering algorithm attempts to assign each observation to a cluster. Depending on the algorithm, total number of clusters may be required as an input. Clustering is often helpful when trying to build initial models of the input data especially when no labels are available. In that case, the cluster id (represented by the cluster centre) becomes the label. The following clustering algorithms were evaluated:

  1. k-means clustering
  2. gaussian mixture model

The data-set for the clustering algorithm has three columns: Date, Monthly Transaction Sum and Monthly Transaction Count.

Given the claw mark distribution of the data it was highly unlikely k-means would give good results. That is exactly what we see in Figure 7 with cluster size of 3 (given we had three labels previously of before crisis, during crisis and after crisis). The clustering seems to cut across the claws. 

Figure 7: k-mean clustering with cluster size of 3 – total value of transactions (Y-axis) vs total number of transactions

If a gaussian mixture model (GMM) is used with component count of 3 and covariance type ‘full’ (using sklearn implementation – see code below) some nice clusters emerge as seen in Figure 8.

Figure 8: Gaussian Mixture model with three components.

Each of the components corresponds to a ‘band’ in the observations. The lowest band corresponds loosely with pre-crisis market, the middle (yellow) band somewhat expands the crisis market to include entries from before the crisis. Finally, the top-most band (green) corresponds nicely with the post-crisis market.

But what other number of components could we choose? Should we try other GMM covariance types (such as ‘spherical’, ‘full’, ‘diag’, ‘tied’)? To answer these questions we can run a ‘Bayesian Information Criteria’ test against different number of components and different covariance types. The method and component count that gives the lowest BIC is preferred.

The result is shown in Figure 9.

Figure 9: BIC analysis of the data – BIC score against number of components (X-axis)

From Figure 9 it seems the ‘full’ type consistently gives the lowest BIC on the data-set. Furthermore, going from 3 to 4 components improves the BIC score (lower the better). Another such jump is from 7 to 8. Therefore, number of components should be 4 (see Figure 10) or 8 (see Figure 11).

Figure 10: Transaction value (Y-axis) against  Total Number of Transactions – with 4 components.

Figure 11: Transaction value (Y-axis) against  Total Number of Transactions – with 8 components.

The 4 component results (Figure 10) when compared with Figure 5 indicates an expansion at the start of the data-set (year: 1995), this is the jump from yellow to green. Then during the crisis there is a contraction (green to purple). Post crisis there is another expansion (purple to blue). This is shown in Figure 12.

Figure 12: Expansion and contraction in the housing market

The 8 component results (Figure 11) when compared with Figure 5 shows the stratification of the data-set based on the Year value. Within the different colours one can see multiple phases of expansion and contraction.

The interesting thing is that for both 4 and 8 component models, the crisis era cluster is fairly well defined.

Code for this is given below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture


csv = "c:\\ML Stats\\housing_sep_18_no_partial_mnth_cnt_sum.csv"


df = pd.read_csv(csv)


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

df_pure = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year})


df_pure = df_pure.sort_values(["Date"])

df_pure = df_pure.set_index("Date")


bics = {}
for cmp in range(1,10):

clust_sph = GaussianMixture(n_components=cmp, covariance_type='spherical').fit(df_pure)
clust_tied = GaussianMixture(n_components=cmp, covariance_type='tied').fit(df_pure)
clust_diag = GaussianMixture(n_components=cmp, covariance_type='diag').fit(df_pure)
clust_full = GaussianMixture(n_components=cmp, covariance_type='full').fit(df_pure)

clusts = [clust_full, clust_diag, clust_sph, clust_tied]
bics[cmp] = []
for c in clusts:
bics[cmp].append(c.bic(df_pure))

plt.plot(bics.keys(), bics.values())
plt.legend(["full", "diag", "sph", "tied"])
plt.show()

num_components = 4

clust = GaussianMixture(n_components=num_components, covariance_type='full').fit(df_pure)

lbls = clust.predict(df_pure)

df_clus = pd.DataFrame({"Count": df_pure.Count, "Sum": df_pure.Sum, "Year": df_pure.Year, "Cluster": lbls})
color = df_clus["Cluster"]

fig, ax = plt.subplots()
ax.scatter(df_clus["Count"], df_clus["Sum"], c=color)

fig, ax2 = plt.subplots()
ax2.scatter(df_clus["Year"], df_clus["Count"], c=color)

fig, ax3 = plt.subplots()
ax3.scatter(df_clus["Year"], df_clus["Sum"], c=color)


plt.show()

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

Recurrent Neural Networks to Predict Pricing Trends in UK Housing Market

Recurrent Neural Networks (RNN):

RNNs are used when temporal relationships have to be learnt. Some common examples include time series data (e.g. stock prices), sequence of words (e.g. predictive text) and so on.

The basic concept of RNNs is that we train an additional set of weights (along with the standard input – output pair) that associate past state (time: t-1) with the current state (time: t). This can then be used to predict the future state (time: t+1) given the current state (time: t). In other words RNNs are NNs with state!

When used to standard time series prediction the input and output values are taken from the same time series (usually a scalar value). This is a degenerate case of single valued inputs and outputs. Thus we need to learn the relationship between x(t-1) and x(t) so that we can predict the value of x(t+1) given x(t). This is what I did for this post.

Time series can be made more complicated by making the input a vector of different parameters, the output may still remain a scalar value which is a component of x or be a vector. One reason this is done is to add all the factors that may impact the value to be predicted (e.g. x(t+1)). In our example of average house prices – we may want to add factors such as time of the year, interest rates, salary levels, inflation etc. to provide some more “independent” variables in the input.

Two final points:

  • Use-cases for RNNs: Speech to Text, Predictive Text, Music Tagging, Machine Translation
  • RNNs include the additional complexity of training in Time as well as Space therefore our standard Back-Propagation becomes Back-Propagation Through Time

RNN Structure for Predicting House Prices:

RNN simple time series

The basic time series problem is that we have a sequence of numbers – the average price of houses for a given month and year (e.g. given: X(1), X(2), … X(t-1), X(t) ) with a regular step size and our task is to predict the next number in the sequence (i.e. predict: X(t+1)). In our problem the avg price is calculated for every month since January 1995 (thus step size is 1 month). As a first step we need to define a fixed sequence size that we are going to use for training the RNN. For the input data we will select a sub-sequence of a given length equal to the number of inputs (in the diagram above there are three inputs). For training output we will select a sub-sequence of the same length as the input but the values will be shifted one step in the future.

Thus if input sub-sequence is: X(3), X(4) and X(5) then the output sub-sequence must be: X(4), X(5) and X(6). In general if input sub-sequence spans time step to where b > a and b-a = sub-sequence length, then the output sub-sequence must span a+1 to b+1.

Once the training has been completed if we provide the last sub-sequence as input we will get the next number in the series as the output. We can see how well the RNN is able to replicate the signal by starting with a sub-sequence in the middle and movie ahead in time steps and plotting actual vs predicted values for the next number in the sequence.

Remember to NORMALISE the data!

The parameters are as below:

n_steps = 36 # Number of time steps (thus a = 0 and b = 35, total of 36 months)

n_inputs = 1 # Number of inputs per step (the avg. price for the current month)

n_neurons = 1000 # Number of neurons in the middle layer

n_outputs = 1 # Number of outputs per step (the avg. price for the next month)

learning_rate = 0.0001 # Learning Rate

n_iter = 2000 # Number of iterations

batch_size = 50 # Batch size

I am using TensorFlow’s BasicRNNCell (complete code at the end of the post) but the basic setup is:

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu), output_size=n_outputs)

outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

loss = tf.reduce_mean(tf.square(outputs-y))
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
training = opt.minimize(loss)

saver = tf.train.Saver()

init = tf.global_variables_initializer()

Results:

A sample of 3 runs, using Mean Squared Error threshold of 1e-4 we get the following values for Error:

  1. 8.6831e-05
  2. 9.05436e-05
  3. 9.86998e-05

Run 3 fitting and predictions are shown below:

Orange dots represent the prediction by the RNN and Blue dots represent the actual data

 

Run 3 prediction against existing data 3 years before October 2017

Then we start from October 2017 (Month 24 in figure below) and forecast ahead to October 2018. This predicts a rise in average prices which start to plateau 3rd quarter of 2018. Given that average house prices across a country like UK are determined by a large number of noisy factors, we should take this prediction with a pinch of salt.

Run 3 Forecasting from Month 24 (October 2017 for the year ahead till October 2018)

A sample of 3 runs, using Mean Squared Error threshold of 1e-3 we get the following values for Error:

  1. 3.4365e-04
  2. 4.1512e-04
  3. 2.1874e-04

With a higher Error Threshold we find when comparing against actual data (Runs 2 and 3 below) the predicted values have a lot less overlap with the actual values. This is expected as we have traded accuracy for reduction in training time.

predicted avg price vs actual avg price (Run 2)

predicted avg price vs actual avg price (Run 3)

Projections in this case are lot different. We see a linearly decreasing avg price in 2018.

predicted avg price vs actual avg price with forecast

Next Steps:

I would like to add more parameters to the input – but it is difficult to get correlated data for different things such as interest rates, inflation etc.

I would also like to try other types of networks (e.g. LSTM) but I am not sure if that would be the equivalent of using a canon to kill a mosquito.

Finally if anyone has any ideas on this I would be happy to collaborate with them on this!

 

Source code can be found here: housing_tf

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

UK Housing Data Analysis: Additional Price Paid Entry

I have been exploring the HM Land Registry Price Paid Data and have discovered few more things of interest.

The data contains a ‘Price Paid Data Category Type’ (this is the second last column at the time of writing this post. As per the description of the schema this field can have one of two values:

A = Standard Price Paid entry, includes single residential property sold for full market value.
B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals.

Therefore it seems there is a way of looking at how properties sold for full market value differ from buy-to-lets, repossessions and power of sale transactions. Proper Category B tracking only starts from October 2013.

Before we do this it is worthwhile to use the ‘Property Type’ field to filter out properties of type ‘Other’ which contribute to the overall noise because these are usually high value properties such as office buildings. The ‘Property Type’ field has the following values:

D = Detached,

S = Semi-Detached,

T = Terraced,

F = Flats/Maisonettes,

O = Other

Data Pipeline for all transactions:

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

 

Data Pipeline for Standard and Additional Price Paid Transactions (separate):

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Price Paid Data Category Type, Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

Tech stuff:

I used a combination of MongoDB (aggregation pipelines for standard heavy weight aggregations – such as simple grouping), Apache Spark (Java based for heavy weight custom aggregations) and Python (for creating graphs and summarising aggregated data)

Results:

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Transaction Counts

Price Paid Data Category A/B Transaction Count

Price Paid Data Category A/B Transaction Count

Category B transactions form a small percentage of the overall transactions (5-8% appprox.)

As the Category B data starts from October 2013 we see a rapid increase in Category B transactions which then settles to a steady rate till 2017 where we can see transactions falling as it becomes less lucrative to buy a second house to generate rental income. There is a massive variation in terms of overall and Category A transactions. But here as well we see a downward trend in 2017.

We can also see the sharp fall in transactions due to the financial crisis around 2008.

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Average Price

Price Paid Data Category A/B Average Price

Price Paid Data Category A/B Average Price

Here we find an interesting result. Category B prices are consistently lower than pure Category A. But given the relatively small number of Category B transactions the average price of combined transactions is fairly close to the average price of Category A transactions. This also seems to point to the fact that in case of buy to let, repossessions and power of sale conditions the price paid is below the average price for Category A. Several reasons could exist for such a result:

  1. People buy cheaper properties as buy-to-let and use more expensive properties as their main residence.
  2. Under stressful conditions (e.g. forced sale or repossession) there is urgency to sell and therefore full market rate may not be obtainable.

Standard Deviation of Prices

Price Paid Data Category A/B Price Standard Dev.

Price Paid Data Category A/B Price Standard Dev.

The variation in the price for Category B properties is quite high when compared with Category A (the standard price paid transaction). This can point to few things about the Category B market:

  1. A lot more speculative activity is carried out here therefore the impact of ‘expectation’ on price paid is very high – particularly:
    1. ‘expected rental returns’: The tendency here will be to buy cheap (i.e. lowest possible mortgage) and profit from the difference between monthly rental and mortgage payments over a long period of time.
    2. ‘expected profit from a future sale’: The tendency here will be to keep a shorter horizon and buy cheap then renovate and sell at a higher price – either through direct value add or because of natural increase in demand.
  2. For a Standard transaction (Category A) the incentive to speculate may not be present as it is a basic necessity.

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

UK House Sales Analysis

I have been looking at house sales data from the UK (actually England and Wales). This is derived from the Land Registry data set (approx. 4GB) which contains all house sales data from mid 1990s. Data contains full address information so one can use reverse geo-coding to get the location of the sales.

Sales Density Over the Years

If we compare the number of sales over the years an interesting picture emerges. Below is the geographical distribution of active regions (w.r.t. number of sales).

Years 2004-2007 there is strong activity in the housing market – this is especially true for London (the big patch of green), South coast and South West of England.

The activity penetrates deeper (look at Wales and South West) as the saturation starts to kick in.
The financial crisis hits and we can immediately see a weakening of sales across England and Wales. It becomes more difficult to get a mortgage. Market shows first signs of recovery especially around London.  Market recovery starts to gain momentum especially outside London.
The recovery is now fairly widespread thanks to various initiative by the Government, rock bottom interest rates and a generally positive feeling about the future. Brexit and other factors kick in – the main issue is around ‘buy-to-let’ properties which are made less lucrative thanks to three-pronged attack: increase in stamp duty on a second house, removal of tax breaks for landlords and tightening of lending for a second home (especially interest-only mortgages).

Finally 2017 once can see that the market is again cooling down. Latest data suggests house prices have started falling once again and with the recent rise in interest rates it will make landing a good deal on a mortgage all that more difficult.

Average House Prices

Above graph shows how the Average price of Sales has changed over the Years. We see there is a slump in prices starting from 2017. It will be interesting to see how the house prices behave as we start 2018. It will be a challenge for people to afford higher mortgages as inflation outstrips income growth. This is especially true for first-time buyers. Given the recent bonanza of zero percent stamp duty for first time buyers I am not sure how much of an impact (positive) this will have.

Returns on Properties

Above graph shows how the returns and risks associated with a house change after a given number of years. It is clear that it is easier to get a return when a house is held for at least 5 years. Below that there is a risk of loosing money on the property. Properties resold within two years are most likely to make a loss. This also ties in with a ‘distress’ sale scenario where the house is sold without waiting for the best possible offer or in times of slowdown where easy term mortgages are not available.

Number of Times Re-sold

Above graph shows the number of times a house is re-sold (vertical) against the number of years it is held for before being re-sold. Most houses are re-sold within 5 years. But why a massive spike where houses are re-sold within 2 years? One possible explanation is that these are houses that are bought by a developer, improved and then re-sold within a year or so.

House Transactions by Month of Year

Transaction by month

What is the best time of the year to sell your house? Counting number of transactions by month (figure above) we can see number of transactions increases as Spring starts and continues to grow till the end of Summer. In fact 60% more houses are sold in Jun – Aug period as compared to Jan – March.

Transactions tend to decrease slightly as Autumn starts and falls off towards end of the year. This is expected as people would not want to move right after Christmas or early in the new year (winter moves are difficult!)

Infrastructure

I have used Apache Spark (using Java) to summarise the data from approximately 4 GB to 1-1.5 GB CSV files and then Python to do next round of aggregations and to generate the plots.

 

Next step will be to incorporate some Machine Learning into the process.