UK Housing Data Analysis: Additional Price Paid Entry

I have been exploring the HM Land Registry Price Paid Data and have discovered few more things of interest.

The data contains a ‘Price Paid Data Category Type’ (this is the second last column at the time of writing this post. As per the description of the schema this field can have one of two values:

A = Standard Price Paid entry, includes single residential property sold for full market value.
B = Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals.

Therefore it seems there is a way of looking at how properties sold for full market value differ from buy-to-lets, repossessions and power of sale transactions. Proper Category B tracking only starts from October 2013.

Before we do this it is worthwhile to use the ‘Property Type’ field to filter out properties of type ‘Other’ which contribute to the overall noise because these are usually high value properties such as office buildings. The ‘Property Type’ field has the following values:

D = Detached,

S = Semi-Detached,

T = Terraced,

F = Flats/Maisonettes,

O = Other

Data Pipeline for all transactions:

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts


Data Pipeline for Standard and Additional Price Paid Transactions (separate):

Step 1: Filter out all transactions with Property Type of Other

Step 2: Group using Price Paid Data Category Type, Year and Month of Transaction

Step 3: Calculate Standard Deviations in Price, Average Price and Counts

Tech stuff:

I used a combination of MongoDB (aggregation pipelines for standard heavy weight aggregations – such as simple grouping), Apache Spark (Java based for heavy weight custom aggregations) and Python (for creating graphs and summarising aggregated data)


In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Transaction Counts

Price Paid Data Category A/B Transaction Count

Price Paid Data Category A/B Transaction Count

Category B transactions form a small percentage of the overall transactions (5-8% appprox.)

As the Category B data starts from October 2013 we see a rapid increase in Category B transactions which then settles to a steady rate till 2017 where we can see transactions falling as it becomes less lucrative to buy a second house to generate rental income. There is a massive variation in terms of overall and Category A transactions. But here as well we see a downward trend in 2017.

We can also see the sharp fall in transactions due to the financial crisis around 2008.

In all graphs Orange points represent Category B related data, Blue represents Category A related data and Green represents a combination of both the Categories.

Average Price

Price Paid Data Category A/B Average Price

Price Paid Data Category A/B Average Price

Here we find an interesting result. Category B prices are consistently lower than pure Category A. But given the relatively small number of Category B transactions the average price of combined transactions is fairly close to the average price of Category A transactions. This also seems to point to the fact that in case of buy to let, repossessions and power of sale conditions the price paid is below the average price for Category A. Several reasons could exist for such a result:

  1. People buy cheaper properties as buy-to-let and use more expensive properties as their main residence.
  2. Under stressful conditions (e.g. forced sale or repossession) there is urgency to sell and therefore full market rate may not be obtainable.

Standard Deviation of Prices

Price Paid Data Category A/B Price Standard Dev.

Price Paid Data Category A/B Price Standard Dev.

The variation in the price for Category B properties is quite high when compared with Category A (the standard price paid transaction). This can point to few things about the Category B market:

  1. A lot more speculative activity is carried out here therefore the impact of ‘expectation’ on price paid is very high – particularly:
    1. ‘expected rental returns’: The tendency here will be to buy cheap (i.e. lowest possible mortgage) and profit from the difference between monthly rental and mortgage payments over a long period of time.
    2. ‘expected profit from a future sale’: The tendency here will be to keep a shorter horizon and buy cheap then renovate and sell at a higher price – either through direct value add or because of natural increase in demand.
  2. For a Standard transaction (Category A) the incentive to speculate may not be present as it is a basic necessity.

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

Using Scala Spark and K-Means on Geo Data

The code (Scala+Maven) can be found here:

The idea is simple… I found an open Geo data (points) set provided by Microsoft (~24 million points). The data is NOT uniformly distributed across the world, in fact the data is highly skewed and there are large concentrations of location data around China (Beijing specifically) and the US (West-Coast).

The data can be found here:

As per the description:

This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). Last published: August 9, 2012.


Loading the Data:

The data set is fairly simple, it contains longitude, latitude, altitude and time-date information. All the details are available with the data set (being Microsoft they have complicated matters by creating a very complex folder structure – but my GeoTrailsLoader Object makes easy work of traversing and loading the data into Mongo ready for you to play around with it.

The data is loaded as Points (WGS 84) and indexed using a 2dSphere. Once the data is in Mongo you can easily test the ‘geographic’ nature of it by running a geo-query:

  $near: {
     $geometry: {
        type: "Point" ,
        coordinates: [ <longitude> , <latitude> ]


More Query types here:

Clustering the Data:

The ScalaWorker does the K-Means training on the geo-data within Mongo using Spark and the Mongo-Spark connector.

We use a local Spark instance (standalone) but you can just as easily use a Spark cluster if you are lucky enough to have access to multiple machines. Just provide the IP Address and Port of your Spark master instead of ‘local[*]’ in the ‘setMaster’ call.

In the example the data is loaded from Mongo into RDDs and then we initiate K-Means clustering on it with a cluster count of 2000. We use Spark ML Lib for this. Only the longitude and latitude are used for clustering (so we have a simple 2D clustering problem).

The clustering operation takes between 2 to 3 hrs on a i7 (6th Gen), 16GB RAM, 7200RPM HDD.

One way of making this work on a ‘lighter’ machine is to limit the amount of data used for K-Means. If you run it with a small data set (say 1 million) then the operation on my machine just takes a 10-15 mins.

Feel free to play around with the code!

The Results:

The simple 2D cluster centres obtained as a result of the K-Means clustering are nothing but longitudes and latitudes. They represent ‘centre points’ of all the locations present in the data set.

We should expect the centres to be around high concentration of location data.

Furthermore a high concentration of location data implies a ‘popular’ location.

As these cluster centres are nothing but longitudes and latitudes let us plot them on the world map to see what are the popular centres of location data contained within the data set.

Geocluster data (cluster centres) with city names

Geocluster data (cluster centres) with city names

The image above is a ‘zoomed’ plot of the cluster centres (blue dots). I chose an area with relatively fewer cluster centres to make sure we do not get influenced by the highly skewed data set.

I have provided a sample 2000 cluster centre file here:

The red text is the ‘popular area’ these cluster centres represent. So without knowing anything about the major cities of Eurasia we have managed to locate many of them (Paris, Madrid, Rome, Moscow etc.) just by clustering location data!

We could have obtained a lot of this ‘label’ information automatically by using a reverse geo-coding service (or geo-decoding service) where we pass the cluster centre and obtain meta-data about that location. For example for the cluster centre: 41.8963978, 12.4818856 (reversed for the  geo-decoding service – in the CSV file it is: 12.4818856, 41.8963978) is the following location in Rome:

Piazza Venezia

Wikipedia describes Piazza Venezia as the ‘central hub’ of Rome.

The geo-decoding service I used (with the sample cluster centre) is:,12.4818856