Housing Market: Auto Correlation Analysis

In this post we take a look at the housing market data which consists of all the transactions registered with the UK Land Registry since 1996. So lets get the copyright out of the way:

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

The data-set from HM Land Registry has information about all registered property transactions in England and Wales. The data-set used for this post has all transactions till the end of October 2018. 

To make things slightly simple and to focus on the price paid and number of transaction metrics I have removed most of the columns from the data-set and aggregated (sum) by month and year of the transaction. This gives us roughly 280 observations with the following data:

{ month, year, total price paid, total number of transactions }

Since this is a simple time-series, it is relatively easy to process. Figure 1 shows this series in a graph. Note the periodic nature of the graph.

Figure 1: Total Price Paid aggregated (sum) over a month; time on X axis (month/year) and Total Price Paid on Y axis.

The first thing that one can try is auto-correlation analysis to answer the question: Given the data available (till end-October 2018) how similar have the last N months been to other periods in the series? Once we identify the periods of high similarity, we should get a good idea of current market state.

To predict future market state we can use time-series forecasting methods which I will keep for a different post.

Auto-correlation

Auto-correlation is correlation (Pearson correlation coefficient) of a given sample (A) from a time series against other available samples (B). Both samples are of the same size. 

Correlation value lies between 1 and -1. A value of 1 means perfect correlation between two samples where they are directly proportional (when A increases, B also increases). A value of 0 implies no correlation and a value of -1 implies the two samples are inversely proportional (when A increases, B decreases).

The simplest way to explain this is with an example. Assume:

  1. monthly data is available from Jan. 1996 to Oct. 2018
  2. we choose a sample size of 12 (months)
  3. the sample to be compared is the last 12 months (Oct. 2018 – Nov. 2017)
  4. value to be correlated is the Total Price Paid (summed by month).

As the sample size is fixed (12 months) we start generating samples from the series:

Sample to be compared: [Oct. 2018 – Nov. 2017]

Sample 1: [Oct. 2018 – Nov. 2017], this should give correlation value of 1 as both the samples are identical.

Sample 2: [Sep. 2018 – Oct. 2017], the correlation value should start to decrease as we skip back one month.

Sample N: [Dec. 1996 – Jan. 1996], this is the earliest period we can correlate against.

Now we present two graphs for different sample sizes:

  1. correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year) – to show yearly spread
  2. correlation coefficient visualised going back in time – to show periods of high correlation

Thing to note in all the graphs is that the starting value (right most) is always 1. That is when we compare the selected sample (last 12 months) with the first sample (last 12 months).

In the ‘back in time’ graph we can see the seasonal fluctuations in the correlation. These are between 1 and -1. This tells us that total price paid has a seasonal aspect to it. This makes sense as we see lots of houses for sale in the summer months than winter as most people prefer to move when the weather is nice!

Fig 2: Example of In and Out of Phase correlation.

So if we correlate a 12 month period (like this one) one year apart (e.g. Oct. 2018 – Nov. 2017 and Oct. 2017 – Nov. 2016) one should get positive correlation as the variation of Total Price Paid should have the same shape. This is ‘in phase’ correlation. This can be seen in Figure 2 as the ‘first’ correlation which is in phase (in fact it is perfectly in phase and the values are identical – thus the correlation value of 1). 

Similarly, if the comparison is made ‘out of phase’ (e.g. Oct. 2018 – Nov. 2017 and Jul 2018 – Aug. 2017) where variations are opposite then negative correlation will be seen. This is the ‘second’ correlation in Figure 2.

This is exactly what we can see in these figures. Sample sizes are 6 months, 12 months, 18 months and 24 months. There are two figures for each sample size. The first figure is the spread of the auto-correlation coefficient for a given year. The second figure is the time series plot of the auto-correlation coefficient, where we move back in time and correlate against the last N months. The correlation values fluctuating between 1 and -1 in a periodic manner. 


Fig. 3a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year), Sample size: 6 months

Fig. 3b: Correlation coefficient visualised going back in time; Sample size: 6 months


Fig. 4a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 12 months

Fig. 4b: Correlation coefficient visualised going back in time; Sample size: 12 months


Fig. 5a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 18 months

Fig. 5b: Correlation coefficient visualised going back in time; Sample size: 18 months


Fig. 6a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 24 months

Fig. 6b: Correlation coefficient visualised going back in time; Sample size: 24 months

Conclusions

Firstly, if we compare the scatter + box plot figures, especially for 12 months (Figure 4a), we find the correlation coefficients are spread around ‘0’ for most of the years. One period where this is not so and the correlation spread is consistently above ‘0’ is the year 2008, the year that marked the start of the financial crisis. The spread is also ‘tight’ which means all the months of that year saw consistent correlation, for the Total Price Paid, against the last 12 months from October 2018.

Secondly conclusion we can draw from the positive correlation between last 12 months (Figure 2b) and the period of the financial crisis is that the variations in the Total Price Paid are similar (weakly correlated) with the time of the financial crisis. This obviously does not guarantee that a new crisis is upon us. But it does mean that the market is slowing down. This is a reasonable conclusion given the double whammy of impending Brexit and on set of winter/Holiday season (which traditionally marks a ‘slow’ time of the year for property transactions).

Code is once again in python and attached below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture

months = MonthLocator(range(1, 13), bymonthday=1, interval=3)
year_loc = YearLocator()

window_size = 12

def is_crisis(year):
if year<2008:
return 0
elif year>2012:
return 2

return 1

def is_crisis_start(year):
if year<2008:
return False
elif year
>2008:
return False

return True

def
process_timeline(do_plot=False):
col = "Count"
y = []
x = []
x_d = []
box_d = []
year_d = []
year = 0
years_pos = []
crisis_corr = []
for i in range(0, size - window_size):

try:

if year != df_dates["Year"][size-1-i]:

if year > 0:
box_d.append(year_d)
years_pos.append(year)
year_d = []
year = df_dates["Year"][size-1-i]

corr = np.corrcoef(df_dates[col][size -i - window_size: size - i].values, current[col].values)
year_d.append(corr[0, 1])
y.append(corr[0, 1])
if is_crisis_start(year):
crisis_corr.append(corr[0, 1])
x.append(year)
month = df_dates["Month"][size - 1 - i]
x_d.append(dt(year, month, 15))

except Exception as e:
print(e)

box_d.append(year_d)
years_pos.append(year)

corr_np = np.array(crisis_corr)
corr_mean = corr_np.mean()
corr_std = corr_np.std()

print("Crisis year correlation: mean and std.: {} / {} ".format(corr_mean, corr_std))
if do_plot:

fig, sp = plt.subplots()

sp.scatter(x, y)
sp.boxplot(box_d, positions=years_pos)

plt.show()

fig, ax = plt.subplots()
ax.plot(x_d, y,'-o')
ax.grid(True)
ax.xaxis.set_major_locator(year_loc)
ax.xaxis.set_minor_locator(months)
plt.show()

return corr_mean, corr_std

csv = "c:\\ML Stats\\housing_oct_18_no_partial_mnth_cnt_sum.csv"
full_csv = "c:\\ML Stats\\housing_oct_18.csv_mnth_cnt_sum.csv"

df = pd.read_csv(full_csv)


mnth = {
1: "Jan",
2: "Feb",
3: "Mar",
4: "Apr",
5: "May",
6: "Jun",
7: "Jul",
8: "Aug",
9: "Sep",
10: "Oct",
11: "Nov",
12: "Dec"
}


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

crisis = list(map(lambda r: is_crisis(int(r[1]["Year"])), df.iterrows()))

df_dates = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year, "Month": df.Month, "Crisis": crisis})

df_dates = df_dates.sort_values(["Date"])

df_dates = df_dates.set_index("Date")

plt.plot(df_dates["Sum"],'-o')
plt.ylim(ymin=0)
plt.show()

size = len(df_dates["Count"])

corr_mean_arr = []
corr_std_arr = []
corr_rat = []
idx = []
for i in range(0, size-window_size):
end = size - i
current = df_dates[end-window_size:end]
print("Length of current: {}, window size: {}".format(len(current), window_size))

ret = process_timeline(do_plot=True)
break #Exit early




UK House Sales Analysis

I have been looking at house sales data from the UK (actually England and Wales). This is derived from the Land Registry data set (approx. 4GB) which contains all house sales data from mid 1990s. Data contains full address information so one can use reverse geo-coding to get the location of the sales.

Sales Density Over the Years

If we compare the number of sales over the years an interesting picture emerges. Below is the geographical distribution of active regions (w.r.t. number of sales).

Years 2004-2007 there is strong activity in the housing market – this is especially true for London (the big patch of green), South coast and South West of England.

The activity penetrates deeper (look at Wales and South West) as the saturation starts to kick in.
The financial crisis hits and we can immediately see a weakening of sales across England and Wales. It becomes more difficult to get a mortgage. Market shows first signs of recovery especially around London.  Market recovery starts to gain momentum especially outside London.
The recovery is now fairly widespread thanks to various initiative by the Government, rock bottom interest rates and a generally positive feeling about the future. Brexit and other factors kick in – the main issue is around ‘buy-to-let’ properties which are made less lucrative thanks to three-pronged attack: increase in stamp duty on a second house, removal of tax breaks for landlords and tightening of lending for a second home (especially interest-only mortgages).

Finally 2017 once can see that the market is again cooling down. Latest data suggests house prices have started falling once again and with the recent rise in interest rates it will make landing a good deal on a mortgage all that more difficult.

Average House Prices

Above graph shows how the Average price of Sales has changed over the Years. We see there is a slump in prices starting from 2017. It will be interesting to see how the house prices behave as we start 2018. It will be a challenge for people to afford higher mortgages as inflation outstrips income growth. This is especially true for first-time buyers. Given the recent bonanza of zero percent stamp duty for first time buyers I am not sure how much of an impact (positive) this will have.

Returns on Properties

Above graph shows how the returns and risks associated with a house change after a given number of years. It is clear that it is easier to get a return when a house is held for at least 5 years. Below that there is a risk of loosing money on the property. Properties resold within two years are most likely to make a loss. This also ties in with a ‘distress’ sale scenario where the house is sold without waiting for the best possible offer or in times of slowdown where easy term mortgages are not available.

Number of Times Re-sold

Above graph shows the number of times a house is re-sold (vertical) against the number of years it is held for before being re-sold. Most houses are re-sold within 5 years. But why a massive spike where houses are re-sold within 2 years? One possible explanation is that these are houses that are bought by a developer, improved and then re-sold within a year or so.

House Transactions by Month of Year

Transaction by month

What is the best time of the year to sell your house? Counting number of transactions by month (figure above) we can see number of transactions increases as Spring starts and continues to grow till the end of Summer. In fact 60% more houses are sold in Jun – Aug period as compared to Jan – March.

Transactions tend to decrease slightly as Autumn starts and falls off towards end of the year. This is expected as people would not want to move right after Christmas or early in the new year (winter moves are difficult!)

Infrastructure

I have used Apache Spark (using Java) to summarise the data from approximately 4 GB to 1-1.5 GB CSV files and then Python to do next round of aggregations and to generate the plots.

 

Next step will be to incorporate some Machine Learning into the process.

House Price and Transactions with UK Elections

We are just getting over the not so shocking election result in UK (8th June 2017).

I wanted to look at house prices and how they are affected by election results.

The graphs below plot House Price/ Number of Transactions against date (blue dots). The data is averaged over a month and is normalised to 1.0.

The vertical lines represent UK general elections with blue representing clear results (clear majority) and black lines representing hung Parliament. There is a black line (2nd from right) that represents EU Referendum (‘Brexit’).

The orange dots represent GBP (Sterling) performing against INR (Indian Rupee) and CNY (Chinese Yuan). The data is daily average normalised to 1.0.

We can see house prices grow aggressively after clear results. The period from 2008 onward is the ‘financial’ crisis era which is further complicated by a hung Parliament in 2010. The actual recovery takes a few years and by 2014 the boom times are back! The growth is further enhanced by a Conservative majority in 2015.

It is too early to see the impact of Brexit on the housing market but as far as GBP goes there has been a fall against all major currencies.

This means investment into the UK housing market is made cheaper for ‘international’ buyers. The growth in house prices is compensated by the fall in the pound (we can see this by the relative falls in the two graphs).

Already the house price increase is cooling off (falling in many regions where they were over-inflated to begin with). With the messy general election of 2017 increasing the uncertainty, especially around Brexit, the house prices from internal demand should decrease or flatten out. We can already see this starting. People might rush in to lock their mortgage (thereby boosting short term demand) as Bank of England has indicated a rise in Interest Rates in the near future.

What happens if look at the number of transactions? The normalised graph  above shows that during the financial crisis era the transactions fell sharply. Then began to revive (correlates with the rise in house prices). The strong position of the Conservatives further supported the market.

But as soon as the Stamp Duty increase came into the picture the number of transactions started reducing and after ‘Brexit’ leading up to the 2017 General Election we can see a sharp fall in transactions.

All of these things indicate that people are not sure about what will happen in the future so are not willing to take positions of risk.

Stamp duty change

Stamp duty change (1st April 2016)

A final interesting titbit – Why is there a massive spike in transactions in a subdued period of house sales (the red arrow)? And no this is not an error! The month is March 2016 – and the spike is there because stamp duty changes were being introduced from 1st April 2016 which meant buying a second home (without selling the first one) would become a lot more expensive!

[This analysis uses the Land Registry data set which is processed using Apache Spark, Python was used to further process and plot the data]

Raspberry Pi Cluster and Apache Spark!

So over the Christmas holidays I have been busy playing with my 4 x Raspberry Pi 3 (Model B) units which I have assembled into a stack. They each have a 16 GB Memory Card with Raspbian.

Spark Pi Cluster

Spark Pi Cluster

The Spark Master is running on a NUC (the Spark driver program runs there or I simply use the ‘spark-shell’).

If you want to make your own cluster here is what you will need:

  • Raspberry Pi 3 Model B (I bought 4 of them – just the Pi’s – don’t bother with the ‘Kit’ because you won’t need the individual cases or power supplies).
  • Rapbian on a Memory Card (16GB will work fine) for each Pi.
  • A stacking plate set (one per Raspberry Pi to mount it) and one pair of ‘end plates’. This acts as a ‘rack’ for your Pi cluster. It also makes sure your Pi boards get enough ventilation and you can place the whole set neatly in a corner instead of having them lying around on the dining table!
  • Multi-device USB power supply (I would suggest Anker 60W PowerPort with 6 USB ports – which can support up to 6 Pi 3’s) so that you end up with one power plug instead of one plug per Pi.
  • To connect the Pi boards to the Internet (and to each other – for the Spark cluster) you will need a multi-port Gigabit switch – I would suggest buying one with at least 8 ports as you will need 1 port per Pi and 1 port to connect to your existing network.
  • A wireless keyboard-trackpad to setup each Pi (just once per Pi).
  • A single HDMI cable to connect with a TV/Monitor (just once per Pi).

Setting up the Pi boards:

Once you have assembled the rack and mounted the boards, install the memory cards on all the boards and connect them to the power supply and the network. Wait for the Pi boards to boot up.

Then one Pi at a time:

  • Connect a keyboard, mouse and monitor – ensure the Pi is working properly then:
    • Set hostname
    • Disable Wireless LAN (as you have Ethernet connectivity- which is more stable)
    • Check SSH works – this will make sure you can remotely work on the Pi

Raspberry Pi Cluster Image

Once all that is done and you can SSH into the Pi boards – time to install Spark:

Again one Pi at a time:

  • SSH into the Pi and use curl -o <spark download url> to download Spark tar.gz
  • tar -xvf <spark tar.gz file> to unzip the tar.gz to a standard location (I use ‘/spark/’ on all the Pi boards)
  • Make sure correct permissions are assigned to the spark folder
  • Add the master machine hostname to the /etc/hosts file
  • Edit your ~/.bashrc and add the following: export SPARK_HOME = <the standard location for your spark>

Similarly install Spark on a node which you will use as the ‘spark cluster master’ – use the same standard location.

Start up master using the spark ‘start-master.sh’ script. If you go to http://<IP of the Master Node>:8080/ you should see the Spark webpage with the status of the Workers (empty to start with) and various other bits of useful information such as the spark master URL (which we will need for the next step), number of available CPUs and application information. The last item – application information – is particularly useful to track running applications.

SSH into each of the Pi boards and execute the following: ‘start-slave.sh spark://<IP of the Master Node>:7077’ to convert each Pi board into a Spark slave.

Now if you look at the Spark webpage you will see each of the Slave nodes up (give it a couple of minutes) and you will also see the cluster resources available to you. If you have 4 Pi boards as slaves you will see 4 * 4 = 16 Cores and 4 * 1 GB = 4 GB Memory available.

Running Spark applications:

There are two main things to remember when running a Spark application:

  1. All the code that you are running should be available to ALL the nodes in your cluster (including the master)
  2. All the data that you are using should be available to ALL the nodes in your cluster (including the master).

For the code – you can easily package it up in the appropriate format (language dependent – I used Java so I used Maven to build a JAR with dependencies) and network share a folder. This reference can be used when using the spark-submit command (as the location of the application package).

For the data – you have two options – either use a network share as for the code or copy the data to the SAME location on ALL the nodes (including the master). So for example if on the master you create a local copy of the data at ‘/spark/data’ then you must use the SAME location on all the Pi boards! A local copy is definitely required if you are dealing with large data files.

Some tests:

For my test I used a 4 GB data file (text-csv) and a simple Spark program (using ‘spark-shell’) to load the text file and do a line count.

1: Pi Cluster (4 x Raspberry Pi 3 Model B)

  • Pi with Network shared data file: > 6 minutes (not good at all – this is just a simple line count!)
  • Pi with local copies of the data file: ~ 51 seconds (massive difference by making the data local to the node)

2: Spark standalone running on my laptop (i7 6th Gen., 5600 RPM HDD, SATA3 SSD, 16 GB RAM)

  • Local data file on HDD: > 1 min 30 seconds (worse than a Pi cluster with locally copied data file)
  • Local data file on SSD: ~ 20 seconds (massive difference due to the raw speed of the SSD!)

Conclusion (Breaking the Cluster):

I did manage to kill the cluster! I setup a more complicated data pipeline which does grouping and calculations using the 4 GB data file. It runs within 5 mins on my laptop (Spark local). The cluster collapsed after processing about 50%. I am not sure if the issue was related to the network (as a bottleneck) or just the Pi not able to take the load. The total file size is greater than the total available memory in the cluster (some RAM is required for the local OS as well).

So my Spark cluster is not going to break any records, in fact I would be better off using a Spark standalone on my laptop  if it is a one-shot (i.e. process large data file and store the results somewhere).

Things get interesting if we had to do this once every few hours and we could automate the ‘local data copy’ step – which should be fairly easy to do. The other option is to create a fast network share (e.g. using SSDs).

What next:

Some nice project which would suit the capabilities of a Pi cluster? Periodic data processing/stream processing task? Node.JS Servers? Please comment and let me know!

Data Analysis: Dengue Disease Prediction

Delhi suffers from an annual dengue epidemic between the months of July and October. It is only the cooler and drier weather at the start of November that stops the mosquitoes that spread this disease.

The year 2015 was a bad year for dengue and all kinds of records were broken. Thankfully due to increased awareness the death toll did not set any records. In fact it was not as high as it could have been (in my view even 1 death is high!).

So I wanted to try and see if there is a relation between Rainfall and Dengue cases?

Also to see if there is any way of predicting the number of Dengue cases in 2016?

I used the historic data available from: http://nvbdcp.gov.in/den-cd.html
and MCD (Delhi).

Data

Year, Rainfall, Cases

2006, 618.70, 3340

2007, 601.60, 548

2008, 815.00, 1216

2009, 595.50, 1154

2010, 953.10, 6259

2011, 661.80, 1131

2012, 559.40, 2093

2013, 1531.40, 5574

2014, 778.60, 995

2015, 1123.10, 15836

Rainfall vs Dengue
More rain – more water logging leading to more opportunities for mosquitoes to multiply. Therefore there must be some relationship between Rainfall and the number of Dengue cases. Given the dramatic growth of Delhi over the last five years we restrict going as far back as 2010.

Using the available data for rainfall and dengue cases if we fit a straight line and 2nd degree polynomial we get Diagram 1 below.
dengue_rainfallDiagram 1: Rainfall vs Dengue Cases.

We see that for a linear model there is a clear trend of higher number of cases with increasing rainfall. The R-Square value is 0.35 (approx) which is not a good fit but it is expected given the fluctuations.

What is more interesting is the 2nd degree polynomial which gives a R-Square value of 0.94 (approx) which is very good. But this could also point to over-fitting.

Another way of interpreting it is that there is a ‘sweet-spot’ for dengue spreading rapidly. If the rain is below a certain amount – there is not enough water around for dengue vector (mosquito) to breed. If there is too much rain then also there is lack of ‘still’ water to allow mosquitoes to breed.

The ‘sweet spot’ seems to be rain at a certain level that tapers leaving enough amount of ‘stagnant’ water for mosquitoes to breed.

 

Growth of Dengue over the Years

Diagram 2 shows the growth trend of Dengue over the years. In 1996 the dengue epidemic broke all records. In 2015 it broke all records once again. If we were to plot the number of cases over the years we see that the graph is steadily marching upwards.

If all other factors remain constant we should see about 6000 cases in 2016. 

dengue_years

Diagram 2: Dengue growth rate over the years.

This is a very simple analysis of dengue. There are lots of other variables that can be added (for example – growing population, temperature profiles, regional variance). But I wanted to show how even simple analysis can produce some interesting results.

Another important point I wanted to highlight was the lack of easily accessible data on diseases and epidemics. If we had better data then public health initiatives could be better targeted to combat such occurrences.

Quality of Life Reduced Question Set: Bristol Open Data

This visualisation operates upon a reduced set of questions from the Quality of Life indicators. This data has been provided by the Bristol City Council under the open data initiative (https://opendata.bristol.gov.uk/).

Using this view the reduced question set can be examined across all the wards as an average of beta for particular question across all wards in Bristol.

Click on a question to focus on it and to examine the beta value across all the wards. A count of wards with positive and negative beta values is also shown. These should correspond to the total green/red marks seen.
The click on a ward to examine the response over time and see the trend line (associated with beta).

Java and Apache Spark used to generate the csv data files.

Link: Dashboard

Criteria for beta calculation: minimum three years data should be available.

Reduced Question Set:


Bristol Government: Open Data Initiative

Bristol City Council (BCC) is now publishing some of their data sets online as part of the Open Data initiative.
This is a VERY positive move and I too hope that this leads to the development of ‘new’ solutions to the city’s problems.
More information can be found here: https://opendata.bristol.gov.uk

The Tableau Viz below uses the Quality of Life Indicators data from across Bristol. This is available from the BCC website. The data set has a set of questions (about 540) asked across the different wards in Bristol (about 35) on a yearly basis starting from 2005 till 2013. Obviously data is not available across all the dimensions, for example the question:
“% respondents who travel for shopping by bus” for the Redland ward is available only from 2006-2010.

The raw data from the Open Data website was processed using Apache Spark’s Java Libraries. This was then dumped into a data file which was imported into Tableau.

Link: Dashboard

The heat map below plots the regression slope of the survey results over the years (beta) against the Questions and Wards.
Criteria for beta calculation: minimum three years data should be available.