The Advantage of Covid-19

Covid-19 has been wreaking havoc across the globe. But this was also expected given the fact that we have not been the best of tenants for Mother Earth.

All the doom and gloom aside, Covid-19 and the mass lockdowns are teaching us a very important lesson about the future of automation and technology.

In a single line:

A secure future requires smart people working on smart devices using smart infrastructure!

Figure 1: Relation between Smart People, Things and Infrastructure.

Figure 1 shows the interactions between Smart People, Things and Infrastructure.

The Covid-19 crisis, which has brought life to a standstill, has exposed the weakness of our automation maturity. Services from haircutting to garbage collection have been trimmed back, mostly as a proactive step. Whatever automation we do have, has helped tremendously (e.g. online grocery shopping) even as people’s behaviour changed overnight as panic set in.

So what is the panic about? What are the basics that we need? The panic is about running out of resources like food due to a collapse of supply chains which have been optimised to reduce warehousing costs.

Supply chains (Logistics) are heavily dependent on people. From farmers growing crops, workers building stuff to drivers transporting the product to the shops (or directly to your home).

This is not the only critical system to break down if large number of people fall ill at the same time.

Healthcare is another area that has been impacted because of the lockdown. Care has to maintained to protect vulnerable people which means minimising contact. This increases the vulnerability due to isolation.

Education has also been impacted with schools closed and exams postponed or cancelled. This might not seem like a big problem but consider the impact in future results.

Another area of concern are the utility networks. Can we truly survive with disruptions to our electricity or water networks?

If the automation is improved in the above areas then we would become more resilient (but not immune) to such events in the future which is as difficult to achieve as it sounds!

Bottom-up Automation

Before a drone can be piloted remotely for hundreds of miles or a truck driven under human supervision from a port to a local warehouse we need robust telecom infrastructure to provide reliable, medium-high bandwidth, low-latency, temporary data connections.

This magic network has three basic ingredients:

  1. Programmable network – devices that can be treated like ‘software’ and provide the same agility > significant progress has already been made in this area.
  2. Network slicing – to efficiently provide the right resource to the requesting service > lot of work ongoing in context of 5G networks
  3. Closed-loop, light touch orchestration – to help people look after a complex network and help make changes quickly and safely when required (e.g. providing a reliable mobile data link to a drone carrying a shipment of food from a wholesaler to a shop, for remote piloting use-case) > significant progress has been made and lot of ongoing work

Using such a network we can build other parts of the puzzle such as smart roads, smart rails and then smart cities. All of these help improve automation and support increasingly light touch automation use-cases.

Smart Things

Once we have the Smart Infrastructure we need Smart Things to use them.

For Logistics and maintaining a robust supply chain during a pandemic we need a fleet of autonomous/remotely supervised/remotely piloted vehicles such as heavy-lift drones, self-driving trains/cars/ships/trucks. We also need similar assistance inside warehouses and factories with robots carrying out the operations with human supervision (so called Industry 4.0 / Lights-out factory use-case).

Healthcare – requires logistics as well as the development of autonomous personal health monitoring kits that augment the doctor by allowing them to virtually examine a patient. These kits need to become as common as a thermometer and should fulfil multiple functions.

For scenario related to caring for vulnerable people, semi-autonomous robots are required that can do lot of the work (e.g. serve dinner).

In case of a lockdown, a teacher should be able to create virtual classrooms with similar level of interactivity (e.g. via AR/VR) as in a real classroom.

To maintain water, electricity and other utilities we need sensors that provide a snapshot of the network as well as actuators, remote inspection and repair platforms etc.

For all of this to be done remotely (e.g. in a lockdown scenario) we need a robust telecoms network. Clearly, without a data connection people would no longer be able to deal with the economic, mental, physical and emotional shock caused by a lockdown.

Smart People

So who will be these people who can pilot/supervise a drone, carrying a crate of toilet rolls from a warehouse in Bristol to a shop in Bath from a remote location? Well trained people of course!

This requires two important things:

  1. Second Job: Everyone should be encouraged to take up a second discipline (of their interest) in a semi-professional capacity. This helps increase redundancy in a system. For example, if you are a taxi driver and have an interest in radio – maybe your second job can be of a maintenance technician.
  2. Thinking beyond data-science and AI: Tech is everywhere and AI is not the final word in hi-tech. People should receive everyday technology training and if possible advanced technology training in at least one topic. E.g. everyone should be taught how to operate a computer but they should also be allowed to choose a topic for deeper study, like security, software development, IT administration etc.

Augmentation technologies should be made more accessible, including providing basic-training in Augmented and Virtual Reality systems so that in case of a lockdown, human presence can be projected via a mobile platform such as a drone or integrated platform within say a forklift or a truck.

Adaptation: This is perhaps the most important. This means not leaving anyone behind in the tech race. Ensuring all technologies allow broad access. This will ensure that in times of trouble technology can be accessed not only by those who are most able to deal with the issues but also those who are the most vulnerable.

All of the above require the presence of smart things!

Conclusion

Thus we have four themes of Logistics, Healthcare, Education and Utilities running across three layers: Smart People -> Smart Things -> Smart Infrastructure. That is what Covid-19 has taught us. A very important lesson indeed, so that the next time around (and there WILL be a next time), we are better prepared!

Digitisation of Services and Automation of Labour

Digitisation of services is all around us. Where we used to call for food, taxi, hotels and flights we now have apps. This ‘app’ based economy has resulted in a large number of highly specialised jobs (e.g. app developers, web designers). It also impacts unskilled or lower skilled jobs as gaps in the digitisation are filled in with human labour (e.g. physical delivery of food, someone to drive the taxi).

The other side of digitisation is automation. Where manual steps are digitised, the data processing steps can involve human labour (e.g. you fill a form online, a human processes it, a response letter is generated and a human puts it in an envelope for posting it). 

In case of a fully automated and digitised service, processing your data would involve ’machine labour’ (with different levels of automation [see http://fisheyefocus.com/fisheyeview/?p=863]) and any communication would also be electronic (e.g. email, SMS). One very good example of this is motor insurance, where you enter your details via a website or app, risk models calculate the premium on the fly and once payment is made all insurance documents are emailed to you. Only involvement of human labour is in the processing of claims and physical validation of documents. This is called an ‘e-insurer’.

Machine Labour

Automation involves replacing or augmenting human labour with machine labour. Machines can work 24×7 and are not paid salaries – thus the cost savings. However, machines need electricity and infrastructure to work and they cannot self-assemble, self-program or self-maintain (so called Judgement Day scenario from the Terminator series). Human labour is still required to develop and maintain an increasingly large number of (complex) automated systems. Human labour is also required to develop and maintain the infrastructure (e.g. power grids, telecom networks, logistic supply chains) that works alongside the automated systems.

So humans earn indirectly from machine labour but in the end automation and digitisation help save large amounts of money for companies by reducing operational costs (in terms of salaries, office space rentals etc.). Another side-effect is that certain types of  jobs are no longer required as automation and digitisation pick up pace.

Impact on Consumption

Now we know from basic economics that all consumption results in someone earning an income. 

For a company, the income is the difference between the value of what they sell and their total costs (fixed + variable) in making and selling it.

A company will increase digitisation and automation with a view to increase their total income. This can happen by targeting automating processes that increase sales or decrease costs. A company will also automate to keep levels of service so as not to lose customers to competition but there will always be some element of income increase involved here as well.

If costs are reduced by digitisation (e.g. less requirement for a physical ‘front office’) and/or automation (e.g. less number of people for the same level of service), it can lead to loss or reduction of income as people are downsized or move to suboptimal roles (e.g. a bank teller working in a supermarket). This also contributes to the ‘gig’ economy where apps provide more ‘on-demand’ access to labour (e.g. Uber).

People consume either from what they earn (income) or from borrowing (e.g. credit cards and loans). If the incomes go down then it can either impact consumption or in the short term lead to increased borrowing. This decrease in consumption can impact the same companies that sought an increase in income by automation and digitisation.

To Summarise:

  1. Automation and Digitisation leads to cost savings by introducing electronic systems in place of a manual process. 
  2. If less people are required to do the same job/maintain a given level of output then employers are likely to hire fewer new workers and/or reduce the size of the workforce over time. 
  3. This will reduce the income of people who are impacted by redundancies and change of job roles. 
  4. This in turn will reduce the consumption of those people which may hit the very same companies that are introducing automation and digitisation
  5. This in turn will further push the margins and thereby force further reduction in costs or increase in consumption from some quarter…. 
  6. And we seem to be trapped in a vicious circle!

This Sounds Like Bad News!

So looking at the circular nature of flows in an economy, as described in the previous section, we can predict some sort of impact on consumption when large scale digitisation and automation takes place. 

As an aside, this is a major reason why ‘basic income’ or universal income is a very popular topic around the world (read more: https://en.wikipedia.org/wiki/Basic_income). With basic income we can guarantee everyone a minimum lifestyle and thereby promise a minimum level of consumption.

The actual manifestation of this issue is not as straightforward as our circular reasoning, from the previous section, would indicate. This is because the income of a company depends upon several factors:

  1. External Consumption (exports)
  2. Amount consumed by those whose income increases due to automation and digitisation
  3. Amount consumed by those whose income decreases due to automation and digitisation
  4. Labour costs attributed to those who implement and support automation and digitisation
  5. Labour costs attributed to those who are at risk of being made redundant due to automation and digitisation (a reducing value)
  6. Variable costs (e.g. resource costs)
  7. Fixed costs

Exports can help provide a net boost to income – this external consumption may not be directly impacted by automation and digitisation (A&D). It may be indirectly boosted if the A&D activities lead to imports from the same countries.

The two critical factors are (2) and (3): namely how much of the output (or service) is sold to people who benefit from A&D and how much is sold to those who do not benefit from A&D. 

If a company employs a large number of people who can be made redundant via A&D activities and a large portion of their consumers are those whose incomes will be impacted by A&D then we have a very tight feedback loop – which can lead to serious loss of income for the employer, especially if it ties in with an external shock (e.g. increase of a variable cost like petroleum).

On the other hand if a company caters to people whose incomes increase with A&D (e.g. software developers) then the impact to its income will be a lot less pronounced and it may even increase significantly.

What works best is when a company can sell to both and has enough space for both A&D activities and manual labour. This means they can make money from both sides of the market. A good example of this are companies like Amazon, McDonalds and Uber who have human components integrated with A&D which then acts as a force multiplier. 

Using this framework we can analyse any given company and figure out how automation will impact them. We can also understand that in the short term A&D can have a positive effect as it acts as a force multiplier, opening new avenues of work and creating demand for different skills.

Breaking Point

Real issues can arise if automation is stretched further to complex tasks such as driving, parcel delivery and cooking food. Or digitisation is taken to an extreme (e.g. e-banks where you have no physical branches). This will have a large scale impact on incomes leading to a direct reduction in demand.

One way to force a minimum level of consumption is for the government to levy special taxes and transfer that income as it is to those who need it. This will make sure those who are unskilled or have basic skills are not left behind. This is a ‘means tested’ version of basic income similar to a benefits system.

The next step will be to re-skill people to allow them to re-enter the job market or start their own business.

Analytics, Machine Learning, AI and Automation

In the last few years buzzwords such as Machine Learning (ML), Deep Learning (DL), Artificial Intelligence (AI) and Automation have taken over from the excitement of Analytics and Big Data.

Often ML, DL and AI are placed in the same context especially in product and job descriptions. This not only creates confusion as to the end target, it can also lead to loss of credibility and wasted investment (e.g. in product development).

Figure 1: Framework for Automation

Figure 1 shows a simplified version of the framework for automation. It shows all the required ingredients to automate the handling of a ‘System’. The main components of this framework are:

  1. A system to be observed and controlled (e.g. telecoms network, supply chain, trading platform, deep space probe …)
  2. Some way of getting data (e.g. telemetry, inventory data, market data …) out of the system via some interface (e.g. APIs, service endpoints, USB ports, radio links …) [Interface <1> Figure 1]
  3. A ‘brain’ that can effectively convert input data into some sort of actions or output data which has one or more ‘models’ (e.g. trained neural networks, decision trees etc.) that contain its ‘understanding’ of the system being controlled. The ‘training’ interface that creates the model(s) and helps maintain them, is not shown separately
  4. Some way of getting data/commands back into the system to control it (e.g. control commands, trade transactions, purchase orders, recommendations for next action etc.) [Interface <2> Figure 1]
  5. Supervision capability which allows the ‘creators’ and ‘maintainers’ of the ‘brain’ to evaluate its performance and if required manually tune the system using generated data [Interface <3> Figure 1] – this itself is another Brain (see Recursive Layering)

This is a so called automated ‘closed-loop’ system with human supervision. In such a system the control can be fully automated, only manual or any combination of the two for different types of actions. For example, in safety critical systems the automated closed loop can have cut out conditions that disables Interface <2> in Figure 1. This means all control passes to the human user (via Interface <4> in Figure 1).

A Note about the Brain

The big fluffy cloud in the middle called the ‘Brain’ hides a lot of complexity, not in terms of the algorithms and infrastructure but in terms of even talking about differences between things like ML, DL and AI.

There are two useful concepts to use when trying to put all these different buzzwords in context when it comes to the ‘Brain’ of the system. In other words next time some clever person tells you that there is a ‘brain’ in their software/hardware that learns.. ask them two questions:

  1. How old is the brain?
  2. How dense is the brain?

Age of the Brain

Age is a very important criteria in most tasks. Games that preschool children struggle with are ‘child’s play’ for teenagers. Voting and driving are reserved for ‘adults’. In the same way for an automated system the age of the brain talks a lot about how ‘smart’ it is.

At its simplest a ‘brain’ can contain a set of unchanging rules that are applied to the observed data again and again [so called static rule based systems]. This is similar to a new born baby that has fairly well defined behaviours (e.g. hungry -> cry). This sort of a brain is pretty helpless in case the data has large variability. It will not be able to generate insights about the system being observed and the rules can quickly become error prone (thus the age old question – ‘why does my baby cry all the time!’).

Next comes the brain of a toddler which can think and learn but in straight lines and that too after extensive training and explanations (unless you are a very ‘lucky’ parent and your toddler is great at solving ‘problems’!). This is similar to a ‘machine learning system’ that is specialised to handle specific tasks. Give it a task it has not trained for and it falls apart.

Next comes the brain of a pre-teen which is maturing and learning all kinds of things with or without extensive training and explanations. ‘Deep learning systems’ have similar properties. For example a Convolutional Neural Network (CNN) can extract features out of a raw image (such as edges) without requiring any kind of pre-processing and can be used on different types of images (generalisation).

At its most complex, (e.g. a healthy adult) the ‘brain’ is able to not only learn new rules but more importantly evaluates existing rules for their usefulness. Furthermore, it is capable of chaining rules, applying often unrelated rules to different situations. Processing of different types of input data is also relatively easy (e.g. facial expressions, tone, gestures, alongside other data). This is what you should expect from ‘artificial intelligence‘. In fact with a true AI Brain you should not need Interface <4> and perhaps a very limited Interface <3> (almost a psychiatrist/psycho-analyst to a brain).

Brain Density

Brain density increases as our age increases and then stops increasing and starts to decrease. From a processing perspective its like the CPU in your phone or laptop starts adding additional processors and therefore is capable of doing more complex tasks.

Static rule-based systems may not require massive computational power. Here more processing power may be required for <1>/<2>. to prepare the data for input and output.

Machine-learning algorithms definitely benefit from massive computational powers especially when the ‘brain’ is being trained. Once the model is trained however, the application of the model may not require computing power. Again more power may be required to massage the data to fit the model parameters than to actually use the model.

Deep-learning algorithms require computational power throughout the cycle of prep, train and use. The training and use times are massively reduced when using special purpose hardware (e.g. GPUs for Neural Networks). One rule of thumb: ‘if it doesn’t need special purpose hardware then its probably not a real deep-learning brain, it may simply be a machine learning algorithm pretending to be a deep-learning brain’. CPUs are mostly good for the data prep tasks before and after the ‘brain’ has done its work.

Analytics System

If we were to have only interfaces <1> and <3> (see Figure 1) – we can call it an analytics solution. This type of system has no ability to influence the system. It is merely an observer. This is very popular especially on the business support side. Here the interface <4> may not be something tangible (such REST API or a command console) all the time. Interface <4> might represent strategic and tactical decisions. The ‘Analytics’ block in this case consists of data visualisation and user interface components.

True Automation

To enable true automation we must close the loop (i.e. Interface <2> must exist). But there is something that I have not shown in Figure 1 which is important for true automation. This missing item is the ability to process event-based data. This is very important especially for systems that are time dependent – real-time or near-real-time – such as trading systems, network orchestrators etc. This is shown in Figure 2.

Figure 2: Automation and different types of data flows

Note: Events are not only generated by the System being controlled but also by the ‘Brain’. Therefore, the ‘Brain’ must be capable of handling both time dependent as well as time independent data. It should also be able to generate commands that are time dependent as well as time independent.

Recursive Layers

Recursive Layering is a powerful concept where an architecture allows for its implementations to be layered on top of each other. This is possible with ML, DL and AI components. The System in Figures 1 and 2 can be another combination of a Brain and controlled System where the various outputs are being fed in to another Brain (super-brain? supervisor brain?). An example is shown in Figure 3. This is a classic Analytics over ML example where the ‘Analytics’ block from Figure 1 and 2 has a Brain inside it (it is not just restricted to visualisation and UI). It may be a simple new-born brain (e.g. static SQL data processing queries) or a sophisticated deep learning system.

Figure 3: Recursive layering in ML, DL and AI systems.

The Analytics feed is another API point that can be an input data source (Interface <1>) to another ‘Brain’ that is say supervising the one that is generating the analytics data.

Conclusion

So next time you get a project that involves automation (implementing or using) – think about the interfaces and components shown in Figure 1. Think about what type of brain do you need (age and density).

If you are on the product side then make sure bold claims are made, not illogical or blatantly false ones. Just as you would not ask a toddler to do a teenagers job, don’t advertise one as the other.

Finally think hard about how the users will be included in the automation loop. What conditions will disable interface <2> in Figure 1 and cut out to manual control? How can the users monitor the ‘Brain’? Fully automated – closed loop systems are not good for anyone (just ask John Connor from the Terminator series or people from Knight Capital https://en.wikipedia.org/wiki/Knight_Capital_Group). Humans often provide deeper insights based on practical experience and knowledge than ML or DL is capable of.

Reduce Food Wastage using Machine Learning

A scenario the readers might be familiar with: food items hiding around in our refrigerator way past their expiry date. Once discovered, these are quickly transferred to the bin with promises to self that next time it will be different for sure OR worse yet we stuff the items in our freezer!

Estimates of waste range from 20% to 50% (in countries like USA). This is a big shame given the fact that hundreds of millions of people around the world don’t have any form of food security and face acute shortage of food.

What can we do about this? 

One solution is to help people be a bit more organised by reminding them of the expiry dates of various items. The registration of items has to be automated and smart. 

Automated:

If we insist on manual entry of items with their expiry date – people are likely to not want to do this especially right after a long shop! Instead, as the items are checked out at the shop, an option should be available to email the receipt which should also contain an electronic record of the expiry date of the purchased items. This should include all groceries as well as ‘ready to eat’ meals. Alternatively, one can also provide different integration options using open APIs with some sort of a mobile app.

Smarter:

Once we have the expiry dates we need to ensure we provide the correct support and advice to the users of the app. To make it more user-friendly we should suggest recipes from the purchased groceries and put those on the calendar to create a ‘burn-down’ chart for the groceries (taking inspiration from Agile) which optimises for things like freshness of groceries, minimising use of ‘packaged foods’ and maintaining the variety of recipes.

Setup:

Steps are as follows:

  1. When buying groceries the expiry and nutrition information are loaded into the system
  2. Using a matrix of expiry to items and items to recipes (for raw groceries) we get an optimised ordering of usage dates mapped to recipes
  3. With the item consumption-recipe schedule we can then interleave ready to eat items, take-away days and calendar entries related to dinner/lunch meetings (all of these are constraints)
  4. Add feedback loop allowing users to provide feedback as to what recipes they cooked, what they didn’t cook, what items were wasted and where ‘unscheduled’ ready to eat items were used or take-away called for
  5. This will help in encouraging users to buy the items they consume and warn against buying (or prioritise after?) items that users ‘ignore’ 

I provide a dummy implementation in Python using Pandas to sketch out some of the points and to bring out some tricky problems.

The output is a list of purchased items and a list of available recipes followed by a list of recommendations with a ‘score’ metric that maximises ingredient use and minimises delay in usage.

Item: 0:cabbage
Item: 1:courgette
Item: 2:potato
Item: 3:meat_mince
Item: 4:lemon
Item: 5:chicken
Item: 6:fish
Item: 7:onion
Item: 8:carrot
Item: 9:cream
Item: 10:tomato


Recipe: 0:butter_chicken
Recipe: 1:chicken_in_white_sauce
Recipe: 2:mince_pie
Recipe: 3:fish_n_chips
Recipe: 4:veg_pasta
Recipe: 5:chicken_noodles
Recipe: 6:veg_soup

Recommendations

butter_chicken:     Score:30            Percentage items consumed:36%

chicken_in_white_sauce:     Score:26            Percentage items consumed:27%

Not enough ingredients for mince_pie

fish_n_chips:       Score:20            Percentage items consumed:27%

veg_pasta:      Score:26            Percentage items consumed:27%

chicken_noodles:        Score:28            Percentage items consumed:36%

veg_soup:       Score:20            Percentage items consumed:27%

The recommendation is to start with ‘butter chicken’ as we use up some items that have a short shelf life. Here is a ‘real’ recipe – as a thank you for reading this post: 

http://maunikagowardhan.co.uk/cook-in-a-curry/butter-chicken-murgh-makhani-chicken-cooked-in-a-spiced-tomato-gravy/h

Tricky Problems:

There are some tricky bits that can be solved but will need some serious thinking:

  1. Updating recommendations as recipes are cooked
  2. Updating recommendations as unscheduled things happen (e.g. item going bad early or re-ordering of recipes being cooked)
  3. Keeping track of cooked items and other interleaved schedules (e.g. item being frozen to use later)
  4. Learning from usage without requiring the user to update all entries (e.g. using RFID? Deep Learning – from images taken of your fridge with the door open)
  5. Coming up with innovative metrics to encourage people to eat healthy and eat fresh – lots of information can be extracted (E.g. nutrition information) if we have a list of purchased items
  6. Scheduling recipes around other events in a calendar or routine items (e.g. avoiding a heavy meal before a scheduled gym appointment)

Housing Market: Auto Correlation Analysis

In this post we take a look at the housing market data which consists of all the transactions registered with the UK Land Registry since 1996. So lets get the copyright out of the way:

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

The data-set from HM Land Registry has information about all registered property transactions in England and Wales. The data-set used for this post has all transactions till the end of October 2018. 

To make things slightly simple and to focus on the price paid and number of transaction metrics I have removed most of the columns from the data-set and aggregated (sum) by month and year of the transaction. This gives us roughly 280 observations with the following data:

{ month, year, total price paid, total number of transactions }

Since this is a simple time-series, it is relatively easy to process. Figure 1 shows this series in a graph. Note the periodic nature of the graph.

Figure 1: Total Price Paid aggregated (sum) over a month; time on X axis (month/year) and Total Price Paid on Y axis.

The first thing that one can try is auto-correlation analysis to answer the question: Given the data available (till end-October 2018) how similar have the last N months been to other periods in the series? Once we identify the periods of high similarity, we should get a good idea of current market state.

To predict future market state we can use time-series forecasting methods which I will keep for a different post.

Auto-correlation

Auto-correlation is correlation (Pearson correlation coefficient) of a given sample (A) from a time series against other available samples (B). Both samples are of the same size. 

Correlation value lies between 1 and -1. A value of 1 means perfect correlation between two samples where they are directly proportional (when A increases, B also increases). A value of 0 implies no correlation and a value of -1 implies the two samples are inversely proportional (when A increases, B decreases).

The simplest way to explain this is with an example. Assume:

  1. monthly data is available from Jan. 1996 to Oct. 2018
  2. we choose a sample size of 12 (months)
  3. the sample to be compared is the last 12 months (Oct. 2018 – Nov. 2017)
  4. value to be correlated is the Total Price Paid (summed by month).

As the sample size is fixed (12 months) we start generating samples from the series:

Sample to be compared: [Oct. 2018 – Nov. 2017]

Sample 1: [Oct. 2018 – Nov. 2017], this should give correlation value of 1 as both the samples are identical.

Sample 2: [Sep. 2018 – Oct. 2017], the correlation value should start to decrease as we skip back one month.

Sample N: [Dec. 1996 – Jan. 1996], this is the earliest period we can correlate against.

Now we present two graphs for different sample sizes:

  1. correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year) – to show yearly spread
  2. correlation coefficient visualised going back in time – to show periods of high correlation

Thing to note in all the graphs is that the starting value (right most) is always 1. That is when we compare the selected sample (last 12 months) with the first sample (last 12 months).

In the ‘back in time’ graph we can see the seasonal fluctuations in the correlation. These are between 1 and -1. This tells us that total price paid has a seasonal aspect to it. This makes sense as we see lots of houses for sale in the summer months than winter as most people prefer to move when the weather is nice!

Fig 2: Example of In and Out of Phase correlation.

So if we correlate a 12 month period (like this one) one year apart (e.g. Oct. 2018 – Nov. 2017 and Oct. 2017 – Nov. 2016) one should get positive correlation as the variation of Total Price Paid should have the same shape. This is ‘in phase’ correlation. This can be seen in Figure 2 as the ‘first’ correlation which is in phase (in fact it is perfectly in phase and the values are identical – thus the correlation value of 1). 

Similarly, if the comparison is made ‘out of phase’ (e.g. Oct. 2018 – Nov. 2017 and Jul 2018 – Aug. 2017) where variations are opposite then negative correlation will be seen. This is the ‘second’ correlation in Figure 2.

This is exactly what we can see in these figures. Sample sizes are 6 months, 12 months, 18 months and 24 months. There are two figures for each sample size. The first figure is the spread of the auto-correlation coefficient for a given year. The second figure is the time series plot of the auto-correlation coefficient, where we move back in time and correlate against the last N months. The correlation values fluctuating between 1 and -1 in a periodic manner. 


Fig. 3a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year), Sample size: 6 months

Fig. 3b: Correlation coefficient visualised going back in time; Sample size: 6 months


Fig. 4a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 12 months

Fig. 4b: Correlation coefficient visualised going back in time; Sample size: 12 months


Fig. 5a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 18 months

Fig. 5b: Correlation coefficient visualised going back in time; Sample size: 18 months


Fig. 6a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 24 months

Fig. 6b: Correlation coefficient visualised going back in time; Sample size: 24 months

Conclusions

Firstly, if we compare the scatter + box plot figures, especially for 12 months (Figure 4a), we find the correlation coefficients are spread around ‘0’ for most of the years. One period where this is not so and the correlation spread is consistently above ‘0’ is the year 2008, the year that marked the start of the financial crisis. The spread is also ‘tight’ which means all the months of that year saw consistent correlation, for the Total Price Paid, against the last 12 months from October 2018.

Secondly conclusion we can draw from the positive correlation between last 12 months (Figure 2b) and the period of the financial crisis is that the variations in the Total Price Paid are similar (weakly correlated) with the time of the financial crisis. This obviously does not guarantee that a new crisis is upon us. But it does mean that the market is slowing down. This is a reasonable conclusion given the double whammy of impending Brexit and on set of winter/Holiday season (which traditionally marks a ‘slow’ time of the year for property transactions).

Code is once again in python and attached below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture

months = MonthLocator(range(1, 13), bymonthday=1, interval=3)
year_loc = YearLocator()

window_size = 12

def is_crisis(year):
if year<2008:
return 0
elif year>2012:
return 2

return 1

def is_crisis_start(year):
if year<2008:
return False
elif year
>2008:
return False

return True

def
process_timeline(do_plot=False):
col = "Count"
y = []
x = []
x_d = []
box_d = []
year_d = []
year = 0
years_pos = []
crisis_corr = []
for i in range(0, size - window_size):

try:

if year != df_dates["Year"][size-1-i]:

if year > 0:
box_d.append(year_d)
years_pos.append(year)
year_d = []
year = df_dates["Year"][size-1-i]

corr = np.corrcoef(df_dates[col][size -i - window_size: size - i].values, current[col].values)
year_d.append(corr[0, 1])
y.append(corr[0, 1])
if is_crisis_start(year):
crisis_corr.append(corr[0, 1])
x.append(year)
month = df_dates["Month"][size - 1 - i]
x_d.append(dt(year, month, 15))

except Exception as e:
print(e)

box_d.append(year_d)
years_pos.append(year)

corr_np = np.array(crisis_corr)
corr_mean = corr_np.mean()
corr_std = corr_np.std()

print("Crisis year correlation: mean and std.: {} / {} ".format(corr_mean, corr_std))
if do_plot:

fig, sp = plt.subplots()

sp.scatter(x, y)
sp.boxplot(box_d, positions=years_pos)

plt.show()

fig, ax = plt.subplots()
ax.plot(x_d, y,'-o')
ax.grid(True)
ax.xaxis.set_major_locator(year_loc)
ax.xaxis.set_minor_locator(months)
plt.show()

return corr_mean, corr_std

csv = "c:\\ML Stats\\housing_oct_18_no_partial_mnth_cnt_sum.csv"
full_csv = "c:\\ML Stats\\housing_oct_18.csv_mnth_cnt_sum.csv"

df = pd.read_csv(full_csv)


mnth = {
1: "Jan",
2: "Feb",
3: "Mar",
4: "Apr",
5: "May",
6: "Jun",
7: "Jul",
8: "Aug",
9: "Sep",
10: "Oct",
11: "Nov",
12: "Dec"
}


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

crisis = list(map(lambda r: is_crisis(int(r[1]["Year"])), df.iterrows()))

df_dates = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year, "Month": df.Month, "Crisis": crisis})

df_dates = df_dates.sort_values(["Date"])

df_dates = df_dates.set_index("Date")

plt.plot(df_dates["Sum"],'-o')
plt.ylim(ymin=0)
plt.show()

size = len(df_dates["Count"])

corr_mean_arr = []
corr_std_arr = []
corr_rat = []
idx = []
for i in range(0, size-window_size):
end = size - i
current = df_dates[end-window_size:end]
print("Length of current: {}, window size: {}".format(len(current), window_size))

ret = process_timeline(do_plot=True)
break #Exit early




House Market Analysis

The house prices in UK are at it again. A combination of Brexit, change in housing stock, easy loans and growing consumer debt is making things interesting again.

Figure 1: Number of Transactions per month from 1995 to August 2018

Figure 1 shows the number of transactions every month since 1995. The massive fall post 2007 because of the financial crisis. Then the surge in transactions since 2013. The lonely spot (top-right, March 2016) is just before the new Stamp Duty changes made buying a second house an expensive proposition. But this is relatively boring!

Visual Analytics: Relation between Quantity and Value of Transactions

Let us look at Transaction Count (quantity) and Total Value of those transactions, aggregated on a monthly basis. I used a Spark cluster to aggregate the full transaction set (4GB csv data file). The base data set has about 280 rows with the following structure:

{month, year, sum, count}

The month and year values are converted into dates and added to the row, then the data set is sorted by date:

{date, month, year, sum, count}

This leads us to three plots. Sum and Count against time and Sum against Count. These are shown below:

Figure 2: Total Transaction value by date, grouped by year (each dot represents a month in that year)

Figure 2 shows Total Transaction value by date (Y-axis). The plot is grouped by year where each dot represents a month in that year. The current year (2018) has complete months data till August therefore less number of dots.

Figure 3: Total Quantity of Transactions  by date, grouped by year (each dot represents a month in that year)

Figure 3 shows Total Quantity of Transactions (Y-axis), once again grouped by year. Similar to Figure 2 the data is complete till August 2018.

Figure 4: Total Transaction value (Y-axis) against Total Number of Transactions (X-axis)

Figure 4 show how the value of the transactions relates to number of transactions. Each dot represents a month in a year. As expected there is a slightly positive correlation between total value of transactions and the number of transactions. A point to note: the total value of transactions depends on the sale price (that depends on the property sold) as well as the number of transactions in a given month. For the same number of transactions the value could be high or low (year on year) depending on whether prices are inflationary or a higher number of good quality houses are part of that months transactions.

Figure 5: Total Transaction value (Y-axis) against Total number of transaction (X-axis), each point represents a particular month in a year

Figure 5 enhances Figure 4 by using colour gradient to show the year of the observation. Each year should have at least 12 points associated with it (except 2018). This concept is further extended by using different shape for the markers depending on whether that observation was made before the financial crisis (circle: year of observation before 2008), during the financial crisis (square: year of observation between 2008 and 2012) or after the crisis (plus: year of observation after 2012). These values for years have been picked using Figures 2 and 3. 

Figure 6: Showing the housing market contract during the Crisis and then expand

Figure 6 shows the effect of the financial crisis nicely. The circles represent pre-crisis transactions. The squares represent transactions during the crisis. The plus symbol represents post-crisis transactions. 

The rapid decrease in transactions can be seen as the market contracted in 2007-2008. As the number of transactions and the value of transactions starts falling, the relative fall in number of transactions is larger than in the total value of the transactions. This indicates the prices did fall but mostly not enough houses were being sold. Given the difficulty in getting a mortgage, this reduction in number of transactions could be caused by a lack of demand.

Discovering Data Clusters

Using a three class split (pre-crisis, crisis, post-crisis) provides some interesting results. These were described in the previous section. But what happens if a clustering algorithm is used on the data?

A Clustering algorithm attempts to assign each observation to a cluster. Depending on the algorithm, total number of clusters may be required as an input. Clustering is often helpful when trying to build initial models of the input data especially when no labels are available. In that case, the cluster id (represented by the cluster centre) becomes the label. The following clustering algorithms were evaluated:

  1. k-means clustering
  2. gaussian mixture model

The data-set for the clustering algorithm has three columns: Date, Monthly Transaction Sum and Monthly Transaction Count.

Given the claw mark distribution of the data it was highly unlikely k-means would give good results. That is exactly what we see in Figure 7 with cluster size of 3 (given we had three labels previously of before crisis, during crisis and after crisis). The clustering seems to cut across the claws. 

Figure 7: k-mean clustering with cluster size of 3 – total value of transactions (Y-axis) vs total number of transactions

If a gaussian mixture model (GMM) is used with component count of 3 and covariance type ‘full’ (using sklearn implementation – see code below) some nice clusters emerge as seen in Figure 8.

Figure 8: Gaussian Mixture model with three components.

Each of the components corresponds to a ‘band’ in the observations. The lowest band corresponds loosely with pre-crisis market, the middle (yellow) band somewhat expands the crisis market to include entries from before the crisis. Finally, the top-most band (green) corresponds nicely with the post-crisis market.

But what other number of components could we choose? Should we try other GMM covariance types (such as ‘spherical’, ‘full’, ‘diag’, ‘tied’)? To answer these questions we can run a ‘Bayesian Information Criteria’ test against different number of components and different covariance types. The method and component count that gives the lowest BIC is preferred.

The result is shown in Figure 9.

Figure 9: BIC analysis of the data – BIC score against number of components (X-axis)

From Figure 9 it seems the ‘full’ type consistently gives the lowest BIC on the data-set. Furthermore, going from 3 to 4 components improves the BIC score (lower the better). Another such jump is from 7 to 8. Therefore, number of components should be 4 (see Figure 10) or 8 (see Figure 11).

Figure 10: Transaction value (Y-axis) against  Total Number of Transactions – with 4 components.

Figure 11: Transaction value (Y-axis) against  Total Number of Transactions – with 8 components.

The 4 component results (Figure 10) when compared with Figure 5 indicates an expansion at the start of the data-set (year: 1995), this is the jump from yellow to green. Then during the crisis there is a contraction (green to purple). Post crisis there is another expansion (purple to blue). This is shown in Figure 12.

Figure 12: Expansion and contraction in the housing market

The 8 component results (Figure 11) when compared with Figure 5 shows the stratification of the data-set based on the Year value. Within the different colours one can see multiple phases of expansion and contraction.

The interesting thing is that for both 4 and 8 component models, the crisis era cluster is fairly well defined.

Code for this is given below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture


csv = "c:\\ML Stats\\housing_sep_18_no_partial_mnth_cnt_sum.csv"


df = pd.read_csv(csv)


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

df_pure = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year})


df_pure = df_pure.sort_values(["Date"])

df_pure = df_pure.set_index("Date")


bics = {}
for cmp in range(1,10):

clust_sph = GaussianMixture(n_components=cmp, covariance_type='spherical').fit(df_pure)
clust_tied = GaussianMixture(n_components=cmp, covariance_type='tied').fit(df_pure)
clust_diag = GaussianMixture(n_components=cmp, covariance_type='diag').fit(df_pure)
clust_full = GaussianMixture(n_components=cmp, covariance_type='full').fit(df_pure)

clusts = [clust_full, clust_diag, clust_sph, clust_tied]
bics[cmp] = []
for c in clusts:
bics[cmp].append(c.bic(df_pure))

plt.plot(bics.keys(), bics.values())
plt.legend(["full", "diag", "sph", "tied"])
plt.show()

num_components = 4

clust = GaussianMixture(n_components=num_components, covariance_type='full').fit(df_pure)

lbls = clust.predict(df_pure)

df_clus = pd.DataFrame({"Count": df_pure.Count, "Sum": df_pure.Sum, "Year": df_pure.Year, "Cluster": lbls})
color = df_clus["Cluster"]

fig, ax = plt.subplots()
ax.scatter(df_clus["Count"], df_clus["Sum"], c=color)

fig, ax2 = plt.subplots()
ax2.scatter(df_clus["Year"], df_clus["Count"], c=color)

fig, ax3 = plt.subplots()
ax3.scatter(df_clus["Year"], df_clus["Sum"], c=color)


plt.show()

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

Recurrent Neural Networks to Predict Pricing Trends in UK Housing Market

Recurrent Neural Networks (RNN):

RNNs are used when temporal relationships have to be learnt. Some common examples include time series data (e.g. stock prices), sequence of words (e.g. predictive text) and so on.

The basic concept of RNNs is that we train an additional set of weights (along with the standard input – output pair) that associate past state (time: t-1) with the current state (time: t). This can then be used to predict the future state (time: t+1) given the current state (time: t). In other words RNNs are NNs with state!

When used to standard time series prediction the input and output values are taken from the same time series (usually a scalar value). This is a degenerate case of single valued inputs and outputs. Thus we need to learn the relationship between x(t-1) and x(t) so that we can predict the value of x(t+1) given x(t). This is what I did for this post.

Time series can be made more complicated by making the input a vector of different parameters, the output may still remain a scalar value which is a component of x or be a vector. One reason this is done is to add all the factors that may impact the value to be predicted (e.g. x(t+1)). In our example of average house prices – we may want to add factors such as time of the year, interest rates, salary levels, inflation etc. to provide some more “independent” variables in the input.

Two final points:

  • Use-cases for RNNs: Speech to Text, Predictive Text, Music Tagging, Machine Translation
  • RNNs include the additional complexity of training in Time as well as Space therefore our standard Back-Propagation becomes Back-Propagation Through Time

RNN Structure for Predicting House Prices:

RNN simple time series

The basic time series problem is that we have a sequence of numbers – the average price of houses for a given month and year (e.g. given: X(1), X(2), … X(t-1), X(t) ) with a regular step size and our task is to predict the next number in the sequence (i.e. predict: X(t+1)). In our problem the avg price is calculated for every month since January 1995 (thus step size is 1 month). As a first step we need to define a fixed sequence size that we are going to use for training the RNN. For the input data we will select a sub-sequence of a given length equal to the number of inputs (in the diagram above there are three inputs). For training output we will select a sub-sequence of the same length as the input but the values will be shifted one step in the future.

Thus if input sub-sequence is: X(3), X(4) and X(5) then the output sub-sequence must be: X(4), X(5) and X(6). In general if input sub-sequence spans time step to where b > a and b-a = sub-sequence length, then the output sub-sequence must span a+1 to b+1.

Once the training has been completed if we provide the last sub-sequence as input we will get the next number in the series as the output. We can see how well the RNN is able to replicate the signal by starting with a sub-sequence in the middle and movie ahead in time steps and plotting actual vs predicted values for the next number in the sequence.

Remember to NORMALISE the data!

The parameters are as below:

n_steps = 36 # Number of time steps (thus a = 0 and b = 35, total of 36 months)

n_inputs = 1 # Number of inputs per step (the avg. price for the current month)

n_neurons = 1000 # Number of neurons in the middle layer

n_outputs = 1 # Number of outputs per step (the avg. price for the next month)

learning_rate = 0.0001 # Learning Rate

n_iter = 2000 # Number of iterations

batch_size = 50 # Batch size

I am using TensorFlow’s BasicRNNCell (complete code at the end of the post) but the basic setup is:

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu), output_size=n_outputs)

outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

loss = tf.reduce_mean(tf.square(outputs-y))
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
training = opt.minimize(loss)

saver = tf.train.Saver()

init = tf.global_variables_initializer()

Results:

A sample of 3 runs, using Mean Squared Error threshold of 1e-4 we get the following values for Error:

  1. 8.6831e-05
  2. 9.05436e-05
  3. 9.86998e-05

Run 3 fitting and predictions are shown below:

Orange dots represent the prediction by the RNN and Blue dots represent the actual data

 

Run 3 prediction against existing data 3 years before October 2017

Then we start from October 2017 (Month 24 in figure below) and forecast ahead to October 2018. This predicts a rise in average prices which start to plateau 3rd quarter of 2018. Given that average house prices across a country like UK are determined by a large number of noisy factors, we should take this prediction with a pinch of salt.

Run 3 Forecasting from Month 24 (October 2017 for the year ahead till October 2018)

A sample of 3 runs, using Mean Squared Error threshold of 1e-3 we get the following values for Error:

  1. 3.4365e-04
  2. 4.1512e-04
  3. 2.1874e-04

With a higher Error Threshold we find when comparing against actual data (Runs 2 and 3 below) the predicted values have a lot less overlap with the actual values. This is expected as we have traded accuracy for reduction in training time.

predicted avg price vs actual avg price (Run 2)

predicted avg price vs actual avg price (Run 3)

Projections in this case are lot different. We see a linearly decreasing avg price in 2018.

predicted avg price vs actual avg price with forecast

Next Steps:

I would like to add more parameters to the input – but it is difficult to get correlated data for different things such as interest rates, inflation etc.

I would also like to try other types of networks (e.g. LSTM) but I am not sure if that would be the equivalent of using a canon to kill a mosquito.

Finally if anyone has any ideas on this I would be happy to collaborate with them on this!

 

Source code can be found here: housing_tf

Contains HM Land Registry data © Crown copyright and database right 2017. This data is licensed under the Open Government Licence v3.0.

Currency Data, Efficient Markets and Influx DB

This post is about processing currency data which I have been collecting since the end of 2014. The data is collected once every hour from Monday 12am till Friday 11pm.

The data-set itself is not large as the frequency of collection is low, but it does cover lots of interesting world events such as Nigerian currency devaluation, Brexit, Trump Presidency, BJP Government in India, EU financial crisis, Demonetisation in India etc.

The image below shows the percentage change histogram for three common currencies (GBP – British Pound, USD – US Dollar and INR – Indian Rupee). The value for Percentage Change (X-Axis) is between -4% and 2%

Percentage Change histogram

Percentage Change histogram

What is immediately clear is the so called ‘fat-tail’ configuration. The data is highly skewed and shows clear features of ‘power law’ statistics. In other words the percentage change is related to frequency by an inverse power law. Larger changes (up or down) are rarer than small changes but not impossible (with respect to other distributions such as the Normal Distribution).

The discontinuity around Percentage Change = 0% is intentional. We do not want very small changes to be included as these would ‘drown out’ medium and large changes.

Mean Currency Movement

Mean Currency Movement

We can use the R code snippet below to draw 100 samples with replacement from  the movement data (combined across all currencies) and calculate the sample mean. The sample means can be plotted on a histogram which should give us the familiar Normal Distribution [this is the ‘Central Limit Theorem’ in action]. The sample mean that is most common is 0% – which is not an unexpected result given the presence of positive and negative  change percentages.

mean_curr_movement <- replicate(1000, { 
mean__curr_movement<-mean(
		sample(data$Percent.Change,100,replace = TRUE)
		)
	}
)

Compare this with a Normal distribution where, as we move away from the mean, the probability of occurrence reduces super-exponentially making large changes almost impossible (also a super-exponential quantity reduces a lot faster than a square or a cube).

Equilibrium Theory (or so called Efficient Market Hypothesis) would have us believe that the market can be modelled using a Bell Curve (Normal Distribution) where things might deviate from the ‘mean’ but rarely by a large amount and in the end it always converges back to the ‘equilibrium’ condition. Unfortunately with the reality of power-law we cannot sleep so soundly because a different definition of rare is applicable there.

Incidentally earthquakes follow a similar power law with respect to magnitude. This means that while powerful quakes are less frequent than milder ones they are still far from non-existent.

Another magical quality of such systems is that fluctuations and stability often come in clusters. The image below show the percentage movement over the full two years (approx.). We see a relative period of calm (green area) bracketed by periods of high volatility (red areas).

Movement Over Time

Movement Over Time

The above graph shows that there are no ‘equilibrium’ states within the price. The invisible hand has not magically appeared to calm things down and reduce any gaps between demand and supply to allow the price of the currency to re-adjust. Otherwise we would have found that larger the change larger the damping force to resist the change – there by making sudden large changes impossible.

For the curious:

All the raw currency data is collected in an Influx DB instance and then pulled out and processed using custom window functions I wrote in JAVA. The processed data is then dumped into a CSV (about 6000 rows) to be processed in R.

We will explore this data-set a bit more in future posts! This was to get you interested in the topic. There are large amounts of time series data sets available out there that you can start to analyse in the same way.

All the best!

Using Scala Spark and K-Means on Geo Data

The code (Scala+Maven) can be found here: https://github.com/amachwe/Scala-Machine-Learning

The idea is simple… I found an open Geo data (points) set provided by Microsoft (~24 million points). The data is NOT uniformly distributed across the world, in fact the data is highly skewed and there are large concentrations of location data around China (Beijing specifically) and the US (West-Coast).

The data can be found here: https://www.microsoft.com/en-us/download/details.aspx?id=52367

As per the description:

This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). Last published: August 9, 2012.

 

Loading the Data:

The data set is fairly simple, it contains longitude, latitude, altitude and time-date information. All the details are available with the data set (being Microsoft they have complicated matters by creating a very complex folder structure – but my GeoTrailsLoader Object makes easy work of traversing and loading the data into Mongo ready for you to play around with it.

The data is loaded as Points (WGS 84) and indexed using a 2dSphere. Once the data is in Mongo you can easily test the ‘geographic’ nature of it by running a geo-query:

{
  $near: {
     $geometry: {
        type: "Point" ,
        coordinates: [ <longitude> , <latitude> ]
     }
  }
}

 

More Query types here: https://docs.mongodb.com/v3.2/applications/geospatial-indexes/

Clustering the Data:

The ScalaWorker does the K-Means training on the geo-data within Mongo using Spark and the Mongo-Spark connector.

We use a local Spark instance (standalone) but you can just as easily use a Spark cluster if you are lucky enough to have access to multiple machines. Just provide the IP Address and Port of your Spark master instead of ‘local[*]’ in the ‘setMaster’ call.

In the example the data is loaded from Mongo into RDDs and then we initiate K-Means clustering on it with a cluster count of 2000. We use Spark ML Lib for this. Only the longitude and latitude are used for clustering (so we have a simple 2D clustering problem).

The clustering operation takes between 2 to 3 hrs on a i7 (6th Gen), 16GB RAM, 7200RPM HDD.

One way of making this work on a ‘lighter’ machine is to limit the amount of data used for K-Means. If you run it with a small data set (say 1 million) then the operation on my machine just takes a 10-15 mins.

Feel free to play around with the code!

The Results:

The simple 2D cluster centres obtained as a result of the K-Means clustering are nothing but longitudes and latitudes. They represent ‘centre points’ of all the locations present in the data set.

We should expect the centres to be around high concentration of location data.

Furthermore a high concentration of location data implies a ‘popular’ location.

As these cluster centres are nothing but longitudes and latitudes let us plot them on the world map to see what are the popular centres of location data contained within the data set.

Geocluster data (cluster centres) with city names

Geocluster data (cluster centres) with city names

The image above is a ‘zoomed’ plot of the cluster centres (blue dots). I chose an area with relatively fewer cluster centres to make sure we do not get influenced by the highly skewed data set.

I have provided a sample 2000 cluster centre file here: https://github.com/amachwe/Scala-Machine-Learning/blob/master/cluster_centre_example/clusters_2000.csv

The red text is the ‘popular area’ these cluster centres represent. So without knowing anything about the major cities of Eurasia we have managed to locate many of them (Paris, Madrid, Rome, Moscow etc.) just by clustering location data!

We could have obtained a lot of this ‘label’ information automatically by using a reverse geo-coding service (or geo-decoding service) where we pass the cluster centre and obtain meta-data about that location. For example for the cluster centre: 41.8963978, 12.4818856 (reversed for the  geo-decoding service – in the CSV file it is: 12.4818856, 41.8963978) is the following location in Rome:

Piazza Venezia

Wikipedia describes Piazza Venezia as the ‘central hub’ of Rome.

The geo-decoding service I used (with the sample cluster centre) is: http://noc.to/geodecode#41.8963978,12.4818856

Enjoy!

 

Artificial Neural Networks: Training for Deep Learning – IIb

  1. Artificial Neural Networks: An Introduction
  2. Artificial Neural Networks: Problems with Multiple Hidden Layers
  3. Artificial Neural Networks: Introduction to Deep Learning
  4. Artificial Neural Networks: Restricted Boltzmann Machines
  5. Artificial Neural Networks: Training for Deep Learning – I
  6. Artificial Neural Networks: Training for Deep Learning – IIa

This post, like the series provides a pathway into deep learning by introducing some of the concepts using some common reference points. This is not designed to be an exhaustive research review of deep learning techniques. I have also tried to keep the description neutral of any programming language, though the backing code is written in Java.

So far we have visited shallow neural networks and their building blocks (post 1), investigated their performance on difficult problems and explored their limitations (post 2). Then we jumped into the world of deep networks and described the concept behind them (post 3) and the RBM building block (post 4). Then we started discussing a possible local (greedy) training method for such deep networks (post 5). In the previous post we started talking about the global training and also about the two possible ‘modes’ of operation (discriminative and generative).

In the previous post the difference between the two modes was made clear. Now we can talk a bit more about how the global training works.

As you might have guessed the two operating modes need two different approaches to global training. The differences in flow for the two modes and the required outputs also means there will be structural differences when in the two modes as well.

The image below shows a standard discriminative network where flow of propagation is from input to the output layer. In such networks the standard back-propagation algorithm can be used to do the learning closer to the output layers. More about this in a bit.

Discriminative Arrangement

Discriminative Arrangement

The image below shows a generative network where the flow is from the hidden layers to the visible layers. The target is to generate an input, label pair. This network needs to learn to associate the labels with inputs. The final hidden layer is usually lot larger as it needs to learn the joint probability of the label and input. One of the algorithms used for global training of such networks is called the ‘wake-sleep’ algorithm. We will briefly discuss this next.

Generative Arrangement

Generative Arrangement

Wake-Sleep Algorithm:

The basic idea behind the wake-sleep algorithm is that we have two sets of weights between each layer – one to propagate in the Input => Hidden direction (so called discriminative weights) and the other to propagate in the reverse direction (Hidden => Input – so called generative weights). The propagation and training are always in opposite directions.

The central assumption behind wake-sleep is that hidden units are independent of each other – which holds true for Restricted Boltzmann Machines as there are no intra-layer connections between hidden units.

Then the algorithm proceeds in two phases:

  1. Wake Phase: Drive the system using input data from the training set and the discriminative weights (Input => Hidden). We learn (tune) the generative weights (Hidden => Input) – thus we are trying to learn how to recreate the inputs by tuning the generative weights
  2. Sleep Phase: Drive the system using a random data vector at the top most hidden layer and the generative weights (Hidden => Input). We learn (tune) the discriminative weights (Input => Hidden) – thus we are trying to learn how to recreate the hidden states by tuning the discriminative weights

As our primary target is to understand how deep learning networks can be used to classify data we are not going to get into details of wake-sleep.

There are some excellent papers for Wake-Sleep by Hinton et. al. that you can read to further your knowledge. I would suggest you start with this one and the references contained in it.

Back-propagation:

You might be wondering why we are talking about back-prop (BP) again when we listed all those ‘problems’ with it and ‘deep networks’. Won’t we be affected by issues such as ‘vanishing gradients’ and being trapped in sub-optimal local minima?

The trick here is that we do the pre-training before BP which ensures that we are tuning all the layers (in a local – greedy way) and giving BP a head start by not using randomly initialised weights. Once we start BP we don’t care if the layers closer to the input layer do not change their weights that much because we have already ‘pointed’ them in a sensible direction.

What we do care about is that the features closer to the output layer get associated with the right label and we know BP for those outer layers will work.

The issue of sub-optimal local minima is addressed by the pre-training and the stochastic nature of the networks. This means that there is no hard convergence early on and the network can ‘jump’ its way out of a sub-optimal local minima (with decreasing probability though as the training proceeds).

Classification Example – MNIST:

The easiest way to go about this is to use ‘shallow’ back propagation where we put a layer of logistic units on top of the existing deep network of hidden units (i.e. the Output Layer in the discriminative arrangement) and only this top layer is trained. The number of logistic units is equal to the number of classes we have in the classification task if using one-hot encoding to encode the classes.

An example is provided on my github, the test file is: rd.neuron.neuron.test.TestRBMMNISTRecipeClassifier

This may not give record breaking accuracy but it is a good way of testing discriminative deep networks. It also takes less time to train as we are splitting the training into two stages and always ever training one layer at a time:

  1. Greedy training of the hidden layers
  2. Back-prop training of the output layer

The other advantage this arrangement has is that it is easy to reason about. In stage 1 we train the feature extractors and in stage 2 we train the feature – class associations.

One example network for MNIST is:

Input Image > 784 > 484 > 484 > 484 > 10 > Output Class

This has 3 RBM based Hidden Layers with 484 neurons per layer and a 10 unit wide Logistic Output Layer (we can also use a SoftMax layer). The Hidden Layers are trained using CD10 and the Output Layer is trained using back propagation.

To evaluate we do peak matching – the index of the highest value at the output layer must match the one-hot encoded label index. So if the label vector is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] then the index value for the peak is 3 (we use index starting at 0). If in the output layer the 4th neuron has the highest activation value out of the 10 then we can say it detected the right digit.

Using such a method we can easily get an accuracy of upwards of 95%. While this is not a phenomenal result (the state of the art full network back-prop gives > 99% accuracy for MNIST), it does prove the concept of a discriminative deep network.

The trained model that results is: network.discrm.25.nw and can be found on my github here. The model is simply a list of network layers (LayerIf).

The model can be loaded using:

List<LayerIf> network = StochasticNetwork.load(fileName);

You can use the Propagate class to use it to ‘predict’ the label.

 

The PatternBuilder class can be used to measure the performance in two ways:

  1. Match Score: Matches the peak index of the one-hot encoded label vector from the test data with the generated label vector. It is a successful match (100%) is the peaks in the two vectors have the same indexes. This does not tell us much about the ‘quality’ of the assigned label because our ‘peak’ value could just be slightly bigger than other values (more of a speed breaker on the road than a peak!) as long as it is strictly the ‘largest’ value. For example this would be a successful match:
    1. Test Data Label: [0, 0, 1, 0] => Actual Label: [0.10, 0.09, 0.11, 0.10] as the peak indexes are the same ( = 2 for zero indexed vector)
    2. and this would be an unsuccessful one: Test Data Label: [0, 0, 1, 0] => Actual Label: [0.10, 0.09, 0.10, 0.11] as the peak indexes are not the same
  2. Score: Also includes the quality aspect by measuring how close the Test Data and Actual Label values are to each other. This measure of closeness is controlled by a threshold which can be set by the user and incorporates ALL the values in the vector. For example if the threshold is set to 0.1 then:
    1. Test Data Label: [0, 0, 1, 0] => Actual Label: [0.09, 0.09, 0.12, 0.11] the score will be 2 out of 4 (or 50%) as the last index is not within the threshold of 0.1 as | 0 – 0.11 | = 0.11 which is > 0.1 and same with | 1 – 0.12 | = 0.88 which is > 0.1 thus we score them both a 0. All other values are within the threshold so we score +1 for them. In this case the Match Score would have given a score of 100%.

 

Next Steps:

So far we have just taken a short stroll at the edge of the Deep Learning forest. We have not really looked at different types of deep learning configurations (such as convolution networks, recurrent networks and hybrid networks) nor have we looked at other computational models of the brain (such as integrate and fire models).

One more thing that we have not discussed so far is how can we incorporate the independent nature of neurons. If you think about it, the neurons in our brains are not arranged neatly in layers with a repeating pattern of inter-layer connections. Neither are they synchronized like in our ANN examples where all the neurons in a layer were guaranteed to process input and decide their output state at the SAME time. What if we were to add a time element to this? What would happen if certain neurons changed state even as we are examining the output? In other words what would happen if the network state also became a function of time (along with the inputs, weights and biases)?

In my future posts I will move to a proper framework (most probably DL4J – deep learning for java or TensorFlow) and show how different types of networks work. I can spend time and implement each type of network but with a host of high quality deep learning libraries available, I believe one should not try and ‘reinvent the wheel’.

If you have found these blog posts useful or have found any mistakes please do comment! My human neural network (i.e. the brain!) is always being trained!