Analytics, Machine Learning, AI and Automation

In the last few years buzzwords such as Machine Learning (ML), Deep Learning (DL), Artificial Intelligence (AI) and Automation have taken over from the excitement of Analytics and Big Data.

Often ML, DL and AI are placed in the same context especially in product and job descriptions. This not only creates confusion as to the end target, it can also lead to loss of credibility and wasted investment (e.g. in product development).

Figure 1: Framework for Automation

Figure 1 shows a simplified version of the framework for automation. It shows all the required ingredients to automate the handling of a ‘System’. The main components of this framework are:

  1. A system to be observed and controlled (e.g. telecoms network, supply chain, trading platform, deep space probe …)
  2. Some way of getting data (e.g. telemetry, inventory data, market data …) out of the system via some interface (e.g. APIs, service endpoints, USB ports, radio links …) [Interface <1> Figure 1]
  3. A ‘brain’ that can effectively convert input data into some sort of actions or output data which has one or more ‘models’ (e.g. trained neural networks, decision trees etc.) that contain its ‘understanding’ of the system being controlled. The ‘training’ interface that creates the model(s) and helps maintain them, is not shown separately
  4. Some way of getting data/commands back into the system to control it (e.g. control commands, trade transactions, purchase orders, recommendations for next action etc.) [Interface <2> Figure 1]
  5. Supervision capability which allows the ‘creators’ and ‘maintainers’ of the ‘brain’ to evaluate its performance and if required manually tune the system using generated data [Interface <3> Figure 1] – this itself is another Brain (see Recursive Layering)

This is a so called automated ‘closed-loop’ system with human supervision. In such a system the control can be fully automated, only manual or any combination of the two for different types of actions. For example, in safety critical systems the automated closed loop can have cut out conditions that disables Interface <2> in Figure 1. This means all control passes to the human user (via Interface <4> in Figure 1).

A Note about the Brain

The big fluffy cloud in the middle called the ‘Brain’ hides a lot of complexity, not in terms of the algorithms and infrastructure but in terms of even talking about differences between things like ML, DL and AI.

There are two useful concepts to use when trying to put all these different buzzwords in context when it comes to the ‘Brain’ of the system. In other words next time some clever person tells you that there is a ‘brain’ in their software/hardware that learns.. ask them two questions:

  1. How old is the brain?
  2. How dense is the brain?

Age of the Brain

Age is a very important criteria in most tasks. Games that preschool children struggle with are ‘child’s play’ for teenagers. Voting and driving are reserved for ‘adults’. In the same way for an automated system the age of the brain talks a lot about how ‘smart’ it is.

At its simplest a ‘brain’ can contain a set of unchanging rules that are applied to the observed data again and again [so called static rule based systems]. This is similar to a new born baby that has fairly well defined behaviours (e.g. hungry -> cry). This sort of a brain is pretty helpless in case the data has large variability. It will not be able to generate insights about the system being observed and the rules can quickly become error prone (thus the age old question – ‘why does my baby cry all the time!’).

Next comes the brain of a toddler which can think and learn but in straight lines and that too after extensive training and explanations (unless you are a very ‘lucky’ parent and your toddler is great at solving ‘problems’!). This is similar to a ‘machine learning system’ that is specialised to handle specific tasks. Give it a task it has not trained for and it falls apart.

Next comes the brain of a pre-teen which is maturing and learning all kinds of things with or without extensive training and explanations. ‘Deep learning systems’ have similar properties. For example a Convolutional Neural Network (CNN) can extract features out of a raw image (such as edges) without requiring any kind of pre-processing and can be used on different types of images (generalisation).

At its most complex, (e.g. a healthy adult) the ‘brain’ is able to not only learn new rules but more importantly evaluates existing rules for their usefulness. Furthermore, it is capable of chaining rules, applying often unrelated rules to different situations. Processing of different types of input data is also relatively easy (e.g. facial expressions, tone, gestures, alongside other data). This is what you should expect from ‘artificial intelligence‘. In fact with a true AI Brain you should not need Interface <4> and perhaps a very limited Interface <3> (almost a psychiatrist/psycho-analyst to a brain).

Brain Density

Brain density increases as our age increases and then stops increasing and starts to decrease. From a processing perspective its like the CPU in your phone or laptop starts adding additional processors and therefore is capable of doing more complex tasks.

Static rule-based systems may not require massive computational power. Here more processing power may be required for <1>/<2>. to prepare the data for input and output.

Machine-learning algorithms definitely benefit from massive computational powers especially when the ‘brain’ is being trained. Once the model is trained however, the application of the model may not require computing power. Again more power may be required to massage the data to fit the model parameters than to actually use the model.

Deep-learning algorithms require computational power throughout the cycle of prep, train and use. The training and use times are massively reduced when using special purpose hardware (e.g. GPUs for Neural Networks). One rule of thumb: ‘if it doesn’t need special purpose hardware then its probably not a real deep-learning brain, it may simply be a machine learning algorithm pretending to be a deep-learning brain’. CPUs are mostly good for the data prep tasks before and after the ‘brain’ has done its work.

Analytics System

If we were to have only interfaces <1> and <3> (see Figure 1) – we can call it an analytics solution. This type of system has no ability to influence the system. It is merely an observer. This is very popular especially on the business support side. Here the interface <4> may not be something tangible (such REST API or a command console) all the time. Interface <4> might represent strategic and tactical decisions. The ‘Analytics’ block in this case consists of data visualisation and user interface components.

True Automation

To enable true automation we must close the loop (i.e. Interface <2> must exist). But there is something that I have not shown in Figure 1 which is important for true automation. This missing item is the ability to process event-based data. This is very important especially for systems that are time dependent – real-time or near-real-time – such as trading systems, network orchestrators etc. This is shown in Figure 2.

Figure 2: Automation and different types of data flows

Note: Events are not only generated by the System being controlled but also by the ‘Brain’. Therefore, the ‘Brain’ must be capable of handling both time dependent as well as time independent data. It should also be able to generate commands that are time dependent as well as time independent.

Recursive Layers

Recursive Layering is a powerful concept where an architecture allows for its implementations to be layered on top of each other. This is possible with ML, DL and AI components. The System in Figures 1 and 2 can be another combination of a Brain and controlled System where the various outputs are being fed in to another Brain (super-brain? supervisor brain?). An example is shown in Figure 3. This is a classic Analytics over ML example where the ‘Analytics’ block from Figure 1 and 2 has a Brain inside it (it is not just restricted to visualisation and UI). It may be a simple new-born brain (e.g. static SQL data processing queries) or a sophisticated deep learning system.

Figure 3: Recursive layering in ML, DL and AI systems.

The Analytics feed is another API point that can be an input data source (Interface <1>) to another ‘Brain’ that is say supervising the one that is generating the analytics data.

Conclusion

So next time you get a project that involves automation (implementing or using) – think about the interfaces and components shown in Figure 1. Think about what type of brain do you need (age and density).

If you are on the product side then make sure bold claims are made, not illogical or blatantly false ones. Just as you would not ask a toddler to do a teenagers job, don’t advertise one as the other.

Finally think hard about how the users will be included in the automation loop. What conditions will disable interface <2> in Figure 1 and cut out to manual control? How can the users monitor the ‘Brain’? Fully automated – closed loop systems are not good for anyone (just ask John Connor from the Terminator series or people from Knight Capital https://en.wikipedia.org/wiki/Knight_Capital_Group). Humans often provide deeper insights based on practical experience and knowledge than ML or DL is capable of.

Reduce Food Wastage using Machine Learning

A scenario the readers might be familiar with: food items hiding around in our refrigerator way past their expiry date. Once discovered, these are quickly transferred to the bin with promises to self that next time it will be different for sure OR worse yet we stuff the items in our freezer!

Estimates of waste range from 20% to 50% (in countries like USA). This is a big shame given the fact that hundreds of millions of people around the world don’t have any form of food security and face acute shortage of food.

What can we do about this? 

One solution is to help people be a bit more organised by reminding them of the expiry dates of various items. The registration of items has to be automated and smart. 

Automated:

If we insist on manual entry of items with their expiry date – people are likely to not want to do this especially right after a long shop! Instead, as the items are checked out at the shop, an option should be available to email the receipt which should also contain an electronic record of the expiry date of the purchased items. This should include all groceries as well as ‘ready to eat’ meals. Alternatively, one can also provide different integration options using open APIs with some sort of a mobile app.

Smarter:

Once we have the expiry dates we need to ensure we provide the correct support and advice to the users of the app. To make it more user-friendly we should suggest recipes from the purchased groceries and put those on the calendar to create a ‘burn-down’ chart for the groceries (taking inspiration from Agile) which optimises for things like freshness of groceries, minimising use of ‘packaged foods’ and maintaining the variety of recipes.

Setup:

Steps are as follows:

  1. When buying groceries the expiry and nutrition information are loaded into the system
  2. Using a matrix of expiry to items and items to recipes (for raw groceries) we get an optimised ordering of usage dates mapped to recipes
  3. With the item consumption-recipe schedule we can then interleave ready to eat items, take-away days and calendar entries related to dinner/lunch meetings (all of these are constraints)
  4. Add feedback loop allowing users to provide feedback as to what recipes they cooked, what they didn’t cook, what items were wasted and where ‘unscheduled’ ready to eat items were used or take-away called for
  5. This will help in encouraging users to buy the items they consume and warn against buying (or prioritise after?) items that users ‘ignore’ 

I provide a dummy implementation in Python using Pandas to sketch out some of the points and to bring out some tricky problems.

The output is a list of purchased items and a list of available recipes followed by a list of recommendations with a ‘score’ metric that maximises ingredient use and minimises delay in usage.

Item: 0:cabbage
Item: 1:courgette
Item: 2:potato
Item: 3:meat_mince
Item: 4:lemon
Item: 5:chicken
Item: 6:fish
Item: 7:onion
Item: 8:carrot
Item: 9:cream
Item: 10:tomato


Recipe: 0:butter_chicken
Recipe: 1:chicken_in_white_sauce
Recipe: 2:mince_pie
Recipe: 3:fish_n_chips
Recipe: 4:veg_pasta
Recipe: 5:chicken_noodles
Recipe: 6:veg_soup

Recommendations

butter_chicken:     Score:30            Percentage items consumed:36%

chicken_in_white_sauce:     Score:26            Percentage items consumed:27%

Not enough ingredients for mince_pie

fish_n_chips:       Score:20            Percentage items consumed:27%

veg_pasta:      Score:26            Percentage items consumed:27%

chicken_noodles:        Score:28            Percentage items consumed:36%

veg_soup:       Score:20            Percentage items consumed:27%

The recommendation is to start with ‘butter chicken’ as we use up some items that have a short shelf life. Here is a ‘real’ recipe – as a thank you for reading this post: 

http://maunikagowardhan.co.uk/cook-in-a-curry/butter-chicken-murgh-makhani-chicken-cooked-in-a-spiced-tomato-gravy/h

Tricky Problems:

There are some tricky bits that can be solved but will need some serious thinking:

  1. Updating recommendations as recipes are cooked
  2. Updating recommendations as unscheduled things happen (e.g. item going bad early or re-ordering of recipes being cooked)
  3. Keeping track of cooked items and other interleaved schedules (e.g. item being frozen to use later)
  4. Learning from usage without requiring the user to update all entries (e.g. using RFID? Deep Learning – from images taken of your fridge with the door open)
  5. Coming up with innovative metrics to encourage people to eat healthy and eat fresh – lots of information can be extracted (E.g. nutrition information) if we have a list of purchased items
  6. Scheduling recipes around other events in a calendar or routine items (e.g. avoiding a heavy meal before a scheduled gym appointment)

Parallel Processing of Large Files using Rust

In the previous posts I have discussed Shared State and State Change:

Let us put all this to good use in this post! 

A standard use-case is where we have a large data file (e.g. CSV like this: Price Paid Data) that needs to be crunched where each row is transformed to a new row (there is no aggregation going on). The simplest processing paradigm is to do everything sequentially.

Step 1: Read next row from file

Step 2: Process row

Step 3: Write row to output file

Step 4: Go to 1 if not End Of File

At the end of the process we end up with a file having the same number of rows but different set of columns.]

This is easy to code up (few lines of Python!) but is impossible to scale. So as our file grows in size, longer we have to wait. Reading and writing in sequence will increase linearly with the size of the file. More processing we need to do in each loop longer the wait becomes.

If we have to process the file again and again (to apply different aggregations and transformations) things become more difficult. 

Figure 1: Creating scope for parallel processing, using an iterator (top) and chunks (bottom).

If the large data file is ‘read-only’ things become easier to process. We can do parallel reads without any fear. Even if we do sequential reads we can process each row in parallel and do sequential batching in the aggregate (which in this case is a simple row write) – where we don’t write one row but a batch of rows. This can be seen in Figure 1 (top). This is still not the best option. The iteration time will increase as the file size increases.

The real power comes from the use of a distributed file system, which means that we can fully parallelize the pipeline till the aggregation step (where we still batch write rows). This can be done by breaking file into blocks or chunks so we can iterate in parallel. 

The chunking of the file is still a sequential step but it needs to be done only once as the file is loaded into the File System. Then we can perform any operation.

I have attempted (as a means of learning Rust) writes a single iterator based program with parallel processing of the transformation function and batched writer.

There is a pitfall here, we can’t keep adding handlers as the overhead of parallelization will start to creep up. At the end of the day the writer is a single threaded processor so it cannot be scaled up and can only handle a given set of handlers at a time (depending on the production rate of the handler). 

There is a way of parallelizing the writer as well. We can produce a chunked file as output. This allows us to run parallel iterate-transform-write pipelines on each chunk. This can be shown in Figure 2.

Figure 2: Chunking for source and destination.

The file for the Rust program is below. 

Expectation and State Change

Continuing on from the previous post about shared state – let us try and get some concept of expected and actual state. This is looking in from the outside. 

Few assumptions first:

  1. State can be written or read.
  2. In terms of multiple processes writing to a shared state store – we only care about state read and state written and assume that half updates are not allowed at the lowest level – to simplify we are not looking at object updates where say out of three items two are updated to the new value but the third one is not, at the level of the object we have a partial update but at the level of the individual items we will either see the the new value or the old value – this is to simplify the discussion
  3. Reads can be with expectation or without expectation (other than the core expectation of ‘success’ and valid ‘format’) – either the reader uses the value read (e.g. in an ‘if’ condition) or simply transfers it on without using it (e.g. as access API) – what the reader does with the value is independent of the issue of shared state. 
  4. Writes can be with checks or not – either the writer doesn’t care about current state before writing the new state OR it does (so called compare and swap)

Therefore the standard model for un-secured parallel read-write of state involves all types of Reads (R) and Writes (W) to the same shared State (S). From the previous post we know the conditions that lead to issues with access to shared state.

Figure 1: What is the state written by Writers (W a,b,c,) and what is the state read by readers (R a,b,c)?

In Figure 1 there is no way of knowing what is the final state of S given three parallel writes by Wa, Wb and Wc. The expectation is that S is either equal to 1, 2 or 3. Similarly, we cannot know what values are read by parallel readers Ra, Rb and Rc. Expectation again is that the values are either 1,2 or 3. But depending on S, Readers may see non-initialised values as well (for example if read happens before any writes).

The writer in this case treats the State as a black box. Therefore, there are no expectations of current value of S and writes are done in a fire and forget manner.

One way of improving this is to make writes that do compare and swap (CAS) – that is forming an expectation of existing state before changing it.

Figure 2: Compare and Swap – with pitfalls

Figure 2 shows this at work. Two writers (W1 and W2) are writing to State S at the same time. We cannot be sure when they start and finish. Initial value of S = 100. Both are attempting to change the State of S from 100 to 200 and 300 respectively. Since these are overlapping writes we need to provide some sort of ordering to achieve consistency. 

To provide ordering we write with expectation. To build the expectation each writer must perform a read first. This initial read gives the writer a foundation for their own write. Usually the subsequent compare and swap operation is an atomic one with support at the lowest levels (e.g. see the sun.misc.Unsafe and its use by concurrency API in Java for CAS).

W1 initiates the write by forming an expectation of the current state (100). While it attempts the CAS operation on S, W2 also initiates the write by forming an expectation. As W1’s CAS has not finished yet the current state is still 100. W2 is then able to slip its CAS ahead of W1 (due to say thread priority). That compares expectation to actual and then sets S to new value of 200. Now W1 catches up and issues a CAS request. CAS checks the current state with the expected state – this no longer matches as current state = 200 and expected is 100. The operation fails and W1 is free to retry (with say exponential back-off).

Putting Expectations to Work

Lets put forward a question at this stage:

What value does R2 get when it tries to read the S? As we can see the value of S changes from 100 to 200 while the read is taking place. For example: If we had instantaneous reads then R2 should get the value 100.

The answer is that it depends on the specific implementation. 

We can use some sort of locking mechanism to coordinate between reads and writes. This would solve the parallel access issue by making sure there is only one reader or write at a given time. But this is not efficient as we are blocking parallel reads as well. 

The main issue is that we don’t want to lock large chunks (e.g. a full document or a full row) but we may have to as we don’t want partial state updates to be available.

To be ‘atomic’ in case of a multi-part object (like a document or a row with multiple columns) means seeing all the changes part of a ‘write transaction’ reflected or none at all.

For example if we have a document that contains details of a game player with: ‘games played’, ‘games lost’, ‘games won’, ‘no results’. Then if we don’t provide atomic behaviour when updating after a game we could allow reads that don’t make sense (e.g. when ‘games played’ is different from the sum of ‘games lost’, ‘games won’ and ‘no results’).

As we spread the scope of the transaction the atomic update guarantee requires locks on more objects.

The other solution (rather than object wide locks) is to use a pointer to the object from the ‘table’ or ‘index’. Here an object is immutable. New versions require creation of new objects and updating a reference to point to the new version.

A single reference value can be easier to update atomically (e.g. using Compare and Swap) than a whole object. Therefore there is no need to block parallel reads and writes if we can guarantee that at any time the pointer to the object will be updated atomically. In this case, reads will see either the new value or the old value.

Furthermore what if there are relationships between entities that need to be updated simultaneously (a group of interrelated objects being updated simultaneously) ? For example between game result data and player data or money sender and receiver?

In the above case we need to guarantee that all related pointer values (as required by the transaction) will be updated atomically. This is lot harder to achieve.

One way this could be done is by creating a duplicate set of inter-related objects with the newer versions. This allows reads to happen (writes are blocked so that we don’t have multiple competing new versions) while the updated versions are being created. The problem will happen when its time to switch all of the related objects to the newer version without exposing a mix of old and new versions.

For this the traditional solution is to lock all connections (for reads or writes) while the version references are updated. Practically this should be a quick operation as the new structures already exist. The old versions can then be removed permanently (usually this is a compaction procedure that releases storage space).

Network Traffic Monitoring using OpenFlow

These days you don’t have to shell out thousands of pounds for an OpenFlow switch. Especially if you don’t mind lesser number of ports and devices that are not blazingly fast. I purchased a Zodiac FX OpenFlow switch and have been trying out different projects with it. The switch has 4 ports at 100M each with Port 4 reserved for management.

Figure 1: Network Setup for Traffic Snooper

To make a Traffic Snooper we need to mirror ports that allow ingress into and egress from the internal network. Port mirroring means sending same traffic to the target port as well as a mirror port. We may also filter traffic that is mirrored (e.g. we may only want to mirror Layer 4: UDP traffic). 

For the Zodiac FX, I selected Port 2 to be the mirror port. Traffic exiting the internal network will arrive on Port 1 and be sent out from Port 2 and Port 3 (Apply Action List). Similarly traffic entering the internal network from the Internet will arrive on Port 3 and be sent out from Port 2 and Port 1.

Thus Port 2 will receive packets from both the directions. Port 2 is attached directly to a Raspberry Pi with Link Local addressing enabled. This means that the port on the Raspberry Pi has a link local IP address assigned to it. We don’t really care about this because we are already forcing the traffic to come this way using a static flow. 

On the Traffic Snooper Pi I run my Python based Traffic collector. This utilises raw sockets to listen for traffic. All data is converted into hex and dumped into the Analytics Store database (Mongo DB). This is a fairly simple utility perfect for a lightweight platform like the Pi. It does not do any processing of the network data. It simply stores it in a database. The source code is attached for both the Snooper and the static controller which uses Ryu.

Snooper.py: https://docs.google.com/document/d/e/2PACX-1vTVV-G17M-TrLfGd2gt0B-5aK_NshjZ1-F1tWvrQwbTHR4Z-FYoaAzfOYdMVtGxP3B1ODLoWWiQiWS3/pub

The Traffic Snooper needs the following to read from the raw socket that is getting data from port 2 of the switch (the mirrored port):

s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.htons(3))

The line above can be interpreted as:

Open a socket that uses Address Family (AF) of Packets (you will need to run snooper with sudo to provide access) that will allow access to Layer 2 information. The socket is a raw type – therefore Layer 2 headers will not be stripped. Finally we provide the host to network byte order conversion (htons). 

This gives us a socket that pulls the packet with all the information (headers) intact and also ensures the byte ordering is correct.

The Traffic Snooper also stores the packet hash to ensure we do not store duplicate packets (this can be disabled if required).

Note: port 2 will not get any direct assignment of IP address (we don’t want any traffic to use this port for communication – only mirrored traffic should use this port) and should default to a ‘link-local’ IP address. In case of IPv4, link-local addresses are defined in the address block 169.254.0.0/16

Static_Controller.py: https://docs.google.com/document/d/e/2PACX-1vRU8ZAa5Vl03UwC5K61Rt9Me0y0tvKq_0s8lCm7aH9t7vN_Z6qnUMQgINPFdCrt9BM-kBkJh3uuJCyw/pub

The installed flows:

The OpenFlow Specification allows multiple Apply Actions. We use this to create duplicated traffic flows.
Flow 1: All traffic coming in from port 3 is forwarded to port 2 and 1. Here port 2 is the port connected to the analyser.

Flow 2: All traffic coming in from port 1 is forwarded to port 2 and 3.

Note: The controller is a static controller. In other words we use Ryu to install flows and then the controller is disconnected (thus flow timeout=0). To achieve this we use the ‘safe’ mode on the Zodiac FX which does not remove the flows that have been installed. As the Zodiac FX is a pure OpenFlow switch it does not support standalone mode.

Next Step: Next post will look at the traffic analyser that breaks down the incoming packet and pulls out various protocol related information.

Follow-up: I have used Zodiac FX for this post and not something like OpenVSwitch (which has several advanced features such as port mirroring and built in learning switch in ‘standalone’ mode) because I wanted to use a pure OpenFlow device that does nothing till you don’t provide the flows. 

OVS Implementation

It is fairly straight forward if you want to setup your Pi as a OVS switch. You will need USB-Ethernet plugs and a freshly formatted Pi. I used the lightweight no-desktop ‘Stretch’.

This is a good guide to follow: 
https://www.telematika.org/post/piovs-raspberry-pi-open-vswitch/

I only needed the ‘Install OVS’ and ‘Configure Interfaces’ step.

Below are the three interfaces I created, ‘eth2’ is the interface to the snooper and ‘eth3’ the ‘internet’ interface.

auto eth1
iface eth1 inet manual
hwaddress ether 00:0e:c6:de:54:15
auto eth2
iface eth2 inet manual
hwaddress ether 00:0e:c6:df:ae:ac
auto eth3
iface eth3 inet manual
hwaddress ether 00:0e:c6:df:ae:c2
auto ovsbr0
allow-ovs ovsbr0
iface ovsbr0 inet manual
ovs_type OVSBridge

There are few things to watch out for:

  1. The interface on the Raspberry Pi running the snooper application  should not be reachable from the network. This is because we do not want any traffic headed for that port. We only want to record traffic flowing between 1 and 3. Therefore, the Pi connected to port 2 of the switch should have two interfaces. One that has a valid local network address – to allow snooper to access the database server for example. The second which is connected to the switch interface 2 which receives the copied traffic. This second interface should have a link local IP address to ensure all the traffic received there either for port 1 or 3.
  2. Set the fail mode to ‘secure’ in OVS. If you do not set the fail mode in OVS to ‘secure’ then it will fall back to learning switch mode (standalone mode) and start faithfully switching traffic. This will mean (in short) your snooper Pi will have an IP address assigned to the port that is sniffing the mirrored traffic. Once you install the flows then the traffic will be mirrored but you can still get extra packets not part of the flow that we are monitoring.
  3. Use ‘sudo ovs-vsctl get-fail-mode <bridge_name>‘ to get the fail mode and then ‘sudo ovs-vsctl set-fail-mode <bridge_name> secure‘ to set to secure mode (replace bridge_name with the name of your OVS bridge). This will disable the learning switch and you will need to use the static-controller to setup the snooper flows.
  4. You can use ‘sudo ovs-appctl fdb/show <bridge_name>‘ to show the forwarding db (this stores the result of the mac learning) and ‘sudo ovs-appctl fdb/flush <bridge_name>‘ to clear the forwarding db.

A Question of Shared State

This post is about one of my favourite topics in Computer Science: shared state and the challenges related to concurrency.  Mature programming languages have constructs that allow us to manage concurrent access to state. Functional programming style is all about keeping ‘state’ out of the ‘code’ to avoid bugs when providing concurrent access. Databases also use different tricks to side-step this issue such as transactions and versioning.

Let us first define ‘shared state’: in general: any value that is accessible, for reading and/or writing, by one or more entities (processes, threads, humans) using a handle (variable name, URL, memory address). For example:

  1. in a database, the value associated against a key or a cell in a table
  2. in a program it is a value stored at a memory address 

Conceptual Model

Figure 1: Conceptual Model

The conceptual model is shown in Figure 1. Time flows from left to right and the State changes as Entities write to it (E1 and E2).

Entities (E1 and E3) read the State some time later.

Few things to note:

  1. State existed (with some value ‘1’) before E1 changed it
  2. E1 and E2 are sequentially writing the State
  3. E1 and E3 (some time later) are concurrently reading the State

Now that we understand the basic building blocks of this model as Entities, State(s) and the Operations of reading and writing of State(s), let us look at where things can go wrong.

Concurrent Access Problems

The most common problem with uncontrolled concurrent access is that of Non-deterministic behaviour. When two processes are writing and reading the same State at the same time then there is no way to guarantee the results of the read. Imagine writing a test that checks the result of the read in such a situation. It could be that the test passes on a slower system as the read thread has to wait (so the write has finished). Or it could be that the test passes on a faster system because the write happens quickly (and the read gets the right value). Worse yet: test passes but the associated feature does not work in production or works sporadically.

For problems like the above to occur in a concurrent access scenario all  three factors, given below, must be present:

  1. State is shared between Entities (which can be multiple threads of the same process)
  2. State once created can be modified (mutability) – even if by a sub-set of the Entities that have ‘write’ access to the State
  3. Unrestricted Concurrent Read and Write Operations are allowed on the State Store

All we need to do, to avoid problems when enabling concurrent access, is to block any one of the three factors given above.  If we do that then some form of parallel access will be available and it will be safe.

Blocking each can open the system up to other issues (there is no such thing as a free lunch!). One thing to keep in mind is that we are talking about a single shared copy of the State here. We are not discussing replicated shared States here. Let us look at what happens when we block each of the factors.

Preventing Shared Access

Here we are assuming shared access is a desired feature. This is equivalent of blocking option 1 (no shared state). This can be seen in Figure 2 (left) where two entities (E1 and E2) have access to different State stores and can do whatever they want with them because they are not shared. E1 creates a State Store with value ‘1’ and E2 creates another State Store with value ‘A’. One thing to understand here is that the ownership of the State Store may be ‘transferred’ but not ‘shared’. In Figure 2 (right) the State Store is used by Entity E1 and then ownership is transferred to E2. Here E1 is not able to use State Store once it has been transferred to E2. 

This is used in the Rust programming language. Once a State Store (i.e. variable) has been transferred out, it can no longer be accessed from the previous owning Entity (function) unless the new owner returns the State Store. This is good because we don’t know what the new function will do with the variables (e.g. library functions). What if it spins up a new thread? If the new owner Entity does not return or transfer further the State Store, the memory being utilised for the store can be re-claimed. If this sounds like a bad idea don’t worry! Rust has another mechanism called ‘borrowing’ which we will discuss towards the end of this post.

Previous section described interactions between functions. What about between threads? Rust allows closure based sharing of state between threads. In this context the ‘move’ keyword is used to transfer ownership of state to the closure. In Go channels use the same concept to clone state before sharing it.

Figure 2: No shared state (left) and borrowing of state (right

This solution makes parallel computation difficult (e.g. we need to duplicate the State Store to make it local for every Entity). The results may then need to be collated for further processing. We can improve this by using the concept of transfer but then that leads to its own sets of issues (e.g. ensure short transfer chains). 

Preventing Mutability

This option is an important part of the ‘functional’ style of programming. Most functional languages make creating immutable structures easier than creating mutable ones (e.g. special keywords have to be used, libraries imported). Even in Object Oriented languages mutable State Stores are discouraged. For example in Java the best practice is to make objects immutable and whole range of immutable structures are provided. In certain databases, when state has to change – a new object with the changed state is created with some sort of version identifier. Older versions may or may not be available to the reader.








Figure 3: Immutable State

Figure 3 shows this concept in action. Entity E1 creates State Store with value ‘1’. Later on, E2 reads the State and creates a new State Store with value ‘2’ (may be based on value of the previous State Store). If at that point there is no need to hold State Store with value ‘1’ it may be reclaimed or marked for deletion. 

Following on from this – the State Store with value ‘2’ is read by two Entities (E3 and E4) and they in-turn produce new State Stores with values ‘3’ and ‘4’ respectively. The important thing here is to realise that once State Store with value ‘2’ was created it cannot be modified, therefore it can be read concurrently as many times as required without any issues.

Where this does get interesting is if the two divergent state transitions (from value ‘2’ -> ‘3’ and ‘2’ -> ‘4’) have to be reconciled. Here too the issue is of reconciliation logic than concurrent access because State Stores with value ‘3’ and ‘4’ are themselves immutable.

One problem with this solution is that if we are not careful about reclaiming resources used by old State Stores we can quickly run out of resources for new instances. Creating of a new State Store also involves a certain overhead where existing State Store must be first cloned. Java, for example, offers multiple utility methods that help create immutable State Stores from mutable ones.

Restricting Concurrent Read and Write Operations

Trouble arises when we mix concurrent reads AND writes. If we are always ever reading then there are no issues (see previous section). What if there was a way of synchronising access to the State Store between different Entities?

Figure 4: Preventing Concurrent Read and Write Operations

Figure 4 shows such a synchronisation mechanism that uses explicit locks. An Entity can access the State Store only when it holds the lock on the State Store. In Figure 4 we can see E1 acquires the lock and changes value of State Store from ‘1’ to ‘2’. At the same time E2 wants to access the State Store but is blocked till E1 finishes. As soon as E1 finishes E2 acquires the lock and changes the value from ‘2’ to ‘3’.

Some time later E2 starts to change the value from ‘3’ to ‘4’, while that is happening E1 reads the value. Without synchronised access it would be difficult to guarantee what value E1 reads back. The value would depend on whether E2 has finished updating the value or not. If we use locks to synchronise, E1 will not be able to read till E2 finishes updating. 

Using synchronisation mechanisms can lead to some very interesting bugs such as ‘deadlocks’ where under certain conditions (called Coffman conditions) Entities are never able to acquire locks on resources they need and are therefore deadlocked forever. Most often these bugs are very difficult to reproduce as they depend on various factors such as speed of processing, size of processing task, load on the system etc.

Concepts used in Rust

Since I have briefly mentioned Rust, it would be interesting to see how the solutions presented have been used in Rust to ensure strict scope checking of variables. Compile time checks ensure that the issues mentioned cannot arise.

  1. Transfer of State: Rust models two main types of transfer – a hard transfer of the ownership and softer ‘borrowing’ of reference. If a reference is borrowed then the original State Store can still be used by the original owning Entity as now transfer of ownership has taken place.
  2. Restrictions on Borrowing: Since references can be both mutable and immutable (default type), we can quickly get into trouble with mutable references given alongside immutable ones. To prevent this from happening there is a strict rule about what types of references can exist together. The rule says that we can have as many immutable references at the same time OR a single mutable reference. Therefore, they change the protection mechanism depending on type of operations required on the State Store. For read and write – State Store is not shared (single mutable reference). For parallel reads – State Store cannot be changed. This allows them to be on the right side of the trade-off.
  3. Move: This allows state transfer between threads. We need to be sure that any closure based state transfer is also safe. It uses the same concept as shown in Figure 2 except the ownership is transferred to the closure and is no longer accessible from the original thread.

Go Channels

Go Channels provide state cloning and sharing out of the box. When you create a channel of a struct and send an instance of that struct over it then a copy is created and sent over the channel. Any changes made in the sender or receiver are made on different instances of the struct. 

One way to mess this up nicely is to use a channel of pointer type. Then you are passing the reference which means you are sharing state (and creating a bug for yourself to debug at a later date)..

Conclusion

We have investigated a simple model to understand how shared state works. Next step would be to investigate how things change when we add replication to the shared state so that we can scale horizontally.

International Recycle Card

It is encouraging to see availability of recyclable packaging such as plastic wrappers, cans and food containers. But we see the problem of incorrect disposal, littering and lack of waste segregation everywhere (here I believe developed and developing countries are alike).

What incentive can the public be given to not only correctly dispose off their litter but also to pick up after others?

One common method has been the use of bottle/can bank where you return empty bottles and/or cans and you get some money in return.

My idea is to extend this and making it streamlined. 

Concept is simple.

Prerequisites:

  1. All packaging to be uniquely identified using RFID/barcode/QR code etc. – this should identify the source of the packaging and the unique package itself. Something like a bar-code
  2. Everyone buying packaged items has a Recycle Card (app or physical)
  3. (Optional) People buy items using electronic cash (e.g. credit cards) – to attach personal details

Process:

  1. Person scans the item (and the package code is also scanned alongside) – over time these could be the same code.
  2. Alongside the bill, a full list (electronic) is provided on the app for Recycle Card of all the packaging you have purchased (when you purchased the product).
  3. The ‘value’ of that packaging in terms of the local currency will also be shown.
  4. Upon successfully recycling the packaging, a part of that ‘value’ will be credited to the person. This can be a monthly or weekly process.
  5. Any litter found is scanned. The full ‘value’ along with a small fine is debited from the associated Recycle Card. The Recycle Card of the person who found the litter and correctly disposed it gets a small credit applied to it.

This means we recognise the value (in terms of money) of the packaging and not just the contents. This I believe is partially happening where ‘green’ products with innovative packaging attract a premium prices.

Furthermore we should attach a loss (again in terms of money) with improper disposal of the packaging. That is done only through fines but without direct accountability.

Key Factors:

There are two important steps here:

  1. Detecting successful disposal: This should be automated probably at the recycling centre some sort of machine which can scan and tally the packaging and indicate which Recycle Card should be credited. Packaging is unlikely to arrive intact at the Recycling centre. Therefore multiple markers need to be provided. RFIDs are a good solution but may be too expensive for regular use. One option is a dye that exhibits florescence under certain light. This would give a code that can be detected using machine vision. This is similar to the Automatic Number Plate Recognition software that has become very popular at parking lots, toll plazas and petrol pumps. 
  2. Registration of the Recycle Card: This should be a global system. Mainly because the problem of plastics and other packaging materials will impact everyone. Especially if these end up in our Oceans. People should be obligated to correctly dispose packaging where-ever they are in the world. Those who do so should be rewarded and those who don’t penalised. To ensure this – every pieces of packaging must be uniquely identified. This is a big task and I am sure there will be manufacturers (perhaps small/medium sized ones or from the informal sector) who will no follow this system (at least in the beginning due to cost etc.). But the idea is to target the 80% before we target the 20%. In the sense that big companies like Unilever, Nestle, etc. and fast-food joints like McDonalds have the capacity to upgrade their packaging. These are also mass-consumption products. So it would have a noticeable impact.

Do let me know what you think about this idea!

Somewhere in there there are few good machine learning and big-data use-cases. 🙂

Housing Market: Auto Correlation Analysis

In this post we take a look at the housing market data which consists of all the transactions registered with the UK Land Registry since 1996. So lets get the copyright out of the way:

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

The data-set from HM Land Registry has information about all registered property transactions in England and Wales. The data-set used for this post has all transactions till the end of October 2018. 

To make things slightly simple and to focus on the price paid and number of transaction metrics I have removed most of the columns from the data-set and aggregated (sum) by month and year of the transaction. This gives us roughly 280 observations with the following data:

{ month, year, total price paid, total number of transactions }

Since this is a simple time-series, it is relatively easy to process. Figure 1 shows this series in a graph. Note the periodic nature of the graph.

Figure 1: Total Price Paid aggregated (sum) over a month; time on X axis (month/year) and Total Price Paid on Y axis.

The first thing that one can try is auto-correlation analysis to answer the question: Given the data available (till end-October 2018) how similar have the last N months been to other periods in the series? Once we identify the periods of high similarity, we should get a good idea of current market state.

To predict future market state we can use time-series forecasting methods which I will keep for a different post.

Auto-correlation

Auto-correlation is correlation (Pearson correlation coefficient) of a given sample (A) from a time series against other available samples (B). Both samples are of the same size. 

Correlation value lies between 1 and -1. A value of 1 means perfect correlation between two samples where they are directly proportional (when A increases, B also increases). A value of 0 implies no correlation and a value of -1 implies the two samples are inversely proportional (when A increases, B decreases).

The simplest way to explain this is with an example. Assume:

  1. monthly data is available from Jan. 1996 to Oct. 2018
  2. we choose a sample size of 12 (months)
  3. the sample to be compared is the last 12 months (Oct. 2018 – Nov. 2017)
  4. value to be correlated is the Total Price Paid (summed by month).

As the sample size is fixed (12 months) we start generating samples from the series:

Sample to be compared: [Oct. 2018 – Nov. 2017]

Sample 1: [Oct. 2018 – Nov. 2017], this should give correlation value of 1 as both the samples are identical.

Sample 2: [Sep. 2018 – Oct. 2017], the correlation value should start to decrease as we skip back one month.

Sample N: [Dec. 1996 – Jan. 1996], this is the earliest period we can correlate against.

Now we present two graphs for different sample sizes:

  1. correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year) – to show yearly spread
  2. correlation coefficient visualised going back in time – to show periods of high correlation

Thing to note in all the graphs is that the starting value (right most) is always 1. That is when we compare the selected sample (last 12 months) with the first sample (last 12 months).

In the ‘back in time’ graph we can see the seasonal fluctuations in the correlation. These are between 1 and -1. This tells us that total price paid has a seasonal aspect to it. This makes sense as we see lots of houses for sale in the summer months than winter as most people prefer to move when the weather is nice!

Fig 2: Example of In and Out of Phase correlation.

So if we correlate a 12 month period (like this one) one year apart (e.g. Oct. 2018 – Nov. 2017 and Oct. 2017 – Nov. 2016) one should get positive correlation as the variation of Total Price Paid should have the same shape. This is ‘in phase’ correlation. This can be seen in Figure 2 as the ‘first’ correlation which is in phase (in fact it is perfectly in phase and the values are identical – thus the correlation value of 1). 

Similarly, if the comparison is made ‘out of phase’ (e.g. Oct. 2018 – Nov. 2017 and Jul 2018 – Aug. 2017) where variations are opposite then negative correlation will be seen. This is the ‘second’ correlation in Figure 2.

This is exactly what we can see in these figures. Sample sizes are 6 months, 12 months, 18 months and 24 months. There are two figures for each sample size. The first figure is the spread of the auto-correlation coefficient for a given year. The second figure is the time series plot of the auto-correlation coefficient, where we move back in time and correlate against the last N months. The correlation values fluctuating between 1 and -1 in a periodic manner. 


Fig. 3a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year), Sample size: 6 months

Fig. 3b: Correlation coefficient visualised going back in time; Sample size: 6 months


Fig. 4a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 12 months

Fig. 4b: Correlation coefficient visualised going back in time; Sample size: 12 months


Fig. 5a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 18 months

Fig. 5b: Correlation coefficient visualised going back in time; Sample size: 18 months


Fig. 6a: Correlation coefficient visualised going back in time, grouped by Year (scatter and box plot per year); Sample size: 24 months

Fig. 6b: Correlation coefficient visualised going back in time; Sample size: 24 months

Conclusions

Firstly, if we compare the scatter + box plot figures, especially for 12 months (Figure 4a), we find the correlation coefficients are spread around ‘0’ for most of the years. One period where this is not so and the correlation spread is consistently above ‘0’ is the year 2008, the year that marked the start of the financial crisis. The spread is also ‘tight’ which means all the months of that year saw consistent correlation, for the Total Price Paid, against the last 12 months from October 2018.

Secondly conclusion we can draw from the positive correlation between last 12 months (Figure 2b) and the period of the financial crisis is that the variations in the Total Price Paid are similar (weakly correlated) with the time of the financial crisis. This obviously does not guarantee that a new crisis is upon us. But it does mean that the market is slowing down. This is a reasonable conclusion given the double whammy of impending Brexit and on set of winter/Holiday season (which traditionally marks a ‘slow’ time of the year for property transactions).

Code is once again in python and attached below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture

months = MonthLocator(range(1, 13), bymonthday=1, interval=3)
year_loc = YearLocator()

window_size = 12

def is_crisis(year):
if year<2008:
return 0
elif year>2012:
return 2

return 1

def is_crisis_start(year):
if year<2008:
return False
elif year
>2008:
return False

return True

def
process_timeline(do_plot=False):
col = "Count"
y = []
x = []
x_d = []
box_d = []
year_d = []
year = 0
years_pos = []
crisis_corr = []
for i in range(0, size - window_size):

try:

if year != df_dates["Year"][size-1-i]:

if year > 0:
box_d.append(year_d)
years_pos.append(year)
year_d = []
year = df_dates["Year"][size-1-i]

corr = np.corrcoef(df_dates[col][size -i - window_size: size - i].values, current[col].values)
year_d.append(corr[0, 1])
y.append(corr[0, 1])
if is_crisis_start(year):
crisis_corr.append(corr[0, 1])
x.append(year)
month = df_dates["Month"][size - 1 - i]
x_d.append(dt(year, month, 15))

except Exception as e:
print(e)

box_d.append(year_d)
years_pos.append(year)

corr_np = np.array(crisis_corr)
corr_mean = corr_np.mean()
corr_std = corr_np.std()

print("Crisis year correlation: mean and std.: {} / {} ".format(corr_mean, corr_std))
if do_plot:

fig, sp = plt.subplots()

sp.scatter(x, y)
sp.boxplot(box_d, positions=years_pos)

plt.show()

fig, ax = plt.subplots()
ax.plot(x_d, y,'-o')
ax.grid(True)
ax.xaxis.set_major_locator(year_loc)
ax.xaxis.set_minor_locator(months)
plt.show()

return corr_mean, corr_std

csv = "c:\\ML Stats\\housing_oct_18_no_partial_mnth_cnt_sum.csv"
full_csv = "c:\\ML Stats\\housing_oct_18.csv_mnth_cnt_sum.csv"

df = pd.read_csv(full_csv)


mnth = {
1: "Jan",
2: "Feb",
3: "Mar",
4: "Apr",
5: "May",
6: "Jun",
7: "Jul",
8: "Aug",
9: "Sep",
10: "Oct",
11: "Nov",
12: "Dec"
}


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

crisis = list(map(lambda r: is_crisis(int(r[1]["Year"])), df.iterrows()))

df_dates = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year, "Month": df.Month, "Crisis": crisis})

df_dates = df_dates.sort_values(["Date"])

df_dates = df_dates.set_index("Date")

plt.plot(df_dates["Sum"],'-o')
plt.ylim(ymin=0)
plt.show()

size = len(df_dates["Count"])

corr_mean_arr = []
corr_std_arr = []
corr_rat = []
idx = []
for i in range(0, size-window_size):
end = size - i
current = df_dates[end-window_size:end]
print("Length of current: {}, window size: {}".format(len(current), window_size))

ret = process_timeline(do_plot=True)
break #Exit early




House Market Analysis

The house prices in UK are at it again. A combination of Brexit, change in housing stock, easy loans and growing consumer debt is making things interesting again.

Figure 1: Number of Transactions per month from 1995 to August 2018

Figure 1 shows the number of transactions every month since 1995. The massive fall post 2007 because of the financial crisis. Then the surge in transactions since 2013. The lonely spot (top-right, March 2016) is just before the new Stamp Duty changes made buying a second house an expensive proposition. But this is relatively boring!

Visual Analytics: Relation between Quantity and Value of Transactions

Let us look at Transaction Count (quantity) and Total Value of those transactions, aggregated on a monthly basis. I used a Spark cluster to aggregate the full transaction set (4GB csv data file). The base data set has about 280 rows with the following structure:

{month, year, sum, count}

The month and year values are converted into dates and added to the row, then the data set is sorted by date:

{date, month, year, sum, count}

This leads us to three plots. Sum and Count against time and Sum against Count. These are shown below:

Figure 2: Total Transaction value by date, grouped by year (each dot represents a month in that year)

Figure 2 shows Total Transaction value by date (Y-axis). The plot is grouped by year where each dot represents a month in that year. The current year (2018) has complete months data till August therefore less number of dots.

Figure 3: Total Quantity of Transactions  by date, grouped by year (each dot represents a month in that year)

Figure 3 shows Total Quantity of Transactions (Y-axis), once again grouped by year. Similar to Figure 2 the data is complete till August 2018.

Figure 4: Total Transaction value (Y-axis) against Total Number of Transactions (X-axis)

Figure 4 show how the value of the transactions relates to number of transactions. Each dot represents a month in a year. As expected there is a slightly positive correlation between total value of transactions and the number of transactions. A point to note: the total value of transactions depends on the sale price (that depends on the property sold) as well as the number of transactions in a given month. For the same number of transactions the value could be high or low (year on year) depending on whether prices are inflationary or a higher number of good quality houses are part of that months transactions.

Figure 5: Total Transaction value (Y-axis) against Total number of transaction (X-axis), each point represents a particular month in a year

Figure 5 enhances Figure 4 by using colour gradient to show the year of the observation. Each year should have at least 12 points associated with it (except 2018). This concept is further extended by using different shape for the markers depending on whether that observation was made before the financial crisis (circle: year of observation before 2008), during the financial crisis (square: year of observation between 2008 and 2012) or after the crisis (plus: year of observation after 2012). These values for years have been picked using Figures 2 and 3. 

Figure 6: Showing the housing market contract during the Crisis and then expand

Figure 6 shows the effect of the financial crisis nicely. The circles represent pre-crisis transactions. The squares represent transactions during the crisis. The plus symbol represents post-crisis transactions. 

The rapid decrease in transactions can be seen as the market contracted in 2007-2008. As the number of transactions and the value of transactions starts falling, the relative fall in number of transactions is larger than in the total value of the transactions. This indicates the prices did fall but mostly not enough houses were being sold. Given the difficulty in getting a mortgage, this reduction in number of transactions could be caused by a lack of demand.

Discovering Data Clusters

Using a three class split (pre-crisis, crisis, post-crisis) provides some interesting results. These were described in the previous section. But what happens if a clustering algorithm is used on the data?

A Clustering algorithm attempts to assign each observation to a cluster. Depending on the algorithm, total number of clusters may be required as an input. Clustering is often helpful when trying to build initial models of the input data especially when no labels are available. In that case, the cluster id (represented by the cluster centre) becomes the label. The following clustering algorithms were evaluated:

  1. k-means clustering
  2. gaussian mixture model

The data-set for the clustering algorithm has three columns: Date, Monthly Transaction Sum and Monthly Transaction Count.

Given the claw mark distribution of the data it was highly unlikely k-means would give good results. That is exactly what we see in Figure 7 with cluster size of 3 (given we had three labels previously of before crisis, during crisis and after crisis). The clustering seems to cut across the claws. 

Figure 7: k-mean clustering with cluster size of 3 – total value of transactions (Y-axis) vs total number of transactions

If a gaussian mixture model (GMM) is used with component count of 3 and covariance type ‘full’ (using sklearn implementation – see code below) some nice clusters emerge as seen in Figure 8.

Figure 8: Gaussian Mixture model with three components.

Each of the components corresponds to a ‘band’ in the observations. The lowest band corresponds loosely with pre-crisis market, the middle (yellow) band somewhat expands the crisis market to include entries from before the crisis. Finally, the top-most band (green) corresponds nicely with the post-crisis market.

But what other number of components could we choose? Should we try other GMM covariance types (such as ‘spherical’, ‘full’, ‘diag’, ‘tied’)? To answer these questions we can run a ‘Bayesian Information Criteria’ test against different number of components and different covariance types. The method and component count that gives the lowest BIC is preferred.

The result is shown in Figure 9.

Figure 9: BIC analysis of the data – BIC score against number of components (X-axis)

From Figure 9 it seems the ‘full’ type consistently gives the lowest BIC on the data-set. Furthermore, going from 3 to 4 components improves the BIC score (lower the better). Another such jump is from 7 to 8. Therefore, number of components should be 4 (see Figure 10) or 8 (see Figure 11).

Figure 10: Transaction value (Y-axis) against  Total Number of Transactions – with 4 components.

Figure 11: Transaction value (Y-axis) against  Total Number of Transactions – with 8 components.

The 4 component results (Figure 10) when compared with Figure 5 indicates an expansion at the start of the data-set (year: 1995), this is the jump from yellow to green. Then during the crisis there is a contraction (green to purple). Post crisis there is another expansion (purple to blue). This is shown in Figure 12.

Figure 12: Expansion and contraction in the housing market

The 8 component results (Figure 11) when compared with Figure 5 shows the stratification of the data-set based on the Year value. Within the different colours one can see multiple phases of expansion and contraction.

The interesting thing is that for both 4 and 8 component models, the crisis era cluster is fairly well defined.

Code for this is given below:

from matplotlib import pyplot as plt
from pandas import DataFrame as df
from datetime import datetime as dt
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.mixture import GaussianMixture


csv = "c:\\ML Stats\\housing_sep_18_no_partial_mnth_cnt_sum.csv"


df = pd.read_csv(csv)


dates = list(map(lambda r: dt(int(r[1]["Year"]), int(r[1]["Month"]), 15), df.iterrows()))

df_pure = pd.DataFrame({"Date": dates, "Count": df.Count, "Sum": df.Sum, "Year": df.Year})


df_pure = df_pure.sort_values(["Date"])

df_pure = df_pure.set_index("Date")


bics = {}
for cmp in range(1,10):

clust_sph = GaussianMixture(n_components=cmp, covariance_type='spherical').fit(df_pure)
clust_tied = GaussianMixture(n_components=cmp, covariance_type='tied').fit(df_pure)
clust_diag = GaussianMixture(n_components=cmp, covariance_type='diag').fit(df_pure)
clust_full = GaussianMixture(n_components=cmp, covariance_type='full').fit(df_pure)

clusts = [clust_full, clust_diag, clust_sph, clust_tied]
bics[cmp] = []
for c in clusts:
bics[cmp].append(c.bic(df_pure))

plt.plot(bics.keys(), bics.values())
plt.legend(["full", "diag", "sph", "tied"])
plt.show()

num_components = 4

clust = GaussianMixture(n_components=num_components, covariance_type='full').fit(df_pure)

lbls = clust.predict(df_pure)

df_clus = pd.DataFrame({"Count": df_pure.Count, "Sum": df_pure.Sum, "Year": df_pure.Year, "Cluster": lbls})
color = df_clus["Cluster"]

fig, ax = plt.subplots()
ax.scatter(df_clus["Count"], df_clus["Sum"], c=color)

fig, ax2 = plt.subplots()
ax2.scatter(df_clus["Year"], df_clus["Count"], c=color)

fig, ax3 = plt.subplots()
ax3.scatter(df_clus["Year"], df_clus["Sum"], c=color)


plt.show()

Contains HM Land Registry data © Crown copyright and database right 2018. This data is licensed under the Open Government Licence v3.0.

Agile and Waterfall for Innovation

Agile methods are iterative and incremental. This, in theory, should prevent implementation death marches which end up with products that do not meet the customer’s needs.

Waterfall on the other hand, is all about having predictable stages with clear milestones at the end of each stage. There is no concept of iteration or increments. All of a stage (like design) are done, validated and only then is the next stage started. The concept here is that validation at the end of each stage keeps implementation aligned.

Unfortunately, both say nothing about the twin human-factor problems of over-excitement and incompetence of the people involved.

Often well understood and repeatable projects (like building a house) follow a waterfall.

Agile suits more ‘non-repeatable’ projects such as building a software product where each product will have its own challenges, risks and ‘hidden dangers’ – while being driven by changes in the ‘business’ environment. Therefore it is very important, when doing Agile for new and innovative products to:

  1. give clear guidance about what is working and what is not working back into the process 
  2. keep overall focus on the problem that the product is attempting to solve (i.e. always keep in sight that gold paved road to sales)

If Agile allows software to ‘flow’ freely, then it needs a proper pipe for it to reach its destination (i.e. the hands of the customer). If the pipe shape keeps changing, or has leaks in it there is no way the software will reach the right destination.

One thing, very easy to do (especially in a startup environment) is to get too excited about the problem domain without staking out what specific parts of that domain the product addresses. What is even worse is not sticking to it once identified! This is because Agile methods can be abused to hide (but not ultimately solve) problems with changing requirements and scope creep – resulting in big failures.

This process is made more difficult by the fact that for a new product idea one needs to find a gap in the market. The irony is: bigger the gap you find (Total Addressable Market – TAM) – more funding are you able to attract on the basis of future demand for your product – bigger the promises made – higher is the chance of getting lost in multiple interesting aspects of the ‘gap’ especially if there are conflicting views in management or if the gap area is not well understood.

Here multiple sources of tension exist: what constitutes a businesses view of the minimum viable product (MVP) and how is it different from the potential customers view (ideally both should be the same)?

The answer to the question ‘which part of the MVP should we do first’ – is the launching point for the Agile process. Ideally – a set of features are decided and then iterative and incremental development starts. As long as there is tough resistance to the business asking for massive changes to the path and clear feedback into the development from the ‘customer’, the end result should be aligned with what initial expectations.

I believe for big gaps where both investors and company owners see big $$$ signs, the so called ‘disruptive innovation’ – it may be a difficult thing to start off with Agile and maintain the discipline of clear feedback and clear definitions of done in terms of the MVP. In such a case it may be good to start of in a waterfall model – with low expectation of success, and then do Agile. Hopefully with one attempt at waterfall, one will end up with a product that can be put against the MVP concept, deltas calculated and then fed into an Agile process to be filled incrementally.

Why start with waterfall? Because waterfall imposes a strict condition of no-iteration. So it is more difficult to abuse it. It forces you to commit to requirements to do some design. To commit to design to write some code and so on. And as I said – in the end it gives you a good target to destroy when you start the Agile method. It can also give strength to push back on requirement changes and scope creep later on.

One may say that it is a waste to do waterfall. But one must remember in an Innovation, new product environment usually the target is not to ‘boil the ocean’. So it may be possible to quickly attempt waterfall to get a starting point for Agile. Also in most innovation environments, the initial team size and skill distribution does not allow for a proper Agile implementation in any case. For example it is not typical to find abundant testing or quality assurance resources.