A Question of Shared State

This post is about one of my favourite topics in Computer Science: shared state and the challenges related to concurrency.  Mature programming languages have constructs that allow us to manage concurrent access to state. Functional programming style is all about keeping ‘state’ out of the ‘code’ to avoid bugs when providing concurrent access. Databases also use different tricks to side-step this issue such as transactions and versioning.

Let us first define ‘shared state’: in general: any value that is accessible, for reading and/or writing, by one or more entities (processes, threads, humans) using a handle (variable name, URL, memory address). For example:

  1. in a database, the value associated against a key or a cell in a table
  2. in a program it is a value stored at a memory address 

Conceptual Model

Figure 1: Conceptual Model

The conceptual model is shown in Figure 1. Time flows from left to right and the State changes as Entities write to it (E1 and E2).

Entities (E1 and E3) read the State some time later.

Few things to note:

  1. State existed (with some value ‘1’) before E1 changed it
  2. E1 and E2 are sequentially writing the State
  3. E1 and E3 (some time later) are concurrently reading the State

Now that we understand the basic building blocks of this model as Entities, State(s) and the Operations of reading and writing of State(s), let us look at where things can go wrong.

Concurrent Access Problems

The most common problem with uncontrolled concurrent access is that of Non-deterministic behaviour. When two processes are writing and reading the same State at the same time then there is no way to guarantee the results of the read. Imagine writing a test that checks the result of the read in such a situation. It could be that the test passes on a slower system as the read thread has to wait (so the write has finished). Or it could be that the test passes on a faster system because the write happens quickly (and the read gets the right value). Worse yet: test passes but the associated feature does not work in production or works sporadically.

For problems like the above to occur in a concurrent access scenario all  three factors, given below, must be present:

  1. State is shared between Entities (which can be multiple threads of the same process)
  2. State once created can be modified (mutability) – even if by a sub-set of the Entities that have ‘write’ access to the State
  3. Unrestricted Concurrent Read and Write Operations are allowed on the State Store

All we need to do, to avoid problems when enabling concurrent access, is to block any one of the three factors given above.  If we do that then some form of parallel access will be available and it will be safe.

Blocking each can open the system up to other issues (there is no such thing as a free lunch!). One thing to keep in mind is that we are talking about a single shared copy of the State here. We are not discussing replicated shared States here. Let us look at what happens when we block each of the factors.

Preventing Shared Access

Here we are assuming shared access is a desired feature. This is equivalent of blocking option 1 (no shared state). This can be seen in Figure 2 (left) where two entities (E1 and E2) have access to different State stores and can do whatever they want with them because they are not shared. E1 creates a State Store with value ‘1’ and E2 creates another State Store with value ‘A’. One thing to understand here is that the ownership of the State Store may be ‘transferred’ but not ‘shared’. In Figure 2 (right) the State Store is used by Entity E1 and then ownership is transferred to E2. Here E1 is not able to use State Store once it has been transferred to E2. 

This is used in the Rust programming language. Once a State Store (i.e. variable) has been transferred out, it can no longer be accessed from the previous owning Entity (function) unless the new owner returns the State Store. This is good because we don’t know what the new function will do with the variables (e.g. library functions). What if it spins up a new thread? If the new owner Entity does not return or transfer further the State Store, the memory being utilised for the store can be re-claimed. If this sounds like a bad idea don’t worry! Rust has another mechanism called ‘borrowing’ which we will discuss towards the end of this post.

Previous section described interactions between functions. What about between threads? Rust allows closure based sharing of state between threads. In this context the ‘move’ keyword is used to transfer ownership of state to the closure. In Go channels use the same concept to clone state before sharing it.

Figure 2: No shared state (left) and borrowing of state (right

This solution makes parallel computation difficult (e.g. we need to duplicate the State Store to make it local for every Entity). The results may then need to be collated for further processing. We can improve this by using the concept of transfer but then that leads to its own sets of issues (e.g. ensure short transfer chains). 

Preventing Mutability

This option is an important part of the ‘functional’ style of programming. Most functional languages make creating immutable structures easier than creating mutable ones (e.g. special keywords have to be used, libraries imported). Even in Object Oriented languages mutable State Stores are discouraged. For example in Java the best practice is to make objects immutable and whole range of immutable structures are provided. In certain databases, when state has to change – a new object with the changed state is created with some sort of version identifier. Older versions may or may not be available to the reader.








Figure 3: Immutable State

Figure 3 shows this concept in action. Entity E1 creates State Store with value ‘1’. Later on, E2 reads the State and creates a new State Store with value ‘2’ (may be based on value of the previous State Store). If at that point there is no need to hold State Store with value ‘1’ it may be reclaimed or marked for deletion. 

Following on from this – the State Store with value ‘2’ is read by two Entities (E3 and E4) and they in-turn produce new State Stores with values ‘3’ and ‘4’ respectively. The important thing here is to realise that once State Store with value ‘2’ was created it cannot be modified, therefore it can be read concurrently as many times as required without any issues.

Where this does get interesting is if the two divergent state transitions (from value ‘2’ -> ‘3’ and ‘2’ -> ‘4’) have to be reconciled. Here too the issue is of reconciliation logic than concurrent access because State Stores with value ‘3’ and ‘4’ are themselves immutable.

One problem with this solution is that if we are not careful about reclaiming resources used by old State Stores we can quickly run out of resources for new instances. Creating of a new State Store also involves a certain overhead where existing State Store must be first cloned. Java, for example, offers multiple utility methods that help create immutable State Stores from mutable ones.

Restricting Concurrent Read and Write Operations

Trouble arises when we mix concurrent reads AND writes. If we are always ever reading then there are no issues (see previous section). What if there was a way of synchronising access to the State Store between different Entities?

Figure 4: Preventing Concurrent Read and Write Operations

Figure 4 shows such a synchronisation mechanism that uses explicit locks. An Entity can access the State Store only when it holds the lock on the State Store. In Figure 4 we can see E1 acquires the lock and changes value of State Store from ‘1’ to ‘2’. At the same time E2 wants to access the State Store but is blocked till E1 finishes. As soon as E1 finishes E2 acquires the lock and changes the value from ‘2’ to ‘3’.

Some time later E2 starts to change the value from ‘3’ to ‘4’, while that is happening E1 reads the value. Without synchronised access it would be difficult to guarantee what value E1 reads back. The value would depend on whether E2 has finished updating the value or not. If we use locks to synchronise, E1 will not be able to read till E2 finishes updating. 

Using synchronisation mechanisms can lead to some very interesting bugs such as ‘deadlocks’ where under certain conditions (called Coffman conditions) Entities are never able to acquire locks on resources they need and are therefore deadlocked forever. Most often these bugs are very difficult to reproduce as they depend on various factors such as speed of processing, size of processing task, load on the system etc.

Concepts used in Rust

Since I have briefly mentioned Rust, it would be interesting to see how the solutions presented have been used in Rust to ensure strict scope checking of variables. Compile time checks ensure that the issues mentioned cannot arise.

  1. Transfer of State: Rust models two main types of transfer – a hard transfer of the ownership and softer ‘borrowing’ of reference. If a reference is borrowed then the original State Store can still be used by the original owning Entity as now transfer of ownership has taken place.
  2. Restrictions on Borrowing: Since references can be both mutable and immutable (default type), we can quickly get into trouble with mutable references given alongside immutable ones. To prevent this from happening there is a strict rule about what types of references can exist together. The rule says that we can have as many immutable references at the same time OR a single mutable reference. Therefore, they change the protection mechanism depending on type of operations required on the State Store. For read and write – State Store is not shared (single mutable reference). For parallel reads – State Store cannot be changed. This allows them to be on the right side of the trade-off.
  3. Move: This allows state transfer between threads. We need to be sure that any closure based state transfer is also safe. It uses the same concept as shown in Figure 2 except the ownership is transferred to the closure and is no longer accessible from the original thread.

Go Channels

Go Channels provide state cloning and sharing out of the box. When you create a channel of a struct and send an instance of that struct over it then a copy is created and sent over the channel. Any changes made in the sender or receiver are made on different instances of the struct. 

One way to mess this up nicely is to use a channel of pointer type. Then you are passing the reference which means you are sharing state (and creating a bug for yourself to debug at a later date)..

Conclusion

We have investigated a simple model to understand how shared state works. Next step would be to investigate how things change when we add replication to the shared state so that we can scale horizontally.

Revisiting Apache Karaf – Custom Commands

And we are not in Kansas any more!

I wanted to talk about the new-ish  Apache Karaf  custom command system. Things have been made very easy using Annotations. It took me a while to get all the pieces together as most of the examples out there were using the deprecated Command system.

To create a custom command in Karaf shell we need the following:

  1. Custom Command Class (one per Custom Command)
  2. Entry in the Manifest to indicate that Custom Commands are present (or the correct POM entry if using maven-bundle-plugin and package type of ‘bundle’)

This is a lot simpler than before where multiple configuration settings were required to get a custom command to work.

The Custom Command Class

This is a class that contains the implementation of the command, it also contains the command definition (including the name, scope and arguments). We can also define custom ‘completers’ to allow tabbed command completion. This can be extended to provide state-based command completion (i.e. completion can adapt to what commands have been executed previously in the session).

A new instance of the Custom Command Class is spun up every time the command is executed so it is inherently thread-safe, but we have to make sure any heavy lifting is not done directly by the Custom Command Class. [see here]

Annotations

There are a few important annotations that we need to use to define our own commands:

@Command

Used to Annotate the Custom Command Class – this defines the command (name, scope etc.).

 

@Service

After the @Command, just before the Custom Command Class definition starts. Ensures there is a standard way of getting a reference to the custom command.

 

@Arguments

Within the Custom Command Class, used to define arguments for your command. This is required only if your command requires command line arguments (obviously!).

 

@Reference

Another optional – if your Custom Command Class requires reference to other services/beans. The important point to note here is that Custom Command Class (if you use the auto-magical way of setting it up) needs to have a default no-args constructor. You cannot do custom instantiation (or at least I was not able to find a way – please comment if you know how) by passing any beans/service refs your command may require to work. These can only be injected via the @Reference annotation. The reason for this is pretty straight forward, we want loose coupling (via interfaces) so that we can swap out the Services without having to change any config/wiring files.

 

Example

Let us create a simple command which takes in a top level directory location and recursively lists all the files and folders in it.

Now we want to keep the traversal logic separate and expose it as a ‘Service’ from the Custom Command Class because it is a highly re-usable function.

The listing below will declare a command to be used as:

karaf prompt> custom:listdir ‘c:\\top_level_dir\\’

package custom.command;
 
import org.apache.karaf.shell.api.action.Action;
import org.apache.karaf.shell.api.action.Argument;
import org.apache.karaf.shell.api.action.Command;
import org.apache.karaf.shell.api.action.lifecycle.Service;
 
@Command(name="listdir", scope="custom", description="list all files and folders in the provided directory")
@Service
public class DirectoryListCommand implements 1.5.0/docs/api/javax/swing/Action.html">Action
{
        //Inject our directory service to provide the listing.
	@5+0%2Fdocs%2Fapi+Reference">Reference
	DirectoryService service;
 
        //Command line arguments - just one argument.
	@Arguments(index=0,name="topLevelDir", required=true, description="top level directory absolute path")
	1.5.0/docs/api/java/lang/String.html">String topLevelDirectory = null;
 
        //Creating a no-args constructor for clarity.
        public DirectoryListCommand()
        {}
 
        //Logic of the command goes here.
	@1.5.0/docs/api/java/lang/Override.html">Override
	public 5+0%2Fdocs%2Fapi+Object">Object execute() throws 1.5.0/docs/api/java/lang/Exception.html">Exception
	{
		// Use the directory service we injected to get the
		// listing and print it to the console.
 
		return "Command executed";
	}
}

The main thing in the above listing is the ‘Action’ interface which provides an ‘execute’ method to contain the logic of the command. We don’t see a ‘BundleContext’ anywhere to help us get the Service References because we use the @Reference tag and inject what we need.

This has a positive side effect of forcing us to create a Service out of the File/Directory handling functionality, thereby promoting re-use across our application. Otherwise previously we could have used the BundleActivator to initialise commands out of POJOs and register them with Karaf.

Declaring the Custom Command

To declare the presence of custom commands you need to add the following tag in the MANIFEST.MF file within the bundle:

Karaf-Commands: *

That will work for any command. Yes! it is a generic flag that you need to add in the Manifest. It tells Karaf that there are custom commands in this bundle.

Listing below is from an actual bundle:

Bundle-Name: nlp.digester
Bundle-SymbolicName: rd.ml.Digester
Bundle-Version: 0.0.1
Karaf-Commands: *
Export-Package: rd.ml.digester.service

Artificial Neural Networks: Problems with Multiple Hidden Layers

In the last post I described how we work with Multi-layer Perceptron (MLP) model of artificial neural networks. I had also shared my repository on GitHub (https://github.com/amachwe/NeuralNetwork).

I have now added a Single Perceptron (SP) and Multi-class Logistic Regression (MCLR) implementations to it.

The idea is to set the stage for deep learning by showing where these types of ANN models fail and why we need to keep adding more layers.

Single Perceptron:

Let us take a step back from a MLP network to a Single Perceptron to delve a bit deeper into its working.

The Single Perceptron (with a one output) acts as a Simple Linear Classifier. In other words, for a two class problem, it finds a single hyper-plane (n-dimensional plane) that separates the inputs based on their class.

 

Neural Network Single Perceptron

Neural Network Single Perceptron

The image above describes the basic operation of the Perceptron. It can have N inputs with each input having a corresponding weight. The Perceptron itself has a bias or threshold term. A weighted sum is taken of all the inputs and the bias is added to this value (a linear model). This value is then put through a function to get the actual output of the Perceptron.

The function is an activation function with the simplest case being the so-called Step Function:

f(x) = 1 if [ Sum(weight * input) + bias ] > 0 

f(x) = -1 if [ Sum(weight*input) + bias ]  <= 0

Perceptron might look simple but it is a powerful model. Implement your own version or walk through my example (rd.neuron.neuron.perceptron.Perceptron) and associated tests on GitHub if you are not convinced.

At the same time not all two-class problems are created equal. As mentioned before, the Single Perceptron partitions the input space using a hyper-plane to provide a classification model. But what if no such separation exists?

Single Perceptron - Linear Separation

Single Perceptron – Linear Separation

The image above represents two very simple test cases: the 2 input AND and XOR logic gates. The two inputs (A and B) can take values of 0 or 1. The single output can similarly take the value of 0 or 1. The colourful lines represents the model which should separate out the two classes of output (0 and 1 – the white and black dots) and allow us to classify incoming data.

For AND the training/test data is:

  • 0, 0 -> 0 (white dot)
  • 0, 1 -> 0 (white dot)
  • 1, 0 -> 0 (white dot)
  • 1, 1 -> 1 (black dot)

We can see it is very easy to draw a single straight line (i.e. a linear model with a single neuron) that separates the two classes (white and black dots), the orange line in the figure above.

For XOR the training/test data is:

  • 0, 0 -> 0 (white dot)
  • 0, 1 -> 1 (black dot)
  • 1, 0 -> 1 (black dot)
  • 1, 1 -> 0 (white dot)

For the XOR it is clear that no single straight line can be drawn that can separate the two classes. Instead what we need are multiple constructs (see figure above). As the single perceptron can only model using a single linear construct it will be impossible for it to classify this case.

If you run a test with the XOR data you will find that the accuracy comes out to be 50%. That might sound good but it is the exact same accuracy if you were to guess one of the classes constantly and the classes were equally distributed. For the XOR case here, as the 0’s and 1’s are equally distributed if we kept guessing 0 or 1 constantly we would still be right 50% of the time.

To put this in contrast to a Multi Layer Perceptron which gives an accuracy of 100%. What is the main difference between a MLP and a Single Perceptron? Obviously the presence of multiple Perceptrons organised in layers! This makes it possible to create models with multiple linear constructs (hyper-planes) which are represented by the blue and green lines in the figure above.

Can you figure out how many units we would need as a minimum for this task? Read on for the answer.

Solving XOR using MLP:

If you used the logic that for the XOR example we need 2 hyper-planes therefore 2 Perceptrons would be required your reasoning would be correct!

Such a MLP network usually would be arranged in a 2 -> 2 -> 1 formation. Where we have two input nodes (as there are 2 inputs A and B), a hidden layer with 2 Perceptrons and a aggregation layer with a single Perceptron to provide a single output (as there is just one output). The input layer doesn’t do anything interesting except presents the values to both the hidden layer Perceptrons. So the main difference between this MLP and a Single Perceptron is that:

  • we add 1 more processing unit (in the hidden layer)
  • to aggregate the output to a single variable an aggregation unit (output layer)

If you check the activation of the individual Perceptrons in the hidden layer (i.e. processing layer) of a MLP trained for XOR you will find a pattern for the activation when presented with type 1 Class (A = B – white dot) and when presented with a type 2 Class (A != B – black dot). One possibility for such a MLP is that:

  • For Class 1 (A = B – white dot): Both the neurons either activate or not (i.e. outputs of the 2 hidden layer Perceptrons are comparable – so either both are high or both are low)
  • For Class 2 (A != B – black dot): The neurons activate asymmetrically (i.e. there is a clear difference between the outputs of the 2 hidden layer Perceptrons)

Conclusions:

Thus there are three takeaways from this post:

a) To classify more complex and real world data which is not linearly separable we need more processing units, these are usually added in the Hidden Layer

b) To feed the processing units (i.e. the Hidden Layer) and to encode the input we utilise an Input Layer which has only one task –  to present the input in a consistent way to the hidden layer, it will not learn or change as the network is trained.

c) To work with multiple Hidden Layer units and to encode the output properly we need an aggregation layer to collect output of the Hidden Layer, this aggregation layer is also called an Output Layer

 

I would again like to bring up the point of input representation and encoding of output:

  • We have to be careful in choosing the right input and output encoding
  • For the XOR and other logic gate example we can simply map the number of bits to the number of inputs/outputs but what if we were trying to process handwritten documents – would you have one character per output? How would you organise the inputs given that the handwritten data can be of different length?

In the next post we will start talking about Deep Learning as we have provided two very important reasons for so-called shallow networks to fail.

Thoughts on Error Handling

Most code has natural boundaries as defined by classes, functions and remote interfaces.

The execution path for a program creates a chain of calls across these boundaries, tears it down as the calls complete and again builds it up as new calls are made.

All is well till one of the calls does not complete successfully. Then an exception is thrown which travels all the way up the chain and somewhere along the line it comes across your code. Or maybe it was a call to your code that does not complete successfully!

What to do when this happens? How to handle the exception?

Do you log it and carry on or do you stop the execution and bomb out or you could just carry on pretending nothing is wrong.

There is no single right answer to this question, just a set of good options that you get to pick from:

1) Log a warning message

This option is easy to understand and easier to forget while writing code. It should be combined with all the other options to give better visibility.

The key to effective logging is first choosing the right Logging API and then using the chosen Logging API correctly! It is a common feature of software to have to little or two much logging. Or the bad use of Error Levels where Level ERROR gives a trickle of messages where as Level INFO floods the logs with messages. Level WARN is often bypassed and Level DEBUG often misused to do ‘machine-gun’ logging.

For secure systems logging should be done carefully so as to not expose any information in an unencrypted log file (e.g. logging user credentials, database server access settings etc.).

Use Level ERROR for when you cannot continue with normal execution (e.g. required data files are missing or required data is not valid)

Use Level WARN for when you can continue but with limited functionality (e.g. not able to connect to remote services – waiting to retry)

Use Level INFO for when you want to inform the user about interesting events (like successfully established a connection or processed a certain number of records)

Use Level DEBUG for when you want to peek under the hood of the application (like logging properties used to initiate a connection or requests sent/response received – beware this is not very secure if logged to a general-access plain text file)

This option should be used no matter which of the other options is chosen. There is nothing as annoying as an application failing with just an error message and nothing in the logs or seeing an exception flash on the console a second before it closes.

2) Return a constant neutral value

In case of a problem we return a constant neutral value and carry on as if nothing happened. For example if we are supposed to return a Set of objects (either from our code or by calling another method) and we are unable to do that for some reason then you can return a blank Set with no items – this would be a constant Set variable which is returned as a neutral value.

For the code that calls this method, we absorb the exception propagation. The only way the calling code can detect any problems is if it treats the returned ‘neutral’ value as an ‘illegal value’. It can use one of the options presented here or ignore it and carry on.

Best Practice: If you are using a neutral constant return value(s) in case of an error make sure you do two things; log the error internally for your reference and make sure if it is an API method you document the fact. This will make sure anyone who calls your code knows the constant neutral value(s) and can treat them as illegal if required.

Another way to use a neutral constant value is to define a max and min range for the return value. In case the actual value is above the max or below the min value then replace it with the relevant constant value (MAX_VALUE or MIN_VALUE).

3) Substitute previous/next valid piece of data

In case of a problem we return the last known or next available valid value. This is fairly useful at the edge of your system where you are dealing with data streams or large quantities of data where it is required that all calls return valid data and not throw any exceptions or revert to constant values (for example a stream of currency data where one call to the remote service fails). You would want to also provide a neutral constant value as well in case there are issues at the beginning where no valid values are present.

For the calling code this provides no mechanism to detect any exceptions down the chain. So the called code that implements this behaviour absorbs all exceptions. That is why this is really useful for the edge of your system when dealing with other remote services, databases and files. If you use this technique make sure you log the fact that you are skipping some invalid values till you get a valid one or you have not been able to get a new valid value so you are re-using the previous one, that will make sure you can detect issues with the remote systems and inform the user (e.g. database login credentials not valid, remote service unavailable or few data file entries are corrupt) while making sure your internal code remains stable.

Also make sure you document this behaviour properly!

4) Return an error code response

This is fairly useful when building a remote or packaged API for external consumption especially when indicating internal errors which the user can do little about. Some examples include: an internal service is no longer responding, internal file I/O errors, issues related to memory management on the remote system etc.

Error codes make it easier for users to log trouble tickets with the help-desk.

Once with the help-desk the trouble ticket can then be routed based on the error code (e.g. does O&M Team just need to restart a failed service or is this a memory leak issue which needs to be passed on to the Dev Team).

We should be careful not to return error codes for issues that can be resolved by the user. In those cases a descriptive error message is the way to go.

As an example: assume you have a form which takes in personal details of the user and then uses one or more remote services to process that data.

– For form validations (email addresses, telephone numbers etc.) we should return a proper descriptive error message.

– For issues related to network connectivity (remote service not reachable) we should return a proper descriptive error message.

– For issues related to the remote service which the user cannot do anything about (as described earlier) the error code should be returned with link to the help-desk contact details and perhaps more information (maybe an auto generated trouble ticket id – see next section).

5) Call an error processing routine/service

This is one where we detect an error response and call an error processing routine or service. This is especially use full not just for complex rule-based logging but also for automatic error reporting, trouble ticket creation, service performance management, self-monitoring etc.

It is often useful to have a service that encapsulates error handling logic rather than have your catch block or return value checks peppered with if-else blocks.

In this case the error response or exception is passed on to a service or routine that encapsulates the error processing logic. Some of the things that such a service or routine might do:

– Decide which log file to log the error in

– Decide the level of the error and create self-monitoring events and/or change life-cycle state of the system (restart, soft-shutdown etc.)

– Interface with trouble ticketing systems (e.g. when you get a major exception in Windows 7 OS it offers to send details to Microsoft)

– Interface with performance monitoring systems to report the health of the service

6) Shutdown (Fail-fast)

This means that the system is shutdown or made un-available as soon as any exception of significance is detected.

This behaviour is often required from critical pieces of software which should not work in a degraded state (so called mission critical software). For example you don’t want the auto-pilot of an A380 to work when it is getting internal errors while performing I/O. You want to kill that instance and switch over to a secondary system or warn the pilot and immediately transfer control to manual.

This is also very important for systems that deal with sensitive data such as online-banking applications (it is better to be not available to process online payments than to provide unreliable service). Users might accept a ‘Site Down’ notice but they will definitely NOT accept incorrect processing of their online payment instructions.

From the example above, because we failed fast and made the banking web-site unavailable we did not allow the impact of the error to spread to the user’s financial transaction.

 

Java: Getting a webpage using java.net

There is an easy way to get a web page as a HTML string. In fact there are two basic ways to do this using the basic java.net package. Why would we need this?

Some use-cases include:

  • Getting information from a web-page for text aggregation (say from a news site)
  • Creating a ‘page filter’ which pre-processes web-pages to strip out unsafe content

For both the examples we will use the basic java.net.URL and java.net.URLConnection entities in java.net package.

 

Type 1: Encapsulated Side-effects

The first example opens up the URL connection and obtains an InputStream from it. A BufferedReader wraps the InputStream from the URL which is then read into a StringBuilder. Then we return the String containing the HTML page source (if all goes well).

The code listing:

 

java.net.1

Few points to remember:

– Always use a StringBuilder instead of the concat ‘+’ operator, Strings are immutable which means every time you call ‘+’ a new String object is born. StringBuilder uses a character array.

– Always make sure to have proper exception handling and a finally block which closes the InputStream

This code encapsulates the ‘side-effect’ of reading from a remote URL and returns a String data.

 Type 2: Exposed Side-effects

The second example is very simple. It is a sub-set of the code shown in the previous section.

In this second example we simply open the URL connection and obtain an InputStream from it which we return to the calling program. The responsibility of using the InputStream to get the page source is left to the calling function. This is especially useful if you want to work directly with an InputStream instead of a String representation. One such example is when using a parser to parse the page source.

The big disadvantage of using this method is that it exposes the side-effect related code to the main application. For example if the Internet connection goes down or the server goes down while the InputStream is being read the calling application will encounter an error and therefore behave unpredictably.

java.net.2

The way to get the best of both worlds (encapsulated side-effects and providing an InputStream to a calling function) is to use Example 1 and return a String object which can then be converted into a ‘byte stream’.

Efficient Data Load using Java with Oracle

Getting your data from its source (where it is generated) to its destination (where it is used) can be very challenging, especially when you have performance and data-size constraints (how is that for a general statement?).

The standard Extract-Transform-Load sequence explains what is involved in any such Source -> Destination data-transfer at a high level.

We have a data-source (a file, a database, a black-box web-service) from which we need to ‘extract’ data, then we need to ‘transform’ it from source format to destination format (filtering, mapping etc.) and finally ‘load’ it into the destination (a file, a database, a black-box web-service).

In many situations, using a commercial third-party data-load tool or a data-loading component integrated with the destination  (e.g. SQL*Loader) is not a viable option. This scenario can be further complicated if the data-load task itself is a big one (say upwards of 500 million records within 24 hrs.).

One example of the above situation is when loading data into a software product using a ‘data loader’ specific to it. Such ‘customized’ data-loaders allow the decoupling of the products’ internal data schema (i.e. the ‘Transform’ and ‘Load’ steps) from the source format (i.e. the ‘Extract’ step).

The source format can then remain fixed (a good thing for the customers/end users) and the internal data schema can be changed down the line (a good thing for product developers/designers), simply by modifying the custom data-loader sitting between the product and the data source.

In this post I will describe some of the issues one can face while designing such a data-loader in Java (1.6 and upwards) for an Oracle (11g R2 and upwards) destination. This is not a comprehensive post on efficient Java or Oracle optimization. This post is based on real-world experience designing and developing such components. I am also going to assume that you have a decent ‘server’ spec’d to run a large Oracle database.

Preparing the Destination 

We prepare the Oracle destination by making sure our database is fully optimized to handle large data-sizes. Below are some of the things that you can do at database creation time:

– Make sure you use BIGFILE table-spaces for large databases. BIGFILE table-spaces provide efficient storage for large databases.

– Make sure you have large enough data-files for TEMP and SYSTEM table-space.

– Make sure the constraints, indexes and primary keys are defined properly as these can have a major impact on performance.

For further information on Oracle database optimization at creation time you can use Google (yes! Google is our friend!).

 

Working with Java and Using JDBC

This is the first step to welcoming the data into your system. We need to extract the data from the source using Java, transforming it and then using JDBC to inject it into Oracle (using the product specific schema).

There are two separate interfaces for the Java component here:

1) Between Data Source and the Java Code

2) Between the Java Code and the Data Destination (Oracle)

Between Data Source and Java Code

Let us use a CSV (comma-separated values) format data-file as the data-source. This will add a bit of variety to the example.

Using the ‘BufferedReader’ (java.io) one can easily read gigabyte size files line by line. This will work best if each line in CSV contains one data row thereby we can read-process-discard the line. Not requiring to store more than a line at a time in memory, will allow your application to have a small memory footprint.

Between the Java Code and the Destination

The second interface is where things get really interesting. Making Java work efficiently with Oracle via JDBC. Here the most important feature while inserting data into the database, that you cannot do without, is batched prepared statement. Using Prepared Statements (PS) without batching is like taking two steps forward and ten steps back. In fact using PS without batching can be worse than using normal statements. Therefore always use PSs, batch them together and execute them as a batch (using executeBatch method).

A point about the Oracle JDBC drivers, make sure the batch size is reasonable (i.e. less than 10K). This is because when using certain versions of the Oracle JDBC driver, if you create a very large batch, the batched insert can fail silently while you are left feeling pleased that you just loaded a large chunk of data in a flash. You will discover the problem only if you check the the row count in the database, after the load.

If the data-load involves sequential updates (i.e. a mix of inserts, updates and deletes) then also batching can be used without destroying the data integrity. Create separate batches for the insert, update and delete prepared statements and execute them in the following order:

  1. Insert batches
  2. Update batches
  3. Delete batches
One drawback of using batches is that if a statement in the batch fails, it fails the whole batch which makes it problematic to detect exactly which statement failed (another reason to use small batch sizes of ~ 100).
Constraints and Primary Keys
The Constraints and Primary Keys (CoPs) on the database act as the gatekeepers at the destination. The data-load program is like a truck driving into the destination with a lot of cargo (data).
CoPs can either be disabled while the data-load is carried out or they can remain on. In case we disabled them during the data-load, when re-enabling them we can have Oracle check the existing data against them or ignore existing data and only enable it for any new operations.
Whether CoPs are enabled or disabled and whether post-load validation of existing data is carried out can have a major affect on the total data-load time. We have three main options when it comes to CoPs and data loading:
  1. Obviously the quickest option, in terms of our data-load, is to drive the truck through the gates (CoPs disabled) and dump the cargo (data) at the destination, without stopping for a check at the gate or after unloading (CoPs enabled for future changes but existing data not validated). This is only possible if the contract with the data-source provider puts the full responsibility for data accuracy with the source.
  2. The slowest option will be if the truck is stopped at the gates (CoPs enabled), unloaded and each cargo item examined by the gatekeepers (all the inserts checked for CoPs violations) before being allowed inside the destination.
  3. A compromise between the two (i.e. the middle path) would be to allow the truck to drive into the destination (CoPs disabled), unload the truck and at the time of transferring the cargo to the destination, check the it (CoPs enabled after load and existing data validated).
The option chosen depends on the specific problem and the various data-integrity requirements. It might be easier to do the data file validation ‘in memory’ before an expensive data-load process is carried out and then we can use the first option.
Indexes
We need indexes and primary keys for performing updates and deletes (try a large table update or delete with and without indexes – then thank God for indexes!).
If your data load consists of only inserts and you are loading data into an empty or nearly empty table (w.r.t. amount of data being loaded), it might be a good idea to drop any indexes on it before starting the load.
This is so because as the data is inserted into a table, then any indexes on it are updated as well which takes additional time. If the table already contains a lot of data as compared to the the size of the new data being loaded then the time saved by dropping indexes will be wasted when trying to rebuild the index.
After the data is loaded we need to rebuild any dropped indexes and re-enable CoPs. Be warned that re-building indexes and re-enabling CoPs can be very time consuming and can take a lot of SYSTEM and TEMP space.
Logging
Oracle, being a ‘safe’ database, maintains a ‘redo’ log so that in case of a database failure we can perform recovery and return the database to its original state. This logging can be disabled by using the nologging option which can lead to a significant performance boost in case of inserts and index creations.
A major drawback of using nologging is that you loose the ability to ‘repeat’ any operations performed while this option is set. When using this option it is very important to take a database backup before and after the load process
Nologging is something that should be used judiciously and with a lot of planning to handle any side-effects.
Miscellaneous 
There are several other exotic techniques for improving large data loads on the Oracle side, such as partitioned tables. But these require more than ‘basic’ changes to the destination database.
Data-loading optimization for ‘big’ data is like a journey without end. I will keep updating this post as I discover new things. Please feel free share your comments and suggestions with me!

 

 

iProcess Connecting to Action Processor through Proxy and JMS – Part 2

Following the first part, I explained the problem scenario and outlined the solution, in this post I present the implementation.

We need to create the following components:

1) Local Proxy – which will be the target for the Workspace Browser instead of the Action Processor which is sitting inside the ‘fence’ and therefore not accessible over HTTP.

2) Proxy for JMS – Proxy which puts the http request in a JMS message and gets the response back to the Local Proxy which returns it to the Workspace Browser.

3) JMS Queues – Queues to act like channels through the ‘fence’.

4) Service for JMS – Service to handle the requests sent over JMS inside the ‘fence’ and to send the response back over JMS.

 

I will group the above into three ‘work-packages’:

1) JMS Queues – TIBCO EMS based queues.

2) Proxy and Service for JMS – implemented using BusinessWorks.

3) Local Proxy – implemented using JSP.

 

Creating the TIBCO EMS Queues

Using the EMS administrator create two queues:

1) iPRequestQueue – to put requests from Workspace Browser.

2) iPResponseQueue – to return response from Action Processor.

Command: create queue <queue name>

 

Proxy and Service for JMS

For both the Proxy and Service to work we will need to store the session information and refer to it when making HTTP requests to the Action Processor. To carry through the session information we use the JMS Header field: JMSCorrelationID.

Otherwise we will get a ‘There is no node context associated with this session, a Login is required.’ error. We use a Shared-Variable resource to store the session information.

Proxy for JMS:

The logic for the proxy is simple.

1) HTTP Receiver process starter listens to requests from the Workspace Browser.

2) Upon receiving a request it sends the request content to a JMS Queue sender which inserts the request in a JMS message and puts it on the iPRequestQueue.

3) Then we wait for the JMS Message on iPResponseQueue which contains the response.

4) The response data is picked up from the JMS Message and sent as response to the Workspace Browser.

5) If the returned response is a complaint about ‘a Login is required’ then remove any currently held session information in the shared variable (so that we can get a fresh session next time).

In the HTTP Receiver we will need to add two parameters ‘action’ and ‘cachecircumvention’ with ‘action’ as a required parameter. The ‘action’ parameter value will then be sent in the body of the JMS Message through the ‘fence’.

In the HTTP Response we will put the response JMS Message’s body as ascii and binary content (convert text to base64), Session information in JMSCorrelationID to Set-Cookie HTTP Header in response, Content-Type Header in response will be “application/xml;charset=utf-8”, Date can be set to the current date and Content-Length to length of the ascii content length (using string-length function).

 

Service for JMS:

The logic for the Service sitting inside the fence waiting for requests from the Proxy, over JMS, is as follows:

1) JMS Queue Receiver process starter is waiting for requests on iPRequestQueue.

2) On receiving a message it sends the request from the JMS Message body to the Action Processor using Send HTTP Request activity.

3) A Get Variable activity gets us the session information to use in the request to the Action Processor.

4) The response is then sent to a JMS Queue Sender activity which sends the response out as a JMS Message on iPResponseQueue.

5) If the session information shared variable is blank then we set the session information received in the response.

The Send HTTP Request will also have two parameters: ‘action’ and ‘cachecircumvention’ (optional). We will populate the ‘action’ parameter with the contents from the received JMS Message’s body. The session information will be fetched from the shared variable and put in the Cookie header field of the request. We will also put the contents of the JMS Message’s body in PostData field of RequestActivityInput. Make sure also to populate the Host, Port and Request URI to point to the ActionProcessor.

An example, if you Action Processor is located at: http://CoreWebServer1:8080/TIBCOActProc/ActionProcessor.servlet  [using the servlet version] then the Host = CoreWebServer1, Port=8080 and RequestURI=/TIBCOActProc/ActionProcessor.servlet. If you expect these values to change, make them into global variables.

 

Local Proxy

This wass the most important, difficult and frustrating component to create. The reason I am using a local proxy based on JSP and not implementing the functionality in the BW Proxy was given in the first part, but to repeat it here in one line: using a Local Proxy allows us to separate the ‘behavior’ of the Action Processor from the task of sending the message through the ‘fence’.

The source jsp file can be found here.

The logic for the proxy is as follows:

1) Receive the incoming request from the Workspace Browser.

2) Forward the request received from the Workspace Browser, as it is, to the BusinessWork Proxy.

3) Receive the response from the BusinessWork Proxy also get any session information from the response.

4) Process the response:

a) Trim the request and remove any newline-carriage returns.

b) Check the type of response and take appropriate action – if response has HTML then set content type in header to “text/html”, if response is an http address then redirect             response and if normal xml then set content type to “application/xml”.

c) Set session information in the header.

5) Pass on the response received from the BusinessWorks Proxy, back to the Workspace Browser.

 

Once everything is in place we need to change the Action Processor location in the Workspace Browser config.xml file. This file is located in <iProcess Workspace Browser Root>/JSXAPPS/ipc/ folder. Open up the XML file and locate the <ActionProcessor> tag. Change the ‘baseUrl’ attribute to point it to the Local Proxy. Start the BW Proxy and Service, iProcess Process Sentinels, Web-Server for the ActionProcessor and the Workspace Browser. Also test whether the Local Proxy is accessible by type out the location in a browser window.

The screenshots given below show the proxy setup in action. The Local Proxy is at: http://glyph:8080/IPR/iprocess.jsp (we would put this location in the ‘baseUrl’). We track the requests in Firefox using Firebug.

In the picture above we can see normal operation of the Workspace Browser. Above we see requests going direct to the ActionProcessor without any proxy.

 

Above we see the same login screen but this time the Workspace Browser is using a proxy. All requests are being sent to the Local Proxy.

 

Above we can the Workspace Browser showing the Work Queues and Work Items (I have blacked out the queue names and work item information on purpose). Tracking it in FireBug we see requests being sent to the Local Proxy (iprocess.jsp).

 

That’s all folks!

 

 

 

Continuing with Java

The continue keyword in the Java programming language provides functionality to skip the current iteration of a loop. Unlike the break keyword which ‘break’ the execution flow out of the loop, continue allows us to skip an iteration. This functionality is often replicated using an if block as below:

for(int i=0;i<100;i++) {
     if(i%2==0){
          // Execute the loop only for even values of i otherwise skip the iteration
      }
}

The above piece of code can be re-written using the continue keyword:

for(int i=0;i<100;i++) {
     if(i%2!=0){ 
          continue; //Skip the loop if i is odd
      }

     //Loop continues as normal if i is even
}

 

While this simple example might not bring out the power of the humble continue keyword, imagine if there were several conditions under which the iteration should be suspended. That would mean having a complicated set of nested if-else blocks to check the flow of execution within the iteration. Whereas with continue we just need to check the condition for skipping the iteration. There is no need for the else part.

Another advantage of using continue  is that it allows the control to be transferred to a labelled statement, as we shall see in a bit.

Before going further let me state the two different ways continue can be used:

Form 1:   continue;

Form 2:   continue <label>;

Example:

The scenario: Imagine you need to extract keywords from a text source. This is a multi-step process with the first step being converting the text into tokens (i.e. breaking down the paragraphs and sentences to individual words based on some rules). Then the set of tokens is filtered and common usage words, such as ‘the’, are removed (stop word removal). Following this you may again want to filter the set using a domain specific set of common use words or you may want to ignore the text source all-together (i.e. not process it) if it has certain words in it. For example if you are interested only in processing articles about Apple products you would want to ignore articles which talk about the fruit.

As you might have guessed, the above process requires several iterations over different sets of data. This is where a continue statement would be most effective. The code given below shows how to use continue. Obviously there are several other ways of implementing the desired functionality (including not using continue).

// Stop and ignore word lists

static String stopWordList[] = { “at”, “in”, “the”, “and”, “if”, “of”,“am”, “who” };

static String ignoreWordList[] = { “king”, “queen” };

——————————

// Sample strings

String samples[] = {“I am the king of the world!”, “For I am the Red Queen”, “A man who wasn’t there” };

outer: for (String sample : samples) {

    System.out.println(“Original String: “ + sample + “\n”);

    // Create tokens

    String[] tokens = tokenize(sample);

    System.out.println(“Unfiltered Tokens:”);

    printToken(tokens);

    // Filter tokens on stop words

   ArrayList<String> filteredTokens = new ArrayList<String>();

    for (String token : tokens) {

        if (filterStopWord(token)) {

           continue; // —————- (1)

       }

        if (filterIgnoreWord(token)) {

        System.out.println(“Ignore – “ + sample + “\n\n”);

          continue outer; // ———- (2)

       }

       filteredTokens.add(token); // — (3)

    }

    // Print filtered tokens
   System.out.println(“Filtered Tokens:”);

   printToken(filteredTokens);

   System.out.println(“\n”);
}// End of outer: for

The full .java file can be found here (right click and save-as).

The logic flow is as follows:

1) Initialise a set of samples (can be any source – taken simple sentences for this example).

2) The for loop labelled ‘outer’ iterates through the set of samples.

3) Create unfiltered token set from the sample.

4) Print the token set (unfiltered).

5) Initialise array list to store the filtered set of tokens.

6) Iterate through the unfiltered token set to filter out tokens.

7) Within the iteration if the current token is a ‘stop word’ then skip the inner loop (using continue – line (1)) as it should not be added to filtered set.

8) If the current token is not a ‘stop word’ then the current iteration will continue as normal.

9) Next we check if the token is on the ‘ignore’ list, if it is then we stop processing the sample and skip the iteration of the outer for loop (using labelled continue – line (2)).

10) If the token is not on the ignore list then we continue with the current iteration and add the token to the filtered set.

If we run the above program with the required methods in place we will see the following output:

Original String: I am the king of the world!
Unfiltered Tokens:
[I]  [am]  [the]  [king]  [of]  [the]  [world!]
Ignore - I am the king of the world!

Original String: For I am the Red Queen
Unfiltered Tokens:
[For]  [I]  [am]  [the]  [Red]  [Queen]
Ignore - For I am the Red Queen

Original String: A man who wasn't there
Unfiltered Tokens:
[A]  [man]  [who]  [wasn't]  [there]
Filtered Tokens:
[A]  [man]  [wasn't]  [there]

As per the logic, the first two samples are ignored and the third one is processed.

SQL Server Express 2008: Timeout and Pre-login Errors

If you are getting Timeout or Pre-login Errors when trying to access MS SQL Server Express 2008 from a Windows Vista based client then read on!

Some of the reasons behind these errors (and the solutions) are given here:

http://technet.microsoft.com/en-us/library/ms190181.aspx

If the above does NOT resolve the error (as in my case) then before you start cursing Microsoft try this:

Go to the Network and Internet –> Network Connections

Right click on the connection being used (WiFi or LAN or any other).

I found IPv6 and and Link-Layer Topology protocols enabled which I disabled.

This partially resolved the issues. Database access was quite stable and I was able to do some of the work. But still things were not 100% normal. I was loosing the wireless connection and it was showing me connected to an ‘unidentified network’. Even the internet stopped working if the computer went into standby.

This it seems is a very common problem and more info can be found on the Microsoft site.

This problem results from

A) Issues with DHCP implementation difference between Vista and XP.

Read for solution: http://support.microsoft.com/kb/928233

B) Issues with power management profiles and WiFi networks

Read for solution: http://support.microsoft.com/kb/928152

[SQLServer JDBC Driver]Character set 437 not found in …

If you face the following error message when you try and connect with Microsoft SQL Server:

[SQLServer JDBC Driver]Character set 437 not found in <PATH OF SQL SERVER LIBRARY e.g. com.microsoft.util >

1) Locate the jar file: msutil.jar – search for it or look in the MS SQL Server Drivers directory.

2) Unjar the file using WinRar or 7zip and extract all the files.

3) Go to the util folder (if msutil is the main directory then the path to the util folder is:msutil\com\microsoft\util)

4) Within the util folder locate a file called: transliteration.properties

5) Open the file in notepade or similar text editor program and add the following:

translit.type.437=VM
translit.name.437=Cp437

6) Save the file and either redirect the class path to the unjared folder with the edited transliteration.properties file or jar the msutil and replace the old msutil.jar with the new msutil.jar containing the edited file.