Tuesday 12 February 2008

An Accident Waiting to Happen

My client is in the process of rolling out a multi-tier J2EE application which has been in development for a couple of years. A couple of weeks ago, it started to run into mysterious 'hanging' states in production. Many crisis meetings took place, DBAs and app server experts were engaged, actions happened and everybody had a generally stressful time without quite getting to the bottom of the issue.

In a quieter moment, going over emails from the DBAs, thread dumps and source code, I spotted a problem. A Java class was querying Oracle for the next value of a sequence, putting it in an instance variable and then using it later to do an insert or an update. Nothing too scary on the face of it... until you realise that an instance of this class was being held in an instance variable in another class, and that class was a WebLogic web service skeleton.

Now the WebLogic manual is very clear that you must write thread-safe code for your skeleton classes because the server will use a single instance of the skeleton to service all client requests. End result: a single instance of our class is running in multiple threads and the database keys are getting mixed up between threads, resulting in multiple threads trying to lock the same database row and causing the app to hang.

So should we blame the developers of this class?, well maybe, but bear in mind that the class in question is not itself a web service skeleton - it just happens to be used by one. Maybe we should blame the developer of the skeleton?, well that may be closer to the mark, but I have another option - maybe we should blame the developer of the web service framework!

My question is: is this rule reasonable? The expectation nowadays is that developers will quickly be up to speed with new technologies and that businesses will want to take advantage of them quickly. We cant expect development shops to be peopled entirely by seasoned professionals, so surely the burden should fall on the developers of frameworks to implement default behaviours which are safe.

Having hit this issue, I cross-checked what Axis does in the same situation. At first, I couldn't find any definitive statement in the manual on the subject, so I ran a service in the debugger and found that it created lots of instances of my skeleton as calls arrived. I eventually found that the bahaviour is configurable via the 'scope' property in the web service deployment descriptor (WSDD). The default seems to be 'session' - i.e. an instance of the skeleton is created for each client that connects. Not 100% safe, not 100% optimal performance-wise, but probably safe enough for most situations.

My client now needs to either check all web-service related code for thread safety or switch to another web service framework that at least makes this behaviour configurable... and then roll out the change into production and retro-fit the two releases which are still in the pipeline. This is not going to be a painless process.

I will just be thankful that I wasn't around when the framework was chosen or when the code in question was written - would I have spotted this accident before it happened?

No comments: