Thursday, January 28, 2010

Killing a particular Tomcat thread

Update: This JSP does not work on a thread that is inside some native code.  On many occasions I had a thread stuck in JNI code and it wont work.
Also in some cases thread.stop can cause jvm to hang. According to javadocs "This method is inherently unsafe. Stopping a thread with Thread.stop causes it to unlock all of the monitors that it has locked".

I have used it only in some rare occasions where I wanted to avoid a system shutdown and in some cases we ended up doing system shutdown as jvm was hung so I had a 70-80% success with it. 
We had an interesting requirement. A tomcat thread that was spawned from an ExecutorService ThreadPool had gone Rogue and was causing lots of disk churning issues. We cant bring down the production server as that would involve downtime. Killing this thread was harmless but how to kill it, the ExecutorService variable was private and we can't call shutdownNow on it. So the solution was to generate a ThreadDump get the threaName and write a JSP to kill it. Much of the code is borrowed but as this is a unique issue , I thought of helping fellow Googlers ;).

Friday, January 22, 2010

Partition your Rolling Fact by time or not

We are creating a data warehouse to store event logging for actions done by user. The requirements are to keep 6 months of historical data and allow user to run audit reports. The challenge here is to partition the facts by creationtime of the record or not. The advantage of partitioning by creationtime is that we can chop a partition when we want to purge the data within seconds and all new data would be added to current month partition so ETL data loads would become fast. The disadvantage is that you will have to include time horizon in your every query that gets fired on the data warehouse otherwise it will do FULL SCAN. The alternative is to not partition by time or you can create global indexes on time-partitioned tables but when you drop data these indexes/tables becomes fragmented. This is a very important decision here and if you can get the User requirements and all the queries would contain time then go ahead and partition your fact by time else its better to pay the performance penalty during ETL as its a background job rather then make the user suffer on every query.

Tuesday, January 19, 2010

Managing User Perception in long running operations

Its important to manage User Perception properly and give him feedback if you are doing a long running synchronous operation. In the web world a user can get frustrated/impatient and try to do the same operation again leading to sending even more load on your server.

We recently ran into one issue like this where registration was taking a long time due to some server issue and looking into apache logs it was taking 10 sec but users were reporting it was taking 30-40 sec. Doing registration from a browser with empty cache confirmed that it was not registration but it was the confirmation page that was taking a long time and its a plain html page. It was all because the confirmation page was making 50 requests to the server to download images/js/css. The solution that clicked to me was simple, reduce the no of images, css, JS requests. The confirmation page was downloading all these images to
  1. 25 images/css/JS for rendering Header/Footer
  2. 5 images for rendering rounded corners
  3. 5-6 JS/css/images for google analytics and other stuff.
The solution was simple, we just whacked off the header and footer from confirmation page as it was not necessary and used CSS for rounded corners (Thanks to for implementing rounded corners so easy and fast). Now registration was fast and users were happy. But it was still taking 10 sec that we are still working on to reduce, but to give user some feedback that we are doing some thing heavy on click on Register button, we are adding an Ajax popup to show Registering... (where ... is an animated gif) and we will mask out the entire screen so that user cant click on any of the buttons again.

One more example of user perception is when I was looking at Tomcat access log of one of our production servers and saw the same search request 4 times within 20 secs. The server was loaded heavily and the FullText keyword search was taking long time, due to user's successive 4 clicks it sent even more load on server and sending load average to spike through the roof.

The solution is simple and even applied by Microsoft, have you ever noticed when you are copying over 1G or more data then Windows will tell you it will take 2 min to copy the files and they show a progress bar but depending on your machine configuration and other activities that is going on it can take anywhere from 2 -3 min to do the copy. A normal user wont notice this small difference as he will see that something is going on. In our case things were not easy as Microsoft so for the time being we will disable the search button on the click and reenable it when results are downloaded.