Programming fun at startup

Posts

Showing posts from May, 2011

Java count open file handles

Encountered an issue in production where JVM ran out of file handles due to code bug. It took five minutes for file handles to build up but had there been any trending of open file handles we would have caught it as soon as release was pushed as on some nodes it didnt exhausted the file handles but the number was high enough to have caught suspicion. Now I can do lsof and put it in cron but I am not fond of crons as you have to configure it manually and if a box has 4 tomcats then you have to configure for each one of them on 20-30 nodes. So I wanted to get count of open file handles every five minutes and push it to graphite for trending. Here is a sample code to do it public long getOpenFileDescriptorCount() { OperatingSystemMXBean osStats = ManagementFactory.getOperatingSystemMXBean(); if(osStats instanceof UnixOperatingSystemMXBean) { return ((UnixOperatingSystemMXBean)osStats).getOpenFileDescriptorCount(); } return 0; }

RateLimiting based on load on nodes

We are a cloud based file storage company and we allow many access point to the cloud. One of the access point is Webdav api and people can use any webdav client to access the cloud. But some of the webdav client especially on mac OS are really abusive. Though the user is not doing any abusive action, the webdav client does aggressive caching so even when you are navigating at a top directory it does PROPFINDs for depth at 5 or 6 level to make the user experience seamless as if he is navigating local drive. This makes life miserable on the server because from some clients we get more than 1000 requests in a minute. If there are 5-10 clients do the webdav activity then it causes 100 or more propfinds per sec. Luckily the server is able to process these but it hurts other activities. So we needed to rate limit this. Now as the user is really not doing any abusive action it would be bad to slow down on penalize the user in normal circumstance, however if the server is under load then it w

ubuntu 11.04 natty upgrade hangs

It hung there for more than one hour. It was trying to upgrade cups. I have no idea but then I tried doing "sudo stop cups" and that let the upgrade resume.

Quartz stop a job

Quartz did a good job on implementing this concept. It was very easy to add this feature by implementing a base class that abstract the details of interrupt and have every job extend this class. If you can rely on thread.interrupt() then its the best way to interrupt a job that is blocked on some I/O or native call. However if its a normal job then a simple boolean flag would do the work. You would need to use scheduler.interrupt(jobName, groupName); to interrupt a running Quartz job. public abstract class BaseInterruptableJob implements InterruptableJob { private static final AppLogger logger = AppLogger.getLogger(BaseInterruptableJob.class); private Thread thread; @Override public void interrupt() throws UnableToInterruptJobException { logger.info("Interrupting job " + getClass().getName()); if (thread != null) { thread.interrupt(); } } @Override final public void execute(JobExecutionContext context) throws JobExecutionException { try { thread

Pitfalls of defensive programming

We programmers some times add too much defensive code in order to protect ourselves from the caller not asserting preconditions before making the call. So for e.g. if we have to save a file in some directory, we would first go and check if the directory exists and if it exists then create the file. Now NFS is not designed to work at cloud scale and we saw lots of calls just stuck in file.exists call in threaddumps. The solution was simple, some of these directories could be created at tomcat startup or app node installer can create them. Also some code can assume that directory exists and if if gets a FileNotFoundExcpetion then create it and retry the operation. Removing these defensive coding practices reduced a lot of unnecessary stat calls on filers and improved performance. This is just an example but similar pattern can be observed in other areas of the code and fixed. Defensive programming is good but too much of it is bad and can be improved by making some assumptions or provid

Contextual Thread dumps

Due to some business policy changes we recently started seeing some changes in usage pattern of our application leading to unexplained app node spikes. These spikes were temporary and by the time we go and try to take jstacks it might have disappeared. So we configured a quartz job to take jstack every 5 min(wrote a quartz instead of cron because cron needs to be manually configured on each node and we have tons of nodes to ops was always missing or misconfiguring it) and dump it in to a folder and we keep last 500 copies. That way I can go and correlate what was going on in the tomcat during the time of the spike (I had to get lucky for spike to happen when quartz job was running but I was lucky as most spikes spanned 3-5 mins). Now from those thread dumps I can figure out what was going on like how many thread are doing "searches" v/s how many thread are coming from Webdav or how many threads are doing add file. But one question that keep on coming was who are the customer