Programming fun at startup

Posts

Showing posts from March, 2013

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.". After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that public class DurableJobDetailFactoryBean extends JobDetailFactoryBean { public DurableJobDetailFactoryBean() { setDurability(true); } } and used this instead of JobDetailFactoryBean in the spring bean definition <bean id="restoreJob" class="com.xxx.infrastructure.quar

graphite dynamic counters trending

I generate this report daily using cron as to top exceptions across all datacentres and top queries across all datacentre and top urls across all datacentres and send them via email. Problem is that after every release the no goes up and down as due to some bug a new exception will popup or some exception will resurrect. How do I trend and correlate these dynamic counters. Solution came from my colleague in just an informal chat and he recommended I should md5-hash the url and create a graphite counter for it and in the email just make the count a hyperlink like shown below. Now I can trend the query as clicking on this shows me a graph as shown below. My next plan is to inline the graph for top 10 urls in the email itself so I don't even need to click them.

Being Analytics driven vs firefight driven

We have doubled our incoming traffic every 6 months and past 3 years have always been in firefighting mode where some customer reports an issue or a node goes down and we try to analyze the root cause and fix them. Lately I am trying to move away from working in a firefight driven mode to Analytics driven mode. What I meant is being proactive to monitor and understand the system by gathering various metrics and fixing issues before cusotmer notices them. For e.g. to put our nodes in HA mode I had to store a sessionId to userId mapping in database, the only real reason to do this was for our Flash file uploader because it makes a request but doesnt pass the sessionId in cookie but it passes as request parameter. This causes the request to go to a completely different node. So to handle this we wrote a session listener that would save the sessionId to userId mapping in db. The code went live and suddenly after some days the db went down. What happened was that the developer forgot to

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues: 1) We have to overprovision each node with ability to handle worse case capacity. 2) If two or three high profile customers lands on to same node then we need to move them manually. 3) We need to cut over new nodes and we already have over 100+ nodes. Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue. I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality. So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediately

%E2%80%90 and links issue

Ran into an issue where a customer creates public link to a file and pastes into word and it works fine but when he converts it to PDF it no longer works thought the link in browser url bar looks exact same. I reproduced it and saw that "h-s" in link url was getting converted to h%E2%80%90s. After some Googling it turned out to be an adobe bug http://forums.adobe.com/message/2807241 related to hyphen character.

Final nail in BDB coffin-Part2

Missing indexes can be really pain. We were migrating data from bdb to mysql and the migration on few nodes were going on for 3-4 days. As I was involved in firefight, I didnt got a chance to look at it. But on one node only 10% of the workgroups were migrated and while chasing a customer reported issue I found on index was missing. we created that index and restarted migration and wow it finished in 5 hours. Finally we ended up with 400M+ more rows in our mysql infrastructure and now bdb is finally out of the product. Hurray!!