Wednesday, March 27, 2013

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.".

After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that

public class DurableJobDetailFactoryBean extends JobDetailFactoryBean {
    public DurableJobDetailFactoryBean() {

and used this instead of JobDetailFactoryBean in the spring bean definition

    <bean id="restoreJob" class="">

Saturday, March 23, 2013

graphite dynamic counters trending

I generate this report daily using cron as to top exceptions across all datacentres and top queries across all datacentre and top urls across all datacentres and send them via email.

Problem is that after every release the no goes up and down as due to some bug a new exception will popup or some exception will resurrect.  How do I trend and correlate these dynamic counters.

Solution came from my colleague in just an informal chat and he recommended I should md5-hash the url and create a graphite counter for it and in the email  just make the count a hyperlink  like shown below. 

Now I can trend the query as clicking on this shows me a graph as shown below. My next plan is to inline the graph for top 10 urls in the email itself so I don't even need to click them.

Being Analytics driven vs firefight driven

We have doubled our incoming traffic every 6 months and past 3 years have always been in firefighting mode where some customer reports an issue or a node goes down and we try to analyze the root cause and fix them.

Lately I am trying to move away from working in a firefight driven mode to Analytics driven mode. What I meant is being proactive to monitor and understand the system by gathering various metrics and fixing issues before cusotmer notices them. For e.g. to put our nodes in HA mode I had to store a sessionId to userId mapping in database, the only real reason to do this was for our Flash file uploader because it makes a request but doesnt pass the sessionId in cookie but it passes as request parameter. This causes the request to go to a completely different node.  So to handle this we wrote a session listener that would save the sessionId to userId mapping in db.  The code went live and suddenly after some days the db went down. What happened was that the developer forgot to remove the row in session purge and the table had grown to 140M.  I didnt expected this table to be more than 1-2M at a time but due to code bug this issue occurred.  So first thing I did was to add a monitor on all tables in our transient db and it alerts me if any table has >10M rows. Immediately it caught another code bug where an audit table has grown to >70M for one customer so we caught it and fixed it in hotpatch it without causing another downtime.

Another incident happened two weeks back when I travelled to Bay area and for two days a db went down exactly between 11:30 and 11:45 and restarted on its own.  What we found that again that session table was 20M even after adding the session destroy delete fix and a backend purge job from all nodes were trying to delete 18M records out of this 20M table.  What happened was that some code flows(webdav and FTP upload) that are not supposed to be session dependent were setting user in session.  Users uploaded 20M files using FTP/webdav and that caused this table to ballon. The only reason I was able to find the issue was because I log stats on each node as to what methods happened every 5 mins. So I grepped queries in those between 11:30 and 11:45 on all nodes.

Immediately I realized that I need something more global, we have 40 mysql servers and I needed some report as to how many queries are happening on each node and in overall at DC level.  If I had this report then I woudl have caught the session thing before it became an issue.  Scribe comes to rescue, now I write those 5 min stats to scribe that gets aggregated in central scribe across all datacentres. I roll the log nightly and wrote a python cron to generate report that dumps queries and no of times they happened  and their avg time.

Yesterday it went live in UAT and immediately I found 2-3 issues where top queries were not what we expected (caching bug :)). Lesson learnt is be analytics driven rather than hunch or firefight driven.

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues:

1) We have to overprovision each node with ability to handle worse case capacity.

2) If two or three high profile customers lands on to same node then we need to move them manually.

3) We need to cut over new nodes and we already have over 100+ nodes. 

Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue.

I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality.

So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediately saw is that we have to use sticky sessions due to some design issue that will take long time to solve. Now we were doing sticky session by JSESSIONID like

      appsession JSESSIONID len 32 timeout 12h request-learn

Immediately I realized that Two tomcats can generate same JSESSIONID so there is a potential for security breach.  Thank god tomcat has a way to add a node identifier to JSESSIONID to solve this issue :).  You can go to conf/server.xml and add jvmRoute to Engine like

This way your JSESSIONID would be generated like BBF8B5EF74EAAECE0278DC92A9F1353D.

As a side effect now  we would know by looking at cookie as to which node is serving the request and this will help in trouble shooting node specific issues.

I hope to cut down the 100+nodes to 40 nodes after this HA.  I will keep 40 because we have pod/farms in each DC so we need to overprovision each pod else this could have been reduced to 10 or 15 nodes. The pods are there to avoid DC meltdown.

Thursday, March 7, 2013

%E2%80%90 and links issue

Ran into an issue where a customer creates public link to a file and pastes into word and it works fine but when he converts it to PDF it no longer works thought the  link in browser url bar looks exact same.  I reproduced it and saw that "h-s" in link url was getting converted to h%E2%80%90s. After some Googling it turned out to be an adobe bug related to hyphen character.

Friday, March 1, 2013

Final nail in BDB coffin-Part2

Missing indexes can be really pain.  We were migrating data from bdb to mysql and the migration on few nodes were going on for 3-4 days. As I was involved in firefight, I didnt got a chance to look at it. But on one node only 10% of the workgroups were migrated and while chasing a customer reported issue I found on index was missing.  we created that index and restarted migration and wow it finished in 5 hours.

Finally we ended up with 400M+ more rows in our mysql infrastructure and now bdb is finally out of the product. Hurray!!