Sunday, November 23, 2014

Webhooks and Integrating Aseembla svn commits with Pivotal tracker stories

I am intrigued by this webhook concept and it seems a very nice way for B2B communication.  Webhooks are powerful and it eliminates Poll for integrating with third parties. All you need is to have some REST api registered that will be called when an event occurs. A good e.g. of webhook for a cloud storage provider can be  "automatically print this document on registered printers when a file is dropped in this folder". Lets assume all the customer needs to do is register a webhook " and the cloud storage provider can then call this url and POST the body of document in input.

Recently I had a chance to play with webhooks when I was trying to move jenkins to EC2 for my friend and as part of it I moved his svn to  I saw webhooks and I thought I can integrate commits into svn hosted by to tickets.  It took just 1 hour to do it but it was fun, apparently there are post commit hooks that you can add in svn but this was much easier to do because has a webhook concept  so all I need to do is find a pivotal api that would allow to post commit messages to it.

It seems Pivotal tracker already has this  .  The challenge was that assembla webhook doesnt allow you to pass any custom header where I can pass the X-TrackerToken header but after some api documentation trolling I found that you can pass the token in query string like  and I was done. After Configuring the integration the commit messages show up in pivotal story.

Thursday, November 20, 2014

Data driven performance issue and NewRelic

NewRelic really shines at discovering these data driven performance issues. Earlier we would find them late or these would be buried but now they seem so obvious if the engineer is paying attention.  I was casually trolling new relic and sorted all apps by avg time per api and one of our core application in one DC was taking twice the avg time for each call than all other DCs. I immediately compared that DC with other DCs and I saw was a graph like below in DC1

and I saw this in DC2

Clearly DC1 is spending abnormal amount of time in database. So I went to database view and saw this in DC1

and I saw this in DC2

Clearly something is weird in DC1 even though its same codebase.  309K queries per minute seems abnormal.  Within 5 min I found out its a n query problem. Aparently some customer has 4000 users  and he has created 3000 groups and the group_member table has 40K rows for this customer. Normally all of our customers would create 10-50 groups and there is a code that iterates over each group and calls get members.

For normal customers if he makes 100 calls per minute to this api it would cause 100*10 calls or 1K calls per minute but in this dc it causes 100*3000 or 300K queries.  As we are near to weekend release, for now I replaced n query with a bulk query and then we would optimize this code in next release or work with customer to see if his data modelling has some flaws and it can be achieved some different way.

Sunday, November 16, 2014

AWS and rise of devops

I used to always wonder how Snapchat, Pinterest and Instagram were able to scale to millions of users with just 10-15 engineers.  I am a Java Architect but when it comes to networking, operations and other stuff I am a Noob beyond basic skills.  Recently our ops team did some subnet changes and some IP changes and added 10G network between some services, All this is grey area to me and I was like you really need to hire Operations for this so how come these other startups did without so many people.  One of my friend was after me for months to help him move his jenkins servers from Ukraine to EC2 as Ukraine is in turmoil. I have no ops expertise so this was tricky but here is how I got it done over 2 weekends as Dallas is freezing due to cold front and I dont have driver license due to immigration fiasco by USCIS. So this friend really got benefit due to it as I had nothing else to do on weekend.

  1. I took a vanilla CentOS AMI and launched an instance in EC2. But when launching it asked me whether I want to launch in ec-classic or EC2-VPC. I was curious so read that EC-VPC will allow you to isolate your instances and allows better security groups by blocking traffic to internal servers from outside using security groups.
  2. Creating a VPC was piece of cake as I followed the wizard and read the docs. I really wanted to use  but went with  as the earlier one was requiring an extra NAT instance.
  3. Finally I launched an EC2 instance with CentOs and installed jenkins using sudo yum install jenkins
  4. One thing I wanted was a banner when I ssh to the instance so I  googled and found that all you need to do is go to  and generate a text banner and put it in /etc/motd and you are done. now when you login to instance it prints "Jenkins"
  5. Now how do I move jenkins, his old instance was up for 2 years and I read moving jenkins from one box to other  requires just copying home from one box to other  But when I did du -hsm I got 40G. I was like no way I want to move all this shit.  
  6. Finally i installed jenkins thinbackup plugin  on both servers. On the old one I took a thinkbackup and then zipped it and overwrote the JENKINS_HOME directory /var/lib/jenkins . 
  7. Then I ran "chown +R jenkins /var/lib/jenkins"
  8.  I asked him to switch DNS to new server.
  9. I took a dump of svn and import into a new account and due to this I didnt needed to setup svn server separately.
  10. We restarted the jenkins using "service jenkins restart" and the new jenkins was up with all configs as old server, I changed all svn repo paths to point to assembla credentials/urls and we were done.
 Now comes the hard part. After all this was done, he was using one box already in EC2 for running selenium tests and that agent was down. No matter what I do it wont connect. I checked jenkins config page and it was using port 15001.  I edited Ec2-security group and allowed 15001 port and it wont connect.

Then I thought may be I need to run this box in same VPC, me being a noob this was a bad idea that derailed me. In EC2 you cant migrate an instance from ec-classic to EC2-VPC. The only way is to create an AMI from old instance and launch a new instance, I did that and it took 4 hours including reading docs to create an AMI, but even then it wont connect. I deleted that AMI/snapshot and new instance and fired up old instance.

Finally I switched to debugging using raw telnet and saw that from the jenkins instance I can telnet to localhost 15001 but I cant from his laptop or from selenium box.  Finally figured out that the CentOS AMI I picked had its own firewall and had only 80 and 443 port open.

We added
sudo iptables -I INPUT -p tcp --dport 15001 -j ACCEPT

and finally all was done.

In short EC2 had made bare bones operations people job in jeopardy, they need to move to devops.  AWS is innovating like crazy and today he sent me some links on AWS Lambda and AWA Aurora   I was like hmm if startups have all this then do they need to hire Mysql dbas until the site has reached a large momentum.

Finally after doing all this I understood how Pinterest, Snapchat, Instagram were able to keep up with a large infrastructure with least no of employees, even less than my employer when they were at similar scale.  For storage you use S3, for database you use aurora, for load balancing you use ELB and you automate build/deploy via jenkins/puppet and now a days docker. I remember 10-14 years ago when I started working for startups, they have to hire an  army of people to get prototype out of the door and that means VCs need to write a Series A check. It seems today people at YC are doing it with 3-4 people with only Seed funding. So the seed funding has become the old Series A. Who needs an army of people to install/manage 100s of severs when automation tools combined with Power of AWS, GCS can do the job for you.

Off-course all this comes at a cost, when the site becomes really big, AWS bills are higher, recently at my employer one of our GCS performance testing env had a bill of $10K for 1 month which went in poof as you  don’t own the hardware, you lease it so its like Renting vs owning the house. But we were able to finish perf testing quickly as bringing up new env was faster than stuck in hardware pipeline. Also Adrian crockfort once said if you are leasing SSD then its even better as the wear out of SSD is not your problem ( But  startups have an opportunity cost, if they can get the product out with less developers in less time then later when they become big they can hire specialist  Also each developer now a day in bay area has a fully loaded cost of >150-200K so for startups strapped for cost, AWS and GCS can be a boon and they can solve the high bill problem when they really reach that stage. Isnt it great if you get an AWS bill of $100K because that means you must be making 10-30 times the $$$$.

Monday, November 10, 2014

A well intentioned public api can bring down a server

Apis are powerful creatures and people can use them to do tons of weird things. We had exposed a public api to create a link but our UI had a capability to select multiple files and generate links for them in bulk in call, so our public api mimicked this behavior.

Today a server was running hot with full GC, I took a heap dump and restarted it. Upon analysing the heap dump in Eclipse Memory Analyzer I found that Sys log appender was choked and it had a queue of 10K messages with each being 2MB. I copied the value and found the class name in log message.

Aparently whenever a link was created a log message was written that would iterate over each file and log a line for each file. There was a bug in the log message that  it would log entire message instead of that file.

for (target in linkRequest.getTargets()) {"queuing preview generation for {}", event) ;

QA/developers cant detect this and most people in code review focuses less on logger messages.

But there was a customer who today did some automation and created a link for each file in the folder in one call,  he had 10K files in that folder. What this means is that the 2MB payload for link generation request was logged in for loop 10K times causing full GC.

The fix was simple to just change logger message in for loop but I will talk to API designer to put some upper limit on no of links created in 1 call. 

for (target in linkRequest.getTargets()) {"queuing preview generation for {}", target) ;