Saturday, May 31, 2014

New relic aha moments with Java app

I am integrating New Relic at our startup and in last 2 weeks there were several Aha moments.  Below are some of the Aha moments

1) Our DBAs had done some profiling an year ago and told me that there is this "Select 1 from dual" query that is taking most of the time in database.  I was like this is the query that we do for validating connection out of commons-dbcp pool so I cant take it out and why would select 1 from dual take time in database.  But then I installed new relic on 16 app servers in a data centre and then I went to Database tab and immediately I see this query being fired 40-50K times. This was an Aha moment and immediately I started looking for alternatives.

Finally I settled on tomcat-dbcp because it has a property called as validationInterval(default is 30 sec). So what it means is it would still fire select 1 from dual but will fire it on the connection only if it hasnt fired it in last 30 sec. This weekend the fix is going live so crossing fingers if there are no serious issues with tomcat-dbcp in highly concurrent environment, UAT tests looks promising so far.

2) Second Aha moment was with Plugins support for new relic. I used to write platform software in my previous startup so I really like extensible software. Especially platforms like eclipse or facebook that are extensible with plugins. We use RabbitMQ as our queuing backbone and its rock solid. I asked ops to install the new relic plugin and immediately I see some queue has 1.5M messages. Upon asking questions to the developer I found out they had changed queue names and this one was not purged.  These 1.5M messages were consuming 75M memory so it was not a big issue but still this was an Aha moment. We purged the queue and problem was solved.

3) Third Aha moment was when I installed memcahed plugin. Immediately I can see that one class of memcached servers that store static metadata is using 10% memory but another class of memcached servers that store transactional metadata is using 80-90% of memory. Memcached has a slab allocation so if you are using 80-90% memory you are sure of seeing evictions and we were seeing at rate of 10 per sec which is not much but still why should we see evictions. Earlier also all this data was present in memcached stats but we were not looking into it pro-actively because both memcached and rabbitMQ have rarely given us issues and finding this kind of trend over 20 memcached servers was not easy from stats output.  This weekend we are redistributing the memory allocation and giving more memory to transactional servers and reclaiming unused memory from metadata servers.

4) Also one more thing that was good to know is our average response time for all apis across the board. We had all these stats at daily level but no trend information on hourly basis.  Below as you can see there is not much difference between one DC to other DC over last 34 hours.

Offcourse not everything is perfect with new relic, I have some minor cribs also:

1) For plugins there is no aggregation. I wanted an aggregation for plugin data coming from all servers like they do for web transactions. we have close to 400+ servers and I dont want to see a dashboard to compare IO averages that look like this as I really want to focus on servers that are anomalies. I have close to 100+ Mysql servers and growing fast but I don’t want to install Mysql Plugin because the graphs are per server so its useless, I have same issue with haproxy plugin, I already have all this in cacti and graphite so I am seeing if I can use OpenTsDB for this and produce one graph that allows me to focus on anamolies.  Vividcortex has a nice bubble paradigm for this kind of analysis but again that tool didnt had much of interest to me for Mysql except Bubble view.

2) New relic is good at detecting surface level issues and pointing the developer in right direction but after that what??  Our app uses a Fair Share thread pool and every api request is routed via this. The Fair share thread pool ensures no one customer is hogging all resources in one machine. But problem is new relic doesnt detect transaction trace in Async activity, it tells me 99% time is spent in this method that delegates to fair share thread pool but its useless to me.  I have not given up on it but then I would need to spend time mutating code with custom @Trace annotation or deploying a yaml file with cut points to trace in each server. But this one I would try later in free time and if we decide to buy new relic.

3) New relic only shows errors that are sent back to browser as error but what about exceptions that are gobbled up or it tells me /webdav url has error and it send 500 at a rate of 15 per minute, but now what, which exception happend as same url could have failed due to different exception for different customer and I cant tail logs on 100+ servers?  I have a home grown APM that probably does a better job here so we would probably stick to that for time being for this.

Overall very happy with new relic and may be I need to play with more advanced features.  Its good at detecting some general trends and pointing you in right direction. It has less clutter than the other APM tools I was trying and best of all its super fast so far for me even If I am analyzing last 3 days worth of data.

Monday, May 19, 2014

Programming Epiphany

I hate doing tasks that are repetitive or that someone else should do but its a waste of my time. Earlier we used to store user and customer data on LDAP and there were all sort of BS requests from marketing like tell me "all customers that are on PlanY and are buy domain with >5 users".  Problem is there were 40 ldaps and each one of these would require writing a custom script. Teaching other programmers ldap was not reasonable.  One of my goals at my employer is to "Empower people to retrieve information themselves". I don’t want to be bottleneck in their chain of thoughts and they need not be bottleneck in what I want to do. I don’t want to involve humans if at all possible.

So when we migrated to Ldap->Mysql, first thing I did was consolidated 40 ldaps per data centre to 4 Mysql server per datacentre.  I could have consolidated to 1 also but that would suffer from noisy neighbour issue.  Problem is that there are 3 Data centres so in total 12 mysql servers.  Still people would come to me for data just because I architected Ldap2Mysql.  People need to understand that my job is to "Empower people by writing tools/frameworks so that they can do their job themselves".

Two weeks back I got a custom request where someone wanted all emails of all users in 11 thousand customer accounts. I was like its BS either I can give you all or I can give you none. It would waste 2-3 hours of my time if I had to do it. I even tried creating a SQL query with 11K domains in IN clause and   Mysql would run it by python program to query 12 Mysql servers would bomb due to command line argument limitation.  So I refused to do it and it was given to someone else in the team who would split the list into 1K domains and combine the CSV.  But then I realized the issue for me was not splitting the result into 1K domains and constructing IN clause as that I can do programatically or in excel in 10-20 min. Problem to me was that my program was generating output as python tuple format so who would combine the 11 log files and do all that data scrubbing to generate final CSV.

Today again the developer attached to internal apps team was asked to produce a list of all admin users of all customers who want to recieve our newsletter.  The developer sent email to me that I should do it. I was like wth its not my job, the marketing team should hire someone with SQL skills to do it and I can give him pointers on how to retrieve information.  Problem is that today one more developer whom I can delegate was on medical leave so the thing came back to me. I had to go to my son's school in 20 min so I was in time crunch. I thought I have everything, I have a program that given a template query would execute it on all Mysql databases and print it in tuple format and this is a common requirement that they want it in CSV so why not change program to do both.

within 5 min I changed code like this
        for row in pc_db_conn.execute(query):
            print row
        for row in pc_db_conn.execute(query):
            print row
            #I know this is BS and I should use csv module but this is internal shit
            csv_file.write(','.join(['"'+str(s)+'"' for s in row])) 

and tada the job was done. Now it seems for this kind of queries I eliminated the need for that developer. So I eliminated a human for one more task and this gave me a programming epiphany for the day.

I was whole day battling with newrelic, appdynamics and vivid cortex to find performance anomalies but this 5 min change gave me more joy than other things I found using those tools.

This blog post is dedicated to a fellow developer who once asked me how do I came up with all these ideas. So just trying to describe the thought process of eliminating myself from doing grunt work forces me to think these things.

Sunday, May 18, 2014

Jenkins Execute shell FAILURE

I was trying to run a jenkins Execute shell step with a grep to grep exceptions out of tomcat logs something like

echo "grepping logs for exceptions"
grep Exception purato.log*|grep -v WARN|grep -v DEBUG>error_summary.txt

problem was that if the log files had no exceptions then this step would fail and mark the job as failed. Now I wanted to mark the job as successful even though the grep didnt yeild any output. I tried various solutions from putting "exit 0" as last command but problem was jenkins would fail immediately first shell step it found failure.

Problem was jenkins starts shell with command

[workspace] $ /bin/sh -xe /tmp/
from man pages 
-e errexit  If not interactive, exit immediately if any untested command fails. 
            The exit status of a command is considered to be explicitly tested 
            if the command is used to control an if, elif, while, or until;
            or if the command is the left hand operand of an “&&” or “||” operator.

this -e means stop at first error.

adding "#!/bin/sh" at top of the shell solved the issue because now jenkins would respect this and wont start shell with -xe switch. Now it starts with

[workspace] $ /bin/sh /tmp/

Monday, May 12, 2014

Dell latitude find your windows7 product key

huh what a bummer, I had obliterated windows from my laptop and installed Ubuntu on it. Now I had an old XP VM that I was using but as Microsoft stopped supporting XP its a security nightmare if I continue to use it every time I sign in on Skype or GTM.

So I wanted a windows7 VM but I need a license and wanted to find it, I found all sorts of weird answers when I was trying to find the product key. Logging to BIOS didnt worked, logging to Dell support site didnt worked.

Finally I remembered that to solve the hardware issue with mouse when I had brought the laptop, I had removed the battery and there was some sticker under it.

That was a eureka moment :).  I remember in my old dell it was directly under the laptop and after 2 years the letters were barely visible, so dell did a good thing but they need to either put this on the website that windows product key is under the battery or put some sticker on battery or better just put the key in BIOS settings.

Predictive weather forecasting

checked weather on today and saw this  "Rain will start in 15 min" and within 10 min rain started falling.

they are calling it MinuteCast

Seems interesting that we are reaching to this level of predictability. 

Thursday, May 8, 2014

Reading concetration mobile vs desktop

I read a hell lot of content these days on zite quora, not sure but something has changed in past 4-6 months, earlier I was a big time consumer of netflix and now a days I rarely watch it. I have cut the cord on cable for past 2 years and I rarely miss it. 

I daily put my son to bed and the guy takes more than an hour to sleep, he will find all sorts of excuses from "I am thirsty or I have to pee" to just get out of bedroom. With kids you got to stick to the routine and consistent else if he is up beyond certain time then he gets cranky and what takes an hour usually takes 2 hours. So for that 1 hour I am in a dark room waiting for him to sleep. I recently started covering a blanket on myself and started reading on my phone, but the wifi in bedroom sucks as its far from Router in study where I work in day time. Also earlier had an Iphone 3GS and it was good for calling but apps were too slow and the Tmobile 3G network used to suck. I switched to Iphone5 a month ago and its super fast. On top of that I switched back to AT&T and 4GLTE is super fast. So this has been a blessing. Now I have almost 30-45 min daily of uninterrupted time to think or read.

So I use Zite/Quora on my mobile to read things. One thing I noticed is that when you are reading on mobile you have the utmost concentration on reading that one and only one article, there are very few things disturbing you.  I dont use whatsapp or facebook or havent even synched gmail on my phone. I dont want to get disturbed while I am in bedroom. I had a discussion with a relative who is a bigshot at a very big company, he lives his life on mobile as he is always travelling, his son who also lives his life on mobile as he is on campus. He and his son peeked at my phone and their reaction was "man this phone is virgin" , why do you even have this phone, there are no apps on it, nor even email. I work from home so 90% of the time I am at home so always in front of laptop and that’s why  I dont need any apps.  Now after getting the fast phone and faster data plan it seems for that 1 hour daily I am on phone with no distractions. But with same logic i am in front of laptop daily but there are so many distractions, if I am reading an article or news in morning then I am getting pings from many people or calls on skype or IM on jabber or even on web there are so many ads and other things on the rendered page. Also there is multi tab browsing on desktop, problem with that is your concentration breaks and if one page is loading slow you immediately switch to other. On mobile you have no choice, if its slow then you wait.  So lately I am observing that if I am reading quora/zite on mobile I  have more concentration than if I am in chrome or FF in desktop.

To end the post on concentration I found this very good answer on quora as to what is programming   I do this a lot of time, I stand in front of study window and I am staring outside or I am sitting at the chair and staring at blank window, or take a 5 min break to watch my fig plants to relax and think.