1) Our DBAs had done some profiling an year ago and told me that there is this "Select 1 from dual" query that is taking most of the time in database. I was like this is the query that we do for validating connection out of commons-dbcp pool so I cant take it out and why would select 1 from dual take time in database. But then I installed new relic on 16 app servers in a data centre and then I went to Database tab and immediately I see this query being fired 40-50K times. This was an Aha moment and immediately I started looking for alternatives.
Finally I settled on tomcat-dbcp because it has a property called as validationInterval(default is 30 sec). So what it means is it would still fire select 1 from dual but will fire it on the connection only if it hasnt fired it in last 30 sec. This weekend the fix is going live so crossing fingers if there are no serious issues with tomcat-dbcp in highly concurrent environment, UAT tests looks promising so far.
2) Second Aha moment was with Plugins support for new relic. I used to write platform software in my previous startup so I really like extensible software. Especially platforms like eclipse or facebook that are extensible with plugins. We use RabbitMQ as our queuing backbone and its rock solid. I asked ops to install the new relic plugin and immediately I see some queue has 1.5M messages. Upon asking questions to the developer I found out they had changed queue names and this one was not purged. These 1.5M messages were consuming 75M memory so it was not a big issue but still this was an Aha moment. We purged the queue and problem was solved.
3) Third Aha moment was when I installed memcahed plugin. Immediately I can see that one class of memcached servers that store static metadata is using 10% memory but another class of memcached servers that store transactional metadata is using 80-90% of memory. Memcached has a slab allocation so if you are using 80-90% memory you are sure of seeing evictions and we were seeing at rate of 10 per sec which is not much but still why should we see evictions. Earlier also all this data was present in memcached stats but we were not looking into it pro-actively because both memcached and rabbitMQ have rarely given us issues and finding this kind of trend over 20 memcached servers was not easy from stats output. This weekend we are redistributing the memory allocation and giving more memory to transactional servers and reclaiming unused memory from metadata servers.
4) Also one more thing that was good to know is our average response time for all apis across the board. We had all these stats at daily level but no trend information on hourly basis. Below as you can see there is not much difference between one DC to other DC over last 34 hours.
1) For plugins there is no aggregation. I wanted an aggregation for plugin data coming from all servers like they do for web transactions. we have close to 400+ servers and I dont want to see a dashboard to compare IO averages that look like this as I really want to focus on servers that are anomalies. I have close to 100+ Mysql servers and growing fast but I don’t want to install Mysql Plugin because the graphs are per server so its useless, I have same issue with haproxy plugin, I already have all this in cacti and graphite so I am seeing if I can use OpenTsDB for this and produce one graph that allows me to focus on anamolies. Vividcortex has a nice bubble paradigm for this kind of analysis but again that tool didnt had much of interest to me for Mysql except Bubble view.
2) New relic is good at detecting surface level issues and pointing the developer in right direction but after that what?? Our app uses a Fair Share thread pool http://neopatel.blogspot.com/2013/06/java-fair-share-threadpool.html and every api request is routed via this. The Fair share thread pool ensures no one customer is hogging all resources in one machine. But problem is new relic doesnt detect transaction trace in Async activity, it tells me 99% time is spent in this method that delegates to fair share thread pool but its useless to me. I have not given up on it but then I would need to spend time mutating code with custom @Trace annotation or deploying a yaml file with cut points to trace in each server. But this one I would try later in free time and if we decide to buy new relic.
3) New relic only shows errors that are sent back to browser as error but what about exceptions that are gobbled up or it tells me /webdav url has error and it send 500 at a rate of 15 per minute, but now what, which exception happend as same url could have failed due to different exception for different customer and I cant tail logs on 100+ servers? I have a home grown APM that probably does a better job here so we would probably stick to that for time being for this.
Overall very happy with new relic and may be I need to play with more advanced features. Its good at detecting some general trends and pointing you in right direction. It has less clutter than the other APM tools I was trying and best of all its super fast so far for me even If I am analyzing last 3 days worth of data.