Tuesday, December 31, 2013

Penetration testing and crowdsourcing

We take security of our data and customers seriously and any security issue found is patched ASAP.  We hired outside security testing companies to do our testing in case the developers missed something. Initially we hired a company lets call it YYYhat and they were great in first few months and found many standard issues like XSS, XSRF, session hijacking but after sometime no new issues were found. Then one day a customer reported an issue to us that was not detected by YYYhat so we hired another company, lets call it YYYsec and this company was good in finding some sql injection and some XSS in another parts of the system that were constantly missed by YYYhat company.  But again the pipelined dried from YYYsec and we thought we were secure.

We even asked our engineers to start downloading penetration test tools and automate them to detect standard issues.  But again they didnt found much.

Lately I am observing that these specialized security testing companies are a one skill or some skill shop but not jack of all trades. They are good at finding some set of security issues but completely miss other kind of issues. This was observed with XXXhat and XXXsec and our own engineers.

This week we hired another company lets call it crowdYYY. I have high hopes for crowdYYY because you just spawn a testbed for them and raise a reward, bounty hunters or aka penetration engineers will come and penetrate your system and submit vulnerabilities, If you think this is really a good vulnerability then you can reward them anywhere from $50 to $100 or even $1000.  I liked the idea behind crowdYYY because its crowdsourced security testing and different security engineers would be good at different penetration testing skills and in turn we would get more variety of testing. This will make our site more secure.

Combine mysql alter statements for performance benefits

We have 1200+ shards spread across 20 mysql servers. I am working on some denormalization project to remove joins and improve performance of a big snapshot query.  I had to alter a table with 8 denormalized columns so initially my script was

alter table file_p1_mdb1_t1 add column ctime BIGINT;
alter table file_p1_mdb1_t1 add column size BIGINT;
alter table file_p1_mdb1_t1 add column user_id BIGINT;

in performance environment when I generated the alter script and started applying alter I started seeing below data where each alter is taking 30sec. I was like this could take 80 hours to alter 1200 shards each with 8 alter per table. Even if we do 20 servers in parallel this could take 4 hours.

Query OK, 1446841 rows affected (33.58 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (31.66 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (31.86 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (32.15 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

so I wrote a new alter that looks like
alter table file_p1_mdb1_t1 add column ctime BIGINT,
add column size BIGINT,
add column user_id BIGINT;

and this shows same constant time. So this would take 10 hours. But we can do the 20 servers in parallel and should finish in 30 min.

Query OK, 1446841 rows affected (33.58 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Lesson learnt, when you are operating at scale even small optimizations count.

Monday, December 30, 2013

Mysql replication and DEFAULT Timestamp

if you are using mysql replication to replicate data across data centres and using statement replication then dont use the DEFAULT and on UPDATE fields


We recently did a simple RING replication across 3 datacentres to manage our user account identities and ran into this issue.  Due to this when the same statement is applied on other data centre the timestamp column gets a different value.  For now we would remove this and generate the time from Java so that we can use do a hash based consistency checker on replicated rows.

Url rewriting XSS and tomcat7

If you have url rewriting enabled and your site has a XSS vulnerability then the site can he hacked by reading url in Javascript and sending sessions to remote servers.

To disable this if you are on tomcat7 the fix is simple

add this to your web.xml

Also you should mark your sessionId cookie secure and httpOnly.

Wednesday, December 25, 2013

Velocity, code quality and manual QA

Unless you can remove manual QA, you can either have code quality or you can have release velocity. You cant have both goals, you need to compromise on one of them.

We recently did a big release where we replaced lot of old DWR apis with REST apis and we revamped a big part of old ExtJS UI with backbone based UI. The release was done in a duration of 4 weeks and even after delaying the release by 1 week, there were some bugs found by some customers.  I had expected some issues and I am of proponent of "fail fast and fix fast".  Many of these issues we were able to find and fix it on the weekend itself.

In the coming week we found an issue on Friday afternoon and we had a similar bug fixed for next release that would fix this bug also, but QA was unsure of merging this fix as most of the QA team was out. This is the second last week of december and it seems everyone is out for christmas so it was decided to delay the fix to be done when QA was available. Now I have to daily run some query to detect the bad data and run some cleanup scripts.

Doing this cleanup for last 2 days I realized this is BS and how can we break this dependency on human touch points. We have code coverage but only 45% line coverage and some 31% branch coverage.  It seems we have two conflicting goals "Velocity of feature releases" and "code quality".  We do one release every 2-3 weeks and until we can automate the hell out of code base to get close to 90%+ code coverage we would continue to run into this situation. We want to release every week or every day in future and unless we can eliminate human QA that goal seems far fetched.

nginx enable compression and JS/CSS files

So one of my old colleague was trying to find some quick way of increasing performance for his pet project website so he asked me to take a look.  First thing I did was ran pagespeed and immediately I saw compression was off, css/js was not aggregated. JS/CSS aggregation would require either build changes or installing pagespeed so I thought quick 1 hour fix would be to just enable compression so I went and enabled " gzip  on;"  in nginx conf and it gave a good bump on login page but the home page after login, it kept complaining about JS/CSS not being compressed. I was scratching my head and left it as is that day.

Yesterday night finally I found out that turning gzip on would only compress text/html by default in nginx.  You need to explicity turn it on for other types. So I went and added the below and problem was solved.

    gzip_types  text/plain application/xml text/css text/js text/xml application/x-javascript text/javascript application/javascript application/json;

While doing this I also installed pagespeed with nginx. It was not as fun as installing it in apache because for nginx if you have to add module you need to recompile it from source. But it was worth it.  The pagespeed score that was 60 went up to 90.

I really like products that removes human touchpoints, Pagespeed is one of them, Thanks to Google for creating this awesome product.  At my employer's website before pagespeed I had to hunt developers every release to see if they had added proper version tags, they had followed build time aggregation guidelines and due to our release velocity, this was the first thing that was sacrificed and debt would keep piling up release after release. One day some customer would complain and everyone would scramble to get it fixed in one week. But thanks to pagespeed no more of this shit.

Tuesday, December 24, 2013

120K lines of code changed and 7000 commits in 4+ years and going strong

wow svn stats reports are interesting.  I ran into a report recently that I made 7000+ commits to svn and changed 120K+ lines at my startup.  Still I feel like I just joined and do not know enough parts of the system.

I wonder if there is a tool that can detect how many lines of code I would have read.  If I changed 120K lines and assuming I read 3 times that of code, it seems low.

Sunday, December 22, 2013

how a loose wire can lead to strange network issues

On Thursday I started having weird network issues where my skype would disconnect and network would be choppy. I had to use my 3G on phone to make skype calls with team. But streaming services like netflix would work fine.  I tried all things from changing DNS to restarting router to shutting down extra devices.  Finally I remember last time it happened the time warner cable had asked me to plug the laptop directly to modem to weed out router issues.

To do this I was connecting modem with ethernet and it also was choppy and suddenly I thought let me check the connections and found that the incoming cable connection to modem required 1-2 turns to tight it. Thats it all issues solved but this loose wire definitely gave me some heart burn for 2 days and even to my son who would start complaining because his netflix would constantly start showing the red download bar :).

Friday, December 20, 2013

eclipse kepler pydev not working

On my new laptop I installed eclipse kepler and then tried installing pydev and no matter what I do pydev wont show up.  Finally I figured out that the latest pydev requires jdk1.7 an unless I boot eclipse in jdk 1.7 vm it wont show up. wtf its pretty lame atleast it could show some warning dialog or something.

Anyways my startup uses jdk1.6 and I didnt wanted to pollute the default jdk so the solution was to download the .tar.gz version from oracle and then I exploded it in some directory, I opened the eclipse.ini in the eclipse install directory and added


and thats it. After I restarted eclipse, pydev showed up.


Change is always hard for me. I installed ubuntu 12.04 and didnt liked unity as I was so used to ubuntu10.  I did everything possible to make the new ubuntu look like classic ubuntu.  But if you want to use classic ubuntu in ubuntu 12.04 then its bare bones and you have to install everything manually.  There were many things that kept crashing and there is no fix for it. Also somethings like seeing skype,pidgin in systray was a must for me. Ultimately I gave up and tried unity.

Honestly I still dont prefer unity that much but have adjusted to its nuances and slowly started to liking it.

Tuesday, December 17, 2013

skype crashes on 64 bit ubuntu12.04

Installed new laptop and incoming calls to skype would crash the skype. I disabled video and it was stable for 1-2 min but again crashed.  Finally google and found it was issue with 64 bit plugin for sound.  Running this fixed it.

sudo apt-get install libasound2-plugins:i386

Friday, December 13, 2013


Lot of intertia is built up when you are using a laptop for 4 years. I got a new company laptop and I was procrastinating to switch to it.  Primary reason was that some of the things that I use daily were not working on new laptop.  Mostly pidgin, py-dev. Also evolution and other settings have changed have changed in ubuntu 12.04.

Finally I found out why my pidgin was not working in my new laptop. It seems I have a 8 year old router and somehow one of the devices in my home keeps hijacking the ip address assigned to my new laptop. I found this out  because randomly web browsing was slow on the new laptop and when Idid ifconfig I realized it was assigned IP and I remember the hijacking of this IP by Roku or some other device. Anyways assigning a static IP to my wireless and ethernet connection seems to have solved the issue.

I am up and running on new laptop but still lots of small small thigns are annoying when you change to a new laptop like the new unity sucks so I switched to ubuntu classic but now weather indicator is not working, pidgin, skype are not displaying in systray, the external monitor somehow is not picking proper resolution when mirroring screen. Alt+tab is not working.

Anyways the only solution it seems to me right now is to just start using the new laptop and keep plodding.

But there are lots of good things also about new laptop. It has 16G ram and it has SSD. Also the mysql schema rebuilds in 1 min that used to take 8 min on old laptop.

Tuesday, December 10, 2013

Building reusable tools saves precious time in future

I hate cases when I had made a code fix but the code fix is going live in 4 days and customer cant wait for the fix. Now you have to go and fix the data manually or give some workaround.  Anyways today ran into an issue where due to some recent code change a memcache key was not getting flushed and the  code fix for that is supposed to go live in 3 days. But a production support  guy called me to fix it for customer and I can understand frustration on part of customer. So the choices were:

1) Ask the production support guy to flush the key in all memcached instances with help of ops.
2) Manually go myself and flush the key. This could take anywhere from 10-15 min plus its boring doing data cleanup.
3)Flush the entire memcached cluster. This is a big NO NO.

So I was about to go on path #2 and realized that there is a internal REST API that the production support guy can call to flush arbitary key from memcached. Hurray I just told him what keys to flush and he knew how to call such apis.  So Thanks to whoever wrote that internal api thinking about its future use and making it generic, it saved me some time :). 

Friday, December 6, 2013

you just have to be desperate

I recently setup a new ubuntu machine and  coincidently my company Jabber account stopped working on both new and old laptop. I was busy so didnt spent much time on it and sent it to ops and they said its an issue on my side. I was puzzled because how come it can be an issue on both laptops.

But using skype is a pain and its also not secure so I had to find a way to fix it.  Finally I found  that you can start pidgin in debug mode using
pidgin -d > ~/debug.log 2>&1

then I tailed the logs and immediately I saw

17:44:44) account: Connecting to account kpatel@mycompany.com/.
> (17:44:44) connection: Connecting. gc = 0x1fc9ae0
> (17:44:44) dnssrv: querying SRV record for mycompany.com: _xmpp-client._tcp.mycompany.com
> (17:44:44) dnssrv: found 1 SRV entries
> (17:44:44) dns: DNS query for '' queued
> (17:44:44) dnsquery: IP resolved for

doing a dig on record shows it was correct

dig +short SRV _xmpp-client._tcp.mycompany.com
10 0 5222 mycompany-im.mycompany.com.

so I was puzzled

finally I saw that my proxy setting in pidgin was some how Gnome proxy. Changing it to no proxy fixed it. See screenshot.

I would have never tried to dig deep and load pidgin in debug mode but I was desperate to fix jabber client and had no other option. Today I learn a lesson that if you are desperate then you could venture into exploring even uninteresting things.

Thursday, December 5, 2013

Human touch and scalable systems

I am a big fan of eliminating human touch points when it comes to scalable systems.  When I say "human touch" I mean a step that involves a human to perform the operation rather than relying on bots.  A good example is db schema updates.  We use mysql sharding and recently we split our  global db into 12 sharded databases with identical schema but diffrent hosts.  As we were crunched on time, I didnt got time to write an automatic schema applier, the process agreed upon was to put the ddl statements in deployment notes  and devops would apply it in all databases. 

As usual I am always skeptical of human touch points so after 2 months of going live just for my curiosity I wrote a mysql schema differ that would dump all 12 schemas and diff them and to my surprise there were schema differences.  Four of the mysql servers were setup with latin1 characterset rather than utf8_bin. No one caught the issue because that table was created a sleeper mode and the feature is yet to go live next month.

We use sharding in other parts of the system also, so I wrote one more script on those databases and there are indexes missing on some shards. The alter scripts are auto generated by python but applied by humans and this human touch caused the inconsistency.

Lesson learnt is that either you remove the human touch point or write a consistency checker to ensure humans did the job correctly. Once I get time I will write the schema applier bot instead of relying on humans.

Being lazy is one of the key principles of a good software engineer and when someone tries to involve me as a human touch point, my first goal is how can I remove myself from this process :).  Like I was asked after every release to grep logs from production nodes and see how is release doing. I would rather have a bot do this job and enjoy my weekend. It was a pain to write the bot but once I write it, I am out of the equation and if I leave the company the process still works as I am not in the critical path.