Wednesday, May 16, 2012

From NOSQL to MySql (BDB to mysql) - Part5

This weekend was painful. Hardware came but out of 12 mysql servers, 3 had hardware issues, so only 8 mysql available (4 pair of Master-slave). I didn't wanted to take a risk of having a cluster up with only 1 master so left that one intentionally.

Started migration on 1st DC and after 4 hours the DC guys were doing some inbound network maintenance and took the in bound network to DC down and it took them 3 hours to restore it, so 3 hours were lost. We were up till 6:00 AM on Saturday morning and only able to finish 8-10 nodes.

We worked on Saturday night and as usual Berkely db gave us last minute jitters, one node was badly fragmented and took 10 hours to migration and one other node stuck in middle of migration so had to stop migration and run db_recovery and then resume again. In all 12-15 nodes migrated on that day.

We are now almost done 60% done in all DCs over last 3 weeks.

But all this pain is worth having a calm Monday where no system alerts or no Sales/Operations team hovering around you to diagnose scalability issues. This also in turn translates to better customer conversion and satisfaction.

Its first Monday in last few months where I had no single nagios alert raised for an app node spiking, hurray.


Monday, May 14, 2012

Useless comments in code

I hate when developers writes useless comments in code like this
}finally {
            // Closing the connection

Anyone reading the function name close connection can figure out this is closing the connection.

I try to give function a meaningful name so that it conveys the intent and comments are not required. I write method comments only when function is doing some complex algorithm or some trick.

I also hate comments like this

     * @param customer
     * @param ids
     * @return

This is also useless comment generated by eclipse automatically but the developer didn't added a comment.

Screen size is limited and I would rather see code in it than useless comments.

Thursday, May 10, 2012

Mysql sharding at our company - Part4 (Cell architecture)

Interesting to know how different people can come to same architecture to solve scalability issues.  I just read this article published today called as cell architecture  and I came up with same architecture in our company as highlighted in the  diagram below and I am calling it as "Cluster" instead of "cell" or "pod" but the concept is same.

You can read a bit more on the link I published above but the main reason why we chose this architecture were:

  1. Failure of a cell doesn't cause the entire DC to go down.
  2. We can update one cell and watch out for a week before pushing the release to all cells.
  3. We can have difference capacity for different cells (Enterprise customers vs trial customers).
  4. We can add more cells or mores mysql host to once cell if it has a capacity problem in one component.
  5. Ideally you want to make it as homogeneous as possible but let say for some reason in one cell people are doing more add/delete files than just adds and delete causes index fragmentation so you can do aggressive index defrag on one cell than others.
Part1 of series

Part2 of series

Part3 of series

Part4 of series

Mysql sharding at our company - Part3 (Shard schema)

As discussed in Part2 of the series we do Horizontal sharding for any schemas that will store 100M+ rows. As we are a cloud file server no data is shared between two customers a perfectly isolation can be achieved easily. One year back when I was thinking on designing the schema there were many alternatives

  1. One shard per customer mapped to one database schema : Rejected this idea because mysql stores 1 or more files per table in physical file system and linux file system chokes after some no of files in a folder. We had faced this problem when storing the real files on filers (topic of another blog post).
  2. One shard per customer maped to one set of tables in database schema : This would solve the issue of multiple files in a folder but again it would lead to too many files on the disk and operating system can choke on it. Also we have customers to do a trial for 15 day and never signup, so too much for ops team to manage for these trials.
  3. Many customers in one shard mapped to one database schema: This would solve both issue one and two, but this is again  too many schemas to manage for operations team when they have to setup replication or write any scripts to manage the schemas.
  4. Many customers in one shard mapped to one set of tables in one database schema. : This is the approach we finally ended up picking as it suits both engineering and operations needs.
On each Mysql server we create a set of schemas and within each schema we have a cluster of tables that comprises a shard. To figure out what customer lives in what set of tables we use a master db called a lookup db aka DNS db. Each query first looks up the master db to figure out what shard this customer lives in, this is a highly  read intensive db so we cache this data in memcache. Once we figure out the shard then we lookup what schema and what server this shard lives on. based on that information the application looks up appropriate connection pool in spring and executes the query. This is how the topology looks like at 10000 ft.

This is the structure of our dns db tables

 A typical schema in a shard db has tables with name like
folders_${TBL_SUFFIX}, file_${TBL_SUFFIX}.  Here TBL_SUFFIX is unique within cluster so that shard can be moved easily. To make it unique for now we just append schema_name and table set number to it to it. So let say for schema c1_db1 the tables for shard 10 and 15 would look like

We could have just appended shard_id to the table names also to make then unique but this makes logical and physical mapping hard. Later if we move the shard to a different host all we need to do is move the entire schema containing many shards to a diff host and change metadata db mappings and flush cache and post a message to zookeeper. App nodes listen to zookeeper for such events and refresh connection pools.

Part1 of series

Part2 of series

Part3 of series

Part4 of series 

Saturday, May 5, 2012

From NOSQL to MySql (BDB to mysql) - Part4

No hardware this weekend due to delays by dell so only  2 nodes were migrated. Ran into another crappy Berkely DB issue where for no reason it would get stuck in native code.

java.lang.Thread.State: RUNNABLE
at com.sleepycat.db.internal.db_javaJNI.Dbc_get(Native Method)
at com.sleepycat.db.internal.Dbc.get(
at com.sleepycat.db.Cursor.getNext(

It was already 4:00 AM in the morning and I was pissed that Berkely db is giving last minute pains. In India they say "Bujhta Diya last main Tej Jalta hai" or "dying lantern runs more bright at end" .  We tried doing a defrag of database but it didnt helped. Ultimately we moved the customer db to its own folder and then ran catastrophic recovery and restarted the migration. From past Berkely db migrations I had the experience that you would get these kind of issues and you might have to restart migrations. I had deliberately written the migration script in such a way that you can run it for 1 customer or 5 customer or all and restart migration in middle.

Anyway good news is that now one cluster in a DC has 160M rows and its going rocking fast.

There were no spikes reported by operations on nodes that are migrated to mysql. This ultimately leads to better customer satisfaction and even I can focus on more longer term tasks. Last 1 year has been frustrating with scaling Berkely db as I think we pushed both Berkely DB and NFS to its limits.


Thursday, May 3, 2012

Mysql Sharding at our company- Part2 (picking least used shard)

in Mysql sharding at my company part1 I covered that when a customer registers we assign him next free shard. Now this problem seems simple at first but it has its own twists. You could have different strategies to determine next free shard id
  1. Based on no of rows in shard
  2. Based on no of customers in shard
After lots of thinking I use the first approach. When a customer registers I pick up the next free shard by looking up information schema and query least used 8 shards and then randomly pick one of those.  Picking shards based on no of customers was rejected because we use a nagios monitor to test registration and that causes lots of dummy registrations and also QA team does registrations and some times people will register a trial use for 15 days and wont convert as they are either spammers or just want to use us for sharing large files for <15 days.

The reason to not always pick the first least used shard is that 6 months down the line if we add 2 more shards to cluster then every new registration would pick those two shards only.  The way our customers use the system is that they register and play with the system for a while during the 15 day trial period and if they like the solution (which they do) then they upload their real dataset.  During the trial period the customer would upload may be 100-1000 files/folders but when they upload the real dataset then they would upload 1-2 million files. So always picking the least used shard would cause hotspots. That's  why we pick the least 8 used shards and randomly pick one of those shards.

This gives us uniform distribution here is a screenshot of 5 random shards from our Graphite dashboard.

As you can see the rows are evenly distributed, you could get many small customers on 1 shard and 1 big customer can take up 1 shard.  As we use memcache many small customers on 1 shard will not cause issues for reads. Its usually the bigger customers with 3-5 M files that cause issues.

One key thing to note is that information schema can be slow when you run select query on it. If the statistics on table are obsolete, then  select query on information schema can cause mysql to scan the table and calculate statistics, which can be bad for registration. So I wrote a quartz job that every hour find the 8 least used shard per cluster and populates in memcache  and registration process just uses that. If for some reason the data is not in cache cache then registration process queries and populates it.

Part1 of series

Part2 of series

Part3 of series

Part4 of series