Skip to main content

From NOSQL to MySql (BDB to mysql) - Part1

Before I joined the company it was evolving from a search and share company to a cloud file server and we were using lucene as a datastore for metadata but it was always lagging in updates as  you cant update data realtime in lucene. Then we were going to NFS for metadata but it was not scaling for metadata calls as NFS protocol was never meant to scale to web2.0 volume. So someone picked up Berkeley DB as a backend. The only reason to not pick any other NoSQL products was that none of them had multi row transactions capabilities and relational db was not picked as it would take time to port code to relational db. I had recently joined at that time and then company became a hit and we scaled from 3-4 nodes in one data centre to 60-100 nodes across multiple datacentre. We are still going through the scaling phase and believe me it is not fun scaling Berkeley db.  Berkeley db works fine if you can hold your dataset in memory but as soon as you start hitting dbs of size 10G or more it becomes a nightmare. The main issues we find with Berkeley db are:

  1. Its difficult or I should say impossible to find people qualified enough to diagnose Berkeley db crashes. We hired a consultant at some costly hourly rates but all he would give is theoretical knowledge and not enough details of the crash.
  2. As its an in memory db if tomcat is hung and kill -hup wont work then we always have to  cross our fingers doing a kill -9 as sometimes we would lose data.
  3. Sometimes DB would get corrupt pages and we would have to do a catastrophic recovery and we would lose data so temporarily I had to build an alternate replicated mysql db writing some ETL code for partial recovery of metadata.
  4. As Berkeley db is an in memory db  the only way to query data is to write a JSP. Simple queries from marketing or products team would require us to write a jsp. Updating data to fix corrupt data was no fun either.
  5. We tried hard to enable replication to other node to scale read operations but our write rates were so high that replication would always lag and it works in development but never in production.
  6. Backing up Berkeley db and rsync of logs to filers would cause IO issues on nodes.
  7. Log churn on Berkeley db would cause IO issues.
  8. To add an index you have to write code and db would become fragmented so lately we are just shutting down nodes every week and copying db to another folder and moving it back to defrag it or in worse case we have to dump/reload the db.
  9. So due to above issues we were left with pinning customers to a node and nodes would become hotspots and one bad customer would affect other customers.
  10. We had to put throttles in system as we cant serve a customer from two or more nodes due to in memory issues.
  11. Very few engineers would have the aptitude to dig deep into Berkely db code and fix issues, it took me some time to digest but it opened my vision to NOSQL dbs and I was able to apply same concepts when we added cassandra to some other usecase.
  12. Sometimes we throw more hardware to solve the issue and its not cost effective in longer term.
  13. Indexes was another issue, you have to write code to add an index and that means you will have to go thought whole release cycle to get index on a db. 
  14. Adding salt to injury was that indexes wont get created on db startup, they would only get created when someone accesses a db, for a customer with 3-4M files this could take anywhere from 10-20 mins if the server is busy or 1 min if server is idle. This unpredictable performance of bdb was making us nuts.  Whenever we would create an index we had to remember to run empty snapshot queries to generate the indexes on weekend and that would mean painful exercise to repeat on 60 nodes as one or the other node would get hung and you would end up spending anywhere from 4-5 hours fixing it.
  15. BDB Locks were another pain, bdb support page level locks instead of row level locks and that means more chance of deadlocks, we would get constant deadlocks if a background thread like Trash purge Quartz job is aggressively trying to purge data and same customer is trying to add files at same time. We had to implement application level locks to prevent such scenarios and that complicates the code.

Ultimately we became tired and decided to move to Mysql and it took almost an year to port code to mysql with so many fires going on to scale Berkely db due to 20% growth month over month.  But in short NOSQL is good and is still used at company for many usecases but when it comes to  transactions/adhoc querying Mysql is still the champion and unless you are writing a desktop client application, inmemory dbs are bad idea. I despise Berkeley db everyday to be used a server application.

Finally Mysql backend is going live today to some customers and once its live and scaling we would start removing throttles and start serving one customer from multiple nodes in coming months. My anticipation is that we would consolidate hardware and cut the no of servers in half as app nodes will now be sitting mostly idle. I would cover more details on the Mysql sharding and how scaling with Mysql goes in coming months.

Part1
Part2
Part3
Part4
Part5

Comments

Popular posts from this blog

Killing a particular Tomcat thread

Update: This JSP does not work on a thread that is inside some native code.  On many occasions I had a thread stuck in JNI code and it wont work. Also in some cases thread.stop can cause jvm to hang. According to javadocs " This method is inherently unsafe. Stopping a thread with Thread.stop causes it to unlock all of the monitors that it has locked". I have used it only in some rare occasions where I wanted to avoid a system shutdown and in some cases we ended up doing system shutdown as jvm was hung so I had a 70-80% success with it.   -------------------------------------------------------------------------------------------------------------------------- We had an interesting requirement. A tomcat thread that was spawned from an ExecutorService ThreadPool had gone Rogue and was causing lots of disk churning issues. We cant bring down the production server as that would involve downtime. Killing this thread was harmless but how to kill it, t

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it can happe

Preparing for an interview after being employed 11 years at a startup

I would say I didn't prepared a hell lot but  I did 2 hours in night every day and every weekend around 8 hours for 2-3 months. I did 20-30 leetcode medium problems from this list https://leetcode.com/explore/interview/card/top-interview-questions-medium/.  I watched the first 12 videos of Lecture Videos | Introduction to Algorithms | Electrical Engineering and Computer Science | MIT OpenCourseWare I did this course https://www.educative.io/courses/grokking-the-system-design-interview I researched on topics from https://www.educative.io/courses/java-multithreading-for-senior-engineering-interviews and leetcode had around 10 multithreading questions so I did those I watched some 10-20 videos from this channel https://www.youtube.com/channel/UCn1XnDWhsLS5URXTi5wtFTA