Saturday, April 12, 2008

Scaling Up Vs Scaling Out

You've got the website up and running, your clients love it, more and more people want to use it. Everything seems to be going great. One fine day you realize that your systems are getting chocked and clients have started complaining. You keep adding new machines to your environment (doing quick-fixes in the application to support this distributed architecture) and believe that this scaling-out approach is the right way to move forward (after all, the big guys like google n yahoo have thousands of machines). After a couple of years, there are 10s (or probably 100s) of machines serving your website traffic and the infrastructure, administration etc. costs have gone up considerably. And due to the quick-fixes, its really difficult to work on a new clean architecture and add more features to your application.

Lets consider what Googles got and how the scaling-out approach works great for them:

1) Possibly the best Engineers in the world
2) Google File System (GFS)
3) Map-Reduce
4) An infrastructure where you can treat a server class machine like a plug-n-play device
5) Applications which are designed keeping GFS and MapReduce in mind
... and god knows what else

If you've got anything close to this, then scaling-out is the obvious answer. Otherwise, read on...

There are 3 major components to consider while choosing a Server:

1) CPUs
2) Primary Memory (RAM)
3) RAID configuration (RAID 0, RAID 1, RAID 5 etc.)

A server has certain limitations in terms of the amount of Memory and number of CPUs it can hold. (mid-level, server class systems come with support of up to 32 GB memory and 4 CPUs). Adding more CPU or Memory becomes very expensive after this. So, there is linear cost of adding more memory and CPU to a certain extent and after that it becomes exponential.

Example: Lets say you've got a 100 GB database. It works comfortably with a 16 GB(expandable upto 32GB) RAM, 2-CPU Server. Once the database size goes up and the users increase, this single server might not be able to handle the load. The option is either to increase RAM (most database servers need more memory and not CPU power) or add another machine. The economical solution will probably be to add more RAM (addition of another 16 GB memory will cost significantly less than what a new server would). After a certain point addition of RAM might be more costly than adding a new server and thus the better option is to scale-out at this point.

The key is to scale-up till the cost is linear and side-by-side start work on the Application architecture such that you can run your application smoothly on multiple servers and scale-out thereafter.

Choosing the right RAID configuration is also important, it depends on what operation you perform the most (read, write or read+write). I'm not an expert in RAID configurations so do a bit of googling and you'll get a number of articles on this.