The Lost Guide: Scaling a Web App

1 - I can do better 2 - Jury's out 3 - Pretty darn good 4 - Splendiferous 5 - Awesometastic by 15 people | Log in to rate

Ranked #7,395 in Tech & Geek, #164,866 overall

At a party recently, I talked to a developer who was very interested in how we are scaling Squidoo. I stressed to him that the art of scaling is almost impossible to learn without having a real-life situation to utilize it. That being said, learning something new and having to immediately apply it is a bit scary, so I thought I'd write up a guide for those who are already along this path and could use some help.

The Golden Rule

Scaling is the art of designing an infrastructure that's able to grow along with usage. Premature optimization, however, can waste valuable cycles and cripple a project's success.

What This Tutorial Doesn't Cover 

For the most part, this guide focuses on the entire web app platform, not specifically how to build a scalable code architecture; the web is teaming with design patterns and best practice guides in your language of choice. The most important rule to take away is that building your code in layers is important. The most common format for web apps is MVC, although some purists are not always content with this approach. The rest of this tutorial assumes that you have, from the very beginning, structured your code in such a way that database calls are easily accessible as their own layer of the application. This is critical for tweaking slow queries and implementing a granular caching system.

All apps are different. This tutorial doesn't cover edge cases, and assumes that your app is on a LAMP (Linux + Apache + MySQL + Python/Perl/PHP) architecture. Many of these concepts will work in other environments, but I can only speak for LAMP.

If you haven't tweaked your code yet... 

Head First Design Patterns

A great introduction to code architecture and common design patterns. This book uses examples in Java, but can be easily applied to your language of choice.

Amazon Price: $29.67 (as of 12/15/2009) Buy Now

Refactoring: Improving the Design of Existing Code

If you're working from an existing code base and not sure where to start, Refactoring is the perfect guide. It gets a little dry in places, but even a quick glance over the book can help a great deal.

Amazon Price: $41.93 (as of 12/15/2009) Buy Now

When to Scale 

The first thing to know about scaling is that you should hold off on it for as long as possible. Scaling adds all kinds of complexity to your process, and is a huge time waster both during setup process and ongoing maintenance.

However, it is crucial that you always think ahead so that at any given time you know, and are prepared for, the next step in scaling your app. But don't pull the trigger until you need to. Once you start noticing trouble, it's time to scale to the next level as quickly as possible. Don't wait, because scaling problems have a tendency to grow exponentially. So here's the rule: Don't scale until the first signs of trouble, but then scale to the next level as quickly as possible. Always stay one step ahead (not three or four).

Step 1: Dedicated Server 

If you're on a shared server and begin to notice performance problems, start by switching to a dedicated server. There are a number of really great companies in this space. At Squidoo we use RackSpace.

If you're new to server administration, it would be wise to select a host who will help you troubleshoot web or database server configuration problems. Obviously, you will have to pay more for a host that does (this is commonly called Managed Hosting).

Another route is to use a managed grid system like Mosso or MediaTemple, but because of the lack of flexibility I'm going to assume that this is not an option for your app. If it is, by all means consider it.

For your first server, don't worry about getting one that has tons of RAM (2 GB should be enough). If data integrity is a priority, installing a RAID 1 hard drive config will give you added protection against hard drive failure (albeit at the cost of decreased performance). More on this in a minute.

Backups 

It's simple: backups are a requirement. I'm going to assume you're already using a version control system like Subversion to backup your code, and will focus specifically on server backups.

You'll want to backup everything required to rebuild your server from scratch. If you haven't started already, consider maintaining a folder on each server with archives of every software package you've installed, along with notes on how they were compiled.

Backup this folder along with any config files needed to run your system. Most hosting providers provide an off-site backup service, but if not you could consider using Amazon S3 or a number of other third party storage providers.

Most importantly, don't forget to backup your database. If you're using InnoDB tables in MySQL, a binary file backup of your data directory is not enough. Luckily, there's an excellent Perl script to make MySQL backups completely painless. I've used this script for years and it has never let me down.

 

Backup and Recovery

Amazon Price: $31.49 (as of 12/15/2009) Buy Now

A Quick Intro to RAID 

RAID stands for Redundant Array of Inexpensive Disks. The idea behind RAID is that you can group a set of hard disks together to achieve things you normally couldn't otherwise. Here is a quick breakdown of some of the most common RAID configurations present in web server configurations. I'm glossing over many of the finer points of RAID here - for a more formal analysis, make sure to check out Wikipedia.

At Squidoo, we've found that the sweet spot is RAID 5 for our application servers and RAID 10 for our database servers.
  • RAID 0 is all about performance. Data is striped, or partitioned, across two or more drives. When a disk seek is made, the first drive to find the corresponding data replies right away. Available storage is 100% of total hard drive capacity.
  • RAID 1 is about redundancy. No matter what your situation, you should be taking data backups daily. But what happens when your hard disk becomes corrupted an hour before your next backup is scheduled? With RAID 1, your data is mirrored on a second drive. Although you only get the storage capacity of a single drive, RAID 1 is crucial for data integrity. RAID 1 decreases performance because all data must be copied to a second drive.
  • RAID 1 + 0, or RAID 10 as it is commonly known, is a combination of the above configurations. It is the security of data integrity without the performance issues. RAID 10 is also the most expensive to implement because it requires at least 4 drives. Available storage is only 50% of total capacity.
  • When you can't afford the four drives required for RAID 10, a RAID 5 configuration affords you limited fault tolerance and decent performance. Raid 5 gives you (size of smallest drive * (number of drives - 1)) performance.

Step 2: Database Server 

Once your dedicated server starts seeing performance bottlenecks, it might be time to configure a separate database server. This will allow you to tune each of your servers for their respective tasks.

The database server should be more powerful than your web server, as it is one of the more difficult elements to change later on, requiring you to take your entire site down. Lots of RAM and a RAID 10 hard drive configuration are desirable.

RAID 10 gives you the data integrity benefits without as much performance sacrifice as a RAID 1 by itself. See the RAID section above for more details.

No matter what hard disk configuration you choose, make sure you give yourself enough storage capacity to grow for a while. Migrating to a new array of hard disks is no fun.

 

High Performance MySQL

Amazon Price: $26.37 (as of 12/15/2009) Buy Now

Step 3: Server Tuning 

Now that you've got two servers, it's time to tweak them for the specific tasks they were born to do.

Run the command 'ps aux' and pay attention to any non-essential applications lurking on your servers. For your web server, this should be anything not related to basic system function, security, apache, or mail. For the database server, it should be anything not related to basic system function, security, or mysql. The startup scripts for these applications are generally located in the /etc/init.d directory. Stop the program by running '/etc/init.d/[appname] stop'. Then delete the symbolic link to it in the /etc/rc.d/rc3.d directory to prevent it from starting up again the next time the system boots. You can quickly identify which is the symbolic link by running 'ls -al /etc/rc.d/rc3.d'. Disabling unused applications frees up RAM (which we'll need in just a second) and might even make your server more secure.

Next, tune Apache on your web server (psst, by now Apache shouldn't be running on your database server at all!). Begin by opening the httpd.conf file and commenting out/disabling all non-essential Apache modules. If you're unfamiliar with a particular module, try Googling it. If you're still unsure, try disabling modules one by one, as opposed to all at once. Since your web server is now left alone to perform one primary task, up the minimum number of Apache servers started and the minimum number of spare servers. This will significantly increase RAM usage, but hopefully you were able to free some up above.

You can often make your web site quicker by enabling gzip compression. Just about all modern browsers support the ability to compress output on the web server and send the compressed version to the browser. This results in faster download times for your users, and saves bandwidth on your end. There's an excellent article on enabling gzip compression with mod_deflate on HowToForge.

Step 4: Static File Hosting 

The Ins and Outs of KeepAlives

Output compression using gzip is by far the easiest thing you can do to give your app an instant speed boost.

Another thing to keep in mind is Apache KeepAlives. The Apache KeepAlive setting spares Apache servers by keeping a single connection open for a browser while it downloads all the external images, CSS, Javascript, and other static content associated with a page. Ordinarily this is great for performance, but in a high volume environment it can become tricky. Here's the situation.

When a surfer visits your app in a web browser, an Apache instance is started to fulfill the request. While processing the request, the Apache instance's memory usage grows to the size demanded by your application (let's say 300 virtual MB, for example). Once the main app is rendered, the instance's memory usage does not shrink, however. It will stay the same size, or even grow, as long as the connection is open, which results in Apache allocating 10 times (or more) the amount of memory required to serve the miscellaneous static content associated with your app.

The best way around this is to use a special lightweight web server dedicated to hosting only static content. lighthttpd (pronounced "lighty") works great for this.

Configure the static server to run on a different port, or on another physical machine altogether. Create a subdomain like static.yourdomain.com and use it to host all your CSS includes, Javascript source files, and images. Make sure this new server is configured to use gzip compression for all text-based files.

Finally, edit your Apache's httpd.conf file and disable KeepAlives for your primary web server. Then monitor the logs to ensure that Apache is no longer server up static files.

Step 5: Caching 

So you've got one or more web servers, a database server, and a static content server. Your web servers seem to be holding up OK, but your database is getting progressively busier as more people visit your site. At this point you deserve congratulations for building a service popular enough to have this problem!

From now on, almost all of your scaling problems will be database-related. Since most applications have a high volume of reads compared to writes, you can usually relieve most of the strain on your database using an object cache like Memcache. Once integrated with your app, Memcache can keep your most commonly accessed data right in memory, where it can be retrieved much faster and with less overhead.

All Apologies

This lens isn't quite done yet, but I promise to finish it soon. Thanks for taking the time to read it so far, and please let me know if you have any questions.

 

Linux in a Nutshell, 5th Edition

Amazon Price: (as of 12/15/2009) Buy Now

Apache: The Definitive Guide (3rd Edition)

Amazon Price: $26.37 (as of 12/15/2009) Buy Now

Comments 

Have a question or comment about this guide? Post it here.

submit

by giltotherescue

Gil Hildebrand, Jr. is an experienced software developer based in New York City. He is currently running things as the Chief Engineer of Squidoo, and... (more)
Create a Lens!