2014-december update – READ THIS -> http://12factor.net/ While I talk about 4 principals, someone else went a little further and came up with 12! its a slightly different vein but on the right track.

When I write software or architect a solution, I keep these 4 guiding ideas at the top of everything.

Partition everything

I inherited a system which stored user accounts, the person who wrote the software did not envisage the system growing, when we hit 32000 user accounts it stopped, why? UNIX filesystem limits in that version of the kernel. Have you tried to do a listing of a directory with 32000 user accounts and 2TB of user data under it! Try backing that up.. most backup agents traverse the directory to build a list of files to copy… try traversing 2T of data before you try to write it to tape…nope aint going to happen… so no backups.

So a better strategy is to use the first letter of the name and then you only have 26 directories to traverse, then use “aa” to “az” under that… much more efficient and easy to scale (as well as backup).

So your new system works nicely on just one server and will NEVER grow beyond that (after all Google ran on 1 server once)? I don’t think so!… If you need to store data for a lot of users use a lot of servers – 26 letters in the alphabet allows a lot of scalability. Still not enough scalability? – try Server-AA to Server-AZ then Server-BA to Server-BZ all the way to Server-ZZ, now think what you can do using digits!

Async is your friend

The same programmer who stuffed up the filesystem storage above also didnt know about how to off load tasks into smaller systems for processing later, so if one system in the chain went down, everything stopped! Not everything has to happen now right?

The more you offload for other systems to do the quicker you can respond to the important things now – queuing is your friend (the world works on queues) and many hands make light work. RabbitMQ appears to be a great open source solid solution for queuing. For commercial implementations use IBM Websphere MQ.

Dividing up tasks allows for loose coupling of workloads and allows additional systems to process data independent of the original application generating the data. It allows you to future proof your apps and build more stuff that can be implemented with no downtime simply by routing copies of messages to your new queue for your application to process with no ill effects on the original processes… use it.

Automation – Everywhere

Use automation to cleanup after your apps, use it everywhere so that systems can grow without stopping. No programmer ever thinks of maintenance… those session files keep building up but do you need 4 years worth sitting in a file system cluttering up the disks when the data is useless after 24 hours?

Did you clean up those temp files after you finished writing temporary data to them? I bet the thought never entered your mind.

Archiving old data is the single biggest headache of any moderately complex system, no programmer ever thinks seriously about how much data will exist in the future or how to archive it automatically and keep the working set of data small and hence access to it faster. Automation to automatically move older data to archive systems that automatically handle year to year transitions are vital.

Everything Fails Eventually

I could also call this “Code for Failure”. The 80-20 rule applies to software in more ways than you think, 80% fault tolerance code and 20% business logic in your code will make sure that when errors occur, you can handle it quickly and in a controlled fashion. Write clean simple daily logs for EVERY SCRIPT/APP and SYSTEM. Logs that record 1. Business Activity, 2. Program Debug Activity and 3. Low Level Data IO will save your ass when something fails, and chances are when you have the other 3 rules in effect, very little will fail 🙂

Sid Young
Senior Systems Engineer
That means 35 years of fixing other peoples crap!