- All processes require checklists.
- Automate everything, and continuously test your automation.
- Do not reinvent the wheel. Defer to idiomatic or popular solutions.
- Everything fails. Anticipate this and offer degraded service.
- Instrument everything.
Checklists may seem absurdly simple, but empirically are one of the most effective ways of increasing the effectiveness of very technical and complex activities [2].
The company where you work may have a fancy name for this: maybe "Method of Procedure", or "Standard Operating Procedure", or just "Procedure". Whatever you call it, you need one for major activities like starting, stopping, and restarting your service or subservices. You need simple steps for colleagues to follow when debugging failures: where are the logs, what system logs to look at, what are failure common modes, what symptoms to look out for, etc.
Use a Software Configuration Management product (e.g. Puppet, Chef, Ansible, Salt) to provision and maintain your service. You automate because humans needs sleep, get tired, forget things, and each come with a variety of idiosyncrasies and competencies. Automation allows you to encode the correct process once, and reliably reproduce it.
Continuously assure that your SCM configuration is valid by using regular virtual machines (e.g. VirtualBox) or lightweight containers (e.g. Docker). If it's untested it's probably already broken. Talk about how Fedora 19 using a Unicode release name broke both Puppet [3] and Celery [4], and how we discovered this during such CI testing.
You have a professional responsibility to deliver value to your customers using proven solutions, not stroke your own ego by reinventing existing technologies [5]. You will have probably not thought of edge cases or failure modes in your toy implementation that years of wisdom has encoded in popular or idiomatic solutions. Moreover using something else that "just works" is cheaper to implement and maintain.
You want to execute long-running jobs asynchronously? Use a job queue like Gearman, Celery, beanstalkd, etc.
You want to create a Single Page Application in JavaScript, but think you can pull it off using just JQuery? Use Backbone.js, Angular, Ember.js, etc.
You want to provision a Linux PC to have a set of software or libraries installed on it? Use a Software Configuration Manager (e.g. Puppet, Chef, Ansible, Salt) or your native operating system's package format (RPM, DEB). Critically both these methods are idempotent, allow anyone to interrogate the system to get its current provisioning status, and are idiomatic enough for others in your team to immediately understand.
You want to see if you can re-invent an Object Relational Mapper (ORM) on top of SQLite? Why not use an ORM?
Failure sometimes is planned. We have a master/slave architecture for executing tests, and slaves report test results back to the master. Restarting the master to include new functionality or fixes will surely have some impact on slaves, but it should not render entire test runs invalid. Rather slaves should quiesce and wait patiently for the master to come back up.
Failure is often unplanned and unanticipated. Expect different parts of your company to unplug your Ethernet switch at 8pm on Friday. Expect all interfaces to be unreliable (e.g. the external DNS resolver) and consider the impact of their failure. Consider attritional rather than outright failures; what would happen if TCP RTTs or NFS RTTs slowly crept up to the thousands of milliseconds?
You will never be able to plan for all failure modes. Use heuristics [1] to be able to detect them and reliably offer degraded service when they occur.
Use a system metric gatherer and dashboard like Diamond [6] and Graphite [7] to gather metrics on all servers and services. Understand what certain metrics like iowait [8] and free memory [9] actually mean. Curate your dashboard and be able to answer questions like "How long until the NAS runs out of space?".
Gather application-level exceptions, on both the front-end and back-end, using e.g. Sentry [10].
[1] Hamilton, James R. "On Designing and Deploying Internet-Scale Services." LISA. Vol. 7. 2007. (http://goo.gl/eQCm10)
[2] Gawande, Atul. "The Checklist". The New Yorker, 10th December 2007. (http://goo.gl/pJczVW)
[3] Puppet Labs, "Puppet reads STDERR for facts", 17th July 2013. (http://goo.gl/1piZMM)
[4] Celery, "Celery won't start due to UnicodeDecodeError", May 2013. (http://goo.gl/7q1GI4)
[5] Hoover and Oshineye "Apprentiship Patterns", Chapter 3 ("Craft over Art"), O'Reilly. 2010. (http://goo.gl/U0EZmZ)
[6] Diamond, GitHub. (https://github.com/BrightcoveOS/Diamond)
[7] Graphite (http://graphite.wikidot.com/)
[8] "Can anyone explain precisely what IOWait is?", ServerFault. (http://serverfault.com/q/12679)
[9] Linux ate my ram! (http://www.linuxatemyram.com/)
[10] Sentry (https://getsentry.com/welcome/)